Haystack docs home page

Module export_utils

def print_answers(results: dict, details: str = "all", max_text_len: Optional[int] = None)

Utility function to print results of Haystack pipelines

Arguments:

  • results: Results that the pipeline returned.
  • details: Defines the level of details to print. Possible values: minimum, medium, all.
  • max_text_len: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.

Returns:

None

def print_documents(results: dict, max_text_len: Optional[int] = None, print_name: bool = True, print_meta: bool = False)

Utility that prints a compressed representation of the documents returned by a pipeline.

Arguments:

  • max_text_len: Shorten the document's content to a maximum number of characters. When set to None, the document is not shortened.
  • print_name: Whether to print the document's name from the metadata.
  • print_meta: Whether to print the document's metadata.

def print_questions(results: dict)

Utility to print the output of a question generating pipeline in a readable format.

export_answers_to_csv

def export_answers_to_csv(agg_results: list, output_file)

Exports answers coming from finder.get_answers() to a CSV file.

Arguments:

  • agg_results: A list of predictions coming from finder.get_answers().
  • output_file: The name of the output file.

Returns:

None

convert_labels_to_squad

def convert_labels_to_squad(labels_file: str)

Convert the export from the labeling UI to the SQuAD format for training.

Arguments:

  • labels_file: The path to the file containing labels.

Module preprocessing

convert_files_to_docs

def convert_files_to_docs(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, encoding: Optional[str] = None, id_hash_keys: Optional[List[str]] = None) -> List[Document]

Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a

Document Store.

Arguments:

  • dir_path: The path of the directory containing the Files.
  • clean_func: A custom cleaning function that gets applied to each Document (input: str, output: str).
  • split_paragraphs: Whether to split text by paragraph.
  • encoding: Character encoding to use when converting pdf documents.
  • id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content", "meta"] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.

tika_convert_files_to_docs

def tika_convert_files_to_docs(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True, id_hash_keys: Optional[List[str]] = None) -> List[Document]

Convert all files (.txt, .pdf) in the sub-directories of the given path to Documents that can be written to a

Document Store.

Arguments:

  • merge_lowercase: Whether to convert merged paragraphs to lowercase.
  • merge_short: Whether to allow merging of short paragraphs
  • dir_path: The path to the directory containing the files.
  • clean_func: A custom cleaning function that gets applied to each doc (input: str, output:str).
  • split_paragraphs: Whether to split text by paragraphs.
  • id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content", "meta"] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.

Module squad_data

SquadData

class SquadData()

This class is designed to manipulate data that is in SQuAD format

SquadData.__init__

def __init__(squad_data)

Arguments:

  • squad_data: SQuAD format data, either as a dictionary with a data key, or just a list of SQuAD documents.

SquadData.merge_from_file

def merge_from_file(filename: str)

Merge the contents of a JSON file in the SQuAD format with the data stored in this object.

SquadData.merge

def merge(new_data: List)

Merge data in SQuAD format with the data stored in this object.

Arguments:

  • new_data: A list of SQuAD document data.

SquadData.from_file

@classmethod
def from_file(cls, filename: str)

Create a SquadData object by providing the name of a JSON file in the SQuAD format.

SquadData.save

def save(filename: str)

Write the data stored in this object to a JSON file.

SquadData.to_document_objs

def to_document_objs()

Export all paragraphs stored in this object to haystack.Document objects.

SquadData.to_label_objs

def to_label_objs()

Export all labels stored in this object to haystack.Label objects.

SquadData.to_df

@staticmethod
def to_df(data)

Convert a list of SQuAD document dictionaries into a pandas dataframe (each row is one annotation).

SquadData.count

def count(unit="questions")

Count the samples in the data. Choose a unit: "paragraphs", "questions", "answers", "no_answers", "span_answers".

SquadData.df_to_data

@classmethod
def df_to_data(cls, df)

Convert a data frame into the SQuAD format data (list of SQuAD document dictionaries).

SquadData.sample_questions

def sample_questions(n)

Return a sample of n questions in the SQuAD format (a list of SQuAD document dictionaries). Note that if the same question is asked on multiple different passages, this function treats that as a single question.

SquadData.get_all_paragraphs

def get_all_paragraphs()

Return all paragraph strings.

SquadData.get_all_questions

def get_all_questions()

Return all question strings. Note that if the same question appears for different paragraphs, this function returns it multiple times.

SquadData.get_all_document_titles

def get_all_document_titles()

Return all document title strings.