Module export_utils
print_answers
def print_answers(results: dict, details: str = "all", max_text_len: Optional[int] = None)
Utility function to print results of Haystack pipelines
Arguments:
results
: Results that the pipeline returned.details
: Defines the level of details to print. Possible values: minimum, medium, all.max_text_len
: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.
Returns:
None
print_documents
def print_documents(results: dict, max_text_len: Optional[int] = None, print_name: bool = True, print_meta: bool = False)
Utility that prints a compressed representation of the documents returned by a pipeline.
Arguments:
max_text_len
: Shorten the document's content to a maximum number of characters. When set toNone
, the document is not shortened.print_name
: Whether to print the document's name from the metadata.print_meta
: Whether to print the document's metadata.
print_questions
def print_questions(results: dict)
Utility to print the output of a question generating pipeline in a readable format.
export_answers_to_csv
def export_answers_to_csv(agg_results: list, output_file)
Exports answers coming from finder.get_answers() to a CSV file.
Arguments:
agg_results
: A list of predictions coming from finder.get_answers().output_file
: The name of the output file.
Returns:
None
convert_labels_to_squad
def convert_labels_to_squad(labels_file: str)
Convert the export from the labeling UI to the SQuAD format for training.
Arguments:
labels_file
: The path to the file containing labels.
Module preprocessing
convert_files_to_docs
def convert_files_to_docs(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, encoding: Optional[str] = None, id_hash_keys: Optional[List[str]] = None) -> List[Document]
Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a
Document Store.
Arguments:
dir_path
: The path of the directory containing the Files.clean_func
: A custom cleaning function that gets applied to each Document (input: str, output: str).split_paragraphs
: Whether to split text by paragraph.encoding
: Character encoding to use when converting pdf documents.id_hash_keys
: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content"
,"meta"
] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.
tika_convert_files_to_docs
def tika_convert_files_to_docs(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True, id_hash_keys: Optional[List[str]] = None) -> List[Document]
Convert all files (.txt, .pdf) in the sub-directories of the given path to Documents that can be written to a
Document Store.
Arguments:
merge_lowercase
: Whether to convert merged paragraphs to lowercase.merge_short
: Whether to allow merging of short paragraphsdir_path
: The path to the directory containing the files.clean_func
: A custom cleaning function that gets applied to each doc (input: str, output:str).split_paragraphs
: Whether to split text by paragraphs.id_hash_keys
: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content"
,"meta"
] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.
Module squad_data
SquadData
class SquadData()
This class is designed to manipulate data that is in SQuAD format
SquadData.__init__
def __init__(squad_data)
Arguments:
squad_data
: SQuAD format data, either as a dictionary with adata
key, or just a list of SQuAD documents.
SquadData.merge_from_file
def merge_from_file(filename: str)
Merge the contents of a JSON file in the SQuAD format with the data stored in this object.
SquadData.merge
def merge(new_data: List)
Merge data in SQuAD format with the data stored in this object.
Arguments:
new_data
: A list of SQuAD document data.
SquadData.from_file
@classmethod
def from_file(cls, filename: str)
Create a SquadData object by providing the name of a JSON file in the SQuAD format.
SquadData.save
def save(filename: str)
Write the data stored in this object to a JSON file.
SquadData.to_document_objs
def to_document_objs()
Export all paragraphs stored in this object to haystack.Document objects.
SquadData.to_label_objs
def to_label_objs()
Export all labels stored in this object to haystack.Label objects.
SquadData.to_df
@staticmethod
def to_df(data)
Convert a list of SQuAD document dictionaries into a pandas dataframe (each row is one annotation).
SquadData.count
def count(unit="questions")
Count the samples in the data. Choose a unit: "paragraphs", "questions", "answers", "no_answers", "span_answers".
SquadData.df_to_data
@classmethod
def df_to_data(cls, df)
Convert a data frame into the SQuAD format data (list of SQuAD document dictionaries).
SquadData.sample_questions
def sample_questions(n)
Return a sample of n questions in the SQuAD format (a list of SQuAD document dictionaries). Note that if the same question is asked on multiple different passages, this function treats that as a single question.
SquadData.get_all_paragraphs
def get_all_paragraphs()
Return all paragraph strings.
SquadData.get_all_questions
def get_all_questions()
Return all question strings. Note that if the same question appears for different paragraphs, this function returns it multiple times.
SquadData.get_all_document_titles
def get_all_document_titles()
Return all document title strings.