Input Conversion¶
A module of helper functions for dealing with various inputs.
-
yalign.input_conversion.generate_documents(filepath, m=20, n=20)¶ Document generator. Documents are created from the parallel corpus and will be between m and n lines long.
-
yalign.input_conversion.html_to_document(html, language='en')¶ Returns html text as list of Sentences
-
yalign.input_conversion.parallel_corpus_to_documents(filepath)¶ Transforms a parallel corpus file format into two documents. The Parallel corpus has:
- One sentences per line.
- One line of each language.
- Sentences are tokenized and tokens are space separated.
- The file encoding is UTF-8
For example:
This is a sentence . Esto es una oración . And this , my friend , is another . Y esta , mi amigo , es otra .
-
yalign.input_conversion.parse_training_file(training_file)¶ Reads SentencePairs from a training file.
-
yalign.input_conversion.srt_to_document(text, lang='en')¶ Convert a string of srt into a list of Sentences.
-
yalign.input_conversion.text_to_document(text, language='en')¶ Returns string text as list of Sentences
-
yalign.input_conversion.tmx_file_to_documents(filepath, lang_a=None, lang_b=None)¶ Converts a tmx file into two lists of Sentences. The first for language lang_a and the second for language lang_b.
-
yalign.input_conversion.tokenize(text, language='en')¶ Returns a Sentence with Words (ie, a list of unicode objects)