Input Conversion

A module of helper functions for dealing with various inputs.

yalign.input_conversion.generate_documents(filepath, m=20, n=20)

Document generator. Documents are created from the parallel corpus and will be between m and n lines long.

yalign.input_conversion.html_to_document(html, language='en')

Returns html text as list of Sentences

yalign.input_conversion.parallel_corpus_to_documents(filepath)

Transforms a parallel corpus file format into two documents. The Parallel corpus has:

  • One sentences per line.
  • One line of each language.
  • Sentences are tokenized and tokens are space separated.
  • The file encoding is UTF-8

For example:

This is a sentence . Esto es una oración . And this , my friend , is another . Y esta , mi amigo , es otra .
yalign.input_conversion.parse_training_file(training_file)

Reads SentencePairs from a training file.

yalign.input_conversion.srt_to_document(text, lang='en')

Convert a string of srt into a list of Sentences.

yalign.input_conversion.text_to_document(text, language='en')

Returns string text as list of Sentences

yalign.input_conversion.tmx_file_to_documents(filepath, lang_a=None, lang_b=None)

Converts a tmx file into two lists of Sentences. The first for language lang_a and the second for language lang_b.

yalign.input_conversion.tokenize(text, language='en')

Returns a Sentence with Words (ie, a list of unicode objects)