Yalign Model

class yalign.yalignmodel.MetadataHelper(metadata)

Bases: dict

class yalign.yalignmodel.YalignModel(document_pair_aligner=None, threshold=None, metadata=None)

Bases: object

Main Yalign class. It provides methods to train a alignment model, to load a model from a folder and to align two documents.

align(document_a, document_b)

Try to detect aligned sentences from the comparable documents document_a and document_b. The returned alignments are expected to meet the F-measure for which the model was trained for.

align_indexes(document_a, document_b)

Same as align but returning indexes in documents instead of sentences.

classmethod load(model_directory)

This method to loads an existing YalignModel from the path to the folder where it’s contained.

optimize_gap_penalty_and_threshold(document_a, document_b, real_alignments)

Given documents document_a and document_b (not necesarily aligned) and the real_alignments for that documents train the YalignModel instance to maximize the target F-measure (the quality measure).

real_alignments is a list of indexes (i, j) of document_a and document_b respectively indicating that those sentences are aligned. Pairs not included in real_alignments are assumed to be wrong alignments.


Store a serialization of a YalignModel instance in a given folder. Metadata is stored in a separate file.

yalign.yalignmodel.apply_threshold(alignments, threshold)
yalign.yalignmodel.basic_model(corpus_filepath, word_scores_filepath, lang_a=None, lang_b=None)

Creates and trains a YalignModel with the basic configuration and default values.

corpus_filepath is the path to a parallel corpus used for training, it can be:

  • a csv file with two sentences and alignement information, or
  • a tmx file with correct alignments (a regular parallel corpus), or
  • a text file with interleaved sentences (one line in language A, the next in language B)

word_scores_filepath is the path to a csv file (possibly gzipped) with word dictionary data. (for ex. “house,casa,0.91”).

lang_a and lang_b are requiered for the tokenizer in the case of a tmx file. In the other cases is not necesary because it’s assumed that the words are already tokenized.

yalign.yalignmodel.best_threshold(real_alignments, predicted_alignments)

Returns the best F score and threshold value for this gap_penalty

yalign.yalignmodel.random_sampling_maximizer(F, min_, max_, n=None)
yalign.yalignmodel.score_with_best_threshold(aligner, xs, ys, gap_penalty, real_alignments)