This repository contains the code, report and presentation slides of my Master's thesis project (Göteborgs Univeristet, A.Y. 2020-21).
To compile the CA module, the Haskell Stack is recommended. To build
the project, clone this repository, move into the
corresponding directory and run stack build
. This will generate a
Haskell library, named concept-alignment
, and the following five
executables:
extract-concepts
(for CE)propagate-concepts
(for CP)eval
, a script for evaluating CP and CEgenerate-grammar
, to automatically generate a GF grammar from the alignments extraced via CE and/or CPtranslate
, to perform simple MT experiments
Using the CE module with the default parameters is straightforward:
stack exec -- extract-concepts SL.conllu TL.conllu
The program, however, supports a number of command-line options. Most importantly:
–-file=FILE
can be used to specify where to write the resulting alignments. Unless otherwise specified (see below), the output consists in two new aligned.conllu
files, stored at the chosen location, whose names are prefixed withSL
andTL
respectively–-linearize
specifies that concepts should be linearized–-maxsize=INT
sets the maximum size of the extracted alignments--pattern=FILE
specifies the path to a.hst
defining a specific pattern the extracted alignments have to match. The syntax of.hst
files is described here (pattern-match
)–-all
specifies that the selection step should be skipped–-clauses
enables clause segmentation–-rest
enables a second pass of alignment "by exclusion"–-pharaoh=FILE
is used to specify a file of alignments in pharaoh format to use as backup.
CP requires two, instead of three, .conllu
files, one with the
concepts to propagate and the following two containing the corpus of
annotated sentences where they should be looked for. As a consequence,
running it with the default parameters is as follows:
stack exec -- propagate-concepts SL_concepts.conllu SL.conllu TL.conllu
All extract-concepts
options that are also relevant for CP (i.e. all
of them excepts maxsize
and pharaoh
) are also valid for this second
executable.
If propagation is performed using a translation of the same set of sentences (the sentence ids must correspond) used for the extraction step, using the (recommended) same-text
flag can help improve performance and precision (while recall might decrease slightly).
The evaluation script eval
can be run in three different "modes":
- with a single collection of alignments (
stack exec – eval SL.conllu TL.conllu
): it allows for interactive manual correctness annotation (if necessary) and prints out basic statistics about precision, recall and amount of reusable alignments. - extraction comparison mode
(
stack exec – eval extraction old_SL.conllu old_TL.conllu new_SL.conllu old_TL.conllu
): given an annotated and a new, possibly yet-to-annotate collection of alignments, it allows to interactively complete the annotation of the latter (if necessary) and, on top of printing out the basic statistics, it compares the new alignments to the old ones, telling how many correct and incorrect alignments were lost and/or found - propagation comparison mode
(
stack exec – eval propagation new.txt new.txt
): similar to extraction comparison mode, excepts that the statistics are CP-specific (percentage of successfully propagated alignments, number of errors introduced by CP etc.).
The command line option –reasons
can be added when criterion-wise
statistics are needed.
The grammar generation module can be run as follows:
stack exec -- generate-grammar alignments_prefixes
Here, alignments_prefixes
is a list of space-separated prefixes of aligned CoNNL-U files to derive grammar rules from.
Flags:
--extraction-grammar=PREFIX
(prefix of the extraction grammar, by defaultgrammars/Extract
)--morphodicts=PREFIX
(prefix common to all the morphological dictionaries to be used,by defaultMorphoDict
since, if the GF RGL is set up correctly, morphological dictionaries should be in your GF path). Specifying a value for this parameter is often necessary, even when the morphological dictionaries to be used are the RGL's.--output=PREFIX
(prefix of the generated (output) grammar, by defaultgrammars/Generated
).
The simplicity of the translation module makes it also extremely easy to
use. The string or list of newline-separate strings to translate comes
from the standard input and the only argument is the path to the
automatically generated GF grammar to be used, again in .pgf
format.
Here is an example of using the program to translate a single sentence:
echo "this sentence will be translated" | stack exec -- translate
Extracted.pgf
Modifying the criteria CA makes use of requires
little effort. All criteria are in fact defined in a separate Haskell
module Criteria
that only exports the list of criteria to be used.
In the current implementation, it looks like this:
criteria :: [Criterion]
criteria = [udpos, ud, divs, pass, pos]
Removing and/or changing the priority of the criteria is then just a matter of altering such list.
Adding new criteria is simple too, but it also requires an understanding
of the data type Criterion
, defined in module ConceptAlignment
:
data Criterion = C {
func :: UDTree -> UDTree -> Bool,
reas :: S.Set Reason,
headAlign :: Bool,
strict :: Bool
}
As the above code block shows, Criterion
is a record type whose fields
are:
- a boolean function
func
specifying a rule to decide whether two trees should be aligned. For instance, UD label matching is implemented as
udMatch :: UDTree -> UDTree -> Bool
(RTree n ts) `udMatch` (RTree m us) =
udDEPREL n == udDEPREL m
-
a set of "reasons"
reas
to be used for ranking the alignments, shown in the linearized files and used to compute criterion-specific statistics.Reason
s, and not alignment rules (func
s), are what mirrors alignment criteria exactly. Afunc
can define in fact a more specific alignment rule, such as "type of categorial divergence in which an adjective is replaced by adverb", while aReason
is in practice a more coarse-grained label associated to potentially many rules -
two boolean flags,
headAlign
andstrict
, the former specifying whether head alignment should be performed for UD tree pairs matching that criterion, the latter marking the criterion as either strict (i.e. to be also used for "alignment by exclusion" or not.
If you use this software in your research, please cite
@inproceedings{masciolini2021grammar,
title={Grammar-based concept alignment for domain-specific machine translation},
author={Masciolini, Arianna and Ranta, Aarne},
booktitle={Proceedings of the Seventh International Workshop on Controlled Natural Language (CNL 2020/21)},
year={2021}
}