[TAG] Apertium 2 cent tip: how to add analysis and generation of unknown words, and *why you shouldn't*
Jimmy O'Regan
joregan at gmail.com
Thu Jan 1 18:27:30 MSK 2009
In my article about Apertium, I promised to follow it up with another
article of a more 'HOWTO' nature. And I've been writing it. And
constantly rewriting it, every time somebody asks how to do something
that I think is moronic, to explain why they shouldn't do that... and
I need to accept that people will always want to do stupid things, and
I should just write a HOWTO.
Anyway... recently, someone asked how to implement generation of
unknown words. There are only two reasons I can think of, why someone
would want this: either they have words in the bilingual dictionary
that they don't have in the monolingual dictionary, or they want to
use it in conjunction with morphological guessing.
In general, the usual method used in Apertium's translators is, if we
don't know the word, we don't try to translate it -- we're honest
about it, essentially. Apertium has an option to mark unknown words,
which we generally recommend that people use. It doesn't cover
'hidden' unknown words, where the same word an be two different parts
of speech--we're looking into how to attempt that. One result of this,
is that before a release, we specifically remove some words from the
monolingual dictionary, if we can't add a translation.
Anyway, in the first case, we generally write scripts to automate
adding those words to the bidix. One plus of this is that it can be
manually checked afterwards, and fixed. Another is that, by adding the
word to the monolingual dictionary, we can also analyse it: we
generally try to make bilingual translators, but sometimes we can only
make a single direction translator--but we still have the option of
adding the other direction later. And, as our translators are open
source, it increases the amount of freely available linguistic data to
do so, so it's a win all round.
The latter case, of also using a mophological guesser, is one source
of some of the worst translations out there. For example, at the
moment, I'm translating a short story by Adam Mickiewicz, which
contains the phrase 'tu i owdzie', which is either a misspelling of
'tu i ?wdzie' ('here and there') or an old form, or typesetting
error[1], but in any case, the word 'owdzie' does not exist in the
modern Polish language.
Translatica, the leading Polish-English translator, gave: "here and he
is owdzying"
Now, if I knew nothing of Polish, that would send me scrambling to the
English dictionary, to search for the non-existant verb 'to owdzy'.
(Google gave: "here said". SMT is a great idea, in theory, but in
practice[2] has the potential to give translations that bear no
resemblance to the original meaning of the source text. Google's own
method of 'augmenting' SMT by extracting correlating phrase pairs
based on a pivot language also leads to extra ambiguities[3])
Anyway. The tip, for anyone who still wants to try it
Apetium's dictionaries can have a limited subset of regular
expressions; these can be used by someone who wishes to have both
analysis and generation of unknown words. The <re> tag can be placed
before the <par> tag, so the entry:
<e>
<re>[a-z]*</re>
<par n="accept__vblex"/>
</e>
will accept, and generate, any otherwise unknown word with the set of
endings represented by the paradigm for the verb 'accept', -s, -ed,
-ing, -0, etc. That gets more complicated when you want to do the same
with verbs like 'live', or 'plug', but judicious use of regexes should
get around that. It's still a bad idea, though, and if anyone tries
this and has poor results and for some reason feels compelled to tell
me about it, expect only 'I told you so' :)
[1] I ORC'd and proofread the text for Project Gutenberg; that's what
appears in the original text.
[2] Word reordering, case restoration, punctuation restoration, etc.
are typically handled in an SMT system in a way that is functionally
similar to the translation process, by scoring the phrases generated
by these stages against a statistical model, which can lead to words
being replaced, replacing a correct translation with an incorrect one
that happens to have better punctuation, etc.
[3] The French 'Je viens de manger' ('I have just eaten') translated
to 'Ja po prostu zje??' ('I simply to eat'; 'po prostu zjedz!' is the
equivalent of 'just eat it!' ) in Polish, because of the ambiguity of
'just' in English, which doesn't exist between French and Polish
(that's today's translation; before it said 'mam tylko je??' 'I have
only to eat', mixing another ambiguity of 'just' in English, 'I have
just five eggs').
More information about the TAG
mailing list