[TAG] Apertium 2 cent tip: how to add analysis and generation of unknown words, and *why you shouldn't*
joregan at gmail.com
Thu Jan 1 22:46:17 MSK 2009
2009/1/1 Ben Okopnik <ben at linuxgazette.net>:
> On Thu, Jan 01, 2009 at 03:35:57PM +0000, Jimmy O'Regan wrote:
>> 2009/1/1 Jimmy O'Regan <joregan at gmail.com>:
>> > In general, the usual method used in Apertium's translators is, if we
>> > don't know the word, we don't try to translate it -- we're honest
>> > about it, essentially. Apertium has an option to mark unknown words,
>> > which we generally recommend that people use. It doesn't cover
>> > 'hidden' unknown words, where the same word an be two different parts
>> *can* be... I can only imagine how poorly that would translate :)
> That would be the major downfall of machine translation: the underlying
> assumption (which pretty much _has_ to be that way) is that the input
> makes sense in the first place. Misspellings, of course, void that: the
> above is an instant - you might even say automatic and thus invisible -
> correction for a human, but an insoluble problem for a machine.
Misspellings, orthographic variations in different regions (our
Spanish-English translator still has a curious mix of American and
British spellings), false derivations (we had an example of that here,
recently :), archaisms, the list goes on and on. Even the presence or
absence of punctuation can be significant.
> Until someone comes up with systems that can handle context, on a fairly
> broad scale, mechanical translation must perforce remain limited.
Nice phrase that, 'mechanical translation': it equally covers machine
translation and, say, the collected works of Jeremiah Curtin and
his ilk :)
Semantic based translation seemed more or less abandoned, but I see
signs of it making a comeback: a paper I read recently more or less
said that the reason attempts to plug systems like WordNet into
machine translators hasn't yielded significantly better results is
that it was not approached in the correct manner (all of the prior
research in the area was wrong :); GramTrans make heavy use of
semantic knowledge in their translators.
Apertium has a module called 'lextor' that uses statistically
collected co-occurrences to perform lexical selection, but I don't
like trusting to statistics anything that can be manually specified
(our part of speech tagger is also statistically based, but it also
accepts rules). I'm writing a new module that's strictly rule based --
because I'm primarily interested in trying to properly translate
prepositions in relation to verbs, and lextor specifically ignores
prepositions (they would *really* screw up the statistics :) -- but it
also requires changes to the main rule engine, and possibly extending
the stream format, which I'd prefer to avoid.
The most promising development in SMT is the Berkeley aligner, which
is open source: http://code.google.com/p/berkeleyaligner/ Instead of
blindly trying to align n-grams, it aligns elements of parse trees.
(Google have done some work in trying to do something similar, but
they've had some difficulty in retrofitting parse trees to the n-grams
they already have).
> even then...
> "Prostitutes appeal to pope"
> "Queen Mary having bottom scraped"
> "Milk drinkers are turning to powder"
> "I saw the Alps flying to Romania"
> "The horse raced past the barn fell"
> "Time flies" "You can't; they move too fast"
> "Cheney hunts quail; companions duck"
> "Drunk gets nine months in violin case"
Those all remind me that there's one thing a human translator can do
that a computer program never can: add a footnote :)
If you'll forgive my choice of example, 'te przekl?te Moskale' --
'those cursed Muscovites'. That's an easy, word to word translation,
but: in Polish, a distinction is made in the plural between human
males and anything else: the correct form should be 'ci przekl?ci
Moskali': using the incorrect form is possibly intended to either show
that the speaker has been poorly educated, or that he intends to
intensify the insult by speaking of the Muscovites as 'non-men'. I've
been assured that in the time the story was set, that would have
been grammatically correct, but the rest of the text contradicts that.
 Douglas Hyde wrote of him: "Mr. Curtin tells us that he has taken
his tales from the old Gaelic-speaking men; but he must have done so
through the awkward medium of an interpreter, for his ignorance of the
commonest Irish words is as startling as Lady Wilde's." Curtin is more
famous for his bad translations of Polish stories, though.
Their data is proprietary, but their semantic engine, CG, is open
source, and used in a few translation modules in Apertium - we don't
use it to its full extent yet, but we have an experimental translator
that does (between two dialects of the Sami language). The main
developer of our Esperanto-English translator is friends with their
main developer, who has been quite helpful.
 The Battle of Stoczek, the first major battle of the November
Uprising of 1830.
More information about the TAG