[TAG] Two-cent tip for Linux Gazette
Jimmy O'Regan
joregan at gmail.com
Wed Sep 17 04:13:11 MSD 2008
2008/9/16 Ben Okopnik <ben at linuxgazette.net>:
> On Tue, Sep 16, 2008 at 10:47:24AM +0100, Jimmy O'Regan wrote:
>> 2008/9/15 Ben Okopnik <ben at linuxgazette.net>:
>> > I make a point of keeping separate lexicon files for this rather than
>> > using the ones from "/usr/share/dictd" since I find that they require a
>> > significant number of corrections. Eventually, after I've polished them
>> > up, I'll send them to the 'dictd' folks.
>>
>> Those files originate with the freedict project, which sadly seems to
>> be defunct.
>
> :((((((((((((((((((((((((((((((
>
> ...and a few more '(((('s for good measure. That *REALLY* sucks!
>
Well, yeah. But... given what my hobbies are, I've quite a collection
of links to dictionaries under free licences, if anyone's interested
(as well as collections for which we've politely asked, and were
granted rights to distribute under the GPL, but haven't yet had time
to convert - Apertium is the Eclipse of open source MT :)
>> I sent some fixes to their Irish-English dictionary around
>> a year ago and subscribed to the mailing list - the fixes have gone
>> untouched, the list has seen little other than spam. In any case, the
>> dict files are generated from TEI-encoded XML files (which is also
>> used internally at Distributed Proofreaders for new Project Gutenberg
>> etexts), which would be the best place to make the changes (I'm sure
>> the Debian maintainer would be happy to integrate your fixes).
>
> I'll look into that in a while.
>
>> Most of the data in Freedict came from a Windows program called Ergane
>> (http://download.travlang.com/Ergane//) which allowed it's
>> dictionaries to be used under public domain terms (though I'm not sure
>> if that was true of later versions). The program used Esperanto as a
>> pivot language, which is probably the reason why there are a few
>> questionable entries in the dictionaries (IIRC, the Russian-English
>> dictionary had '?????' matched to 'g-man' rather than 'agent' - this
>> is one reason why we have direction restrictions in Apertium).
>
> Oh, it's not just a question of direction; there were plenty of simple
> errors as well. Truncated words, misspellings, and - for some odd reason
> - words in which a part of the correct word would be repeated, something
> like "correctorrect" or "wordrdord". There were a number of these. There
> were also a number of quite silly literal translations, like "White
> Russia" for "Belorussia" (that's like translating "childbearing" as "kid
> demeanor".)
>
Yes; but the 'White Russia' entry exemplifies what I'm talking about -
Ergane was built from an Esperantist's point of view, but applied
across other languages, without considering whether or not a
translation should have restriction or not - translating from 'White
Russia' to the native equivalent of Belorussia is the right thing to
do; the opposite is not.
(My preferred method is to, where possible, find a similarly archaic
term which needs no restrictions, but it's not always possible, and
there are probably better uses of the time).
I have seen various of the other kinds of errors though; the Irish
examples I fixed were, for the most part, character encoding screw
ups, even after which, there were issues of missing spaces, etc.
>> If you want extra words, wikipedia is a nice place to look. This is a
>> version of Francis Tyers' wikipedia script[1] to output to dict format
>> (original here: http://wiki.apertium.org/wiki/Building_dictionaries):
>
> [snip]
>
>> sleep 8; # don't put undue strain on the Wikimedia servers
>
> [laugh] Given that I have pretty close to 100k entries in my
> '/usr/share/dict/words' (I use the standard Scrabble word list), and
> each translation will take 8 seconds + translation time + network
> latency (call it 10 seconds total), that would take more than 11 days to
> download. I wonder if there's an easier way?
>
Actually, there are plenty of them. I constructed a ~8k
Portuguese-English dictionary for my parents[1] using mostly
'dictionary crossing' (though I had to do a bit of manual searching,
to ensure my father had the terms he needed for his dialysis - so he
can avoid taking the kinds of drugs that'd kill him, etc. I have ~10k
French-English from the same method (for my sister's birthday[2] today
:)
There are no end of open source tools available to assist in
dictionary construction - but at the end of the day, they all still
need a knowledgeable human to check them, which is very much the
bottleneck in our process.
[1] They went to Portugal to celebrate their 30th anniversary[3]. They
had a bit of a false start when the announcement came that the airline
their flight was booked with announced that they had gone broke the
day before their flight, but the travel agent came through.
[2] She's honest enough to only use it as a study aid, rather than as a crutch.
[3] Yes, my sister was born the day after my parent's anniversary. No,
my mother still hasn't fully forgiven her - all the more so, because
that particular year was the first that my father had a job after
several years of unemployment, and could afford to take my mother to a
restaurant.
More information about the TAG
mailing list