I got my citizenship notice today!!! I’m an Irish citizen! Now I can vote! And I’m an EU citizen! It came two months earlier than I thought it would. Hooray!
I got my citizenship notice today!!! I’m an Irish citizen! Now I can vote! And I’m an EU citizen! It came two months earlier than I thought it would. Hooray!
Automatic translation by machine.
Rather than writing the canonical blog post about how I haven’t written in a long time, I’ll start with the metacanonical blog post about how I’m not going to write about how I haven’t written in a long time.
As some of you may know, I’ve been studying Irish lately. I’m a bit obsessive about it. I really really enjoy it. In any case it has gotten me thinking a lot about how to perform automatic translation by machine.
I have very little faith in the utility of formal grammars for the purpose of computer linguistic analysis both because these have been shown to be extremely poor performers in past practice and the fact that I don’t really believe in semantics in the first place.
One common problem in translation by machine is the notorious “round trip problem”. Anyone who has played with babble-fish has seen the humourous consequences of translating an idiomatic phrase from one language into a target language and then translating it back.
I have some ideas about how to solve the “round trip problem” in translation by machine. The idea is basically this. You take two parallel corpuses (corpi? I don’t know Latin and I always choose the wrong plural, so I’ll just go with the English plural). You “pin” these corpuses at places that are “full stops”. A full stop is a boundary over which the analysis is allowed to ignore correlations. This can improve the computational efficiency of the technique. I haven’t tried it yet, but I believe that paragraph boundaries may be reasonable. The analysis consists of looking at the probability of occurance of words in the target text given words in the source text. If we pin at sentence boundaries the following parallel corpus:
I am hungry
Tá ocras orm
I am thirsty
Tá tart orm
We will find that “hungry” is correlated to “ocras” and “thirsty” is correlated to “tart”. “I am” will be correlated to “Tá… …orm”. It will also very importantly be correlated to both “tart” and “ocras”. The importance of this can be seen from some further sentences in our parallel corpus
I am tired
Tá mé tuirseach
I am contented
Tá mé sásta
It is only in context that we can determine the appropriate use of the prepositional pronoun “orm”. No amount of grammatical analysis will denote the appropriateness of a particular construction when moving from a source language to a destination language. The fact that “the hunger is upon you” is correct is only evident from the use of the language.
In any case I’ve been playing with different probabilistic models to achieve the correlations between words, and to make sure that the positioning is both important and not over-determined in the model. This is especially important for noun phrases and other syntactic units that should be treated as atomic by the structural analysis.
I’d love to hear if you have any thoughts….