Languages16 Jul 2006 09:42 am

I’ve been hacking away at my program to test a theory I have about machine translation. I wrote a bit about it in a previous post but I was fairly vague. I thought I’d describe in more detail exactly how the technique would work (I’m still in phase 1).

The idea is simple. The first phase is to take a corpus in a language. Take each sentence of the source (or some other sized chunk, currently I’m limited by computational tractability to a single sentence) and recombine each element of the sentence into every possible string of n-grams. If you play with it a bit you’ll realise that there are 2(N-1) of these for a string of size N. One way to think about it is that there are N-1 indexes into the spaces between words in the string. You can then think of each sentence as being a collection of indexes at which we combine words. This is obviously the power set of the set of indexes {1,2,3…N-1} and hence there are 2(N-1). It turns out however that it is nice to have a special word meaning “beggining of sentence” and another for “end of sentence”, so we end up starting with N+2 words, and getting 2(N+1). That can be a big number!

So now that we have our n-grams for each sentence we want to look at transition probabilities between n-grams. The reason for this is that various parts of a sentence have unpredictable size. In the absense of a full NL parsing system there is no way to figure out what a syntactic unit (a noun phrase for instance) will be. This process completely obviates the need for an NL parser. This in itself is a huge win since NL parsing is at least difficult and probably impossible to do correctly because of idioms and variations in dialect. With the n-grams in hand we can now look at transition frequencies amoung the various n-grams in each of the different patterns in which they were combined. At this point we enter the information into a database which stores the transition probability between every two n-grams. Let us assume that we ignore sentences larger than 12 words. This means that we have 213 or 8192 words for a large sentence. This gives us 67,000,000 entries in our transition frequency matrix. O.K. So this is looking fairly intractable. If we decided that we will only look at correlations between neighbors and next neighbors however, we are back in the realm of possibility. This limitation has a certain justification beyond making things computationally feasible in that every element of the sentence will be a next nearest neighbor with an element of one of the n-gram sentences therby relating every possible syntactic unit. It should even be possible given this information to “guess” a parse based on our frequencies given a large enough corpus.

Stage 2 revolves around extracting information from a parallel corpus. We will simply perform a nearly identical procedure between two parallel corpuses.

When stage 1 and stage 2 are completed, we can use the probabilities of co-occurance from the parallel corpus in conjunction with the intra-language transition frequencies to generate “most probable” sentences.

We’ll see how it goes.

Politics& Public Policy16 Jul 2006 05:12 am

If there is one aspect of modern society that I positively can’t stand it has to be cars.

If we could stand back and look at transportation objectively, cars would just look completely ridiculous. Let us contemplate cars as transportation for a moment.

Large steel objects are hurtled towards each other at a sum relative velocity of something like 160km/h whilst carrying only one person each. They use enormous amounts of energy. They use up huge tracts of land for their storage and for the infrastructure needed to carry them. Because they have no systematic routing, they end up with lower *average* velocities during rush-hour than bicycling. They spew enormous amounts of toxic fumes. A person is killed because of one every 240 minutes in our tiny island. Cars drive demand for energy much above what it should be resulting in increased desirability of oil which serves to create pathological political instability in the middle east. They lower quality of life as they are expensive to own and care for.

In sum total, they are not only wasteful, inefficient, toxic and deadly. They are also completely unecessary. We can affect travel in much better ways. The DART is a fabulous transportation system. If we could cover the island with a system like this, it would take up far less space, use less energy, and it would be faster for the purpose of travel than cars ever could.

I often hear the various parties (including the Greens!) saying that the trafic problem should be solved that with more and bigger roads to solve the traffic problem. One need only look at LA to see the complete futility and absurdity of trying to solve the problem by throwing roads at it. LA had faster average cross town transit speeds in 1920 when trolleys were common. It would be and environmental and aesthetic disaster if we were to turn Ireland into an LA-like twisted mass of freeways.

Where I grew up, in Anchorage Alaska, there really was nothing for it but to own a car. The city center was 12 km from my home, and winter temperatures could reach as low as -40C. There was no public transportation that could be used reliably. The most frequent bus routes were served once every 45 minutes. The buses were frequently late, or even worse early, and they sometimes didn’t come at all. The transit center where all bus transfers took place was at the *edge* of town. The poor were at a terrible disadvantage. A car was necessary to hold a job. The poor also spent a disproportionate amount of income on their cars since the weather causes more repairs to be necessary than in a more moderate climate and cheap cars would break down frequently.

Things are certainly never going to be that terrible in Ireland. The public transit system is great relative to the US, even if it is terrible relative to most of Europe. I am however frightened by the lack of support for a comprehensive public transportation plan and the talk of increased roads. If anything, all public roads should be toll roads. Using cars should be discouraged, and the money could be used to provide real alternatives. Poor decisions should not be rewarded with subsidies from the state. Building roads and road maintainance are a subsidy of a poor tranportation choice.

Ubykh& Gaeilge14 Jul 2006 03:43 pm

I was perusing Wikipedia today at lunch since I left the book, “An Taistealaí”, which I’m currently reading, at home. I recently finished my first Irish book “Bagairt Ón Spás”. The latter is *much* easier reading than the former, since it was written for the instruction of teens between the ages of 12 and 15.

In any case I was roaming around the sections on syntax and morphology and I stumbled onto an article concerning Ubykh. This extremely strange Northerwestern Caucasian language is sometimes classified as polysynthetic due to the capacity of verbs to incorporate large parts of the sentence. The following sentence is, to me, quite facinating:

“aχʲazbatʂʾaʁawdətʷaajlafaqʾajtʾmadaχ!”

which translates to:

“If only you had not been able to make him take it all out from under me again for them!”

If you can’t read that because of the strange characters that you need to have present in your font, or because you don’t understand how to read IPA then don’t worry too much. You probably couldn’t pronounce it even if you could read it, since Ubykh has one of the largest consonant sets outside of South Africa. In order to compensate however it has an almost ridiculous paucity of vowels, totaling to only two.

I first learned about the Ubykh language while reading a facinating book Nart Sagas from the Caucasus. This book has one of the most incredible collections of folk tales that I have ever read. At the same time strange and familiar, the Narts hold many values that are incredibly modern and egalitarian (for instance the recognition of the sexual rights of women), whilst at the same time holding traditions that are incredibly alien (the ability to avoid blood feud by suckling at the nipple of your adversaries wife). I HIGHLY recommend this book. Really, these stories should be up there with the greek myths and the story of Cúchulainn.

Unfortunately as I was reading, I also learned of something else about the Ubykh language. “The Ubykh language died out on October 7, 1992, when its last fluent speaker (Tevfik Esenç) died in his sleep.” (wikipedia)

A tragedy to be sure. I remember the excitement I felt in reading the Nart Sagas and in the possiblity of reading the sagas in their native language, and I even contemplated the possibity of going to Caucasia and speaking with the Ubykh people in their native tongue. Alas, there are none to speak to.

Gaeilge& Politics& Personal14 Jul 2006 03:00 pm

For those of you who missed out on it, Richard Waghorne wrote an extremely amusing post on the Irish language. Apparently a few people were incensed. I myself thought it was quite funny and I sent it to my wife straight away. She, not being as familiar with Waghorne’s antics, was quite convinced that it was a parody.

I didn’t have time to prepare a response due to other constraints. Soon after the post appeared I found a post which covered nearly every argument that occured to me to put forward over at Talideon.com. Even the post title which had arisen in my mind was stolen by some clever soul working at Údarás na Gaeltachta before I could use it.

I’m quite convinced that Ireland is in the midst of a cultural revolution. The economy is booming in Ireland and the McCulture that can be offered by the global entertainment industry and the appeal of excessive drinking can only hold the attention spans of an affluent population for a limited time. The popularity of the Gaelscoileanna are growing rapidly and visits to Irish classes in the Gaeltacht are booming. TG4 offers reasonable entertainment value which is at least as good as most of the programming on the english language stations. And now, we find that even youtube.com has modern Irish music in its repertoire.

As evidenced by the juggernaught that is the Welsh language, or the meteoric rise of Finnish, It doesn’t take long for people to take back their native tongue. All it takes is the will to see it used.

I look forward to the day when dinosaurs like Richard Waghorne will have to make a choice of whether to move into the modern era or hide under the swiftly fading shadow of Englands hegemony.

Personal27 Jun 2006 02:19 pm

I got my citizenship notice today!!! I’m an Irish citizen! Now I can vote! And I’m an EU citizen! It came two months earlier than I thought it would. Hooray!

Gaeilge& Languages18 Jun 2006 05:59 am

Rather than writing the canonical blog post about how I haven’t written in a long time, I’ll start with the metacanonical blog post about how I’m not going to write about how I haven’t written in a long time.

As some of you may know, I’ve been studying Irish lately. I’m a bit obsessive about it. I really really enjoy it. In any case it has gotten me thinking a lot about how to perform automatic translation by machine.

I have very little faith in the utility of formal grammars for the purpose of computer linguistic analysis both because these have been shown to be extremely poor performers in past practice and the fact that I don’t really believe in semantics in the first place.

One common problem in translation by machine is the notorious “round trip problem”. Anyone who has played with babble-fish has seen the humourous consequences of translating an idiomatic phrase from one language into a target language and then translating it back.

I have some ideas about how to solve the “round trip problem” in translation by machine. The idea is basically this. You take two parallel corpuses (corpi? I don’t know Latin and I always choose the wrong plural, so I’ll just go with the English plural). You “pin” these corpuses at places that are “full stops”. A full stop is a boundary over which the analysis is allowed to ignore correlations. This can improve the computational efficiency of the technique. I haven’t tried it yet, but I believe that paragraph boundaries may be reasonable. The analysis consists of looking at the probability of occurance of words in the target text given words in the source text. If we pin at sentence boundaries the following parallel corpus:

I am hungry
Tá ocras orm

I am thirsty
Tá tart orm

We will find that “hungry” is correlated to “ocras” and “thirsty” is correlated to “tart”. “I am” will be correlated to “Tá… …orm”. It will also very importantly be correlated to both “tart” and “ocras”. The importance of this can be seen from some further sentences in our parallel corpus

I am tired
Tá mé tuirseach

I am contented
Tá mé sásta

It is only in context that we can determine the appropriate use of the prepositional pronoun “orm”. No amount of grammatical analysis will denote the appropriateness of a particular construction when moving from a source language to a destination language. The fact that “the hunger is upon you” is correct is only evident from the use of the language.

In any case I’ve been playing with different probabilistic models to achieve the correlations between words, and to make sure that the positioning is both important and not over-determined in the model. This is especially important for noun phrases and other syntactic units that should be treated as atomic by the structural analysis.

I’d love to hear if you have any thoughts….

Logic30 Mar 2006 12:37 pm

I was thinking about cut elimination on my way back from work today due in large part to a post on the cut rule made by sigfpe on A Neighborhood of Infinity.

It occured to me that the fact that cut-free proofs can be so tremendously much larger than the cut-full ones and that directly constructing cut-full proofs is so difficult is a bit strange. It seems somehow unfair.

As I was thinking about this I realised that one could find cut-free proofs automatically and then reduce them with reverse cut-elimination to produce extremely cutfull proofs. These proofs should be very economical computationally. If one keeps the terms that correspond with the proofs along side, one should be able to obtain a source to source translation that might have performance benefits.

Another totally far-out idea came to me as well. Mathematicians often use lemmas repeatedly. Perhaps the process of finding cuts is generally useful. Specifically, if a lemma is useful in simplifying one proof, maybe it is more likely to be useful in simplifying other proofs. There should be statistical measures over random syntactically valid sentences that one could come up with to see if such lemmas exist.

Politics30 Mar 2006 12:26 pm

The ideas of traditional corporatism and communism are diminisioning in importance. This results from a fundemental change in the economics of modern post industrial society.

The means of production was assumed, in communist theory, to be the instrument that the proletariate needed in order to avoid having their labor exploited by the owners of those means. Factories, farms and industrial occupations continue to exist in the west. The cost of these plants and farms is also much higher than it has every been in the past. Yet they employ fewer people and represent a shrinking segment of the economy.

The financial institution of the corporation was designed as a means of raising money in order pay for the means of production which include capital expenditures and the cost of wages. This leads to a political theory concerned with producing environments that are suitable for corporations to exist and proliferate.

Two questions need to be asked. Firstly, is this aforementioned phenomenon a real change in the economic structure or have the jobs simply been shifted elsewhere leading to the illusion of a post industrial economy in the first world. Secondly, if it is in fact a real change, what implications does it have towards political theory.

I don’t know enough about the first question to say for sure (if you could send me data/papers, that would be nice!). It seems to me that the factories in these outsourced-to countries that produce our phones, fabric and computers as well as an array of other goods must, on the whole, be fairly sophisticated. They tend to produce high quality goods with respect to the goods that were produced in former times (just decades ago). I suspect that in fact the number of laborers that they use is also much lower than it would have been in times past. For these reasons I think it is likely that the post industrial economy is not just a vestige of leveraging cheap labor in foreign economies but an actual global phenomenon which will have increasing importance as countries such as china, india and mexico fully industrialise and subsequently post-industrialise.

If in fact this is true what implications does it have to the idiologies of corporatism and communism?

Let us take for example the occupation of software engineer. In reality the only infrastructure that is needed to produce and disseminate products generated by a software engineer can be bought with a months wages. The costs of the means of production are now dominated by the cost of food and shelter.

Why then are software engineers overwelmingly employeed by corporations? The interests that are served by entering into contract for a corporation reduce to:

a) insurance against non-payment (wages)
b) social lubrication

The main reasons are social and technological. The institutions that would lubricate the formation of by-need organisations of programmers to take advantage of niches or requirements in the area of software design, do not yet exist. The old political doctrines of communism and corporatism are increasingly irrelevant to the software engineer as the software engineer (in the overwelming number of cases) already owns the capital means of production, yet the apparatus to despense with these institutions, while technically feasible, still requires mind share.

Physics& Logic22 Jan 2006 08:37 am

As you might know, I like logics. Notice that logic is in the plural. This might seem a bit strange to you. It certainly seems a bit strange to me. How can it be that there is more than one?

There are quite a number of logics at this point. We pretty much started out with Aristotelian logic, ie. the logic of syllogism which you were probably forced to study in school. Aristotelian logic was formalised and generalised during the early part of last century and this formalism has come to be called Classical Logic or often just CL.

This formalisation caused some to view with suspicion the outcome of various formal arguments. It gave rise to a more conservative ‘constructive’ logic which we will call Intuisionistic (or IL) whose informal interpretation is known as the Brouwer-Heyting-Kalmagorov interpretation or BHK.

Basically in Classical logic you can make proofs about things for which you can not provide examples. This also happens however in Intuisionistic logic for arguments that use the ∀ quantifier. It doesn’t seem so onerous in those cases however as you can see by playing with it a bit.

We can even make things more restrictive and get Minimal Logic (or ML). Minimal logic rejects the provability of arbitrary things from Falsum. The rule is often called ‘ex falso sequitur quodlibet,’ or ‘ex falso.’

Since then there has been a real explosion of the types of logic. There are Substructural Logics, Quantum Logic (QL), Linear Logic (LL, a pretty big fish than can even swallow CL) and a host of others.

From this the question naturally arises in my mind. Which is the right logic? As someone who writes programming languages I have a natural sympathy for IL as it leads naturally to a term calculus meaning that terms can be given back to the user that exemplify proofs. It is a natural logic to look at for the purposes of a database query language. There are however problems with it in regards to this. It is not “resource sensitive”. Things change in data stores and none of the above mentioned logics provide the appropriate tools to deal with this. Linear Logic comes closest but fails to deal with sharing or ordering. Many new resource logics have been invented to deal with this but I have yet to come across something that looks to me like a suitable answer (which doesn’t mean it isn’t already out there!).

In science the problem may be even worse. People use some form of quasi-classical reasoning to make arguments. It seems that this might not even be the appropriate tool to use when reasoning about Quantum Mechanics. Quantum Logic has been proposed as the appropriate way to deal with Quantum quandries in some (fairly fringe) circles. So far Quantum Logic looks to me to be too anemic. Something closer to a theory that has a curry howard isomorphism with quantum computation would be more satisfying.

So what is it that makes a good logic? My personal feeling is this. A logic is a constraint framework from whence you can show various programs that are the “proofs” of the constraint apparatus are acceptable. An appropriate constraint framework is one in which constraints that apply to your system can be expressed simply with minimal work. I believe that the Classical Logic for Propositions arose as a sort of logic of the natural sciences because it was in fact a type of physics. It is a calculus in which we can present common sense notions of real things in a simple way. When we extended the apparatus to classical logic we may have gotten something that strays so far from common sense it is no longer useful (this of course is debatable, and I’m not sure how much I believe it).

Now that we have a quantum world with physics that does not function in ways that our common sense would dictate, it seems perfectly reasonable to reject the notion of classical logic in this regime. In favor of what? I think the jury is still out on this one.

As for as how to quantify what a good logic is, I’ll make a couple of guesses. You want to be able to express constraint systems that apply to your realm with parsimony. You want to be able to verify and extract programs from proofs. If those two conditions are met more often for one logic than another for a particular problem, then I would deem it superior.

Of course this doesn’t even go into notions of logic in ethics…

Logic& Maths20 Jan 2006 07:48 am

Thanks to my brother I got a really cool book on proof theory called “Basic Proof Theory”. It has a bunch of nice features including a from the ground up presentation of proof theory that should be relatively accesible to anyone with a background in mathematics. It demonstrates some of the connections provided by the Curry-Howard correspondance (which is my favourite part of the book) . It also describe Second order logic, which is great because I’ve had very little formal exposure to this. Second order logic is really beautiful since you can define all the connectives in terms of ∀, ∀2 and →. If you pun ∀ and ∀2 you have a really compact notation.

The book also forced me to learn some things I hadn’t wrapped my head around. One of those was Gentzen style sequent calculus. This really turns out to be pretty easy when you have a good book describing it. I’ve even wrote a little sequent solver (in lisp) since I found the proofs so much fun. The first order intuisionistic sequent solver is really not terribly difficult to write. Basically I treat the proofs as goal directed starting with a sequent of the form:

⇒ F

And try to arive at leaves of the tree that all have the form:

A ⇒ A

I have already proven that ‘F ⇒ F’ for compound formulas F from ‘A ⇒ A’ so I didn’t figure it was neccessary to make the solver do it. The solver currently only works with propositional formula (it solves a type theory where types are not parameteric.) but I’m interested in limited extensions though I haven’t thought much about that. I imagine I quickly get something undecidable if I’m not careful.

Anyhow working with the sequent calculus got me thinking about → In the book they present the rule for R→ as such

Γ,A ⇒ Δ,B

Γ ⇒ A→B,Δ

This is a bit weird since there is nothing that goes the other direction. ie. for non of: Minimal, Intuisionistic or Classical logic do you find a rule in which you introduce a connective in the left from formulas in the right. I started looking around for something that does this and I ran into Basic Logic. I haven’t read the paper yet so I can’t really comment on it. I’ll let you know after I’m done with it.

« Previous PageNext Page »