Consciousness and how it got to be that way

Sunday, January 20, 2013

Markov Chains and Language Learning

This is cross-posted to my "geek" blog. As if this one isn't.

Once I started playing with Markov chains I couldn't leave well enough alone, so I went back and played with Markov chains at the character level again, this time with several languages. As you add more elements to the state, the output text starts "looking more like" the language that it came from. To measure this, I checked to see if Google Translate's language autodetect could still tell what it was looking at. This led to a prediction about language learning.

Using the Spanish Wiki entry for Apollo 11, at a 1-character state out of 5 trials the computer thought it was seeing Estonian twice, and then Welsh, Irish, and Galician. Using the English Apollo 11 article, it thought it was seeing Welsh three times, then Afrikaans and Engilsh. With 2-character states for both languages, the translator guessed correctly 5 out of 5.

Then I pulled a dirty trick: I used Old English (from the introduction to Beowulf of course) complete with thorns and diphthongs, and used that as input. With 1-character states, the translator consistently thought it was seeing Welsh 5 for 5. With 2-characters, it still answered Welsh four times, and German once (remember, with the modern languages it could already reliably detect languages scrambled at this level). With 3-element states the translator said English twice, then Icelandic, German, and Welsh. Surprising that this was the first it responded with Icelandic, with all those thistles and thorns floating around.

Finally I started giving it blocks of unmolested Old English text. Unsurprisingly, it couldn't consistently say that this was an ancestral form of English (how many literate modern English speakers would, if they'd never seen it?) The translator said English three times and Danish twice. What!? We translator-Danes in the days of Markov! Feeding it full blocks of Canterbury tales, it had no problem seeing that Middle English was English.

Finally, using an online string generator as well as a little Excel function I wrote, I fed it totally random strings. The response was consistently Maltese.

Why Maltese? Maybe because of the X's, who knows. But I'm not trying to reverse engineer Google's translation engine. I'm more interested in whether its wild guesses on Old English and the low-element-state-scrambled Markov chains reveal something. One obviously is the possible relationship between languages; that Danish and Icelandic should appear in the translator's guesses with scrambled Old English is interesting but not surprising. But the preponderance of Welsh was also interesting. It's unlikely that the translator is noticing anything about a Celtic deep substrate of English, especially since it couldn't see the more recent Old English substrate of English! More likely there's something about Welsh that makes it a good guess in badly scrambled text, as possibly with Maltese. More sound combinations allowed? Lots of single-letter words? If this is true, then the languages that permit the most sounds and sound combinations will:

a) be the "last resort" guesses for translation engines, and

b) should take longer for children to start speaking.

Why b.? When children are learning their first language, imagine the difficulty of identifying individual words. All they're getting is a stream of sound. Now, when you learn a new word, you recognize the other words around it; not so when you're twelve months old. There is evidence that what kids are doing is trying to find word boundaries, and that part of the input comes from looking for sound combinations that appear less frequently, as in "worD Boundary" - English doesn't allow "db" to occur at the beginning or end of a word (although some languages do), but we allow d and b to run up against each other between words. In a language with sound combinations that are less constrained, it will be harder for kids to identify the word boundaries, and it will take them longer to start speaking. This prediction has verified parallels in morphosyntax. Cree and Fulani are notorious for being horribly irregular in terms of verbs and plurals, respectively. In most languages, children are proficient grammatically around age 5, but grammatical maturity is delayed in these languages for several years by the irregularities.

So my prediction is: if the orthography of Welsh and Maltese corresponds to their phonology and a less constraining set of rules about what sounds can occur together, then I predict that children's vocabularies will grow more slowly in those languages relative to most others. The most extreme would be Khoi-San, i.e. the famous "Bushman" click languages, which have the richest sound inventory of any language on Earth (not just the clicks). I'm not familiar with the phonology, just that their inventory is huge, and I'm presuming in my prediction that those sounds aren't severely constrained in terms of how they can appear in combination with each other.