Computer automatically deciphers ancient language

An incidental challenge in developing a computer system that could decipher Ugaritic (inscribed on tablet) was developing a way to digitally render Ugaritic symbols (inset).

In his 2002 book Lost Languages, Andrew Robinson, then the literary editor of the London Times' higher-education supplement, declared that "successful archaeological decipherment has turned out to require a synthesis of logic and intuition 鈥� that computers do not (and presumably cannot) possess."

Regina Barzilay, an associate professor in MIT鈥檚 Computer Science and Artificial Intelligence Lab, Ben Snyder, a grad student in her lab, and the University of Southern California鈥檚 Kevin Knight took that claim personally. At the Annual Meeting of the Association for Computational Linguistics in Sweden next month, they will present a paper on a new computer system that, in a matter of hours, deciphered much of the ancient Semitic language Ugaritic. In addition to helping archeologists decipher the eight or so ancient languages that have so far resisted their efforts, the work could also help expand the number of languages that automated translation systems like Google Translate can handle.

To duplicate the 鈥渋ntuition鈥� that Robinson believed would elude computers, the researchers鈥� software makes several assumptions. The first is that the language being deciphered is closely related to some other language: In the case of Ugaritic, the researchers chose Hebrew. The next is that there鈥檚 a systematic way to map the alphabet of one language on to the alphabet of the other, and that correlated symbols will occur with similar frequencies in the two languages.

The system makes a similar assumption at the level of the word: The languages should have at least some cognates, or words with shared roots, like main and mano in French and Spanish, or homme and hombre. And finally, the system assumes a similar mapping for parts of words. A word like 鈥渙verloading,鈥� for instance, has both a prefix 鈥� 鈥渙ver鈥� 鈥� and a suffix 鈥� 鈥渋ng.鈥� The system would anticipate that other words in the language will feature the prefix 鈥渙ver鈥� or the suffix 鈥渋ng鈥� or both, and that a cognate of 鈥渙verloading鈥� in another language 鈥� say, 鈥渟urchargeant鈥� in French 鈥� would have a similar three-part structure.

Crosstalk

The system plays these different levels of correspondence off of each other. It might begin, for instance, with a few competing hypotheses for alphabetical mappings, based entirely on symbol frequency 鈥� mapping symbols that occur frequently in one language onto those that occur frequently in the other. Using a type of probabilistic modeling common in artificial-intelligence research, it would then determine which of those mappings seems to have identified a set of consistent suffixes and prefixes. On that basis, it could look for correspondences at the level of the word, and those, in turn, could help it refine its alphabetical mapping. 鈥淲e iterate through the data hundreds of times, thousands of times,鈥� says Snyder, 鈥渁nd each time, our guesses have higher probability, because we鈥檙e actually coming closer to a solution where we get more consistency.鈥� Finally, the system arrives at a point where altering its mappings no longer improves consistency.

Ugaritic has already been deciphered: Otherwise, the researchers would have had no way to gauge their system鈥檚 performance. The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew counterparts. Roughly one-third of the words in Ugaritic have Hebrew cognates, and of those, the system correctly identified 60 percent. 鈥淥f those that are incorrect, often they鈥檙e incorrect only by a single letter, so they鈥檙e often very good guesses,鈥� Snyder says.

Furthermore, he points out, the system doesn鈥檛 currently use any contextual information to resolve ambiguities. For instance, the Ugaritic words for 鈥渉ouse鈥� and 鈥渄aughter鈥� are spelled the same way, but their Hebrew counterparts are not. While the system might occasionally get them mixed up, a human decipherer could easily tell from context which was intended.

Babel

Nonetheless, Andrew Robinson remains skeptical. 鈥淚f the authors believe that their approach will eventually lead to the computerised 鈥榓utomatic鈥� decipherment of currently undeciphered scripts,鈥� he writes in an e-mail, 鈥渢hen I am afraid I am not at all persuaded by their paper.鈥� The researchers鈥� approach, he says, presupposes that the language to be deciphered has an alphabet that can be mapped onto the alphabet of a known language 鈥� 鈥渨hich is almost certainly not the case with any of the important remaining undeciphered scripts,鈥� Robinson writes. It also assumes, he argues, that it鈥檚 clear where one character or word ends and another begins, which is not the case with many deciphered and undeciphered scripts.

鈥淓ach language has its own challenges,鈥� Barzilay agrees. 鈥淢ost likely, a successful decipherment would require one to adjust the method for the peculiarities of a language.鈥� But, she points out, the decipherment of Ugaritic took years and relied on some happy coincidences 鈥� such as the discovery of an axe that had the word 鈥渁xe鈥� written on it in Ugaritic. 鈥淭he output of our system would have made the process orders of magnitude shorter,鈥� she says.

Indeed, Snyder and Barzilay don鈥檛 suppose that a system like the one they designed with Knight would ever replace human decipherers. 鈥淏ut it is a powerful tool that can aid the human decipherment process,鈥� Barzilay says. Moreover, a variation of it could also help expand the versatility of translation software. Many online translators rely on the analysis of parallel texts to determine word correspondences: They might, for instance, go through the collected works of Voltaire, Balzac, Proust and a host of other writers, in both English and French, looking for consistent mappings between words. 鈥淭hat鈥檚 the way statistical translation systems have worked for the last 25 years,鈥� Knight says.

But not all languages have such exhaustively translated literatures: At present, Snyder points out, Google Translate works for only 57 languages. The techniques used in the decipherment system could be adapted to help build lexicons for thousands of other languages. 鈥淭he technology is very similar,鈥� says Knight, who works on machine translation. 鈥淭hey feed off each other.鈥�

More information: Paper on deciphering Ugaritic -

Provided by Massachusetts Institute of Technology