Wolfram Alpha came out today. For anyone who hasn’t heard of it, Wolfram Alpha is a “computational knowledge engine” created by Stephen Wolfram which draws information from various Internet sources. Anyway, I thought I’d play around with what it does with langauge data.
Searching languages is fairly straightforward. Search English or Spanish for example, and you get a form giving basic info — number of speakers character frequency, lexical simiarity (what languages share the most cognates with it), genetic classification and major regions where it’s spoken. If I search “Chinese“, it defaults to Mandarin Chinese but also gives me an option to go to Chinese langauges, which gives me links to several Chinese/Sinitic languages as well as a link to Pidgin English Chinese, which leads to a comparison of English, Chinese, and Hawaii Creole English — which I had thought referred to a creole of English and Hawai’ian — the Wikipedia page doesn’t mention Mandarin, though it says Cantonese was a major source language — along with Portuguese, Japanese, and several others. That page also gives you the option to go to Tok Pisin which also doesn’t seem to have such a strong Madarin influence according to Wikipedia.
A couple other things I find: Each language entry lists the numbers 1-10 (taken from Zompist’s list) but no mention is made of what numerical base the language uses. While pretty much all major languages use a decimal base, there are a number of languages, particularly in Mesoamerica, that use a visegimal base (base-20) and there are other bases in use. (I should note that “vigesimal” leads to a dictionary definition, and a serch for “base 20” seems to take you to something about DNA base-pairs.)
Another issue is with writing systems. Several languages correctly identify the writing system (Latin alphabet for English and Spanish) others are not so helpful. The entry for Japanese lists the “Chinese script”, apparently ignoring the kana. And searching for Latin alphabet and various other scripts turns up nothing useful, and even more seems to associate a script only with a particular language. I found that Devanagari has some useful information listed, but Cyrillic only has a dictionary definition. “Chinese characters”, “hanzi”, “kanji”, and “hanja” all turned up nothing, though you can find something with “Chinese script” — though it’s really only a Mandarin-specific block of Unicode code points.
Basically Wolfram Alpha looks like a very nice little tool for research, but it has some kinks. This is really just a small segment of what the system can do, and being that it is built on a mathematics engine I’m sure that it does a lot better with things like equations and statistics. For example: Each of my language searches turned up good info on number of speakers, character frequency, and even an estimate of translation length (based on character count). The semantic search could definitely use some improvement, and I’m sure it will in the future. Ultimately I think this system will find a niche or maybe several. But it’s definitely not going to kill Google or Wikipedia. In fact, I can really only see it being a compliment to all the info we already have on the web.