A blog for fans of Bananagrams, word games, puzzles, and amazing things

Tuesday, September 20, 2011

Why words are the lengths they are

Some words are long and others are short. What determines how long a particular word should be? If you look at some long words (like "serendipity", "pandemonium", and "hypothesis") and some short words ("my", "in", and "of"), you might come to the conclusion that short words are short because they are used frequently while long words can afford to be long because they come up rarely. This idea was first proposed by a Harvard linguist named George Zipf in 1936.

Researchers at the MIT Department of Brain and Cognitive Sciences took a fresh look at this question and came up with a new theory. They present their results in a paper titled (spoilers!) "Word lengths are optimized for efficient communication".

How much information is conveyed by a word? Consider the sentence that starts "After I got home, I walked the...". If I finish the sentence as "I walked the dog", the extra word "dog" doesn't convey much information because it's probably one of the words your brain was expecting. More surprising would have been "I walked the cat" or "I walked the bulldozer", "I walked the quasar" or "I walked the plank". It is the amount of surprise that researchers are equating with the information contained in a word. Consequently the information content of a word depends on the context that it appears in. [For those who want a more quantitative explanation, the information contribution from a particular context (like, "I walked the...") is -log(p), where p is the probability that the word appears at the end of that phrase and where log() is the natural logarithm function. To get the total information for a word like "dog", you just sum -p log(p) over all the contexts that "dog" appears in.]

Ideally what the researchers would have liked to examine is the relationship between how long it takes to say words and how much information they convey, but it was easier (and, they argue, an adequate approximation) to use the number of letters in a word in place of its utterance duration. But later, they went back and ran the same tests (for a few languages) using number of syllables instead of number of letters, and the results were the same.

To calculate the relationship between word length and frequency, the researchers used the same N-gram data set that Google used in its N-gram viewer. This figure from the paper summarizes their findings:
The plot on the left shows word length versus word use frequency, with frequency decreasing from left to right. (Here the data has been divided into large groups of words ("bins") and the average lengths and frequency have been used.) For the first few points (high-frequency words like "the"), the slope of the line is strong, but then it quickly flattens out, indicating that for low-frequency words, the frequency of the word doesn't change the length very much.

The plot on the right shows average word length versus the information content of the word. Here, the line starts off jagged but then becomes strongly-sloped and very straight. This tells us that how much information a word carries is indeed a good predictor of how long the word will be.

The researchers also cite other work that has shown that, when speaking, people will speak more information-dense syllables more slowly than less information-dense syllables. (If you've ever listened to the synthesized voice of something like a GPS, you'll be familiar with the jerkiness of the pronunciation that sounds like it is speaking some syllables too slowly and others too quickly.)

It would seem that a corollary to this principle is that as a word becomes more common (or more precisely, loses information density), it experiences a linguistic force, pushing it toward a shorter form. This shortening process is called phonetic erosion. Examples of the resulting shortenings (also called clippings) are "refrigerator" becoming "fridge", "going to" becoming "gonna", and "cabriolet" being completely replaced by "cab". Here are a few other terms that have evolved much shorter forms:
  • advertisement → ad
  • caravan → van
  • examination → exam
  • gasoline → gas
  • gymnasium → gym
  • influenza → flu
  • public house → pub
So, essentially, the researchers found that the old idea that word length is based mainly on frequency of word usage (short words are used often while long words are used rarely) does a poor job of explaining why words are the lengths they are. The amount of information in a word (averaged over the various contexts that it is used in) is a far better predictor for how long the word will be. The only exception to this is the 5% to 20% of words that are the least informative (generally short, high-frequency words like "the" and "and").

This result holds, not just for English, but also for the other ten languages that they examined (Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, and Swedish).

The basic idea that I take away from this work is that there is some maximum rate that our brains can understand incoming speech, and that our speech patterns reformulate what we are saying to evenly distribute information over time. It makes me wonder whether pausing for effect is taking advantage of this fact. Similarly, when I say a word slowly to emphasize it, maybe I am just slowing it down to suggest that it contains a lot of information.

Epilogue: In case you were wondering, the actual ending to the sentence that started "After I got home, I walked the..." was "...tightrope.".