13 November 2012 @ 12:50 pm
Up Goer Five  
Monday's xkcd (http://www.xkcd.com/1133/) is an awesome and amusing set of simplified blueprints for the Saturn V rocket, with descriptions for the major parts written using "only the ten hundred words people use the most often."

This lead to a lot of discussion in various places along the lines of "I'm surprised X was in the top thousand words" and "I'm surprised Y wasn't in the top thousand words."

A little investigation led to the (obvious in retrospect) discovery that there is no single list of "thousand most used words." The problem being, most used in _what_? There are a lot of different lists with different criteria for how they were collated. A lot of the lists clearly were not the one used because they didn't contain all the words used in the comic. Some of them also contained a number of words that were more specific than the ones used in the comic.

However after looking around awhile i discovered one that seems to at least be a close analog to the one that must have been used for the comic, wikipedia's "Frequency lists/contemporary fiction" page: http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Contemporary_fiction

I was curious enough to attempt to transcribe the comic and "normalize" the words, merging conjugations, pluralizations, and what have you. ("What have you" mainly being whatever you call what was done to "goer" =)

Assuming i did everything correctly, the list contains every single word (with one possible exception) in the top 1000, including "car" at frequency 231, "escape" at 841, "computer" at 940, and "third" squeaking in at 987. It's also notable that in the 1000-2000 range are "ship", "moon", and "thousand", words that would have been rather useful in place of the circumlocutions he ended up with.

The one exception is "US". I'm not sure if that was given a pass on the theory that it's a proper name (which seems fair) or if he "cheated" by counting it as a heteronym of "us". "United" was not in the top 1000, or even the top 2000, of this list.

For those curious about the results, or who want to double-check my work, the spreadsheet is here: https://docs.google.com/spreadsheet/ccc?key=0AlTScW2oEvY4dG5uOWFpUTkteGV1Yjl5RmxNcHhtS1E
