Sunday, January 29, 2012

Counting words

Zipf's law is a well-know word frequency distribution. Let's assume you are learning a foreign language and your teacher gives you books to read. You have to take exams that test if you acquired the vocabulary of the books. You have other commitments, and you prefer reading blogs and books on computational linguistics, so you'd like to determine the most frequent words of the texts and learn them by rote memorization right before the exam. You know that the higher the frequency of a word, the higher the probability it will be on the test. At first, it seems to be obvious that we have to count how many times each word occurs in a text, but it will get a bit complicated.
We need a text file, I'm using Austen's Persuasion from the NLTK corpora.
Warning! slurp reads the whole file into the memory! Counting the words is pretty straightforward.
Plot the text with (graph-words austen) (or your text) and you will see something like this.

Not an informative picture! Let's analyse our text before we modify our program. The raw text file contains a lots of "noise". E.g. it is full of punctuation marks, our program is case sensitive and etc. Another problem lies in the nature of language.
Function words like determiners and prepositions are high frequency words. We are interested in the so called content words like nouns and verbs.

Part of speech tagging consumes your resources, so instead of removing function words identified by their pos tag, we are going to use a stopword list, and a list of punctuation marks. I used the NLTK English stopword list and made my own list of punctuation marks.
The stop lists are stored in sets because we can filter the complement of a set (in Clojure, filter gives you the elements, doesn't remove them). It is a common practice to remove hapax legomena from the distribution and to use logarithmic scales on the axes of the chart.
Now we've got a nicer chart.

The chart shows you that you can get a decent score if you concentrate on the most frequent words.

No comments:

Post a Comment