Sunday, January 29, 2012

Counting words

Zipf's law is a well-know word frequency distribution. Let's assume you are learning a foreign language and your teacher gives you books to read. You have to take exams that test if you acquired the vocabulary of the books. You have other commitments, and you prefer reading blogs and books on computational linguistics, so you'd like to determine the most frequent words of the texts and learn them by rote memorization right before the exam. You know that the higher the frequency of a word, the higher the probability it will be on the test. At first, it seems to be obvious that we have to count how many times each word occurs in a text, but it will get a bit complicated.
(ns hello-nlp.core
(use [clojure.string :as str :only [split-lines lower-case]] :reload)
(use opennlp.nlp)
(use opennlp.tools.filters)
(use (incanter core charts)))
view raw zipf01.clj hosted with ❤ by GitHub
We need a text file, I'm using Austen's Persuasion from the NLTK corpora.
(def austen
(slurp "/path/to/your/corpora/austen-persuasion.txt"))
view raw zipf02.clj hosted with ❤ by GitHub

Warning! slurp reads the whole file into the memory! Counting the words is pretty straightforward.
(defn plus-map [map key]
(if (nil? (map key))
(assoc map key 1)
(assoc map key (+ (map key) 1))))
(defn plus-list-map [mymap keylist]
(if (empty? keylist)
mymap
(recur (plus-map mymap (first keylist)) (rest keylist))))
(defn sortmap [mymap]
(let [mykeys (keys mymap)
keyorder (sort-by #(mymap %1) > mykeys)
keymap (map (fn [key]
[key (mymap key)]) keyorder)]
keymap))
(defn count-words [text]
(let [counter {}
one (plus-list-map counter (tokenize text))]
one))
(defn graph-words [text]
(let [raw (sortmap (count-words text))
words (map first raw)
numbers (map second raw)]
(view (bar-chart words numbers
:x-label "Words"
:y-label "Frequency"
:title "Zipf"))))
view raw zipf03.clj hosted with ❤ by GitHub
Plot the text with (graph-words austen) (or your text) and you will see something like this.

Not an informative picture! Let's analyse our text before we modify our program. The raw text file contains a lots of "noise". E.g. it is full of punctuation marks, our program is case sensitive and etc. Another problem lies in the nature of language.
(def get-sentences
(make-sentence-detector "models/en-sent.bin"))
(def tokenize
(make-tokenizer "models/en-token.bin"))
(def pos-tag
(make-pos-tagger "models/en-pos-maxent.bin"))
(defn tag-sent [sent]
(pos-tag (tokenize sent)))
(def pos-austen
(map pos-tag (map tokenize (get-sentences austen))))
(pos-filter determiners #"^DT")
(pos-filter prepositions #"^IN")
(def preps
(reduce + (map count (map prepositions pos-austen))))
(def dets
(reduce + (map count (map determiners pos-austen))))
(def nps
(reduce + (map count (map nouns pos-austen))))
(def vps
(reduce + (map count (map verbs pos-austen))))
(def stats
[nps vps dets preps])
(view (bar-chart ["np" "vp" "dts" "preps" ] stats))
view raw zipf04.clj hosted with ❤ by GitHub

Function words like determiners and prepositions are high frequency words. We are interested in the so called content words like nouns and verbs.

Part of speech tagging consumes your resources, so instead of removing function words identified by their pos tag, we are going to use a stopword list, and a list of punctuation marks. I used the NLTK English stopword list and made my own list of punctuation marks.
(def stop-words
(set (split-lines (slurp "/home/zoli/Projects/cllx/hello-nlp/corpora/stopwords/english"))))
(def puntctuation-marks
#{"+" "-" "*" "^" "." ";" "%" "\\" "," "..." "!" "?" ":" "\""})
view raw zipf05.clj hosted with ❤ by GitHub
The stop lists are stored in sets because we can filter the complement of a set (in Clojure, filter gives you the elements, doesn't remove them). It is a common practice to remove hapax legomena from the distribution and to use logarithmic scales on the axes of the chart.
(defn filter-hapax [lst]
(filter #(> (second %) 1) lst))
(defn graph-text [text]
(let [filtered-text (filter (complement puntctuation-marks) (filter (complement stop-words) (tokenize (lower-case text))))
raw (filter-hapax (sortmap (plus-list-map {} filtered-text)))
words (log10 (range 0 (count (map first raw))))
numbers (log10 (map second raw))]
(view (bar-chart words numbers
:x-label "Words"
:y-label "Frequency"
:title "Zipf"
))))
view raw zipf06.clj hosted with ❤ by GitHub

Now we've got a nicer chart.

The chart shows you that you can get a decent score if you concentrate on the most frequent words.

No comments:

Post a Comment