This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns hello-nlp.core | |
(use [clojure.string :as str :only [split-lines lower-case]] :reload) | |
(use opennlp.nlp) | |
(use opennlp.tools.filters) | |
(use (incanter core charts))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(def austen | |
(slurp "/path/to/your/corpora/austen-persuasion.txt")) |
Warning! slurp reads the whole file into the memory! Counting the words is pretty straightforward.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn plus-map [map key] | |
(if (nil? (map key)) | |
(assoc map key 1) | |
(assoc map key (+ (map key) 1)))) | |
(defn plus-list-map [mymap keylist] | |
(if (empty? keylist) | |
mymap | |
(recur (plus-map mymap (first keylist)) (rest keylist)))) | |
(defn sortmap [mymap] | |
(let [mykeys (keys mymap) | |
keyorder (sort-by #(mymap %1) > mykeys) | |
keymap (map (fn [key] | |
[key (mymap key)]) keyorder)] | |
keymap)) | |
(defn count-words [text] | |
(let [counter {} | |
one (plus-list-map counter (tokenize text))] | |
one)) | |
(defn graph-words [text] | |
(let [raw (sortmap (count-words text)) | |
words (map first raw) | |
numbers (map second raw)] | |
(view (bar-chart words numbers | |
:x-label "Words" | |
:y-label "Frequency" | |
:title "Zipf")))) |
Not an informative picture! Let's analyse our text before we modify our program. The raw text file contains a lots of "noise". E.g. it is full of punctuation marks, our program is case sensitive and etc. Another problem lies in the nature of language.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(def get-sentences | |
(make-sentence-detector "models/en-sent.bin")) | |
(def tokenize | |
(make-tokenizer "models/en-token.bin")) | |
(def pos-tag | |
(make-pos-tagger "models/en-pos-maxent.bin")) | |
(defn tag-sent [sent] | |
(pos-tag (tokenize sent))) | |
(def pos-austen | |
(map pos-tag (map tokenize (get-sentences austen)))) | |
(pos-filter determiners #"^DT") | |
(pos-filter prepositions #"^IN") | |
(def preps | |
(reduce + (map count (map prepositions pos-austen)))) | |
(def dets | |
(reduce + (map count (map determiners pos-austen)))) | |
(def nps | |
(reduce + (map count (map nouns pos-austen)))) | |
(def vps | |
(reduce + (map count (map verbs pos-austen)))) | |
(def stats | |
[nps vps dets preps]) | |
(view (bar-chart ["np" "vp" "dts" "preps" ] stats)) |
Function words like determiners and prepositions are high frequency words. We are interested in the so called content words like nouns and verbs.
Part of speech tagging consumes your resources, so instead of removing function words identified by their pos tag, we are going to use a stopword list, and a list of punctuation marks. I used the NLTK English stopword list and made my own list of punctuation marks.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(def stop-words | |
(set (split-lines (slurp "/home/zoli/Projects/cllx/hello-nlp/corpora/stopwords/english")))) | |
(def puntctuation-marks | |
#{"+" "-" "*" "^" "." ";" "%" "\\" "," "..." "!" "?" ":" "\""}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn filter-hapax [lst] | |
(filter #(> (second %) 1) lst)) | |
(defn graph-text [text] | |
(let [filtered-text (filter (complement puntctuation-marks) (filter (complement stop-words) (tokenize (lower-case text)))) | |
raw (filter-hapax (sortmap (plus-list-map {} filtered-text))) | |
words (log10 (range 0 (count (map first raw)))) | |
numbers (log10 (map second raw))] | |
(view (bar-chart words numbers | |
:x-label "Words" | |
:y-label "Frequency" | |
:title "Zipf" | |
)))) |
Now we've got a nicer chart.
The chart shows you that you can get a decent score if you concentrate on the most frequent words.
No comments:
Post a Comment