Thursday, December 29, 2011

Hello nlp!

This post assumes you have already installed Leiningen and you can work with your choice of programmers' editor.

Starting a new project
lein new hello-nlp
view raw hello-nlp01 hosted with ❤ by GitHub

This command creates a new directory (hello-nlp). Navigate into that new directory and open the file project.clj. You are going to see something like this:
(defproject hello-nlp "1.0.0-SNAPSHOT"
:description "FIXME: write description"
:dependencies [[org.clojure/clojure "1.3.0"]])
view raw hello-nlp02.clj hosted with ❤ by GitHub
We are going to use the Apache Foundation's OpenNLP library with the help of Lee Hinman's Clojure library interface (and this post is based on Hinman's tutorial). Searching for “opennlp” gives various results, so we picked up the first (ending with 0.1.7). The information page contains everything you might want to know, the location of the github repo and a short code snippet for leiningen users [clojure-opennlp "0.1.7"]. Copy and paste the code into project.clj as follows:

(defproject hello-nlp "1.0.0-SNAPSHOT"
:description "FIXME: write description"
:dependencies [[org.clojure/clojure "1.3.0"]
[clojure-opennlp "0.1.7"]])
view raw hello-nlp03.clj hosted with ❤ by GitHub

Now your project.clj knows everything and is ready to serve you. The command
lein deps
view raw hello-nlp04.clj hosted with ❤ by GitHub
downloads dependencies (e.g. the clojure-opennlp library) and puts them into your path. Have a look at the lib library in your project library and you'll see jar files.

The core
Now navigate into the hello-nlp/src/hello_nlp/ library. You'll find a core.clj file there. Open it in your editor.

You'll see something like this:
(ns hello-nlp.core)
view raw hello-nlp05.clj hosted with ❤ by GitHub
To “enable” OpenNLP, modify the file:
(ns hello-nlp.core
(use opennlp.nlp)
(use opennlp.treebank)
(use opennlp.tools.filters))
view raw hello-nlp06.clj hosted with ❤ by GitHub
You need a few additional files. Make a models directory in hello-nlp and download the pre-trained models from here (http://opennlp.sourceforge.net/models-1.5/). In this post, we are using English models, but feel free to change to another one. You need the Sentence Detector (en-sent.bin), Tokenizer (en-token.bin) and the POS Tagger (en-pos-maxent.bin).
Now, we can add user defined functions to core.clj. In the example, we made a sentence detector (get-sentences), a tokenizer (tokenize) and a POS tagger (pos-tag) based on the downloaded models.
(ns hello-nlp.core
(use opennlp.nlp)
(use opennlp.treebank)
(use opennlp.tools.filters))
(def get-sentences
(make-sentence-detector "models/en-sent.bin"))
(def tokenize
(make-tokenizer "models/en-token.bin"))
(def pos-tag
(make-pos-tagger "models/en-pos-maxent.bin"))
view raw hello-nlp08.clj hosted with ❤ by GitHub
Get your hands dirty
You can try out the newly defined function on your own sentences!
hello-nlp.core> (get-sentences "I like reading books. And I like watching goo movies.")
["I like reading books." "And I like watching goo movies."]
hello-nlp.core> (map tokenize (get-sentences "I like reading books. And I like watching goo movies."))
(["I" "like" "reading" "books" "."] ["And" "I" "like" "watching" "goo" "movies" "."])
hello-nlp.core> (map pos-tag (map tokenize (get-sentences "I like reading books. And I like watching goo movies.")))
((["I" "PRP"] ["like" "VBP"] ["reading" "VBG"] ["books" "NNS"] ["." "."]) (["And" "CC"] ["I" "PRP"] ["like" "IN"] ["watching" "VBG"] ["goo" "JJ"] ["movies" "NNS"] ["." "."]))
hello-nlp.core> (nouns (pos-tag (tokenize "Don't forget the milk.")))
(["milk" "NN"])
view raw hello-nlp07.clj hosted with ❤ by GitHub