Thursday, December 29, 2011

Hello nlp!

This post assumes you have already installed Leiningen and you can work with your choice of programmers' editor.

Starting a new project

This command creates a new directory (hello-nlp). Navigate into that new directory and open the file project.clj. You are going to see something like this:
We are going to use the Apache Foundation's OpenNLP library with the help of Lee Hinman's Clojure library interface (and this post is based on Hinman's tutorial). Searching for “opennlp” gives various results, so we picked up the first (ending with 0.1.7). The information page contains everything you might want to know, the location of the github repo and a short code snippet for leiningen users [clojure-opennlp "0.1.7"]. Copy and paste the code into project.clj as follows:


Now your project.clj knows everything and is ready to serve you. The command
downloads dependencies (e.g. the clojure-opennlp library) and puts them into your path. Have a look at the lib library in your project library and you'll see jar files.

The core
Now navigate into the hello-nlp/src/hello_nlp/ library. You'll find a core.clj file there. Open it in your editor.

You'll see something like this:
To “enable” OpenNLP, modify the file:
You need a few additional files. Make a models directory in hello-nlp and download the pre-trained models from here (http://opennlp.sourceforge.net/models-1.5/). In this post, we are using English models, but feel free to change to another one. You need the Sentence Detector (en-sent.bin), Tokenizer (en-token.bin) and the POS Tagger (en-pos-maxent.bin).
Now, we can add user defined functions to core.clj. In the example, we made a sentence detector (get-sentences), a tokenizer (tokenize) and a POS tagger (pos-tag) based on the downloaded models.
Get your hands dirty
You can try out the newly defined function on your own sentences!