This post assumes you have already installed Leiningen and you can work with your choice of programmers' editor.
Starting a new project
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lein new hello-nlp |
This command creates a new directory (hello-nlp). Navigate into that new directory and open the file project.clj. You are going to see something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defproject hello-nlp "1.0.0-SNAPSHOT" | |
:description "FIXME: write description" | |
:dependencies [[org.clojure/clojure "1.3.0"]]) |
We are going to use the Apache Foundation's OpenNLP library with the help of Lee Hinman's Clojure library interface (and this post is based on Hinman's tutorial). Searching for “opennlp” gives various results, so we picked up the first (ending with 0.1.7). The information page contains everything you might want to know, the location of the github repo and a short code snippet for leiningen users [clojure-opennlp "0.1.7"]. Copy and paste the code into project.clj as follows:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defproject hello-nlp "1.0.0-SNAPSHOT" | |
:description "FIXME: write description" | |
:dependencies [[org.clojure/clojure "1.3.0"] | |
[clojure-opennlp "0.1.7"]]) |
Now your project.clj knows everything and is ready to serve you. The command
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lein deps |
downloads dependencies (e.g. the clojure-opennlp library) and puts them into your path. Have a look at the lib library in your project library and you'll see jar files.
The core
Now navigate into the hello-nlp/src/hello_nlp/ library. You'll find a core.clj file there. Open it in your editor.
The core
Now navigate into the hello-nlp/src/hello_nlp/ library. You'll find a core.clj file there. Open it in your editor.
You'll see something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns hello-nlp.core) |
To “enable” OpenNLP, modify the file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns hello-nlp.core | |
(use opennlp.nlp) | |
(use opennlp.treebank) | |
(use opennlp.tools.filters)) |
You need a few additional files. Make a models directory in hello-nlp and download the pre-trained models from here (http://opennlp.sourceforge.net/models-1.5/). In this post, we are using English models, but feel free to change to another one. You need the Sentence Detector (en-sent.bin), Tokenizer (en-token.bin) and the POS Tagger (en-pos-maxent.bin).
Now, we can add user defined functions to core.clj. In the example, we made a sentence detector (get-sentences), a tokenizer (tokenize) and a POS tagger (pos-tag) based on the downloaded models.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns hello-nlp.core | |
(use opennlp.nlp) | |
(use opennlp.treebank) | |
(use opennlp.tools.filters)) | |
(def get-sentences | |
(make-sentence-detector "models/en-sent.bin")) | |
(def tokenize | |
(make-tokenizer "models/en-token.bin")) | |
(def pos-tag | |
(make-pos-tagger "models/en-pos-maxent.bin")) |
Get your hands dirty
You can try out the newly defined function on your own sentences!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hello-nlp.core> (get-sentences "I like reading books. And I like watching goo movies.") | |
["I like reading books." "And I like watching goo movies."] | |
hello-nlp.core> (map tokenize (get-sentences "I like reading books. And I like watching goo movies.")) | |
(["I" "like" "reading" "books" "."] ["And" "I" "like" "watching" "goo" "movies" "."]) | |
hello-nlp.core> (map pos-tag (map tokenize (get-sentences "I like reading books. And I like watching goo movies."))) | |
((["I" "PRP"] ["like" "VBP"] ["reading" "VBG"] ["books" "NNS"] ["." "."]) (["And" "CC"] ["I" "PRP"] ["like" "IN"] ["watching" "VBG"] ["goo" "JJ"] ["movies" "NNS"] ["." "."])) | |
hello-nlp.core> (nouns (pos-tag (tokenize "Don't forget the milk."))) | |
(["milk" "NN"]) |