Sunday, January 29, 2012

Counting words

Zipf's law is a well-know word frequency distribution. Let's assume you are learning a foreign language and your teacher gives you books to read. You have to take exams that test if you acquired the vocabulary of the books. You have other commitments, and you prefer reading blogs and books on computational linguistics, so you'd like to determine the most frequent words of the texts and learn them by rote memorization right before the exam. You know that the higher the frequency of a word, the higher the probability it will be on the test. At first, it seems to be obvious that we have to count how many times each word occurs in a text, but it will get a bit complicated.
We need a text file, I'm using Austen's Persuasion from the NLTK corpora.
Warning! slurp reads the whole file into the memory! Counting the words is pretty straightforward.
Plot the text with (graph-words austen) (or your text) and you will see something like this.

Not an informative picture! Let's analyse our text before we modify our program. The raw text file contains a lots of "noise". E.g. it is full of punctuation marks, our program is case sensitive and etc. Another problem lies in the nature of language.
Function words like determiners and prepositions are high frequency words. We are interested in the so called content words like nouns and verbs.

Part of speech tagging consumes your resources, so instead of removing function words identified by their pos tag, we are going to use a stopword list, and a list of punctuation marks. I used the NLTK English stopword list and made my own list of punctuation marks.
The stop lists are stored in sets because we can filter the complement of a set (in Clojure, filter gives you the elements, doesn't remove them). It is a common practice to remove hapax legomena from the distribution and to use logarithmic scales on the axes of the chart.
Now we've got a nicer chart.

The chart shows you that you can get a decent score if you concentrate on the most frequent words.

Friday, January 27, 2012

lx in core.logic #2: Jumps, Flexible Transitions and Parsing

This is a continuation of the post Finite State Machines in Clojure core.logic.

This current plan for this series is to follow the book Algorithms for Computational Linguistics using Clojure core.logic instead of Prolog.

Jumps, wildcard transitions and parsing are natural and useful ways to extend and leverage finite state machines for text analysis. This was an opportunity to introduce extensions of fact databases and non-deterministic matching. Here's the code:

Wednesday, January 25, 2012

Finite State Machines in core.logic

This is an implementation of Finite State Machines in Clojure using core.logic. They are a good starting point for computational linguistics and illustrate several features of core.logic, such as various ways of defining new relations, pattern matching and also the invertibility of relations.

It is not an introduction to core.logic. To learn the basics, I would recommend the Logic Starter.

Tuesday, January 24, 2012

Beginning with Clojure

I am, by heart, a linguist, not a computational linguist. I was trained in Edinburgh, which is theoretically heavy, although not in Chomskyan, traditional linguistics. What I learned of Python I essentially taught myself, and there's no limit to my ignorance with traditional programming languages. That doesn't mean I'm not willing to try something new - far from it.

So, here we go. Rather than sit and pretend I haven't been twiddling my thumbs or busy for the past few months, I'm going to come straight out and say that is exactly what has been happening. I like blogging as I go along, though.

Install Clojure from this site. And here is where I ran into my first problem. I downloaded Clojure 1.3.0, unzipped it into my 'code' folder, and then cd'd in there in the Terminal. (I run a Mac.) The site suggests running this:
java -cp clojure.jar clojure.main
Well, that didn't work. (Neither did posting code snippets on Blogger, it seems. update: nevermind.) Instead, I got this:

Exception in thread "main" java.lang.NoClassDefFoundError: clojure/main
Caused by: java.lang.ClassNotFoundException: clojure.main
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

This is because there's an issue where it is packaged, and you need to call it by the specific package name. Quick fix, and...:
java -cp clojure-1.3.0.jar clojure.main Clojure 1.3.0
We're properly off!

I've been messing around with clojure on and off for a while now, over here. I highly suggest the tutorial, it is great. (I also highly suggest checking out this post on why Clojure Con was great, but that's not really on topic.

Depending on your development style, you may also want

  • line editing and history at the REPL
  • a syntax-highlighting editor
  • package management
  • automated builds
  • a full IDE
  • a tutorial environment
The site suggests all of those. I'm not so sure. I generally do all of my code with the Terminal and with MacVim. I'll be relying heavily on Vim for Clojure.

That's enough for this post. I'll put more up tomorrow, I hope! I know I'm late, but I view this as a running project and not an end-based one. Again, I'm a linguist, not a coder. So this is a long process.

Friday, January 13, 2012

What makes Clojure different?

A friend of mine asked me why Clojure matters and what makes it special and why I think it is good for linguists. This post is the edited version of my answer to my dear friend. Since there are very good books on the market (my favourite is Clojure in Action) and the internet is full of good tutorials (4Clojure is esp. good if you like the learning by doing method) my goal is only to give you a rough picture of functional programming.

An example

We are going to solve a "toy" problem stolen from the first chapter of Peter Norvig's seminal Paradigms of Artificial Intelligence. The question is how do you extract first and last names from someone’s full name. Before you think this is too simple and it doesn't worth dealing with, consider names like Robert Downey Jr, Admiral Grace Hopper, and what about Staff Sergeant William "Wild Bill" Guarnere (a character for the Band of Brothers series). Machines should be programmed to solve these problems, and even humans could have problems with names. It took me years to figure out that Martin "Boban" Doktor (a well known Czech Olympic champion sprint canoer) is not a real doctor...
First, we need some data to test our assumptions.
The function 'def' associates the symbol 'names' with names (oh, a vector of vectors). A first name is usually just the first word in a name.
And the last name is the last word in a name.
Let's test our functions. Calling first-name and last-name on my name gives the right answers.
We stored out test data in names, and now it's time to test our functions en mass. The higher order function map helps us in doing so. Map takes a function as its first argument and applies it to every member of its second argument.
Oooops, the program is having serious problems with "titles" or prefixes. Calling last-name on names gives interesting results too. Our program is not that bad, it captures the basic logic of identifying first and last names, but affixes cause problems. The first name should be the first word in a name if it is not a prefix. Let's store the affixes in vectors.
We want to test if the first word of the full name is a member of the titles. We need a function that tests membership.
The function member is recursive. First, it test if its second argument is a sequence. The second if gives us a terminating condition, if x and the first element of the second argument are equivalent, it returns the whole second argument. Otherwise it tests the membership again on the rest of the sequence (i.e. everything but the first element of the original sequence). Now, we can redefine our first name function. If the first word of the full name is in the list of prefixes, call first-name on the rest of the full name, otherwise return the first word of the full name.
Testing our new function shows it works correctly.
We can redefine last-name similarly.
Storing names in vectors of strings is very unnatural (at least for humans, I guess machines don't care about these issues). Wouldn’t it be nicer to type names like "Zoltán Varjú" instead of ["Zoltán" "Varjú"]?
First, we need new test data, which is a vector of strings.
We want to use our first-name and last-name functions. Can we split a name into individual words? clojure.string provides us a split function (that's why we put (:use [clojure.string :as str :only [split]] :reload) into ns) which splits a string into a vector of strings at a given point. The space character delimits the parts of a name. Our source code looks like this now:
Now we can test split from clojure.string.
Let's define a split-name function just to save ourself from repetitive strain injury caused by excessive typing.
Finally, we test if our functions work on splitted names.

Notes

I have to note, you can make the code more concise and idiomatic. I hope you can see 1) how can you solve a problem with functions and by combining them 2) you have a basic idea of what is recursion 3) how can you go from a basic problem to an acceptable solution.

What makes Clojure different?

Norvig lists eight features that make Lisp different:
  1. built-in support for lists
  2. automatic storage management
  3. dynamic typing
  4. first-class functions
  5. uniform syntax
  6. interactive environment
  7. extensibility
  8. history (see Paul Graham's essays, What Made Lisp Different and The Roots of Lisp)

Clojure is a Lisp on the JVM which makes it unique. The Java Virtual Machine makes it portable, reliable and secure, but there is a new JavaScript based version called ClojureScript. Slime is an excellent development environment, leiningen makes project automation easy. Java interoperability means Clojure has got a great collection of libraries for almost everything.
However Clojure is not for complete beginners. The Clojure community is very open and supportive, but asking the right question requires some sort of maturity. As this Reddit thread explains you shouldn't be a Java expert to pick up the language, even you can learn what you have to know on the go. But you should know at least one 'conventional' language like Python before you start learning Clojure. More propaganda in our Why Clojure lx? post.