Wednesday, November 16, 2011

Why Clojure lx?

The NLTK is a natural choice for students of linguistics and computer science. It has matured into a stable project, its users are very active, and it is now used outside of academia. Those who are into functional programming can use the Scheme Natural Language Toolkit, or learn from the Natural Language Processing for the Working Programmer, and those who needs the JVM can turn to ScalaNLP. So why brother with Clojure?

First of all, we are NOT proposing a new framework/library here! Our main goal is to examine what Clojure offers to linguists. Although more and more linguistics departments offer courses in statistics and probability theory, the vast majority of students graduate with some background in discrete maths, mostly taught in an implicit way through a class in syntax and/or semantics (and the same is true for philosophy education). Using computer programs to test our scientific ideas is becoming a common practice in sciences, and this is true for linguists too. Stefan Th. Gries distinguishes linguistic computing from computational linguistics; following him, we think linguistic computing will become a common methodology used in the language sciences.

So, what's the difference between computational linguistics and linguistic computing? Well, there is no clear boundary! We'd say computational linguistics (or natural language processing) is a kind of applied science and engineering, and as such it is more “goal oriented”. Norvig's recent critique of Chomsky shows that commercial success is a measure of ideas, but despite the proliferation of statistical methods linguists are still doing research on rule based systems like HPSG, minimalism, etc., and new interdisciplinary research themes have emerged like Parikh's idea of the social software (and game theoretic semantics and dynamic epistemic logic, among others). But what is “pure” research today can become applied research tomorrow. To foster communication between pure and applied research, between linguistic computing and computational linguistics, we need a lingua franca.

As Clojure is the Lisp for the JVM, it is a convenient language for linguists. In the not-so-distant past, Touretzky wrote his Gentle Introduction to Symbolic Computation, an excellent book for beginners in the humanities. Gazdar and Mellish Natural Language Processing in X (where X stands for Prolog, Lisp or Pop11) is a good introduction to finite state techniques, grammars, parsing and it even has a chapter on question answering. We don't deny that these techniques are old, but they are still part of the well-educated linguists' body of knowledge. Also, although Norivig's PAIP is a real gem, one cannot argue against the “old” AI paradigm without seeing the past, and those ideas are still important for linguist, philosophers and cognitive scientists. Logic programming is a natural pair of functional programming. The basic techniques of computational linguistics can be expressed in logic programs, and although they have their computational limitations, these little programs has got unquestionable educational value.

Porting the classic into Clojure is not a novel idea, as some Google searching shows that people are turning the classic Lisp books like PAIP or the Structure and Interpretations of Computer Programs into modern Clojure. The core.logic library opens up the possibility to do the same with the Prolog literature.

The most common argument against NLTK is that you can't use mature, industry standard tools like the GATE framework, Stanford core, and openNLP. Clojure's Java interoperability solves this problem. If you are into machine learning, Weka, MALLET and etc. are at your service. The Incanter package provides an R-like statistical library.

With these tools in your hand, you can test your ideas in a language that's very close to what you learned about formal languages. Using Java libraries is like using rapid prototyping material when you are a marble sculptor. And as your works end result can be shared with the computational linguists, you can get more feedback, and even help from the greater community.

That's why we think that Clojure lx is an idea worths exploring. We'd like to test ourselves! Can we use Clojure to express our simple ideas? How easy is it to use Java libraries for a project? If you would like to join us, please send an email to zoltan.varju(at)gmail.com. We welcome everyone, linguists and Clojure hackers, philosphers, digital humanists, everyone who is interested!

About us
Zoltán Varjú – computational linguist at Weblib LLC, @zoltanvarju, Számítógépes nyelvészet
Richard Littauer – MSc computational linguistics student at the University of Saarland, @richlitt

Special thanks to
Neil Ashton - @nmashton

2 comments:

  1. I have written a simple-to-use concordancer for hand-built Polish corpora in Clojure. It lives at http://smyrna.danieljanus.pl (due to the target audience the site is all Polish, sorry) and the source is at https://github.com/nathell/smyrna. Thanks to Clojure it was a pleasure to write and the code is quite concise. Hopefully this serves as an encouragement!

    ReplyDelete
  2. This looks to be a great tool! Can we invite you for a guest post about your concordancer? Unfortunately I don't speak Polish, but I have a slavist friend (http://tempflip.tumblr.com/) who's interested in the tool.

    ReplyDelete