This current plan for this series is to follow the book Algorithms for Computational Linguistics using Clojure core.logic instead of Prolog.
Jumps, wildcard transitions and parsing are natural and useful ways to extend and leverage finite state machines for text analysis. This was an opportunity to introduce extensions of fact databases and non-deterministic matching. Here's the code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns fsmparse | |
(:refer-clojure :exclude [==]) | |
(:use [clojure.core.logic])) | |
;; We will encode a state machine that accepts lists containing '(w h y) as a sublist | |
;; Moreover, instead of a recognizer, we will implement a parser, that returns a list | |
;; of visited states in order | |
;; | |
;; +----#-----+----#-----+ +--?--+ | |
;; v | | v | | |
;; +> (x) --w--> (w) --h--> (wh) --y--> (why) --+ | |
;; | | | |
;; +-?-+ | |
;; | |
;; Notation: | |
;; ? - any character | |
;; # - jump; does not consume a character from the input | |
;; (<x>) - state named <x> | |
;; --<i>--> - transition with input <i> | |
(defrel start q) | |
(fact start 'x) | |
;; Encoded transitions including jumps | |
(defrel transition* from via to) | |
(facts transition* [['x 'w 'w] | |
['w 'h 'wh] | |
['wh 'y 'why] | |
['w :jump 'x] | |
['wh :jump 'x]]) | |
;; An extension of the transition* relation to implement start state and final | |
;; state transitions that accept any character | |
(defn transition [from via to] | |
(conde | |
((transition* from via to)) | |
((!= via :jump) (== from 'x) (== to 'x)) | |
((!= via :jump) (== from 'why) (== to 'why)))) | |
(defrel accepting q) | |
(fact accepting 'why) | |
(defn parse | |
([input parsed] | |
(fresh [q0] | |
(start q0) | |
(parse q0 input parsed))) | |
([q input parsed] | |
; Non-relational matching, commits to first match | |
(matcha [input] | |
(['()] | |
(accepting q) | |
(== parsed (list q))) | |
([[i . nput]] | |
; Handling transitions that consume input characters | |
(!= i :jump) | |
(fresh [qto subparsed] | |
(transition q i qto) | |
(parse qto nput subparsed) | |
; conso is a built in relation defined as | |
; conso(x, xs, [x . xs]) succeeds | |
(conso q subparsed parsed))) | |
([_] | |
; Handling jump transitions | |
(fresh [qto subparsed] | |
(transition q :jump qto) | |
(parse qto input subparsed) | |
(conso q subparsed parsed)))))) | |
(run* [q] (parse '(a w h y i n s i d e) q)) | |
;; => ((x x w wh why why why why why why why)) | |
(run* [q] (parse '(n o w a y) q)) | |
;; => () | |
(run 3 [q] (fresh [m] (parse q m))) | |
;; => ((w h y) (_.0 w h y) (_.0 _.1 w h y)) |
This is really fantastic stuff! I'm also very, very excited about leveraging core.logic for parsing. One hurdle we have to surmount is very large inputs - Prolog has some tricks that we haven't implemented yet. Looking forward to watching this series evolve and any feedback you may have about core.logic.
ReplyDeleteCool to hear that. I haven't been looking at complexity yet, just trying to get used to this paradigm and looking for what it can offer over Prolog. Just being able to prototype and experiment this fast is great already. But absolutely, it would be pretty neat if this actually scaled up.
Delete