This is the future home page of Hopskip (www.hopskip.org) - an open-source tool for specifying, training, and sharing probabilistic models of string and sequence data. Applications include text analysis and manipulation, speech recognition, machine translation, information extraction, music, genomics, etc.

You specify an appropriate probabilistic model using an extended regular expression language. Internally this is compiled into a parameterized finite-state machine. You can then train the free parameters from data. Training can be supervised, unsupervised, or something in between.

It is easy to specify complex models that are sensitive to linguistically meaningful features, that incorporate dictionaries or morphological analyzers, etc. You can try your models right away, without writing additional code. The Hopskip code will handle them in a highly optimized way.

We are planning a communal library of useful finite-state machines, such as taggers, parsers, lemmatizers, weighted translation dictionaries, and so on. You can use these directly or incorporate them into your own Hopskip models. You can also retrain them on new data.

For a technical paper and some overview slides, see Eisner (2002). The major technical contributions involve new learning algorithms that are sufficiently general to handle parameterized finite-state machines, and algorithmic tricks to speed them up. See also the related Dyna project, which is providing the underlying infrastructure.


Project participants so far:
Jason Eisner (project leader)
Roy Tromble
Project location:
Johns Hopkins University -- Natural Language Processing Lab
(part of the Computer Science Department and the Center for Language and Speech Processing)
Funding:
A 5-year CAREER award from the National Science Foundation, Grant No. 0347822. Any opinions, findings, and conclusions or recommendations expressed on this website are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
License:
We expect to make everything freely available and open-source. However, we may support substitution of closed-source code (e.g., the AT&T FSM toolkit) for some components.


Were [the gossiper's road] as straight as the Appia, and as broad as "that which leadeth to destruction," nevertheless would he be malcontent without a frequent hopskip-and-jump over the hedges, into the tempting pastures of digression beyond. -Edgar Allan Poe (1844)


Jason Eisner - jason@cs.jhu.edu - $Date: 2006/01/31 16:06:31 $ (GMT)