Giter Club home page Giter Club logo

lingo's Introduction

lingo

Build Status

package lingo provides the data structures and algorithms required for natural language processing.

Specifically, it provides a POS Tagger (lingo/pos), a Dependency Parser (lingo/dep), and a basic tokenizer (lingo/lexer) for English. It also provides data structures for holding corpuses (lingo/corpus), and treebanks (lingo/treebank).

The aim of this package is to provide a production quality pipeline for natural language processing.

Install

The package is go-gettable: go get -u github.com/chewxy/lingo

This package and its subpackages depend on very few external packages. Here they are:

Package Used For Vitality Notes Licence
gorgonia Machine learning Vital. It won't be hard to rewrite them, but why? Same author Gorgonia Licence (Apache 2.0-like)
gographviz Visualization of annotations, and other graph-related visualizations Vital for visualizations, which are a nice-to-have feature API last changed 12th April 2017 gographviz licence (Apache 2.0)
errors Errors The package won't die without it, but it's a very nice to have Stable API for the past year errors licence (MIT/BSD like)
set Set operations Can be easily replaced Stable API for the past year set licence (MIT/BSD-like)

Usage

See the individual packages for usage. There is also a bunch of executables in the cmd directory. They're meant to be examples as to how a natural language processing pipeline can be set up.

A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

How It Works

For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.

Perhaps the most important data structure is the *Annotation structure. It basically holds a word and the associated metadata for the word.

For dependency parses, the graph takes three forms: *Dependency, *DependencyTree and *Annotation. All three forms are convertable from one to another. TODO: explain rationale behind each data type.

Quirks

Very Oddly Specific POS Tags and Dependency Rel Types

A particular quirk you may have noticed is that the POSTag and DependencyType are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from UniversalDependencies.

The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.

Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.

The following build tags are supported:

  • stanfordtags
  • universaltags
  • stanfordrel
  • universalrel

To use a specific tagset or relset, build your program thusly: go build -tags='stanfordtags'.

The default tag and dependency rel types are the universal dependencies version.

Lexer

You should also note that the tokenizer, lingo/lexer is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.

The test cases in package lingo/lexer showcases how it handles unicode, and other pathalogical english.

Contributing

see CONTRIBUTING.md for more info

Licence

This package is licenced under the MIT licence.

lingo's People

Contributors

chewxy avatar functionary avatar glaslos avatar xeoncross avatar ynqa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingo's Issues

dep binary overwrites dep binary

You are creating a binary called dep that overwrites the dep binary from the dependency management tool. Independent from who claimed that name first, it causes some unexpected behavior ๐Ÿ˜„

Const overflows on ARM builds

../../const.go:58: constant 1000000000000 overflows int
../../const.go:59: constant 1000000000000000 overflows int

Tasks

  • Create a version of const.go for ARM that doesn't have those two numbers
  • ALTERNATIVE SOLUTION: use int64 for NumberWords

Issue training on ud-treebanks

Trying to train a treebank against ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu and I've noticed that all rows that have a head field with the value _ panic.

Any ideas on how to deal with this data within the library? Happy to submit a PR on any advice given.

# sent_id = answers-20111108072305AAPJTjj_ans-0005
# text = It's more compact, ISO 6400 capability (SX40 only 3200), faster lens at f/2 and the SX40 only f/2.7.
1	It	it	PRON	PRP	Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs	4	nsubj	4:nsubj	SpaceAfter=No
2	's	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	4:cop	_
3	more	more	ADV	RBR	_	4	advmod	4:advmod	_
4	compact	compact	ADJ	JJ	Degree=Pos	0	root	0:root	SpaceAfter=No
5	,	,	PUNCT	,	_	8	punct	8:punct	_
6	ISO	iso	NOUN	NN	Number=Sing	8	compound	8:compound	_
7	6400	6400	NUM	CD	NumType=Card	6	nummod	6:nummod	_
8	capability	capability	NOUN	NN	Number=Sing	4	list	4:list	_
9	(	(	PUNCT	-LRB-	_	10	punct	10:punct|10.1:punct	SpaceAfter=No
10	SX40	SX40	PROPN	NNP	Number=Sing	8	parataxis	8:parataxis|10.1:nsubj	_
#dies on the following line
10.1	has	have	VERB	VBZ	_	_	_	8:parataxis	CopyOf=-1
11	only	only	ADV	RB	_	12	advmod	12:advmod	_
12	3200	3200	NUM	CD	NumType=Card	10	orphan	10.1:obj	SpaceAfter=No
13	)	)	PUNCT	-RRB-	_	10	punct	10:punct|10.1:punct	SpaceAfter=No
14	,	,	PUNCT	,	_	8	punct	8:punct	_
15	faster	faster	ADJ	JJR	Degree=Cmp	16	amod	16:amod	_
16	lens	lens	NOUN	NN	Number=Sing	4	list	4:list	_
17	at	at	ADP	IN	_	18	case	18:case	_
18	f/2	f/2	NOUN	NN	Number=Sing	16	nmod	16:nmod:at	_
19	and	and	CCONJ	CC	_	21	cc	21:cc|21.1:cc	_
20	the	the	DET	DT	Definite=Def|PronType=Art	21	det	21:det	_
21	SX40	SX40	PROPN	NNP	Number=Sing	16	conj	16:conj:and|21.1:nsubj	_
21.1	has	have	VERB	VBZ	_	_	_	16:conj:and	CopyOf=-1
22	only	only	ADJ	JJ	Degree=Pos	23	amod	23:amod	_
23	f	f	NOUN	NN	Number=Sing	21	orphan	21.1:obj	SpaceAfter=No
24	/	/	PUNCT	,	_	23	punct	23:punct	SpaceAfter=No
25	2.7	2.7	NUM	CD	NumType=Card	23	nummod	23:nummod	SpaceAfter=No
26	.	.	PUNCT	.	_	4	punct	4:punct	_```

Standardize Field Names

It's very obvious when you see this:

func pipeline(name string, f io.Reader) (*lingo.Dependency, error) {
	l := lexer.New(name, f)
	p := pos.New(pos.WithModel(posModel), pos.WithCluster(clusters), pos.WithStemmer(stemmer{}), pos.WithLemmatizer(fixer{}))
	d := dep.New(depModel)

	// set up pipeline
	p.Input = l.Output
	d.Input = p.Output
	go l.Run()
	go p.Run()
	go d.Run()

	select {
	case err := <-l.Errors:
	case err := <-d.Error: // should be named "Errors"
	case dep := <-d.Output:
		return dep, nil
	}
}

Space needs work

"hello there, world" and "hello there , world" yields different parses. Should look into lexer

I tried to train dependency parser model with your code, but it keeps failing to train.

I copied the training code that you offered in cmd/dep and run that code with minimal modification like editing some file directories.

It fails, explaining that the process lookupTransition does not work.

What I found out is that the transitions parsed from new parser model are not included in total transitition array that are formed from training dataset, because the model just spits out empty array.

So I started to inspect the codes in dep package, and find out that after the *configuration object is formed as an output of newConfigurations function, the mother function terminates and does not proceed to array forming part, which is like this. I cite some part of your code from dependencyParser.go as an example.

	c := newConfiguration(sentence, false)

	var err error
	var argmax int
	var count int
	// the following part of code won't be executed. This function terminates at this point. I couldn't figure out why.
	for !c.isTerminal() && count < 100 {
		logf("%v", c)
		if count == 99 {
			logf("TARPIT")
		}

		features := getFeatures(c, d.corpus)

If you want to reproduce the issue, I can offer you my training dataset which is from open source UD datasets via email.

I want to stick to your repository since this is the only reliable Go repository that treats dependency.

lexer design for performance

Hi,
I am really looking forward to using your library.

I suggest using a a traditional lexer (w/o goroutines) now that the library is young so that is eventually production ready (i.e. able to get best performance).

See references:

That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast.

I think it's a fine approach to a lexer if you don't care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I'm probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don't care about that for your application, I'm sure.

So: It's pretty and nice to work on, but you'd probably not choose that approach for a production compiler.

-rob

Fix NER tagging

NER tagging feature not included yet. This should be fixed, but should wait until Gorgonia fully supports a CUDA backend for the lispMachine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.