Giter Club home page Giter Club logo

stanford-corenlp-python's Introduction

A Python wrapper for the Java Stanford Core NLP tools


This is a fork of Dustin Smith's stanford-corenlp-python, a Python interface to Stanford CoreNLP. It can either use as python package, or run as a JSON-RPC server.

Edited

  • Added multi-threaded load balancing
  • Update to Stanford CoreNLP v3.2.0
  • Fix many bugs & improve performance
  • Using jsonrpclib for stability and performance
  • Can edit the constants as argument such as Stanford Core NLP directory
  • Adjust parameters not to timeout in high load
  • Fix a problem with long text input by Johannes Castner stanford-corenlp-python
  • Packaging

Requirements

Download and Usage

To use this program you must download and unpack the zip file containing Stanford's CoreNLP package. By default, corenlp.py looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.

In other words:

sudo pip install pexpect unidecode jsonrpclib   # jsonrpclib is optional
git clone https://bitbucket.org/torotoki/corenlp-python.git
  cd corenlp-python
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip
unzip stanford-corenlp-full-2013-06-20.zip

Then, to launch a server:

python corenlp/corenlp.py

Optionally, you can specify a host or port:

python corenlp/corenlp.py -H 0.0.0.0 -p 3456

For additional concurrency, you can add a load-balancing layer on top:

python corenlp/corenlp.py --ports=8081,8082,8083,8084

That will run a public JSON-RPC server on port 3456. And you can specify Stanford CoreNLP directory:

python corenlp/corenlp.py -S stanford-corenlp-full-2013-06-20/

Assuming you are running on port 8080 and CoreNLP directory is stanford-corenlp-full-2013-06-20/ in current directory, the code in client.py shows an example parse:

import jsonrpclib
from simplejson import loads
server = jsonrpclib.Server("http://localhost:8080")

result = loads(server.parse("Hello world.  It is so beautiful"))
print "Result", result

If you are using the load balancing component, then you can use the following approach:

import jsonrpclib
from simplejson import loads
server = jsonrpclib.Server("http://localhost:8080")

result = loads(server.send("Hello world.  It is so beautiful"))
print "Result", server.getForKey(result['key'])

# asynchronous parsing and retrieval
sents = [ 'add in as many sentences as you want', 'your mileage may vary' ]
for sent in sents:
	server.send(sent)
# this approach is non-blocking
print server.getCompleted()
# this approach waits for all in-progress parses to finish (i.e. blocks)
print server.getAll()

That returns a dictionary containing the keys sentences and (when applicable) corefs. The key sentences contains a list of dictionaries for each sentence, which contain parsetree, text, tuples containing the dependencies, and words, containing information about parts of speech, NER, etc:

{u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))',
                 u'text': u'Hello world!',
                 u'tuples': [[u'dep', u'world', u'Hello'],
                             [u'root', u'ROOT', u'world']],
                 u'words': [[u'Hello',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'5',
                              u'Lemma': u'hello',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'UH'}],
                            [u'world',
                             {u'CharacterOffsetBegin': u'6',
                              u'CharacterOffsetEnd': u'11',
                              u'Lemma': u'world',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'NN'}],
                            [u'!',
                             {u'CharacterOffsetBegin': u'11',
                              u'CharacterOffsetEnd': u'12',
                              u'Lemma': u'!',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]},
                {u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))',
                 u'text': u'It is so beautiful.',
                 u'tuples': [[u'nsubj', u'beautiful', u'It'],
                             [u'cop', u'beautiful', u'is'],
                             [u'advmod', u'beautiful', u'so'],
                             [u'root', u'ROOT', u'beautiful']],
                 u'words': [[u'It',
                             {u'CharacterOffsetBegin': u'14',
                              u'CharacterOffsetEnd': u'16',
                              u'Lemma': u'it',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'PRP'}],
                            [u'is',
                             {u'CharacterOffsetBegin': u'17',
                              u'CharacterOffsetEnd': u'19',
                              u'Lemma': u'be',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'VBZ'}],
                            [u'so',
                             {u'CharacterOffsetBegin': u'20',
                              u'CharacterOffsetEnd': u'22',
                              u'Lemma': u'so',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'RB'}],
                            [u'beautiful',
                             {u'CharacterOffsetBegin': u'23',
                              u'CharacterOffsetEnd': u'32',
                              u'Lemma': u'beautiful',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'JJ'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'32',
                              u'CharacterOffsetEnd': u'33',
                              u'Lemma': u'.',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]}],
u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}

Not to use JSON-RPC, load the module instead:

from corenlp import StanfordCoreNLP
corenlp_dir = "stanford-corenlp-full-2013-06-20/"
corenlp = StanfordCoreNLP(corenlp_dir)  # wait a few minutes...
corenlp.raw_parse("Parse it")

If you need to parse long texts (more than 30-50 sentences), you must use a batch_parse function. It reads text files from input directory and returns a generator object of dictionaries parsed each file results:

from corenlp import batch_parse
corenlp_dir = "stanford-corenlp-full-2013-06-20/"
raw_text_directory = "sample_raw_text/"
parsed = batch_parse(raw_text_directory, corenlp_dir)  # It returns a generator object
print parsed  #=> [{'coref': ..., 'sentences': ..., 'file_name': 'new_sample.txt'}]

The function uses XML output feature of Stanford CoreNLP, and you can take all information by raw_output option. If true, CoreNLP's XML is returned as a dictionary without converting the format.

parsed = batch_parse(raw_text_directory, corenlp_dir, raw_output=True)

(note: The function requires xmltodict now, you should install it by sudo pip install xmltodict)

Developer

stanford-corenlp-python's People

Contributors

dasmith avatar torotoki avatar emilmont avatar t33chong avatar abhaga avatar andrewyates avatar antoniomk avatar jcccf avatar

Watchers

lemonhall avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.