Giter Club home page Giter Club logo

spark-corenlp's Introduction

Stanford CoreNLP wrapper for Apache Spark

This package wraps Stanford CoreNLP annotators as Spark DataFrame functions following the simple APIs introduced in Stanford CoreNLP 3.6.0.

This package requires Java 8 and CoreNLP 3.6.0 to run. Users must include CoreNLP model jars as dependencies to use language models.

All functions are defined under com.databricks.spark.corenlp.functions.

  • cleanxml: Cleans XML tags in a document and returns the cleaned document.
  • tokenize: Tokenizes a sentence into words.
  • ssplit: Splits a document into sentences.
  • pos: Generates the part of speech tags of the sentence.
  • lemma: Generates the word lemmas of the sentence.
  • ner: Generates the named entity tags of the sentence.
  • depparse: Generates the semantic dependencies of the sentence and returns a flattened list of (source, sourceIndex, relation, target, targetIndex, weight) relation tuples.
  • coref: Generates the coref chains in the document and returns a list of (rep, mentions) chain tuples, where mentions are in the format of (sentNum, startIndex, mention).
  • natlog: Generates the Natural Logic notion of polarity for each token in a sentence, returned as up, down, or flat.
  • openie: Generates a list of Open IE triples as flat (subject, relation, target, confidence) tuples.
  • sentiment: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4 (strong positive).

Users can chain the functions to create pipeline, for example:

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

import sqlContext.implicits._

val input = Seq(
  (1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

output.show(truncate = false)
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|sen                                           |words                                                 |nerTags                                           |sentiment|
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[ORGANIZATION, ORGANIZATION, O, O, O, LOCATION, O]|1        |
|It is a great university .                    |[It, is, a, great, university, .]                     |[O, O, O, O, O, O]                                |4        |
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+

Acknowledgements

Many thanks to Jason Bolton from the Stanford NLP Group for API discussions.

spark-corenlp's People

Contributors

mengxr avatar slothspot avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.