Giter Club home page Giter Club logo

constannnnnt / distributed-corenlp Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 828 KB

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.

Home Page: https://github.com/Constannnnnt/Distributed-CoreNLP

License: MIT License

Java 95.15% Python 4.85%
apache-spark big-data java mapreduce mapreduce-java natural-language-processing nlp nlp-parsing spark stanford-corenlp

distributed-corenlp's People

Contributors

constannnnnt avatar ji-xin avatar kaisonghuang avatar

Watchers

 avatar  avatar  avatar  avatar

distributed-corenlp's Issues

Task not serializable Exception

I think the exception may be caused by nested map operations.

docs.map(doc -> {
            pipeline.annotate(doc);
            return doc.tokens().stream()
                .map(token -> "("+token.word()+","+token.ner()+")").collect(Collectors.joining(" "));
        })
            .saveAsTextFile(_args.output);
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.map(RDD.scala:370)
        at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:93)
        at org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:45)
        at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:52)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: edu.stanford.nlp.pipeline.StanfordCoreNLP
Serialization stack:
        - object not serializable (class: edu.stanford.nlp.pipeline.StanfordCoreNLP, value: edu.stanford.nlp.pipeline.StanfordCoreNLP@225bfe78)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 1)
        - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
        - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class ca.uwaterloo.cs651.project.CoreNLP, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic ca/uwaterloo/cs651/project/CoreNLP.lambda$main$94619a37$1:(Ledu/stanford/nlp/pipeline/StanfordCoreNLP;Ledu/stanford/nlp/pipeline/CoreDocument;)Ljava/lang/String;, instantiatedMethodType=(Ledu/stanford/nlp/pipeline/CoreDocument;)Ljava/lang/String;, numCaptured=1])
        - writeReplace data (class: java.lang.invoke.SerializedLambda)
        - object (class ca.uwaterloo.cs651.project.CoreNLP$$Lambda$61/2043561063, ca.uwaterloo.cs651.project.CoreNLP$$Lambda$61/2043561063@4ff31e98)
        - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
        - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
        ... 22 more

Mal-Functionalities in SimpleNLP

  1. depparse: no idea what to provide for governor(int); output format strange
  2. dcoref: CPU time limit exceeded

Doesn't mean we need to solve these now; come back after finishing CoreNLP.

Use the Shift-Reduce Parser

I haven't figured out which is the default model that corenlp uses to parse sentences. I tried to explicitly call the following method to use englishSR which corenlp says has better performance:

// use faster shift reduce parser
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");

And I got the following error:

edu.stanford.nlp.io.RuntimeIOException: java.io.IOException: Unable to open "edu/stanford/nlp/models/srparser/englishSR.ser.gz" as class path, filename or URL

I just commented the above line for now and maybe we can work on it later.

OOM Error when running NER and DCOREF on SimpleNLP (SUTime)

An error that is very likely to be caused by SUTime

2018-11-12 16:42:59 INFO  TimeExpressionExtractorImpl:88 - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
2018-11-12 16:43:46 ERROR Utils:91 - Aborting task
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.asList(Arrays.java:3800)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:808)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:615)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.<init>(TokensRegexNERAnnotator.java:315)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:220)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:145)
        at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
        at edu.stanford.nlp.simple.Document$2.lambda$null$0(Document.java:111)
        at edu.stanford.nlp.simple.Document$2$$Lambda$61/336633101.get(Unknown Source)
        at edu.stanford.nlp.util.Lazy$2.compute(Lazy.java:106)
        at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
        at edu.stanford.nlp.simple.Document$2.get(Document.java:111)
        at edu.stanford.nlp.simple.Document$2.get(Document.java:106)
        at edu.stanford.nlp.simple.Document.runNER(Document.java:853)
        at edu.stanford.nlp.simple.Sentence.nerTags(Sentence.java:528)
        at edu.stanford.nlp.simple.Sentence.nerTags(Sentence.java:536)
        at ca.uwaterloo.cs651.project.SimpleNLP.lambda$main$fceadcfc$1(SimpleNLP.java:96)
        at ca.uwaterloo.cs651.project.SimpleNLP$$Lambda$27/363625212.call(Unknown Source)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Quote Mentions

I finished quote, but one thing here is that it doesn't work as expected? Maybe I wrote a bug?

With the input In the summer Joe Smith decided to go on vacation. He said, "I'm going to Hawaii." That July, vacationer Joe went to Hawaii., the output was

((0,quote),())
((0,natlog),(The,up) (University,up) (of,up) (Waterloo,up) (is,up) (located,up) (in,up) (Canada,up) (.,up) (Goose,up) (lives,up) (in,up) (this,up) (University,up) (.,up))
((1,quote),())
((1,natlog),(The,up) (University,up) (of,up) (Waterloo,up) (is,up) (located,up) (in,up) (Canada,up) (.,up) (Goose,up) (lives,up) (here,up) (.,up))
((2,quote),())
((2,natlog),(all,up) (cats,down) (have,up) (tails,up))
((3,quote),("I'm going to Hawaii.",He)
((3,natlog),(In,up) (the,up) (summer,up) (Joe,up) (Smith,up) (decided,up) (to,up) (go,up) (on,up) (vacation,up) (.,up) (He,up) (said,up) (,,up) (``,up) (I,up) ('m,up) (going,up) (to,up) (Hawaii,up) (.,up) ('',up) (That,up) (July,up) (,,up) (vacationer,up) (Joe,up) (went,up) (to,up) (Hawaii,up) (.,up))

Note: ((3,quote),("I'm going to Hawaii.",He) is returned, but the optimal one is ((3,quote),("I'm going to Hawaii.",Joe Smith). Any idea on this?

To Do:

I have finished setting up the framework for Spark and CoreNLP, including the annotator dependency stuff.

What we need to do:

  1. Take a look at CoreNLP.java, line 36-50. Those are the functionalities we need to implement (at least most of them).
  2. Same file, line 155-183, this is the major part. Each line is read in as a pair (Index, Line), and the expected output is an iterator of Array: [((Index, Func0), Output0), ..., ((Index, FuncN), OutputN))]. I have implemented two (easiest) functionalities and the others should be done in the same way. Don't underestimate this. StanfordNLP is very badly documented and this can be nasty.
  3. After these functionalities have been done, at line 184 (after the flatMap), group by functionalities and then sort by Index. The output of different functionalities should go into different folders, but the order of sentences must be unchanged in each folder.

FuncTODO

It seems that at this moment if we want to do a complex func, say "coref", it will print all the results of funcs before it in the pipeline?

Say, the input arg for the functionalities is cleamxml, ssplit, coref, it will execute all functions in the pipeline and add the results to the rdd.

((0,tokenize),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 in-3 this-4 University-5 .-6)
((0,cleanxml),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 in-3 this-4 University-5 .-6)
((0,ssplit),The University of Waterloo is located in Canada.|Goose lives in this University.)
((0,pos),(The,DT) (University,NNP) (of,IN) (Waterloo,NNP) (is,VBZ) (located,JJ) (in,IN) (Canada,NNP) (.,.) (Goose,NN) (lives,VBZ) (in,IN) (this,DT) (University,NNP) (.,.))
((0,ner),(The,O) (University,ORGANIZATION) (of,ORGANIZATION) (Waterloo,ORGANIZATION) (is,O) (located,O) (in,O) (Canada,COUNTRY) (.,O) (Goose,O) (lives,O) (in,O) (this,O) (University,O) (.,O))
((0,coref),The University of Waterloo:this University; )
((1,tokenize),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 here-3 .-4)
((1,cleanxml),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 here-3 .-4)
((1,ssplit),The University of Waterloo is located in Canada.|Goose lives here.)
((1,pos),(The,DT) (University,NNP) (of,IN) (Waterloo,NNP) (is,VBZ) (located,JJ) (in,IN) (Canada,NNP) (.,.) (Goose,NN) (lives,NNS) (here,RB) (.,.))
((1,ner),(The,O) (University,ORGANIZATION) (of,ORGANIZATION) (Waterloo,ORGANIZATION) (is,O) (located,O) (in,O) (Canada,COUNTRY) (.,O) (Goose,O) (lives,O) (here,O) (.,O))
((1,coref),)

I don't see any necessaries to do it and all I think here is just to return results after executing that three args.

Test Failed: file Not exist on Datasci

2018-11-13 16:13:31 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://name1..xxxx../simpledata;
	at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:715)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:344)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
	at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:732)
	at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:702)
	at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:51)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-11-13 16:13:31 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-11-13 16:13:31 INFO  AbstractConnector:318 - Stopped Spark@79c4715d{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}

The local file cannot be assessed in datasci. Is there any way that we can work around it?

OOM Error when running NATLOG and OPENIE On SimpleNLP (dependency parser)

An error that is likely to be caused by the dependency parser.

2018-11-12 16:46:10 INFO  MaxentTagger:88 - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
2018-11-12 16:46:10 INFO  DependencyParser:88 - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
2018-11-12 16:46:16 ERROR Utils:91 - Aborting task
java.lang.OutOfMemoryError: Java heap space
        at edu.stanford.nlp.parser.nndep.Classifier.preCompute(Classifier.java:662)
        at edu.stanford.nlp.parser.nndep.Classifier.preCompute(Classifier.java:644)
        at edu.stanford.nlp.parser.nndep.DependencyParser.initialize(DependencyParser.java:1189)
        at edu.stanford.nlp.parser.nndep.DependencyParser.loadModelFile(DependencyParser.java:630)
        at edu.stanford.nlp.parser.nndep.DependencyParser.loadFromModelFile(DependencyParser.java:499)
        at edu.stanford.nlp.pipeline.DependencyParseAnnotator.<init>(DependencyParseAnnotator.java:57)
        at edu.stanford.nlp.pipeline.AnnotatorImplementations.dependencies(AnnotatorImplementations.java:240)
        at edu.stanford.nlp.simple.Document$5.lambda$null$0(Document.java:147)
        at edu.stanford.nlp.simple.Document$5$$Lambda$63/1641616399.get(Unknown Source)
        at edu.stanford.nlp.util.Lazy$2.compute(Lazy.java:106)
        at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
        at edu.stanford.nlp.simple.Document$5.get(Document.java:147)
        at edu.stanford.nlp.simple.Document$5.get(Document.java:142)
        at edu.stanford.nlp.simple.Document.runDepparse(Document.java:918)
        at edu.stanford.nlp.simple.Document.runNatlog(Document.java:938)
        at edu.stanford.nlp.simple.Sentence.natlogPolarities(Sentence.java:897)
        at edu.stanford.nlp.simple.Sentence.natlogPolarities(Sentence.java:905)
        at ca.uwaterloo.cs651.project.SimpleNLP.lambda$main$71c840f9$1(SimpleNLP.java:114)
        at ca.uwaterloo.cs651.project.SimpleNLP$$Lambda$27/273072814.call(Unknown Source)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

Efficiency

We are not sure which of the following is actually done by Spark:

  1. construct a StanforCoreNLP object at the initialization of a mapper, then use it to map each input
  2. construct a new StanfordCoreNLP object for each input

If 2 is the case, the efficiency will be unbearable.

Experiments are running at the moment to check which is true.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.