constannnnnt / distributed-corenlp Goto Github PK

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.

Home Page: https://github.com/Constannnnnt/Distributed-CoreNLP

License: MIT License

Java 95.15% Python 4.85%

apache-spark big-data java mapreduce mapreduce-java natural-language-processing nlp nlp-parsing spark stanford-corenlp

distributed-corenlp's People

Contributors

Watchers

distributed-corenlp's Issues

Task not serializable Exception

I think the exception may be caused by nested map operations.

docs.map(doc -> {
            pipeline.annotate(doc);
            return doc.tokens().stream()
                .map(token -> "("+token.word()+","+token.ner()+")").collect(Collectors.joining(" "));
        })
            .saveAsTextFile(_args.output);

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.map(RDD.scala:370)
        at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:93)
        at org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:45)
        at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:52)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: edu.stanford.nlp.pipeline.StanfordCoreNLP
Serialization stack:
        - object not serializable (class: edu.stanford.nlp.pipeline.StanfordCoreNLP, value: edu.stanford.nlp.pipeline.StanfordCoreNLP@225bfe78)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 1)
        - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
        - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class ca.uwaterloo.cs651.project.CoreNLP, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic ca/uwaterloo/cs651/project/CoreNLP.lambda$main$94619a37$1:(Ledu/stanford/nlp/pipeline/StanfordCoreNLP;Ledu/stanford/nlp/pipeline/CoreDocument;)Ljava/lang/String;, instantiatedMethodType=(Ledu/stanford/nlp/pipeline/CoreDocument;)Ljava/lang/String;, numCaptured=1])
        - writeReplace data (class: java.lang.invoke.SerializedLambda)
        - object (class ca.uwaterloo.cs651.project.CoreNLP$$Lambda$61/2043561063, ca.uwaterloo.cs651.project.CoreNLP$$Lambda$61/2043561063@4ff31e98)
        - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
        - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
        ... 22 more

Mal-Functionalities in SimpleNLP

depparse: no idea what to provide for governor(int); output format strange
dcoref: CPU time limit exceeded

Doesn't mean we need to solve these now; come back after finishing CoreNLP.

Use the Shift-Reduce Parser

I haven't figured out which is the default model that corenlp uses to parse sentences. I tried to explicitly call the following method to use englishSR which corenlp says has better performance:

// use faster shift reduce parser
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");

And I got the following error:

edu.stanford.nlp.io.RuntimeIOException: java.io.IOException: Unable to open "edu/stanford/nlp/models/srparser/englishSR.ser.gz" as class path, filename or URL

I just commented the above line for now and maybe we can work on it later.

OOM error still exists

Not even related to spark, but happens when I construct the StanfordNLP pipeline, which is just a copy-and-paste from https://stanfordnlp.github.io/CoreNLP/ner.html#java-api-example.

OOM Error when running NER and DCOREF on SimpleNLP (SUTime)

An error that is very likely to be caused by SUTime

2018-11-12 16:42:59 INFO  TimeExpressionExtractorImpl:88 - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
2018-11-12 16:43:46 ERROR Utils:91 - Aborting task
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.asList(Arrays.java:3800)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:808)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:615)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.<init>(TokensRegexNERAnnotator.java:315)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:220)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:145)
        at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
        at edu.stanford.nlp.simple.Document$2.lambda$null$0(Document.java:111)
        at edu.stanford.nlp.simple.Document$2$$Lambda$61/336633101.get(Unknown Source)
        at edu.stanford.nlp.util.Lazy$2.compute(Lazy.java:106)
        at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
        at edu.stanford.nlp.simple.Document$2.get(Document.java:111)
        at edu.stanford.nlp.simple.Document$2.get(Document.java:106)
        at edu.stanford.nlp.simple.Document.runNER(Document.java:853)
        at edu.stanford.nlp.simple.Sentence.nerTags(Sentence.java:528)
        at edu.stanford.nlp.simple.Sentence.nerTags(Sentence.java:536)
        at ca.uwaterloo.cs651.project.SimpleNLP.lambda$main$fceadcfc$1(SimpleNLP.java:96)
        at ca.uwaterloo.cs651.project.SimpleNLP$$Lambda$27/363625212.call(Unknown Source)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Quote Mentions

I finished quote, but one thing here is that it doesn't work as expected? Maybe I wrote a bug?

With the input In the summer Joe Smith decided to go on vacation. He said, "I'm going to Hawaii." That July, vacationer Joe went to Hawaii., the output was

((0,quote),())
((0,natlog),(The,up) (University,up) (of,up) (Waterloo,up) (is,up) (located,up) (in,up) (Canada,up) (.,up) (Goose,up) (lives,up) (in,up) (this,up) (University,up) (.,up))
((1,quote),())
((1,natlog),(The,up) (University,up) (of,up) (Waterloo,up) (is,up) (located,up) (in,up) (Canada,up) (.,up) (Goose,up) (lives,up) (here,up) (.,up))
((2,quote),())
((2,natlog),(all,up) (cats,down) (have,up) (tails,up))
((3,quote),("I'm going to Hawaii.",He)
((3,natlog),(In,up) (the,up) (summer,up) (Joe,up) (Smith,up) (decided,up) (to,up) (go,up) (on,up) (vacation,up) (.,up) (He,up) (said,up) (,,up) (``,up) (I,up) ('m,up) (going,up) (to,up) (Hawaii,up) (.,up) ('',up) (That,up) (July,up) (,,up) (vacationer,up) (Joe,up) (went,up) (to,up) (Hawaii,up) (.,up))

Note: ((3,quote),("I'm going to Hawaii.",He) is returned, but the optimal one is ((3,quote),("I'm going to Hawaii.",Joe Smith). Any idea on this?

To Do:

I have finished setting up the framework for Spark and CoreNLP, including the annotator dependency stuff.

What we need to do:

Take a look at CoreNLP.java, line 36-50. Those are the functionalities we need to implement (at least most of them).
Same file, line 155-183, this is the major part. Each line is read in as a pair (Index, Line), and the expected output is an iterator of Array: [((Index, Func0), Output0), ..., ((Index, FuncN), OutputN))]. I have implemented two (easiest) functionalities and the others should be done in the same way. Don't underestimate this. StanfordNLP is very badly documented and this can be nasty.
After these functionalities have been done, at line 184 (after the flatMap), group by functionalities and then sort by Index. The output of different functionalities should go into different folders, but the order of sentences must be unchanged in each folder.

FuncTODO

It seems that at this moment if we want to do a complex func, say "coref", it will print all the results of funcs before it in the pipeline?

Say, the input arg for the functionalities is cleamxml, ssplit, coref, it will execute all functions in the pipeline and add the results to the rdd.

((0,tokenize),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 in-3 this-4 University-5 .-6)
((0,cleanxml),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 in-3 this-4 University-5 .-6)
((0,ssplit),The University of Waterloo is located in Canada.|Goose lives in this University.)
((0,pos),(The,DT) (University,NNP) (of,IN) (Waterloo,NNP) (is,VBZ) (located,JJ) (in,IN) (Canada,NNP) (.,.) (Goose,NN) (lives,VBZ) (in,IN) (this,DT) (University,NNP) (.,.))
((0,ner),(The,O) (University,ORGANIZATION) (of,ORGANIZATION) (Waterloo,ORGANIZATION) (is,O) (located,O) (in,O) (Canada,COUNTRY) (.,O) (Goose,O) (lives,O) (in,O) (this,O) (University,O) (.,O))
((0,coref),The University of Waterloo:this University; )
((1,tokenize),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 here-3 .-4)
((1,cleanxml),The-1 University-2 of-3 Waterloo-4 is-5 located-6 in-7 Canada-8 .-9 Goose-1 lives-2 here-3 .-4)
((1,ssplit),The University of Waterloo is located in Canada.|Goose lives here.)
((1,pos),(The,DT) (University,NNP) (of,IN) (Waterloo,NNP) (is,VBZ) (located,JJ) (in,IN) (Canada,NNP) (.,.) (Goose,NN) (lives,NNS) (here,RB) (.,.))
((1,ner),(The,O) (University,ORGANIZATION) (of,ORGANIZATION) (Waterloo,ORGANIZATION) (is,O) (located,O) (in,O) (Canada,COUNTRY) (.,O) (Goose,O) (lives,O) (here,O) (.,O))
((1,coref),)

I don't see any necessaries to do it and all I think here is just to return results after executing that three args.

Test Failed: file Not exist on Datasci

2018-11-13 16:13:31 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://name1..xxxx../simpledata;
	at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:715)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:344)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
	at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:732)
	at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:702)
	at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:51)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-11-13 16:13:31 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-11-13 16:13:31 INFO  AbstractConnector:318 - Stopped Spark@79c4715d{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}

The local file cannot be assessed in datasci. Is there any way that we can work around it?

OOM Error when running NATLOG and OPENIE On SimpleNLP (dependency parser)

An error that is likely to be caused by the dependency parser.

2018-11-12 16:46:10 INFO  MaxentTagger:88 - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
2018-11-12 16:46:10 INFO  DependencyParser:88 - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
2018-11-12 16:46:16 ERROR Utils:91 - Aborting task
java.lang.OutOfMemoryError: Java heap space
        at edu.stanford.nlp.parser.nndep.Classifier.preCompute(Classifier.java:662)
        at edu.stanford.nlp.parser.nndep.Classifier.preCompute(Classifier.java:644)
        at edu.stanford.nlp.parser.nndep.DependencyParser.initialize(DependencyParser.java:1189)
        at edu.stanford.nlp.parser.nndep.DependencyParser.loadModelFile(DependencyParser.java:630)
        at edu.stanford.nlp.parser.nndep.DependencyParser.loadFromModelFile(DependencyParser.java:499)
        at edu.stanford.nlp.pipeline.DependencyParseAnnotator.<init>(DependencyParseAnnotator.java:57)
        at edu.stanford.nlp.pipeline.AnnotatorImplementations.dependencies(AnnotatorImplementations.java:240)
        at edu.stanford.nlp.simple.Document$5.lambda$null$0(Document.java:147)
        at edu.stanford.nlp.simple.Document$5$$Lambda$63/1641616399.get(Unknown Source)
        at edu.stanford.nlp.util.Lazy$2.compute(Lazy.java:106)
        at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
        at edu.stanford.nlp.simple.Document$5.get(Document.java:147)
        at edu.stanford.nlp.simple.Document$5.get(Document.java:142)
        at edu.stanford.nlp.simple.Document.runDepparse(Document.java:918)
        at edu.stanford.nlp.simple.Document.runNatlog(Document.java:938)
        at edu.stanford.nlp.simple.Sentence.natlogPolarities(Sentence.java:897)
        at edu.stanford.nlp.simple.Sentence.natlogPolarities(Sentence.java:905)
        at ca.uwaterloo.cs651.project.SimpleNLP.lambda$main$71c840f9$1(SimpleNLP.java:114)
        at ca.uwaterloo.cs651.project.SimpleNLP$$Lambda$27/273072814.call(Unknown Source)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

Efficiency

We are not sure which of the following is actually done by Spark:

construct a StanforCoreNLP object at the initialization of a mapper, then use it to map each input
construct a new StanfordCoreNLP object for each input

If 2 is the case, the efficiency will be unbearable.

Experiments are running at the moment to check which is true.

constannnnnt / distributed-corenlp Goto Github PK

distributed-corenlp's People

Contributors

Watchers

distributed-corenlp's Issues

Task not serializable Exception

Mal-Functionalities in SimpleNLP

Use the Shift-Reduce Parser

OOM error still exists

OOM Error when running NER and DCOREF on SimpleNLP (SUTime)

Quote Mentions

To Do:

FuncTODO

Test Failed: file Not exist on Datasci

OOM Error when running NATLOG and OPENIE On SimpleNLP (dependency parser)

Efficiency

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent