kermitt2 / grobid Goto Github PK
View Code? Open in Web Editor NEWA machine learning software for extracting information from scholarly documents
Home Page: https://grobid.readthedocs.io
License: Apache License 2.0
A machine learning software for extracting information from scholarly documents
Home Page: https://grobid.readthedocs.io
License: Apache License 2.0
As conference papers or presentations often do not contain sufficient bibliograhic information it would be helpful to force the value of certain tags (meeting, date, location ...).
A howto or a parameter within grobid would be very helpful.
We are considering running Grobid in our Spark environment as part of the Semantic Scholar project. Currently, PDFs are shuffled to Spark nodes as RDDs of byte arrays which live in the nodes' main memory. These byte arrays currently need to be output to a temporary directory which is used as value for -dIn
argument. The extracted XMLs are then read from a temporary directory (the -dOut
argument) into memory as byte array RDDs which then are processed by the next step in the Spark pipeline.
This current design incurs one disk write and one disk read per PDF. These IO operations can be removed if Grobid accept PDFs as byte arrays directly. To enable this, Grobid needs to call pdf2xml via a library API wrapped with JNI rather than shelling out to a separate process (correct me if I am wrong). This approach could also support multi-thread processing similar to Grobid's REST service.
Thanks,
Vu Ha. The Semantic Scholar project
When executing grobid-core's createTrainingFulltext method, the tei xml files generated for the references have a typo in their name: they are called "...tranining.references...": 'i' and 'n' are inverted.
At some point, we could use only Wapiti since it provides only advantages as compared to CRF++. This would simplify the maintenance given that we need otherwise to manage two set of JNI libraries and models.
Is there an easy way to change the number of iterations when running grobid-trainer?
When ready ;)
Don't use feature matrix for doing that, to ensure genericity between CRP++ and Wapiti
See runReflow stuff in the affiliation parser...
When commands of grobid-core are run with a non existing directory for the -gH option, the error refers to a non-existing 'null' directory (instead of the erroneous directory entered). This is not serious, but this does not help the user to find the source of the error and correct it quickly.
Hi,
When I parse a standard scientific paper [1], with the code proposed in [2]
and print the resulting BiblioItem, I get:
...
month='null'
e_month='null'
s_month='null'
a_month='null'
...
e_year='null'
s_year='null'
a_year='null'
year='null'
...
day='null'
e_day='null'
s_day='null'
a_day='null'
...
normalized_publication_date=20-6-2013 / 20-Jun-2013
...
normalized_submission_date=null
...
I looked quickly at BiblioItem.java. I'm not sure which field represents what information, so I don't feel comfortable with changing the code for the moment.
However, I've noticed that the year/month/day field are set only in toTEI() and toTEI2(). Wouldn't it be better to set them in setNormalizedPublicationDate() and/or setNormalizedSubmissionDate(), so that they are defined each time a date is set, and use these methods in the other methods of the class every time a date has to be changed?
[1] http://search.arxiv.org:8081/paper.jsp?r=1306.4727&qid=1375964777082mix_nCnN_-1091422823&qs=1306.4727
[2] https://github.com/kermitt2/grobid/wiki/Grobid-java-library
I've tried to apply the code provided here:
https://github.com/kermitt2/grobid/wiki/Grobid-java-library
in a WebService context, where I crawl entire websites and need to run grobid everytime I find a PDF. Unfortunately, calling twice the line:
MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
results in an Exception being raised:
Caused by: org.grobid.core.exceptions.GrobidPropertyException: Could not set GROBD_HOME
at org.grobid.core.utilities.GrobidProperties.load_GROBID_HOME_PATH(GrobidProperties.java:170)
at org.grobid.core.utilities.GrobidProperties.init(GrobidProperties.java:353)
at org.grobid.core.utilities.GrobidProperties.init(GrobidProperties.java:383)
at org.grobid.core.utilities.GrobidProperties.(GrobidProperties.java:341)
at org.grobid.core.utilities.GrobidProperties.getNewInstance(GrobidProperties.java:119)
at org.grobid.core.utilities.GrobidProperties.getInstance(GrobidProperties.java:99)
at fr.presans.machinelearning.extractor.AnyPDFExtractor.PDF2BiblioItem(AnyPDFExtractor.java:131)
... 8 more
Caused by: javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:307)
at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:344)
at javax.naming.InitialContext.lookup(InitialContext.java:411)
at org.grobid.core.utilities.GrobidProperties.load_GROBID_HOME_PATH(GrobidProperties.java:168)
I've tried to define a static boolean in my code in order to call the concerned line only once, but then I get the following error:
Caused by: javax.naming.NameAlreadyBoundException: Name java: is already bound in this Context
at org.apache.naming.NamingContext.bind(NamingContext.java:892)
at org.apache.naming.NamingContext.bind(NamingContext.java:186)
at org.apache.naming.NamingContext.createSubcontext(NamingContext.java:542)
at org.apache.naming.NamingContext.createSubcontext(NamingContext.java:564)
at javax.naming.InitialContext.createSubcontext(InitialContext.java:483)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:37)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76)
at fr.presans.machinelearning.extractor.AnyPDFExtractor.PDF2BiblioItem(AnyPDFExtractor.java:134)
I'm not sure what the line MockContext.setInitialContext() is supposed to do... Therefore how I should call it.
tei.append("\t\t\t\t\t\n"); on line 407 of TEIFormater.java is in the wrong place, it should be moved to after the closing } on line 409.
Also, lists are not being closed consistently, suggest adding
if (listOpened ) {
tei.append("\t\t\t\t\n");
listOpened = false;
}
after the for loop at line 704. I will send a push request shortly.
Great tool by the way, GROBID seems to be one of the best PDF to XML tools out there at the moment. Thanks for this!
When running grobid-core's commands "processXXX", I get a NullPointerException: - processDate (ProcessEngine:103)
I also get and error for processAffiliation, that comes from the line: processAffiliation (AffiliationAddressParser:28), which reads:
input.trim();
I think the error is related to the above ones, but I'm not sure for the moment. I believe 'input' might be "null" here because i) it is not captured from the commands line arguments, ii) I have not understood how to pass the argument to the command, iii) there are some pre-treatments done on the argument that nullify the string.
If the bibliographic data is at the second page of the paper, creating the training header file (*.header) doesn´t work properly: There is no information, written at the second page, included in the *.header file. --> training does not work.
That´s also the case for some other articles, there the journal name, written at the first page, is not included in the *.header files.
Grobid experts,
When I process with the shipped Segmentation model, the performance is excellent--accuracy and F1 scores are close to perfect in all fields (this is reflected in the annotations themselves). However when I train a new model on the shipped Segmentation corpus, the evaluation is very poor. I can think of three possible reasons to explain this:
The file "grobid-service_manual.pdf", does not reflect exactly the lastest version (2.8) of the tool.
At least update the pictures, the tool is prettier now :)
When using createTrainingXXX, with papers where a middle name is present in the authors' names, the generated XML file shows "XXX.", with the dot after the closing tag.
Hi i try to compile this project, but i get errors using this guide:
https://github.com/kermitt2/grobid/wiki/Grobid-service-quick-start
Tests run: 154, Failures: 0, Errors: 44, Skipped: 1
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] There are test failures.
actually i use:
Apache Maven 2.2.1 (rdebian-8)
Java version: 1.6.0_45
Java home: /usr/lib/jvm/jdk1.6.0_45/jre
Default locale: es_MX, platform encoding: UTF-8
OS name: "linux" version: "3.2.0-4-686-pae" arch: "i386" Family: "unix"
what i need.. SO, Architecture, Java version, Maven version?
thanks
Grobid should indicate in the generated TEI results the date, provenance information (i.e. which version of Grobid) and if the result has been combined or not via CrossRef.
I've been running GROBID on a variety of pdfs as of late and came across one where the xml is not well-formed. I'm using GROBID as a service via its processFulltextDocument
method.
Offending PDF
Resulting XML file
The problem with the XML file is in Line 172 (below, line-breaks added by me for readability) where there's a quot;
entity that seems to have gotten messed up and leaves us with an unescaped ampersand.
<p>To find out on which level of granularity
visual pattern are classified as single "objects" (see Figure 6),
a questionnaire with eight different screen-dumps
of a commercial multimedia information system
(called "mock-ups", see[Rauterberg 1995c]) was answered
by a heterogeneous group of potential users.
A total of 33 women (between 14 and 66 year of life)
and 33 men (between 21 and 55 year of life) participated
(no significant difference in age between both groups).
The computer experience of each subject was measured on a rating scale
("no experience" ; = 0 … "expert" = 90).
We found a significant difference in computer experience between both groups:
40 ±27 for women and 63 ±24 for men (p≤ .001);
the men were more experienced than the women.</p>
Using a version of GROBID cloned from github about a month or so ago. Let me know if any other info is needed.
From my point of view, grobid would be unbeatible, if the information extracted would be consolidated with a free online database.
Are there alternatives to crossref, as this is quite expensive?
Index out of bound in the Patent parser test when using wapiti
The training data for fulltext models creates corrupt XML files, which causes the trainer to crash.
This is the PDF: http://www.aclweb.org/anthology/W12-4305
Grobid was able to process this file with -exe processHeader
, but hanged when run with -exe processReferences
and -exe processFullText
.
When Grobid fails to process a PDF, the REST response should indicate if the failure is due to a PDF timeout, a corrupted PDF or a Grobid process error.
For multi-thread support.
Hi,
Under the latest up-to-date Debian GNU/Linux 7 (wheezy), I get the following exception when running grobid:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:359)
at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:154)
... 6 more
Caused by: java.lang.UnsatisfiedLinkError: /home/gmuller/grobid/grobid-home/lib/lin-64/libwapiti.so: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /home/gmuller/grobid/grobid-home/lib/lin-64/libwapiti.so)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1851)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1062)
at org.grobid.core.main.LibraryLoader.load(LibraryLoader.java:94)
at org.grobid.core.factory.AbstractEngineFactory.init(AbstractEngineFactory.java:51)
at org.grobid.core.factory.GrobidFactory.(GrobidFactory.java:21)
at org.grobid.core.factory.GrobidFactory.newInstance(GrobidFactory.java:58)
at org.grobid.core.factory.GrobidFactory.getInstance(GrobidFactory.java:32)
at org.grobid.core.engines.ProcessEngine.getEngine(ProcessEngine.java:42)
at org.grobid.core.engines.ProcessEngine.processFullText(ProcessEngine.java:103)
... 12 more
The exact command executed is:
java -Xmx1024m -jar ~/grobid/grobid-core/target/grobid-core-0.3.0.one-jar.jar -gH ~/grobid/grobid-home/ -gP ~/grobid/grobid-home/config/grobid.properties -dIn In/ -dOut Out/ -exe processFullText
Grobid uses quite a lot of rules to recompose diacritics (never encoded as such in PDF), e.g.:
e' -> é
In some cases (not so frequent), the sequence of characters from the PDF is the not as expected for the existing rules and the recomposition fails. Currently the consequence is that both the "accent" and the modified charcater diseapper.
Example: for this PDF (HAL Open Access)
"Clément Cancès" becomes:
<persName>
<forename type="first">Clément</forename>
<surname>Canc</surname>
</persName>
(Mac OS X Preview does not get it right neither by the way!)
Hi,
I just ran into a strange behaviour: when trying to access files in a dir (/tmp/ToTreat) that all belonged to another user, I ran into a not-really-explicatory-error: NullPointerException...
$ GROBID_HOME=/home/xxx/grobid/grobid-home/
$ DIR=/tmp/ToTreat/
$ java -Xmx1024m -Xms1024m
-jar ${GROBID_HOME}/../grobid-core/target/grobid-core-0.2.*.one-jar.jar
-gH ${GROBID_HOME}
-gP ${GROBID_HOME}/config/grobid.properties
-dIn ${DIR} -dOut ${DIR}
-exe processHeader
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:359)
at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:155)
... 6 more
Caused by: java.lang.NullPointerException
at org.grobid.core.engines.ProcessEngine.processHeader(ProcessEngine.java:56)
... 12 more
Testing Grobid on papers from arXive.org, I've found at least two cases that Grobid does not manage correctly:
The Wapiti binary models are not recognized on a few Linux machines.
The error is coming from model.c in Wapiti, when the header of the model is parsed via fscanf:
278-/* mdl_load:
279- * Read back a previously saved model to continue training or start labeling.
280- * The returned model is synced and the quarks are locked. You must give to
281- * this function an empty model fresh from mdl_new.
282- */
283-void mdl_load(mdl_t *mdl, FILE *file) {
284: const char *err = "invalid model format";
285- uint64_t nact = 0;
286- int type;
287- if (fscanf(file, "#mdl#%d#%"SCNu64"\n", &type, &nact) == 2) {
288- mdl->type = type;
289- } else {
290- rewind(file);
291- if (fscanf(file, "#mdl#%"SCNu64"\n", &nact) == 1)
292- mdl->type = 0;
293- else
294- fatal(err);
295- }
296- rdr_load(mdl->reader, file);
297- mdl_sync(mdl);
298- for (uint64_t i = 0; i < nact; i++) {
299- uint64_t f;
300- double v;
301- if (fscanf(file, "%"SCNu64"=%la\n", &f, &v) != 2)
302- fatal(err);
303- mdl->theta[f] = v;
304- }
305-}
306-
The header of the model looks like this on the problematic machine:
> find grobid/grobid-home/models/ -name "*wapiti" -print -exec head -n2 \{} \;
grobid/grobid-home/models/header/model.wapiti
#mdl#2#314470
#rdr#85/29/0
If the model is retrained on the problematic machine, it is working. However, the header format looks the same:
> head -n2 grobid/grobid-home/models/date/model.wapiti
#mdl#2#262
#rdr#50/16/0
12:u00:%x[-3,0],
Users having this issue can use CRF++ as JNI CRF engine instead of Wapiti (a little bit slower, takes more memory, use smaller models - because of GitHub limitation on binary file size - but the result are similar).
In the file grobid-home/config/grobid.properties
, simply change:
grobid.crf.engine=wapiti
#grobid.crf.engine=crfpp
by
#grobid.crf.engine=wapiti
grobid.crf.engine=crfpp
Hi,
I am using the latest Grobit Verison with j2sdk1.7-oracle on SMP Debian 3.2.60-1+deb7u1 x86_64 Linux
and it seems the Grobid is somehow unable to use it's pool, I have messages about an error in my log
ulf@ivanova:/data/delivermath/grobid/grobid-service$ tail -f grobid.log
17 Mar 2015 18:03.03 [INFO ] GrobidRestService - Initiating Servlet GrobidRestService
17 Mar 2015 18:03.03 [INFO ] LibraryLoader - Loading external native CRF library
17 Mar 2015 18:03.03 [INFO ] LibraryLoader - Loading Wapiti native library...
17 Mar 2015 18:03.03 [INFO ] LibraryLoader - Library crfpp loaded
17 Mar 2015 18:03.03 [INFO ] Lexicon - Initiating dictionary
17 Mar 2015 18:03.03 [INFO ] Lexicon - End of Initialization of dictionary
17 Mar 2015 18:03.03 [INFO ] Lexicon - Initiating names
17 Mar 2015 18:03.03 [INFO ] Lexicon - End of initialization of names
17 Mar 2015 18:03.04 [INFO ] Lexicon - Initiating country codes
17 Mar 2015 18:03.04 [INFO ] Lexicon - End of initialization of country codes
17 Mar 2015 18:03.04 [INFO ] GrobidRestService - Initiating of Servlet GrobidRestService finished.
17 Mar 2015 18:03.29 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.29 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/citation/model.wapiti (size: 12840798)
17 Mar 2015 18:03.32 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/name/header/model.wapiti (size: 2215355)
17 Mar 2015 18:03.32 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/name/citation/model.wapiti (size: 99017)
17 Mar 2015 18:03.32 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/date/model.wapiti (size: 103543)
17 Mar 2015 18:03.32 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.32 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.32 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.34 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.34 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.34 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
.
.
.
I have little experience with Maven, so it's entirely possible that I'm doing something wrong. I'm following the Quick Start Guide, and it looks like maven is downloading a bunch of HTML files and trying to use them as .jars when I run "mvn clean install"
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Compilation failure
error: error reading /home/steve/.m2/repository/com/aliasi/lingpipe/3.8.2/lingpipe-3.8.2.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/commons-pool/commons-pool/1.6/commons-pool-1.6.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/commons-io/commons-io/2.0.1/commons-io-2.0.1.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/org/apache/commons/commons-lang3/3.0.1/commons-lang3-3.0.1.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/org/slf4j/slf4j-api/1.6.6/slf4j-api-1.6.6.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/org/slf4j/slf4j-log4j12/1.6.6/slf4j-log4j12-1.6.6.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/directory-naming/naming-java/0.8/naming-java-0.8.jar; error in opening zip file
All of those files look to be HTML not jars. I'm seeing exactly the same thing on both OS X and Linux
On Linux:
Apache Maven 2.2.1 (rdebian-4)
Java version: 1.6.0_26
on OS X:
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00)
Java version: 1.7.0_25, vendor: Oracle Corporation
Any help you can provide would be appreciated, let me know if I can provide more useful information.
I tried to run the wiki example to extract header, however, I got this error:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:359)
at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:155)
... 6 more
Caused by: java.lang.NullPointerException
at org.grobid.core.engines.ProcessEngine.processHeader(ProcessEngine.java:56)
... 12 more
Any clue?
Check segmentation of instances in the labeled result.
Lot of functions of CRFPP that works on Linux/Mac does not work at all on Windows.
On training (the function what is not recognized):
java.lang.UnsatisfiedLinkError: org.chasen.crfpp.CRFPPJNI.CRFPPTrainer_what(JLorg/chasen/crfpp/CRFPPTrainer;)Ljava/lang/String;
at org.chasen.crfpp.CRFPPJNI.CRFPPTrainer_what(Native Method)
at org.chasen.crfpp.CRFPPTrainer.what(CRFPPTrainer.java:41)
at org.grobid.trainer.AbstractTrainer.train(AbstractTrainer.java:49)
at org.grobid.trainer.AbstractTrainer.runTraining(AbstractTrainer.java:139)
at org.grobid.trainer.NameHeaderTrainer.main(NameHeaderTrainer.java:93)
On server (the instanciation of Model is not supported: newModel):
java.lang.UnsatisfiedLinkError: org.chasen.crfpp.CRFPPJNI.new_Model(Ljava/lang/String;)J
at org.chasen.crfpp.CRFPPJNI.new_Model(Native Method)
at org.chasen.crfpp.Model.(Model.java:46)
at org.grobid.core.engines.ModelMap.getNewModel(ModelMap.java:101)
at org.grobid.core.engines.ModelMap.getModel(ModelMap.java:87)
at org.grobid.core.engines.ModelMap.initModels(ModelMap.java:64)
at org.grobid.core.factory.AbstractEngineFactory.fullInit(AbstractEngineFactory.java:58)
at org.grobid.service.GrobidRestService.(GrobidRestService.java:74)
... 69 more
I´m not able to compile th actual version of grobid on linux (64bit):
The error messages in the log files look like this:
Tests in error:
testFullTextTrainingParser(org.grobid.core.test.TestFullTextParser): An exception occured while running Grobid training data generation for full text.
testFullTextParser(org.grobid.core.test.TestFullTextParser): An exception occured while running Grobid.
testHeaderHeader(org.grobid.core.test.TestHeaderParser): An exception occurred while running Grobid on file /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-home/tmp: java.lang.RuntimeException: PDF to XML conversion timed out
testTrainingHeader(org.grobid.core.test.TestHeaderParser): An exception occurred while running Grobid.
testReferences(org.grobid.core.test.TestReferencesParser): An exception occured while running Grobid.
Tests run: 151, Failures: 0, Errors: 5, Skipped: 1
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] grobid-parent ..................................... SUCCESS [0.005s]
[INFO] grobid-home ....................................... SUCCESS [1:11.088s]
[INFO] grobid-core ....................................... FAILURE [28.210s]
[INFO] grobid-trainer .................................... SKIPPED
[INFO] grobid-service .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:39.876s
[INFO] Finished at: Sat Nov 02 00:40:20 CET 2013
[INFO] Final Memory: 14M/263M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project grobid-core: There are test failures.
[ERROR]
[ERROR] Please refer to /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-core/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project grobid-core: There are test failures.
Please refer to /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-core/target/surefire-reports for the individual test results.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:213)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:321)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:158)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures.
Please refer to /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-core/target/surefire-reports for the individual test results.
at org.apache.maven.plugin.surefire.SurefireHelper.reportExecution(SurefireHelper.java:83)
at org.apache.maven.plugin.surefire.SurefirePlugin.writeSummary(SurefirePlugin.java:176)
at org.apache.maven.plugin.surefire.SurefirePlugin.handleSummary(SurefirePlugin.java:150)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:650)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:586)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :grobid-core
Any ideas?
Thanks in advance
holoxy
When running grobid-trainer with wrong arguments it replies with a 'Usage' line containing this :
"Usage: (...) -pH /path/to/Grobid/home (...)"
the -pH switch is a typo => the TrainerRunner expects -gH
Trying to execute,
String tei = engine.processHeader("some filename with spaces.pdf", false, resHeader);
, I get the following error from Grobid:
org.grobid.core.exceptions.GrobidException: An exception occurred while running Grobid on > file /Users/username/dev/rep/grobid/grobid-home/tmp: java.lang.RuntimeException: PDF to XML > conversion failed due to:
at org.grobid.core.engines.HeaderParser.processing(HeaderParser.java:79)
at org.grobid.core.engines.Engine.processHeader(Engine.java:454)
at org.grobid.core.engines.Engine.processHeader(Engine.java:419)
I'd be great if it could handle filenames with spaces.
Grobid Service uses the class ModelMap for initialize the models (initModels), and thus not the new TaggerFactory.
Several tests fail for me because getAttributes() functions of the tests use Sets on a type that does not support comparable. Replacing Set/TreeSet with Vector allow me to successfully build GROBID.
diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java index a5337bc..3c08462 100644 --- a/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java +++ b/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java @@ -7,6 +7,7 @@ import static org.grobid.core.utilities.TeiValues.XML; import java.io.IOException; import java.util.Iterator; import java.util.Set; +import java.util.Vector; import java.util.TreeSet; import javax.xml.namespace.QName; @@ -433,7 +434,7 @@ public class DescriptionTest extends XMLTestCase { } private static Iterator getAttributes(final Attribute... pAttr) { - Set attributes = new TreeSet(); + Vector attributes = new Vector(); for (final Attribute attr : pAttr) { attributes.add(attr); } diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java index b8798e6..355b027 100644 --- a/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java +++ b/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java @@ -5,6 +5,7 @@ import java.util.Iterator; import java.util.List; import java.util.Set; import java.util.TreeSet; +import java.util.Vector; import static org.grobid.core.utilities.TeiValues.ATTR_ID; import static org.grobid.core.utilities.TeiValues.W3C_NAMESPACE; @@ -143,7 +144,7 @@ public class ParagraphTest { } private static Iterator getAttributes(final Attribute... pAttr) { - Set attributes = new TreeSet(); + Vector attributes = new Vector(); for (final Attribute attr : pAttr) { attributes.add(attr); } diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java index ae53a6d..4954f01 100644 --- a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java +++ b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java @@ -7,6 +7,7 @@ import static org.junit.Assert.assertTrue; import java.util.Iterator; import java.util.Set; +import java.util.Vector; import java.util.TreeSet; import javax.xml.namespace.QName; @@ -139,7 +140,7 @@ public class TeiStAXParsedInfoTest { } private static Iterator getAttributes(final Attribute... pAttr) { - Set attributes = new TreeSet(); + Vector attributes = new Vector(); for (final Attribute attr : pAttr) { attributes.add(attr); } diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java index b864fe3..ec160fa 100644 --- a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java +++ b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java @@ -10,6 +10,7 @@ import java.io.OutputStream; import java.io.UnsupportedEncodingException; import java.util.Iterator; import java.util.Set; +import java.util.Vector; import java.util.TreeSet; import javax.xml.namespace.QName; @@ -263,7 +264,7 @@ public class TeiStAXParserTest extends XMLTestCase { } private static Iterator getAttributes(final Attribute... pAttr) { - Set attributes = new TreeSet(); + Vector attributes = new Vector(); for (final Attribute attr : pAttr) { attributes.add(attr); }
Thanks a lot for the very promising meta data extraction tool.
Would it be possible to add a option for recursive extraction for folders and subfolder?
See fork with JNI: https://github.com/kermitt2/Wapiti
In the line "Converting PDF to XML is a bit like converting hamburgers into cows" (Peter Murray-Rust), character spacing in PDF is not what we see ;)
A good PDF processing libraries like pdftoxml is trying to recreate valid spacing (with respect to the visual rendering), but of course it is difficult.
It appears that quite a lot of PDF result in problematic charcater spacing for some fonts, in particular in the header section. For instance this pdf, the author sequence is extracted by pdftoxml as:
M ihael ARCAN 1 Chr ist ian F E DERM AN N 2 Paul BU I T E LAAR 1
It is then very hard for the CRF to predict a good sequence labeling on this...
Note that Mac OS X Preview is recomposing it right, a direct cut and paste form the PDF gives:
Mihael ARCAN1 Christian FEDERMANN2 Paul BUITELAAR1
If Apple can do it, we can certainly do it right too ;)
I'm following the instructions in the readme to get the code working, but getting the error:
Caused by: java.lang.RuntimeException: Unable to find a native CRF++ library: Folder /media/D020-1F62/github/grobid/grobid-home/lib/lin-32 does not exist
I'm on "Linux rb-806-02-c 3.0.0-32-generic-pae #51-Ubuntu SMP Thu Mar 21 16:09:48 UTC 2013 i686 i686 i386 GNU/Linux"
I'm adding support for PDFMiner (https://github.com/euske/pdfminer) and PDFBox (https://pdfbox.apache.org) in my fork. So far, all the tests still pass when using PDFMiner . Will issue a pull request after further testing and cleanup.
It may be an idea to decouple the PDF extraction entirely, and have the option to import raw XML or JSON directly (e.g. created by a previous process, or from a filestore).
It would be good to share notes on best practices for editing the training data. Here's a few things that I have come up with:
When processHeader'ing some pdfs, I get an abstract with all words concatenated without spaces between them.
For instance:
http://guillaumemuller1.free.fr/Articles/iTrust2006_LNAI.pdf
But not with:
http://search.arxiv.org:8081/paper.jsp?r=1306.4727&qid=1375964777082mix_nCnN_-1091422823&qs=1306.4727
However, pdftoxml ran in commandline separates the words (TOKENs) correctly.
Is it possible to get the probability/confidence of each tag?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.