kermitt2 / grobid Goto Github PK

A machine learning software for extracting information from scholarly documents

Home Page: https://grobid.readthedocs.io

License: Apache License 2.0

Shell 0.10% XQuery 0.13% Java 53.91% HTML 29.22% JavaScript 11.42% CSS 0.82% XSLT 0.26% Batchfile 0.01% Roff 1.06% SWIG 0.03% Ruby 1.62% Python 0.30% Kotlin 1.10%

machine-learning scientific-articles pdf metadata fulltext bibliographical-references hamburger-to-cow deep-learning rnn transformers

grobid's People

Contributors

Stargazers

Watchers

Forkers

mariohoberg graynard renaud snpts bawerd songbaoqiang juanbits pombredanne sonlac bilokurov zinebkritez nmeuschke bigboss21x aazhar boumenot neozhangthe1 rodneykinney allenai fekaputra greggref mrniket rloth jcboyd sujen1412 fmux inspirehep chrismattmann dtkaczyk lilykos shabeesh codetasks kwin-wang immortalin anukat2015 ashfaqfarooqui project-renard-survey ijmarshall sagnik asmousavi nooralahzadeh musaceylan mnjstwins qtux ccppjava mjspka abe-iop krzysiekruczaj plos ondrocks zedomel horacioibrahim yuancz leen parvinb bcui6611 wihoho rishiabhishek kgpavan aoboturov shreelakshmi06 mkrnr ramaswamym1987 tantikristanti sentido-labs vi-dot paris0120 dominic-sps iorala michamos datagold2017 jfix cyvaction jymcheong toshazed purker lomascolo min-chao minghao2016 ltadjou hillkorn 1000-7 ajbloureiro yofayed saranyasekar123 yufc2002 mastak clustersdata ufukhurriyetoglu billho wooooooooooooooowastaken ashrafuzzaman catalyticds asanchez75 hanjing0098 raychen05 shubhampachori12110095 sherbazhashmi de-code jiangwenqi silky

grobid's Issues

Force tags for conference papers/presentations

As conference papers or presentations often do not contain sufficient bibliograhic information it would be helpful to force the value of certain tags (meeting, date, location ...).

A howto or a parameter within grobid would be very helpful.

Support running Grobid in Apache Spark

We are considering running Grobid in our Spark environment as part of the Semantic Scholar project. Currently, PDFs are shuffled to Spark nodes as RDDs of byte arrays which live in the nodes' main memory. These byte arrays currently need to be output to a temporary directory which is used as value for -dIn argument. The extracted XMLs are then read from a temporary directory (the -dOut argument) into memory as byte array RDDs which then are processed by the next step in the Spark pipeline.

This current design incurs one disk write and one disk read per PDF. These IO operations can be removed if Grobid accept PDFs as byte arrays directly. To enable this, Grobid needs to call pdf2xml via a library API wrapped with JNI rather than shelling out to a separate process (correct me if I am wrong). This approach could also support multi-thread processing similar to Grobid's REST service.

Thanks,
Vu Ha. The Semantic Scholar project

Bad name for files generated by createTrainingFulltext

When executing grobid-core's createTrainingFulltext method, the tei xml files generated for the references have a typo in their name: they are called "...tranining.references...": 'i' and 'n' are inverted.

Getting rid of CRF++

At some point, we could use only Wapiti since it provides only advantages as compared to CRF++. This would simplify the maintenance given that we need otherwise to manage two set of JNI libraries and models.

Changing the number of training iterations?

Is there an easy way to change the number of iterations when running grobid-trainer?

Switch Wapiti branch to trunk

When ready ;)

Remove augmented affiliation/address labels for the affililiation parser

Don't use feature matrix for doing that, to ensure genericity between CRP++ and Wapiti
See runReflow stuff in the affiliation parser...

Uninformative error message when bad directory is provided

When commands of grobid-core are run with a non existing directory for the -gH option, the error refers to a non-existing 'null' directory (instead of the erroneous directory entered). This is not serious, but this does not help the user to find the source of the error and correct it quickly.

Embed wapiti jar

Day/Month/Year not set

Hi,

When I parse a standard scientific paper [1], with the code proposed in [2]
and print the resulting BiblioItem, I get:
...
month='null'
e_month='null'
s_month='null'
a_month='null'
...
e_year='null'
s_year='null'
a_year='null'
year='null'
...
day='null'
e_day='null'
s_day='null'
a_day='null'
...
normalized_publication_date=20-6-2013 / 20-Jun-2013
...
normalized_submission_date=null
...

I looked quickly at BiblioItem.java. I'm not sure which field represents what information, so I don't feel comfortable with changing the code for the moment.
However, I've noticed that the year/month/day field are set only in toTEI() and toTEI2(). Wouldn't it be better to set them in setNormalizedPublicationDate() and/or setNormalizedSubmissionDate(), so that they are defined each time a date is set, and use these methods in the other methods of the class every time a date has to be changed?

[1] http://search.arxiv.org:8081/paper.jsp?r=1306.4727&qid=1375964777082mix_nCnN_-1091422823&qs=1306.4727
[2] https://github.com/kermitt2/grobid/wiki/Grobid-java-library

Calling twice MockContext.setInitialContext result in Exception being raised

I've tried to apply the code provided here:
https://github.com/kermitt2/grobid/wiki/Grobid-java-library
in a WebService context, where I crawl entire websites and need to run grobid everytime I find a PDF. Unfortunately, calling twice the line:
MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
results in an Exception being raised:
Caused by: org.grobid.core.exceptions.GrobidPropertyException: Could not set GROBD_HOME
at org.grobid.core.utilities.GrobidProperties.load_GROBID_HOME_PATH(GrobidProperties.java:170)
at org.grobid.core.utilities.GrobidProperties.init(GrobidProperties.java:353)
at org.grobid.core.utilities.GrobidProperties.init(GrobidProperties.java:383)
at org.grobid.core.utilities.GrobidProperties.(GrobidProperties.java:341)
at org.grobid.core.utilities.GrobidProperties.getNewInstance(GrobidProperties.java:119)
at org.grobid.core.utilities.GrobidProperties.getInstance(GrobidProperties.java:99)
at fr.presans.machinelearning.extractor.AnyPDFExtractor.PDF2BiblioItem(AnyPDFExtractor.java:131)
... 8 more
Caused by: javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:307)
at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:344)
at javax.naming.InitialContext.lookup(InitialContext.java:411)
at org.grobid.core.utilities.GrobidProperties.load_GROBID_HOME_PATH(GrobidProperties.java:168)

I've tried to define a static boolean in my code in order to call the concerned line only once, but then I get the following error:

Caused by: javax.naming.NameAlreadyBoundException: Name java: is already bound in this Context
at org.apache.naming.NamingContext.bind(NamingContext.java:892)
at org.apache.naming.NamingContext.bind(NamingContext.java:186)
at org.apache.naming.NamingContext.createSubcontext(NamingContext.java:542)
at org.apache.naming.NamingContext.createSubcontext(NamingContext.java:564)
at javax.naming.InitialContext.createSubcontext(InitialContext.java:483)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:37)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76)
at fr.presans.machinelearning.extractor.AnyPDFExtractor.PDF2BiblioItem(AnyPDFExtractor.java:134)

I'm not sure what the line MockContext.setInitialContext() is supposed to do... Therefore how I should call it.

Malformed TEI XML created by toTEIHeader in TEIFormater.java

tei.append("\t\t\t\t\t\n"); on line 407 of TEIFormater.java is in the wrong place, it should be moved to after the closing } on line 409.

Also, lists are not being closed consistently, suggest adding

if (listOpened ) {
tei.append("\t\t\t\t\n");
listOpened = false;
}

after the for loop at line 704. I will send a push request shortly.

Great tool by the way, GROBID seems to be one of the best PDF to XML tools out there at the moment. Thanks for this!

NullPointerException when running grobid-core's processXXX commands

When running grobid-core's commands "processXXX", I get a NullPointerException: - processDate (ProcessEngine:103)

processAuthorsHeader (ProcessEngine:117)
processAuthorsCitation (ProcessEngine:131)
The line that causes the problem is:
Utilities.writeInFile(pGbdArgs.getPath2Output() + File.separator + "result", result.get(0).toTEI());
I'm not sure if it comes from the result being "null" because I use incorrectly the commands (in this case, maybe it would be worth explaining more the useage in the docs), or if it comes from attempting to write to a file, when the commands in the doc seems not to configure a file to write to (is it a problem in the commands given in the doc or a larger problem of attempting to write to a file when the call of the commands from the command-line should write to the terminal?).

I also get and error for processAffiliation, that comes from the line: processAffiliation (AffiliationAddressParser:28), which reads:
input.trim();
I think the error is related to the above ones, but I'm not sure for the moment. I believe 'input' might be "null" here because i) it is not captured from the commands line arguments, ii) I have not understood how to pass the argument to the command, iii) there are some pre-treatments done on the argument that nullify the string.

Training header does not work for some papers

If the bibliographic data is at the second page of the paper, creating the training header file (*.header) doesn´t work properly: There is no information, written at the second page, included in the *.header file. --> training does not work.
That´s also the case for some other articles, there the journal name, written at the first page, is not included in the *.header files.

Segmentation corpus incomplete?

Grobid experts,

When I process with the shipped Segmentation model, the performance is excellent--accuracy and F1 scores are close to perfect in all fields (this is reflected in the annotations themselves). However when I train a new model on the shipped Segmentation corpus, the evaluation is very poor. I can think of three possible reasons to explain this:

The training set shipped is only a small sample of the set used to train the shipped model
The number of l-bfgs iterations (6 by default) is too low
I have broken something :)

Update PDF documentation

The file "grobid-service_manual.pdf", does not reflect exactly the lastest version (2.8) of the tool.
At least update the pictures, the tool is prettier now :)

Typo in the generated XML file for training

When using createTrainingXXX, with papers where a middle name is present in the authors' names, the generated XML file shows "XXX.", with the dot after the closing tag.

What i need to run it?

Hi i try to compile this project, but i get errors using this guide:
https://github.com/kermitt2/grobid/wiki/Grobid-service-quick-start

Tests run: 154, Failures: 0, Errors: 44, Skipped: 1

[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] There are test failures.

actually i use:
Apache Maven 2.2.1 (rdebian-8)
Java version: 1.6.0_45
Java home: /usr/lib/jvm/jdk1.6.0_45/jre
Default locale: es_MX, platform encoding: UTF-8
OS name: "linux" version: "3.2.0-4-686-pae" arch: "i386" Family: "unix"

what i need.. SO, Architecture, Java version, Maven version?

thanks

Add date of production of TEI document

Grobid should indicate in the generated TEI results the date, provenance information (i.e. which version of Grobid) and if the result has been combined or not via CrossRef.

Not well-formed xml output

I've been running GROBID on a variety of pdfs as of late and came across one where the xml is not well-formed. I'm using GROBID as a service via its processFulltextDocument method.

Offending PDF
Resulting XML file

The problem with the XML file is in Line 172 (below, line-breaks added by me for readability) where there's a quot; entity that seems to have gotten messed up and leaves us with an unescaped ampersand.

<p>To find out on which level of granularity 
visual pattern are classified as single &quot;objects&quot; (see Figure 6), 
a questionnaire with eight different screen-dumps 
of a commercial multimedia information system 
(called &quot;mock-ups&quot;, see[Rauterberg 1995c]) was answered
by a heterogeneous group of potential users. 
A total of 33 women (between 14 and 66 year of life) 
and 33 men (between 21 and 55 year of life) participated 
(no significant difference in age between both groups). 
The computer experience of each subject was measured on a rating scale 
(&quot;no experience&quot ; = 0 … &quot;expert&quot; = 90). 
We found a significant difference in computer experience between both groups: 
40 ±27 for women and 63 ±24 for men (p≤ .001); 
the men were more experienced than the women.</p>

Using a version of GROBID cloned from github about a month or so ago. Let me know if any other info is needed.

Consolidation of extracted data

From my point of view, grobid would be unbeatible, if the information extracted would be consolidated with a free online database.

Are there alternatives to crossref, as this is quite expensive?

Index out of bound in the Patent parser

Index out of bound in the Patent parser test when using wapiti

Bug in Fulltext training

The training data for fulltext models creates corrupt XML files, which causes the trainer to crash.

Grobid hangs on a PDF

This is the PDF: http://www.aclweb.org/anthology/W12-4305

Grobid was able to process this file with -exe processHeader, but hanged when run with -exe processReferences and -exe processFullText.

Failure information with RESTful interface

When Grobid fails to process a PDF, the REST response should indicate if the failure is due to a PDF timeout, a corrupted PDF or a Grobid process error.

Get rid of fields in Document class

For multi-thread support.

version `GLIBC_2.14' not found

Hi,

Under the latest up-to-date Debian GNU/Linux 7 (wheezy), I get the following exception when running grobid:

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:359)
at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:154)
... 6 more
Caused by: java.lang.UnsatisfiedLinkError: /home/gmuller/grobid/grobid-home/lib/lin-64/libwapiti.so: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /home/gmuller/grobid/grobid-home/lib/lin-64/libwapiti.so)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1851)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1062)
at org.grobid.core.main.LibraryLoader.load(LibraryLoader.java:94)
at org.grobid.core.factory.AbstractEngineFactory.init(AbstractEngineFactory.java:51)
at org.grobid.core.factory.GrobidFactory.(GrobidFactory.java:21)
at org.grobid.core.factory.GrobidFactory.newInstance(GrobidFactory.java:58)
at org.grobid.core.factory.GrobidFactory.getInstance(GrobidFactory.java:32)
at org.grobid.core.engines.ProcessEngine.getEngine(ProcessEngine.java:42)
at org.grobid.core.engines.ProcessEngine.processFullText(ProcessEngine.java:103)
... 12 more

The exact command executed is:
java -Xmx1024m -jar ~/grobid/grobid-core/target/grobid-core-0.3.0.one-jar.jar -gH ~/grobid/grobid-home/ -gP ~/grobid/grobid-home/config/grobid.properties -dIn In/ -dOut Out/ -exe processFullText

Failure in some recompositions of diacritics

Grobid uses quite a lot of rules to recompose diacritics (never encoded as such in PDF), e.g.:
e' -> é

In some cases (not so frequent), the sequence of characters from the PDF is the not as expected for the existing rules and the recomposition fails. Currently the consequence is that both the "accent" and the modified charcater diseapper.

Example: for this PDF (HAL Open Access)

"Clément Cancès" becomes:

<persName>
    <forename type="first">Clément</forename>
    <surname>Canc</surname>
</persName>

(Mac OS X Preview does not get it right neither by the way!)

NullPointerException when files are not readable

Hi,

I just ran into a strange behaviour: when trying to access files in a dir (/tmp/ToTreat) that all belonged to another user, I ran into a not-really-explicatory-error: NullPointerException...

$ GROBID_HOME=/home/xxx/grobid/grobid-home/
$ DIR=/tmp/ToTreat/
$ java -Xmx1024m -Xms1024m
-jar ${GROBID_HOME}/../grobid-core/target/grobid-core-0.2.*.one-jar.jar
-gH ${GROBID_HOME}
-gP ${GROBID_HOME}/config/grobid.properties
-dIn ${DIR} -dOut ${DIR}
-exe processHeader
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:359)
at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:155)
... 6 more
Caused by: java.lang.NullPointerException
at org.grobid.core.engines.ProcessEngine.processHeader(ProcessEngine.java:56)
... 12 more

Train Grobid on more diverse examples

Testing Grobid on papers from arXive.org, I've found at least two cases that Grobid does not manage correctly:

If there are 2 symbols for 1 author with 2 affiliations, then the 2nd symbol becomes his last name;
If there is a quote symbol in the author's lastname (e.g., O'Hare), then the lastname is only the first charaters before the quote.

Problem with Wapiti model portability on some (rare) Linux machines

The Wapiti binary models are not recognized on a few Linux machines.

Load model
error: invalid model format

The error is coming from model.c in Wapiti, when the header of the model is parsed via fscanf:

    278-/* mdl_load:
    279- *   Read back a previously saved model to continue training or start labeling.
    280- *   The returned model is synced and the quarks are locked. You must give to
    281- *   this function an empty model fresh from mdl_new.
    282- */
    283-void mdl_load(mdl_t *mdl, FILE *file) {
    284:    const char *err = "invalid model format";
    285-    uint64_t nact = 0;
    286-    int type;
    287-    if (fscanf(file, "#mdl#%d#%"SCNu64"\n", &type, &nact) == 2) {
    288-        mdl->type = type;
    289-    } else {
    290-        rewind(file);
    291-        if (fscanf(file, "#mdl#%"SCNu64"\n", &nact) == 1)
    292-            mdl->type = 0;
    293-        else
    294-            fatal(err);
    295-    }
    296-    rdr_load(mdl->reader, file);
    297-    mdl_sync(mdl);
    298-    for (uint64_t i = 0; i < nact; i++) {
    299-        uint64_t f;
    300-        double v;
    301-        if (fscanf(file, "%"SCNu64"=%la\n", &f, &v) != 2)
    302-            fatal(err);
    303-        mdl->theta[f] = v;
    304-    }
    305-}
    306-

The header of the model looks like this on the problematic machine:

    > find grobid/grobid-home/models/ -name "*wapiti" -print -exec head -n2 \{} \;
grobid/grobid-home/models/header/model.wapiti
#mdl#2#314470
#rdr#85/29/0

If the model is retrained on the problematic machine, it is working. However, the header format looks the same:

  > head -n2 grobid/grobid-home/models/date/model.wapiti
  #mdl#2#262  
  #rdr#50/16/0
  12:u00:%x[-3,0],

Users having this issue can use CRF++ as JNI CRF engine instead of Wapiti (a little bit slower, takes more memory, use smaller models - because of GitHub limitation on binary file size - but the result are similar).

In the file grobid-home/config/grobid.properties, simply change:

  grobid.crf.engine=wapiti
  #grobid.crf.engine=crfpp

 #grobid.crf.engine=wapiti
 grobid.crf.engine=crfpp

Pooling Question/Bug?

Hi,

I am using the latest Grobit Verison with j2sdk1.7-oracle on SMP Debian 3.2.60-1+deb7u1 x86_64 Linux
and it seems the Grobid is somehow unable to use it's pool, I have messages about an error in my log

ulf@ivanova:/data/delivermath/grobid/grobid-service$ tail -f grobid.log

17 Mar 2015 18:03.03 [INFO ] GrobidRestService - Initiating Servlet GrobidRestService
17 Mar 2015 18:03.03 [INFO ] LibraryLoader - Loading external native CRF library
17 Mar 2015 18:03.03 [INFO ] LibraryLoader - Loading Wapiti native library...
17 Mar 2015 18:03.03 [INFO ] LibraryLoader - Library crfpp loaded
17 Mar 2015 18:03.03 [INFO ] Lexicon - Initiating dictionary
17 Mar 2015 18:03.03 [INFO ] Lexicon - End of Initialization of dictionary
17 Mar 2015 18:03.03 [INFO ] Lexicon - Initiating names
17 Mar 2015 18:03.03 [INFO ] Lexicon - End of initialization of names
17 Mar 2015 18:03.04 [INFO ] Lexicon - Initiating country codes
17 Mar 2015 18:03.04 [INFO ] Lexicon - End of initialization of country codes
17 Mar 2015 18:03.04 [INFO ] GrobidRestService - Initiating of Servlet GrobidRestService finished.
17 Mar 2015 18:03.29 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.29 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/citation/model.wapiti (size: 12840798)
17 Mar 2015 18:03.32 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/name/header/model.wapiti (size: 2215355)
17 Mar 2015 18:03.32 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/name/citation/model.wapiti (size: 99017)
17 Mar 2015 18:03.32 [INFO ] WapitiModel - Loading model: /data/delivermath/grobid/grobid-home/models/date/model.wapiti (size: 103543)
17 Mar 2015 18:03.32 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.32 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.32 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.33 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.34 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.34 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10
17 Mar 2015 18:03.34 [ERROR] GrobidPoolingFactory - Number of Engines in pool active/max: 1/10

.
.
.

Build fails on OS X and Linux

I have little experience with Maven, so it's entirely possible that I'm doing something wrong. I'm following the Quick Start Guide, and it looks like maven is downloading a bunch of HTML files and trying to use them as .jars when I run "mvn clean install"

[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Compilation failure

error: error reading /home/steve/.m2/repository/com/aliasi/lingpipe/3.8.2/lingpipe-3.8.2.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/commons-pool/commons-pool/1.6/commons-pool-1.6.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/commons-io/commons-io/2.0.1/commons-io-2.0.1.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/org/apache/commons/commons-lang3/3.0.1/commons-lang3-3.0.1.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/org/slf4j/slf4j-api/1.6.6/slf4j-api-1.6.6.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/org/slf4j/slf4j-log4j12/1.6.6/slf4j-log4j12-1.6.6.jar; error in opening zip file
error: error reading /home/steve/.m2/repository/directory-naming/naming-java/0.8/naming-java-0.8.jar; error in opening zip file

All of those files look to be HTML not jars. I'm seeing exactly the same thing on both OS X and Linux

On Linux:
Apache Maven 2.2.1 (rdebian-4)
Java version: 1.6.0_26

on OS X:
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00)
Java version: 1.7.0_25, vendor: Oracle Corporation

Any help you can provide would be appreciated, let me know if I can provide more useful information.

InvocationTargetException Exception

I tried to run the wiki example to extract header, however, I got this error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.simontuffs.onejar.Boot.run(Boot.java:340)
    at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:359)
    at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:155)
    ... 6 more
Caused by: java.lang.NullPointerException
    at org.grobid.core.engines.ProcessEngine.processHeader(ProcessEngine.java:56)
    ... 12 more

Any clue?

Instance level evaluation not working with Wapiti

Check segmentation of instances in the labeled result.

CRFPP library not well supported on Windows

Lot of functions of CRFPP that works on Linux/Mac does not work at all on Windows.

On training (the function what is not recognized):
java.lang.UnsatisfiedLinkError: org.chasen.crfpp.CRFPPJNI.CRFPPTrainer_what(JLorg/chasen/crfpp/CRFPPTrainer;)Ljava/lang/String;
at org.chasen.crfpp.CRFPPJNI.CRFPPTrainer_what(Native Method)
at org.chasen.crfpp.CRFPPTrainer.what(CRFPPTrainer.java:41)
at org.grobid.trainer.AbstractTrainer.train(AbstractTrainer.java:49)
at org.grobid.trainer.AbstractTrainer.runTraining(AbstractTrainer.java:139)
at org.grobid.trainer.NameHeaderTrainer.main(NameHeaderTrainer.java:93)

On server (the instanciation of Model is not supported: newModel):
java.lang.UnsatisfiedLinkError: org.chasen.crfpp.CRFPPJNI.new_Model(Ljava/lang/String;)J
at org.chasen.crfpp.CRFPPJNI.new_Model(Native Method)
at org.chasen.crfpp.Model.(Model.java:46)
at org.grobid.core.engines.ModelMap.getNewModel(ModelMap.java:101)
at org.grobid.core.engines.ModelMap.getModel(ModelMap.java:87)
at org.grobid.core.engines.ModelMap.initModels(ModelMap.java:64)
at org.grobid.core.factory.AbstractEngineFactory.fullInit(AbstractEngineFactory.java:58)
at org.grobid.service.GrobidRestService.(GrobidRestService.java:74)
... 69 more

Bug in compilation

I´m not able to compile th actual version of grobid on linux (64bit):

The error messages in the log files look like this:

Tests in error:
testFullTextTrainingParser(org.grobid.core.test.TestFullTextParser): An exception occured while running Grobid training data generation for full text.
testFullTextParser(org.grobid.core.test.TestFullTextParser): An exception occured while running Grobid.
testHeaderHeader(org.grobid.core.test.TestHeaderParser): An exception occurred while running Grobid on file /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-home/tmp: java.lang.RuntimeException: PDF to XML conversion timed out
testTrainingHeader(org.grobid.core.test.TestHeaderParser): An exception occurred while running Grobid.
testReferences(org.grobid.core.test.TestReferencesParser): An exception occured while running Grobid.

Tests run: 151, Failures: 0, Errors: 5, Skipped: 1

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] grobid-parent ..................................... SUCCESS [0.005s]
[INFO] grobid-home ....................................... SUCCESS [1:11.088s]
[INFO] grobid-core ....................................... FAILURE [28.210s]
[INFO] grobid-trainer .................................... SKIPPED
[INFO] grobid-service .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:39.876s
[INFO] Finished at: Sat Nov 02 00:40:20 CET 2013
[INFO] Final Memory: 14M/263M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project grobid-core: There are test failures.
[ERROR]
[ERROR] Please refer to /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-core/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project grobid-core: There are test failures.

Please refer to /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-core/target/surefire-reports for the individual test results.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:213)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:321)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:158)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures.

Please refer to /var/run/media/holoxy/INTENSO/Programme/grobid_neu/grobid/grobid-core/target/surefire-reports for the individual test results.
at org.apache.maven.plugin.surefire.SurefireHelper.reportExecution(SurefireHelper.java:83)
at org.apache.maven.plugin.surefire.SurefirePlugin.writeSummary(SurefirePlugin.java:176)
at org.apache.maven.plugin.surefire.SurefirePlugin.handleSummary(SurefirePlugin.java:150)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:650)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:586)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :grobid-core

Any ideas?

Thanks in advance

holoxy

typo in grobid-trainer "Usage" warning

When running grobid-trainer with wrong arguments it replies with a 'Usage' line containing this :
"Usage: (...) -pH /path/to/Grobid/home (...)"

the -pH switch is a typo => the TrainerRunner expects -gH

Problem Handling Filenames With Spaces

Trying to execute,

String tei = engine.processHeader("some filename with spaces.pdf", false, resHeader);

, I get the following error from Grobid:

org.grobid.core.exceptions.GrobidException: An exception occurred while running Grobid on > file /Users/username/dev/rep/grobid/grobid-home/tmp: java.lang.RuntimeException: PDF to XML > conversion failed due to:
at org.grobid.core.engines.HeaderParser.processing(HeaderParser.java:79)
at org.grobid.core.engines.Engine.processHeader(Engine.java:454)
at org.grobid.core.engines.Engine.processHeader(Engine.java:419)

I'd be great if it could handle filenames with spaces.

JNI bindings for Wapiti on Windows

Grobid Service not properly initialized in the Wapiti branch

Grobid Service uses the class ModelMap for initialize the models (initModels), and thus not the new TaggerFactory.

Build/Test error

Several tests fail for me because getAttributes() functions of the tests use Sets on a type that does not support comparable. Replacing Set/TreeSet with Vector allow me to successfully build GROBID.

diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java
index a5337bc..3c08462 100644
--- a/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java
+++ b/grobid-core/src/test/java/org/grobid/core/annotations/DescriptionTest.java
@@ -7,6 +7,7 @@ import static org.grobid.core.utilities.TeiValues.XML;
 import java.io.IOException;
 import java.util.Iterator;
 import java.util.Set;
+import java.util.Vector;
 import java.util.TreeSet;
 
 import javax.xml.namespace.QName;
@@ -433,7 +434,7 @@ public class DescriptionTest extends XMLTestCase {
    }
 
    private static Iterator getAttributes(final Attribute... pAttr) {
-       Set attributes = new TreeSet();
+       Vector attributes = new Vector();
        for (final Attribute attr : pAttr) {
            attributes.add(attr);
        }
diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java
index b8798e6..355b027 100644
--- a/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java
+++ b/grobid-core/src/test/java/org/grobid/core/annotations/ParagraphTest.java
@@ -5,6 +5,7 @@ import java.util.Iterator;
 import java.util.List;
 import java.util.Set;
 import java.util.TreeSet;
+import java.util.Vector;
 
 import static org.grobid.core.utilities.TeiValues.ATTR_ID;
 import static org.grobid.core.utilities.TeiValues.W3C_NAMESPACE;
@@ -143,7 +144,7 @@ public class ParagraphTest {
    }
 
    private static Iterator getAttributes(final Attribute... pAttr) {
-       Set attributes = new TreeSet();
+       Vector attributes = new Vector();
        for (final Attribute attr : pAttr) {
            attributes.add(attr);
        }
diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java
index ae53a6d..4954f01 100644
--- a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java
+++ b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParsedInfoTest.java
@@ -7,6 +7,7 @@ import static org.junit.Assert.assertTrue;
 
 import java.util.Iterator;
 import java.util.Set;
+import java.util.Vector;
 import java.util.TreeSet;
 
 import javax.xml.namespace.QName;
@@ -139,7 +140,7 @@ public class TeiStAXParsedInfoTest {
    }
 
    private static Iterator getAttributes(final Attribute... pAttr) {
-       Set attributes = new TreeSet();
+       Vector attributes = new Vector();
        for (final Attribute attr : pAttr) {
            attributes.add(attr);
        }
diff --git a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java
index b864fe3..ec160fa 100644
--- a/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java
+++ b/grobid-core/src/test/java/org/grobid/core/annotations/TeiStAXParserTest.java
@@ -10,6 +10,7 @@ import java.io.OutputStream;
 import java.io.UnsupportedEncodingException;
 import java.util.Iterator;
 import java.util.Set;
+import java.util.Vector;
 import java.util.TreeSet;
 
 import javax.xml.namespace.QName;
@@ -263,7 +264,7 @@ public class TeiStAXParserTest extends XMLTestCase {
    }
 
    private static Iterator getAttributes(final Attribute... pAttr) {
-       Set attributes = new TreeSet();
+       Vector attributes = new Vector();
        for (final Attribute attr : pAttr) {
            attributes.add(attr);
        }

Recursive extraction

Thanks a lot for the very promising meta data extraction tool.

Would it be possible to add a option for recursive extraction for folders and subfolder?

Wapiti library not supported on Windows

See fork with JNI: https://github.com/kermitt2/Wapiti

Character spacing issues with pdftoxml

In the line "Converting PDF to XML is a bit like converting hamburgers into cows" (Peter Murray-Rust), character spacing in PDF is not what we see ;)

A good PDF processing libraries like pdftoxml is trying to recreate valid spacing (with respect to the visual rendering), but of course it is difficult.

It appears that quite a lot of PDF result in problematic charcater spacing for some fonts, in particular in the header section. For instance this pdf, the author sequence is extracted by pdftoxml as:

   M ihael ARCAN 1 Chr ist ian F E DERM AN N 2 Paul BU I T E LAAR 1

It is then very hard for the CRF to predict a good sequence labeling on this...

Note that Mac OS X Preview is recomposing it right, a direct cut and paste form the PDF gives:

 Mihael ARCAN1 Christian FEDERMANN2 Paul BUITELAAR1

If Apple can do it, we can certainly do it right too ;)

build failed on linux 32 bit

I'm following the instructions in the readme to get the code working, but getting the error:

Caused by: java.lang.RuntimeException: Unable to find a native CRF++ library: Folder /media/D020-1F62/github/grobid/grobid-home/lib/lin-32 does not exist

I'm on "Linux rb-806-02-c 3.0.0-32-generic-pae #51-Ubuntu SMP Thu Mar 21 16:09:48 UTC 2013 i686 i686 i386 GNU/Linux"

Adding support for other PDF to XML engines

I'm adding support for PDFMiner (https://github.com/euske/pdfminer) and PDFBox (https://pdfbox.apache.org) in my fork. So far, all the tests still pass when using PDFMiner . Will issue a pull request after further testing and cleanup.

It may be an idea to decouple the PDF extraction entirely, and have the option to import raw XML or JSON directly (e.g. created by a previous process, or from a filestore).

Notes on creating and correcting training data

It would be good to share notes on best practices for editing the training data. Here's a few things that I have come up with:

As far as possible, use the text extracted from the PDF ‘as is’ - do not correct for poor OCR or gremlin characters, as we are focusing on improving the classification of text blocks, not correction of the extracted text. This includes retaining line break markers that appear within titles and abstracts. However, if text tokens in a title are output in the wrong order, then this is corrected.
If adding a title or abstract manually or by copying and pasting from the PDF, try to match the line breaks and hyphenation that are in the PDF, e.g. hyph-<lb/>enated text. This is tedious but important so that the model is trained correctly on the extracted text blocks and lines.
For papers with a cover page, if GROBID extracts the title and author information twice - once on the cover page and once on the main page - both are retained. If no information is extracted, make a copy of the PDF, delete the cover page and regenerate the training data for that file so that the model gets trained on the output from the second page of the PDF when presented with a PDF with a cover page.
Ensure that the text in the supplementary files: author.tei.xml, affiliation.tei.xml, .header-reference.xml matches the text extracted in the main header.tei.xml file for case, punctuation, spelling and line breaks. It is better to have no information than information that does not match the extracted text.