Giter Club home page Giter Club logo

vu-rm-pip3's Introduction

The VU Reading Machine provides an up-to-date NewsReader pipeline for Dutch, for use on Linux or with Docker.

NewsReader pipelines processe Dutch texts and generates high-level semantic interpretations: annotated concepts, entities (people, organisations, places), events and roles, time expressions and opinions. The interpretations are interesting for humanities researchers and social scientists that want to investigate the content of large text collections. Documents are annotated with the Natural Language Annotation Format NAF, version 3.

The VU Reading Machine was developed with the intention to provide a robust and flexible pipeline. A simple scheduler allows to specify which components to run in a flexible manner, and attention has been brought to identify and report possible component failures as they occur to prevent silent failures.

Documentation

You will find detailed installation and usage instructions in the documentation.

Quick start

Linux

Clone the repository:

git clone https://github.com/cltl/vu-rm-pip3.git

Set up a python 3 environment and install requirements.txt, then run the script install.sh to install the components of the Dutch NewsReader pipeline:

./scripts/install.sh

The script run-pipeline.sh allows to run the pipeline on a raw text document to produce a fully annotated NAF document:

./scripts/run-pipeline.sh < input.txt > output.naf

Docker

You can also pull and run a Docker image from DockerHub:

docker pull vucltl/vu-rm-pip3

To run the image on an input file ./example/test.txt:

docker run -v $(pwd)/example/:/wrk/ vucltl/vu-rm-pip3 /wrk/test.txt > example/test.out 2> example/test.log

RDF

The script scripts/bin/naf2sem-grasp.sh allows to extract RDF files from pipeline output NAF files.

Contact

Please submit issues to the issue tracker. Questions can be addressed to Sophie Arnoult: [email protected]

vu-rm-pip3's People

Contributors

sarnoult avatar angel-daza avatar dependabot[bot] avatar jrvosse avatar

Stargazers

Matthias Schlögl avatar  avatar Jan van Casteren avatar Dafne van Kuppevelt avatar Jenia Kim avatar piek avatar Maarten van Gompel avatar

Watchers

Maarten van Gompel avatar James Cloos avatar Emiel van Miltenburg avatar  avatar Ruben Izquierdo avatar piek avatar Minh Le avatar  avatar Paul Huygen avatar Marten Postma avatar Filip Ilievski avatar R.H. Segers avatar  avatar Pia Sommerauer avatar  avatar

vu-rm-pip3's Issues

run_pipeline.sh

Explain in the readme that the path needs to be adapted in the run_pipeline.sh
It now says:

wrapper_dir=/home/arnoult/vu-rm-pip3

you can copy this config file and adapt it as you wish

#cfg=$wrapper_dir/example/pipeline.yml

Build failure for Heideltime component on Mac and Windows

Building IXAPipeHeidelTime 1.0.1 results in Build Failure:

[ERROR] Failed to execute goal on project time: Could not resolve dependencies for project ixa.pipe:time:jar:1.0.1: The following artifacts could not be resolved: local:jvntextpro-2.0:jar:1.0, local:de.unihd.dbs.heideltime.standalone:jar:1.0: Could not find artifact local:jvntextpro-2.0:jar:1.0 in heideltime-local-dependency-repo (file:///private/var/folders/wp/n2j6jp5n6ys9289dhnfmm9sr0000gs/T/tmp.XXXXXXXXXX.8fGSNOuD/ixa-heideltime/repo)

Error occurs on Mac and Windows (so far not on Linux).

install error: ixa_tok expects jar file in folder components

Failed running idx_tok because it is expect the JAR file in the folder "components" but in fact it is installed in "modules". Renaming the folder "modules" into "components" fixes the problem.

-- Running ixa_tok
Error: Unable to access jarfile /Tools/vu-rm-pip3/components/java/ixa-pipe-tok-1.8.5-exec.jar
module ixa_tok returned an error

Alpino 'setitimer' error on Windows Subsystem for Linux

When Alpino parses sentences, it can time how long each sentence takes and can stop parsing if a maximum time is passed (see the documentation of the 'user_max' flag on the guide).

When the pipeline is installed on the Windows Subsystem for Linux, and the pipeline is tested (using 'cat example/test.txt | run-pipeline.sh > output.naf'), the following error is logged in 'pipeline.log':

-- Running alpino
[2018-11-05 11:17:46,004 root INFO ] Calling Alpino with 10 sentences
hdrug: process 4761 on host ESLT0081 (datime(2018,11,5,11,17,50))
! System error
! '$start_timer_a/3: 'setitimer' failed: Invalid argument'

It seems timing the parsing of each sentence, using 'setitimer', does not work.

One way to by-pass this error is to leave empty the 'user_max' flag. In this pipeline, setting the 'user_max' flag occurs via a Python script. In 'scripts/bin/alpino', the '-t' flag is set. In 'components/python/morphosyntactic_parser_nl/alpinonaf/main.py', this '-t' argument is passed on to the 'morph_syn_parser.py' and subsequently sets the 'user_max' flag. So by removing the '-t 0.2' in 'scripts/bin/alpino', the 'setitimer' error can be avoided.

Nonetheless, this issue should be solved properly, if that's possible under the Windows Subsystem for Linux.

Cannot run Alpino on Mac OS X

I get the following error after the tokenisation was done successfully, trying to run Alpino. I get the same error trying to run Alpino outside the pipeline from the command line:

PiekMacPro:create_bin piek$ ./Alpino.bin
-bash: ./Alpino.bin: cannot execute binary file

The input NAF for the pipeline is the following:

Dit is een leuke tweet van mij .

The error in the log is:

-- Running alpino
[2018-10-17 12:29:50,550 root INFO ] Calling Alpino with 1 sentences
/Tools/vu-rm-pip3/components/resources/Alpino/bin/Alpino: line 15: ldd: command not found
/Tools/vu-rm-pip3/components/resources/Alpino/bin/Alpino: line 24: /Tools/vu-rm-pip3/components/resources/Alpino/create_bin/Alpino.bin: cannot execute binary file
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Tools/vu-rm-pip3/components/python/morphosyntactic_parser_nl/alpinonaf/main.py", line 25, in
in_obj = parse(input_file, max_min_per_sent=args.max_minutes)
File "/Tools/vu-rm-pip3/components/python/morphosyntactic_parser_nl/alpinonaf/morph_syn_parser.py", line 330, in parse
for sentence, tree, dependencies in call_alpino(sentences, max_min_per_sent):
File "/Tools/vu-rm-pip3/components/python/morphosyntactic_parser_nl/alpinonaf/morph_syn_parser.py", line 254, in call_alpino_local
raise Exception("Call to alpino failed (see logs): %s" % cmd)
Exception: Call to alpino failed (see logs): /Tools/vu-rm-pip3/components/resources/Alpino/bin/Alpino user_max=12000 end_hook=xml -flag treebank /var/folders/ht/_fx2jf216xb_vrtykf56v7c40000gp/T/tmpxmvah8p9 -parse
module alpino returned an error

Alpino: NameError: name 'time' is not defined

I encountered the following error with Alpino:

File vu-rm-pip3/components/python/morphosyntactic_parser_nl/alpinonaf/morph_syn_parser.py", line 247, in call_alpino_local t1 = time.time() NameError: name 'time' is not defined module vua-alpino returned an error

The error seems to be resolved by adding 'import time' to the file in question.

running alpino

Cannot find the KafNafParserPy from where where alpino is ran. It is located deep down in the module: opinion_miner_deluxePP

-- Running alpino
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/anaconda3/lib/python3.6/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/anaconda3/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/Tools/vu-rm-pip3/components/python/morphosyntactic_parser_nl/alpinonaf/init.py", line 1, in
from .morph_syn_parser import parse, version
File "/Tools/vu-rm-pip3/components/python/morphosyntactic_parser_nl/alpinonaf/morph_syn_parser.py", line 13, in
from KafNafParserPy import *
ModuleNotFoundError: No module named 'KafNafParserPy'
module alpino returned an error

Python3 compatibility of Opinion Miner Deluxe

From the ReadMe:

Not all python components are python3-compatible yet. Call the script ./scripts/util/port-to-python3.sh to back-up the relevant modules and convert them to python2/3-compatible code. The script uses 2to3, futurize and module-specific patches.

One change between Python2 and Python3 is the way accelerated versions of modules are imported (see this StackExchange).

However, in the 'components\python\opinion_miner_deluxePP\extract_features_xx.py' files, the following lines remain in Python2 style:

try:
import cPickle as pickler
except ValueError:
import pickle as pickler

This should be changed to:

import _pickle as pickler

Preferably, this change should occur automatically within the './scripts/util/port-to-python3.sh' script.

Make Linux requirements more explicit

The 'svm_wsd/blob/master/install_naf.sh' script requires 'unzip'. This does not (always) come by default with Linux/Ubuntu. The 'install.sh' script also does not provide a clear error message if the user does not have 'unzip'. Therefore, it may be necessary for the user to run 'apt-get install unzip'. This should be made more explicit (in a list of requirements or by using a script to install all those requirements automatically).

Additional Linux requirements are:

  • java (v8)
  • maven (v3)
  • python (v3)
  • pip (v3)
  • timbl
  • libxss1
  • libxft2
  • libtk8.5
  • unzip

Component directory incorrect for WSD

In the 'wsd_to-python3.sh' script, the line

dy2=$modulesdir/svm_wsd

should be changed into

dy2=$modulesdir/python/svm_wsd

Or, alternatively, in the 'install.sh' script, the line

$utildir/wsd_to-python3.sh -m $modulesdir

should be changed to

$utildir/wsd_to-python3.sh -m $pythondir

Python3 compatibility of Word Sense Disambiguation

From the ReadMe:

Not all python components are python3-compatible yet. Call the script ./scripts/util/port-to-python3.sh to back-up the relevant modules and convert them to python2/3-compatible code. The script uses 2to3, futurize and module-specific patches.

In the 'components\python\svm_wsd\dsc_wsd_tagger.py' script, the Python2 function xrange is called twice. This function was renamed to range in Python3 (see here).

Preferably, this change should occur automatically within the './scripts/util/port-to-python3.sh' script.

Timbl error in SRL

When I tested the pipeline locally using 'cat example/test.txt | run-pipeline.sh -c pipeline.yml > output.naf', the module SRL did not output anything. I solved this by changing in the file 'scripts\bin\vua-srl.sh' the line

timbl -mO:I1,2,3,4 -i $mod/25Feb2015_e-mags_mags_press_newspapers.wgt -t $FEATUREVECTOR -o $TIMBLOUTPUTFILE &>/dev/null

to

timbl -mO:I1,2,3,4 -i $mod/25Feb2015_e-mags_mags_press_newspapers.wgt -t $FEATUREVECTOR -o $TIMBLOUTPUTFILE &>/dev/null & sleep 4

Perhaps the 'timbl' module runs too slowly on a regular laptop to produce output before the next command is run. The 'sleep' solves this in a hacky way, but a better solution should be possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.