Giter Club home page Giter Club logo

poldeepner2's Introduction

PolDeepNer2

PolDeepNer2 is an improved version of PolDeepNer. The tool is designed to recognize and categorize named entities utilizing neural networks and transfomer-based language models.

The tool is provided with a list of pre-trained models for Polish and other languages.

It contains a pre-trained model trained on the NKJP corpus which recognizes nested annotations of the following types:

Contributors

Notebooks

notebooks/pdn2_cpu.py
This notebook present how to install and use module API to process a raw text on CPU.

Models

PolEval 2018 (NKJP NER model)

PolDeepNer2 achieves the SOTA results on the PolEval 2018 dataset.

NKJP NER categories

Model Score F1 Overlap F1 Exact Score main Time CPU Time GPU Source
PolDeepNer2
HerBERT large, spacy-ext, sq 92.1 92.7 89.9 ~2m 24s
Polish RoBERTa base, spacy-ext, sq 91.4 91.9 89.1 ~1.5 h ~2m 8s
Polish RoBERTa base, toki 90.0 90.5 87.7 92.40 ~6h 30m ~6m 30s
Polish RoBERTa base, spacy-ext 89.8 90.4 87.4 92.20 ~8m 2s
Systems published after PolEval 2018
Dadas et al. 2020 [1] 88.6 87.0 89.0 - - - link
Polish RoBERTa (large) [1] - - - 89.98 - - link
Polish RoBERTa (base) [1] - - - 87.94 - - link
spaCy (pl_spacy_model) - - - 87.50 ~3m - link
Top 3 systems from PolEval 2018
Applica.ai 86.6 87.7 82.6 - - - link
PolDeepNer 85.1 85.9 82.2 - - ~9m link
Liner2 81.0 81.8 77.8 - ~3m - link

[1] The model is not available. Only the evaluation results were published.

Comparision of loading and processing times

Model Library Tokenizer Model loading [s] Preprocessing [s] NE recognition [s] Total [s]
Polish RoBERTa base fairseq - 12.28 50.90 65.23 128.4
HerBERT large HuggingFace HerbertTokenizerFast 18.44 50.83 103.70 173.0
HerBERT large HuggingFace XLMTokenizer 18.33 51.42 177.50 247.3
  • Dataset size: 1828 document (3M characters).
  • GPU: RTX Titan (24 GB, 4608 CUDA cores).

NKJP NER times

Comparision of named entity recognition times for different datasets

Size [Million chars] NER time [minutes]
PolEval 2018 NER test dataset 3 2.6
Monthly volume of news from Polish news portals [70 sources] 160 136.9
Polish Wikipedia (2013 dump) 1000 855.6
Annual volume of news from Polish news portals [70 sources] 1920 1642.7

NKJP NER times

N82 (KPWr and CEN)

Inner-corpora evaluation

Model Eval Precision Recall F-measure Support Source
PolDeepNer2 (kpwr_n82_base) KPWr 75.02 77.67 76.32 4430
PolDeepNer2 (kpwr_n82_large) KPWr 77.05 78.79 77.91 4430
PolDeepNer (n82-elmo-kgr10) KPWr 73.97 75.49 74.72 4430 link
---
PolDeepNer2 (cen_n82_base) CEN 84.64 85.95 85.29 1423
PolDeepNer2 (cen_n82_large) CEN 86.94 88.40 87.67 1423

Cross-corpora evaluation

Model Eval Precision Recall F-measure Support
PolDeepNer2 (kpwr_n82_base) CEN 80.90 81.87 81.38 1423
PolDeepNer2 (kpwr_n82_large) CEN 80.16 82.08 81.11 1423
---
PolDeepNer2 (cen_n82_base) KPWr 58.58 64.79 61.53 4430
PolDeepNer2 (cen_n82_large) KPWr 61.38 66.66 63.91 4430

Installation (with Conda)

Create and activate conda environment:

conda create -n pdn2 python=3.6
conda activate pdn2

Install CUDA, CuDNN and Torch:

conda install -c anaconda cudatoolkit=10.1
conda install -c anaconda cudnn

Install PolDeepNer2:

pip install https://pypi.clarin-pl.eu/packages/poldeepner2-0.5.0-py3-none-any.whl#md5=6a6131d1b3d104f0bbed87ec6969a841

Install spacy model

python -m spacy download pl_core_news_sm

Evaluation

Download evaluation dataset

wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O POLEVAL-NER_GOLD.json

Polish RoBERTa

Process the dataset:

python process_poleval.py \
  --input POLEVAL-NER_GOLD.json \
  --output pdn2_nkjp_roberta_base_sq.json \
  --model nkjp-base-sq \
  --device cuda:0

Output:

Model loading time          :    12.28 second(s)
Data preprocessing time     :     50.9 second(s)
Data NE recognition time    :    65.23 second(s)
Total time                  :    128.4 second(s)
Data size:                  :    3.072M characters

Evaluate:

python poleval_ner_test.py \
  --goldfile POLEVAL-NER_GOLD.json \
  --userfile pdn2_nkjp_roberta_base_sq.json

Output:

OVERLAP precision: 0.927 recall: 0.912 F1: 0.919 
EXACT precision: 0.899 recall: 0.884 F1: 0.891 
Final score: 0.914
Exact TP=32971 ; FP=3709; FN=4335

HerBERT

Process the dataset:

python process_poleval.py \
  --input POLEVAL-NER_GOLD.json \
  --output pdn2_nkjp_herbert_large_sq.json \
  --model nkjp-herbert-large-sq \
  --device cuda:0

Output:

Model loading time          :    18.44 second(s)
Data preprocessing time     :    50.83 second(s)
Data NE recognition time    :    103.7 second(s)
Total time                  :    173.0 second(s)
Data size:                  :    3.072M characters

Evaluate:

python poleval_ner_test.py \
  --goldfile POLEVAL-NER_GOLD.json \
  --userfile pdn2_nkjp_herbert_large_sq.json

Output:

OVERLAP precision: 0.929 recall: 0.922 F1: 0.926 
EXACT precision: 0.903 recall: 0.896 F1: 0.900 
Final score: 0.921
Exact TP=33433 ; FP=3596; FN=3873

Credits

poldeepner2's People

Contributors

mczuk avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

simondrey

poldeepner2's Issues

sample code doesn't work

hey, I'm testing your solution for the team and your example code doesn't work, when I want to on my own download and load dependencies.txt it pops up a problem with some libraries like numpy or pandas

problem with PolDeepNer2 installation

Is the project maintained?
The installation link does not work, it finishes with the error

ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.