Giter Club home page Giter Club logo

legalner_gdcl_iii_projekt's Introduction

LegalNER_GdCL_III_Project

Coursework of the seminar Grundlagen der Computerlinguitik III at the University of Erlangen
Wintersemester 2022-2023
Under the guidance of female professor Stephanie Evert, Lehrstuhl für Korpus- und Computerlinguistik
The project is a shared task from the SemEval-2023, which dedicates to the recognition of entities in the legal documents. (Legal Named Entity Recognizer)
Link to the shared task

1. The Tokenizer.ipynb

(also see Tokenizer.py)
uses a TreebankWordTokenizer to convert the annotated judgement texts from the javascript objects (json) into pandas dataframes
The dataframes are stored in transitional_data/tokenized_train.csv and transitional_data/tokenized_dev.csv

2. The POS_Tagger.ipynb

(also see POS_Tagger.py)
uses two POS taggers to provide dataframe with tags and lemmas of each token (a row in the dataframe).
The first is the standard tokenizer (a pretrained PerceptronTagger) provided by nltk.
The second is the TreeTagger configured with the Penn treebank.
The TreeTagger provides also lemmas of each token to the dataframe.
The extanded dataframes are stored in "transitional_data/tagged_train_filled.csv" and "transitional_data/tagged_dev_filled.csv". The empty cells (np.NaN) in the dataframe are substituted with "0".

About the TreeTagger

Link to the install instruction of TreeTagger
Tutorial to use the TreeTagger in python
I have installed the TreeTagger directly in the project folder under the name "TreeTagger". It includes not only a the TreeTaggerwrapper, but also is configured with the Penn treebank.

3. Feature_Matrix.ipynb

Purpose of notebook is to enlarge the dataframe and provide the maschine learning model with more context information.
e. g. Tokens on the left and right sides and its pos tags, lemmas.
Because the Treetagger in last notebook already provides the pos tags to every token, the prefix, suffix and other features of the token are generally proven to be surplus and won't be included in the final feature matrix.
The "Trigramme" Processing (add the labels of last two tokens to the feature matrix, "gold" for the train and dynamically the predicted labels in the dev)
only makes the model much worse.
The latest result shows, feature matrix works best with following columns:
Token, POSTag and Lemma of the Token itself and also the three features of its L1, L2, R1, R2 neighbours.
WITHOUT any Affix, other features or labels before.

Best result: Weighed average f1 score of all labels 77% (level of token)

(exclduing the "o", outsiders, which makes up 86% of all tokens)

4. evaluation.py provides functions which can be used to compare the results of different models.

  1. get_all_labels
    extracts all labels with its sole sequence number from a y list as preparation for future usage.

  2. It provides two methods to evaluation.
    2.1) The fist is shown by get_classify_report.
    The Classify Report reflects the accuracy of (simply) classification, not considering the rate of detecting.
    2.2) The second method shows the accuracy of recognition in strict sense. (get_recognition_report)
    Recognition Report shows how many entities are not only in accurate length detected but also correctly classified.

  3. get_confusion_matrix
    returns confusion_matrix in three categories: juridical_person, formats, natural_person

5. Model_Selection.ipynb

This notebook compares the results of support vector mashine and the sklearn_crfsuite model, both after parameter finetuning.
The sklearn_crfsuite model has reached a obviously better result as the svm.
SVM:Weighed average f1 score of strict recognition 67% (level of entity)
crfsuite:Weighed average f1 score of strict recognition 75% (level of entity)
Visualization uses the visualizer tools of SpaCy, to mark the entities from text in colors.

6. Project Report (in german language)

with the total LaTeX document can be found in the file "Project_Report".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.