Giter Club home page Giter Club logo

unsupervised-german-ts's Introduction

GUTS

=German Unsupervised Text Simplification

Code from the Master Thesis and Paper:

"An Approach Towards Unsupervised Text Simplification on Paragraph-Level for German Texts"
Author: Leon Fruth

This approach is an adaption from the paper Keep it Simple to the German language. Large parts of the code are copied and/or adapted from the Keep it Simple repository: https://github.com/tingofurro/keep_it_simple/

The reward scores and training progression from the training runs of GUTS are visualized here. The model used in this repository is named GUTS-2 in the report.

Run GUTS

To test the GUTS models use the script run_guts.py.

You can use arguments to test different models, decoding methods, and input texts

Without any arguments it generates a single simplification using greedy decoding.

Training

To train GUTS use the script train_guts.py

Before training first pre-train a GerPT-2 model on the copy task. This way the generator model learns to copy the original paragraph, which is a good starting point.

Reward

All parts for the parts are in the reward folder:

  • reward.py: Wraps the scores and utilizes them to calculate the overall reward. Different scoring functions and weights can be used with this.
  • The scores can be variable added to the reward
    • simplicity.py: Contains the score for the lexical and syntactic simplicity.
    • meaning_preservation.py: contains different methods to score the meaning preservation. The TextSimilarity score was used for this work. The file further contains the CoverageModel, used in Keep it Simple and the Summary Loop, and BScoreSimilarity a similarity scoring method only based on BERTScore.
    • fluency.py: The LM-fluency score and the TextDiscriminator score are contained in this file.
    • guardrails.py: Different Guardrails for Hallucination Detection, Brevity, ArticleRepetition, and NGRamRepetition can be found in this file.

Some of these scores are analysed on the reference datasets TextComplexityDE and GWW_leichtesprache and visualized using some jupyter notebooks in notebooks.

Data

The data folder contains the following datasets:

  • textcomplexityde.csv is the processed TextComplexity dataset. Here all sentences from a Wikipedia article are concatenated to form a Complex-Simple aligned dataset. This dataset was used for the analysis of the reward scores.
  • leichtesprache2.csv are parallel articles from GWW.
  • tc_eval.csv contains the manually composed paragraphs from the TextComplexityDE dataset, and the generated simplifications used for automatic evaluation of the thesis.
  • wiki_eval.csv contains paragraphs from Wikipedia and the generated simplifications used for automatic evaluation of the thesis.
  • all_wiki_paragraphs.csv contains the extracted paragraphs from Wikipedia articles used for training. The file is contained in the latest release

Automatic Evaluation

The jupyter notebook, where the automatic evaluation can be reproduced is located in notebooks/evaluation.ipynb.

This script uses the files tc_eval.csv and wiki_eval.csv to generate the automatic results.

Models

This repository contains the following saved models:

  • One trained GUTS model: GUTS.bin (contained in the release)
  • morphmodel_ger.pgz: A model used for lemmatization of German words
  • wiki_finetune.bin: A saved BERT model trained on wikipedia paragraphs, for the LM-Fluency score. (contained in the release)
  • All other models used in the reward scores are retrieved from the huggingface library

Other scripts

unsupervised-german-ts's People

Contributors

lfruth avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.