Giter Club home page Giter Club logo

wiki-auto's Introduction

Neural CRF Model for Sentence Alignment in Text Simplification

This repository contains the code and resources from the following paper

Repo Structure:

  1. aligner: Code for neural CRF sentence aligner.

  2. wiki-manual: The Wiki-Manual dataset. The definitions of columns are: label, the index of simple sentence, the index of complex sentence, simple sentence, complex sentence.

  3. wiki-auto: The Wiki-Auto dataset. .src are the complex sentences, and .dst are the simple sentences.

  4. annotation_tool: The tool for in-house annotators to annotate the sentence alignment.

  5. simplification: Code for text simplification experiments.

Checkpoints

  1. We released the checkpoints of BERT model fine-tuned on Newsela-Manual and Wiki-Manual datasets. They are trained using the Hugging Face implementation of BERT_base architecture in the package pytorch-transformers==1.1.0. BERT_newsela and BERT_wiki.
  2. If you want to align other monolingual parallel data, please try the fine-tuned BERT models. They should be able to achieve competitive performance. The performance boost of adding the neural CRF model is related to the structure of the articles. We have some experience in designing the paragraph alignment algorithm and using neural CRF model to align sentences, feel free to contact us if you want to have a discussion.
  3. We also released the code for our neural CRF sentence alignment model, you can use it to train your own model.

Instructions:

  1. To request the Newsela-Manual and Newsela-Auto datasets, please first obtain access to the Newsela corpus, then contact the authors.

  2. Please use Python 3 to run the code.

  3. We also have pre-processed Wikipedia data, alignments between complex and simple Wikipedia articles, and original sentence and paragraph alignments between Wikipedia article pairs, please contact us if you want to use that data.

  4. We also have the original sentence and paragraph alignments between the Newsela articles, please contact us if you want to use that data.

Citation

Please cite if you use the above resources for your research

@inproceedings{jiang2020neural,
  title={Neural CRF Model for Sentence Alignment in Text Simplification},
  author={Jiang, Chao and Maddela, Mounica and Lan, Wuwei and Zhong, Yang and Xu, Wei},
  booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
  year={2020}
}

wiki-auto's People

Contributors

chaojiang06 avatar mounicam avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.