Giter Club home page Giter Club logo

orangepointer's Introduction

OrangePOINTER

The repository contains the scripts for the model, described in a paper submitted to NeurIPS 2021, titled "OrangePOINTER: Constraint progressive generation of texts in French language", which exploits the approach published in this article. The goal of this approach is a text generation under specific constrains (keywords) in a progressive, non-autoregressive manner.

Project predecessor relation

The current project is an adaptation to a French language context of the initial code repository, distributed under the MIT license. The main changes are (but to limited to):

  1. Making code executable as Jupiter notebooks on cloud platforms (Google Cloud Platform, Colab, Kaggle)
  2. Transitionning the training script from GPU to TPU.
  3. Rewriting of the keywords extraction script, since the YAKE extractor didn't result in acceptable output for French language.
  4. Addition of the CodeCarbon tracking, reported to the Comet ML platform for the scripts requiring considerable energy consumption.
  5. Amelioration of the cleansing procedure in generate_training_data.ipynb script, based on empirical observations of the output.
  6. Addition of the join_train_data.ipynb script. The necessity of this script was dictated by the fact that 100Mo of the raw French text takes about 3 hours to output the data consumable by the model, therefore we made turning several GCP instances at the same time to speed up the generation of the pretraining data and join them into one big file in the end.
  7. Addition of the postprocessing.ipynb script, since the result of the inference contained undesired tags.
  8. Numerous comments and structurization of the code blocks.

Project description

Project consists of seven Jupiter notebooks executable on cloud platforms (Colab, Kaggle, GCP). The order of the scripts execution might be described in the following scenarios:

Training data generation

  1. generate_training_data.ipynb consumes a .txt file of raw data and outputs six .json files, joined into two zip folders, containing 3 files of metrics and 3 files for the text data itself. The number "3" corresponds to the number of data epochs. You must put your key and name in the code from Comet ML in order to make work the CodeCarbon reporting. If you don't have CometML account or you don't want to create one, do not execute CometML-related cells to avoid login errors.

  2. join_train_data.ipynb might be applied if the script was running on different VM instances or several times on the same VM, to consolidate the pretraining data. It takes zip folders coming from generate_training_data.ipynb. You must manually rename zip folders by adding _[number], starting from 0, for example:

metrics_data_0.zip
training_data_0.zip

the script generates two zip folders, similar to the ones you might have received directly from generate_training_data.ipynb script.

Pre-training

pretraining_on_TPU.ipynb must be executed on TPU enabled device. It takes zip folders from training data generation step and outputs a pytorch_model.bin file. Put the file inside a folder titled model containing configuration files - you may download it from this repository, and zip it.

Finetunning

Similar to the pretraining, but uses different set of parameters (learning rate, number of epochs) and, most importantly, it takes a pretrained model as an input. For the sake of clarity we have devided a finetunning script to a separate file, containing all configuration needed: finetunning_on_TPU.ipynb The example of a pretrained model might be found in a current repository.

Keywords extraction

spacy_keywords_extraction.ipynb takes a raw text file and returns a text file of extracted keywords. The amont of generated keywords may be modulated.

Inference

inference.ipynb takes the finetuned model along with configuration files and a keywords.txt file. The decoding strategy might be switched between 'greedy' and 'sampling'. Such parameters as the top-k, top-p and temperature for the 'sampling' decoding strategy, might be modulated as well.

Data sources description

Pretraining dataset: CC-100

Dataset Size
CC-100 (French part) 54Go

Finetunning dataset

Split sizes are given in thousands of documents. Vocab sizes are given in thousands of tokens.

Dataset train/val/test sizes avg. document length in words avg. document length in sentences avg. summary length in words avg. summary length in sentences vocabulary size: document vocabulary size: summary
OrangeSUM 21.4/1.5/1.5 350 12.06 32.12 1.43 420 71

The pregenerated data used during the finetunning is available for downloading: metrics folder and training text data folder.

Models

Model Link to download
pretrained model link
finetuned model link

Example of text generation

Capture

Demo

If you wish to run a demo of the inference using a finetuned model, you may do so in Colab or Kaggle (for free). In either of platforms you need import the finetunning_on_TPU.ipynb, downloaded from a current repository. Once done, execute all the cells - the script aready contains wget commands downloading the latest version of OrangePOINTER finetuned model and the keywords for the text generation, which were extracted from the summaries of OrangeSUM dataset, contained in a test split (1500 entries). The generated txt file will be saved to the same folder, where the model was unpacked (./model).

orangepointer's People

Contributors

asnota avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.