The repository contains the scripts for the model, described in a paper submitted to NeurIPS 2021, titled "OrangePOINTER: Constraint progressive generation of texts in French language", which exploits the approach published in this article. The goal of this approach is a text generation under specific constrains (keywords) in a progressive, non-autoregressive manner.
The current project is an adaptation to a French language context of the initial code repository, distributed under the MIT license. The main changes are (but to limited to):
- Making code executable as Jupiter notebooks on cloud platforms (Google Cloud Platform, Colab, Kaggle)
- Transitionning the training script from GPU to TPU.
- Rewriting of the keywords extraction script, since the YAKE extractor didn't result in acceptable output for French language.
- Addition of the CodeCarbon tracking, reported to the Comet ML platform for the scripts requiring considerable energy consumption.
- Amelioration of the cleansing procedure in
generate_training_data.ipynb
script, based on empirical observations of the output. - Addition of the
join_train_data.ipynb
script. The necessity of this script was dictated by the fact that 100Mo of the raw French text takes about 3 hours to output the data consumable by the model, therefore we made turning several GCP instances at the same time to speed up the generation of the pretraining data and join them into one big file in the end. - Addition of the
postprocessing.ipynb
script, since the result of the inference contained undesired tags. - Numerous comments and structurization of the code blocks.
Project consists of seven Jupiter notebooks executable on cloud platforms (Colab, Kaggle, GCP). The order of the scripts execution might be described in the following scenarios:
-
generate_training_data.ipynb
consumes a .txt file of raw data and outputs six .json files, joined into two zip folders, containing 3 files of metrics and 3 files for the text data itself. The number "3" corresponds to the number of data epochs. You must put your key and name in the code from Comet ML in order to make work the CodeCarbon reporting. If you don't have CometML account or you don't want to create one, do not execute CometML-related cells to avoid login errors. -
join_train_data.ipynb
might be applied if the script was running on different VM instances or several times on the same VM, to consolidate the pretraining data. It takes zip folders coming fromgenerate_training_data.ipynb
. You must manually rename zip folders by adding _[number], starting from 0, for example:
metrics_data_0.zip
training_data_0.zip
the script generates two zip folders, similar to the ones you might have received directly from generate_training_data.ipynb
script.
pretraining_on_TPU.ipynb
must be executed on TPU enabled device. It takes zip folders from training data generation step and outputs a pytorch_model.bin
file.
Put the file inside a folder titled model
containing configuration files - you may download it from this repository, and zip it.
Similar to the pretraining, but uses different set of parameters (learning rate, number of epochs) and, most importantly, it takes a pretrained model as an input.
For the sake of clarity we have devided a finetunning script to a separate file, containing all configuration needed: finetunning_on_TPU.ipynb
The example of a pretrained model might be found in a current repository.
spacy_keywords_extraction.ipynb
takes a raw text file and returns a text file of extracted keywords. The amont of generated keywords may be modulated.
inference.ipynb
takes the finetuned model along with configuration files and a keywords.txt file.
The decoding strategy might be switched between 'greedy' and 'sampling'.
Such parameters as the top-k, top-p and temperature for the 'sampling' decoding strategy, might be modulated as well.
Dataset | Size |
---|---|
CC-100 (French part) | 54Go |
Split sizes are given in thousands of documents. Vocab sizes are given in thousands of tokens.
Dataset | train/val/test sizes | avg. document length in words | avg. document length in sentences | avg. summary length in words | avg. summary length in sentences | vocabulary size: document | vocabulary size: summary |
---|---|---|---|---|---|---|---|
OrangeSUM | 21.4/1.5/1.5 | 350 | 12.06 | 32.12 | 1.43 | 420 | 71 |
The pregenerated data used during the finetunning is available for downloading: metrics folder and training text data folder.
Model | Link to download |
---|---|
pretrained model | link |
finetuned model | link |
If you wish to run a demo of the inference using a finetuned model, you may do so in Colab or Kaggle (for free).
In either of platforms you need import the finetunning_on_TPU.ipynb
, downloaded from a current repository.
Once done, execute all the cells - the script aready contains wget commands downloading the latest version of OrangePOINTER finetuned model and the keywords for the text generation, which were extracted from the summaries of OrangeSUM dataset, contained in a test split (1500 entries).
The generated txt file will be saved to the same folder, where the model was unpacked (./model).