Giter Club home page Giter Club logo

neuspell's Introduction

NeuSpell: A Neural Spelling Correction Toolkit

Contents

Updates

Latest

Previous

  • March, 2021:
    • Code-base reformatted. Addressed bug fixes and issues.
  • November, 2020:
    • Neuspell's BERT pretrained model is now available as part of huggingface models as murali1996/bert-base-cased-spell-correction. We provide an example code snippet at ./scripts/huggingface for curious practitioners.
  • September, 2020:
    • This work is accepted at EMNLP 2020 (system demonstrations)

Installation

git clone https://github.com/neuspell/neuspell; cd neuspell
pip install -e .

To install extra requirements,

pip install -r extras-requirements.txt

or individually as:

pip install -e .[elmo]
pip install -e .[spacy]

NOTE: For zsh, use ".[elmo]" and ".[spacy]" instead

Additionally, spacy models can be downloaded as:

python -m spacy download en_core_web_sm

Then, download pretrained models of neuspell following Download Checkpoints

Here is a quick-start code snippet (command line usage) to use a checker model. See test_neuspell_correctors.py for more usage patterns.

import neuspell
from neuspell import available_checkers, BertChecker

""" see available checkers """
print(f"available checkers: {neuspell.available_checkers()}")
# → available checkers: ['BertsclstmChecker', 'CnnlstmChecker', 'NestedlstmChecker', 'SclstmChecker', 'SclstmbertChecker', 'BertChecker', 'SclstmelmoChecker', 'ElmosclstmChecker']

""" select spell checkers & load """
checker = BertChecker()
checker.from_pretrained()

""" spell correction """
checker.correct("I luk foward to receving your reply")
# → "I look forward to receiving your reply"
checker.correct_strings(["I luk foward to receving your reply", ])
# → ["I look forward to receiving your reply"]
checker.correct_from_file(src="noisy_texts.txt")
# → "Found 450 mistakes in 322 lines, total_lines=350"

""" evaluation of models """
checker.evaluate(clean_file="bea60k.txt", corrupt_file="bea60k.noise.txt")
# → data size: 63044
# → total inference time for this data is: 998.13 secs
# → total token count: 1032061
# → confusion table: corr2corr:940937, corr2incorr:21060,
#                    incorr2corr:55889, incorr2incorr:14175
# → accuracy is 96.58%
# → word correction rate is 79.76%

Alternatively, once can also select and load a spell checker differently as follows:

from neuspell import SclstmChecker

checker = SclstmChecker()
checker = checker.add_("elmo", at="input")  # "elmo" or "bert", "input" or "output"
checker.from_pretrained()

This feature of adding ELMO or BERT model is currently supported for selected models. See List of neural models in the toolkit for details.

If interested, follow Additional Requirements for installing non-neural spell checkers- Aspell and Jamspell.

Installation through pip

pip install neuspell

In v1.0, allennlp library is not automatically installed which is used for models containing ELMO. Hence, to utilize those checkers, do a source install as in Installation & Quick Start

Toolkit

Introduction

NeuSpell is an open-source toolkit for context sensitive spelling correction in English. This toolkit comprises of 10 spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources. To make neural models for spell checking context dependent, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated mis-spellings; and (ii) use richer representations of the context.This toolkit enables NLP practitioners to use our proposed and existing spelling correction systems, both via a simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings.

Live demo available at http://neuspell.github.io/



List of neural models in the toolkit:



This pipeline corresponds to the `SC-LSTM plus ELMO (at input)` model.

Performances
Spell
Checker
Word
Correction
Rate
Time per
sentence
(in milliseconds)
Aspell 48.7 7.3*
Jamspell 68.9 2.6*
CNN-LSTM 75.8 4.2
SC-LSTM 76.7 2.8
Nested-LSTM 77.3 6.4
BERT 79.1 7.1
SC-LSTM plus ELMO (at input) 79.8 15.8
SC-LSTM plus ELMO (at output) 78.5 16.3
SC-LSTM plus BERT (at input) 77.0 6.7
SC-LSTM plus BERT (at output) 76.0 7.2

Performance of different correctors in the NeuSpell toolkit on the BEA-60K dataset with real-world spelling mistakes. ∗ indicates evaluation on a CPU (for others we use a GeForce RTX 2080 Ti GPU).

Download Checkpoints

To download selected checkpoints, select a Checkpoint name from below and then run download. Each checkpoint is associated with a neural spell checker as shown in the table.

Spell Checker Class Checkpoint name Disk space (approx.)
CNN-LSTM CnnlstmChecker 'cnn-lstm-probwordnoise' 450 MB
SC-LSTM SclstmChecker 'scrnn-probwordnoise' 450 MB
Nested-LSTM NestedlstmChecker 'lstm-lstm-probwordnoise' 455 MB
BERT BertChecker 'subwordbert-probwordnoise' 740 MB
SC-LSTM plus ELMO (at input) ElmosclstmChecker 'elmoscrnn-probwordnoise' 840 MB
SC-LSTM plus BERT (at input) BertsclstmChecker 'bertscrnn-probwordnoise' 900 MB
SC-LSTM plus BERT (at output) SclstmbertChecker 'scrnnbert-probwordnoise' 1.19 GB
SC-LSTM plus ELMO (at output) SclstmelmoChecker 'scrnnelmo-probwordnoise' 1.23 GB
import neuspell

neuspell.seq_modeling.downloads.download_pretrained_model("subwordbert-probwordnoise")

Alternatively, download all Neuspell neural models by running the following (available in versions after v1.0):

import neuspell

neuspell.seq_modeling.downloads.download_pretrained_model("_all_")

Alternatively,

Datasets

We curate several synthetic and natural datasets for training/evaluating neuspell models. For full details, check our paper. Run the following to download all the datasets.

cd data/traintest
python download_datafiles.py 

See data/traintest/README.md for more details.

Train files are dubbed with names .random, .word, .prob, .probword for different noising startegies used to create them. For each strategy (see Synthetic data creation), we noise ∼20% of the tokens in the clean corpus. We use 1.6 Million sentences from the One billion word benchmark dataset as our clean corpus.

Demo Setup

In order to setup a demo, follow these steps:

  • Do Installation and then install flask requirements as pip install -e ".[flask]"
  • Download checkpoints (Note: If you wish to use only one of the neural checkers, you need to manually disable others in the imports of ./scripts/flask-server/app.py)
  • Start a flask server in folder ./scripts/flask-server by running CUDA_VISIBLE_DEVICES=0 python app.py (on GPU) or python app.py (on CPU)

Synthetic data creation

English

This toolkit offers 3 kinds of noising strategies (identfied from existing literature) to generate synthetic parallel training data to train neural models for spell correction. The strategies include a simple lookup based noisy spelling replacement (en-word-replacement-noise), a character level noise induction such as swapping/deleting/adding/replacing characters (en-char-replacement-noise), and a confusion matrix based probabilistic character replacement driven by mistakes patterns in a large corpus of spelling mistakes (en-probchar-replacement-noise). For full details about these approaches, checkout our paper.

Following are the corresponding class mappings to utilize the above noise curations. As some pre-built data files are used for some of the noisers, we also provide their approximate disk space.

Folder Class name Disk space (approx.)
en-word-replacement-noise WordReplacementNoiser 2 MB
en-char-replacement-noise CharacterReplacementNoiser --
en-probchar-replacement-noise ProbabilisticCharacterReplacementNoiser 80 MB

Following is a snippet for using these noisers-

from neuspell.noising import WordReplacementNoiser

example_texts = [
    "This is an example sentence to demonstrate noising in the neuspell repository.",
    "Here is another such amazing example !!"
]

word_repl_noiser = WordReplacementNoiser(language="english")
word_repl_noiser.load_resources()
noise_texts = word_repl_noiser.noise(example_texts)
print(noise_texts)
Other languages
Coming Soon ...

Finetuning on custom data and creating new models

Finetuning on top of neuspell pretrained models

from neuspell import BertChecker

checker = BertChecker()
checker.from_pretrained()
checker.finetune(clean_file="sample_clean.txt", corrupt_file="sample_corrupt.txt", data_dir="default")

This feature is only available for BertChecker and ElmosclstmChecker.

Training other Transformers/BERT-based models

We now support initializing a huggingface model and finetuning it on your custom data. Here is a code snippet demonstrating that:

First mark your files containing clean and corrupt texts in a line-seperated format

from neuspell.commons import DEFAULT_TRAINTEST_DATA_PATH

data_dir = DEFAULT_TRAINTEST_DATA_PATH
clean_file = "sample_clean.txt"
corrupt_file = "sample_corrupt.txt"
from neuspell.seq_modeling.helpers import load_data, train_validation_split
from neuspell.seq_modeling.helpers import get_tokens
from neuspell import BertChecker

# Step-0: Load your train and test files, create a validation split
train_data = load_data(data_dir, clean_file, corrupt_file)
train_data, valid_data = train_validation_split(train_data, 0.8, seed=11690)

# Step-1: Create vocab file. This serves as the target vocab file and we use the defined model's default huggingface
# tokenizer to tokenize inputs appropriately.
vocab = get_tokens([i[0] for i in train_data], keep_simple=True, min_max_freq=(1, float("inf")), topk=100000)

# # Step-2: Initialize a model
checker = BertChecker(device="cuda")
checker.from_huggingface(bert_pretrained_name_or_path="distilbert-base-cased", vocab=vocab)

# Step-3: Finetune the model on your dataset
checker.finetune(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)

You can further evaluate your model on a custom data as follows:

from neuspell import BertChecker

checker = BertChecker()
checker.from_pretrained(
    bert_pretrained_name_or_path="distilbert-base-cased",
    ckpt_path=f"{data_dir}/new_models/distilbert-base-cased"  # "<folder where the model is saved>"
)
checker.evaluate(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)

Multilingual Models

Following usage above, once can now seamlessly utilize multilingual models such as xlm-roberta-base, bert-base-multilingual-cased and distilbert-base-multilingual-cased on a non-English script.

Potential applications for practitioners

  • Defenses against adversarial attacks in NLP
    • example implementation available in folder ./applications/Adversarial-Misspellings-arxiv. See README.md.
  • Improving OCR text correction systems
  • Improving grammatical error correction systems
  • Improving Intent/Domain classifiers in conversational AI
  • Spell Checking in Collaboration and Productivity tools

Additional requirements

Requirements for Aspell checker:

wget https://files.pythonhosted.org/packages/53/30/d995126fe8c4800f7a9b31aa0e7e5b2896f5f84db4b7513df746b2a286da/aspell-python-py3-1.15.tar.bz2
tar -C . -xvf aspell-python-py3-1.15.tar.bz2
cd aspell-python-py3-1.15
python setup.py install

Requirements for Jamspell checker:

sudo apt-get install -y swig3.0
wget -P ./ https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz
tar xf ./en.tar.gz --directory ./

Citation

@inproceedings{jayanthi-etal-2020-neuspell,
    title = "{N}eu{S}pell: A Neural Spelling Correction Toolkit",
    author = "Jayanthi, Sai Muralidhar  and
      Pruthi, Danish  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.21",
    doi = "10.18653/v1/2020.emnlp-demos.21",
    pages = "158--164",
    abstract = "We introduce NeuSpell, an open-source toolkit for spelling correction in English. Our toolkit comprises ten different models, and benchmarks them on naturally occurring misspellings from multiple sources. We find that many systems do not adequately leverage the context around the misspelt token. To remedy this, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings; and (ii) use richer representations of the context. By training on our synthetic examples, correction rates improve by 9{\%} (absolute) compared to the case when models are trained on randomly sampled character perturbations. Using richer contextual representations boosts the correction rate by another 3{\%}. Our toolkit enables practitioners to use our proposed and existing spelling correction systems, both via a simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings. The toolkit can be accessed at neuspell.github.io.",
}

Link for the publication. Any questions or suggestions, please contact the authors at jsaimurali001 [at] gmail [dot] com

neuspell's People

Contributors

danishpruthi avatar murali1996 avatar neuspell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuspell's Issues

Multi gpu training models

Hi guys,

I have a question; Is it possible to finetune models using more than one gpu.

With kind regards,

Finetuned model doesn't load in eval/inference

After I finetuned a pretrained model on my custom data, I tried to eval it.

My code is as below

from neuspell.seq_modeling.helpers import load_data, train_validation_split
from neuspell.seq_modeling.helpers import get_tokens
from neuspell import BertChecker

train_data = load_data(data_dir, clean_file, corrupt_file)
train_data, valid_data = train_validation_split(train_data, 0.8, seed=11690)
vocab = get_tokens([i[0] for i in train_data], keep_simple=True, min_max_freq=(1, float("inf")), topk=100000)

checker = BertChecker(device="cuda")
checker.from_huggingface(bert_pretrained_name_or_path="distilbert-base-cased", vocab=vocab)
checker.finetune(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)

#eval. 
checker = BertChecker()
checker.from_pretrained(
  #note that the README file is outdated
    ckpt_path=f"{data_dir}/new_models/distilbert-base-cased"  # "<folder where the model is saved>"
)
checker.evaluate(clean_file=clean_file, corrupt_file=corrupt_file, data_dir=data_dir)

However, it seems it never loads the path to my trained data. I tried to print and the ckpt_path is always set to None. And the result is it is always pointing to the downloaded checkpoints.

I wonder about the eval result though. My finetuned model is actually at a totally different location.

data folder is set to `.../repo/neuspell/neuspell/../data` script
loading vocab from path:../repo/neuspell/neuspell/../data/checkpoints/subwordbert-probwordnoise/vocab.pkl
initializing model
Number of parameters in the model: 185212579
Loading model params from checkpoint dir: .../repo/neuspell/neuspell/../data/checkpoints/subwordbert-probwordnoise

When I tried to point it to my finetuned model, providing the path to it, this error appears

Error(s) in loading state_dict for SubwordBert:
	Missing key(s) in state_dict: "bert_model.embeddings.position_ids", "bert_model.embeddings.token_type_embeddings.weight", "bert_model.encoder.layer.0.attention.self.query.weight", "bert_model.encoder.layer.0.attention.self.query.bias", "bert_model.encoder.layer.0.attention.self.key.weight", "bert_model.encoder.layer.0.attention.self.key.bias", "bert_model.encoder.layer.0.attention.self.value.weight", "bert_model.encoder.layer.0.attention.self.value.bias", "bert_model.encoder.layer.0.attention.output.dense.weight", "bert_model.encoder.layer.0.attention.output.dense.bias", "bert_model.encoder.layer.0.attention.output.LayerNorm.weight", "bert_model.encoder.layer.0.attention.output.LayerNorm.bias", "bert_model.encoder.layer.0.intermediate.dense.weight", "bert_model.encoder.layer.0.intermediate.dense.bias", "bert_model.encoder.layer.0.output.dense.weight", "bert_model.encoder.layer.0.output.dense.bias", "bert_model.encoder.layer.0.output.LayerNorm.weight", "bert_model.encoder.layer.0.output.LayerNorm.bias", "bert_model.encoder.layer.1.attention.self.query.weight", "bert_model.encoder.layer.1.attention.self.query.bias", "bert_model.encoder.layer.1.attention.self.key.weight", "bert_model.encoder.layer.1.attention.self.key.bias", "bert_model.encoder.layer.1.attention.self.value.weight", "bert_model.encoder.layer.1.attention.self.value.bias", "bert_model.encoder.layer.1.attention.output.dense.weight", "bert_model.encoder.layer.1.attention.output.dense.bias", "bert_model.encoder.layer.1.attention.output.LayerNorm.weight", "bert_model.encoder.layer.1.attention.output.LayerNorm.bias", "bert_model.encoder.layer.1.intermediate.dense.weight", "bert_model.encoder.layer.1.intermediate.dense.bias", "bert_model.encoder.layer.1.output.dense.weight", "bert_model.encoder.layer.1.output.dense.bias", "bert_model.encoder.layer.1.output.LayerNorm.weight", "bert_model.encoder.layer.1.output.LayerNorm.bias", "bert_model.encoder.layer.2.attention.self.query.weight", "bert_model.encoder.layer.2.attention.self.query.bias", "bert_model.encoder.layer.2.attention.self.key.weight", "bert_model.encoder.layer.2.attention.self.key.bias", "bert_model.encoder.layer.2.attention.self.value.weight", "bert_model.encoder.layer.2.attention.self.value.bias", "bert_model.encoder.layer.2.attention.output.dense.weight", "bert_model.encoder.layer.2.attention.output.dense.bias", "bert_model.encoder.layer.2.attention.output.LayerNorm.weight", "bert_model.encoder.layer.2.attention.output.LayerNorm.bias", "bert_model.encoder.layer.2.intermediate.dense.weight", "bert_model.encoder.layer.2.intermediate.dense.bias", "bert_model.encoder.layer.2.output.dense.weight", "bert_model.encoder.layer.2.output.dense.bias", "bert_model.encoder.layer.2.output.LayerNorm.weight", "bert_model.encoder.layer.2.output.LayerNorm.bias", "bert_model.encoder.layer.3.attention.self.query.weight", "bert_model.encoder.layer.3.attention.self.query.bias", "bert_model.encoder.layer.3.attention.self.key.weight", "bert_model.encoder.layer.3.attention.self.key.bias", "bert_model.encoder.layer.3.attention.self.value.weight", "bert_model.encoder.layer.3.attention.self.value.bias", "bert_model.encoder.layer.3.attention.output.dense.weight", "bert_model.encoder.layer.3.attention.output.dense.bias", "bert_model.encoder.layer.3.attention.output.LayerNorm.weight", "bert_model.encoder.layer.3.attention.output.LayerNorm.bias", "bert_model.encoder.layer.3.intermediate.dense.weight", "bert_model.encoder.layer.3.intermediate.dense.bias", "bert_model.encoder.layer.3.output.dense.weight", "bert_model.encoder.layer.3.output.dense.bias", "bert_model.encoder.layer.3.output.LayerNorm.weight", "bert_model.encoder.layer.3.output.LayerNorm.bias", "bert_model.encoder.layer.4.attention.self.query.weight", "bert_model.encoder.layer.4.attention.self.query.bias", "bert_model.encoder.layer.4.attention.self.key.weight", "bert_model.encoder.layer.4.attention.self.key.bias", "bert_model.encoder.layer.4.attention.self.value.weight", "bert_model.encoder.layer.4.attention.self.value.bias", "bert_model.encoder.layer.4.attention.output.dense.weight", "bert_model.encoder.layer.4.attention.output.dense.bias", "bert_model.encoder.layer.4.attention.output.LayerNorm.weight", "bert_model.encoder.layer.4.attention.output.LayerNorm.bias", "bert_model.encoder.layer.4.intermediate.dense.weight", "bert_model.encoder.layer.4.intermediate.dense.bias", "bert_model.encoder.layer.4.output.dense.weight", "bert_model.encoder.layer.4.output.dense.bias", "bert_model.encoder.layer.4.output.LayerNorm.weight", "bert_model.encoder.layer.4.output.LayerNorm.bias", "bert_model.encoder.layer.5.attention.self.query.weight", "bert_model.encoder.layer.5.attention.self.query.bias", "bert_model.encoder.layer.5.attention.self.key.weight", "bert_model.encoder.layer.5.attention.self.key.bias", "bert_model.encoder.layer.5.attention.self.value.weight", "bert_model.encoder.layer.5.attention.self.value.bias", "bert_model.encoder.layer.5.attention.output.dense.weight", "bert_model.encoder.layer.5.attention.output.dense.bias", "bert_model.encoder.layer.5.attention.output.LayerNorm.weight", "bert_model.encoder.layer.5.attention.output.LayerNorm.bias", "bert_model.encoder.layer.5.intermediate.dense.weight", "bert_model.encoder.layer.5.intermediate.dense.bias", "bert_model.encoder.layer.5.output.dense.weight", "bert_model.encoder.layer.5.output.dense.bias", "bert_model.encoder.layer.5.output.LayerNorm.weight", "bert_model.encoder.layer.5.output.LayerNorm.bias", "bert_model.encoder.layer.6.attention.self.query.weight", "bert_model.encoder.layer.6.attention.self.query.bias", "bert_model.encoder.layer.6.attention.self.key.weight", "bert_model.encoder.layer.6.attention.self.key.bias", "bert_model.encoder.layer.6.attention.self.value.weight", "bert_model.encoder.layer.6.attention.self.value.bias", "bert_model.encoder.layer.6.attention.output.dense.weight", "bert_model.encoder.layer.6.attention.output.dense.bias", "bert_model.encoder.layer.6.attention.output.LayerNorm.weight", "bert_model.encoder.layer.6.attention.output.LayerNorm.bias", "bert_model.encoder.layer.6.intermediate.dense.weight", "bert_model.encoder.layer.6.intermediate.dense.bias", "bert_model.encoder.layer.6.output.dense.weight", "bert_model.encoder.layer.6.output.dense.bias", "bert_model.encoder.layer.6.output.LayerNorm.weight", "bert_model.encoder.layer.6.output.LayerNorm.bias", "bert_model.encoder.layer.7.attention.self.query.weight", "bert_model.encoder.layer.7.attention.self.query.bias", "bert_model.encoder.layer.7.attention.self.key.weight", "bert_model.encoder.layer.7.attention.self.key.bias", "bert_model.encoder.layer.7.attention.self.value.weight", "bert_model.encoder.layer.7.attention.self.value.bias", "bert_model.encoder.layer.7.attention.output.dense.weight", "bert_model.encoder.layer.7.attention.output.dense.bias", "bert_model.encoder.layer.7.attention.output.LayerNorm.weight", "bert_model.encoder.layer.7.attention.output.LayerNorm.bias", "bert_model.encoder.layer.7.intermediate.dense.weight", "bert_model.encoder.layer.7.intermediate.dense.bias", "bert_model.encoder.layer.7.output.dense.weight", "bert_model.encoder.layer.7.output.dense.bias", "bert_model.encoder.layer.7.output.LayerNorm.weight", "bert_model.encoder.layer.7.output.LayerNorm.bias", "bert_model.encoder.layer.8.attention.self.query.weight", "bert_model.encoder.layer.8.attention.self.query.bias", "bert_model.encoder.layer.8.attention.self.key.weight", "bert_model.encoder.layer.8.attention.self.key.bias", "bert_model.encoder.layer.8.attention.self.value.weight", "bert_model.encoder.layer.8.attention.self.value.bias", "bert_model.encoder.layer.8.attention.output.dense.weight", "bert_model.encoder.layer.8.attention.output.dense.bias", "bert_model.encoder.layer.8.attention.output.LayerNorm.weight", "bert_model.encoder.layer.8.attention.output.LayerNorm.bias", "bert_model.encoder.layer.8.intermediate.dense.weight", "bert_model.encoder.layer.8.intermediate.dense.bias", "bert_model.encoder.layer.8.output.dense.weight", "bert_model.encoder.layer.8.output.dense.bias", "bert_model.encoder.layer.8.output.LayerNorm.weight", "bert_model.encoder.layer.8.output.LayerNorm.bias", "bert_model.encoder.layer.9.attention.self.query.weight", "bert_model.encoder.layer.9.attention.self.query.bias", "bert_model.encoder.layer.9.attention.self.key.weight", "bert_model.encoder.layer.9.attention.self.key.bias", "bert_model.encoder.layer.9.attention.self.value.weight", "bert_model.encoder.layer.9.attention.self.value.bias", "bert_model.encoder.layer.9.attention.output.dense.weight", "bert_model.encoder.layer.9.attention.output.dense.bias", "bert_model.encoder.layer.9.attention.output.LayerNorm.weight", "bert_model.encoder.layer.9.attention.output.LayerNorm.bias", "bert_model.encoder.layer.9.intermediate.dense.weight", "bert_model.encoder.layer.9.intermediate.dense.bias", "bert_model.encoder.layer.9.output.dense.weight", "bert_model.encoder.layer.9.output.dense.bias", "bert_model.encoder.layer.9.output.LayerNorm.weight", "bert_model.encoder.layer.9.output.LayerNorm.bias", "bert_model.encoder.layer.10.attention.self.query.weight", "bert_model.encoder.layer.10.attention.self.query.bias", "bert_model.encoder.layer.10.attention.self.key.weight", "bert_model.encoder.layer.10.attention.self.key.bias", "bert_model.encoder.layer.10.attention.self.value.weight", "bert_model.encoder.layer.10.attention.self.value.bias", "bert_model.encoder.layer.10.attention.output.dense.weight", "bert_model.encoder.layer.10.attention.output.dense.bias", "bert_model.encoder.layer.10.attention.output.LayerNorm.weight", "bert_model.encoder.layer.10.attention.output.LayerNorm.bias", "bert_model.encoder.layer.10.intermediate.dense.weight", "bert_model.encoder.layer.10.intermediate.dense.bias", "bert_model.encoder.layer.10.output.dense.weight", "bert_model.encoder.layer.10.output.dense.bias", "bert_model.encoder.layer.10.output.LayerNorm.weight", "bert_model.encoder.layer.10.output.LayerNorm.bias", "bert_model.encoder.layer.11.attention.self.query.weight", "bert_model.encoder.layer.11.attention.self.query.bias", "bert_model.encoder.layer.11.attention.self.key.weight", "bert_model.encoder.layer.11.attention.self.key.bias", "bert_model.encoder.layer.11.attention.self.value.weight", "bert_model.encoder.layer.11.attention.self.value.bias", "bert_model.encoder.layer.11.attention.output.dense.weight", "bert_model.encoder.layer.11.attention.output.dense.bias", "bert_model.encoder.layer.11.attention.output.LayerNorm.weight", "bert_model.encoder.layer.11.attention.output.LayerNorm.bias", "bert_model.encoder.layer.11.intermediate.dense.weight", "bert_model.encoder.layer.11.intermediate.dense.bias", "bert_model.encoder.layer.11.output.dense.weight", "bert_model.encoder.layer.11.output.dense.bias", "bert_model.encoder.layer.11.output.LayerNorm.weight", "bert_model.encoder.layer.11.output.LayerNorm.bias", "bert_model.pooler.dense.weight", "bert_model.pooler.dense.bias". 
	Unexpected key(s) in state_dict: "bert_model.transformer.layer.0.attention.q_lin.weight", "bert_model.transformer.layer.0.attention.q_lin.bias", "bert_model.transformer.layer.0.attention.k_lin.weight", "bert_model.transformer.layer.0.attention.k_lin.bias", "bert_model.transformer.layer.0.attention.v_lin.weight", "bert_model.transformer.layer.0.attention.v_lin.bias", "bert_model.transformer.layer.0.attention.out_lin.weight", "bert_model.transformer.layer.0.attention.out_lin.bias", "bert_model.transformer.layer.0.sa_layer_norm.weight", "bert_model.transformer.layer.0.sa_layer_norm.bias", "bert_model.transformer.layer.0.ffn.lin1.weight", "bert_model.transformer.layer.0.ffn.lin1.bias", "bert_model.transformer.layer.0.ffn.lin2.weight", "bert_model.transformer.layer.0.ffn.lin2.bias", "bert_model.transformer.layer.0.output_layer_norm.weight", "bert_model.transformer.layer.0.output_layer_norm.bias", "bert_model.transformer.layer.1.attention.q_lin.weight", "bert_model.transformer.layer.1.attention.q_lin.bias", "bert_model.transformer.layer.1.attention.k_lin.weight", "bert_model.transformer.layer.1.attention.k_lin.bias", "bert_model.transformer.layer.1.attention.v_lin.weight", "bert_model.transformer.layer.1.attention.v_lin.bias", "bert_model.transformer.layer.1.attention.out_lin.weight", "bert_model.transformer.layer.1.attention.out_lin.bias", "bert_model.transformer.layer.1.sa_layer_norm.weight", "bert_model.transformer.layer.1.sa_layer_norm.bias", "bert_model.transformer.layer.1.ffn.lin1.weight", "bert_model.transformer.layer.1.ffn.lin1.bias", "bert_model.transformer.layer.1.ffn.lin2.weight", "bert_model.transformer.layer.1.ffn.lin2.bias", "bert_model.transformer.layer.1.output_layer_norm.weight", "bert_model.transformer.layer.1.output_layer_norm.bias", "bert_model.transformer.layer.2.attention.q_lin.weight", "bert_model.transformer.layer.2.attention.q_lin.bias", "bert_model.transformer.layer.2.attention.k_lin.weight", "bert_model.transformer.layer.2.attention.k_lin.bias", "bert_model.transformer.layer.2.attention.v_lin.weight", "bert_model.transformer.layer.2.attention.v_lin.bias", "bert_model.transformer.layer.2.attention.out_lin.weight", "bert_model.transformer.layer.2.attention.out_lin.bias", "bert_model.transformer.layer.2.sa_layer_norm.weight", "bert_model.transformer.layer.2.sa_layer_norm.bias", "bert_model.transformer.layer.2.ffn.lin1.weight", "bert_model.transformer.layer.2.ffn.lin1.bias", "bert_model.transformer.layer.2.ffn.lin2.weight", "bert_model.transformer.layer.2.ffn.lin2.bias", "bert_model.transformer.layer.2.output_layer_norm.weight", "bert_model.transformer.layer.2.output_layer_norm.bias", "bert_model.transformer.layer.3.attention.q_lin.weight", "bert_model.transformer.layer.3.attention.q_lin.bias", "bert_model.transformer.layer.3.attention.k_lin.weight", "bert_model.transformer.layer.3.attention.k_lin.bias", "bert_model.transformer.layer.3.attention.v_lin.weight", "bert_model.transformer.layer.3.attention.v_lin.bias", "bert_model.transformer.layer.3.attention.out_lin.weight", "bert_model.transformer.layer.3.attention.out_lin.bias", "bert_model.transformer.layer.3.sa_layer_norm.weight", "bert_model.transformer.layer.3.sa_layer_norm.bias", "bert_model.transformer.layer.3.ffn.lin1.weight", "bert_model.transformer.layer.3.ffn.lin1.bias", "bert_model.transformer.layer.3.ffn.lin2.weight", "bert_model.transformer.layer.3.ffn.lin2.bias", "bert_model.transformer.layer.3.output_layer_norm.weight", "bert_model.transformer.layer.3.output_layer_norm.bias", "bert_model.transformer.layer.4.attention.q_lin.weight", "bert_model.transformer.layer.4.attention.q_lin.bias", "bert_model.transformer.layer.4.attention.k_lin.weight", "bert_model.transformer.layer.4.attention.k_lin.bias", "bert_model.transformer.layer.4.attention.v_lin.weight", "bert_model.transformer.layer.4.attention.v_lin.bias", "bert_model.transformer.layer.4.attention.out_lin.weight", "bert_model.transformer.layer.4.attention.out_lin.bias", "bert_model.transformer.layer.4.sa_layer_norm.weight", "bert_model.transformer.layer.4.sa_layer_norm.bias", "bert_model.transformer.layer.4.ffn.lin1.weight", "bert_model.transformer.layer.4.ffn.lin1.bias", "bert_model.transformer.layer.4.ffn.lin2.weight", "bert_model.transformer.layer.4.ffn.lin2.bias", "bert_model.transformer.layer.4.output_layer_norm.weight", "bert_model.transformer.layer.4.output_layer_norm.bias", "bert_model.transformer.layer.5.attention.q_lin.weight", "bert_model.transformer.layer.5.attention.q_lin.bias", "bert_model.transformer.layer.5.attention.k_lin.weight", "bert_model.transformer.layer.5.attention.k_lin.bias", "bert_model.transformer.layer.5.attention.v_lin.weight", "bert_model.transformer.layer.5.attention.v_lin.bias", "bert_model.transformer.layer.5.attention.out_lin.weight", "bert_model.transformer.layer.5.attention.out_lin.bias", "bert_model.transformer.layer.5.sa_layer_norm.weight", "bert_model.transformer.layer.5.sa_layer_norm.bias", "bert_model.transformer.layer.5.ffn.lin1.weight", "bert_model.transformer.layer.5.ffn.lin1.bias", "bert_model.transformer.layer.5.ffn.lin2.weight", "bert_model.transformer.layer.5.ffn.lin2.bias", "bert_model.transformer.layer.5.output_layer_norm.weight", "bert_model.transformer.layer.5.output_layer_norm.bias". 

Honestly, I wonder is the example even working? Please double check.

Deployment errors when pushing to GCP

I have wrapped neuspell in a simple flask app, put it in a Docker container, and am deploying to GCP.

It is saying there is a memory leak somewhere. The app downloads a couple things then downloads a 1.52GB file, which I think is happening on these steps:

`from future import unicode_literals, print_function
from neuspellMast.neuspell import ElmosclstmChecker #BertsclstmChecker, SclstmChecker,

checker = ElmosclstmChecker()

checker.from_pretrained("./neuspellMast/data/checkpoints/elmoscrnn-probwordnoise")
`
I know there could be a lot going on in this set up to cause this, but I wonder if anyone else has encountered this kind of thing when deploying the code.

import neuspell neuspell.seq_modeling.downloads.download_pretrained_model("subwordbert-probwordnoise")

data folder is set to /neuspell/neuspell/../data script
subwordbert-probwordnoise created

TimeoutError Traceback (most recent call last)
/opt/conda/miniconda3/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
698 # Make the request on the httplib connection object.
--> 699 httplib_response = self._make_request(
700 conn,

Git clone fails on Windows

FYI - Git clone fails on windows when following the installation instructions.

D:\workspace>git clone https://github.com/neuspell/neuspell.git
Cloning into 'neuspell'...
remote: Enumerating objects: 235, done.
remote: Counting objects: 100% (235/235), done.
remote: Compressing objects: 100% (145/145), done.
Receote: Total 432 (delta 142), reused 159 (delta 82), pack-reused 197eceiving objects: 6.05 MiB/s
Receiving objects: 100% (432/432), 74.34 MiB | 6.26 MiB/s, done.
Resolving deltas: 100% (168/168), done.
error: invalid path 'applications/Adversarial-Misspellings/defenses/scRNN/model_dumps/scrnn_TASK_NAME=MRPC__VOCAB_SIZE=9999_REP_LIST=swap_key_add_drop_REP_PROBS=0.25:0.25:0.25:0.25'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

I had to add the following to my git config to perform a sparse checkout
git config core.protectNTFS false

https://stackoverflow.com/questions/63727594/github-git-checkout-returns-error-invalid-path-on-windows

checker = BertChecker() is taking more time then getting disconnect on GCP

checker.from_pretrained()
/home/jupyter/neuspell/neuspell/../data/checkpoints/subwordbert-probwordnoise already exists

TimeoutError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
705 headers=headers,
--> 706 chunked=chunked,
707 )

/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
381 try:
--> 382 self._validate_conn(conn)
383 except (SocketTimeout, BaseSSLError) as e:

/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
1009 if not getattr(conn, "sock", None): # AppEngine might not have .sock
-> 1010 conn.connect()
1011

/opt/conda/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
420 ssl_context=context,
--> 421 tls_in_tls=tls_in_tls,
422 )

/opt/conda/lib/python3.7/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)
428 ssl_sock = _ssl_wrap_socket_impl(
--> 429 sock, context, tls_in_tls, server_hostname=server_hostname
430 )

/opt/conda/lib/python3.7/site-packages/urllib3/util/ssl_.py in _ssl_wrap_socket_impl(sock, ssl_context, tls_in_tls, server_hostname)
471 if server_hostname:
--> 472 return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
473 else:

/opt/conda/lib/python3.7/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
422 context=self,
--> 423 session=session
424 )

/opt/conda/lib/python3.7/ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
869 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 870 self.do_handshake()
871 except (OSError, ValueError):

/opt/conda/lib/python3.7/ssl.py in do_handshake(self, block)
1138 self.settimeout(None)
-> 1139 self._sslobj.do_handshake()
1140 finally:

TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )

/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
755 retries = retries.increment(
--> 756 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
757 )

/opt/conda/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
531 if read is False or not self._is_method_retryable(method):
--> 532 raise six.reraise(type(error), error, _stacktrace)
533 elif read is not None:

/opt/conda/lib/python3.7/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
733 if value.traceback is not tb:
--> 734 raise value.with_traceback(tb)
735 raise value

/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
705 headers=headers,
--> 706 chunked=chunked,
707 )

/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
381 try:
--> 382 self._validate_conn(conn)
383 except (SocketTimeout, BaseSSLError) as e:

/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
1009 if not getattr(conn, "sock", None): # AppEngine might not have .sock
-> 1010 conn.connect()
1011

/opt/conda/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
420 ssl_context=context,
--> 421 tls_in_tls=tls_in_tls,
422 )

/opt/conda/lib/python3.7/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)
428 ssl_sock = _ssl_wrap_socket_impl(
--> 429 sock, context, tls_in_tls, server_hostname=server_hostname
430 )

/opt/conda/lib/python3.7/site-packages/urllib3/util/ssl_.py in _ssl_wrap_socket_impl(sock, ssl_context, tls_in_tls, server_hostname)
471 if server_hostname:
--> 472 return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
473 else:

/opt/conda/lib/python3.7/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
422 context=self,
--> 423 session=session
424 )

/opt/conda/lib/python3.7/ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
869 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 870 self.do_handshake()
871 except (OSError, ValueError):

/opt/conda/lib/python3.7/ssl.py in do_handshake(self, block)
1138 self.settimeout(None)
-> 1139 self._sslobj.do_handshake()
1140 finally:

ProtocolError: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)
in
----> 1 checker.from_pretrained()

~/neuspell/neuspell/corrector.py in from_pretrained(self, ckpt_path, vocab_path, **kwargs)
138
139 def from_pretrained(self, ckpt_path=None, vocab_path=None, **kwargs):
--> 140 self._from_pretrained(ckpt_path=None, vocab_path=None, **kwargs)
141
142 def load_output_vocab(self, vocab_path):

~/neuspell/neuspell/corrector.py in _from_pretrained(self, ckpt_path, vocab_path)
130 self.vocab_path = vocab_path or os.path.join(self.ckpt_path, "vocab.pkl")
131 if not os.path.isfile(self.vocab_path): # leads to "FileNotFoundError"
--> 132 download_pretrained_model(self.ckpt_path)
133
134 self.load_output_vocab(self.vocab_path)

~/neuspell/neuspell/seq_modeling/downloads.py in download_pretrained_model(ckpt_path)
176 _download_all_pretrained_model()
177 else:
--> 178 _download_pretrained_model(ckpt_path)
179 return

~/neuspell/neuspell/seq_modeling/downloads.py in _download_pretrained_model(ckpt_path)
151 else:
152 vocab_url = details["vocab.pkl"]
--> 153 download_file_from_google_drive(vocab_url, vocab_path)
154
155 pytorch_model_path = os.path.join(ckpt_path, "pytorch_model.bin")

~/neuspell/neuspell/seq_modeling/downloads.py in download_file_from_google_drive(id, destination)
12 session = requests.Session()
13
---> 14 response = session.get(URL, params={'id': id}, stream=True)
15 token = get_confirm_token(response)
16

/opt/conda/lib/python3.7/site-packages/requests/sessions.py in get(self, url, **kwargs)
553
554 kwargs.setdefault('allow_redirects', True)
--> 555 return self.request('GET', url, **kwargs)
556
557 def options(self, url, **kwargs):

/opt/conda/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
540 }
541 send_kwargs.update(settings)
--> 542 resp = self.send(prep, **send_kwargs)
543
544 return resp

/opt/conda/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
653
654 # Send the request
--> 655 r = adapter.send(request, **kwargs)
656
657 # Total elapsed time of the request (approximately)

/opt/conda/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
496
497 except (ProtocolError, socket.error) as err:
--> 498 raise ConnectionError(err, request=request)
499
500 except MaxRetryError as e:

ConnectionError: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))

Unexpected Keys and Size Mismatch errors

Hello, I'm trying to get this to work.

I run this starter code from the home page:

from neuspell import BertsclstmChecker, SclstmChecker

checker = SclstmChecker()
checker = checker.add_("elmo", at="input")
checker.from_pretrained("./data/checkpoints/elmoscrnn-probwordnoise")

And I get this error.

checker.from_pretrained("./data/checkpoints/elmoscrnn-probwordnoise")
loading vocab from path:./data/checkpoints/elmoscrnn-probwordnoise/vocab.pkl
initializing model
SCLSTM(
(lstmmodule): LSTM(294, 512, num_layers=2, batch_first=True, dropout=0.4, bidirectional=True)
(dropout): Dropout(p=0.5, inplace=False)
(dense): Linear(in_features=1024, out_features=100002, bias=True)
(criterion): CrossEntropyLoss()
)
112111266
loading pretrained weights from path:./data/checkpoints/elmoscrnn-probwordnoise
Loading model params from checkpoint dir: ./data/checkpoints/elmoscrnn-probwordnoise
Traceback (most recent call last):
File "", line 1, in
File "/neuspell-master/neuspell/./corrector_sclstm.py", line 43, in from_pretrained
self.model = load_pretrained(self.model, self.weights_path, device=self.device)
File "/neuspell-master/scripts/seq_modeling/sclstm.py", line 77, in load_pretrained
model.load_state_dict(checkpoint_data['model_state_dict'])
File "/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SCLSTM:
Unexpected key(s) in state_dict: "elmo._elmo_lstm._token_embedder._char_embedding_weights", "elmo._elmo_lstm._token_embedder.char_conv_0.weight", "elmo._elmo_lstm._token_embedder.char_conv_0.bias", "elmo._elmo_lstm._token_embedder.char_conv_1.weight", "elmo._elmo_lstm._token_embedder.char_conv_1.bias", "elmo._elmo_lstm._token_embedder.char_conv_2.weight", "elmo._elmo_lstm._token_embedder.char_conv_2.bias", "elmo._elmo_lstm._token_embedder.char_conv_3.weight", "elmo._elmo_lstm._token_embedder.char_conv_3.bias", "elmo._elmo_lstm._token_embedder.char_conv_4.weight", "elmo._elmo_lstm._token_embedder.char_conv_4.bias", "elmo._elmo_lstm._token_embedder.char_conv_5.weight", "elmo._elmo_lstm._token_embedder.char_conv_5.bias", "elmo._elmo_lstm._token_embedder.char_conv_6.weight", "elmo._elmo_lstm._token_embedder.char_conv_6.bias", "elmo._elmo_lstm._token_embedder._highways._layers.0.weight", "elmo._elmo_lstm._token_embedder._highways._layers.0.bias", "elmo._elmo_lstm._token_embedder._highways._layers.1.weight", "elmo._elmo_lstm._token_embedder._highways._layers.1.bias", "elmo._elmo_lstm._token_embedder._projection.weight", "elmo._elmo_lstm._token_embedder._projection.bias", "elmo._elmo_lstm._elmo_lstm.forward_layer_0.input_linearity.weight", "elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.weight", "elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.bias", "elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_projection.weight", "elmo._elmo_lstm._elmo_lstm.backward_layer_0.input_linearity.weight", "elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.weight", "elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.bias", "elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_projection.weight", "elmo._elmo_lstm._elmo_lstm.forward_layer_1.input_linearity.weight", "elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.weight", "elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.bias", "elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_projection.weight", "elmo._elmo_lstm._elmo_lstm.backward_layer_1.input_linearity.weight", "elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.weight", "elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.bias", "elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_projection.weight", "elmo.scalar_mix_0.gamma", "elmo.scalar_mix_0.scalar_parameters.0", "elmo.scalar_mix_0.scalar_parameters.1", "elmo.scalar_mix_0.scalar_parameters.2".
size mismatch for lstmmodule.weight_ih_l0: copying a param with shape torch.Size([2048, 1318]) from checkpoint, the shape in current model is torch.Size([2048, 294]).
size mismatch for lstmmodule.weight_ih_l0_reverse: copying a param with shape torch.Size([2048, 1318]) from checkpoint, the shape in current model is torch.Size([2048, 294]).

Error quick-start code snippet - UnpicklingError: invalid load key, '<'.

Hello,

I tried to run the quick-start code snippet :

import neuspell
from neuspell import available_checkers, BertChecker

""" see available checkers """
print(f"available checkers: {neuspell.available_checkers()}")
# → available checkers: ['BertsclstmChecker', 'CnnlstmChecker', 'NestedlstmChecker', 'SclstmChecker', 'SclstmbertChecker', 'BertChecker', 'SclstmelmoChecker', 'ElmosclstmChecker']

""" select spell checkers & load """
checker = BertChecker()
checker.from_pretrained()

""" spell correction """
checker.correct("I luk foward to receving your reply")
# → "I look forward to receiving your reply"
checker.correct_strings(["I luk foward to receving your reply", ])
# → ["I look forward to receiving your reply"]
checker.correct_from_file(src="noisy_texts.txt")
# → "Found 450 mistakes in 322 lines, total_lines=350"

""" evaluation of models """
checker.evaluate(clean_file="bea60k.txt", corrupt_file="bea60k.noise.txt")
# → data size: 63044
# → total inference time for this data is: 998.13 secs
# → total token count: 1032061
# → confusion table: corr2corr:940937, corr2incorr:21060,
#                    incorr2corr:55889, incorr2incorr:14175
# → accuracy is 96.58%
# → word correction rate is 79.76%

But I have this following error :

File "C:\Users\user\Anaconda3\lib\site-packages\torch\serialization.py", line 764, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
UnpicklingError: invalid load key, '<'.

Can you help me please ?

Thanks,
Camille

Feature/methods to add new vocab (wordunits) while fine-tuning

Models trained for spell-correction can be fine-tuned for use-case specific dataset, which belongs to a textual domain with vocabulary different from the pre-trained dataset. Thus, at times, it is required to train spell correction models to predict words that are currently marked as OUT-OF-VOCABULARY. For example, the word lexical is not available in the currently used 100K word-list (see scripts/seq_modeling/wordunits). Furthermore, once some new vocabulary is added, the output layer of the model architecture has to be changed. This also means that the final layer has to be trained again in the use-case specific training.

How do I Train a Model ?

I want to adopt your models and use my dataset to train a new model. But I don't find the train entrance. Thanks a lot :)

Download script needs fixing

Upon running python download_checkpoints.py

./cnn-lstm-probnoise created
Traceback (most recent call last):
  File "download_checkpoints.py", line 59, in <module>
    download_file_from_google_drive('1wEKynHMlBnw2N65jRw8Xox4fsl8BJpmv', './cnn-lstm-probwordnoise/model.pth.tar')
  File "download_checkpoints.py", line 27, in download_file_from_google_drive
    save_response_content(response, destination)
  File "download_checkpoints.py", line 41, in save_response_content
    with open(destination, "wb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './cnn-lstm-probwordnoise/model.pth.tar'

error in hgface-snippet-for-neuspell.py

Hello, good library!

I found error in hgface-snippet-for-neuspell.py.

I ran this python file, but error message was printed as follows;

Traceback (most recent call last):
File "bert_spell_correction.py", line 97, in
batch_sentences, batch_bert_dict, batch_splits = _custom_bert_tokenize(misspelled_sentences, tokenizer)
File "bert_spell_correction.py", line 48, in _custom_bert_tokenize
batch_encoded_dicts = [bert_tokenizer.encode_plus(tokens) for tokens in batch_tokens]
File "bert_spell_correction.py", line 48, in
batch_encoded_dicts = [bert_tokenizer.encode_plus(tokens) for tokens in batch_tokens]
File "/home/user/anaconda2/envs/py3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2425, in encode_plus
**kwargs,
File "/home/user/anaconda2/envs/py3/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 463, in _encode_plus
**kwargs,
File "/home/user/anaconda2/envs/py3/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 378, in _batch_encode_plus
is_pretokenized=is_split_into_words,
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I think that encode_plus function cannot take token list as input arguments. (but hugging face said that this function can take token list as input.)

Please check this python file.

Thank you.

Cannot recreate published numbers (Accuracy/WCR)

Hello,

The current code base fails to recreate the number that have been published in the paper.
The best I was able to get was 76.28% Word correction rate and 96.1% accuracy on BEA-60k (for bertscrnn). All the steps were followed as per mentioned in the paper.
Can you guys also add the script or maybe exact parameters that have been used to train the model.

Finetuning - clean and corrupt files

Hi,

I would like to finetuning on custom data and creating new models, but I have a question about the clean and corrupt files : Do the data in both texts have to be the same or not (so the data with errors in the corrupt file and the same data but with the good spelling in the clean file)?
Thanks,
Camille

Finetuning- New vocab list

Hello,

I try to do the finetuning part to correct specific (scientific) terms, but some terms are not corrected even with finetuning, for example (covet for covid).
About the finetuning part, there is a parameter "new_vocab_list" in corrector_subwordbert.py. If we target words that are not in the embedding of the BERT base model, do we have to list them?

Thank you,
Camille

Expanding the Vocab for my specific domain

Hi,

Thanks for this library. The text that I am running the BertChecker with contains a lot of domain specific clinical words that are not in the regular English lexicon (those words are Out Of the Vocabulary). I have compiled a list of those terms. I would like the BertChecker to not fix those terms whenever they exist in the text. What is the correct way to archive this?

Is there a way to tell the BertChecker to only fix words that are not in my expanded lexicon?

CUDA out of memory

Hi, I tried to train my own custom data using some large pretrained model (multilingual) such as multilingual BERT or XML roBERTa, and even if I tried reducing my batch size (for current my batch size is 2), still had CUDA OOM error.

My data is only around 400k pairs of clean and corrupted sentences.

I have a few GPUs and I'm thinking of running on multiple CUDA devices, but I tried modifying the code and it didn't work.

Can you take a look?

SSL Error when downloading pretrained models.

Hi,
when Im trying to download a pretrained model with this code:

neuspell.seq_modeling.downloads.download_pretrained_model("bertscrnn-probwordnoise")

I got the below error:

HTTPSConnectionPool(host='doc-00-30-docs.googleusercontent.com', port=443): Max retries exceeded with url: /docs/securesc/gti2smdi6shae5s5bikom49c1e0n4qi9/1eg6pp6pp509p1las4ukmsh31e9djdnd/1618470150000/02761850238464772402/13110869053472900530Z/1nMyoXg49_dl_jiXt9bFo8A4Gnd9XdGD2?e=download (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1091)')))

also same with every other model obviously since its a SSL error , any idea why its happening ? I am using my work laptop which has bunch of restrictions , can it be that ?

Huggingface example script throws all kinds of errors

I'm trying to run the huggingface example in scripts/huggingface.

Running the script as-is produces the error

`/tmp/ipykernel_40405/3991995924.py in _custom_bert_tokenize(batch_sentences, bert_tokenizer, padding_idx, max_len)
49 out = [_custom_bert_tokenize_sentence(text, bert_tokenizer, max_len) for text in batch_sentences]
50 batch_sentences, batch_tokens, batch_splits = list(zip(*out))
---> 51 batch_encoded_dicts = [bert_tokenizer.encode_plus(tokens) for tokens in batch_tokens]
52 batch_input_ids = pad_sequence(
53 [torch.tensor(encoded_dict["input_ids"]) for encoded_dict in batch_encoded_dicts], batch_first=True,

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]`

Changing the .encode_plus to .batch_encode_plus moves the error further down to

`/tmp/ipykernel_40405/1454383768.py in _custom_bert_tokenize(batch_sentences, bert_tokenizer, padding_idx, max_len)
51 batch_encoded_dicts = [bert_tokenizer.batch_encode_plus(tokens) for tokens in batch_tokens]
52 batch_input_ids = pad_sequence(
---> 53 [torch.tensor(encoded_dict["input_ids"]) for encoded_dict in batch_encoded_dicts], batch_first=True,
54 padding_value=padding_idx)
55 batch_attention_masks = pad_sequence(

ValueError: expected sequence of length 3 at dim 1 (got 5)`

getting error when ruuning huugingface snippet

hi i ran the exact code of script/hgface-snippet-for-neuspell.py on google colab and got the following error.

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Here is my full code..

from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification
from torch.nn.utils.rnn import pad_sequence
import torch
import pickle

def load_vocab_dict(path_: str):
    """
    path_: path where the vocab pickle file is saved
    """
    with open("/content/vocab.pkl", 'rb') as fp:
        vocab = pickle.load(fp)
    return vocab

def _tokenize_untokenize(input_text: str, bert_tokenizer):
    subtokens = bert_tokenizer.tokenize(input_text)
    output = []
    for subt in subtokens:
        if subt.startswith("##"):
            output[-1] += subt[2:]
        else:
            output.append(subt)
    return " ".join(output)

def _custom_bert_tokenize_sentence(input_text, bert_tokenizer, max_len):

    tokens = []
    split_sizes = []
    text = []
    for token in _tokenize_untokenize(input_text, bert_tokenizer).split(" "):
        word_tokens = bert_tokenizer.tokenize(token)
        if len(tokens) + len(word_tokens) > max_len-2:  # 512-2 = 510
            break
        if len(word_tokens) == 0:
            continue
        tokens.extend(word_tokens)
        split_sizes.append(len(word_tokens))
        text.append(token)

    return " ".join(text), tokens, split_sizes

def _custom_bert_tokenize(batch_sentences, bert_tokenizer, padding_idx=None, max_len=512):

    if padding_idx is None:
        padding_idx = bert_tokenizer.pad_token_id

    out = [_custom_bert_tokenize_sentence(text, bert_tokenizer, max_len) for text in batch_sentences]
    batch_sentences, batch_tokens, batch_splits = list(zip(*out))
    batch_encoded_dicts = [bert_tokenizer.encode_plus(tokens) for tokens in batch_tokens]
    batch_input_ids = pad_sequence(
        [torch.tensor(encoded_dict["input_ids"]) for encoded_dict in batch_encoded_dicts], batch_first=True,
        padding_value=padding_idx)
    batch_attention_masks = pad_sequence(
        [torch.tensor(encoded_dict["attention_mask"]) for encoded_dict in batch_encoded_dicts], batch_first=True,
        padding_value=0)
    batch_bert_dict = {"attention_mask": batch_attention_masks,
                       "input_ids": batch_input_ids
                       }
    return batch_sentences, batch_bert_dict, batch_splits

def _custom_get_merged_encodings(bert_seq_encodings, seq_splits, mode='avg', keep_terminals=False, device="cpu"):
    bert_seq_encodings = bert_seq_encodings[:sum(seq_splits) + 2, :]  # 2 for [CLS] and [SEP]
    bert_cls_enc = bert_seq_encodings[0:1, :]
    bert_sep_enc = bert_seq_encodings[-1:, :]
    bert_seq_encodings = bert_seq_encodings[1:-1, :]
    # a tuple of tensors
    split_encoding = torch.split(bert_seq_encodings, seq_splits, dim=0)
    batched_encodings = pad_sequence(split_encoding, batch_first=True, padding_value=0)
    if mode == 'avg':
        seq_splits = torch.tensor(seq_splits).reshape(-1, 1).to(device)
        out = torch.div(torch.sum(batched_encodings, dim=1), seq_splits)
    elif mode == "add":
        out = torch.sum(batched_encodings, dim=1)
    elif mode == "first":
        out = batched_encodings[:, 0, :]
    else:
        raise Exception("Not Implemented")

    if keep_terminals:
        out = torch.cat((bert_cls_enc, out, bert_sep_enc), dim=0)
    return out


if __name__ == "__main__":
    
    path = "murali1996/bert-base-cased-spell-correction"
    config = AutoConfig.from_pretrained(path)
    tokenizer = AutoTokenizer.from_pretrained(path)
    bert_model = AutoModelForTokenClassification.from_pretrained(path, config=config)
    model_dict = bert_model.state_dict()

    bert_model.eval()
    with torch.no_grad():

        misspelled_sentences = ["Well,becuz badd spelln is ard to undrstnd wen ou rid it.",
                                "they fought a deadly waer",
                                "Hurahh!! we mad it...."]
        batch_sentences, batch_bert_dict, batch_splits = _custom_bert_tokenize(misspelled_sentences, tokenizer)
        # print(batch_sentences, "\n")
        outputs = bert_model(batch_bert_dict['input_ids'], attention_mask=batch_bert_dict["attention_mask"],
                             output_hidden_states=True)
        sequence_output = outputs[1][-1]
        # sanity check -------->
        # sequence_output = bert_model.dropout(sequence_output)
        # temp_logits = bert_model.classifier(sequence_output)
        # x1 = [val.data for val in outputs[0].reshape(-1,)]
        # x2 = [val.data for val in temp_logits.reshape(-1,)]
        # assert all([a == b for a, b in zip(x1, x2)])
        # <-------- sanity check
        bert_encodings_splitted = \
            [_custom_get_merged_encodings(bert_seq_encodings, seq_splits, mode='avg')
             for bert_seq_encodings, seq_splits in zip(sequence_output, batch_splits)]
        bert_merged_encodings = pad_sequence(bert_encodings_splitted,
                                             batch_first=True,
                                             padding_value=0
                                             )  # [BS,max_nwords_without_cls_sep,768]
        logits = bert_model.classifier(bert_merged_encodings)
        output_vocab = load_vocab_dict("vocab.pkl")
        # print(logits.shape)
        assert len(output_vocab["idx2token"]) == logits.shape[-1]
        argmax_inds = torch.argmax(logits, dim=-1)
        outputs = [" ".join([output_vocab["idx2token"][idx.item()] for idx in argmaxs][:len(wordsplits)])
                   for wordsplits, argmaxs in zip(batch_splits, argmax_inds)]
        print(outputs)

        print("complete")

Am i missing something here?

New vocab generation

Please implement the feature to update the vocab while finetuning on custom data.

RuntimeError: Error(s) in loading state_dict for SubwordBert:

I download model.pth.tar and vocab.pkl of subwordbert-probwordnoise from google docs by myself, the copy them to /environment/python/versions/miniconda3-4.7.12/envs/sj/lib/python3.7/site-packages/neuspell/../data/checkpoints/subwordbert-probwordnoise. When I run

checker = BertChecker()
checker.from_pretrained()

It warns

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

and

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_84079/4153817354.py in <module>
      1 """ select spell checkers & load """
      2 checker = BertChecker()
----> 3 checker.from_pretrained()

/environment/python/versions/miniconda3-4.7.12/envs/sj/lib/python3.7/site-packages/neuspell/corrector_subwordbert.py in from_pretrained(self, ckpt_path, vocab, weights)
     47         self.weights_path = weights if weights else self.ckpt_path
     48         print(f"loading pretrained weights from path:{self.weights_path}")
---> 49         self.model = load_pretrained(self.model, self.weights_path, device=self.device)
     50         return
     51 

/environment/python/versions/miniconda3-4.7.12/envs/sj/lib/python3.7/site-packages/neuspell/seq_modeling/subwordbert.py in load_pretrained(model, checkpoint_path, optimizer, device)
     24     # print(f"previously model saved at : {checkpoint_data['epoch_id']}")
     25 
---> 26     model.load_state_dict(checkpoint_data['model_state_dict'])
     27     if optimizer is not None:
     28         optimizer.load_state_dict(checkpoint_data['optimizer_state_dict'])

/environment/python/versions/miniconda3-4.7.12/envs/sj/lib/python3.7/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1405         if len(error_msgs) > 0:
   1406             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1407                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1408         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1409 

RuntimeError: Error(s) in loading state_dict for SubwordBert:
	size mismatch for dense.weight: copying a param with shape torch.Size([100002, 768]) from checkpoint, the shape in current model is torch.Size([100003, 768]).
	size mismatch for dense.bias: copying a param with shape torch.Size([100002]) from checkpoint, the shape in current model is torch.Size([100003]).

Could you tell me what can I do to fix it?

Changing Epoch

How to increase or decrease epoch number while fine tunning ?

Instructions for a new language?

Hi!

I'm wanting to create a data set and train with the Cherokee language.

I've looked, but I don't see a straightforward documentation of steps to:

Format clean data as... [example], with filenames of..., in folder...
Generate dirty data as.... [example], using scripts..., in folder
Split data into train/eval/test...
Start train/eval cycle...
Run final test cycle...

Any help in this endeavor would greatly be appreciated.

Note: This is a low resource language.

Punctuation?

Hi!

Great library!

Would it be possible to let it correct punctuation as well? E.g. missing commas, too many commas, etc.

Source code license?

Under what license is this code available? I didn't see one in the read me nor the GitHub project meta data.

Reproduce experiment in paper

I wasn't able to find all the details required to train the BERT checker from scratch on the 1blm dataset. Is there a script will all the hyperparams set appropriately? If not, could you help me set up the training script to retrain the model reported in the paper (so I can then change things in that script to try a few things out)

Fine-Tune example

Hi,
First off, this is a great library so thanks for making it available.
I really like the fine-tune option and was trying to use it but I am just not sure what format is expected in the clean and corrupt files.
For example, if I has a custom word or abbreviations such as AWS and I was to correct mis-spellings for that how do I create a training file for that?
Do I create a clean file with sentences like "how do i use aws" and a corrupt example like "how do i use aaws"?
I tried it like this but it did not seem to "learn" the correct spelling and still auto-correct aws to things like "laws".
Maybe it would be difficult to change the pretrained model (I am using BERTChecker at the moment) since it would have learned the "laws" example already so would need alot of training to relearn a new spelling?

I would appreciate any info you have on this.
Thanks in advance
Cathal

Running sentences through checker adds spaces near punctuation

Hello, I am have the issue that when I run sentences through the checker, extra spacing gets added near punctuation.

For example, when I run:
However, the member churches ("Glieekirchen") share full pulpit and altar fellowship. through the BertSpellchecker example in the readme, the output is:
However , the member churches ( " Glieekirchen " ) share full pulpit and altar fellowship .

Any ideas why this would be?

Code breaks using different model like roberta, xlm-roberta in finetuning

Code breaks using a different model other than BERT. I debugged into the code and found that the code is written with respect to BERT tokenizer only while the tokenizers of other transformer models are different. Below snippet in helpers.py

if BERT_TOKENIZER is None:  # gets initialized during the first call to this method
    if bert_pretrained_name_or_path:
        BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained(bert_pretrained_name_or_path)
        BERT_TOKENIZER.do_basic_tokenize = True
        BERT_TOKENIZER.tokenize_chinese_chars = False
    else:
        BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained('bert-base-cased')
        BERT_TOKENIZER.do_basic_tokenize = True
        BERT_TOKENIZER.tokenize_chinese_chars = False

Fine tuning on offline computer

Hi,

First of all, thanks for your library.
I am trying to run the fine tuning script with a bert-based model on a computer without internet access, is this possible? If possible, where I should make changes in the code?
Thanks in advance.

Unable to load finetuned model

After the finetuning the model, its not loading the new finetuned model. Even if I specify the new checkpoint path it is loading the default checkpoint.
In the code even if we pass ckpt_path it is changed to default, so I modified that.
Now when I load the model using the finetuned checkpoint, it not is compatible

Modify readme

Download datasets
Run the following to download datasets

cd data/traintest
python download_dataset.py 
See data/traintest/README.md for more details.

The python download_dataset.py must be changed to python download_datafiles.py as there is no python download_dataset.py in trainset

IndexError: index out of range in self

Hi @neuspell team. I really appreciated the efforts that you guys put into this research project.

I would like to try on some Vietnamese model, for example "vinai/phobert-base". But I can't train it. It always appears

"CUDA error: device-side assert triggered" when I ran on GPU.

When I tried to debug by running with CPU, I discovered that it has the error "IndexError: index out of range in self" in the file models.py, line 918:

bert_encodings = self.bert_model(**batch_bert_dict, return_dict=False)[0]

I am still debugging and modifying the code, but I'm not very positive in fixing it by myself.

(Other models run fine, like Multilingual BERT base, and Distill Multilingual BERT. I trained on Vietnamese data)

Thank you and have a nice day

AttributeError: 'CorrectorElmoSCLstm' object has no attribute 'finetune'

I was trying to run

""" load spell checkers """
from neuspell import SclstmChecker
checker = SclstmChecker()
checker = checker.add_("elmo", at="input")
checker.from_pretrained("./data/checkpoints/elmoscrnn-probwordnoise")

""" spell correction """
checker.correct("I luk foward to receving your reply")
# → ["I look forward to receiving your reply"]


""" fine-tuning on domain specific dataset """
checker.finetune(clean_file="clean.txt", corrupt_file="corrupt.txt")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.