lyeoni / nlp-tutorial Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 263.0 1.42 GB

A list of NLP(Natural Language Processing) tutorials

License: MIT License

Python 5.67% Shell 0.04% Jupyter Notebook 94.30%

natural-language-processing neural-machine-translation nlp nlp-tutorial sentiment-classification text-classification

nlp-tutorial's Introduction

NLP Tutorial

A list of NLP(Natural Language Processing) tutorials built on PyTorch.

A step-by-step tutorial on how to implement and adapt to the simple real-word NLP task.

Text Classification

News Category Classification

This repo provides a simple PyTorch implementation of Text Classification, with simple annotation. Here we use Huffpost news corpus including corresponding category. The classification model trained on this dataset identify the category of news article based on their headlines and descriptions.
Keyword: CBoW, LSTM, fastText, Text cateogrization

IMDb Movie Review Classification

This text classification tutorial trains a transformer model on the IMDb movie review dataset for sentiment analysis. It provides a simple PyTorch implementation, with simple annotation.
Keyword: Transformer, Sentiment analysis

Question-Answer Matching

This repo provides a simple PyTorch implementation of Question-Answer matching. Here we use the corpus from Stack Exchange to build embeddings for entire questions. Using those embeddings, we find similar questions for a given question, and show the corresponding answers to those I found.
Keyword: CBoW, TF-IDF, LSTM with variable-length seqeucnes

Movie Review Classification (Korean NLP)

This repo provides a simple Keras implementation of TextCNN for Text Classification. Here we use the movie review corpus written in Korean. The model trained on this dataset identify the sentiment based on review text.
Keyword: TextCNN, Sentiment analysis

Neural Machine Translation

English to French Translation - seq2seq

This neural machine translation tutorial trains a seq2seq model on a set of many thousands of English to French translation pairs to translate from English to French. It provides an intrinsic/extrinsic comparison of various sequence-to-sequence (seq2seq) models in translation.
Keyword: sequence to seqeunce network(seq2seq), Attention, Autoregressive, Teacher-forcing

French to English Translation - Transformer

This neural machine translation tutorial trains a Transformer model on a set of many thousands of French to English translation pairs to translate from French to English. It provides a simple PyTorch implementation, with simple annotation.
Keyword: Transformer, SentencePiece

Natural Language Understanding

Neural Language Model

This repo provides a simple PyTorch implementation of Neural Language Model for natural language understanding. Here we implement unidirectional/bidirectional language models, and pre-train language representations from unlabeled text (Wikipedia corpus).
Keyword: Autoregressive language model, Perplexity

nlp-tutorial's People

Contributors

Stargazers

Watchers

Forkers

zeniel-oroi jp1936 keep-steady awesome-archive puru01 yushu-liu foreseez ilineicry allensmile legendtianjin songxianjin hsouporto boykis82 sandepp123 mariobyn rrmina gdcollect anamikasen fengyicoder rssanjeev kaderberrouachedi cathy-kim yesheng607 jiamim yunweidashuju rogerspy just92up niuniuyouguo coolerme samux87 jangocheng nofeetbird0321 jangocity rie-long wps1112 liuwenhaha ada1582 nickgxnn uuuup zhang-yun-peng chaoongithub deqiangxiao zhexiongliu awesome-docs zhwj0803 znsoftm zhhsx001 tswhen meanmachine1031 zdx ringwraith chengmuni66 moseshu sidney1994 fishredleaf lyxx666 webshell520 wguo123 bzqweiyi renhongquan pieere langke14199 hhy5277 kgoeson liubin12360 physcoder d1jiasheng liuwq168 hezihan0606 khaled-klod wibruce nastul yingtaohuo chubukou melicent114 mousechen luolanfeixue fengfengj lijian10086 yanyiting prasancumarn leynard007 jacklee20151 vinklibrary sxzhou1937 hzj1558718 tasnimneo lizhaoliu-lec chenpe32cp shualite subrota-mondal everlee78 nhatrio svmihar tiandiao123 yerayl essie-chiang typanda ysyfrank xzycr7

nlp-tutorial's Issues

question-answer-matching missing file

Hi Lyeoni,

First of all, thank you a lot for your work in making these tutorials, which are interesting !

I am trying to run the question-answer-matching tutorial and reproduce your evaluation. Unfortunately, I can't download the Posts.xml file from git lfs as it looks like your subscription doesn't accept download anymore.
By any chance, do you have that file hosted somewhere else ? That would allow me to run the evaluation with your trained model.

Thanks a lot and I wish you a nice day ! :-)

How about the speed of the model

Using the classifier

Hi
After saving the model in news-category-classification, how do you actually use it to predict text classification?
Can you put up an example, please?

Arabic to Urdu Machine Translation

@lyeoni

In the case I want to train an Arabic to Urdu Machine Translation:

is that attainable using this project?
what options should be set in training?
do you suggest another github project?

Little improvements for right indexes in vocabulary dictionaries

Hi, @lyeoni !
You have written great tutorials. I really appreciate you)
We can improve a little bit with one pretty line. Look, please)
Here, we fill first key-value items of stoi, itos by special tokens.
I suggest insert this line before cycle.
special_tokens = filter(lambda x: x is not None, [self.unk_token, self.bos_token, self.eos_token, self.pad_token])
If we don't set value for self.unk_token and set for self.bos_token, then index in dictionary become wrong. So, we need filter None values before.
Input
vocab = Vocab(body, bos_token='<bos>'); vocab.build(); vocab.stoi;
Wrong Output
'<bos>': 1 ' ': 1, 'hi': 2, 'bear': 3, ...

How could utilize GPU totally?

Thanks for your code!

I found that when I do training, the GPU are not totally utilized. So it there is way to add batch to train more pairs at one iter?

Movie Rating Classification no datasets

Movie Rating Classification no datasets?

num_samples should be a positive integer value, but got num_samples=0

python train.py --epochs 12 --batch_size 2 --learning_rate .001 --hidden_size 64 --n_layers 1 --dropout_p .1

number of trained word vectors of data/glove.6B.100d.txt: 400000
Traceback (most recent call last):
File "train.py", line 200, in
train_loader = DataLoader(dataset=qa_train, batch_size=config.batch_size, shuffle=True, num_workers=4)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 176, in init
sampler = RandomSampler(dataset)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 66, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

Please Let Know What's the exact issue

local variable 'MosesTokenizer' referenced before assignment

The corresponding package is installed and Data set downloaded，Run vocab.py . The following error occurred： “local variable 'MosesTokenizer' referenced before assignment”

neural-machine-translation - nmt ZeroDivisionError: integer division or modulo by zero

Traceback (most recent call last):

File "", line 1, in
runfile('D:/nlp-tutorial/neural-machine-translation/nmt/train.py', wdir='D:/nlp-tutorial/neural-machine-translation/nmt')

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "D:/nlp-tutorial/neural-machine-translation/nmt/train.py", line 254, in
trainiters(pairs, encoder, decoder, n_iters)

File "D:/nlp-tutorial/neural-machine-translation/nmt/train.py", line 184, in trainiters
train_pairs += [random.choice(train_pairs) for i in range(n_iters%len(train_pairs))]

ZeroDivisionError: integer division or modulo by zero

Question about validate acc

Thanks for your great job! I learned a lot. However, I have a question.
I train the model for 7 epochs reaching a train acc of 95.2 and test(validate) acc of 85.2.
Is that normal? Could the final test(validate) acc be higher after more epochs? Thanks!

Please add transformer based tutorial

Kindly add a tutorial for NLP with transformer setup

No module named 'nltk.tokenize.moses'

I had install nltk; but a error occur when I run the code;
ModuleNotFoundError: No module named 'nltk.tokenize.moses'

typo in preprocessing?

Hi,
In cleaning function in the script : nlp-tutorial/news-category-classifcation/preprocessing.py,
line 21 is written as text = re.sub(r'[!]{2,}', '?', text) # multiple ?s -> ?. There should be ? in first argument and It should be text = re.sub(r'[?]{2,}', '?', text) # multiple ?s -> ?.
Am I correct?

Confused about the inference. Any example?

Hi, I am curious about the inference part in the model. Does any example to show how it works? Many thank.