Giter Club home page Giter Club logo

productner's Introduction

Product categorization and named entity recognition

This repository is meant to automatically extract features from product titles and descriptions. Below we explain how to install and run the code, and the implemented algorithms. We also provide background information including the current state-of-the-art in both sequence classification and sequence tagging, and suggest possible improvements to the current implemention. Enjoy!

Requirements

Use Python 3.7 and install dependencies via following command (please use venv or conda):

pip install -r requirements.txt

Usage

Fetching data

Amazon product data

cd ./data/
wget http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz
gzip -d metadata.json.gz

GloVe

cd ./data/
wget https://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip

Preprocessing data

cd ./data/
python parse.py metadata.json
python normalize.py products.csv
python trim.py products.normalized.csv
python supplement.py products.normalized.trimmed.csv
python tag.py products.normalized.trimmed.supplemented.csv

Training models

mkdir -p ./models/
python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv

Extract information

Infer on our sample dataset with your model by running the following:

python extract.py ./models/ Product\ Dataset.csv

Contents

  • extract.py: Script to extract product category specific attributes based on product titles and descriptions
  • train_tokenizer.py: Script to train a word tokenizer
  • train_ner.py: Script to train a product named entity recognizer based on product titles
  • train_classifier.py: Script to train a product category classifier based on product titles and descriptions
  • tokenizer.py: Word tokenizer class
  • ner.py: Named entity recognition class
  • classifier.py: Product classifier class
  • data/parse.py: Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz
  • data/normalize.py: Normalizes product data
  • data/trim.py: Trims product data
  • data/supplement.py: Supplements product data
  • data/tag.py: Tags product data
  • Product\ Dataset.csv: CSV file with product ids, names, and descriptions

Algorithms

These are the methods used in this demonstrative implementation. For state-of-the-art extensions, we refer the reader to the references listed below.

  • Tokenization: Built-in Keras tokenizer with 80,000 word maximum
  • Embedding: Stanford GloVe (Wikipedia 2014 + Gigaword 5, 200 dimensions) with 200 sequence length maximum
  • Sequence classification: 3 layer CNN with max pooling between the layers
  • Sequence tagging: Bidirectional LSTM

For the sequence classification task, we extract product titles, descriptions, and categories from the Amazon product corpus. We then fit our CNN model to predict product category based on a combination of product title and description. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.90 after 2 epochs.

For the sequence tagging task, we extract product titles and brands from the Amazon product corpus. We then fit our bidirection LSTM model to label each word token in the product title to be either a brand or not. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.85 after 2 epochs.

For both models we use the GloVe embedding with 200 dimensions, though we note that a larger dimensional embedding might achieve superior performance. Additionally, we could be more careful in the data preprocessing to trim bad tokens (e.g. HTML remnants). Also for both models we use a dropout layer after embedding to combat overfitting the data.

Background

Problem definition

The problem of extracting features from unstructured textual data can be given different names depending on the circumstances and desired outcome. Generally, we can split tasks into two camps: sequence classification and sequence tagging.

In sequence classification, we take a text fragment (usually a sentence up to an entire document), and try to project it into a categorical space. This is considered a many-to-one classification in that we are taking a set of many features and producing a single output.

Sequence tagging, on the other hand, is often considered a many-to-many problem since you take in an entire sequence and attempt to apply a label to each element of the sequence. An example of sequence tagging is part of speech labeling, where one attempts to label the part of speech of each word in a sentence. Other methods that fall into this camp include chunking (breaking a sentence into relational components) and named entity recognition (extracting pre-specified features like geographic locations or proper names).

Tokenization and embedding

An often important step in any natural language processing task is projecting from the character-based space that composes words and sentences to a numeric space on which computer models can operate.

The first step is simply to index unique tokens appearing in a dataset. There is some freedom on what is considered a token, i.e. it can be considered a specific group of words, a single word, or even individual characters. A popular choice is to simple create a word-based dictionary which maps unique space-separated character sequences to unique indices. Usually this is done after a normalization procedure where everything is lower-cased, made into ASCII, etc. This dictionary can then be sorted by frequency of occurance in the dataset and truncated to a maximum size. After tokenization, your dataset is transformed into a set of indices where truncated words are typically replaced with a '0' index.

Following tokenization, the indexed words are often projected into an embedding vector space. Currently popular embeddings include word2vec [1] and GloVe [2]. Word2vec (as the name implies) is a word to vector space projector composed of a two-layer neural network. The network is trained in one of two ways: a continuous bag-of-words where the model attempts to predict the current word by using the surrounding words as context features, and continuous skip-grams where the model attempts to predict surrounding context words by looking at the current word. GloVe is a "global vectors representation for words". Essentially it is a count-based, unsupervised learned embedding where a token cooccurance matrix is constructed and factored.

Vector-space embedding methods generally provide substantial improvements over using basic dictionaries since they inject contextual knowledge from the language. Additionally, they allow a much more compact representation, while maintaining important correlations. For example, they allow you to do amazing things like performing word arithmetic:

king - man + woman = queen

where equality is determined by directly computing vector overlaps.

Sequence classification

Classification is the shining pillar of modern day machine learning with convolutional neural networks (CNN) at the top. With their ability to efficiently represent high-level features via windowed filtering, CNN's have seen their largest success in the classification and segmentation of images. However, more recently, CNN's have started seeing success in natural language sequence classification as well. Several recent works have shown that for the text classification, CNN's can significantly outperform other classifying methods such as hidden Markov models and support vector machines [3,4]. The reason CNN's see success in text classification is likely for the same reason they see success in the vision domain: there are strong, regular correlations between nearby features which are efficiently picked up by reasonably sized filters.

Even more recently CNN's dominance has been toppled by the recurrent neural network (RNN) architectures. In particular, long-/short-term memory (LSTM) units have shown exceptional promise. LSTM's pass output from one unit to the next, while carrying along an internal state. How this state updates (as well as other weights in the network) can be trained end-to-end on variable length sequences by passing a single token at a time. For classification, bidirectional LSTM's, which allow for long-range contextual correlations in both forward and reverse directions, have seen the best performance [5,6]. An additional feature of these networks is an attention layer that allows continuous addressing of internal states of the sequential LSTM units. This further strengthens the networks ability to draw correlations from both nearby and far away tokens.

Sequence tagging

As mentioned above sequence tagging is a many-to-many machine learning task, and thus an added emphasis on the sequential nature of the input and output. This makes largely CNN's ill-suited for the problem. Instead the dominant approaches are again bidirectional LSTM's [11,12] as well as another method called conditional random fields (CRF) [7]. CRF's can be seen as either sequential logistic regression or more powerful hidden Markov models. Essentially they are sequential models composed of many defined feature functions that depend both on the word currently be labelled as well as surrounding words. The relative weights of these feature functions can then be trained via any supervised learning approach. CRF's are used extensively in the literature for both part of speech tagging as well as named entity recognition because of their ease of use and intuitive feeling [8-10].

Even more recent models for sequence tagging use a combination of the aforementioned methods (CNN, LSTM, and CRF) [13,14,15]. These works usually use a bidirectional LSTM as the major labeling architecture, another RNN or CNN to capture character-level information, and finally a CRF layer to model the label dependency. A logical next step will be to combine these methods with the neural attention models used in sequence classification, though this seems to be currently missing from the literature.

Future directions

Looking forward, there are several available avenues for continued research. More sophisticated word embeddings might help alleviate the need for complicated neural architectures. Hierarchical optimization methods can be used to automatically build new architectures as well as optimize hyperparameters. Diverse models can be intelligently combined to produce more powerful classification schemes (indeed most all Kaggle competitions are won this way). One interesting approach is to combine text data with other available data sources such as associated images [10]. By collecting data from different sources, feature labels could possibly be extracted automatically by cross-comparison.

References

[1] "Distributed Representations of Words and Phrases and their Compositionality". Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. https://arxiv.org/abs/1310.4546.

[2] "GloVe: Global Vector Representation for Words". Stanford NLP. 2015. https://nlp.stanford.edu/projects/glove/.

[3] "Convolutional Neural Networks for Sentence Classification". Yoon Kim. 2014. "https://arxiv.org/abs/1408.5882.

[4] "Character-level Convolutional Networks for Text Classification". Xiang Zhang, Junbo Zhao, Yann LeCun. 2015. https://arxiv.org/abs/1509.01626.

[5] "Document Modeling with Gated Recurrent Neural Network for Sentiment Classification". Duyu Tang, Bing Qin, Ting Liu. 2015. http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf.

[6] "Hierarchical Attention Networks for Document Classification". Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. 2016. https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf.

[7] "An Introduction to Conditional Random Fields". Charles Sutton, Andrew McCallum. 2010. https://arxiv.org/abs/1011.4088.

[8] "Attribute Extraction from Product Titles in eCommerce". Ajinkya More. 2016. https://arxiv.org/abs/1608.04670.

[9] "Bootstrapped Named Entity Recognition for Product Attribute Extraction". Duangmanee (Pew) Putthividhya, Junling Hu. 2011. http://www.aclweb.org/anthology/D11-1144.

[10] "A Machine Learning Approach for Product Matching and Categorization". Petar Ristoski, Petar Petrovski, Peter Mika, Heiko Paulheim. 2017. http://www.semantic-web-journal.net/content/machine-learning-approach-product-matching-and-categorization-0.

[11] "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss". Barbara Plank, Anders Søgaard, Yoav Goldberg. 2016. https://arxiv.org/abs/1604.05529.

[12] "Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network". Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao. 2015. https://arxiv.org/abs/1510.06168.

[13] "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF". Xuezhe Ma, Eduard Hovy. 2016. https://arxiv.org/abs/1603.01354.

[14] "Neural Architectures for Named Entity Recognition". Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. https://arxiv.org/abs/1603.01360.

[15] "Neural Models for Sequence Chunking". Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou. 2017. https://arxiv.org/abs/1701.04027.

productner's People

Contributors

dependabot[bot] avatar etano avatar ikvision avatar manojbalaji1 avatar mysterionrise avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

productner's Issues

Issue running train_tokenizer.py

Running from the terminal, I am getting the error "zsh: illegal hardware instruction python train_tokenizer.py"

Running from Jupyter Notebook, the kernel simply dies, so not sure where the issue is. Might be an issue with new mac chips, but was hopping someone else might be able to help troubleshoot here, as I'm not having issues with tensorflow outside this error.

Old issues- Still open - Getting error at the end when the evaluation happens

Traceback (most recent call last): File "train_ner.py", line 49, in main(sys.argv) File "train_ner.py", line 46, in main ner.train(data, labels) File "/root/productner/ner.py", line 190, in train self.evaluate(x_val, y_val, batch_size) File "/root/productner/ner.py", line 205, in evaluate target_names[self.tag_map[category]] = category IndexError: list assignment index out of range

IndexError while running 'python supplement.py products.normalized.trimmed.csv'

Hello etano

I was trying to execute the commands as you have provided. But after trimming and running the following all scripts give the index error.
productner-master\data>python supplement.py products.normalized.trimmed.csv
Traceback (most recent call last):
File "supplement.py", line 15, in
title, brand, description = row[0], row[1], row[2]
IndexError: list index out of range

Kindly let me know if any other details are needed.

Error classifier

Error when running:
python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
ValueError: You are passing a target array of shape (0, 1) while using as loss
categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
from keras.utils import to_categorical
y_binary = to_categorical(y_int)`

NER low performance

The classification training results is impressive
precision recall f1-score
micro avg 0.766327 0.766327 0.766327
But the NER results are significantly lower
precision recall f1-score
micro avg 0.042444 0.042444 0.042444

What are the expected results for NER I should achieve with this code?

Number of classes, 42, does not match size of target_names, 43. Try specifying the labels parameter.

@etano , thanks for sharing!

--THIS IS NOT A BUG--

I have a question related how to solve this:
ValueError: Number of classes, 42, does not match size of target_names, 43. Try specifying the labels parameter

As it is coded, I am having this after I ran
python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv

I believe the error is coming cuz of empty class in JSON (" "). How can I get rid of ? Should I get rid of ? Should I ignore ?

Can you please guide me in this matter ?

Cheers,

ImportError: cannot import name 'CLoader'

I am having the following error :
ImportError: cannot import name 'CLoader'
on parser.py file while importing Cloader "from yaml import CLoader as Loader"

Any idea how to fix that?

Normalization fails

After running python parse.py metadata.json (results: good: 1806933 , bad: 152)
The next script fails : python normalize.py products.csv

Traceback (most recent call last):
File "normalize.py", line 33, in
writer.writerow(row)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2032.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 49: character maps to

Different results

Thank you for your implementation, but i ran the classifier and evaluated using whole dataset and got the following results.

                       precision    recall  f1-score   support

clothing, shoes & jewelry 0.17 0.00 0.00 24536
sports & outdoors 0.00 0.00 0.00 65860
toys & games 0.08 0.04 0.06 80918
movies & tv 0.00 0.00 0.00 11303
baby 0.00 0.00 0.00 8675
tools & home improvement 0.07 0.00 0.00 73641
automotive 0.20 0.00 0.00 108660
home & kitchen 0.08 0.11 0.09 61362
arts, crafts & sewing 0.00 0.00 0.00 18658
office products 0.00 0.00 0.00 28153
books 0.00 0.04 0.00 91
office & school supplies 0.00 0.00 0.00 420
electronics 0.05 0.19 0.08 48724
computers 0.00 0.00 0.00 117
cell phones & accessories 0.01 0.01 0.01 7013
pet supplies 0.05 0.00 0.00 22393
health & personal care 0.09 0.00 0.00 61348
cds & vinyl 0.00 0.00 0.00 6891
musical instruments 0.02 0.00 0.00 19710
software 0.00 0.22 0.00 157
industrial & scientific 0.00 0.00 0.00 9049
all beauty 0.00 0.03 0.00 950
video games 0.00 0.00 0.00 241
beauty 0.05 0.00 0.01 52040
patio, lawn & garden 0.02 0.01 0.02 22239
grocery & gourmet food 0.00 0.00 0.00 25763
all electronics 0.00 0.00 0.00 338
baby products 0.00 0.13 0.01 1631
kitchen & dining 0.00 0.00 0.00 372
car electronics 0.00 0.00 0.00 39
digital music 0.00 0.00 0.00 1
home improvement 0.00 0.00 0.00 489
amazon fashion 0.00 0.00 0.00 441
appliances 0.00 0.00 0.00 1488
camera & photo 0.00 0.00 0.00 74
purchase circles 0.00 0.00 0.00 6
gps & navigation 0.00 0.00 0.00 42
mp3 players & accessories 0.00 0.00 0.00 40
collectibles & fine art 0.00 0.03 0.00 79
luxury beauty 0.00 0.00 0.00 415
furniture & dcor 0.00 0.00 0.00 26
0.00 0.00 0.00 30

          avg / total       0.07      0.03      0.02    764423

in other words, the f1-score is way more less than .85 as you mentioned. Is there something im missing ?

typeerror :

python extract.py ./models/ Product\ Dataset.csv

getting error like shown below when trying to execute the above code.

main(sys.argv)

File "extract.py", line 75, in main
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames+['category', 'brand'])
TypeError: unsupported operand type(s) for +: 'NoneType' and 'list'

CSV file opens in "wb". This throws an error in python3

I believe the line below must be replaced to

outfile = open('.'.join(data_file.split('.')[:-1] + ['processed', 'csv']), 'wb')

To the line below for python3

outfile = open('.'.join(data_file.split('.')[:-1] + ['processed', 'csv']), 'w')

Error while running "python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv"

I am trying to run the code as described. However when I run the code python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv, i got the below error

D:\Product Classifier>python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv
Using TensorFlow backend.
Processed 297765 texts.
Getting labels...
Traceback (most recent call last):
File "train_ner.py", line 49, in
main(sys.argv)
File "train_ner.py", line 42, in main
labels = ner.get_labels(tags)
File "D:\Product Classifier\ner.py", line 105, in get_labels
labels.append(to_categorical(np.asarray(indexed_tags), num_classes=4))
File "C:\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\utils\np_utils.py", line 25, in to_categorical
categorical[np.arange(n), y] = 1
IndexError: index 4 is out of bounds for axis 1 with size 4

I am not sure if I missed something. This is the first time I have reported an issue. Please let me know if any other details are needed.

NameError: global name 'evaluate' is not defined`

Hello, i have an error when i execute train_classifier.py

Traceback (most recent call last): File "train_classifier.py", line 49, in <module> main(sys.argv) File "train_classifier.py", line 46, in main classifier.train(data, labels) File "/root/productner/classifier.py", line 188, in train evaluate(x_val, y_val, batch_size) NameError: global name 'evaluate' is not defined

Thank you in advance.

Provide CSV files

hi please provide csv files like

products.csv,products.normalized.csv,products.normalized.trimmed.csv,products.normalized.trimmed.supplemented.csv files.

Understanding of scripts

Hi Ethan,

First of all amazing project. Have been trying my hands on it and just wanted to understand that from json data of 9.4 mil rows after running the parse.py I am left with 1.9 mil rows, is this due to some error or the script is written in such a way. I reviewed the script and there is no such condition that I came across and due to the large size of json file unable to see the file as well. Please help. While running the second script I am getting an error UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 49: character maps to --> this might be resolved by adding utf-8 encoding while reading the file. Will be experimenting on the same.
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.