Light

davidsbatista / text-classification Goto Github PK

View Code? Open in Web Editor NEW

110.0 9.0 52.0 40.12 MB

An example on how to train supervised classifiers for multi-label text classification using sklearn pipelines

Home Page: http://www.davidsbatista.net/blog/2017/04/01/document_classification/

Jupyter Notebook 86.33% Python 13.67%

text-classification train-supervised-classifiers multi-label-classification

text-classification's Introduction

text-classification

An example of how to train supervised classifiers for multi-label text classification

code for blog post:

http://www.davidsbatista.net/blog/2017/04/01/document_classification

text-classification's People

Contributors

Stargazers

Watchers

text-classification's Issues

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

y_value = array([221900, 180000, 510000, ..., 360000, 400000, 325000])

List of machine learning algorithms that will be used for predictions

estimator = [('Logistic Regression', LogisticRegression), ('Ridge Classifier', RidgeClassifier),
('SGD Classifier', SGDClassifier), ('Passive Aggressive Classifier', PassiveAggressiveClassifier),
('SVC', SVC), ('Linear SVC', LinearSVC), ('Nu SVC', NuSVC),
('K-Neighbors Classifier', KNeighborsClassifier),
('Gaussian Naive Bayes', GaussianNB), ('Multinomial Naive Bayes', MultinomialNB),
('Bernoulli Naive Bayes', BernoulliNB), ('Complement Naive Bayes', ComplementNB),
('Decision Tree Classifier', DecisionTreeClassifier),
('Random Forest Classifier', RandomForestClassifier), ('AdaBoost Classifier', AdaBoostClassifier),
('Gradient Boosting Classifier', GradientBoostingClassifier), ('Bagging Classifier', BaggingClassifier),
('Extra Trees Classifier', ExtraTreesClassifier), ('XGBoost', XGBClassifier)]

Separating independent features and dependent feature from the dataset

#X_train = titanic.drop(columns='Survived')
#y_train = titanic['Survived']

Creating a dataframe to compare the performance of the machine learning models

comparison_cols = ['Algorithm', 'Training Time (Avg)', 'Accuracy (Avg)', 'Accuracy (3xSTD)']
comparison_df = pd.DataFrame(columns=comparison_cols)

Generating training/validation dataset splits for cross validation

cv_split = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

Performing cross-validation to estimate the performance of the models

for idx, est in enumerate(estimator):

cv_results = cross_validate(est[1](), x_value, y_value, cv=cv_split)

comparison_df.loc[idx, 'Algorithm'] = est[0]
comparison_df.loc[idx, 'Training Time (Avg)'] = cv_results['fit_time'].mean()
comparison_df.loc[idx, 'Accuracy (Avg)'] = cv_results['test_score'].mean()
comparison_df.loc[idx, 'Accuracy (3xSTD)'] = cv_results['test_score'].std() * 3

comparison_df.set_index(keys='Algorithm', inplace=True)
comparison_df.sort_values(by='Accuracy (Avg)', ascending=False, inplace=True)

#Visualizing the performance of the models

and following error occured

ValueError Traceback (most recent call last)
in
25 for idx, est in enumerate(estimator):
26
---> 27 cv_results = cross_validate(est1, x_value, y_value, cv=cv_split)
28
29 comparison_df.loc[idx, 'Algorithm'] = est[0]

~/.local/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
238 return_times=True, return_estimator=return_estimator,
239 error_score=error_score)
--> 240 for train, test in cv.split(X, y, groups))
241
242 zipped_scores = list(zip(*scores))

~/.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in call(self, iterable)
915 # remaining jobs.
916 self._iterating = False
--> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919

~/.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
752 tasks = BatchedCalls(itertools.islice(iterator, batch_size),
753 self._backend.get_nested_backend(),
--> 754 self._pickle_cache)
755 if len(tasks) == 0:
756 # No more tasks available in the iterator: tell caller to stop.

~/.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in init(self, iterator_slice, backend_and_jobs, pickle_cache)
208
209 def init(self, iterator_slice, backend_and_jobs, pickle_cache=None):
--> 210 self.items = list(iterator_slice)
211 self._size = len(self.items)
212 if isinstance(backend_and_jobs, tuple):

~/.local/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in (.0)
233 pre_dispatch=pre_dispatch)
234 scores = parallel(
--> 235 delayed(_fit_and_score)(
236 clone(estimator), X, y, scorers, train, test, verbose, None,
237 fit_params, return_train_score=return_train_score,

~/.local/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
1313 """
1314 X, y, groups = indexable(X, y, groups)
-> 1315 for train, test in self._iter_indices(X, y, groups):
1316 yield train, test
1317

~/.local/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
1693 class_counts = np.bincount(y_indices)
1694 if np.min(class_counts) < 2:
-> 1695 raise ValueError("The least populated class in y has only 1"
1696 " member, which is too few. The minimum"
1697 " number of groups for any class cannot"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Queries about number of labels

Hey, thanks for the great article,
Can you please help me by answering below queries?
I have tried the same method using on a dataset which contains more than 400 labels and 7000 is the entire size of the dataset, but I'm not able to get the accurate results as you.
Can you suggest me a better way when there are more labels?
Thanks in advance!!

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than

Hi David
I'm seeing the following error, when I try to run your script on my test data.
I generated a csv file similar to your movies_generes.csv file where I have one text column and multiple label columns where the values are 1 or 0

Looks like the problem is with the "StratifiedSplit" method. But not sure what the issue is.
All the labels/columns have values of '0' or'1' values in more than 2 rows in the file

PS E:\Tools\TLC> python E:\Projects\NLP\TrainClassifiers.py --vectors tfidf --clf nb
C:\Python27\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Loading already processed training data
Traceback (most recent call last):
File "E:\Projects\NLP\TrainClassifiers.py", line 338, in
main()
File "E:\Projects\NLP\TrainClassifiers.py", line 245, in main
for train_index, test_index in stratified_split.split(data_x, data_y):
File "C:\Python27\lib\site-packages\sklearn\model_selection_split.py", line 1204, in split
for train, test in self._iter_indices(X, y, groups):
File "C:\Python27\lib\site-packages\sklearn\model_selection_split.py", line 1546, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Do you know where I can find datasets of multi-label text? Wiki, Amazon

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.