Feedback from Sebastian on ML notebook

Feedback from @rasbt:

  • Okay, let be me very nit-picky here. I would either spell all the package name in lower-case or use the common convention: NumPy, seaborn, matplotlib, SciPy, scikit-learn
  • In "scikit-learn: The main Machine Learning package in Python." I would suggest replacing "main" by "essential" or so. It is really great for basic stuff, and essential has the positive tone of "important" and also "fundamental" at the same time.
  • About the iris images: They look great! But one question, are they really attribution free? I am wondering because I looked very hard to find some good ones that meet this criterion
  • Since this is more of a beginner audience, maybe define "accuracy" E.g., "fraction of correctly classified flower samples"
  • "hand-measuring 100 randomly-sampled flowers of each species" -> Maybe use "50" so that the reader can directly relate to the dataset.
  • Instead of "scatter matrix", maybe consider the term "scatter plot matrix" since "scatter matrix" is typically something else: an "unnormalized" covariance matrix (e.g., in LDA)
  • Maybe mention that random forests are scale-invariant, e.g., you could mention that a typical procedure in the data preprocessing pipeline (required by most ML algos) is to scale the features because you are using decision trees (I believe this is the only scale-invariant algo that is used in ML) -- maybe also explain what a decision tree is and how it relates to random forests in a few sentences. On a side-note, but you probably already now this: Most gradient-based optimization algos
  • "There are several random forest classifier parameters that we can tune" -- yes there are, but typically, the idea behind random forest is that you don't need to tune any of these except for the number of trees.
  • "It's obviously a problem that our model performs quite differently depending on the data it's trained on." Maybe it would be too much for this intro, but you could mention high variance (overfitting) and high bias (underfitting); I suspect the high variance here comes from the fact that you are only using 10 trees, in RF you typically use hundreds or thousands of trees since it is a special case of bagging with unpruned decision trees after all. Also, Iris may not be the best example for RF since it is a very simple dataset that does not have many features (the random sampling of features is e.g., the advantage of RF over regular bagging). In general, maybe consider starting this section with an unpruned decision tree instead of random forests. And in the end, conclude with random forests and explain why they are typically better (with respect to overfitting). Nice side effect: you can visualize the decision tree with GraphViz. If you decide to stick with RF, consider tuning the n_estimators parameter instead.
  • When you plot the cross-val error, I could also print the standard deviation
  • RandomForestClassifier(n_estimators=10, max_depth=1); I wouldn't recommend showing people this example, this could give them the wrong idea; you don't prune trees in a forest.
  • Maybe also mention the problems with KNN, because people could think that it is typically a great classifier since it performs so well here. It's really susceptible to the curse of dimensionality, and you always have to keep the training set around (lazy learner). In this context, I would also mention that the scale of the features matters (if you use Euclidean distance) and in this case we don't have to worry about it because everything is in cm.

Ball Outcome

Cricinfo want to create a simple cricket simulator to test their scorecard and so have come to you for help.
After a ball is bowled, there are eight possible outcomes. Below we list the eight outcomes and each outcome’s
e"ect on the score:
• 0 runs: add 1 to the ball count
• 1 run: add 1 to the ball count, add 1 to the run count
• 2 runs: add 1 to the ball count, add 2 to the run count
• 4 runs: add 1 to the ball count, add 4 to the run count, add 1 to the 4s count
• 6 runs: add 1 to the ball count, add 6 to the run count, add 1 to the 6s count
• Wide: add 1 to the extras count
• No ball: add 1 to the extras count
• Out: add 1 to the ball count, mark batsman as out
Cricinfo store a batsman’s record using the variable:
state = list(balls = 0, runs = 0, fours = 0, sixes = 0, extras = 0, out = FALSE)
Write the function oneBall that takes the input state and one outcome and returns the updated state based
on the eight outcomes above

follower factory's alpha value calculation doesn't work for small accounts

That alpha = ... for the plot assumes a large number of followers. Otherwise, the alpha value is out of bounds!

e.g. I think this is an easy fix, although maybe there are better ideas out there for normalizing the transparency w.r.t. the total number of points to plot.

alpha=0.1 * min(9, 80000. / len(days_since_2006)))

Some stations do not report average daily temp

KSAF in Santa Fe, New Mexico does not record an average daily temperature, only the mean. In order for to work in this case, all of the index values need to be shifted down by one after reading weather_data[0].

ML notebook: Add interpretation section

What features are being used to make the classification?

_Type of Classification Description Example_
Categorical (Nominal) Classification of entities into particular categories. That thing is a dog.That thing is a car.
Ordinal Classification of entities in some kind of ordered relationship. You are stronger than him.It is hotter today than yesterday.
Adjectival or Predicative Classification based on some quality of an entity. That car is fast.She is smart.
Cardinal Classification based on a numerical value. He is six feet tall.It is 25.3 degrees today.

Categorical classification is also called nominal classification because it classifies an entity in terms of the name of the class it belongs to. This is the type of classification we focus on in this document.

Why are those features important?

Let’s imagine that you’ve landed a consulting gig with a bank who have asked you to identify those who have a high likelihood of default on the next month’s bill. Armed with the machine learning techniques that you’ve learnt and practiced, let’s say you proceed to analyze the data set given by your client and have used a random forest algorithm that achieves a reasonably high accuracy. Your next task is to present to the business stakeholders from the client’s team how you achieved these results. What would you say to them? Will they be able to understand all the hyperparameters of the algorithm that you tweaked in order to land on your final model? How will they react when you start talking about the number of estimators and Gini criterion of the random forest?
Although it is important to be proficient in understanding the inner workings of the algorithm, it is far more essential to be able to communicate the findings to an audience who may not have any theoretical / practical knowledge of machine learning. Just showing that the algorithm predicts well is not enough. You have to attribute the predictions to the elements of the input data that contribute to your accuracy. Thankfully, the random forest implementation of sklearn does give an output called “feature importances” which helps us explain the predictive power of the features in the dataset. But, there are certain drawbacks to this method that we will explore in this post, and an alternative technique to assess the feature importances that overcomes these drawbacks.

What does that say about the problem domain?

A problem domain is the area of expertise or application that needs to be examined to solve a problem. A problem domain is simply looking at only the topics of an individual's interest, and excluding everything else. For example, when developing a system to measure good practice in medicine, carpet drawings at hospitals would not be included in the problem domain. In this example, the domain refers to relevant topics solely within the delimited area of interest: medicine. This points to a limitation of an overly specific, or overly bounded, problem domain. An individual may think they are interested in medicine and not interior design, but a better solution exists outside of the problem domain as it was initially conceived. For example, when IDEO researchers noticed that patients in hospitals spent a huge amount of time staring at acoustic ceiling tiles, which "became a symbol of the overall ambiance: a mix of boredom and anxiety from feeling lost, uninformed, and out of control."

KeyError while using seaborn plotting

Hi. I am getting KeyError 'class' while attempting to plot iris data.

sb.pairplot(iris_data.dropna(), hue='class')

gives the following stack trace, please advise.

KeyError Traceback (most recent call last)
in ()
----> 1 sb.pairplot(iris_data.dropna(), hue='class')

/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/ in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws)
1583 hue_order=hue_order, palette=palette,
1584 diag_sharey=diag_sharey,
-> 1585 size=size, aspect=aspect, dropna=dropna, **grid_kws)
1587 # Add the markers here as PairGrid has figured out how many levels of the

/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/ in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, diag_sharey, size, aspect, despine, dropna)
1221 index=data.index)
1222 else:
-> 1223 hue_names = utils.categorical_order(data[hue], hue_order)
1224 if dropna:
1225 # Filter NA from the list of unique hue names

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in getitem(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1994 def _getitem_column(self, key):

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1997 # get column
1998 if self.columns.is_unique:
-> 1999 return self._get_item_cache(key)
2001 # duplicate columns & possible reduce dimensionality

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1343 res = cache.get(item)
1344 if res is None:
-> 1345 values = self._data.get(item)
1346 res = self._box_item_values(item, values)
1347 cache[item] = res

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3224 if not isnull(item):
-> 3225 loc = self.items.get_loc(item)
3226 else:
3227 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1876 return self._engine.get_loc(key)
1877 except KeyError:
-> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key))
1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 'class'

Hi, I'm getting a keyerror of species, please advice after looking at this error

I am using seaborn but it is just a command to count data points for each class are present
I wrote this- iris["species"].value_counts()

KeyError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/ in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
----> 1 iris["species"].value_counts()

~/anaconda3/lib/python3.7/site-packages/pandas/core/ in getitem(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/ in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

Thank you

twitter got 400 in follower-factory

The entire error reads:
TwitterHTTPError: Twitter sent status 400 for URL: 1.1/application/rate_limit_status.json using parameters: (oauth_consumer_key=&oauth_nonce=8407703836662294415&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1607558162&oauth_version=1.0&oauth_signature=wkpQTGhB4qpoMvkpmvzUFMfNJ%2Bc%3D)
details: {'errors': [{'code': 215, 'message': 'Bad Authentication data.'}]}
It seems the authentication method or API is no longer valid.

UsageError: Line magic function `%install_ext` not found.

In the file Example Machine Learning Notebook.ipynb, codeline 37, since %install_ext was depreciated, now it is better to ask the user to install watermark:

pip install watermark

followed by

%load_ext watermark

%watermark -a 'author' -nmv --packages numpy,pandas,sklearn,matplotlib,seaborn

Getting this error when i try to plot my dataframe 'callers' on sns

KeyError Traceback (most recent call last)
~\Anaconda\lib\site-packages\pandas\core\indexes\ in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'callers'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
----> 1 sns.pairplot(callers , hue = 'callers')

~\AppData\Roaming\Python\Python38\site-packages\ in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f

~\AppData\Roaming\Python\Python38\site-packages\seaborn\ in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
1923 # Set up the PairGrid
1924 grid_kws.setdefault("diag_sharey", diag_kind == "hist")
-> 1925 grid = PairGrid(data, vars=vars, x_vars=x_vars, y_vars=y_vars, hue=hue,
1926 hue_order=hue_order, palette=palette, corner=corner,
1927 height=height, aspect=aspect, dropna=dropna, **grid_kws)

~\AppData\Roaming\Python\Python38\site-packages\ in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f

~\AppData\Roaming\Python\Python38\site-packages\seaborn\ in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, corner, diag_sharey, height, aspect, layout_pad, despine, dropna, size)
1212 index=data.index)
1213 else:
-> 1214 hue_names = categorical_order(data[hue], hue_order)
1215 if dropna:
1216 # Filter NA from the list of unique hue names

~\Anaconda\lib\site-packages\pandas\core\ in getitem(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]

~\Anaconda\lib\site-packages\pandas\core\indexes\ in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2899 if tolerance is not None:

KeyError: 'callers'

ML notebook: Expand data testing section

Expand the data testing section to create actual unit tests and explain assert statements a little better. Currently newcomers to unit tests would have no idea what's going on with assert statements.

ML notebook: Add interpretation section

Add a section near the end trying to interpret the model:

  • What features are being used to make the classification?
  • Why are those features important?
  • What does that say about the problem domain?

ML notebook: Add preprocessing and a sklearn pipeline

From a Reddit comment:

Advice: I think you are missing a few big things like preprocessing/scaling and pipelines.

Before using the learners, inputs should be scaled so that each feature has equal weight. Something like StandardScaler or MinMaxScaler are both appropriate (from sklearn.preprocessing). If you think some features are more important, you can scale them later to increase their relative importance in prediction. These are more parameters you would tune using CV, but these can be really numerous, so GridSearch is out the window and you would have to consider some alternatives like Nelder Mead search, genetic search, or multivariate gradient descent if you suspect convexity.

You have to fit these scalers on the training data and then use the trained fit to transform the testing data. Using Pipelines simplifies this whole process (fits the scaler and learner at once, transforms and predicts at once).

Best parameters result not reproducible

Hi, this is a very helpful example. Fun to read and easy to follow. I just have one question. You close with Reproducibility, but each time I run the cell to compute the best parameters for DecisionTreeClassifier, I get different answers most of time. That would appear to be a result not reproducible. Any reason?

