rhiever / data-analysis-and-machine-learning-projects Goto Github PK
View Code? Open in Web Editor NEWRepository of teaching materials, code, and data for my data analysis and machine learning projects.
Home Page: http://www.randalolson.com/blog/
Repository of teaching materials, code, and data for my data analysis and machine learning projects.
Home Page: http://www.randalolson.com/blog/
The link to the single python script from Andrew Liesinger is broken
Feedback from @rasbt:
Error in finding the path and api key
Cricinfo want to create a simple cricket simulator to test their scorecard and so have come to you for help.
After a ball is bowled, there are eight possible outcomes. Below we list the eight outcomes and each outcome’s
e"ect on the score:
• 0 runs: add 1 to the ball count
• 1 run: add 1 to the ball count, add 1 to the run count
• 2 runs: add 1 to the ball count, add 2 to the run count
• 4 runs: add 1 to the ball count, add 4 to the run count, add 1 to the 4s count
• 6 runs: add 1 to the ball count, add 6 to the run count, add 1 to the 6s count
• Wide: add 1 to the extras count
• No ball: add 1 to the extras count
• Out: add 1 to the ball count, mark batsman as out
Cricinfo store a batsman’s record using the variable:
state = list(balls = 0, runs = 0, fours = 0, sixes = 0, extras = 0, out = FALSE)
Write the function oneBall that takes the input state and one outcome and returns the updated state based
on the eight outcomes above
That alpha = ...
for the plot assumes a large number of followers. Otherwise, the alpha value is out of bounds!
e.g. I think this is an easy fix, although maybe there are better ideas out there for normalizing the transparency w.r.t. the total number of points to plot.
alpha=0.1 * min(9, 80000. / len(days_since_2006)))
KSAF in Santa Fe, New Mexico does not record an average daily temperature, only the mean. In order for wunderground_parser.py to work in this case, all of the index values need to be shifted down by one after reading weather_data[0].
_Type of Classification | Description | Example_ |
---|---|---|
Categorical (Nominal) | Classification of entities into particular categories. | That thing is a dog.That thing is a car. |
Ordinal | Classification of entities in some kind of ordered relationship. | You are stronger than him.It is hotter today than yesterday. |
Adjectival or Predicative | Classification based on some quality of an entity. | That car is fast.She is smart. |
Cardinal | Classification based on a numerical value. | He is six feet tall.It is 25.3 degrees today. |
Categorical classification is also called nominal classification because it classifies an entity in terms of the name of the class it belongs to. This is the type of classification we focus on in this document.
Let’s imagine that you’ve landed a consulting gig with a bank who have asked you to identify those who have a high likelihood of default on the next month’s bill. Armed with the machine learning techniques that you’ve learnt and practiced, let’s say you proceed to analyze the data set given by your client and have used a random forest algorithm that achieves a reasonably high accuracy. Your next task is to present to the business stakeholders from the client’s team how you achieved these results. What would you say to them? Will they be able to understand all the hyperparameters of the algorithm that you tweaked in order to land on your final model? How will they react when you start talking about the number of estimators and Gini criterion of the random forest?
Although it is important to be proficient in understanding the inner workings of the algorithm, it is far more essential to be able to communicate the findings to an audience who may not have any theoretical / practical knowledge of machine learning. Just showing that the algorithm predicts well is not enough. You have to attribute the predictions to the elements of the input data that contribute to your accuracy. Thankfully, the random forest implementation of sklearn does give an output called “feature importances” which helps us explain the predictive power of the features in the dataset. But, there are certain drawbacks to this method that we will explore in this post, and an alternative technique to assess the feature importances that overcomes these drawbacks.
A problem domain is the area of expertise or application that needs to be examined to solve a problem. A problem domain is simply looking at only the topics of an individual's interest, and excluding everything else. For example, when developing a system to measure good practice in medicine, carpet drawings at hospitals would not be included in the problem domain. In this example, the domain refers to relevant topics solely within the delimited area of interest: medicine. This points to a limitation of an overly specific, or overly bounded, problem domain. An individual may think they are interested in medicine and not interior design, but a better solution exists outside of the problem domain as it was initially conceived. For example, when IDEO researchers noticed that patients in hospitals spent a huge amount of time staring at acoustic ceiling tiles, which "became a symbol of the overall ambiance: a mix of boredom and anxiety from feeling lost, uninformed, and out of control."
it rejects the point of overfitting and gives 93% accuracy
Hi. I am getting KeyError 'class' while attempting to plot iris data.
sb.pairplot(iris_data.dropna(), hue='class')
gives the following stack trace, please advise.
KeyError Traceback (most recent call last)
in ()
----> 1 sb.pairplot(iris_data.dropna(), hue='class')
/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/linearmodels.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws)
1583 hue_order=hue_order, palette=palette,
1584 diag_sharey=diag_sharey,
-> 1585 size=size, aspect=aspect, dropna=dropna, **grid_kws)
1586
1587 # Add the markers here as PairGrid has figured out how many levels of the
/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/axisgrid.py in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, diag_sharey, size, aspect, despine, dropna)
1221 index=data.index)
1222 else:
-> 1223 hue_names = utils.categorical_order(data[hue], hue_order)
1224 if dropna:
1225 # Filter NA from the list of unique hue names
/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in getitem(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):
/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1997 # get column
1998 if self.columns.is_unique:
-> 1999 return self._get_item_cache(key)
2000
2001 # duplicate columns & possible reduce dimensionality
/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1343 res = cache.get(item)
1344 if res is None:
-> 1345 values = self._data.get(item)
1346 res = self._box_item_values(item, values)
1347 cache[item] = res
/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3223
3224 if not isnull(item):
-> 3225 loc = self.items.get_loc(item)
3226 else:
3227 indexer = np.arange(len(self.items))[isnull(self.items)]
/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1876 return self._engine.get_loc(key)
1877 except KeyError:
-> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key))
1879
1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()
KeyError: 'class'
Hi I found a convenient link which gives the weather table directly:
Software Carpentry does this in their license. CC recommends it.
I am using seaborn but it is just a command to count data points for each class are present
I wrote this- iris["species"].value_counts()
KeyError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'species'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
in
----> 1 iris["species"].value_counts()
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
Thank you
The entire error reads:
TwitterHTTPError: Twitter sent status 400 for URL: 1.1/application/rate_limit_status.json using parameters: (oauth_consumer_key=&oauth_nonce=8407703836662294415&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1607558162&oauth_version=1.0&oauth_signature=wkpQTGhB4qpoMvkpmvzUFMfNJ%2Bc%3D)
details: {'errors': [{'code': 215, 'message': 'Bad Authentication data.'}]}
It seems the authentication method or API is no longer valid.
Hi,
I think cell 37 should be dt_scores instead of rf_scores.
In the file Example Machine Learning Notebook.ipynb, codeline 37, since %install_ext was depreciated, now it is better to ask the user to install watermark:
pip install watermark
followed by
%load_ext watermark
%watermark -a 'author' -nmv --packages numpy,pandas,sklearn,matplotlib,seaborn
KeyError Traceback (most recent call last)
~\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'callers'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
in
----> 1 sns.pairplot(callers , hue = 'callers')
~\AppData\Roaming\Python\Python38\site-packages\seaborn_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48
~\AppData\Roaming\Python\Python38\site-packages\seaborn\axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
1923 # Set up the PairGrid
1924 grid_kws.setdefault("diag_sharey", diag_kind == "hist")
-> 1925 grid = PairGrid(data, vars=vars, x_vars=x_vars, y_vars=y_vars, hue=hue,
1926 hue_order=hue_order, palette=palette, corner=corner,
1927 height=height, aspect=aspect, dropna=dropna, **grid_kws)
~\AppData\Roaming\Python\Python38\site-packages\seaborn_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48
~\AppData\Roaming\Python\Python38\site-packages\seaborn\axisgrid.py in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, corner, diag_sharey, height, aspect, layout_pad, despine, dropna, size)
1212 index=data.index)
1213 else:
-> 1214 hue_names = categorical_order(data[hue], hue_order)
1215 if dropna:
1216 # Filter NA from the list of unique hue names
~\Anaconda\lib\site-packages\pandas\core\frame.py in getitem(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]
~\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2898
2899 if tolerance is not None:
KeyError: 'callers'
https://github.com/josephmisiti/awesome-machine-learning
disclaimer - it is my repo!
Expand the data testing section to create actual unit tests and explain assert statements a little better. Currently newcomers to unit tests would have no idea what's going on with assert statements.
Add a section near the end trying to interpret the model:
From a Reddit comment:
Advice: I think you are missing a few big things like preprocessing/scaling and pipelines.
Before using the learners, inputs should be scaled so that each feature has equal weight. Something like StandardScaler or MinMaxScaler are both appropriate (from sklearn.preprocessing). If you think some features are more important, you can scale them later to increase their relative importance in prediction. These are more parameters you would tune using CV, but these can be really numerous, so GridSearch is out the window and you would have to consider some alternatives like Nelder Mead search, genetic search, or multivariate gradient descent if you suspect convexity.
You have to fit these scalers on the training data and then use the trained fit to transform the testing data. Using Pipelines simplifies this whole process (fits the scaler and learner at once, transforms and predicts at once).
Hi, this is a very helpful example. Fun to read and easy to follow. I just have one question. You close with Reproducibility, but each time I run the cell to compute the best parameters for DecisionTreeClassifier, I get different answers most of time. That would appear to be a result not reproducible. Any reason?
Regardless of what the user passes to run_genetic_algorithm()
for population_size
, the code always reduces the population size down to 100. Need to rework that code so it's more dynamic.
can not open the website
Machine Learing
I want to reopened
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.