Giter Club home page Giter Club logo

pandas-cookbook's Introduction

Pandas cookbook

Binder

pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly.

The goal of this cookbook is to give you some concrete examples for getting started with pandas. The docs are really comprehensive. However, I've often had people tell me that they have some trouble getting started, so these are examples with real-world data, and all the bugs and weirdness that entails.

It uses 3 datasets:

  • 311 calls in New York
  • How many people were on Montréal's bike paths in 2012
  • Montreal's weather for 2012, hourly

It comes with batteries (data) included, so you can try out all the examples right away.

Table of Contents

How to use this cookbook

The easiest way is to try it out instantly online using Binder's awesome service. Start by clicking here, wait for it to launch, then click on "cookbook", and you'll be off to the races! It will let you run all the code interactively without having to install anything on your computer.

To install it locally, you'll need Jupyter notebook and pandas on your computer.

You can get these using pip (you may want to do this inside a virtual environment to avoid conflicting with your other libraries).

  pip install -r requirements.txt

This can be difficult to get set up and require you to compile a whole bunch of things. I instead use and recommend Anaconda, which is a Python distribution which will give you everything you need. It's free and open source.

Once you have pandas and Jupyter, you can get going!

git clone https://github.com/jvns/pandas-cookbook.git
cd pandas-cookbook/cookbook
jupyter notebook

A tab should open up in your browser at http://localhost:8888

Happy pandas!

Running the cookbook inside a Docker container.

This repository contains a Dockerfile and can be built into a docker container. To build the container run following command from inside of the repository directory:

docker build -t jvns/pandas-cookbook -f Dockerfile-Local .

run the container:

docker run -d -p 8888:8888 -e "PASSWORD=MakeAPassword" <IMAGE ID>

you can find out about the id of the image, by checking

docker images

After starting the container, you can access the Jupyter notebook with the cookbook on port 8888.

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Translations

There's a translation into Chinese of this repo.

pandas-cookbook's People

Contributors

amygdalama avatar c-martinez avatar chankeypathak avatar duims avatar erikvdven avatar hydrosquall avatar jbalogh avatar julia-stripe avatar jvns avatar kim0 avatar mkuzak avatar oibe avatar russkel avatar sanuj avatar scls19fr avatar wolever avatar zfrankel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pandas-cookbook's Issues

Chapter 1: first cell raises FutureWarning about mpl_style

On Chapter 1, running the first cell

# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)

triggers

/usr/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2885: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

  exec(code_obj, self.user_global_ns, self.user_ns)

Chapter 4 TypeError:unhashable type:'slice'

when I write this line
berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday

If we run then we will see the bug: TypeError:unhashable type:'slice'

Anyone can tell me what happen and how to fix the bug?

Chapter 4 TypeError:unhashable type:'slice'

when I write this line
berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday

If we run then we will see the bug: TypeError:unhashable type:'slice'

Anyone can tell me what happen and how to fix the bug?

Error when plotting for Chapter 1

Similar errors exist for plots in other chapters too. I found the solution to be here:

http://stackoverflow.com/questions/33995707/attributeerror-unknown-property-color-cycle


AttributeError Traceback (most recent call last)
in ()
----> 1 fixed_df['Berri 1'].plot()

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in call(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, *_kwds)
3495 colormap=colormap, table=table, yerr=yerr,
3496 xerr=xerr, label=label, secondary_y=secondary_y,
-> 3497 *_kwds)
3498 call.doc = plot_series.doc
3499

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, *_kwds)
2585 yerr=yerr, xerr=xerr,
2586 label=label, secondary_y=secondary_y,
-> 2587 *_kwds)
2588
2589

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(data, x, y, subplots, ax, kind, *_kwds)
2382 plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, *_kwds)
2383
-> 2384 plot_obj.generate()
2385 plot_obj.draw()
2386 return plot_obj.result

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in generate(self)
985 self._compute_plot_data()
986 self._setup_subplots()
--> 987 self._make_plot()
988 self._add_table()
989 self._make_legend()

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _make_plot(self)
1662 stacking_id=stacking_id,
1663 is_errorbar=is_errorbar,
-> 1664 **kwds)
1665 self._add_legend_handle(newlines[0], label, index=i)
1666

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _ts_plot(cls, ax, x, data, style, *_kwds)
1699 ax._plot_data.append((data, cls._kind, kwds))
1700
-> 1701 lines = cls._plot(ax, data.index, data.values, style=style, *_kwds)
1702 # set date formatter, locators and rescale limits
1703 format_dateaxis(ax, ax.freq)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(cls, ax, x, y, style, column_num, stacking_id, *_kwds)
1676 cls._initialize_stacker(ax, stacking_id, len(y))
1677 y_values = cls._get_stacked_values(ax, stacking_id, y, kwds['label'])
-> 1678 lines = MPLPlot._plot(ax, x, y_values, style=style, *_kwds)
1679 cls._update_stacker(ax, stacking_id, y)
1680 return lines

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(cls, ax, x, y, style, is_errorbar, *_kwds)
1298 else:
1299 args = (x, y)
-> 1300 return ax.plot(_args, **kwds)
1301
1302 def _get_index_name(self):

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/init.pyc in inner(ax, _args, *_kwargs)
1810 warnings.warn(msg % (label_namer, func.name),
1811 RuntimeWarning, stacklevel=2)
-> 1812 return func(ax, _args, *_kwargs)
1813 pre_doc = inner.doc
1814 if pre_doc is None:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in plot(self, _args, *_kwargs)
1422 kwargs['color'] = c
1423
-> 1424 for line in self._get_lines(_args, *_kwargs):
1425 self.add_line(line)
1426 lines.append(line)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _grab_next_args(self, _args, *_kwargs)
384 return
385 if len(remaining) <= 3:
--> 386 for seg in self._plot_args(remaining, kwargs):
387 yield seg
388 return

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _plot_args(self, tup, kwargs)
372 ncx, ncy = x.shape[1], y.shape[1]
373 for j in xrange(max(ncx, ncy)):
--> 374 seg = func(x[:, j % ncx], y[:, j % ncy], kw, kwargs)
375 ret.append(seg)
376 return ret

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _makeline(self, x, y, kw, kwargs)
278 default_dict = self._getdefaults(None, kw, kwargs)
279 self._setdefaults(default_dict, kw, kwargs)
--> 280 seg = mlines.Line2D(x, y, *_kw)
281 self.set_lineprops(seg, *_kwargs)
282 return seg

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/lines.pyc in init(self, xdata, ydata, linewidth, linestyle, color, marker, markersize, markeredgewidth, markeredgecolor, markerfacecolor, markerfacecoloralt, fillstyle, antialiased, dash_capstyle, solid_capstyle, dash_joinstyle, solid_joinstyle, pickradius, drawstyle, markevery, **kwargs)
365 # update kwargs before updating data to give the caller a
366 # chance to init axes (and hence unit support)
--> 367 self.update(kwargs)
368 self.pickradius = pickradius
369 self.ind_offset = 0

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/artist.pyc in update(self, props)
854 func = getattr(self, 'set_' + k, None)
855 if func is None or not six.callable(func):
--> 856 raise AttributeError('Unknown property %s' % k)
857 func(v)
858 changed = True

AttributeError: Unknown property color_cycle

Dockerfile Out-of-Date

  1. ipython/scipyserver is deprecated (see description)
  2. The recommended replacement image, scipy-notebook is based on Python3, so we'd want to specify the older image, and figure out how to launch the notebook from a virtualenv from within the docker container
  3. It's debatable whether one might just want to upgrade all these examples to Python3 rather than sort through the docker situation described above.

Binder: Could not find a version that satisfies the requirement matplotlib==3.7.1

Here is the log when running https://mybinder.org/v2/gh/jvns/pandas-cookbook/master:

Step 37/50 : COPY --chown=1000:1000 src/requirements.txt ${REPO_DIR}/requirements.txt
 ---> Using cache
 ---> 0ac1ed6a7c4e
Step 38/50 : USER ${NB_USER}
 ---> Using cache
 ---> 8c57bb10fbc0
Step 39/50 : RUN ${KERNEL_PYTHON_PREFIX}/bin/pip install --no-cache-dir -r "requirements.txt"
 ---> Running in 9479e21d4ccf
ERROR: Ignored the following versions that require a different python version: 3.6.0 Requires-Python >=3.8; 3.6.0rc1 Requires-Python >=3.8; 3.6.0rc2 Requires-Python >=3.8; 3.6.1 Requires-Python >=3.8; 3.6.2 Requires-Python >=3.8; 3.6.3 Requires-Python >=3.8; 3.7.0 Requires-Python >=3.8; 3.7.0rc1 Requires-Python >=3.8; 3.7.1 Requires-Python >=3.8
ERROR: Could not find a version that satisfies the requirement matplotlib==3.7.1 (from versions: 0.86, 0.86.1, 0.86.2, 0.91.0, 0.91.1, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1rc1, 1.4.1, 1.4.2, 1.4.3, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 2.0.0b1, 2.0.0b2, 2.0.0b3, 2.0.0b4, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1, 2.0.2, 2.1.0rc1, 2.1.0, 2.1.1, 2.1.2, 2.2.0rc1, 2.2.0, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 3.0.0rc2, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0rc1, 3.1.0rc2, 3.1.0, 3.1.1, 3.1.2, 3.1.3, 3.2.0rc1, 3.2.0rc3, 3.2.0, 3.2.1, 3.2.2, 3.3.0rc1, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4, 3.4.0rc1, 3.4.0rc2, 3.4.0rc3, 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.5.0b1, 3.5.0rc1, 3.5.0, 3.5.1, 3.5.2, 3.5.3)
ERROR: No matching distribution found for matplotlib==3.7.1
Removing intermediate container 9479e21d4ccf
The command '/bin/sh -c ${KERNEL_PYTHON_PREFIX}/bin/pip install --no-cache-dir -r "requirements.txt"' returned a non-zero code: 1

Why does this happen?

git clone partially fails on windows because of file names with question mark

On windows filenames can't have question marks, so the checkout of those particular files fails

error: unable to create file cookbook/Chapter 3 - Which borough has the
most noise complaints? (or, more selecting data).ipynb (Invalid argument)
error: unable to create file cookbook/Chapter 6 - String operations! Which
month was the snowiest?.ipynb
(Invalid argument)
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

I'm going to search for a workaround but if you know of one let me know

Explain stack/unstack

Explain how to get from

name date value
David Jan3 10
David Jan4 12
Julia Jan 3 8

to

name Jan3 Jan4
David 10 12
Julia 8 NaN

Adding legends to GroupBy plots

Having a recipe for adding legends (and other options) to GroupBy plots would be useful, I think. Have a peek at this issue -- its something that is supposed to be added to a future version of Pandas, but until then its hard to figure out without some help.

pandas 0.13.0 yields warning on reading 311-service-requests.csv

Mentioning just in case useful - this problem went away went I ungraded from pandas 0.13.0 to 0.13.1

Chapter 2 cell 2 reads the CSV into 'complaints' but issues a warning:

H:\WinPython\WinPython-64bit-2.7.6.2\python-2.7.6.amd64\lib\site-packages\pandas\io\parsers.py:1050: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)

Update to Ch 5: Incorrect year for January fixed at source

In ch.5, for the download_weather_month function, the source url now provides the correct data even when you look for data for the month of January. So one no longer needs to increment the year by 1 for January. i.e. remove:

    if month == 1:
        year += 1

Quick Tour is only compatible with Python 2.7

I think it's about time to start either calling out the version you're expecting users to have or update the examples to work with Python 3.x

The very first example doesn't work with Python 3.x

print "Hi! This is a cell. Press the ▶ button above to run it"

should be

print("Hi! This is a cell. Press the ▶ button above to run it")

Update to Ch 5: read_csv

In the latest version (.18) of pandas, this statement from ch.5 doesn't seem to work:
weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='latin1', header=True)
The following worked for me:
weather_mar2012 = pd.read_csv(url, index_col='Date/Time', parse_dates=True, encoding='latin1', header=14)

The Chinese translation of this repo

When I started to learn pandas, I found the examples in this repository were interesting. Lately I have translated it into Chinese and hope more people can enjoy it ;)

Here is my translation.

Tab completion does not work as described

Hi there,

I am trying to use the tab completion to pop up the 'help' window, but this is not happening at all. I am using ipython 2.0 and python 2.7 and 3.4 (different computers) to no avail. Has this behaviour changed?

Russ

Chapter 7.3: pandas.DataFrame.sort()

To sort the DataFrame along the column 'Incident Zip' one cannot use .sort() method anymore, but must use .sort_values():

requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort_values(by='Incident Zip')

Feature: Run Examples Online

I'd like to modify this repository so that mybinder.org can read this repo, and allow users to run examples live without needing to install anything locally. The examples will run in a python2.7 environment with dependencies that satisfy the criteria in the existing README.md

Setting this up involved making some PRs and issues on the GitHub pages for jupyterhub/binderhub and jupyter/repo2docker, but I think that the final outcome will be fairly clean.

Chapter 1: broken dataframe file is really broken

In the cookbook Chapter 1 we start with command

broken_df = pd.read_csv('../data/bikes.csv')

That suppose to give a bad input. But when I try this command I get an exception:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

The fixed version of the command seems to work good.

I use:
python 3.5.1,
pandas 20.3
MacOS

Update:
After giving another look into the readme file, it is stated that the project is set to work with Python 2.7. But it would be nice to make it Python 3 compatible also.

Sentence Error: Chapter 3 - between prompt 15 and 16

After In[15] there is a sentence that reads:

"This is a big array of Trues and Falses, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where."

If you read the 2nd sentence above you'll see that it seems be missing the end of the sentence.

Error with ipython notebook

$ git clone https://github.com/jvns/pandas-cookbook.git
Cloning into 'pandas-cookbook'...
remote: Counting objects: 410, done.
Rreceivingemote: Total 410 (delta objects: 83% (34 0), reused 3/410), 10.50 MiB
0 (delta 0), pack-reused 410
Receiving objects: 100% (410/410), 10.51 MiB | 14.00 KiB/s, done.
Resolving deltas: 100% (208/208), done.
Checking connectivity... done.

DEGNINOU@DYEHADJI ~ (master)
$ cd pandas-cookbook/cookbook

DEGNINOU@DYEHADJI ~/pandas-cookbook/cookbook (master)
$ ipython notebook
Traceback (most recent call last):
File "c:\Anaconda3\Scripts\ipython-script.py", line 3, in
from IPython import start_ipython
File "c:\Anaconda3\lib\site-packages\IPython__init__.py", line 47, in <module

from .core.application import Application

File "c:\Anaconda3\lib\site-packages\IPython\core\application.py", line 22, in

from traitlets.config.application import Application, catch_config_error
File "c:\Anaconda3\lib\site-packages\traitlets\config__init__.py", line 6, in

from .application import *
File "c:\Anaconda3\lib\site-packages\traitlets\config\application.py", line 17
, in
from decorator import decorator
ImportError: No module named 'decorator'

Update to Chapter 6

is_snowing.astype(float).resample('M', how=np.mean).plot(kind='bar')

FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...)..apply()
if name == 'main':

The new way to write it is:
is_snowing.astype(float).resample('M').apply(np.mean).plot(kind='bar')

Thanks!

readme.md links

the link translate version to Chinese is linked to here, where is it ?

Cookbook 5 url

First, thank you for these tutorials - they've been very helpful!
I've had a problem with the scraping tutorial in that, while the file does download, it has problems with this line:
weather_mar2012 = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True, encoding='latin1')
If I skip 16 rows, I get "ValueError: No columns to parse from file"
If I skip 15, I see the data, but it leaves a blank line on top and seem to do any of the rest - index_col, parse_dates or encoding.
This could be a problem on my end - I'm using anaconda with python 3.4. But I wasn't able to find anything with initial searches so I thought I'd ask here.
Thanks for any guidance you can give.

Is there a known port for Python 3?

I really like your cookbook but I would love to have a version for Python 3.
Do you know whether there is any port for that? Maybe somebody you have already talked to?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.