tomaugspurger / effective-pandas Goto Github PK

View Code? Open in Web Editor NEW

1.5K 63.0 384.0 12.96 MB

Source code for my collection of articles on using pandas.

Home Page: https://leanpub.com/effective-pandas

License: Creative Commons Attribution 4.0 International

Jupyter Notebook 99.56% Makefile 0.07% CSS 0.01% Python 0.36%

effective-pandas's People

Stargazers

Watchers

Forkers

alexliberzonlab dejim jeperez rlugojr freddygv deadkingq giserh serenazzhou xuanhan863 darcy0511 neverspill tobypatterson martinkess datenspieler littlea1 rns tade0726 alexsinfarosa hmcuesta jaspajjr laventura robertervin nicotrombon dtsmith2001 ryanvmenezes kenhollandwhy melakbet aliavni ltcguthrie lizhihao1990 laurii diegslva andersrmr siddharth691 bertomartin zxsted myeducationalrepos whitehaven iwouldnot kirchho chrinide codetradr winklerand hades210 timellemit sourceshit chaosem moondra ssscld kravchukpetr syoung3 sudsfsp datablender konstantinlastovski ocowchun mduttaroy shh6242 valeman walterkwon lebigot kmfolgar superricecooker rsoper72 birgetit joeyi820 decastro-alex cometyang suboptimal pokidyshev sanjc sun1lksh pkmklong madongmingming mnrozhkov annytom fenice420 jameskuangchecheng hoadu sprinterzzj adriantorrie nobusugi246 joseph-hurtado kylecheek let4ik sukhwant sergeyshchus nyawanga rameshjay mrrahul011 ampstat rotten65 tom1presto priya-gittest kormilitzin batermj aobai zouhx11 evanfalcone dorotaw checorh

effective-pandas's Issues

Error in analysis due to pandas bug

@TomAugspurger, I found pandas-dev/pandas#22509 by chasing down an error in part 2. You use the broken form of groupby().transform('rank') in "Chaining methods", to create this graph:

If you use dep_time.rank() instead you get the correct result

Unable to open file on github or offline

Github just says there was a problem with the file, Jupyter provides the following error message:
Unreadable Notebook:
..\modern_2_method_chaining.ipynb NotJSONError('Notebook does not appear to be JSON: '\n\n\n\n\n<html lang="en...',)

Effective pandas part 1 - 'modern-1-url.txt

To setup the same flight data in part 1, it calls for the following to set up the data variable; however, I can't find the text file ('modern-1-url.txt') to pull the data set. Where is this text file or what is in it?

with open('modern-1-url.txt', encoding='utf-8') as f:
data = f.read().strip()

chapter4_tidy_data variable versus rest plot gives some empty plots

These all tutorial are superbly great, and the author has put lots of efforts on them. The code might have been running excellent on 2016 but now (at 2019 Feb),
some of the codes of chapter_4_tidy_data fails.

Issue 1

example:

g = sns.FacetGrid(tidy, col='team', col_wrap=6, hue='team', size=2)
g.map(sns.barplot, 'variable', 'rest');

Gives:
imagur link

Issue 2

rest.unstack()
        .query('away_team < 7')
        .rolling(7)
        .mean()

Gives all NANS and plot fails.

and so on.

modern pandas 3 - merge impossible

The merge (many-to-one) at the end of the third notebook results in an empty data frame, because the weather data is for 2014 and the flights data for 2017. Your results show flight data for 2014, so I imagine you may be using a different dataset.

This may also have to do with the source data being changed; I also noticed that the underscores are removed from the flight data set, e.g. fl_date has become flightdate and unique_carrier has become uniquecarrier.

p.s. Thanks for sharing your well-written code and insights into pandas, they are a very welcome and useful read!

pandas_datareader

Not able to find enough resolution on this issue, so I decided to post (relatively new to GitHub, so apologies in advance if I'm doing something wrong).

Operating system: Windows 10
IDE: VS Code (1.37.0)
Python: 3.7.3
Distribution: Anaconda

I created a virtual environment using Conda
I installed pandas_datareader using: conda install -c anaconda pandas-datareader
I confirmed that the package is present in the virtual enviroment
pandas version: 0.25.0
pandas_datarader:0.7.4
When I run import pandas_datareader in VS Code (and I'm running the command interactively/in Jupyter notebook), I get the following error

ModuleNotFoundError Traceback (most recent call last)
in
----> 1 import pandas_datareader

ModuleNotFoundError: No module named 'pandas_datareader

I think another option is to install a previous version of pandas....but at this point, I'm not entirely sure what to do. I wanted to check before I take any further steps.
Thanks

Visualization notebook - pandas_datareader

Arguably would have been nicer to see
from pandas_datareader import fred
at the top with other imports, so I could see dependencies more easily.

Feel free to close with no change of course.

Intro notebook UnsortedIndexError on the date range (cell 18)

The last cell of notebook 1 is throwing an exception, along with the message: UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [4], lexsort depth 3'. I don't understand this, since cell 13 seems like it should be sorting the index.

python 3.7, pandas 0.23.4 and 0.24.1.

Incidentally, upgrading to 0.24.1 also broke cell 6, which crashes with KeyError: "['tail_num'] not in index" (which I don't understand at all).

All of that said, I've now learned about IndexSlice, which is truly awesome! Thanks.

Python 3.6 pandas_datareader not found

I am running Windows 10 64 bit, python 3.6, spyder. I have both pandas_datareader and pandas_datareader-0.5.0.dist-info installed in the same site packages as pandas. When I attempt to import the module I receive the following error message:

Python 3.6.2 |Anaconda, Inc.| (default, Sep 19 2017, 08:03:39) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 6.1.0 -- An enhanced Interactive Python.

import pandas as pd

import pandas_datareader as pdr
Traceback (most recent call last):

File "", line 1, in
import pandas_datareader as pdr

ModuleNotFoundError: No module named 'pandas_datareader'

Who do I import the pandas_datareader via another method or have python recognize the module?

File download doesn't work

When executing the cell with file download from the Transtats website, we do not download a zip file but an HTML page containing:

<head>
	<script type="text/javascript" src="js/dot_ostr_analytics.js"></script>
</head>
<body>

start time ==> 5:59:28 PM<br>complete time==> 5:59:28 PM

</body></html>

Can you precise which table we should download from the government website?

Discrepancy when downloading the dataset in the notebook .zip / .csv

Hi,

I've just tried to use the first notebook in the series and it turns out the data is downloaded to a file called "flights.csv" which is then opened as "flights.csv.zip".

Suggestion: correct save filename to "flights.csv.zip". Might be especially useful for beginners...

Best regards,
Florian

Upload the data used

This material appears to have outlived the data sources used. For example https://www.transtats.bts.gov/DownLoad_Table.asp doesn't work, and the weather data isn't working for me either. Perhaps you (or anyone else) might upload copies of the data to this repo if you still have them?

error when loading data

when running cell 2 of the first notebook I get:

SSLError: HTTPSConnectionPool(host='www.transtats.bts.gov', port=443): Max retries exceeded with url: /DownLoad_Table.asp?Table_ID=236&Has_Group=3&Is_Zipped=0 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))

It looks a problem related to the connection to the API.
Would it be possible to include as part of the repo the data to work with?
Thanks

Visualization: Feather seems not to play nicely under Windows

See conda-forge/feather-format-feedstock#1 for a hint on this. Installation is at best problematic - and I found it impossible.

I worked as follows:
Comment out all the following
import feather

%load_ext rpy2.ipython

%%R
suppressPackageStartupMessages(library(ggplot2))
library(feather)
write_feather(diamonds, 'diamonds.fthr')

And then replace
import feather
df = feather.read_dataframe('diamonds.fthr')
df.head()

with:
from ggplot import diamonds
// type(diamonds) # dataframe...
df = diamonds # primitive!
df.head()

There is one much more mundane issue, which I'll raise separately