Pandas Profiling

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values

Examples

The following examples can give you an impression of what the package can do:

Census Income (US Adult Census data relating income)
NASA Meteorites (comprehensive set of meteorite landings)
Titanic (the "Wonderwall" of datasets)
NZA (open data from the Dutch Healthcare Authority)
Stata Auto (1978 Automobile data)
Website Inaccessibility (demonstrates the URL type)

Installation

Using pip

You can install using the pip package manager by running

pip install pandas-profiling

Alternatively, you could install directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using conda

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

From source

Download the source code by cloning the repository or by pressing 'Download ZIP' on this page. Install by navigating to the proper directory and running

python setup.py install

Usage

The profile report is written in HTML5 and CSS3, which means pandas-profiling requires a modern browser.

Documentation

The documentation for pandas_profiling can be found here. The documentation is generated using pdoc3. If you are contributing to this project, you can rebuild the documentation using:

make docs

or on Windows:

make.bat docs

Jupyter Notebook

We recommend generating reports interactively by using the Jupyter notebook.

Start by loading in your pandas DataFrame, e.g. by using

import numpy as np
import pandas as pd
import pandas_profiling

df = pd.DataFrame(
    np.random.rand(100, 5),
    columns=['a', 'b', 'c', 'd', 'e']
)

To display the report in a Jupyter notebook, run:

df.profile_report(style={'full_width':True})

To retrieve the list of variables which are rejected due to high correlation:

profile = df.profile_report()
rejected_variables = profile.get_rejected_variables(threshold=0.9)

If you want to generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="output.html")

Command line usage

For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable. Run

pandas_profiling -h

for information about options and arguments.

Advanced usage

A set of options is available in order to adapt the report generated.

title (str): Title for the report ('Pandas Profiling Report' by default).
pool_size (int): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
minify_html (boolean): Whether to minify the output HTML.

More settings can be found in the default configuration file.

Example

profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file(output_file="output.html")

How to contribute

The package is actively maintained and developed as open-source software. If pandas-profiling was helpful or interesting to you, you might want to get involved. There are several ways of contributing and helping our thousands of users. If you would like to be a industry partner or sponsor, please drop us a line.

Read more on getting involved in the Contribution Guide.

Editor integration

PyCharm integration

Install pandas-profiling via the instructions above

Locate your pandas-profiling executable.

On macOS / Linux / BSD:

$ which pandas_profiling
(example) /usr/local/bin/pandas_profiling

On Windows:

$ where pandas_profiling
(example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe

In Pycharm, go to Settings (or Preferences on macOS) > Tools > External tools
Click the + icon to add a new external tool
Insert the following values
- Name: Pandas Profiling
- Program: The location obtained in step 2
- Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
- Working Directory: $ProjectFileDir$

To use the PyCharm Integration, right click on any dataset file: External Tools > Pandas Profiling.

Other integrations

Other editor integrations may be contributed via pull requests.

Dependencies

You need Python 3 to run this package. Other dependencies can be found in the requirements files:

Filename	Requirements
requirements.txt	Package requirements
requirements-dev.txt	Requirements for development
requirements-test.txt	Requirements for testing

michael2tang / pandas-profiling Goto Github PK