Giter Club home page Giter Club logo

awesome-python-data-science's Introduction

Awsome Python Data Science

A curated list of Python libraries used for data science.

Contents

Machine Learning Frameworks

  • scikit-learn - Machine learning.
  • CatBoost - Gradient boosting library with categorical features support.
  • LightGBM - Fast, distributed, high performance gradient boosting.
  • Xgboost - Scalable, Portable and Distributed Gradient Boosting.
  • PyMC - Probabilistic Programming.
  • statsmodels - Statistical modeling and econometrics.
  • SymPy - A computer algebra system.
  • NetworkX - Creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
  • dask-ml - Distributed and parallel machine learning.
  • imbalanced-learn - Perform under sampling and over sampling.
  • lightning - Large-scale linear models.
  • sklearn-crfsuite - API for CRFsuite, Conditional Random Fields for labeling sequential data.
  • vowpal_porpoise - Wrapper for vowpal_wabbit.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • BayesianOptimization - Global optimization with gaussian processes.
  • gplearn - Genetic Programming.
  • scikit-multilearn - Scikit-learn based module for multi-label.
  • mlens - ML-Ensemble high performance ensemble learning.
  • speedml - Speed start machine learning projects.
  • fastFM - Factorization Machines.
  • python-glmnet - glmnet package for fitting generalized linear models.
  • hmmlearn - Hidden Markov Models.
  • vecstack - stacking (machine learning technique).
  • bayespy - Bayesian inference tools.
  • modAL - Modular Active Learning framework
  • deap - Evolutionary computation framework.
  • pyro - Deep universal probabilistic programming with PyTorch.
  • civisml-extensions - scikit-learn-compatible estimators from Civis Analytics.
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn.
  • zhusuan - A Library for Bayesian Deep Learning, Generative Models, Based on Tensorflow.
  • Kaggler - Code for Kaggle Data Science Competitions. Includes FTRL.
  • modAL - A modular active learning framework.
  • scikit-survival - Survival analysis built on top of scikit-learn.
  • dstoolbox - Tools that make working with scikit-learn and pandas easier.
  • dowhy - A unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
  • modin - Unify the way you interact with your data.
  • pyomo - Python Optimization MOdels.
  • pymc-learn - Practical probabilistic machine learning.
  • BAMBI - BAyesian Model-Building Interface.

Scientific

  • NumPy - A fundamental package for scientific computing with Python.
  • SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
  • Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
  • Numba - NumPy aware dynamic Python compiler using LLVM.
  • blaze - NumPy and Pandas for databases.
  • astropy - Astronomy and astrophysics.
  • Biopython - Astronomy and astrophysics.
  • PyDy - Multibody Dynamics.
  • DIPY - Diffusion MR Imaging.
  • bcolz - Columnar data container that can be compressed.
  • nilearn - NeuroImaging.
  • patsy - Describing statistical models using symbolic formulas.
  • numexpr - Fast numerical array expression evaluator.
  • dask - Parallel computing with task scheduling.
  • or-tools - Google's Operations Research tools. Classical CS algorithms.
  • cvxpy - Python-embedded modeling language for convex optimization problems.

Deep Learning Frameworks

  • Tensorflow - DL Framework.
  • PyTorch - DL Framework.
  • onnx - Open Neutral Network Exchange.
  • Keras - High-level neutral networks API.
  • tensorlayer - A Deep Learning and Reinforcement Learning Library for Researchers and Engineers.
  • chainer - A flexible framework of neural networks for deep learning.
  • mxnet - Apache MXNet: A flexible and efficient library for deep learning.

Deep Learning Tools

  • Edward - Probabilistic programming language in TensorFlow.
  • pomegranate - Probabilistic modelling.
  • skorch - Scikit-learn PyTorch.
  • DLTK - Deep Learning Toolkit for Medical Image Analysis.
  • sonnet - TensorFlow-based neural network library.
  • rasa_core - Dialogue engine.
  • luminoth - Computer Vision.
  • allennlp - NLP Research library.
  • spotlight - Pytorch Recommender framework.
  • tensorforce - TensorFlow library for applied reinforcement learning.
  • tensorboard-pytorch - Tensorboard for pytorch.
  • keras-vis - Neural network visualization toolkit for keras.
  • hyperas - Keras + Hyperopt.
  • spaCy - Natural Language processing.
  • tensorboard_logger - Log TensorBoard events without touching TensorFlow.
  • keras-contrib - Keras community contributions.
  • tfdeploy - Deploy tensorflow graphs.
  • ktext - Utilities for preprocessing text for deep learning with Keras.
  • foolbox - Python toolbox to create adversarial examples that fool neural networks.
  • pytorch/vision - Datasets, Transforms and Models specific to Computer Vision.
  • gluon-nlp - NLP made easy.
  • PyTorch-GAN - PyTorch implementations of Generative Adversarial Networks.
  • pytorch/ignite - High-level library to help with training neural networks in PyTorch.
  • NMT - Neural machine translation and neural sequence modeling.
  • Netron - Visualizer for deep learning and machine learning models.
  • gpytorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch.
  • tensorly - Tensor Learning in Python.
  • einops - Deep learning operations reinvented.
  • hiddenlayer - Neural network graphs and training metrics for PyTorch, Tensorflow, and Keras.
  • dgl - Python package built to ease deep learning on graph, on top of existing DL frameworks.

Deep Learning Projects

Visualization

  • matplotlib - 2D plotting.
  • seaborn - Visualization library.
  • bokeh - Interactive web plotting.
  • plotly - Collaborative web plotting.
  • dash - Interactive Web plotting.
  • altair - Declarative statistical visualization.
  • folium - Leaflet.js Maps.
  • geoplot - High-level geospatial data visualization.
  • datashader - Graphics pipeline system.
  • mplleaftlet - Matplotlib plots from Python into interactive Leaflet web maps.
  • matplotlib-venn - Area-weighted venn-diagrams.
  • pyLDAvis - Interactive topic model visualization.
  • cufflinks - Productivity Tools for Plotly + Pandas.
  • scatterText - Visualizations of how language differs among document types.
  • plotnine - ggplot for python.
  • ggpy - ggplot for python.
  • mizani - scales package.
  • bqplot - Plotting library for IPython/Jupyter Notebooks.
  • PtitPrince - Raindrop cloud.
  • joypy - Ridgeline plots.
  • dtreeviz - Decision tree visualization and model interpretation.
  • ipyvolume - 3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL.

AutoML

  • Nevergrad - Gradient-free optimization.
  • featuretools - Automated feature engineering.
  • auto-sklearn - Automated machine learning.
  • tpot - Automated machine learning.
  • auto_ml - Automated machine learning.
  • MLBox - Automated Machine Learning python library.
  • devol - Automated deep neural network design via genetic programming.
  • skll - SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
  • autokeras - Automated machine learning in Keras.
  • SMAC3 - Sequential Model-based Algorithm Configuration.

Exploration

  • mlxtend - A library of extension and helper modules for Python's data analysis and machine learning libraries.
  • yellowbrick - Visual analysis and diagnostic tools.
  • pandas-profiling - Profiling reports for pandas DataFrame objects.
  • Skater - Model Agnostic Interpretation.
  • Dora - Exploratory data analysis.
  • sklearn-evaluation - scikit-learn model evaluation.
  • fitter - simple class to identify the distribution from which a data samples is generated from.
  • missingno - Missing data visualization.
  • hypertools - Gaining geometric insights into high-dimensional data.
  • scikit-plot - Plotting functionality to scikit-learn objects.
  • elih - Explain Machine Learning.
  • kmeans_smote - Oversampling for imbalanced learning based on k-means and SMOTE.
  • pyUpSet - UpSet suite of visualisation methods.
  • lime - Explaining the predictions of any machine learning classifier.
  • pandas-summary - An extension to pandas dataframes describe function.
  • SauceCat/PDPbox - Partial dependence plot toolbox.
  • shap - A unified approach to explain the output of any machine learning model.
  • eli5 - Debug machine learning classifiers and explain their predictions.
  • rfpimp - Permutation and drop-column importance for scikit-learn random forests.
  • pypeln - Concurrent data pipelines made easy.
  • pycm - Multi-class confusion matrix library in Python.
  • great_expectations - Always know what to expect from your data.
  • innvestigate - A toolbox to iNNvestigate neural networks' predictions.
  • alibi - Algorithms for monitoring and explaining machine learning models.

Feature Extraction

General Feature Extraction

  • sklearn-pandas - Pandas integration with sklearn.
  • pdpipe - Easy pipelines for pandas DataFrames.
  • engarde - Defensive data analysis.
  • datacleaner - Tool that automatically cleans data sets and readies them for analysis.
  • categorical-encoding - sklearn compatible categorical variable encoders.
  • fancyimpute - Multivariate imputation and matrix completion algorithms.
  • raccoon - DataFrame with fast insert and appends.
  • kmodes - k-modes and k-prototypes clustering algorithm.
  • annoy - Approximate Nearest Neighbors.
  • datacleaner - Automatically cleans data sets and readies them for analysis.
  • scikit-feature - Filter methods for feature selection.
  • mifs - Parallelized Mutual Information based Feature Selection module.
  • skggm - Scikit-learn compatible estimation of general graphical models.
  • dirty_cat - Encoding methods for dirty categorical variables.
  • Impyute - Data imputations library to preprocess datasets with missing data.
  • eif - Extended Isolation Forest for Anomaly Detection.
  • featexp - Feature exploration for supervised learning.
  • feature_engine - Feature engineering package with sklearn like functionality.
  • stumpy - STUMPY is a powerful and scalable Python library that can be used for a variety of time series data mining tasks.

Time Series

  • Causality - Causal analysis.
  • traces - Unevenly-spaced time series analysis.
  • PyFlux - Time series library for Python.
  • prophet - Tool for producing high quality forecasts.
  • tsfresh - Automatic extraction of relevant features from time series.
  • tslearn - Machine learning toolkit dedicated to time-series data.
  • - A Python package for time series transformation and classification.

Audio

  • python_speech_features - Speech features.
  • speechpy - A Library for Speech Processing and Recognition.
  • magenta - Music and Art Generation with Machine Intelligence.
  • librosa - Audio and music analysis.
  • pydub - Manipulate audio with a simple and easy high level interface.
  • pytorch/audio - simple audio I/O for pytorch.

Images and Video

  • pillow - PIL fork.
  • scikit-image - Image processing.
  • hmap - Image histogram remapping.
  • pyocr - A wrapper for Tesseract and Cuneiform (Optical Character Recognition).
  • scikit-video - Video processing.
  • moviepy - Video editing.
  • OpenCV - Open Source Computer Vision Library.
  • SimpleCV - Wrapper around OpenCV.
  • label-maker - Data Preparation for Satellite Machine Learning.
  • face_recognition - Facial recognition.
  • imgaug - Image augmentation.
  • pyvips - Fast image processing.
  • aeneas - Set of tools to automagically synchronize audio and text.
  • ImageHash - Image hashing.
  • Augmentor - Image augmentation library.
  • PyAV - Bindings for FFmpeg.
  • imutils - Convenience functions to make basic image processing operations.
  • albumentations - fast image augmentation library.

Geolocation

Web Content

  • sum - Automatic summarization of text documents and HTML.
  • textract - Extract text from any document.
  • newspaper - News extraction, article extraction and content curation.

Text/NLP

  • BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.
  • BERT-pytorch - Google AI 2018 BERT pytorch implementation.
  • pytorch-pretrained-BERT - PyTorch version of Google AI's BERT model with script to load Google's pre-trained models.
  • gensim - Topic Modeling.
  • pattern - Web ining module.
  • probablepeople - Parsing unstructured western names into name components.
  • Expynent - Regular expression patterns.
  • mimesis - Generate synthetic data.
  • pyenchant - Spell checking.
  • parserator - Domain-specific probabilistic parsers.
  • scrubadub - Clean personally identifiable information from dirty dirty text.
  • usaddress - Parsing unstructured address strings into address components.
  • python-phonenumbers - Python port of Google's libphonenumber.
  • jellyfish - Approximate and phonetic matching of strings.
  • preprocessing - Simple interface for the CMU Pronouncing Dictionary.
  • langid - Stand-alone language identification system.
  • fuzzywuzzy - Fuzzy String Matching.
  • Fuzzy - Soundex, NYSIIS, Double Metaphone.
  • snowball - Snowball compiler and stemming algorithms.
  • leven - Levenshtein edit distance.
  • flashtext - Extract Keywords from sentence or Replace keywords in sentences.
  • polyglot - Multilingual text NLP processing toolkit.
  • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
  • pyfasttext - Binding for fastText.
  • python-wordsegment - English word segmentation.
  • pyahocorasick - Exact or approximate multi-pattern string search.
  • Wordbatch - Parallel text feature extraction for machine learning.
  • langdetect - Port of Google's language-detection library.
  • translation - Uses web services for text translation.
  • nltk - Natural Language Toolkit.
  • unidecode - ASCII transliterations of Unicode text.
  • pytorch/text - Data loaders and abstractions for text and NLP.
  • textdistance - Compute distance between sequences.
  • sent2vec - General purpose unsupervised sentence representations.
  • pyhunspell - Python bindings for the Hunspell spellchecker engine.
  • facebook/fastText - Library for fast text representation and classification.
  • textblob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
  • facebook/InferSent - Sentence embeddings (InferSent) and training code for NLI.
  • nmslib - Non-Metric Space Library.
  • google/sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
  • ftfy - Fixes mojibake and other glitches in Unicode text, after the fact.
  • fletcher - Pandas ExtensionDType/Array backed by Apache Arrow.
  • textacy - NLP, before and after spaCy.
  • hmtl - Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP.
  • pytext - A natural language modeling framework based on PyTorch.
  • flair - A very simple framework for state-of-the-art Natural Language Processing.
  • LASER - Language-Agnostic SEntence Representations.
  • transformer-xl - Attentive Language Models Beyond a Fixed-Length Context.

Graphs

  • louvain - Louvain Community Detection.

Time

Ranking/Recommender

  • Surprise - Analyzing recommender systems.
  • trueskill - TrueSkill rating system.
  • LightFM - Hybrid recommendation algorithm.
  • implicit - Collaborative Filtering for Implicit Datasets.

Trading

  • Clairvoyant - Identify and monitor social/historical cues.
  • zipline - Algorithmic Trading Library.
  • qstrader - Advanced Trading Infrastructure.

Misc

  • sklearn-porter - Transpile trained scikit-learn estimators.
  • sklearn-compiledtrees - Compiled Decision Trees for scikit-learn.
  • Metrics - Machine learning evaluation metrics.
  • bonobo - Extract Transform Load.
  • pyemd - Earth Mover's Distance metric.
  • fastai - The fast.ai deep learning library, lessons, and tutorials.
  • mmh3 - MurmurHash3, a set of fast and robust hash functions.
  • fbpca - Fast Randomized PCA/SVD.
  • annoy - Approximate Nearest Neighbors.
  • mlcrate - Handy tools and functions.
  • pipeline - Standard Runtime For Every Real-Time Machine Learning.
  • tabulate - Pretty-print tabular data in Python, a library and a command-line utility.
  • crayon - A language-agnostic interface to TensorBoard.
  • faiss - A library for efficient similarity search and clustering of dense vectors.
  • neurtu - A Python package for parametric benchmarks.

Deployment

  • palladium - Framework for setting up predictive analytics services.
  • lore - Lore makes machine learning approachable for Software Engineers and maintainable for Machine Learning Researchers.
  • kubeflow - Machine Learning Toolkit for Kubernetes.
  • great_expectations - F framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests.
  • mara/data-integration - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow.
  • airflow - ETL.
  • mlflow - Open source platform for the complete machine learning lifecycle.

Python Tools

  • pip-tools - Keeps dependencies up to date.
  • devpi - PyPI server and packaging/testing/release tool.
  • Jupyter Notebook - Notebooks are awseome.
  • click - CLI package.
  • sacredboard - Dashboard for sacred.
  • sacred - Reproduce computational experiments.
  • python-flamegraph - Statistical profiler which outputs in format suitable for FlameGraph.
  • magic-wormhole - get things from one computer to another, safely.
  • memory_profiler - monitoring memory usage of a python program.
  • line_profiler - Line-by-line profiling.
  • parse - Parse strings using a specification based on the Python format() syntax.

Data Gathering

  • gain - Web crawling framework based on asyncio.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • camelot - Camelot: PDF Table Extraction for Humans.

awesome-python-data-science's People

Contributors

thomasjpfan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.