Giter Club home page Giter Club logo

hi-primus / optimus Goto Github PK

View Code? Open in Web Editor NEW
1.4K 38.0 234.0 112.57 MB

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Home Page: https://hi-optimus.com

License: Apache License 2.0

Python 75.36% Shell 0.01% HTML 1.37% Jupyter Notebook 23.10% CSS 0.05% JavaScript 0.05% Dockerfile 0.05%
spark pyspark data-wrangling bigdata big-data-cleaning data-science data-cleansing data-cleaner data-transformation machine-learning

optimus's People

Contributors

argenisleon avatar arpit1997 avatar atwoodjw avatar aviolante avatar codacy-badger avatar cool-pot avatar deepakjangid123 avatar dependabot[bot] avatar eschizoid avatar faviovazquez avatar jameslamb avatar jarrioja avatar joseangelhernao avatar lakhotiaharshit avatar lironco11 avatar luis11011 avatar luisboitas avatar mrpowers avatar niteshnicholas avatar pyup-bot avatar sergey48k avatar timgates42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

optimus's Issues

Data Enrichment

It would be helpful for a user to have a function to enrich data using a REST API. For example, connect to the Google Maps API to geocode an address or the Fullcontact API to add additional info using the user email.

The function must let the user config the request rate limit and the URL.

We must explore how to add additional params like the API keys or any other param the API needs.

Timestamp error

Hi,
I get this error on the latest version while doing:

analyzer = op.DataFrameAnalyzer(df=df)
analyzer.column_analyze("*", plots=False, values_bar=False)

Error:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-282aba391ae8> in <module>()
      1 analyzer = op.DataFrameAnalyzer(df=df)
----> 2 analyzer.column_analyze("*", plots=False, values_bar=False)

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in column_analyze(self, column_list, plots, values_bar, print_type, num_bars, print_all)
    596                 values_bar=values_bar,
    597                 num_bars=num_bars,
--> 598                 types_dict=types)
    599 
    600             # Save the invalid col if exists

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in _analyze(self, df_col_analyzer, column, row_number, plots, print_type, values_bar, num_bars, types_dict)
    388         summary_dict = self._create_dict(
    389             ["name", "type", "total", "valid_values", "missing_values"],
--> 390             [column, types_dict[type_col], row_number, valid_values, missing_values
    391              ])
    392 

KeyError: 'timestamp'

DataFrameTransformer.explode_table example is outdated

Following the explode_table example on doc will return missing 1 required positional argument: 'list_to_assign' error

# Instanciation of DataTransformer class:
transformer = op.DataFrameTransformer(df)

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show()

# Transformation:
transformer.explode_table('bill id', 'foods', 'Beer')

# Printing new dataFrame:
print('New dataFrame:')
transformer.show()

Improve usability

We should make life easy from people coming from pandas, spark, R, etc. the usage of the framework.

I'm not saying lets copy the name, but I think some or them are too large o maybe confusing.

Refactor tests

Want to add DataFrame comparisons to the tests suite to make it more robust. Want to also make the tests suite handle the SparkSession efficiently. Proposed improvements are here: #150

Code not complete

On this documentation we have the following code:

# Import optimus
import optimus as op
#Import os module for system tools
import os

# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_csv(path=filePath,
                            sep=',')

# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

There is a tools = op.Utilities() line missing.

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Sampling and processing the whole data set

@FavioVazquez
I think should be an easy way to take a data sample, make all the operations and then apply them to the whole dataset. In this way, if we have a data set of 5TB we do not need to process the all the data that could be time-consuming.

For example:

Sample Dataset

transformer.sample().trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*") 

Whole dataset

transformer.trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*") 

This approach should be the easiest to implement but there are have some problems here. The user has to copy paste the whole chaining and apply it to the whole dataset. Another approach could be something like that(not python):

operations = [trim_col("*"), remove_special_chars("*"),  clear_accents("*"), lower_case("*") ]
#Transformation in the sample dataset
transfomer.sample().apply(operations)
#Transformation in the whole dataset
transfomer.apply(operations)

The problem with a random sample

We talk about the possibility to use a random sample function from Spark. Although this could be the first approach I think there are 2 problems we have to tackle.

Empty data

The trainers do not accept empty data. If we can not detect empty data on the sample and we do not remove it in the whole data set the user could have problems at training time.

Outliers

If we can not detect and an outlier in the sample and we do not remove it the whole data set it could generate a flawed model.

We must be sure that our sample data set to meet this requirement.

What can we do

I think we must find the fastest way to detect empty data and outliers in the sample function. and be sure that the user has the more accurate representation of the whole data.

Simplify plot_hist

Right now to print a hist the user must:

priceDf = analyzer.get_data_frame.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
analyzer.plot_hist(df_one_col=priceDf, hist_dict= hist_dictPri, type_hist='categorical')

I think that we must simplify the function to:
analyzer.plot_hist(df, column, bins, type_hist='categorical')

Using PEP-8 style guide for naming conventions

I was going through the code and found that various methods and variables have been named using camel casing. Also few methods starting with "__" double underscores. We should just use one "_" underscore to differentiate internal methods from public methods. I think it would be a good idea to start by using PEP-8 naming conventions for modules, function, methods and variables. Please let me know your thoughts on this.

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Explore options for a different DataFrameTransformer interface

I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.

The Spark Scala API has a nifty transform method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.

I like the DataFrameTransformer class, but it doesn't let users easily access the native PySpark DataFrame methods.

We might want to take these methods out of the DataFrameTransfrormer class, so the user can mix and match the Optimus API and the PySpark API.

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

The transform method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

Let me know what you think!

Moving the read and write functions

I think that should not be necessary to instance the Utilities function to use the read and write operations.

Foe example:

# Import optimus
import optimus as op
# Import module for system tools 
import os

# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook 
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_dataset_csv(path=filePath,
 delimiter_mark=',')

I think the way pandas handle it is easiest and elegant.

import pandas as pd

# Load the dataset
df = pd.read_csv('mock_bank_data_original.csv')
df.to_csv('mock_bank_data_original_PART1.csv')

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.