hi-primus / optimus Goto Github PK

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

License: Apache License 2.0

Python 75.36% Shell 0.01% HTML 1.37% Jupyter Notebook 23.10% CSS 0.05% JavaScript 0.05% Dockerfile 0.05%

spark pyspark data-wrangling bigdata big-data-cleaning data-science data-cleansing data-cleaner data-transformation machine-learning

optimus's People

Contributors

Stargazers

Watchers

Forkers

himansh1306 josephkevin fabioquintero a3digit gilmararaujo afaq404alam thejanagala guptam todorovc tantrantp7 ashishkej gitter-badger codeaudit iht indeevari mrpowers mahsa-kiani charygao sbarman-mi9 kirkhadley srinivest shaweifeng shubhampachori12110095 anoru ganeshharugeri yuseferi pursh2002 just4jc joserfjuniorllms guillermogsjc nkhuyu tplink32 eswarketl stefmt2970 actions-im aiexperts diegobodas farman1855 jatin7 eduardopacheco4 ufukhurriyetoglu sam9905 matbilml msellamitn mittidesai vivshri plliao gachet o9812 trendingtechnology rbramwell vikasyadav15 dohoainam111 deepakjangid123 hussainasghar easonleeee hordaway axel-bernal nkamsteve afcarl redbitshift bharath5673 xuliangleon techwithshadab gridl eysdevteam pilgrim2go meenasambamurthy darrenhaken manoelalmeida raashutosh lijielife jingwuchen anilsener benzei shafaypro amoussoubaruch tudoufuluobo ramasravani arvindsam96 shiaronhuapaya bizancio3 top1select hedgehog-zowie shinroo vivek2319 obarros forestlzj veeranji0425 paolominguzzi yajiebao semanttica jiapei100 yuanjie-ai danielzhang111cn 1superman shadowkun jason-lee-lxx bhanditz kirosg

optimus's Issues

Wrapping o modifying the transformer.get_data_frame().show() function

@FavioVazquez
If we want to abstract the use of the spark dataframe structure, I think we should consider use something like:

transformer.show()
instead of
transformer.get_data_frame().show()

Is shorter, easier to use and you do not need to deal with df

Utilities Class error on Windows

It seems that the path use on Windows is /// instead of // on self.__sc.setCheckpointDir(dirName="file://" + folderPath)

It would be helpful for a user to have a function to enrich data using a REST API. For example, connect to the Google Maps API to geocode an address or the Fullcontact API to add additional info using the user email.

The function must let the user config the request rate limit and the URL.

We must explore how to add additional params like the API keys or any other param the API needs.

Let users impute values to the minimal column value

We could create a little helper function to let the user easily impute values to minimal column value.

Start testing and CI with travis

Migrate Code from DFAnalyzer to Spark 2.2.0

Add Profiler example

Do you think it should be in the same example or create a new one @argenisleon?

Create Cheatsheet

It will be amazing to create an Optimus Cheatsheet. I create a www.cheatography.com. Please @FavioVazquez let me know via Slack your cheatography credentials to add you as a collaborator.

Let the OutlierDetector class get the threshold param

The threshold is actually hard coded in the OutlierDetector function. The user should be able to pass this param to test different thresholds

Add automatic publish to PyPi from travis

Timestamp error

Hi,
I get this error on the latest version while doing:

analyzer = op.DataFrameAnalyzer(df=df)
analyzer.column_analyze("*", plots=False, values_bar=False)

Error:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-282aba391ae8> in <module>()
      1 analyzer = op.DataFrameAnalyzer(df=df)
----> 2 analyzer.column_analyze("*", plots=False, values_bar=False)

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in column_analyze(self, column_list, plots, values_bar, print_type, num_bars, print_all)
    596                 values_bar=values_bar,
    597                 num_bars=num_bars,
--> 598                 types_dict=types)
    599 
    600             # Save the invalid col if exists

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in _analyze(self, df_col_analyzer, column, row_number, plots, print_type, values_bar, num_bars, types_dict)
    388         summary_dict = self._create_dict(
    389             ["name", "type", "total", "valid_values", "missing_values"],
--> 390             [column, types_dict[type_col], row_number, valid_values, missing_values
    391              ])
    392 

KeyError: 'timestamp'

Fix "Method could be a function" errors

Add feature engineering to transformer

Create Spark Dataframes easily

Testing with airbrake

Use airbrake for testing

DataFrameTransformer.explode_table example is outdated

Following the explode_table example on doc will return missing 1 required positional argument: 'list_to_assign' error

# Instanciation of DataTransformer class:
transformer = op.DataFrameTransformer(df)

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show()

# Transformation:
transformer.explode_table('bill id', 'foods', 'Beer')

# Printing new dataFrame:
print('New dataFrame:')
transformer.show()

Fix display buttons for plotting

Load pyspark DF from URL

Based on https://github.com/ibm-watson-data-lab/pixiedust/blob/master/pixiedust/utils/sampleData.py

Improve usability

We should make life easy from people coming from pandas, spark, R, etc. the usage of the framework.

I'm not saying lets copy the name, but I think some or them are too large o maybe confusing.

Document missing functions from Read the Docs

Refactor tests

Want to add DataFrame comparisons to the tests suite to make it more robust. Want to also make the tests suite handle the SparkSession efficiently. Proposed improvements are here: #150

Improve docuentation on ReadTheDocs

Add every method description and more examples

Use command line tool for publishing Spark Package

Create Docker Box

It would be amazing if we create a Docker Box with all ready to run Optimus. Spark, Optimus, Jupiter Notebook, etc. This seems docker file to start https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook

Let users impute values to 0

We could create a little helper function to let the user easily impute values to 0.

Code not complete

On this documentation we have the following code:

# Import optimus
import optimus as op
#Import os module for system tools
import os

# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_csv(path=filePath,
                            sep=',')

# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

There is a tools = op.Utilities() line missing.

Fix the explode function

Change the explode function to count items on Read the docs http://optimus-ironmussa.readthedocs.io/en/latest/#dataframetransformer-explode-table-coldid-col-new-col-feature

Migrate Code from DFTransformer to Spark 2.2.0

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Fix "Using type() instead of isinstance() for a typecheck." errors

Find out how to publish a release in Spark-Packages

Sampling and processing the whole data set

@FavioVazquez
I think should be an easy way to take a data sample, make all the operations and then apply them to the whole dataset. In this way, if we have a data set of 5TB we do not need to process the all the data that could be time-consuming.

For example:

Sample Dataset

transformer.sample().trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*")

Whole dataset

transformer.trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*")

This approach should be the easiest to implement but there are have some problems here. The user has to copy paste the whole chaining and apply it to the whole dataset. Another approach could be something like that(not python):

operations = [trim_col("*"), remove_special_chars("*"),  clear_accents("*"), lower_case("*") ]
#Transformation in the sample dataset
transfomer.sample().apply(operations)
#Transformation in the whole dataset
transfomer.apply(operations)

The problem with a random sample

We talk about the possibility to use a random sample function from Spark. Although this could be the first approach I think there are 2 problems we have to tackle.

Empty data

The trainers do not accept empty data. If we can not detect empty data on the sample and we do not remove it in the whole data set the user could have problems at training time.

Outliers

If we can not detect and an outlier in the sample and we do not remove it the whole data set it could generate a flawed model.

We must be sure that our sample data set to meet this requirement.

What can we do

I think we must find the fastest way to detect empty data and outliers in the sample function. and be sure that the user has the more accurate representation of the whole data.

Simplify plot_hist

Right now to print a hist the user must:

priceDf = analyzer.get_data_frame.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
analyzer.plot_hist(df_one_col=priceDf, hist_dict= hist_dictPri, type_hist='categorical')

I think that we must simplify the function to:
analyzer.plot_hist(df, column, bins, type_hist='categorical')

Create Benchmarks for Optimus (local and cluster)

Try out some cleaning operations in local mode and cluster mode. See performance and create basic graphs.

Principal libraries and frameworks to try:

Trifacta
Pandas
Dora
....

Improve and add more tests

Create wrapper in Optimus init for all use classes and functions

Using PEP-8 style guide for naming conventions

I was going through the code and found that various methods and variables have been named using camel casing. Also few methods starting with "__" double underscores. We should just use one "_" underscore to differentiate internal methods from public methods. I think it would be a good idea to start by using PEP-8 naming conventions for modules, function, methods and variables. Please let me know your thoughts on this.

Find differences between Spark 1.6.x and Spark 2.2.0

And the ones really important to Optimus

Start branch to bump Optimus to Spark 2.2.0

Migrate Code from DFProfiler to Spark 2.2.0

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Machine Learning with Optimus

This may be the future for Optimus and all we do.

Make easy to run ML with Spark.

Lets do it @argenisleon

Explore options for a different DataFrameTransformer interface

I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.

The Spark Scala API has a nifty transform method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.

I like the DataFrameTransformer class, but it doesn't let users easily access the native PySpark DataFrame methods.

We might want to take these methods out of the DataFrameTransfrormer class, so the user can mix and match the Optimus API and the PySpark API.

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

The transform method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

Let me know what you think!

Try integration with Spark DF profiler

https://github.com/julioasotodv/spark-df-profiling/

Moving the read and write functions

I think that should not be necessary to instance the Utilities function to use the read and write operations.

Foe example:

# Import optimus
import optimus as op
# Import module for system tools 
import os

# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook 
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_dataset_csv(path=filePath,
 delimiter_mark=',')

I think the way pandas handle it is easiest and elegant.

import pandas as pd

# Load the dataset
df = pd.read_csv('mock_bank_data_original.csv')
df.to_csv('mock_bank_data_original_PART1.csv')

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Remove/translate message in Spanish

Here is a word in Spanish that must be removed/translated https://github.com/ironmussa/Optimus/blob/e22f9d55a889f2198396f6472e573d8afd8a795a/optimus/utilities.py#L192

hi-primus / optimus Goto Github PK

optimus's People

Contributors

Stargazers

Watchers

Forkers

optimus's Issues

Installation

Using pip

Setup

The problem with a random sample

What can we do

Installation

Using pip

Setup

Installation

Using pip

Setup

Recommend Projects

Recommend Topics

Recommend Org