hi-primus / optimus Goto Github PK
View Code? Open in Web Editor NEW:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Home Page: https://hi-optimus.com
License: Apache License 2.0
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Home Page: https://hi-optimus.com
License: Apache License 2.0
@FavioVazquez
If we want to abstract the use of the spark dataframe structure, I think we should consider use something like:
transformer.show()
instead of
transformer.get_data_frame().show()
Is shorter, easier to use and you do not need to deal with df
Fix RTD.
It seems that the path use on Windows is /// instead of // on self.__sc.setCheckpointDir(dirName="file://" + folderPath)
It would be helpful for a user to have a function to enrich data using a REST API. For example, connect to the Google Maps API to geocode an address or the Fullcontact API to add additional info using the user email.
The function must let the user config the request rate limit and the URL.
We must explore how to add additional params like the API keys or any other param the API needs.
We could create a little helper function to let the user easily impute values to minimal column value.
Do you think it should be in the same example or create a new one @argenisleon?
It will be amazing to create an Optimus Cheatsheet. I create a www.cheatography.com. Please @FavioVazquez let me know via Slack your cheatography credentials to add you as a collaborator.
The threshold is actually hard coded in the OutlierDetector function. The user should be able to pass this param to test different thresholds
Hi,
I get this error on the latest version while doing:
analyzer = op.DataFrameAnalyzer(df=df)
analyzer.column_analyze("*", plots=False, values_bar=False)
Error:
KeyError Traceback (most recent call last)
<ipython-input-5-282aba391ae8> in <module>()
1 analyzer = op.DataFrameAnalyzer(df=df)
----> 2 analyzer.column_analyze("*", plots=False, values_bar=False)
~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in column_analyze(self, column_list, plots, values_bar, print_type, num_bars, print_all)
596 values_bar=values_bar,
597 num_bars=num_bars,
--> 598 types_dict=types)
599
600 # Save the invalid col if exists
~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in _analyze(self, df_col_analyzer, column, row_number, plots, print_type, values_bar, num_bars, types_dict)
388 summary_dict = self._create_dict(
389 ["name", "type", "total", "valid_values", "missing_values"],
--> 390 [column, types_dict[type_col], row_number, valid_values, missing_values
391 ])
392
KeyError: 'timestamp'
Use airbrake for testing
Following the explode_table
example on doc will return missing 1 required positional argument: 'list_to_assign'
error
# Instanciation of DataTransformer class:
transformer = op.DataFrameTransformer(df)
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show()
# Transformation:
transformer.explode_table('bill id', 'foods', 'Beer')
# Printing new dataFrame:
print('New dataFrame:')
transformer.show()
We should make life easy from people coming from pandas, spark, R, etc. the usage of the framework.
I'm not saying lets copy the name, but I think some or them are too large o maybe confusing.
Want to add DataFrame comparisons to the tests suite to make it more robust. Want to also make the tests suite handle the SparkSession efficiently. Proposed improvements are here: #150
Add every method description and more examples
It would be amazing if we create a Docker Box with all ready to run Optimus. Spark, Optimus, Jupiter Notebook, etc. This seems docker file to start https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook
We could create a little helper function to let the user easily impute values to 0.
On this documentation we have the following code:
# Import optimus
import optimus as op
#Import os module for system tools
import os
# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"
df = tools.read_csv(path=filePath,
sep=',')
# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()
There is a tools = op.Utilities()
line missing.
Change the explode function to count items on Read the docs http://optimus-ironmussa.readthedocs.io/en/latest/#dataframetransformer-explode-table-coldid-col-new-col-feature
pip install -U airbrake
The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):
export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production
and you're done!
Otherwise, you can instantiate your AirbrakeHandler
by passing these values as arguments to the getLogger()
helper:
import airbrake
logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)
try:
1/0
except Exception:
logger.exception("Bad math.")
For more information please visit our official GitHub repo.
@FavioVazquez
I think should be an easy way to take a data sample, make all the operations and then apply them to the whole dataset. In this way, if we have a data set of 5TB we do not need to process the all the data that could be time-consuming.
For example:
Sample Dataset
transformer.sample().trim_col("*")
.remove_special_chars("*")
.clear_accents("*")
.lower_case("*")
Whole dataset
transformer.trim_col("*")
.remove_special_chars("*")
.clear_accents("*")
.lower_case("*")
This approach should be the easiest to implement but there are have some problems here. The user has to copy paste the whole chaining and apply it to the whole dataset. Another approach could be something like that(not python):
operations = [trim_col("*"), remove_special_chars("*"), clear_accents("*"), lower_case("*") ]
#Transformation in the sample dataset
transfomer.sample().apply(operations)
#Transformation in the whole dataset
transfomer.apply(operations)
We talk about the possibility to use a random sample function from Spark. Although this could be the first approach I think there are 2 problems we have to tackle.
Empty data
The trainers do not accept empty data. If we can not detect empty data on the sample and we do not remove it in the whole data set the user could have problems at training time.
Outliers
If we can not detect and an outlier in the sample and we do not remove it the whole data set it could generate a flawed model.
We must be sure that our sample data set to meet this requirement.
I think we must find the fastest way to detect empty data and outliers in the sample function. and be sure that the user has the more accurate representation of the whole data.
Right now to print a hist the user must:
priceDf = analyzer.get_data_frame.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
analyzer.plot_hist(df_one_col=priceDf, hist_dict= hist_dictPri, type_hist='categorical')
I think that we must simplify the function to:
analyzer.plot_hist(df, column, bins, type_hist='categorical')
Try out some cleaning operations in local mode and cluster mode. See performance and create basic graphs.
Principal libraries and frameworks to try:
I was going through the code and found that various methods and variables have been named using camel casing. Also few methods starting with "__" double underscores. We should just use one "_" underscore to differentiate internal methods from public methods. I think it would be a good idea to start by using PEP-8 naming conventions for modules, function, methods and variables. Please let me know your thoughts on this.
And the ones really important to Optimus
pip install -U airbrake
The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):
export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production
and you're done!
Otherwise, you can instantiate your AirbrakeHandler
by passing these values as arguments to the getLogger()
helper:
import airbrake
logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)
try:
1/0
except Exception:
logger.exception("Bad math.")
For more information please visit our official GitHub repo.
This may be the future for Optimus and all we do.
Make easy to run ML with Spark.
Lets do it @argenisleon
I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.
The Spark Scala API has a nifty transform
method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.
I like the DataFrameTransformer
class, but it doesn't let users easily access the native PySpark DataFrame
methods.
We might want to take these methods out of the DataFrameTransfrormer
class, so the user can mix and match the Optimus API and the PySpark API.
source_df\
.transform(lambda df: lower_case(df, "*"))\
.withColumn("funny", lit("spongebob"))\
.transform(lambda df: trim_col(df, "address"))
The transform
method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.
source_df\
.transform(lower_case("*"))\
.withColumn("funny", lit("spongebob"))\
.transform(trim_col("address"))
Let me know what you think!
I think that should not be necessary to instance the Utilities function to use the read and write operations.
Foe example:
# Import optimus
import optimus as op
# Import module for system tools
import os
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"
df = tools.read_dataset_csv(path=filePath,
delimiter_mark=',')
I think the way pandas handle it is easiest and elegant.
import pandas as pd
# Load the dataset
df = pd.read_csv('mock_bank_data_original.csv')
df.to_csv('mock_bank_data_original_PART1.csv')
pip install -U airbrake
The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):
export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production
and you're done!
Otherwise, you can instantiate your AirbrakeHandler
by passing these values as arguments to the getLogger()
helper:
import airbrake
logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)
try:
1/0
except Exception:
logger.exception("Bad math.")
For more information please visit our official GitHub repo.
Here is a word in Spanish that must be removed/translated https://github.com/ironmussa/Optimus/blob/e22f9d55a889f2198396f6472e573d8afd8a795a/optimus/utilities.py#L192
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.