Comments (24)
@MrPowers I think all functions should work on multicolumn. And keep the name remore_chars
or something like that. The user can choose to only add one or many. That's more natural. Thanks!
from optimus.
Hi @MrPowers! But the source_df
is a Spark Dataframe right? I'm actually not sure how to implement this. But it seems very interesting. I'll take a look.
from optimus.
Yep, source_df
is a Spark DataFrame.
Let me make the current, quinn, and "ideal" options more clear.
current - think this code would work ;)
transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame
using quinn transform method (if DataFrameTransformer methods weren't in a class)
source_df\
.transform(lambda df: lower_case(df, "*"))\
.withColumn("funny", lit("spongebob"))\
.transform(lambda df: trim_col(df, "address"))
using "ideal" transform method - I'm not sure this is even possible
source_df\
.transform(lower_case("*"))\
.withColumn("funny", lit("spongebob"))\
.transform(trim_col("address"))
from optimus.
Yes I think this is a great idea. But I'm thinking, maybe the only way is to extend the DataFrame class in Pyspark so we can add quinn transform method. Do you think this could work?
from optimus.
Yep, the quinn library extends the PySpark DataFrame class to add the transform method. I am going to ask some coworkers / StackOverflow if they know how to write the "elegant" transform method. I'll report back ;)
from optimus.
Great @MrPowers! Let me know please! I think we should add more functionality of quinn to Optimus!
from optimus.
@pirate showed me how to define PySpark custom transformations with inner functions, so they can be easily chained with a DataFrame.transform
method ๐
I wrote a blog post with more details.
How about I do a spike to see if the DataFrameTransformer can be refactored, so users can do this:
(source_df
.transform(lower_case("*"))
.withColumn("funny", lit("spongebob"))
.transform(trim_col("address")))
Instead of this:
transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame
Thanks!
from optimus.
Hi @MrPowers! We've talked about it and it actually sounds great. Amazing that you've found the way.
What is the impact into the way the transformer is written right now? Can you create a small snippet of a function in the new programming with the transform method so we can check it out please?
I do think this could be a major change into the code. So please go ahead and do your magic :) ๐
from optimus.
Hi @MrPowers any advances in this Issue?
from optimus.
Sorry for the delay on this one @FavioVazquez. I'll get you something today or tomorrow. Thanks for following up!
from optimus.
Great! We'll be waiting :) @MrPowers
from optimus.
@FavioVazquez - Check out this pull request and let me know what you think ๐
I think the custom DataFrame transformation interface I've outlined in the df_transformer_exp.py
file is the best path forward for this project. I work on a large Scala/Spark codebase and we organize the DataFrame transformations in this manner. I think it's best to give the user an interface that allows them to easily access the built-in DataFrame methods / functions as well as the functionality that's provided by Optimus. It's awkward to switch back and forth between the native methods and the Optimus methods with the current interface (see the test_existing_interface
test).
I think this new interface will also encourage Optimus to focus on the unique functionality that your project is bringing to the table. A lot of the current DataFrameTransformer
methods are basically just aliases for methods that are already provided by PySpark (e.g. show
, replace_na
, drop_col
, delete_row
, etc.). I think Optimus should be focused on unique data munging methods that aren't provided by Spark natively (e.g. clear_accents
, remove_special_chars
, remove_special_chars_regex
, move_col
).
I'm optimistic about the future of this project and am happy to catch up on a call to discuss the next steps!
from optimus.
@FavioVazquez - Glad we're on the same page. Let me get this pull request in better shape, so it can get merged into master. In the short run, I think we can build out the new interface and keep the existing code. I think we'll be able to clean up the existing code by leveraging the native Spark functions a bit more. Let me know what you think ๐
I think we'll also want to build out some Optimus functions, similar to the PySpark functions.
I think a good next step is to go through the DataFrameTransformer
class and categorize each method as a "Spark alias method" or a "unique Optimus method". We can then start migrating over the unique Optimus methods.
from optimus.
Great! Yes that should be done. The thing here @MrPowers is that I hate some of the names for the functions of spark, there are not intuitive at all. So what could happen there?
Question: what do you mean when you said: "I think we'll also want to build out some Optimus functions, similar to the PySpark functions."?
Another thing is that, I was planning on doing a thing with an annotator that says that is experimental, like the ones in spark, but I'm not sure how they do it. Do you know anything about this?
from optimus.
@MrPowers I think this will be a major change. So I'm putting it in the plans for version 2.0. You can check the board, and add more issues there. It will be a great way for letting us know the state of the progress. Thanks!
from optimus.
@FavioVazquez - I went through all the current DataFrameTransformer
methods and classified them as "not needed in the new interface", "should be in the new interface", or "I'm not sure yet". Let me know what you think!
def df(self):
no longer needed
def show(self, n=10, truncate=True):
no longer needed
def lower_case(self, columns):
There is already a lower function so I don't think this is needed: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lower
def upper_case(self, columns):
There is already an upper function: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.upper
def impute_missing(self, columns, out_cols, strategy):
Not sure about this one
def replace_na(self, value, columns=None):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions
def check_point(self):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.checkpoint
def trim_col(self, columns):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim
def drop_col(self, columns):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop
def replace_col(self, search, change_to, columns):
I think this is pretty much the same as regexp_replace: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace
def delete_row(self, func):
Filter already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
def set_col(self, columns, func, data_type):
I am not sure about this one
def keep_col(self, columns):
Looks like the same as select: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select
def clear_accents(self, columns):
Keep this one
def remove_special_chars(self, columns):
Keep this one
def remove_special_chars_regex(self, columns, regex):
Not sure about this one
def rename_col(self, columns):
You can use withColumn for this: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn
def lookup(self, column, str_to_replace, list_str=None):
Not sure about this one
def move_col(self, column, ref_col, position):
Keep this one
def count_items(self, col_id, col_search, new_col_feature, search_string):
Keep this one
def date_transform(self, columns, current_format, output_format):
Keep this one
def age_calculate(self, column, dates_format, name_col_age):
Keep this one
def cast_func(self, cols_and_types):
Not sure about this one
def empty_str_to_str(self, columns, custom_str):
Not sure about this one
def operation_in_type(self, parameters):
Not sure about this one
def row_filter_by_type(self, column_name, type_to_delete):
Not sure about this one
def undo_vec_assembler(self, column, feature_names):
Think explode can be used for this: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
def scale_vec_col(self, columns, name_output_col):
Not sure if this belongs here, seems like a pyspark.ml related function: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html
def split_str_col(self, column, feature_names, mark):
See explode method: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
def remove_empty_rows(self, how="all"):
See DataFrameNaFunctions#drop: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.drop
def remove_duplicates(self, cols=None):
dropDuplicates already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates
def write_df_as_json(self, path):
Think we can just use the DataFrameWriter to write out JSON: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.format
def to_csv(self, path_name, header="true", mode="overwrite", sep=",", *args, **kargs):
No longer needed
def string_to_index(self, input_cols):
Not sure
def index_to_string(self, input_cols):
Not sure
def one_hot_encoder(self, input_cols):
Not sure
def sql(self, sql_expression):
Not sure
def vector_assembler(self, input_cols):
Not sure
def normalizer(self, input_cols, p=2.0):
Not sure
def select(self, columns):
Not needed
def select_idx(self, indices):
Keep
from optimus.
Hey @MrPowers thanks for this. I think now is a decision time. So, lots of the functions you mention to remove are in spark, and that's true, but some of them only work column by column, I mean the functions doesn't allow for changing multiple columns. Why not keeping maybe the same name as in spark, but giving this functionality? Apart from all the assertions we make, that spark doesn't to help the user.
On the other hand, why are you not sure about the feature transformations we programmed? They're a pain in the ass for most users, and allow for single transformations at a time, apart that you have to use the fit
and transform
methods that are not clear for everyone.
We may move all of this into the OptimusML library, but I'm voting to keeping them.
I think this new interface will be a great step forward for Optimus so I'm in, but I want to emphasize that most of what we are doing here can be done with spark, but is not easy or pretty and that's what we want to give to user. Apart that most of them come from pandas or dplyr, so the plan was trying to make something like that.
What are your thoughts on this?
And thank you again :)
from optimus.
I just wrote a blog post on how to perform operations on multiple columns of a DataFrame with the Scala API. I am going to do some research and see how to run operations on multiple columns with PySpark. I think I'll be able to figure something out with reduce
and/or a list comprehension ๐ค
I haven't used the Spark ML library much yet, but I think you're right that the methods you've coded up might be super useful for users. Making the ML methods easily accessible might turn out to be the secret sauce of Optimus ๐
For now, I'll research and see how to run operations on multiple columns with PySpark and will get back to you with what I find!
from optimus.
@FavioVazquez - Here's one way we can make it easy for users to apply the transformations to multiple columns:
def remove_chars(colName, chars):
def inner(df):
regexp = "|".join('\{0}'.format(i) for i in chars)
return df.withColumn(colName, regexp_replace(colName, regexp, ""))
return inner
def multi_remove_chars(colNames, chars):
def inner(df):
return reduce(
lambda memo_df, col_name: remove_chars(col_name, chars)(memo_df),
colNames,
df
)
return inner
There's a test in this commit.
I am going to ask some people I work with that are more experienced with Python about what they think about this approach.
If we go with this approach, I'm not sure if we should expose both remove_chars
and multi_remove_chars
in the public API, or only multi_remove_chars
. I'll need to think about that one a bit.
from optimus.
@FavioVazquez - Yep, I agree with your feedback. Something like this might work to keep the code clean:
def __remove_chars(col_name, removed_chars):
def inner(df):
regexp = "|".join('\{0}'.format(i) for i in removed_chars)
return df.withColumn(col_name, regexp_replace(col_name, regexp, ""))
return inner
def remove_chars(col_names, removed_chars):
def inner(df):
return reduce(
lambda memo_df, col_name: __remove_chars(col_name, removed_chars)(memo_df),
col_names,
df
)
return inner
from optimus.
Yes I like that. @MrPowers Finish cleaning the PR up so I can merge it, and add the documentation, assertions and more functions. Thanks :)
from optimus.
@FavioVazquez - I wrote the blog post on performing operations on multiple columns in a PySpark DataFrame. Let me know what you think!
I think we should write private functions that work on a single column and then expose functions that work on multiple columns as the public API.
I am traveling to a remote part of Colombia and will be offline for the next few days. I will pick this back up next week ๐
from optimus.
I have been working in simplify how you can work with Optimus.
The approach about using a transform function seems fine so you can perform native data frame function but it seems a little verbose.
After some experimentation with hierarchy and decorators, it seems that de decorator option is more flexible and was the only way I could implement chaining.
from functools import wraps # This convenience func preserves name and docstring
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
# decorator to attach a custom fuction to a class
def add_method(cls):
def decorator(func):
@wraps(func)
def wrapper(self, *args, **kwargs):
return func(self, *args, **kwargs)
setattr(cls, func.__name__, wrapper)
# Note we are not binding func, but wrapper which accepts self but does exactly the same as func
return func # returning func means func can still be used normally
return decorator
@add_method(DataFrame)
def lower(self, columns):
for column in columns:
self= self.withColumn(column, F.lower(col(column)))
return self
@add_method(DataFrame)
def upper(self, columns):
for column in columns:
self= self.withColumn(column, F.upper(col(column)))
return self
@add_method(DataFrame)
def reverse(self, columns):
for column in columns:
self= self.withColumn(column, F.reverse(col(column)))
return self
schema = StructType([
StructField("city", StringType(), True),
StructField("country", StringType(), True),
StructField("population", IntegerType(), True)])
countries = ['Colombia', 'US@A', 'Brazil', 'Spain']
cities = ['Bogotรก', 'New York', ' Sรฃo Paulo ', '~Madrid']
population = [37800000,19795791,12341418,6489162]
# Create dataframe
df = spark.createDataFrame(list(zip(cities, countries, population)), schema=schema)
# Some operations in multiple columns
r = df.lower(["city","country"]).withColumn("city", F.upper(col("city"))).reverse(["city"]).reverse(["city", "country"])
r.show()
@MrPowers I was reading your article about processing multiple columns but I can not figure out how to use an implementation like this with chaining.
def multi_remove_some_chars(col_names):
def inner(df):
for col_name in col_names:
df = df.withColumn(
col_name,
remove_some_chars(col_name)
)
return df
return inner
Any thought about this?
from optimus.
This is a great idea @argenisleon, I think we should explore this option too. I created the PR for the second version in #217. It follows some of the things @MrPowers started. @argenisleon check the reduce function there. the chaining part is very easy with the transformer that @MrPowers created in Quinn, and not we should think on how to do it here.
from optimus.
Related Issues (20)
- AttributeError: module 'optimus' has no attribute 'Utilities' HOT 1
- url_parser bug in requirements.txt HOT 2
- Tests for columns operations HOT 1
- Recreate docstrings HOT 1
- Recreate readthedocs
- Tests for rows operations
- Plans for Python 3.9 support?
- Add Single Sign On to sqlServer
- Improve SQL generation using SQAlchemy
- Add additional imputations methods HOT 1
- Explore methods to handle invalanced dataset HOT 1
- Also replace a value if the condition to replace is not matched. HOT 1
- Spark HOT 3
- Improve API consistency
- Load a list of files using a list of file names HOT 4
- Unable to run on m1 mac - ModuleNotFoundError: No module named 'dask' HOT 4
- Creating new instances forces to download nltk data HOT 1
- Operation between an inferred numeric column and a numeric value fail
- Columns and DataType Not Explicitly Set on line 185 of functions.py
- Look into this section
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from optimus.