Giter Club home page Giter Club logo

Comments (24)

FavioVazquez avatar FavioVazquez commented on May 18, 2024 1

@MrPowers I think all functions should work on multicolumn. And keep the name remore_chars or something like that. The user can choose to only add one or many. That's more natural. Thanks!

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Hi @MrPowers! But the source_dfis a Spark Dataframe right? I'm actually not sure how to implement this. But it seems very interesting. I'll take a look.

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

Yep, source_df is a Spark DataFrame.

Let me make the current, quinn, and "ideal" options more clear.

current - think this code would work ;)

transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame

using quinn transform method (if DataFrameTransformer methods weren't in a class)

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

using "ideal" transform method - I'm not sure this is even possible

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Yes I think this is a great idea. But I'm thinking, maybe the only way is to extend the DataFrame class in Pyspark so we can add quinn transform method. Do you think this could work?

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

Yep, the quinn library extends the PySpark DataFrame class to add the transform method. I am going to ask some coworkers / StackOverflow if they know how to write the "elegant" transform method. I'll report back ;)

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Great @MrPowers! Let me know please! I think we should add more functionality of quinn to Optimus!

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@pirate showed me how to define PySpark custom transformations with inner functions, so they can be easily chained with a DataFrame.transform method ๐ŸŽŠ

I wrote a blog post with more details.

How about I do a spike to see if the DataFrameTransformer can be refactored, so users can do this:

 (source_df
    .transform(lower_case("*"))
    .withColumn("funny", lit("spongebob"))
    .transform(trim_col("address")))

Instead of this:

transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame

Thanks!

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Hi @MrPowers! We've talked about it and it actually sounds great. Amazing that you've found the way.

What is the impact into the way the transformer is written right now? Can you create a small snippet of a function in the new programming with the transform method so we can check it out please?

I do think this could be a major change into the code. So please go ahead and do your magic :) ๐Ÿ‘

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Hi @MrPowers any advances in this Issue?

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

Sorry for the delay on this one @FavioVazquez. I'll get you something today or tomorrow. Thanks for following up!

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Great! We'll be waiting :) @MrPowers

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@FavioVazquez - Check out this pull request and let me know what you think ๐Ÿ˜‰

I think the custom DataFrame transformation interface I've outlined in the df_transformer_exp.py file is the best path forward for this project. I work on a large Scala/Spark codebase and we organize the DataFrame transformations in this manner. I think it's best to give the user an interface that allows them to easily access the built-in DataFrame methods / functions as well as the functionality that's provided by Optimus. It's awkward to switch back and forth between the native methods and the Optimus methods with the current interface (see the test_existing_interface test).

I think this new interface will also encourage Optimus to focus on the unique functionality that your project is bringing to the table. A lot of the current DataFrameTransformer methods are basically just aliases for methods that are already provided by PySpark (e.g. show, replace_na, drop_col, delete_row, etc.). I think Optimus should be focused on unique data munging methods that aren't provided by Spark natively (e.g. clear_accents, remove_special_chars, remove_special_chars_regex, move_col).

I'm optimistic about the future of this project and am happy to catch up on a call to discuss the next steps!

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@FavioVazquez - Glad we're on the same page. Let me get this pull request in better shape, so it can get merged into master. In the short run, I think we can build out the new interface and keep the existing code. I think we'll be able to clean up the existing code by leveraging the native Spark functions a bit more. Let me know what you think ๐Ÿ˜„

I think we'll also want to build out some Optimus functions, similar to the PySpark functions.

I think a good next step is to go through the DataFrameTransformer class and categorize each method as a "Spark alias method" or a "unique Optimus method". We can then start migrating over the unique Optimus methods.

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Great! Yes that should be done. The thing here @MrPowers is that I hate some of the names for the functions of spark, there are not intuitive at all. So what could happen there?

Question: what do you mean when you said: "I think we'll also want to build out some Optimus functions, similar to the PySpark functions."?

Another thing is that, I was planning on doing a thing with an annotator that says that is experimental, like the ones in spark, but I'm not sure how they do it. Do you know anything about this?

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

@MrPowers I think this will be a major change. So I'm putting it in the plans for version 2.0. You can check the board, and add more issues there. It will be a great way for letting us know the state of the progress. Thanks!

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@FavioVazquez - I went through all the current DataFrameTransformer methods and classified them as "not needed in the new interface", "should be in the new interface", or "I'm not sure yet". Let me know what you think!

def df(self):
no longer needed

def show(self, n=10, truncate=True):
no longer needed

def lower_case(self, columns):
There is already a lower function so I don't think this is needed: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lower

def upper_case(self, columns):
There is already an upper function: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.upper

def impute_missing(self, columns, out_cols, strategy):
Not sure about this one

def replace_na(self, value, columns=None):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

def check_point(self):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.checkpoint

def trim_col(self, columns):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim

def drop_col(self, columns):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop

def replace_col(self, search, change_to, columns):
I think this is pretty much the same as regexp_replace: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace

def delete_row(self, func):
Filter already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter

def set_col(self, columns, func, data_type):
I am not sure about this one

def keep_col(self, columns):
Looks like the same as select: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select

def clear_accents(self, columns):
Keep this one

def remove_special_chars(self, columns):
Keep this one

def remove_special_chars_regex(self, columns, regex):
Not sure about this one

def rename_col(self, columns):
You can use withColumn for this: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn

def lookup(self, column, str_to_replace, list_str=None):
Not sure about this one

def move_col(self, column, ref_col, position):
Keep this one

def count_items(self, col_id, col_search, new_col_feature, search_string):
Keep this one

def date_transform(self, columns, current_format, output_format):
Keep this one

def age_calculate(self, column, dates_format, name_col_age):
Keep this one

def cast_func(self, cols_and_types):
Not sure about this one

def empty_str_to_str(self, columns, custom_str):
Not sure about this one

def operation_in_type(self, parameters):
Not sure about this one

def row_filter_by_type(self, column_name, type_to_delete):
Not sure about this one

def undo_vec_assembler(self, column, feature_names):
Think explode can be used for this: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

def scale_vec_col(self, columns, name_output_col):
Not sure if this belongs here, seems like a pyspark.ml related function: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html

def split_str_col(self, column, feature_names, mark):
See explode method: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

def remove_empty_rows(self, how="all"):
See DataFrameNaFunctions#drop: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.drop

def remove_duplicates(self, cols=None):
dropDuplicates already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

def write_df_as_json(self, path):
Think we can just use the DataFrameWriter to write out JSON: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.format

def to_csv(self, path_name, header="true", mode="overwrite", sep=",", *args, **kargs):
No longer needed

def string_to_index(self, input_cols):
Not sure

def index_to_string(self, input_cols):
Not sure

def one_hot_encoder(self, input_cols):
Not sure

def sql(self, sql_expression):
Not sure

def vector_assembler(self, input_cols):
Not sure

def normalizer(self, input_cols, p=2.0):
Not sure

def select(self, columns):
Not needed

def select_idx(self, indices):
Keep

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Hey @MrPowers thanks for this. I think now is a decision time. So, lots of the functions you mention to remove are in spark, and that's true, but some of them only work column by column, I mean the functions doesn't allow for changing multiple columns. Why not keeping maybe the same name as in spark, but giving this functionality? Apart from all the assertions we make, that spark doesn't to help the user.

On the other hand, why are you not sure about the feature transformations we programmed? They're a pain in the ass for most users, and allow for single transformations at a time, apart that you have to use the fit and transform methods that are not clear for everyone.

We may move all of this into the OptimusML library, but I'm voting to keeping them.

I think this new interface will be a great step forward for Optimus so I'm in, but I want to emphasize that most of what we are doing here can be done with spark, but is not easy or pretty and that's what we want to give to user. Apart that most of them come from pandas or dplyr, so the plan was trying to make something like that.

What are your thoughts on this?

And thank you again :)

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

I just wrote a blog post on how to perform operations on multiple columns of a DataFrame with the Scala API. I am going to do some research and see how to run operations on multiple columns with PySpark. I think I'll be able to figure something out with reduce and/or a list comprehension ๐Ÿค”

I haven't used the Spark ML library much yet, but I think you're right that the methods you've coded up might be super useful for users. Making the ML methods easily accessible might turn out to be the secret sauce of Optimus ๐Ÿ˜‰

For now, I'll research and see how to run operations on multiple columns with PySpark and will get back to you with what I find!

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@FavioVazquez - Here's one way we can make it easy for users to apply the transformations to multiple columns:

def remove_chars(colName, chars):
    def inner(df):
        regexp = "|".join('\{0}'.format(i) for i in chars)
        return df.withColumn(colName, regexp_replace(colName, regexp, ""))
    return inner

def multi_remove_chars(colNames, chars):
    def inner(df):
        return reduce(
            lambda memo_df, col_name: remove_chars(col_name, chars)(memo_df),
            colNames,
            df
        )
    return inner

There's a test in this commit.

I am going to ask some people I work with that are more experienced with Python about what they think about this approach.

If we go with this approach, I'm not sure if we should expose both remove_chars and multi_remove_chars in the public API, or only multi_remove_chars. I'll need to think about that one a bit.

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@FavioVazquez - Yep, I agree with your feedback. Something like this might work to keep the code clean:

def __remove_chars(col_name, removed_chars):
    def inner(df):
        regexp = "|".join('\{0}'.format(i) for i in removed_chars)
        return df.withColumn(col_name, regexp_replace(col_name, regexp, ""))
    return inner


def remove_chars(col_names, removed_chars):
    def inner(df):
        return reduce(
            lambda memo_df, col_name: __remove_chars(col_name, removed_chars)(memo_df),
            col_names,
            df
        )
    return inner

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

Yes I like that. @MrPowers Finish cleaning the PR up so I can merge it, and add the documentation, assertions and more functions. Thanks :)

from optimus.

MrPowers avatar MrPowers commented on May 18, 2024

@FavioVazquez - I wrote the blog post on performing operations on multiple columns in a PySpark DataFrame. Let me know what you think!

I think we should write private functions that work on a single column and then expose functions that work on multiple columns as the public API.

I am traveling to a remote part of Colombia and will be offline for the next few days. I will pick this back up next week ๐Ÿ˜„

from optimus.

argenisleon avatar argenisleon commented on May 18, 2024

@MrPowers @FavioVazquez

I have been working in simplify how you can work with Optimus.
The approach about using a transform function seems fine so you can perform native data frame function but it seems a little verbose.

After some experimentation with hierarchy and decorators, it seems that de decorator option is more flexible and was the only way I could implement chaining.

from functools import wraps # This convenience func preserves name and docstring
from pyspark.sql import DataFrame
from pyspark.sql import functions as F

# decorator to attach a custom fuction to a class
def add_method(cls):
    def decorator(func):
        @wraps(func) 
        def wrapper(self, *args, **kwargs): 
            return func(self, *args, **kwargs)
        setattr(cls, func.__name__, wrapper)
        # Note we are not binding func, but wrapper which accepts self but does exactly the same as func
        return func # returning func means func can still be used normally
    return decorator

@add_method(DataFrame)
def lower(self, columns):
    for column in columns: 
        self= self.withColumn(column, F.lower(col(column)))
    return self

@add_method(DataFrame)
def upper(self, columns):
    for column in columns: 
        self= self.withColumn(column, F.upper(col(column)))
    return self

@add_method(DataFrame)
def reverse(self, columns):
    for column in columns: 
        self= self.withColumn(column, F.reverse(col(column)))
    return self

schema = StructType([
        StructField("city", StringType(), True),
        StructField("country", StringType(), True),
        StructField("population", IntegerType(), True)])

countries = ['Colombia', 'US@A', 'Brazil', 'Spain']
cities = ['Bogotรก', 'New York', '   Sรฃo Paulo   ', '~Madrid']
population = [37800000,19795791,12341418,6489162]

# Create dataframe
df = spark.createDataFrame(list(zip(cities, countries, population)), schema=schema)

# Some operations in multiple columns
r = df.lower(["city","country"]).withColumn("city", F.upper(col("city"))).reverse(["city"]).reverse(["city", "country"])
r.show()

@MrPowers I was reading your article about processing multiple columns but I can not figure out how to use an implementation like this with chaining.

def multi_remove_some_chars(col_names):
    def inner(df):
        for col_name in col_names:
            df = df.withColumn(
                col_name,
                remove_some_chars(col_name)
            )
        return df
    return inner

Any thought about this?

from optimus.

FavioVazquez avatar FavioVazquez commented on May 18, 2024

This is a great idea @argenisleon, I think we should explore this option too. I created the PR for the second version in #217. It follows some of the things @MrPowers started. @argenisleon check the reduce function there. the chaining part is very easy with the transformer that @MrPowers created in Quinn, and not we should think on how to do it here.

from optimus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.