Giter Club home page Giter Club logo

Comments (2)

ras44 avatar ras44 commented on June 3, 2024 1

hi @AmyLin0515

A couple ideas:

See the code for get_cumlift here:

def get_cumlift(
df, outcome_col="y", treatment_col="w", treatment_effect_col="tau", random_seed=42
):
"""Get average uplifts of model estimates in cumulative population.
If the true treatment effect is provided (e.g. in synthetic data), it's calculated
as the mean of the true treatment effect in each of cumulative population.
Otherwise, it's calculated as the difference between the mean outcomes of the
treatment and control groups in each of cumulative population.
For details, see Section 4.1 of Gutierrez and G{\'e}rardy (2016), `Causal Inference
and Uplift Modeling: A review of the literature`.
For the former, `treatment_effect_col` should be provided. For the latter, both
`outcome_col` and `treatment_col` should be provided.
Args:
df (pandas.DataFrame): a data frame with model estimates and actual data as columns
outcome_col (str, optional): the column name for the actual outcome
treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional): the column name for the true treatment effect
random_seed (int, optional): random seed for numpy.random.rand()
Returns:
(pandas.DataFrame): average uplifts of model estimates in cumulative population
"""
assert (
(outcome_col in df.columns)
and (treatment_col in df.columns)
or treatment_effect_col in df.columns
)
df = df.copy()
np.random.seed(random_seed)
random_cols = []
for i in range(10):
random_col = "__random_{}__".format(i)
df[random_col] = np.random.rand(df.shape[0])
random_cols.append(random_col)
model_names = [
x
for x in df.columns
if x not in [outcome_col, treatment_col, treatment_effect_col]
]
lift = []
for i, col in enumerate(model_names):
sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True)
sorted_df.index = sorted_df.index + 1
if treatment_effect_col in sorted_df.columns:
# When treatment_effect_col is given, use it to calculate the average treatment effects
# of cumulative population.
lift.append(sorted_df[treatment_effect_col].cumsum() / sorted_df.index)
else:
# When treatment_effect_col is not given, use outcome_col and treatment_col
# to calculate the average treatment_effects of cumulative population.
sorted_df["cumsum_tr"] = sorted_df[treatment_col].cumsum()
sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"]
sorted_df["cumsum_y_tr"] = (
sorted_df[outcome_col] * sorted_df[treatment_col]
).cumsum()
sorted_df["cumsum_y_ct"] = (
sorted_df[outcome_col] * (1 - sorted_df[treatment_col])
).cumsum()
lift.append(
sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"]
- sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"]
)
lift = pd.concat(lift, join="inner", axis=1)
lift.loc[0] = np.zeros((lift.shape[1],))
lift = lift.sort_index().interpolate()
lift.columns = model_names
lift[RANDOM_COL] = lift[random_cols].mean(axis=1)
lift.drop(random_cols, axis=1, inplace=True)
return lift

  1. Note that get_cumlift iterates at least 10 times over random orderings and also other order orderings if your input df has columns other than outcome_col, treatment_col, and treatment_effect_col:

for i in range(10):
random_col = "__random_{}__".format(i)
df[random_col] = np.random.rand(df.shape[0])
random_cols.append(random_col)

for i, col in enumerate(model_names):
sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True)
sorted_df.index = sorted_df.index + 1

  1. Also if treatment_effect_col is provided, it is used to calculate the ATE of the cumulative population:
    if treatment_effect_col in sorted_df.columns:
    # When treatment_effect_col is given, use it to calculate the average treatment effects
    # of cumulative population.

Not sure if you are providing the treatment_effect_col using synthetic data or not, but if that is the case, then 2) would apply.

If you're not providing treatment_effect_col, then 1) still applies- a repeated random ordering and subsequent interpolation of lift results.

FYI, also see work in #707

from causalml.

AmyLin0515 avatar AmyLin0515 commented on June 3, 2024

Hi @ras44 ! Thanks for providing insights. I did find the difference decreased a lot after I added 10 random columns and included them to sort. However, I don't understand why we need to add these two random columns. And if eventually the order was changed by the final 10th random columns, what is the point that we added so many of them.

from causalml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.