Describe the bug Hi Team! I used get_cumlift(), and got the l

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Why I got different lift result when using get_cumlift() and calculating line by line? about causalml HOT 2 OPEN

AmyLin0515 commented on June 3, 2024

Why I got different lift result when using get_cumlift() and calculating line by line?

from causalml.

Comments (2)

ras44 commented on June 3, 2024 1

hi @AmyLin0515

A couple ideas:

See the code for get_cumlift here:

causalml/causalml/metrics/visualize.py

Lines 54 to 135 in c154afe

 def get_cumlift( 

 df, outcome_col="y", treatment_col="w", treatment_effect_col="tau", random_seed=42 

 ): 

 """Get average uplifts of model estimates in cumulative population. 

  If the true treatment effect is provided (e.g. in synthetic data), it's calculated 

  as the mean of the true treatment effect in each of cumulative population. 

  Otherwise, it's calculated as the difference between the mean outcomes of the 

  treatment and control groups in each of cumulative population. 

  For details, see Section 4.1 of Gutierrez and G{\'e}rardy (2016), `Causal Inference 

  and Uplift Modeling: A review of the literature`. 

  For the former, `treatment_effect_col` should be provided. For the latter, both 

  `outcome_col` and `treatment_col` should be provided. 

  Args: 

  df (pandas.DataFrame): a data frame with model estimates and actual data as columns 

  outcome_col (str, optional): the column name for the actual outcome 

  treatment_col (str, optional): the column name for the treatment indicator (0 or 1) 

  treatment_effect_col (str, optional): the column name for the true treatment effect 

  random_seed (int, optional): random seed for numpy.random.rand() 

  Returns: 

  (pandas.DataFrame): average uplifts of model estimates in cumulative population 

  """ 

 assert ( 

 (outcome_col in df.columns) 

 and (treatment_col in df.columns) 

 or treatment_effect_col in df.columns 

 ) 

 df = df.copy() 

 np.random.seed(random_seed) 

 random_cols = [] 

 for i in range(10): 

 random_col = "__random_{}__".format(i) 

 df[random_col] = np.random.rand(df.shape[0]) 

 random_cols.append(random_col) 

 model_names = [ 

 x 

 for x in df.columns 

 if x not in [outcome_col, treatment_col, treatment_effect_col] 

 ] 

 lift = [] 

 for i, col in enumerate(model_names): 

 sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True) 

 sorted_df.index = sorted_df.index + 1 

 if treatment_effect_col in sorted_df.columns: 

 # When treatment_effect_col is given, use it to calculate the average treatment effects 

 # of cumulative population. 

 lift.append(sorted_df[treatment_effect_col].cumsum() / sorted_df.index) 

 else: 

 # When treatment_effect_col is not given, use outcome_col and treatment_col 

 # to calculate the average treatment_effects of cumulative population. 

 sorted_df["cumsum_tr"] = sorted_df[treatment_col].cumsum() 

 sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"] 

 sorted_df["cumsum_y_tr"] = ( 

 sorted_df[outcome_col] * sorted_df[treatment_col] 

 ).cumsum() 

 sorted_df["cumsum_y_ct"] = ( 

 sorted_df[outcome_col] * (1 - sorted_df[treatment_col]) 

 ).cumsum() 

 lift.append( 

 sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"] 

 - sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"] 

 ) 

 lift = pd.concat(lift, join="inner", axis=1) 

 lift.loc[0] = np.zeros((lift.shape[1],)) 

 lift = lift.sort_index().interpolate() 

 lift.columns = model_names 

 lift[RANDOM_COL] = lift[random_cols].mean(axis=1) 

 lift.drop(random_cols, axis=1, inplace=True) 

 return lift

Note that get_cumlift iterates at least 10 times over random orderings and also other order orderings if your input df has columns other than outcome_col, treatment_col, and treatment_effect_col:

causalml/causalml/metrics/visualize.py

Lines 90 to 93 in c154afe

 for i in range(10): 

 random_col = "__random_{}__".format(i) 

 df[random_col] = np.random.rand(df.shape[0]) 

 random_cols.append(random_col)

causalml/causalml/metrics/visualize.py

Lines 102 to 104 in c154afe

 for i, col in enumerate(model_names): 

 sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True) 

 sorted_df.index = sorted_df.index + 1

Also if treatment_effect_col is provided, it is used to calculate the ATE of the cumulative population:

causalml/causalml/metrics/visualize.py

Lines 106 to 108 in c154afe

if treatment_effect_col in sorted_df.columns:

# When treatment_effect_col is given, use it to calculate the average treatment effects

# of cumulative population.

Not sure if you are providing the treatment_effect_col using synthetic data or not, but if that is the case, then 2) would apply.

If you're not providing treatment_effect_col, then 1) still applies- a repeated random ordering and subsequent interpolation of lift results.

FYI, also see work in #707

from causalml.

AmyLin0515 commented on June 3, 2024

Hi @ras44 ! Thanks for providing insights. I did find the difference decreased a lot after I added 10 random columns and included them to sort. However, I don't understand why we need to add these two random columns. And if eventually the order was changed by the final 10th random columns, what is the point that we added so many of them.

from causalml.

Why I got different lift result when using get_cumlift() and calculating line by line? about causalml HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def get_cumlift(
	df, outcome_col="y", treatment_col="w", treatment_effect_col="tau", random_seed=42
	):
	"""Get average uplifts of model estimates in cumulative population.

	If the true treatment effect is provided (e.g. in synthetic data), it's calculated
	as the mean of the true treatment effect in each of cumulative population.
	Otherwise, it's calculated as the difference between the mean outcomes of the
	treatment and control groups in each of cumulative population.

	For details, see Section 4.1 of Gutierrez and G{\'e}rardy (2016), `Causal Inference
	and Uplift Modeling: A review of the literature`.

	For the former, `treatment_effect_col` should be provided. For the latter, both
	`outcome_col` and `treatment_col` should be provided.

	Args:
	df (pandas.DataFrame): a data frame with model estimates and actual data as columns
	outcome_col (str, optional): the column name for the actual outcome
	treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
	treatment_effect_col (str, optional): the column name for the true treatment effect
	random_seed (int, optional): random seed for numpy.random.rand()

	Returns:
	(pandas.DataFrame): average uplifts of model estimates in cumulative population
	"""

	assert (
	(outcome_col in df.columns)
	and (treatment_col in df.columns)
	or treatment_effect_col in df.columns
	)

	df = df.copy()
	np.random.seed(random_seed)
	random_cols = []
	for i in range(10):
	random_col = "__random_{}__".format(i)
	df[random_col] = np.random.rand(df.shape[0])
	random_cols.append(random_col)

	model_names = [
	x
	for x in df.columns
	if x not in [outcome_col, treatment_col, treatment_effect_col]
	]

	lift = []
	for i, col in enumerate(model_names):
	sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True)
	sorted_df.index = sorted_df.index + 1

	if treatment_effect_col in sorted_df.columns:
	# When treatment_effect_col is given, use it to calculate the average treatment effects
	# of cumulative population.
	lift.append(sorted_df[treatment_effect_col].cumsum() / sorted_df.index)
	else:
	# When treatment_effect_col is not given, use outcome_col and treatment_col
	# to calculate the average treatment_effects of cumulative population.
	sorted_df["cumsum_tr"] = sorted_df[treatment_col].cumsum()
	sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"]
	sorted_df["cumsum_y_tr"] = (
	sorted_df[outcome_col] * sorted_df[treatment_col]
	).cumsum()
	sorted_df["cumsum_y_ct"] = (
	sorted_df[outcome_col] * (1 - sorted_df[treatment_col])
	).cumsum()

	lift.append(
	sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"]
	- sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"]
	)

	lift = pd.concat(lift, join="inner", axis=1)
	lift.loc[0] = np.zeros((lift.shape[1],))
	lift = lift.sort_index().interpolate()

	lift.columns = model_names
	lift[RANDOM_COL] = lift[random_cols].mean(axis=1)
	lift.drop(random_cols, axis=1, inplace=True)

	return lift