Giter Club home page Giter Club logo

doubleml-for-py's People

Contributors

janteichertkluge avatar lgtm-migrator avatar maltekurz avatar oliverschacht avatar philippbach avatar svenklaassen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doubleml-for-py's Issues

[Bug]: type casting outcome_variable and treatment_variable(s)

Describe the bug

This is more of a nitpick :) I think there is an implicit assumption that the types of the outcome_variable and treatment_variable(s) should be float. So if we provide a dataframe to DoubleMLData where those variables are of type Decimal, the partialling out step fails with the error shown below. This is more of an issue specially when reading parquet files.

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Minimum reproducible code snippet

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
from doubleml import DoubleMLData, DoubleMLPLR

df = pd.read_parquet("/...")

x_cols = [x for x in df.columns if "pre_" in x]
d_col = "event_action"
y_col = "post_outcome"

dml_data = DoubleMLData(df, y_col = y_col, d_cols=d_col, x_cols=x_cols)

learner = RandomForestRegressor(n_jobs = -1)
lasso = LassoCV()
dml_plr = DoubleMLPLR(dml_data, ml_l = learner, ml_g = learner, ml_m=lasso, score= "IV-type", n_folds = 2)
dml_plr.fit(n_jobs_cv = -1)

Expected Result

Model should fit successfully.

Actual Result

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Versions

Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
DoubleML 0.7.1
Scikit-Learn 1.3.2

[Feature Request]: Add Clustering for non-linear DML models

Describe the feature you want to propose or implement

Extend the NonLinearScoreMixin to handle data with clustering.
It should be sufficient to extend the estimation of the coefficient/parameter as the variance estimation is identical.

Propose a possible solution or implementation

For dml1, just use the current implementation without clustering (should be identical).
For dml2, adjust the scaling of the score analogously to LinearScoreMixin.

Did you consider alternatives to the proposed solution. If yes, please describe

No response

Comments, context or references

No response

Indended behavior (and execption handling / warning) of repetitive calls to `set_ml_nuisance_params`

  • If the function set_ml_nuisance_params('ml_g', 'd', params) is being called multiple times it results in the last params dict being used.
  • I think this behavior is fine, but I would add a warning for the user whenever the function has been called already with a non-empty dictionary as parameter set.
  • As an alternative "behavior" we could implement that calling set_ml_nuisance_params() does not overwrite already set parameter sets (that also stem from set_ml_nuisance_params() calls) but instead combines them: Add new keys to dict and replace existing ones.

[Feature Request]: Add Weights to IRM Model for Policy Evaluation

Describe the feature you want to propose or implement

Currently, if you want to evaluate a policy (e.g. derived by IRM policy_tree()), the gate() method is the best call. However, this has two disadvantages: Firstly, you can only have $\pi(X) \in {0,1}$, while a policy might be defined $0 \leq \pi(X) \leq 1$. Secondly, the gate() function does not provide sensitivity analysis.
With this feature request I suggest to add an option weights to the IRM model.

Propose a possible solution or implementation

Allow the DoubleMLIRM object to take a $(n \times d)$ vector of weights per observation per treatment that modifies the ATE score as proposed in the Long Story Short paper to get a weighted average treatment effect. weights=None should be the default and estimate an ATE which is equivalent to the current ATE implementation. If score='ATTE', then no weights should be allowed.
Additionally, in a later step, we might add a evaluate_policy() function that computes the policy value and change the weights of an existing object without refitting (if possible).

Did you consider alternatives to the proposed solution. If yes, please describe

The alternative would be to add weights to the sensitivity_analysis() function, this however would be way more complex as currently the coefficient is not recalculated and furthermore it would change the DoubleML class having implications on every other model.

[Feature Request]: Request for individual scores from IRM model

Describe the feature you want to propose or implement

Hello, I am trying to deep dive the ATE and ATT components of the DMl-IRM model. For example, the IRM ATE is:
$$g(1,X) - g(0,X) + \frac{D(Y-g(1,X)}{m(X)} - \frac{(1-D)()Y-g(0,X)}{1-m(X)} - \theta$$
I understand that since $\theta$ is a scalar, we just take the expectation over this score function. However, I am interested in using the individual estimates (everything except $\theta$) for CATE and HTE analysis.

Propose a possible solution or implementation

Allowing the estimated IRM model to output the individual scores.

Did you consider alternatives to the proposed solution. If yes, please describe

I think these estimates are already available because they can be called for CATE estimation. But this requires specifying a basis function to project the CATE estimates onto. Being able to call it directly from IRM model output would be great and more flexible.

Comments, context or references

No response

[Bug]: ValueError Gram matrix in 401k example

Describe the bug

https://docs.doubleml.org/stable/examples/py_double_ml_pension.html
example shows ValueError in PLR model
ValueError Gram matrix passed in via 'precompute' parameter did not pass validation when a single element was checked - please check that it was computed properly. For element (4,5) we computed 3375.771728515625 but the user-supplied value was 3375.773193359375.

Minimum reproducible code snippet

https://docs.doubleml.org/stable/examples/py_double_ml_pension.html

Initialize learners

Cs = 0.0001*np.logspace(0, 4, 10)
lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))
lasso_class = make_pipeline(StandardScaler(),
LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear',
Cs = Cs, max_iter=1000))
np.random.seed(123)

Initialize DoubleMLPLR model

dml_plr_lasso = dml.DoubleMLPLR(data_dml_base,
ml_g = lasso,
ml_m = lasso_class,
n_folds = 3)
try:
dml_plr_lasso.fit(store_predictions=True)
except ValueError as ve:
print('ignore exception ValueError', ve)

Expected Result

no ValueError

Actual Result

ValueError Gram matrix passed in via 'precompute' parameter did not pass validation when a single element was checked - please check that it was computed properly. For element (4,5) we computed 3375.771728515625 but the user-supplied value was 3375.773193359375.

Versions

Linux-5.13.0-30-generic-x86_64-with-glibc2.29
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
DoubleML 1.0 -> pip list says DoubleML 0.4.1
Scikit-Learn 1.0

Support for sparse matrix

Even when I set both prediction functions to be lasso, which should support for sparse matrix in sklearn, the doubleml pkg throws the error that sparse matrix is not supported. Transforming the matrix to dense format will explode my memory. Is there any particular reason that sparse matrix cannot be used?

Python 401(k) Case Study flex model specification possible issue with .DoubleMLData object

Hello,

in the Python 401(k) Case Study when entering the flexible model data into the dml.DoubleMLData object and then printing it, the y-variable (net_tfa) is seen in the x_cols even after it was specified as y_col.

This leads to the lasso model on the flex specification not estimating the coefficient correctly. For some reason it is only an issue with the flex model, but not with the base model. This is a reacent issue. Last week it was working properly.

See screenshot
dml_flex_issue

[Bug]: Multi-treatment data creation bug

Describe the bug

If one want to create an object for a multi-treatment problem, in which each time just a 1-dimensional parameter theta_j for the treatment j is predicted including the rest of treatments /j with the set of covariates X, it outputs an error asking for the use of the option use_other_treat_as_covariate even though it is by default True.

Minimum reproducible code snippet

import numpy as np
import pandas as pd
import doubleml as dml

from doubleml.datasets import fetch_401K

dtypes = data.dtypes
dtypes['nifa'] = 'float64'
dtypes['net_tfa'] = 'float64'
dtypes['tw'] = 'float64'
dtypes['inc'] = 'float64'
data = data.astype(dtypes)

features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
                 'twoearn', 'db', 'pira', 'hown']

# Initialize DoubleMLData (data-backend of DoubleML)
data_dml_base = dml.DoubleMLData(data,
                                 y_col='net_tfa',
                                 d_cols=['e401', 'pira'],
                                 x_cols=features_base,
                                 use_other_treat_as_covariate=True)

Expected Result

I would expect a successful creation of the data object.

Actual Result


ValueError Traceback (most recent call last)
Cell In[6], line 5
1 features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
2 'twoearn', 'db', 'pira', 'hown']
4 # Initialize DoubleMLData (data-backend of DoubleML)
----> 5 data_dml_base = dml.DoubleMLData(data,
6 y_col='net_tfa',
7 d_cols=['e401', 'pira'],
8 x_cols=features_base,
9 use_other_treat_as_covariate=True)

File ~/first_env/lib/python3.8/site-packages/doubleml/double_ml_data.py:151, in DoubleMLData.init(self, data, y_col, d_cols, x_cols, z_cols, t_col, use_other_treat_as_covariate, force_all_x_finite)
149 self.t_col = t_col
150 self.x_cols = x_cols
--> 151 self._check_disjoint_sets_y_d_x_z_t()
152 self.use_other_treat_as_covariate = use_other_treat_as_covariate
153 self.force_all_x_finite = force_all_x_finite

File ~/first_env/lib/python3.8/site-packages/doubleml/double_ml_data.py:634, in DoubleMLData._check_disjoint_sets_y_d_x_z_t(self)
631 # note that the line xd_list = self.x_cols + self.d_cols in method set_x_d needs adaption if an intersection of
632 # x_cols and d_cols as allowed (see https://github.com/DoubleML/doubleml-for-py/issues/83)%3C/span%3E)
633 if not d_cols_set.isdisjoint(x_cols_set):
--> 634 raise ValueError('At least one variable/column is set as treatment variable (d_cols) and as covariate'
635 '(x_cols). Consider using parameter use_other_treat_as_covariate.')
637 if self.z_cols is not None:
638 z_cols_set = set(self.z_cols)

ValueError: At least one variable/column is set as treatment variable (d_cols) and as covariate(x_cols). Consider using parameter use_other_treat_as_covariate.

Versions

Linux-4.15.0-194-generic-x86_64-with-glibc2.17
Python 3.8.2 (default, Feb 26 2020, 14:31:49)
[GCC 6.3.0 20170516]
DoubleML 0.6.1
Scikit-Learn 1.2.2

Extensions and refinements for the trimming of propensity scores (IRM & IIVM)

Trimming as part of the ML estimation and prediction step

  • At the moment the trimming of propensity scores is part of the "score evaluation step", see
    if (self.trimming_rule == 'truncate') & (self.trimming_threshold > 0):
    m_hat[m_hat < self.trimming_threshold] = self.trimming_threshold
    m_hat[m_hat > 1 - self.trimming_threshold] = 1 - self.trimming_threshold
  • Therefore, the exported predictions in property predictions are not yet trimmed. Presumably, it would be more reasonable to make the trimming during the "ML estimation and prediction step". Otherwise users might question whether the trimming really happens.

New trimming rule 'discard'

  • Currently, we only have implemented the trimming_rule 'truncate'. As alternative, we also want to offer the trimming_rule 'discard'. For this we need to find a stable way to exclude observations from subsequent steps. Predictions can obviously just be set to np.nan. In subsequent steps these observations need to be excluded. In the repeated cross-fitting case this can then result in different number of observations being evaluated for different random sample splits. At the beginning we might want to prevent these technically challenging cases and only allow trimming_rule = 'discard' for n_rep == 1.

how to predict

Thanks for your contributions to this project. However, compared with EcomML, this method lack of usage for predict.

Estimators enforcing no null confounder values

I tried using the package with XGBoost to estimate the ml_g and ml_m terms. The existence of nulls is no problem for XGBoost as it is able to infer the correct branch split for null values empirically. Indeed, XGBoost is commonly used to estimate propensity scores when the confounding features are potentially missing/null.

Unfortunately, the doubleml package calls sklearn.utils.check_X_y, and is configured to throw an error if there are missing values in the confounders X. This occurs several times in
https://github.com/DoubleML/doubleml-for-py/blob/master/doubleml/double_ml_plr.py

It seems the fix is as simple as allowing users to pass the check_X_y kwarg force_all_finite=False, e.g. see https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_X_y.html. Once this is changed, I've found that an XGBoost model with missing values runs with no issue. Naturally, the response variable y still cannot have null values.

Check estimated propensity scores

Currently we only check that predictions are finite in

def _check_finite_predictions(preds, learner, learner_name, smpls):
test_indices = np.concatenate([test_index for _, test_index in smpls])
if not np.all(np.isfinite(preds[test_indices])):
raise ValueError(f'Predictions from learner {str(learner)} for {learner_name} are not finite.')
return

We may additionally want to check that estimated probabilities are strictly in (0,1) (maybe with some eps threshold). Otherwise, the values of the score functions are likely infinite / missing. It may make sense to not let it directly fail but throw a warning. This way one for example would still have the option to discard these observations from the estimation of the target parameter or to choose a score function where this is accounted for, e.g., trimmed.

[Feature Request]: Expose last stage score function estimation

Describe the feature you want to propose or implement

Hi,

I heard some time back that there were plans for the final stage score function estimation to be exposed so that we can use our own predictions from our own ML models. Has that feature been completed, and if so, are there examples on how to use it? Thanks!

In addition, I was wondering whether there is a DML multi-period diff-in-diff formulation in literature, and if there is, where is it on your priority list?

Propose a possible solution or implementation

No response

Did you consider alternatives to the proposed solution. If yes, please describe

No response

Comments, context or references

No response

DoubleMLData: Add checks for the intersections of y_col, d_cols, x_cols, z_cols

  • There are no checks yet whether variables are assigned to multiple roles (y_col, d_cols, x_cols, z_cols).
  • Before implementing the checks, we need to discuss whether and which "multiple roles" can be allowed.
  • It might makes sense to first disallow most of the "multiple roles" and then re-introduce specific ones.
  • Note that this is related to the use_other_treat_as_covariate option and to #83.

[Unit Test Extension]: Extend "default setting unit tests" to methods

We do have unit tests for model defaults, see https://github.com/DoubleML/doubleml-for-py/blob/master/doubleml/tests/test_doubleml_model_defaults.py. The intention behind such "default setting unit tests" is twofold:

  1. It should assert that the defaults are valid / meaningful, i.e., the code runs through successfully with default values for the input parameters.
  2. The unit tests serve as a reminder to update the documentation of defaults in case a default value is being changed.

Currently, we only have such "default setting unit tests" for the initialization of the model classes but it would also make sense to extend it to the most important methods.

Calculating RMSE using D and Y nuisance model residuals

does the DoublML package have the option to output the residuals of the nuisance models, for example when computing RMSE for predicting D and Y, in order to compare different methods for estimating them. Maybe there is an existing code example somehwere that I couldn't find.

thank you

[Bug] DoubleMLPLR.tune return none

According to the doc, when setting parameters return_tune_res=True, it should return tune_res (A list containing detailed tuning results and the proposed hyperparameters, returned if return_tune_res is True). However, right now DoubleMLPLR.tune just returns none even return_tune_res=True.

[Bug]: KeyError in DoubleMLPLIV.fit() with multiple instruments and store_predictions=True

Describe the bug

In the case of multiple instruments, the function DoubleMLPLIV.fit() throws an error when executed with the parameter 'store_predictions=True'.

Minimum reproducible code snippet

import numpy as np
import doubleml as dml
from doubleml.datasets import make_pliv_CHS2015
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import clone
np.random.seed(3141)
learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
ml_l = clone(learner)
ml_m = clone(learner)
ml_r = clone(learner)
obj_dml_data = make_pliv_CHS2015(n_obs=500, alpha=1.0, dim_x=10, dim_z=10, return_type='DoubleMLData')
dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)
dml_pliv_fit = dml_pliv_obj.fit(store_predictions=True)

Expected Result

Predictions for the whole list of learners ('params_names') are stored, i.e. for:

print(dml_pliv_obj.params_names)

['ml_l',
 'ml_r',
 'ml_m_Z1',
 'ml_m_Z2',
 'ml_m_Z3',
 'ml_m_Z4',
 'ml_m_Z5',
 'ml_m_Z6',
 'ml_m_Z7',
 'ml_m_Z8',
 'ml_m_Z9',
 'ml_m_Z10']

Actual Result

After executing the code, the following error is stated:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/sb/q_1b_jtx6_x55nw95r50s0tr0002mt/T/ipykernel_44055/2685974828.py in <module>
     11 obj_dml_data = make_pliv_CHS2015(n_obs=500, alpha=1.0, dim_x=10, dim_z=10, return_type='DoubleMLData')
     12 dml_pliv_obj = dml.DoubleMLPLIV(obj_dml_data, ml_l, ml_m, ml_r)
---> 13 dml_pliv_fit = dml_pliv_obj.fit(store_predictions=True)

/opt/anaconda3/envs/py39/lib/python3.10/site-packages/doubleml/double_ml.py in fit(self, n_jobs_cv, keep_scores, store_predictions, store_models)
    500 
    501                 if store_predictions:
--> 502                     self._store_predictions(preds['predictions'])
    503                 if store_models:
    504                     self._store_models(preds['models'])

/opt/anaconda3/envs/py39/lib/python3.10/site-packages/doubleml/double_ml.py in _store_predictions(self, preds)
   1000     def _store_predictions(self, preds):
   1001         for learner in self.params_names:
-> 1002             self._predictions[learner][:, self._i_rep, self._i_treat] = preds[learner]
   1003 
   1004     def _store_models(self, models):

KeyError: 'ml_m_Z1'

Versions

macOS-10.16-x86_64-i386-64bit
Python 3.10.6 (main, Oct 24 2022, 11:04:34) [Clang 12.0.0 ]
DoubleML 0.6.dev0
Scikit-Learn 1.1.3

[Bug]: RMSE calculation and nuisance evaluation for IRM models does not account for conditional observations

Describe the bug

The IRM model constructs two versions of the learner ml_g on conditional samples for d==1 and d==0 to fit conditional expectations. To evaluate the out-of-sample performance the outcome/target for each model is set to y, due to the implementation of dml_cv_predict. This is not correct and should only be based on the conditional samples.

Minimum reproducible code snippet

import numpy as np
import doubleml as dml
from doubleml.datasets import make_irm_data
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.base import clone
np.random.seed(3141)
ml_g = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
ml_m = RandomForestClassifier(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)
obj_dml_data = make_irm_data(n_obs=500, dim_x=5)
dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
dml_irm_obj.fit()
print(dml_irm_obj.rmses)

Expected Result

The RMSE should be calculated on the correct subsample e.g. for ml_g0:
np.sqrt(np.power(dml_irm_obj.predictions['ml_g0'][obj_dml_data.d == 0] - dml_irm_obj.nuisance_targets['ml_g0'][obj_dml_data.d == 0], 2).mean())

This would result in
1.0904733380718747

Actual Result

{'ml_g0': array([[1.20999233]]), 'ml_g1': array([[1.1650356]]), 'ml_m': array([[0.43024777]])}

Versions

not required

Issue in setter of y_col properties for objects of class DoubleMLData: Objects mutate despite the fact that a ValueError was raised

Describe the bug

Bug reported by @ShreyDixit:

Assume that one successfully initializes an object of class DoubleMLData. Then alters a property like y_col in a way that violates some basic assumptions (e.g., the same variable cannot be at the same time the outcome variable y_col and the treatment variable d_cols). This results in a ValueError being raised. However, nevertheless the object mutates and violates the basic assumption.

--> So while the ValueError is appropriately raised, the object nevertheless mutates and the y_col property is changed. The root cause is in the setter for the y_col property

@y_col.setter
def y_col(self, value):
reset_value = hasattr(self, '_y_col')
if not isinstance(value, str):
raise TypeError('The outcome variable y_col must be of str type. '
f'{str(value)} of type {str(type(value))} was passed.')
if value not in self.all_variables:
raise ValueError('Invalid outcome variable y_col. '
f'{value} is no data column.')
self._y_col = value
if reset_value:
self._check_disjoint_sets()
self._set_y_z()

Basically the value shouldn't be set before all checks have been successfully applied. However, in its current form the _check_disjoint_sets() check requires that the properties have been set already. The same issue also applies to the other setters for properties like d_cols, x_cols, etc. Note however, that this issue only becomes relevant if an object of class DoubleMLData has been initialized successfully and if then the user alters one of the properties in a way that violates _check_disjoint_sets().

Minimum reproducible code snippet

Code block 1

from doubleml.datasets import make_plr_CCDDHNR2018
dml_data = make_plr_CCDDHNR2018()
print(dml_data.y_col)
dml_data.y_col = 'd'

Code block 2

print(dml_data.y_col)

Expected Result

First code block: dml_data.y_col == 'y' and raise exception

ValueError: d cannot be set as outcome variable ``y_col`` and treatment variable in ``d_cols``.

Second code block: dml_data.y_col == 'y' should still hold.

Actual Result

First code block: dml_data.y_col == 'y' and raise exception

ValueError: d cannot be set as outcome variable ``y_col`` and treatment variable in ``d_cols``.

Second code block: dml_data.y_col == 'd'

Versions

Python 3.9.7
DoubleML 0.4.1
Scikit-Learn 1.0.1

Sphinx copy-button not working for multiline input

See e.g. basics section in user guide

In [15]: def est_ols(y, X):
   ....:     if X.ndim == 1:
   ....:         X = X.reshape(-1, 1)
   ....:     ols = LinearRegression(fit_intercept=False)
   ....:     results = ols.fit(X, y)
   ....:     theta = results.coef_
   ....:     return theta
   ....: 

Attributes of nuisance functions

Is it possible to access the attributes of the nuisance functions? For example, if the nuisance function is a RandomForestRegressor, then the sklearn package allows one to access the attributes such as estimators_, feature_importances_ etc. Attributes like feature_importances_ can perhaps help identify the confounding variables in the model.

What's the difference between double ml and double selection?

My problem is more about the theory or practice of double ml rather than of the pkg per se. I am sorry for that but I cannot find another place to ask the question. The original paper is way beyond my capability and I learned all about double ml through your document.

The thing is that I read Belloni, Chernozhukov, and Hansen (2014JEP) and find that in a case similar to the partially linear regression, they recommend to apply variable selection methods (lasso) to the two reduced form equations and use all of the selected controls in the traditional estimation (OLS) of the treatment effect of interest. There is no mention of cross-fitting for this double selection method. I wonder that is this double selection method simply the double ml without cross-fitting? And is the double ml with cross-fitting strictly better than double selection for any specific cases?

One related question is I want to know how to do double ml for the cases that there are some covariates that I don't want put them into the ml algorithms but want estimate them in a traditional way (like the covariates in a simple 2SLS). If using double selection, I can add them to the final OLS. But how can I do this with DoubleMLPR? Or does adding such variables make sense at all?

Thanks in advance for any suggestions.

Issue in DoubleMLData.set_x_d if a variable is set as treatment variable and at the same time also as covariate

If a variable is present in d_cols and in x_cols, the following lines might be problematic (the duplicate variable can end up twice in the covariate array ._x):

if self.use_other_treat_as_covariate:
xd_list = self.x_cols + self.d_cols
xd_list.remove(treatment_var)
else:
xd_list = self.x_cols
self._d = self.data.loc[:, treatment_var]
self._X = self.data.loc[:, xd_list]

[Bug]:

Describe the bug

when I use my own data (three variables in D, four variables in X), and after that the predictions for both "ml_l", "ml_m" has shape (n_obs, iteration, number of variables in D), shouldn't it be (n_obs, iteration, 1) for "ml_l"?
Furthermore, if I see the shape of feature importance score of the model for both "ml_l", "ml_m", it is (6,), shouldn't it be (4,) in my case?
In your provided example, it works fine, also it only has one variable in D, so hard to debug, but you can reproduce it using my code.

I hope I don't miss anything but if I do please let me know thanks!

Minimum reproducible code snippet

test1=pd.DataFrame({
'd1': np.random.randn(100),
'd2': np.random.randn(100),
'd3': np.random.randn(100),
'x1': np.random.randn(100),
'x2': np.random.randn(100),
'x3': np.random.randn(100),
'x4': np.random.randn(100),
'y': np.random.randn(100)
})

obj_dml_data_from_df = DoubleMLData(test1, 'y', ["d1","d2","d3"])

ml_l=XGBRegressor(random_state=0)
ml_m=XGBRegressor(random_state=0)

dml_plr_obj = dml.DoubleMLPLR(obj_dml_data_from_df, ml_l, ml_m).fit(store_models=True)

print(dml_plr_obj.predictions["ml_l"].shape)
print(dml_plr_obj.predictions["ml_m"].shape)
print(dml_plr_obj.models["ml_l"]["d1"][0][0].feature_importances_.shape)
print(dml_plr_obj.models["ml_m"]["d1"][0][0].feature_importances_.shape)

Expected Result

(100, 1, 1)
(100, 1, 3)
(4,)
(4,)

Actual Result

(100, 1, 3)
(100, 1, 3)
(6,)
(6,)

Versions

Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
DoubleML 0.7.1
Scikit-Learn 1.0.2

[API Documentation]: missing } in the python bib entry

Describe the issue related to the API documentation

Thanks for the package. Just wanted to report a very minor issue. You are missing } at the end of the bib entry for Python.

@Article{DoubleML2022Python,
title = {{DoubleML} -- {A}n Object-Oriented Implementation of Double Machine Learning in {P}ython},
author = {Philipp Bach and Victor Chernozhukov and Malte S. Kurz and Martin Spindler},
journal = {Journal of Machine Learning Research},
year = {2022},
volume = {23},
number = {53},
pages = {1--6},
url = {http://jmlr.org/papers/v23/21-0862.html}
} <---- this one.

Suggested alternative or fix

No response

CATE with continuous treatment and categorical outcome

I was trying to use DML for continuous treatment (price) and binary outcome(churn). Based on the docs, its not possible to use any of these techniques to this case. Is there any way to adjust any of these algorithms for this setup? If there is a paper that I could develop and add to this library I'm down to it as well.

[Feature Request]: Panel data and individual effect control

Describe the feature you want to propose or implement

Thank you for your excellent work, which is very helpful for learning and using DML. I have a question I'd like your guidance on. If I apply DML to panel data, how can the individual fixed effects be controlled? Directly generate individual dummy variables? My current data has about 30,000 individuals, but only three years of data, it may not be appropriate to generate individual dummy variables. But I don't know how to control for imperceptible individual fixed effects. thank you again!

Propose a possible solution or implementation

No response

Did you consider alternatives to the proposed solution. If yes, please describe

No response

Comments, context or references

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.