square / pysurvival Goto Github PK
View Code? Open in Web Editor NEWOpen source package for Survival Analysis modeling
Home Page: https://www.pysurvival.io/
License: Apache License 2.0
Open source package for Survival Analysis modeling
Home Page: https://www.pysurvival.io/
License: Apache License 2.0
For others such as predict_survival, predict_hazard, predict_cdf, implementation in models.py are used in Simulations.
However for predict_risk, in Simulations, instead of using the one in models.py , it uses the one in simulations.py
The two predict_risk ( models.py, simulations.py ), give quite different result.
The one from models.py being sum over cumulative hazard, whereas the one in simulations.py being a sum over weight values (or more complex forms depending on the choice.)
Wonder why it is the case.
Is there a way to know the feature importances (best fitting parameters) of the Linear and Non-Linear Multitask Logistic Regression models?
Doesn't seem to work with pip install pysurvival
.
...
pysurvival/cpp_extensions/non_parametric.cpp(349): error C2065: 'M_PI': undeclared identifier
pysurvival/cpp_extensions/non_parametric.cpp(364): error C2065: 'M_PI': undeclared identifier
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit status 2
The error message is prompted by the step function in the rprop.py file in the user torch directory. You only need to initialize the variable that reports the error in this function, similar to step_size_min=[]
def step(self, closure=None):
......
F.rprop(params,
grads,
prevs,
step_sizes,
step_size_min=step_size_min,
step_size_max=step_size_max,
etaminus=etaminus,
etaplus=etaplus)
I ran a comparison from lifelines loaded data (rossi)
# lifelines
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
# load dada
rossi = load_rossi()
And the pysurvival cox coefficients don't match. I also repeated for scikit-survival, and statsmodels. The age effect is an order of magnitude larger than the other packages.
Does the predict_survival function predict from the beginning of the employee's tenure or do it automatically subtract the time observed till now.
I am trying to find something similar to the lifelines packages where they have the following
: ``predict_survival_function(X, conditional_after)` where conditional after taking the tenure till date into account and give the probability of survival after that point
I'm using the save_model
function to save a ConditionalRandomSurvivalForest model for churn. When I do, I repeatedly see a warning:
python3.5/site-packages/pyarrow/pandas_compat.py:113: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly
I would like to remove this warning, but unfortunately neither the save_model
func nor the random forest object's save
method allow me to pass in kwargs like skipna
.
Package Versions:
My best guess is that the issue is on this line:
pysurvival/pysurvival/models/__init__.py
Line 83 in dd4c5bf
Can we update save_model
to accept kwargs?
Hi,
It would be great if you could provide a conda recipe.
I am already working on this and it is ready to be previewed and merged by the conda team here: conda-forge/staged-recipes#15709
I'll be happy to add other maintainers for that package!
Cheers.
Trying to install this package on Windows10. the pip install is trying to use vsStudio2019 to compile and link the cpp extensions and is failing. Here are the relevant build commands:
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\ProgramData\Anaconda3\include -IC:\ProgramData\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /Tppysurvival/cpp_extensions/non_parametric.cpp /Fobuild\temp.win-amd64-3.7\Release\pysurvival/cpp_extensions/non_parametric.obj -std=c++11 -O3
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
cl : Command line warning D9002 : ignoring unknown option '-O3'
non_parametric.cpp
pysurvival/cpp_extensions/non_parametric.cpp(152): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
pysurvival/cpp_extensions/non_parametric.cpp(284): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data
pysurvival/cpp_extensions/non_parametric.cpp(303): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
pysurvival/cpp_extensions/non_parametric.cpp(308): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
pysurvival/cpp_extensions/non_parametric.cpp(317): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
pysurvival/cpp_extensions/non_parametric.cpp(318): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
pysurvival/cpp_extensions/non_parametric.cpp(349): error C2065: 'M_PI': undeclared identifier
pysurvival/cpp_extensions/non_parametric.cpp(364): error C2065: 'M_PI': undeclared identifier
pysurvival/cpp_extensions/non_parametric.cpp(364): error C2065: 'M_PI': undeclared identifier
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\HostX86\x64\cl.exe' failed with exit status 2
Hi, this seems like a very useful library with cool features, including the metrics concordance index, brier score etc.
For users who want to use their own model fit / estimation methods (like me), would it be a good addition to export the metrics (and maybe other functions) to work without having to specify a model argument?
Hello! It looks like the Conditional Survival Forest model I fitted ran successfully, but I'm unable to save the model or use it to predict.
i am a pretty fresh with pysurvival , how to visualize a forest model with others package?
I successfully installed pysurvival (after doing brew install gcc and export CC and CXX as per instructions), but when I try to import pysurvival into jupyter notebook I get this error.
And when I try to import some pysurvival modules there is no problem.
Is there maybe some conflict between GCC and clang when compiling C++ code?
My OS is Mac OSX Catalina 10.15.1 and I am using Python 3.7.4 but Python 2.7 also exist on my computer.
I could be wrong about this, as I am new to the survival analysis literature, but my understanding is that time-varying covariates must be given special treatment in any survival analysis:
https://www.jstor.org/stable/pdf/27643698.pdf?ab_segments=0%252Fbasic_SYC-5187_SYC-5188%252F5187&refreqid=excelsior%3A0aaa616b818456a7135b22942f8307e0
https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html
https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
https://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.20.1.145
However, the tutorials for this project use time-varying covariates as if they are fixed over time:
https://square.github.io/pysurvival/tutorials/employee_retention.html
Is this problematic ?
Hello,
I have following class
from pysurvival.models.multi_task import NeuralMultiTaskModel
import joblib
import numpy as np
from ml_models.preprocessing.one_hot_encoding import PreProcessingWOneHot
from ml_models.templates.model import Model
from pysurvival.utils import save_model, load_model
from util import get_my_logger
class PySurvival(Model):
def __init__(self):
super().__init__()
self.pre_processing = PreProcessingWOneHot()
def build_survival_model(self, parm: dict, row_data: dict, target: np.array) -> (np.ndarray, np.ndarray, np.float64, np.float64, np.ndarray):
"""
Args:
row_data: list
target: np.array
parm: what parameters that I need to pass to models
Returns:
hold_out_y: target variable for hold out set
predict_proba: predicted score on hold out set
logloss: logloss on hold out
auc: auc on hold out
feature_score_array: feature score of features
"""
structure = parm.pop('structure')
self.model_instance = NeuralMultiTaskModel(bins=parm.pop('bins'), structure=structure)
self.logger.info("building training data started")
train_x, time, event = self.build_data_survival(row_data, target)
self.logger.info("time is {0}".format(time[:10]))
self.logger.info("time is {0}".format(event[:10]))
self.logger.info("target is {0}".format(target[:10]))
self.logger.info("final model building starting")
self.model_instance.fit(train_x, time, event, **parm)
hazard, density, survival = self.model_instance.predict(train_x)
risk = self.model_instance.predict_risk(train_x)
return {"time": time, "event": event, "hazard": hazard, "density": density, "survival": survival, "risk": risk}
In another file i am importing that class like following
from ml_models.templates.py_survival.py_survival_model import PySurvival
However, this doesn't work, it throws following error
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
However, it works if I import pysurvival first, before importing class, it works like following.
import pysurvival
from ml_models.templates.py_survival.py_survival_model import PySurvival
Do you know what is happening ?
Any help is appreciated.
This is great package. Thank you for making open sources
Anyone out there?
There seem to be a mistake in calculating risk_score for Cox.
In "models/semi_parametric.py" line 308-312:
risk_score = np.exp(np.dot(x, self.weights))
if not use_log:
risk_score = np.exp(risk_score)
return risk_score
There are a redundant "np.exp()" in line 308
The logged risk score should just be "np.dot(x, self.weights)"
Hope you check it soon. Please tell me if I misunderstand it.
The code currently works as follows:
risk = model.predict_risk(X)
if use_log:
risk = np.log(risk)
However, if model.predict_risk(X) get inf values, when applying np.log(risk) the values will still being inf. Whereas, if they are caclulated with log from the begining, inf values will be less likely to appear.
Maybe use something like
if use_log:
risk = model.predict_risk(X, use_log = True)
else: risk = model.predict_risk(X, use_log = False)
Does anyone know how to tune the Random Survival Forests model?
Hey,
When I run the code and follow along with the tutorial (https://square.github.io/pysurvival/tutorials/credit_risk.html), I'm confused, especially at the end.
I have few questions:
Are these 2 graphs similar? (same y-axis) Because I'm not sure what is the y-axis in the 2nd graph...
How is it possible that the high-risk line is higher than the low risk? Does it mean, he repays faster?
What is the "actual time", called T the one which is represented in the 2nd graphs?
Can you help me with that please, it's for a school project ๐
Thanks for your consideration and have a good day!
Hi, there is a bug when we turn off the auto_scaler.
Sample code:
coxph = NonLinearCoxPHModel(auto_scaler=False)
coxph.fit(feature_train, time_train, event_train)
UnboundLocalError Traceback (most recent call last)
in
1 coxph = NonLinearCoxPHModel(auto_scaler=False)
----> 2 coxph.fit(feature_train, time_train, event_train)
/opt/conda/lib/python3.7/site-packages/pysurvival/models/semi_parametric.py in fit(self, X, T, E, init_method, optimizer, lr, num_epochs, dropout, batch_normalization, bn_and_dropout, l2_reg, verbose)
608 T = T[order]
609 E = E[order]
--> 610 X_original = X_original[order, :]
611 self.times = np.unique(T[E.astype(bool)])
612 self.nb_times = len(self.times)
UnboundLocalError: local variable 'X_original' referenced before assignment
fix suggestion (starting from line 602):
# Scaling data
if self.auto_scaler:
X_original = self.scaler.fit_transform( X )
else:
X_original = X
There is a problem with installing pysurvival under python 3.9. (Maybe #39 is related)
The solution is to recompile the cython module with a recent cython version.
I did this here: https://github.com/berleon/pysurvival
Best,
Leon
I tried to fit CoxPHModel with my own dataset. I am sure that the format for X, T, and E vectors are correct. However, I got "AttributeError: The time axis needs to be created before using the method get_time_buckets." after optimization reached max number of iterations. How can I solve this problem?
Hello,
Thank you for your great package.
I would like to know why there is no Hyperparameters tuning performed for any of the models and if you can add it to one of the methods such as DeepSurv so that we could do it for other methods by ourselves.
Thank you in advance,
Afshin
I don't fully understand how the salary feature is handled in the Employee Retention. There appears to be an ordinal with 3 categories: low, medium and high. What happens here is that:
The salary feature is one-hot encoded - Why wouldn't an ordinal encoding work here, considering the tree model?
The correlation is then tested on the "low" and "medium" columns, which is very negative - Isn't this quite expected, considering it's a categorical feature?
The "low" column is dropped - Doesn't that mean that we effectively grouped "high" and "low" salary together?
https://github.com/square/pysurvival/blob/master/pysurvival/models/semi_parametric.py#L395
This line calculate the events that occur at a given time.
It should be index_fail = np.argwhere( self.times == T[i] ).flatten()
index_fail = np.argwhere( self.times == T[i] )[0] only considers one event occur at the given time. We should consider all of ties.
I recently bought a MacBook Pro 2 GHz Intel Core i5, and RAM of 16Gb.
I have installed Python's last version 3.9.0.
I am trying to install PySurvival on my new laptop following the usual procedure, however, I get a lot of error messages when running the command
pip3 install pysurvival
How could I solve this issue please?
ERROR: Command errored out with exit status 1:
command: /usr/local/opt/[email protected]/bin/python3.9 /usr/local/lib/python3.9/site-packages/pip install --ignore-installed --no-user --prefix /private/var/folders/ys/zf1_7nr579qfkdcvpjnwhnv00000gn/T/pip-build-env-gazj7qjl/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'cython >= 0.29' 'numpy==1.14.5; python_version<'"'"'3.7'"'"'' 'numpy==1.16.0; python_version>='"'"'3.7'"'"'' setuptools setuptools_scm wheel
cwd: None
Complete output (3595 lines):
Ignoring numpy: markers 'python_version < "3.7"' don't match your environment
Collecting cython>=0.29
Using cached Cython-0.29.21-py2.py3-none-any.whl (974 kB)
Collecting numpy==1.16.0
Using cached numpy-1.16.0.zip (5.1 MB)
Collecting setuptools
Using cached setuptools-50.3.2-py3-none-any.whl (785 kB)
Collecting setuptools_scm
Using cached setuptools_scm-4.1.2-py2.py3-none-any.whl (27 kB)
Collecting wheel
Using cached wheel-0.35.1-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: numpy
Building wheel for numpy (setup.py): started
Building wheel for numpy (setup.py): still running...
Building wheel for numpy (setup.py): finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /usr/local/opt/[email protected]/bin/python3.9 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/ys/zf1_7nr579qfkdcvpjnwhnv00000gn/T/pip-install-ts_1r59v/numpy/setup.py'"'"'; __file__='"'"'/private/var/folders/ys/zf1_7nr579qfkdcvpjnwhnv00000gn/T/pip-install-ts_1r59v/numpy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/ys/zf1_7nr579qfkdcvpjnwhnv00000gn/T/pip-wheel-obu2ycmj
cwd: /private/var/folders/ys/zf1_7nr579qfkdcvpjnwhnv00000gn/T/pip-install-ts_1r59v/numpy/
Complete output (3194 lines):
Running from numpy source directory.
/private/var/folders/ys/zf1_7nr579qfkdcvpjnwhnv00000gn/T/pip-install-ts_1r59v/numpy/numpy/distutils/misc_util.py:476: SyntaxWarning: "is" with a literal. Did you mean "=="?
return is_string(s) and ('*' in s or '?' is s)
blas_opt_info:
blas_mkl_info:
customize UnixCCompiler
libraries mkl_rt not found in ['/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib', '/usr/local/lib', '/usr/lib']
NOT AVAILABLE
blis_info:
customize UnixCCompiler
libraries blis not found in ['/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib', '/usr/local/lib', '/usr/lib']
NOT AVAILABLE
I've installed pysurvival using brew for gcc and pip on my MacBook Pro (macOS 10.13.6) and been able to train a RSF model in a Jupyter Notebook (though this took several minutes of high CPU activity). The training data has around 70 factors and 5000 rows.
I'm now trying to work with the model but when I call e.g.
risks = rsf.predict_risk(X_test)
the notebook just hangs indefinitely with no sign of CPU activity.
Can MTLR or other models handle multilabel classification?
Hi,
Currently the survival curve given by the model is limited only to the range of the time buckets. If i give a time that is outside the time bucket, the survival value is just the value of survival predicted for the last value in the bucket. So, basically survival probability flat lines after the range. Is there a way to extrapolate the model beyond it ?
Thanks
Is there a way, given a previously trained model, to interpret the survival prediction such as Lime or Shap?
Does this package address class/label imbalance?
There might be typos in utils.metrics at lines 340 and 348 (compare_to_actual function)
I assume this shoulde be:
(340) results['median_absolute_error'] = med_ae instead of results['median_absolute_error'] = rmse
(348) results['mean_absolute_error'] = mae instead of results['mean_absolute_error'] = rmse
Thanks for the great repo !
I have a df with 40k rows and 21 variables. I am following the Churn prediction tutorial. csf_fit() works fine and takes 45min to run. But when I then run concordance_index() my session crashes and I lose my csf object.
I was able to reproduce the issue by running the example code for Conditional Survival Forest (CSF) but by increasing the N and number of features to:
# Generating N random samples
N = 10000
dataset = sim.generate_data(num_samples = N, num_features=6)
I used the environment which the following Dockerfile
provides:
FROM jupyter/scipy-notebook
RUN conda update -n base conda
RUN conda install pytorch-cpu torchvision-cpu -c pytorch
RUN conda install matplotlib pandas scikit-learn pyarrow progressbar scipy boost
RUN pip install --upgrade pip \
&& pip install pysurvival
I have noticed that PySurvival does not really follow the priniciples of scikit-learn. Starting with the fact that you input X, T, E, instead of X, y. Further GridSearchCV cannot be used because of the aforementioned problem but also because there is no set_params method in the model objects. (also see pipeline of scikit-learn, which only works after extensive reworking of many classes and functions in scikit-learn). This is very unfortunate, I think, that this great package keeps outside of sklearn. Is there any plan to fix this and make PySurvival connectable to scikit-learn? Or am I missing something?
Hello! There were several approximations proposed in the original paper here, but which method was used to approximate the p-value for Conditional Survival Forests? Additionally, is the final p-value used for comparison corrected for multiple hypothesis testing using Bonferroni's correction? Thanks ahead.
Hello,
For quite some time I can't model LogNormal AFT model, as no matter what learning rate, optimizer I select, gradient always explodes... Is there could be anything to this method specifically here?
Thanks for answer in advance!
Hi, thanks for this package! It looks great for time-to-event modeling. I may have missed something in the documentation, but do the MTLR estimators support time-varying covariates, and if so, how would I set up the data to train such a model?
In the sklearn API for tree-based models there's an method "apply(self, X)" that returns leaf indices, applying trees in the forest to X. It is very useful to diagnose which elements ended up in the same leaf node and cluster them.
Is it possible to add this functionality?
I've had some success using other methods to predict survival for my business problem. However, I can never get either of the multi class methods to work. I'm met each time with The gradient exploded... You should reduce the learningrate (lr) of your optimizer
. I've tried extremely small learning rates, but still the same result.
Hi,
When I try to copy the code and run it from the tutorial page https://square.github.io/pysurvival/tutorials/credit_risk.html , the pysurvival package is successfully installed. But when I run this code:
from pysurvival.utils.display import correlation_matrix
it shows the error message as follow:
ImportError: dlopen(/Users/ju/opt/anaconda3/lib/python3.7/site-packages/pysurvival/models/_non_parametric.cpython-37m-darwin.so, 2): Symbol not found: _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_assignERKS4
Referenced from: /Users/ju/opt/anaconda3/lib/python3.7/site-packages/pysurvival/models/_non_parametric.cpython-37m-darwin.so
Expected in: /usr/lib/libstdc++.6.dylib
in /Users/ju/opt/anaconda3/lib/python3.7/site-packages/pysurvival/models/_non_parametric.cpython-37m-darwin.so
After that, I have uninstalled and reinstalled pysurvival several times from either Spider/Anaconda or command line, it was still failed.
My laptop is MacBook Pro (13-inch, 2019, Two Thunderbolt 3 ports), version 10.15.7 (19H2)
IDE: Spyder
How can I solve it?
Thank you!
Will this work on Windows?
Is it possible to implement calibration plots with probabilities like in scikit-learn?
The offending line is here: https://github.com/square/pysurvival/blob/master/pysurvival/utils/metrics.py#L340
And 8 lines below it. May I submit a quick fix? Not sure what the process is for contributing but I'd love to help :)
Hi, for the MLTR and RandomSurvivalForest I get different estimates for survival probabilities on each run.
Is there any parameter to regulate training?
To train any forest model with categorical variables it is needed first to convert them to dummy. After training, to get the feature importance, how does one get only one importance for only one variable, instead of for each class-factor?
Hey
I would like to fix the time buckets before training, but I don't think the library supports that for now.
I tried setting the attributes times and time_buckets, but that doesn't work, as the model calculates its own ones during training.
I think that it would be useful to be able to specify that
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.