ragulpr / wtte-rnn Goto Github PK

View Code? Open in Web Editor NEW

759.0 759.0 185.0 9.33 MB

WTTE-RNN a framework for churn and time to event prediction

License: MIT License

Python 99.96% Shell 0.04%

churn-prediction failure-rate keras machine-learning-algorithms neural-network rnn tensorflow weibull

wtte-rnn's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala theclaymethod alexkruegger akansal1 hjarraya allensmile sunilgodara yydxlv gangeshwark tolety zshwuhan jhub1 shane-huang kolafsson nataliavconnolly mynameisfiber mdiby taflahi hfjn dakl inureyes johndpope achimnol hushuangwei aprotopopov mljejucamp2017 hangyao hephaex bevislee sn3fru snowsky naejin aimran asadisaghar leecodedog ussumant strategist922 alimpolat jprokop-va kategautier felixhsiao ujjwalshukla ayush488 emaadmanzoor johnbensnyder schneiderjay hedgefair hbcbh1999 lancifollia chepet shubhampachori12110095 ph3b steccami bballamudi ml-lab stenbjerre jeffplourde pedrosan alecesa andrgig satpreetsingh andreapi deeprnd miguelperalvo jaykimbravekjh victorhdamian srepho afcarl ayyappak hongxu55 eycab wellbeing18 taxus13 chris-english catburr tommybe philippechatigny auserj shupingr matbilml artkom911 ringwraith ethanknows kismuz teslaimpertior mfretti shashpy mikeyungretina moses621 rahuketu86 enixon44 ghshu yd-yang tsdalton maryanmorel soufianedatafan sophie-cao enixon4 tpreece101 michael-tsel

wtte-rnn's Issues

Loss dropping to 1e-6

Hi,
thank you for your framework. I am trying to use it for charge event prediction at charging stations.
For this, I have downsampled the data to 5min steps and pass up to one week history to the network.

This is my network (copied from one of your examples)

reduce_lr = callbacks.ReduceLROnPlateau(monitor='loss', factor  =0.5, patience=50, verbose=0, mode='auto', min_delta=0.0001, 
                                        cooldown=0, min_lr=1e-8)

nanterminator = callbacks.TerminateOnNaN()

model = Sequential()
#Dont need to specify input_shape=(n_timesteps, n_features) since keras uses dynamic rnn by default
model.add(CuDNNLSTM(10, input_shape=(None, n_features),return_sequences=True))
model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=0.01))
model.add(CuDNNLSTM(20, return_sequences=True))
model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=0.01))
model.add(GaussianDropout(0.05))
model.add(CuDNNLSTM(5, return_sequences=True))
model.add(Dense(2))
model.add(Lambda(wtte.output_lambda, 
                 arguments={"init_alpha":init_alpha, 
                            "max_beta_value":4.0,
                            "scalefactor":1/np.log(5)}))  

loss = wtte.Loss(kind='discrete').loss_function

However, if I train the network, the loss sometimes drops to 1e-6:

Train on 12384 samples, validate on 12384 samples
Epoch 1/20
12384/12384 [==============================] - 5s 403us/step - loss: 0.1205 - val_loss: 1.0000e-06
Epoch 2/20
  927/12384 [=>............................] - ETA: 1s - loss: 1.0000e-06
C:\Python36\python-3.6.2.amd64\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning:

invalid value encountered in reduce

C:\Python36\python-3.6.2.amd64\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning:

invalid value encountered in reduce

11433/12384 [==========================>...] - ETA: 0s - loss: 1.0000e-06

Once it had this loss, the loss does not change anymore. I guess there is something with the data?

Sometimes it helps to change the kind of the Loss function from discrete to continouos (btw, when should I use discrete, when continouos?). What is the general problem here? I tried to reduce the network size, but I had no success.

Korean translation of README.md

Before introducing the work, we need to pre-train the potential audiences (and they are mostly Korean in thursday)
Prepare material to easier understanding about TTE, WTTE and WTTE-RNN concept.
Translate basic readme.md documentation
Add it as README.ko.md (the naming scheme for GitHub multilingual support is not fixed yet, (still discussing now) but this naming scheme is de facto.

Continuous WTTE-RNN case for churn prediction

Is it possible to use WTTE-RNN when time is continuous or almost continuous?
I have time scale as seconds and trying to predict next user session and not sure how to deal with TTE in that case..
Should it be every second? It'll be very high dimensional tensor and very sparse events. Could WTTE-RNN deal with such scenario?
What about divide TTE between events on fixed number (e.g. between 5010 sec and 10 sec with 10 intervals will be 500 sec)? In that case it'll be different TTE step between each event.
Now I'm thinking about switching to days but it's more natural to use sessions in my case because there could be about 20 user sessions per day.

Invalid Loss error when training on the GPU

What could be the reason for the Invalid Loss error to be present in GPU training and not in the CPU training?

I've successfully trained the WTTE-RNN algorithm on a CPU using a GRU RNN on the C-MAPSS dataset. However, when doing the same on a Nvidia GPU with CuDNNGRU I am getting the Invalid Loss error at around 20/100 epochs.

I am using Keras with Tensorflow backend. And the WTTE-RNN version is 1.1.1.

Is it applicable for my dataset

I am having dataset and think that this method could do the job.
The main problem is my target variable is only 0 and 1(machine NORMAL or FAIL)
Could you please reply if wtte-rnn applicable to my problem.

High-level Questions

Hello @ragulpr !

Great work on this, very compelling. I have 2 higher-level questions (to potentially be followed up by other questions once I'm sure I understand the concept).

We only use groups with that have the event we're interested in modeling, correct? For instance, if I'm building a churn model, I could use the event I'm modeling to be something like clicking the 'Cancel' button on the web page, so my training set should only include groups that have done that event at least once right, or not?
We pick a time frame according to our resolution and within each sequence, we censor the data beyond that time period? So if I'm interested in people that will churn within the next 14 days, I set max_time=14 (assuming my data is in the day resolution) and at each sequence/step, everything > 14 days ahead gets censored (features and target turned to 0) right?

Thanks in advance for the insights. Looking forward to trying this out!

Log-likelihood for discrete Weibull distribution

There seems to be an issue with log-likelihood for discrete Weibull distribution with censored data (u=0).
According to equation (2.7) in Proposition 2.26 of your great thesis, the likelihood in this case is
L_d = Pr(T_d > t) = Pr(T >= t+1) for t in {0,1,2,...}
However, I do believe that it should be
L_d = Pr(T_d >= t) = Pr(T >= t) for t in {0,1,2,...}
[Sorry, I have found no way to use TeX here]

Arguments are following. Assume u=0 and tte=0 for some fixed day. It means that the next event might occur at any day after that fixed day, so the probability should be equal to 1. In your case it's strictly lower than 1.

Question about preprocessing functions

Hi,

I've two questions regarding the preprocessing functions:

Regarding prep_tensors - the lines

y  = y[:,1:]
x  = np.roll(x, shift=1, axis=1)[:,1:,]

Simply throw away the first event, right? Is this a necessity? In my data, a significant portion of the chruners churn at the beginning, and I'd be happy to try and predict these, as well.

Regarding the nanmask_to_keras_mask function: As far as I understand, the y variable returned by this function is of dimension (n_subjects,t_timesteps,2), such that y[i] is the matrix whose rows are the different times and its columns are time-to-event and censoring indicator, respectively, for subject i. In my data, each subject is either churned or not churned (no recurrent events). This means that for each subject, the second column is either all ones (if it's a churned subject) or all zeros (if it's a censored subject); this, of course, without taking into account the 0.95 mask. Is this the correct input format for training the model?

Should be >= K.epsilon()

At this line

wtte-rnn/python/wtte/wtte.py

Line 77 in aab7460

if K.epsilon() <= 1e-07 and K.backend() == 'tensorflow':

I see you are checking if K.epsilon() is <= 1e-7 and recommending to use a smaller value but then if you set it to 1e-08 it will still give the warning. You should change the operator to >=.

Can't reproduce the results of standalone_simple_example.ipynb

Hi,

I'm trying to reproduce the results in examples/keras/standalone_simple_example.ipynb, but the genrated data are different than what is shown in this notebook.

Below are the data plotted in my local notebook. I use Python 3.5 and Windows 7 OS. With these data sets, the neural network doesn't learn at all.

Another notebook simple_example.ipynb doesn't have this issue.

Simplified jupyter notebook examples

Currently, example jupyter notebook contains lots of information.
It leads users to be confused, also, the it affects to the total package size.

Suggestion

Make "Hello world" example notebook with tiny dataset (or generated dataset from embedded code)
Create new repo for examples (need to discuss more. The best way is keeping one huge repo. without diverging it. However, we need to decrease the total package size)
Gather good WTTE-RNN integrations / examples already on the web. It can be a new issue about manipulating README.md and GitHub Wiki.

Event with duration

In the lecture/blog all the events are described as a single point in time. How can i go about modelling an event which has a duration? The only thing i can think of is separately model the start and stop time, but i'm not sure the model would be able to learn that the start-time and stop-time belong to each other.

One other thing i was thinking about is to figure out if anything can be said about the change of influence during the event. What do i mean by that? Well suppose we take the "influence of event on the churn rate" as a curve that happens during the duration of the event. And the area under curve as the total influence. Depending on the type of event the influence can be more biased towards the start or end time of the event.

Suppose your business is a bank. Your customer has saving and you want to predict if your customer keeps a lot (say more than 1000 dollar) saving with your bank, or churns when the customer falls below that. Now your customer has taken a mortgage with your bank. Which is a one time event, but the customer is now in the state "having a mortgage" (with a duration).

Maybe in the first week after taking the mortgage, because the customer is already dealing with financial things it will reconsider where to stall it's savings. So influence/likeliness on churn decision is high.

Maybe in the 2 to 5 years after that it's likely that the customer keeps the savings in your bank because he won't switch mortgage now and thus this feature also keeps the customer satisfied about the bank in general and thus keeps the savings there as well. Now this feature causing this customer to be less likely to churn.

What also makes this more complicated is that several events might now be going on at the same time. Raising the question of how one event going on influences another event. Could that be extracted from the model or would it need to be retrained for such a question (because now we are looking at events influencing each other and not at the churn anymore).

Extracting feature importance

Is it possible to extract "feature importance" from an WTTE-RNN model?

Problems at replicating the CMAPPS data score

Dear Egil,

First, I would like to thank you for your great work! I am in the process of implementing your approach in a distributed collaborative learning platform for turbine prognostics and I was trying first to replicate your results.

I have noticed that in the uploaded custom activation function alpha is limited to a minimum of init_alpha/e, if an {-1,1} bounded activation is used in the last Dense(2) layer.

I figured out this could be overcome by simply using a linear activation function in the last layer of the neural network, and it indeed worked. Unfortunately, the algorithm was very unstable and the loss went to nan after 400-500 epochs.

This, I thought, was not a problem and I simply changed the custom activation function so alpha could still reach 0. It works fine but I am still having a lot of trouble replicating your results... The best score I have obtained is around 800.

I must confess that I do not have a great deal of computational power at my disposal so I am computing for around 1500 epochs.

I write this post just to ask if you also encountered this problem with the loss going to nan when using linear activation in the last layer of the neural network (before the custom layer).

This instability has become specially relevant when using real industrial data, where I am forced to use 'tanh' activation before the custom activation layer.

Yours

ps. Another slightly strange issue is that when I split the dataset you mention in your thesis, I obtain a different number of trajectories using the same split... I assume this is a typo?

Travis CI fail Python 3.6 with "KERAS_BACKEND=theano" on develop

With e4e2343 Python 3.6 with "KERAS_BACKEND=theano THEANO_FLAGS=optimizer=fast_compile" fails on develop (but did not fail on master 2 month ago.

Something's wrong with either CI or theano moved.
https://travis-ci.org/ragulpr/wtte-rnn/jobs/416410994

ipython example breaks at Tensorflow 1.0.1

I'm running tensorflow 1.0.1 and tried the ipython notebook in the example folder. But it break at several points. It seems code is broken because of incompatibility of tensorflow versions. Which version of tensorflow have you used to developed your code? And is it possible to migrate this to Tensorflow 1.0?

Thanks for the great work by the way. It's useful.

Validating data_pipeline

Hi Egil,

Thanks so much for releasing the end-to-end wtte-rnn code in data_pipeline! Very cool stuff.

I had a question about validating the performance though. To check how well the model predicts TTE, I did the following:

predicted_t = model.predict(x_test)
predicted_t[:,:,1]=predicted_t[:,:,1]+predicted_t[:,:,0]*0# lazy re-add NAN-mask
print(predicted_t.shape)

pred_df = tr.padded_to_df(predicted_t,column_names=["alpha","beta"],dtypes=[float,float],ids=pd.unique(df.id))
pred_df['pred_tte'] = pred_df.apply(lambda g: g.alpha*math.pow(math.log1p(0.5),(1/g.beta)),axis=1)
pred_df['actual_tte'] = y_test[:,:,0].flatten()

where

 x_test      = left_pad_to_right_pad(right_pad_to_left_pad(x)[:,(n_timesteps-n_timesteps_to_hide):,:])
 y_test      = left_pad_to_right_pad(right_pad_to_left_pad(y)[:,(n_timesteps-n_timesteps_to_hide):,:])
 events_test = left_pad_to_right_pad(right_pad_to_left_pad(events)[:,(n_timesteps-n_timesteps_to_hide):])

 y_test[:,:,0] = tr.padded_events_to_tte(events_test,discrete_time=discrete_time,t_elapsed=padded_t)
 y_test[:,:,1] = tr.padded_events_to_not_censored(events_test,discrete_time)

What I got looked like this:

>  pred_df

Out[16]:
        id	t	alpha	       beta	    pred_tte	actual_tte
0	1	0	0.148557	0.743166	0.044092	10
1	1	1	18.626453	0.687964	5.014936	9 
2	1	2	21.242054	0.726595	6.132385	8
3	1	3	29.170420	0.734831	8.539321	7
4	1	4	30.190809	0.744385	8.978482	6
...	...	...	...	...	...	...
5233	802	49	4.187856	0.667144	1.082288	1
5234	802	50	5.580938	0.632970	1.340699	0
5235	802	51	2.631150	0.609310	0.598024	3
5236	802	52	4.732635	0.670265	1.230809	2
5237	802	53	5.632733	0.642269	1.381371	1

So if you plot predicted TTE vs. actual they don't agree much, not even directionally. Clearly I am missing something. Is this not a valid way to compare predicted vs. actual TTE?

Thank you!
Natalia

Keras and Theano why?

Why do you use Theano that much? I find it hard to work with that because now Google provides Tensorflow and Tensorflow-GPU Docker Images for Kubernetes/openShift where for one things keras is integrated in Tensorflow with different Syntax
e.g.

Sequential = tf.keras.models.Sequential
Dense = tf.keras.layers.Dense
LSTM = tf.keras.layers.LSTM
Activation = tf.keras.layers.Activation
Masking = tf.keras.layers.Masking
RMSprop = tf.keras.optimizers.RMSprop
History = tf.keras.callbacks.History

and Theano is not under active development anymore.
https://groups.google.com/forum/#!topic/theano-users/7Poq8BZutbY

I am working with Kubeflow 0.5, Tensorflow 1.12 and tf.keras Version 2.1.6-tf

Are there plans to go for that?

I really appreciate that you and Dayne Batten have done with regards to WTTE.

Transforming data from long format to wtte-rnn format

I'm struggling with formatting my data into wtte-rnn compatible format. Since the format I work with is rather common in survival analysis, I thought the question might be relevant for other practitioners, as well.

My data is in the so-called long format (aka per-timepoint format) - that is, I have one line per subject per timepoint. This defines an interval-based survival object: (tstart, tstop, event), and allows me to incorporate time-varying coefficients (e.g., blood pressure), or time dependent events (e.g., application of some drug).

Also, since the event is death, a subject only "churns" once (unlike the purchasing examples), which is also rather common in survival analysis, as far as I understand.

If anyone has a code snippet/general tips on how to do this, it would be much appreciated.

Clip Log-Likelihood and more [WTTE 1.1 release]

Deep learning is hard. Really small silly changes will have big impact. One such that I've discovered is to clip log-likelihood. By clipping it at <log(1-p) it will stop pushing censored observations to the right once the likelihood for the observation is 1-p.

This makes infinity-predictions controllable and fixes 99% of the problem of NaN and numerical instability. I.e, if we set p=1e-4 there will be zero gradient contribution once it found a threshold t s.t Pr(Y>t)=0.999. I previously refrained from clipping since t will not really have a meaning thinking it should/could go to infinity and it should fail. With clipping this wont happen. Interpretations of predictions should be modified to account for this. I concluded benefits outweighs this minor problem.

Version number
Rerun wtte-rnn-examples
add changelog

Changes

Add clipping to log-likelihood dcebad2
Deprecate penalization of beta for regularization. I've found that clipping and modulating beta through the activation function parameters is much more effective. c9bfdba
Backward-compatible updates of the API of wtte. It's just a little less ugly, i.e call loss_fun = wtte.Loss(type='discrete').loss_function. instead of wtte.loss.... ba13045
Added a outputlayer-bias pre-training step to the wtte-rnn-examples. It improves numerical stability and greatly shortens training time, even though it's ugly to have a step like this.

batch size > 1 for varying sequence lengths

I'm experimenting with using the wtte model on teh CMAPPS dataset. I note you recommend using batch size = 1:

Loss was calculated as mean over timesteps. With batch-size 1 this means that each individual training sequence was given equal weight regardless of length.

However this is resulting in very slow training (around 200 secs per epoch with GPU).

Are there any workarounds for this? I've tried modifying the loss function so it's divided by the number of timesteps of interest (which I'd hoped would normalise it), but this doesn't produce good results.

Create an example with C-MAPSS data?

In the blog C-MAPSS data have been used as an example/demo.

Can it be reproduced as an example of wtte-rnn with visualization shown in the blog?

Release process of 1.0.0

Release process of 1.0.0

Checklist

Example for predicting the destruction of jet-engines

In your blog post you have a nice visualization and good predictive results for Predicting the destruction of jet-engines.
Could you attach python code or jupyter notebook to examples directory to reproduce visualizations and prediction results?

Pre-filtering by number of events

HI, I'm fairly new to this area, and I just wanted a sanity check to see if it makes sense to pre-filter a dataset based on number of events. For example, remove all users with less than k events in the observation period.

I can see this making sense with k=1 since we tend to drop the first event for all sequences anyway (#37 (comment)). Of course this might depend on the dataset, and I plan to play around with it. However, I just wanted to know if it was maybe common practice to drop records like this. Or do we favor keeping all users so we can learn user-features correlated with single event->churn

Thanks

How does wtte work seasonally?

First of all, congratulations on the great work with wtte.

My question is about different periods in time. We know that the behavior is not constant, that in some months or during the dawn the flow is naturally lower if the learning in creating its curves of theoretical time series takes this behavior into account.

What the proper way of handling NaN for WTTE-RNN

Hi there.
I have a problem causing NaN while training LSTM with wtte.output_lambda and wtte.loss.

First I thought about loss_function which has possibly NaN values for K.log and dividing on a for discrete case:

def loglik_discrete(y, u, a, b, epsilon=1e-35):
    hazard0 = K.pow((y + epsilon) / a, b)
    hazard1 = K.pow((y + 1.0) / a, b)

    loglikelihoods = u * \
        K.log(K.exp(hazard1 - hazard0) - 1.0) - hazard1
return loglikelihoods

After I changed to something like binary_crossentropy no NaN occured but it has no any sense with loss like that.

Then I looked at weights for a simple model like:

def create_model(y_train_users, feature_cols):
    tte_mean_train = np.nanmean(y_train_users[:, :, 0])
    y_censored = y_train_users[:, :, 1]

    init_alpha = -1.0/np.log(1.0 - 1.0/(tte_mean_train + 1.0) )
    init_alpha = init_alpha/np.nanmean(y_censored)

    model = Sequential()
    model.add(LSTM(1, input_shape=(None, len(feature_cols)), activation='tanh', return_sequences=True))
    model.add(Dense(2))
    model.add(Lambda(wtte.output_lambda, arguments={"init_alpha":init_alpha, 
                                                   "max_beta_value":2.5}))

    loss = wtte.loss(kind='discrete', reduce_loss=False).loss_function
    lr = 0.001
    model.compile(loss=loss, optimizer=adam(lr=lr, decay=0.00001, clipnorm=0.5))

    return model

And weights on last two step till NaNs (no any expoits):

>>> model_weights[-2]
[array([[-0.10012437, -0.19260231, -0.23978625,  0.45771736],
        [-0.37926474,  0.01478457,  0.4888621 , -0.03959836]], dtype=float32),
 array([[-0.02832842, -0.26800382,  0.60015482, -0.11135387]], dtype=float32),
 array([ 0.52170336,  1.59952521,  0.17328304,  0.59602541], dtype=float32),
 array([[ 1.50127375,  2.28139687]], dtype=float32),
 array([ 1.09258926, -1.61024928], dtype=float32)]


>>> model_weights[-1]
[array([[ nan,  nan,  nan,  nan],
        [ nan,  nan,  nan,  nan]], dtype=float32),
 array([[ nan,  nan,  nan,  nan]], dtype=float32),
 array([ nan,  nan,  nan,  nan], dtype=float32),
 array([[        nan, -2.13727713]], dtype=float32),
 array([        nan, -1.76466596], dtype=float32)]

It seems that a in output_lambda cause that NaN, but I'm not sure where, because I didn't find any possible expoit there. When I changed it to activation, i.e. sigmoid (which is not making any sense for current task) not NaN occured.

Also I noticed that you used Masking layer and callbacks.TerminateOnNaN in data-pipeline-template. Does that mean that NaN still possible and what the actual reason for causing NaNs?

Sorry for the long post. Hope for your help.

Loss Function - Not the PCF?

Hello, when looking at the likelihood function for the Weibull, I derived a different function than you do. Here is your function:

loglikelihoods = censored * (K.log(shape) + shape * K.log(x)) - K.pow(x, shape)

However when I do this, I end up with a K.log(scale) term as well. Can you explain how you arrived at your loglikelihood function? I can provide my working if necessary

c-index

Hello!

Awesome work here! I was wandering how to derive model's c-index with this package. Would you please advise?

Thanks,
Ed

ReST-based automatic documentation

For easier using and deeper understanding about the project, we need solid and synchronized documentation for the project.
Also, tool-based documentation reduces the requirements for separated instruction site & technical documentations

Plan

Write concrete comment on source code
Match comments to be ReStructuredText format
Make documentation generation scripts
Setup readthedocs

Bimodal Hazard

The process I'm trying to model using WTTE-RNN has two "typical" churn times, though customers can churn at any other time (they're simply a-priori more likely). This means that, at least when evaluated at T=0, the hazard rate for future Ts should be bimodal (right?).

Is it somehow possible to tweak/hack the Weibull log-likelihood (discrete version) to represent such a (non-weibull-anymore) bimodal hazard rate? Perhaps something like a mixture of weibulls? Is it possible to compute the loss in that case?

EDIT: For computing the loss, we need both the PMF and the SF. Calculating the PMF of the mixture is easy, but I'm not sure about the SF of the mixture. Is it simply the mixture of the SFs? Or perhaps some sort of convolution between them? Any help would be appreciated...

There's a Kaggle calling out for wtte-rnn weibull

@ragulpr,
Spent the last very interesting month split between trying to implement this in R or magically arrive at a sufficient understanding of python to apply to this kaggle challenge:

https://www.kaggle.com/c/santander-value-prediction-challenge

of which there are a few days left to submit. I looked at this data in R and it really did look weibull.

Of the many places that I stumbled, until just recently RKeras didn't support python like slicing notation, though Keras2.2.2(R) and tfnightly(1,10.0) probably now do. My ndims are always wrong, i.e. expected 3 got 2, expected 2 got 4. The usual culprits.

The challenge above is a slight variation on time to next event as the value(s) are also embedded in
the timesteps. Under pandas or data.frame(R) a system of potential bookends (left - float64, right - int64, with next potential series starting with next float64 whether activity >0. or not.) I've been attempting a fit on concatenating a 4459, 4991, 2 tensor with a fit on None, 4991,1 (a 0/1 normalized by row) and None, 4991, 2 the values in the data (concatenated), but with as yet no avail.

I'm not sure that make much sense, or more probably makes none. But attempting to say that timesteps were also features, i.e. 4459, 4991, 4991 sorta blew thru my ram, my mother's ram, and the NSA's ram.

In reviewing all the open and closed, I see I want to 999 my mask. And probably should just try to get the examples to work, which I finally managed today to get @daynebatten to run as long as I used the deprecated input_dim= rather than trying to negotiate input_shape where I inevitably fall into the 'expected ndim x, got ndim y dance'.

Anyway, I think your work is very interesting, worthy, and should make something of a splash over in kaggle land. Might even fund another year of study. If you haven't looked into it already, check out Emmanual Parzens and Deep's quantile statistics.

Thanks for a very interesting month. I'll keep plugging away and hope to see your submission.

python3 isinstance issues

wtte/transforms.py in df_join_in_endtime(df, per_id_cols, abs_time_col, abs_endtime, nanfill_val)
382 assert 't' not in df.columns.values, 'define per-id time upstream'
383
--> 384 if isinstance(per_id_cols, basestring):
385 per_id_cols = [per_id_cols]
386

NameError: name 'basestring' is not defined

AttributeError: module 'numpy' has no attribute 'float128'

Hi,
I was trying to run this example to test WTTE. However, the cell "Format tensor for training" won't execute on my machine:

    C:\Python36\python-3.6.2.amd64\lib\site-packages\wtte\transforms.py in normalize_padded(padded, 
    means, stds, only_nonzero, epsilon)
        527         else:
        528             vals = padded.reshape(n_obs, n_features)
    --> 529         means = np.nanmean(vals, axis=0, keepdims=False, dtype=np.float128)
        530         del vals
        531 

AttributeError: module 'numpy' has no attribute 'float128'

I am using WinPython with

Python 3.6.2
numpy 1.14.5
wtte 1.1.1
tensorflow-gpu 1.9.0
keras 2.2.0

This issue suggests, that numpy compiled with MKL does not support float128.
The Mailinglist recommends to replace np.float128 with np.longdouble.

I can try to fix this, do you have any PR guidelines?

Question about problem setup; predicting on real-world censored data

Hey I tried commenting on the github.io post but I will probably receive an answer faster here. I've been interested in trying this out on some real data, but I have a question about the setup of the problem.

If I have understood this correctly, the censoring indicator is in the label dataset. Doesn't this pose a problem in the real-world case of predicting for example customer next visit time? In the real-world case all present-day data is censored. So wouldn't the model vastly underestimate the time-to-event when the censoring indicator isn't part of the data fed through the model? Or is the point to do some adjustment post-prediction to account for censoring like what happens during the training phase?

Input Data format

Hey there! We've been dealing with the churn prediction problem using the typical approach. Now, we'd would love to give a try to this less hacky approach!

First, I'd love to share a bit more of what we have. I think our evented data is a bit different from the required on this approach since is not made by evenly spaced events. Each user will have it's own timeseries data with events at different (and continuous) times. Looking at the data by time we can see something like the following table.

Time	Event Type	User
1	1	A
2	3	A
3	1	B
4	4	B
5	3	A
6	1	C
7	1	D
8	5	B
9	5	D
10	4	A
11	3	C
12	5	C
13	0	B
14	2	D
15	4	C
16	4	C
17	0	D
18	3	A
19	2	A
20	2	A

For example, looking at the table and assuming that the event 0 means churn, we know that users B and D churned.

My question then is: What would be the input of the Keras model if we want to use the WTTE-RNN approach?

I'm also a bit confused about the training, since we're predicting sequences, should we fed the network the entire sequence of churned users at prediction time? Since the sequence will be right censored at prediction time, I'm not sure what's the correct approach here!

Sorry if this are very basic questions! Also, thanks for writing such a great and comprehensible article!

Weibull parameters as y_train?

Hi Egil,

I was just wondering if there are any advantages to finding the Weibull parameters through an RNN activation layer vs. first fitting the data to the Weibull to get the (alpha, beta) params and giving those to the RNN loss function instead of TTE and censored indicators? So basically separating out (alpha, beta) fitting and optimization instead of doing then jointly in the RNN. What do you think?

wtte.pipelines.data_pipeline returns wrong seq_ids

Hi, I found there is some problem with data preprocessing functions.

The problem is when we want to get result from our model for sequences and its id, when we use lib data_pipeline function for preprocessing our data. Ok, so to the point. data_pipeline function in wtte.pipelines module seems to return seq_ids in wrong order. So it causes problem with seq_index-to-seq_id mapping. The bug is in df_to_array function in its second instruction line: unique_ids = list(grouped.groups.keys()). Grouped seqneces aren't ordered by its ids so padded feature vector based on it can have different order than seq_ids returned from data_pipeline function. Its because data_pipeline returns sequences ordered by id_col in passed padnas dataframe, but df_to_array creates features sequences based on pandas groupby order which may be different, like in my case. My suggestion to fix this bug (the simplest one) is just to change unique_ids = list(grouped.groups.keys()) to unique_ids = df[id_col].unique() in df_to_array.

Improve packaging

Support fully installable package for python public repos

Requirements

Compatible with TensorFlow syntaxes
Write scripts/configs to build a universal wheel package for PyPI
Independent requirements for dev. / production
Add configurations for Travis CI (#18)
Ensure compatibility with both Python 2 and 3

Masking layer doesn't work

I encountered some problems using the masking layer. The network, instead of skipping the padded timestamps, computes the gradients obtaining nan values. More in detail, I have padded the sequences with the value -1.0 using the pad_sequences function implemented in keras. Then, I have trained the model using the train_on_batch method.

Do you already face these kinds of problems?

Can be this explanation a reason for such problems? "If any downstream layer does not support masking yet receives such an input mask, an exception will be raised." -- keras documentation

possible memory issue with large data

What's the biggest dataset you've used with WTTE-RNN? I'm having consistent issues with this package whenever I try using functions in wtte.transforms. I am using this in a Jupyter notebook, and even after downsampling my dataset, I get kernel crashes in my Jupyter notebook when I try the data-manipulations methods. I believe it's memory-related (small datasets work fine), so I am curious about the size of the largest dataset you've used. I'd appreciate it if you could let me know.
Thank you.

Exogeneity and grouping

Are all ids treated individually? Producing alphas and betas for each of the ids. If a training result of a specific id does not help train any second id, would it be better for me to group ids with similar behaviors into some macro-group to increase the mass of training data for that group (such as a hierarchical regression)? ?

Different "knobs" to improve accuracy

Similar to #32, I'm also trying to use wtte-rnn for prediction on real data; I'm not getting very good performance, and trying to understand what are the different "knobs" I can play with for improving prediction.

Some general info:

Data is class-imbalanced - the event I'm trying to predict is rather rare during the time window I'm trying to predict (3-10%). Tried oversampling the training data to counteract this, didn't help...
Eyeballing plots of alpha and beta sequences, it seems like the networks learns what are the "good" features (pushing alpha up) and what are the "bad" features (pulling alpha down).
There's one "dominant" feature (time in the study) which is by far the most informative; I'm trying to do better than a simple model that uses just this features - capturing the "variance" when we remove it. Despite the fact that the network seems to learn the features, I'm not doing any better than the "vanilla", rate-per-time model.

I've tried using more GRUs, changing them to LSTMs, adding an initial dense layer, etc, but the whole thing feels too random. Any ideas on what what to tweak and how would be appreciated.

time-to-event helper functions

Calculating time to event and censoring can lead to some serious gotchas. To have a set of tests that can validate it would be very helpful. I started it but I didn't get further.

Saving and Loading Model

Hello,

Thank you for your wonderful work on this model! Very interesting. I'm working to save and then load the model for future evaluations.

Considering Keras' guide on saving/loading here:

https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

I've tried this code:

from keras.models import load_model
model.save('my_model.h5')
model_loaded = load_model('my_model.h5')

But I get the following error:

Traceback (most recent call last):
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-124-9911263901fe>", line 1, in <module>
    model2 = load_model('my_model.h5')
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/models.py", line 243, in load_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/models.py", line 317, in model_from_config
    return layer_module.deserialize(config, custom_objects=custom_objects)
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/layers/__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 143, in deserialize_keras_object
    list(custom_objects.items())))
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/models.py", line 1353, in from_config
    model.add(layer)
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/models.py", line 492, in add
    output_tensor = layer(self.outputs[0])
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/engine/topology.py", line 617, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/keras/layers/core.py", line 663, in call
    return self.function(inputs, **arguments)
  File "/home/python/envs/python3.6/lib/python3.6/site-packages/wtte/wtte.py", line 85, in output_lambda
    a, b = _keras_unstack_hack(x)
NameError: name '_keras_unstack_hack' is not defined

Which I find strange because the function _keras_unstack_hack clearly exists in the wtte.py.

I'm running Ubuntu 16.04.3 and Python 3.6, with Keras v 2.1.3.

nan loss function when b approaches 0

I have tried to solve the problem with nan loss and I found this trick to be helpful: adding the epsilon constant to the argument of np.log:

loglikelihoods = u * \
        K.log(K.exp(hazard1 - hazard0) - 1.0 + epsilon) - hazard1

This way when b ~ 0, thus hazard1 = hazard0, the logarithm is always defined.

wtte-rnn/python/wtte/wtte.py

Line 202 in c0075a7

loglikelihoods = u * \

implement objective functions

Hopefully we can commit them as ops to their respective project too

Combining data_pipeline and simple_example

Hi Egil,

Thank you so much for making your code available! This is really great stuff.

So in trying to understand better how it all works I tried using the tensorflow.log-extracted data (as in your data_pipeline notebook) as inputs to the network (same config as in your simple_example). Unfortunately I got all nan's as losses:

Model summary:

 init_alpha:  -785.866918162
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
gru_1 (GRU)                  (None, 101, 1)            18        
_________________________________________________________________
dense_1 (Dense)              (None, 101, 2)            4         
_________________________________________________________________
activation_1 (Activation)    (None, 101, 2)            0         
=================================================================
Total params: 22.0
Trainable params: 22.0
Non-trainable params: 0.0

Results of running model.fit:

Train on 72 samples, validate on 24 samples
Epoch 1/75
2s - loss: nan - val_loss: nan
....

I was wondering if you've tried doing the same experiment and if so, whether it worked for you? Thanks so much!

Release wtte 1.1.2

Changes

MKL fix (#44 )

Migrate python unittest to pytest

Current test cases uses python unittest
Pytest (https://doc.pytest.org) helps rich features for more test cases.
Move unittest codes to pytest

Requirements

Add pytest to be the the one of the requirement package
Adopt current unittests to be pytest
Add more unittests / functional tests on the project

Time	Event Type	User
1	1	A
2	3	A
3	1	B
4	4	B
5	3	A
6	1	C
7	1	D
8	5	B
9	5	D
10	4	A
11	3	C
12	5	C
13	0	B
14	2	D
15	4	C
16	4	C
17	0	D
18	3	A
19	2	A
20	2	A

Time	Event Type	User
1	1	A
2	3	A
3	1	B
4	4	B
5	3	A
6	1	C
7	1	D
8	5	B
9	5	D
10	4	A
11	3	C
12	5	C
13	0	B
14	2	D
15	4	C
16	4	C
17	0	D
18	3	A
19	2	A
20	2	A

ragulpr / wtte-rnn Goto Github PK

wtte-rnn's People

Contributors

Stargazers

Watchers

Forkers

wtte-rnn's Issues

Suggestion

Changes

Checklist

Plan

Requirements

Changes

Requirements

Recommend Projects

Recommend Topics

Recommend Org

Time	Event Type	User
1	1	A
2	3	A
3	1	B
4	4	B
5	3	A
6	1	C
7	1	D
8	5	B
9	5	D
10	4	A
11	3	C
12	5	C
13	0	B
14	2	D
15	4	C
16	4	C
17	0	D
18	3	A
19	2	A
20	2	A