Bancolombia Dataton Code from the Kaggle Competition, where you can see the rules and download the data that was given to solve the problem
You can see the IDE requirements for python and R used
To face this problem we explore three ways to master the data and make the final dataset to training the models
This problem was made thank to the capacity of this machine
We fist have to fix the numerical and the categorical data, and make some categorical ranks to label the data focused on the objective variable, but here you can see a kind of EDA
- Aggregate all data per user id
Here is the code used to reach this objective. At the very beggining of this process the idea was to set the money variables in dic-20 value here based on the IPC
- Aggregate all data per month
Here is the code used to reach this objective
- Take historical data along the timeline
Here is the code to reach this objective
In each case we have to impute all the missing values with the PPCA R method here and verificate the results here
- Per user Id
- Per month
- Per all timeline
To modeling the problem we follow this steps
-
Correlations analysis, this was the one done in the data per id
-
Linear regression
- Per user Id
- per month
- Per all timeline
- PCA
Method | Scale |
---|---|
user Id | z-norm |
month | z-norm |
all timeline | z-norm |
all timeline | Centred |
At the end the best results was the one's reached with the all timeline strategy because in this case it wasn't necessary to chage any at all about the objetive variable, but the time's up and we didn't win but we learned too much!