The mambastock from zshicode

Data leakage from the future?

Impressive results! Predictions this good are rare.

I'm not as familiar with the Mamba architecture am just starting to research this, and found your code and tested it, with this incredibly accurate better than state-of-the-art result, surpassing the prediction power of any other model I have seen:

Are you sure there's not some test data leakage? Like data from the future leaking into the predicted results?

Mamba and your code implementation looks very promising -- but I am confused about why you're not using some type of time masking or temporal splitting on the test data when you send into PredictWithData in order to avoid or eliminate the model having look-ahead bias and/or some sort of data leaking in from the future -- my theory is there is future-data leaking into the predictions, but allow me to explain why -- and please correct me and tell me I am wrong if this is not the case. It seems that the model should be pre-trained, and then should be doing it's predictions in a loop, and each time the prediction is run, Mamba should only be given the data up until the day-of or the day-before the prediction, and it should not be given all of the past, current and future data to run it's predictions.

Your code appears concise and clear, which makes it easy to read. But, let me make sure I'm understanding what's going on and that I don't misunderstand:

It looks like you're removing the 'close', 'pre_close', 'change', 'pct_chg' data, and then using: open,high,low,pre_close,change,pct_chg,vol,amount,turnover_rate,volume_ratio,pe,pb,ps,total_share,float_share,free_share,total_mv,circ_mv to train the model on the closing percent change?

But you also pass in the entire test set of all future test data (except without the 'close', 'pre_close', 'change', 'pct_chg' data) into PredictWithData. You then use this in the prediction code after training to predict yhat.

In the PredictWithData function, first you initialize (and train) the Mamba network using pytorch:

clf = Net(len(trainX[0]),1)

Before training you load up the test set into the xv variable:

xv = torch.from_numpy(testX).float().unsqueeze(0)

After training you evaluate and predict by passing in the entire set of testX data.

clf.eval()
mat = clf(xv)
yhat = mat.detach().numpy().flatten()

From what I can see, xv contains the entire test set of testX that is passed into the PredictWithData function. This test set includes data from future time points relative to the training data. By passing the entire test set into the trained model clf during the evaluation phase, it appears that the model has access to future information when making predictions.

My concern and/or hypothesis is that this approach introduces a form of data leakage or look-ahead bias, where the model is inadvertently using information from the future to make predictions. This could lead to overly optimistic results that may not generalize well to real-world scenarios where future data is not available.

You do remove the 'close', 'pre_close', 'change', 'pct_chg' columns from the test set before passing it into PredictWithData, but the model still has access to other future information such as 'open', 'high', 'low', 'vol', 'amount', [...] etc.

I was wondering if you have considered applying some form of time masking or temporal splitting to ensure that the model only uses information available up to the current time step when making predictions for future time steps. This would help mitigate the risk of data leakage from the future and provide a more realistic evaluation of the model's performance.

I'm relatively new to the Mamba architecture and would greatly appreciate your insights on this matter. Could you please clarify whether there is indeed a potential issue with data leakage in the current implementation and, if so, suggest any modifications or best practices to address it? I'd really like to test applying this model to the problem of time series classification but in the current state, I believe that the lack of time masking presents the likelihood that data leakage from the future is the reason the results are so incredibly good.

Read more about time masking and the importance of this in time sequence / time series modeling and training here:
https://wandb.ai/mostafaibrahim17/ml-articles/reports/A-Deep-Dive-Into-Time-Masking-Using-PyTorch--Vmlldzo1Njg1Nzc5#understanding-time-masking-techniques

zshicode / mambastock Goto Github PK

mambastock's People

Contributors

Stargazers

Watchers

Forkers

mambastock's Issues

How to define the length of predict seq in this project?

Data leakage from the future?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent