Giter Club home page Giter Club logo

mambastock's People

Contributors

zshicode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mambastock's Issues

Data leakage from the future?

Impressive results! Predictions this good are rare.

I'm not as familiar with the Mamba architecture am just starting to research this, and found your code and tested it, with this incredibly accurate better than state-of-the-art result, surpassing the prediction power of any other model I have seen:

image

Are you sure there's not some test data leakage? Like data from the future leaking into the predicted results?

Mamba and your code implementation looks very promising -- but I am confused about why you're not using some type of time masking or temporal splitting on the test data when you send into PredictWithData in order to avoid or eliminate the model having look-ahead bias and/or some sort of data leaking in from the future -- my theory is there is future-data leaking into the predictions, but allow me to explain why -- and please correct me and tell me I am wrong if this is not the case. It seems that the model should be pre-trained, and then should be doing it's predictions in a loop, and each time the prediction is run, Mamba should only be given the data up until the day-of or the day-before the prediction, and it should not be given all of the past, current and future data to run it's predictions.

Your code appears concise and clear, which makes it easy to read. But, let me make sure I'm understanding what's going on and that I don't misunderstand:

It looks like you're removing the 'close', 'pre_close', 'change', 'pct_chg' data, and then using: open,high,low,pre_close,change,pct_chg,vol,amount,turnover_rate,volume_ratio,pe,pb,ps,total_share,float_share,free_share,total_mv,circ_mv to train the model on the closing percent change?

But you also pass in the entire test set of all future test data (except without the 'close', 'pre_close', 'change', 'pct_chg' data) into PredictWithData. You then use this in the prediction code after training to predict yhat.

In the PredictWithData function, first you initialize (and train) the Mamba network using pytorch:

clf = Net(len(trainX[0]),1)

Before training you load up the test set into the xv variable:

xv = torch.from_numpy(testX).float().unsqueeze(0)

After training you evaluate and predict by passing in the entire set of testX data.

clf.eval()
mat = clf(xv)
yhat = mat.detach().numpy().flatten()

From what I can see, xv contains the entire test set of testX that is passed into the PredictWithData function. This test set includes data from future time points relative to the training data. By passing the entire test set into the trained model clf during the evaluation phase, it appears that the model has access to future information when making predictions.

My concern and/or hypothesis is that this approach introduces a form of data leakage or look-ahead bias, where the model is inadvertently using information from the future to make predictions. This could lead to overly optimistic results that may not generalize well to real-world scenarios where future data is not available.

You do remove the 'close', 'pre_close', 'change', 'pct_chg' columns from the test set before passing it into PredictWithData, but the model still has access to other future information such as 'open', 'high', 'low', 'vol', 'amount', [...] etc.

I was wondering if you have considered applying some form of time masking or temporal splitting to ensure that the model only uses information available up to the current time step when making predictions for future time steps. This would help mitigate the risk of data leakage from the future and provide a more realistic evaluation of the model's performance.

I'm relatively new to the Mamba architecture and would greatly appreciate your insights on this matter. Could you please clarify whether there is indeed a potential issue with data leakage in the current implementation and, if so, suggest any modifications or best practices to address it? I'd really like to test applying this model to the problem of time series classification but in the current state, I believe that the lack of time masking presents the likelihood that data leakage from the future is the reason the results are so incredibly good.

Read more about time masking and the importance of this in time sequence / time series modeling and training here:
https://wandb.ai/mostafaibrahim17/ml-articles/reports/A-Deep-Dive-Into-Time-Masking-Using-PyTorch--Vmlldzo1Njg1Nzc5#understanding-time-masking-techniques

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.