zshicode / mambastock Goto Github PK
View Code? Open in Web Editor NEWMambaStock: Selective state space model for stock prediction
MambaStock: Selective state space model for stock prediction
Thanks
Impressive results! Predictions this good are rare.
I'm not as familiar with the Mamba architecture am just starting to research this, and found your code and tested it, with this incredibly accurate better than state-of-the-art result, surpassing the prediction power of any other model I have seen:
Are you sure there's not some test data leakage? Like data from the future leaking into the predicted results?
Mamba and your code implementation looks very promising -- but I am confused about why you're not using some type of time masking or temporal splitting on the test data when you send into PredictWithData
in order to avoid or eliminate the model having look-ahead bias and/or some sort of data leaking in from the future -- my theory is there is future-data leaking into the predictions, but allow me to explain why -- and please correct me and tell me I am wrong if this is not the case. It seems that the model should be pre-trained, and then should be doing it's predictions in a loop, and each time the prediction is run, Mamba should only be given the data up until the day-of or the day-before the prediction, and it should not be given all of the past, current and future data to run it's predictions.
Your code appears concise and clear, which makes it easy to read. But, let me make sure I'm understanding what's going on and that I don't misunderstand:
It looks like you're removing the 'close', 'pre_close', 'change', 'pct_chg'
data, and then using: open,high,low,pre_close,change,pct_chg,vol,amount,turnover_rate,volume_ratio,pe,pb,ps,total_share,float_share,free_share,total_mv,circ_mv
to train the model on the closing percent change?
But you also pass in the entire test set of all future test data (except without the 'close', 'pre_close', 'change', 'pct_chg'
data) into PredictWithData
. You then use this in the prediction code after training to predict yhat
.
In the PredictWithData
function, first you initialize (and train) the Mamba network using pytorch:
clf = Net(len(trainX[0]),1)
Before training you load up the test set into the xv
variable:
xv = torch.from_numpy(testX).float().unsqueeze(0)
After training you evaluate and predict by passing in the entire set of testX data.
clf.eval()
mat = clf(xv)
yhat = mat.detach().numpy().flatten()
From what I can see, xv
contains the entire test set of testX
that is passed into the PredictWithData
function. This test set includes data from future time points relative to the training data. By passing the entire test set into the trained model clf
during the evaluation phase, it appears that the model has access to future information when making predictions.
My concern and/or hypothesis is that this approach introduces a form of data leakage or look-ahead bias, where the model is inadvertently using information from the future to make predictions. This could lead to overly optimistic results that may not generalize well to real-world scenarios where future data is not available.
You do remove the 'close', 'pre_close', 'change', 'pct_chg'
columns from the test set before passing it into PredictWithData
, but the model still has access to other future information such as 'open', 'high', 'low', 'vol', 'amount', [...]
etc.
I was wondering if you have considered applying some form of time masking or temporal splitting to ensure that the model only uses information available up to the current time step when making predictions for future time steps. This would help mitigate the risk of data leakage from the future and provide a more realistic evaluation of the model's performance.
I'm relatively new to the Mamba architecture and would greatly appreciate your insights on this matter. Could you please clarify whether there is indeed a potential issue with data leakage in the current implementation and, if so, suggest any modifications or best practices to address it? I'd really like to test applying this model to the problem of time series classification but in the current state, I believe that the lack of time masking presents the likelihood that data leakage from the future is the reason the results are so incredibly good.
Read more about time masking and the importance of this in time sequence / time series modeling and training here:
https://wandb.ai/mostafaibrahim17/ml-articles/reports/A-Deep-Dive-Into-Time-Masking-Using-PyTorch--Vmlldzo1Njg1Nzc5#understanding-time-masking-techniques
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.