maxim5 / time-series-machine-learning Goto Github PK

Machine learning models for time series analysis

License: Apache License 2.0

Python 100.00%

python machine-learning deep-learning neural-network recurrent-neural-networks time-series time-series-prediction tensorflow xgboost statistics

time-series-machine-learning's Introduction

Time Series Prediction with Machine Learning

A collection of different Machine Learning models predicting the time series, concretely the market price for given the currency chart and target.

Requirements

Required dependency: numpy. Other dependencies are optional, but to diversify the final models ensemble, it's recommended to install these packages: tensorflow, xgboost.

Tested with python versions: 2.7.14, 3.6.0.

Fetching data

There is one built-in data provider, which fetches the data from Poloniex exchange. Currently, all models have been tested with crypto-currencies' charts.

Fetched data format is standard security OHLC trading info: date, high, low, open, close, volume, quoteVolume, weightedAverage. But the models are agnostic of the particular time series features and can be trained with sub- or superset of these features.

To fetch the data, run run_fetch.py script from the root directory:

# Fetches the default tickers: BTC_ETH, BTC_LTC, BTC_XRP, BTC_ZEC for all time periods.
$ ./run_fetch.py

By default, the data is fetched for all time periods available in Poloniex (day, 4h, 2h, 30m, 15m, 5m) and is stored in _data directory. One can specify the tickers and periods via command-line arguments.

# Fetches just BTC_ETH ticker data for only 3 time periods.
$ ./run_fetch.py BTC_ETH --period=2h,4h,day

Note: the second and following runs won't fetch all charts from scratch, but just the update from the last run till now.

Training the models

To start training, run run_train.py script from the root directory:

# Trains all models until stopped.
# The defaults: 
# - tickers: BTC_ETH, BTC_LTC, BTC_XRP, BTC_ZEC
# - period: day
# - target: high
$ ./run_train.py

# Trains the models for specified parameters.
$ ./run_train.py --period=4h --target=low BTC_BCH

By default, the script trains all available methods (see below) with random hyper-parameters, cross-validates each model and saves the result weights if the performance is better than current average (the limit can be configured).

All models are placed to the _zoo directory (note: it is possible that early saved models will perform much worse than later ones, so you're welcome to clean-up the models you're definitely not interested in, because they can only spoil the final ensemble).

Note 1: specifying multiple periods and targets will force the script to train all combinations of those. Currently, the models do not reuse weights for different targets. In other words, if set --target=low,high, it will train different models particularly for low and for high.

Note 2: under the hood, the models work with transformed data, in particular high, low, open, close, volume are transform to percent changes. Hence, the prediction for these columns is also percent changes.

Machine Learning methods

Currently supported methods:

Ordinary linear model. Even though it's very simple, as it turns out, the linear regression shows pretty good results and compliments the more complex models in the final ensemble.
Gradient boosting (using xgboost implementation).
Deep neural network (in tensorflow).
Recurrent neural network: LSTM, GRU, one or multi-layered (in tensorflow as well).
Convolutional neural network for 1-dimensional data (in tensorflow as well).

All models take as input a window of certain size (named k) and predict a single target value for the next time step. Example: window size k=10 means that the model accepts (x[t-10], x[t-9], ..., x[t-1]) array to predict x[t].target. Each of x[i] includes a number of features (open, close, volume, etc). Thus, the model takes 10 * features values in and outputs a single value - percent change for the target column.

Inspecting the model

Saved models consist of the following files:

run-params.txt: each model has the following run parameters:
- Ticker name, e.g., BTC_ETH.
- Time period, e.g., 4h.
- Target column, e.g., high (means the model is predicting the next high price).
- Model class, e.g., RecurrentModel.
- The k value, which denotes the input length, e.g., k=16 with period=day means the model needs 16 days to predict the next one.
model-params.txt: holds the specific hyper-parameters that the model was trained with.
stats.txt: evaluation statistics (for both training and test sets, see the details below).
One or several files holding the internal weights.

Each model is evaluated for both training and test set, but the final evaluation score is computed only from the test set.

Here's the example report:

# Test results:
Mean absolute error: 0.019528
SD absolute error:   0.023731
Sign accuracy:       0.635158
Mean squared error:  0.000944
Sqrt of MSE:         0.030732
Mean error:          -0.001543
Residuals stats:     mean=0.0195 std=0.0238 percentile=[0%=0.0000 25%=0.0044 50%=0.0114 75%=0.0252 90%=0.0479 100%=0.1917]
Relative residuals:  mean=1.1517 std=0.8706 percentile=[0%=0.0049 25%=0.6961 50%=0.9032 75%=1.2391 90%=2.3504 100%=4.8597]

You should read it like this:

The model is on average 0.019528 or about 2% away from the ground truth percent change (absolute difference), but only -0.001543 away taking into account the sign. In other words, the model underestimates and overestimates the target equally, usually by 2%.
The standard deviation of residuals is also about 2%: 0.023731, so it's rarely far off the target.
The model is 63% right about the sign of the change: 0.635158. For example, this means that when the model says "Buy!", it may be wrong about how high the predicted price will be, but the price will go up in 63% of the cases.
Residuals and relative residuals show the percentiles of error distribution. In particular, in 75% of the cases the residual percent value is less than 2.5% away from the ground truth and no more than 124% larger relatively.

Example: if truth=0.01 and prediction=0.02, then residual=0.01 (1% away) and relative_residual=1.0 (100% larger).

In the end, the report is summarized to one evaluation result, which is mean_abs_error + risk_factor * sd_abs_error. You can vary the risk_factor to prefer the models that are better or worse on average vs in the worst case. By default, risk_factor=1.0, hence the model above is evaluated at 0.0433. Lower evaluation is better.

Running predictions

The run_predict.py script downloads the current trading data for the selected currencies and runs an ensemble of several best models (5 by default) that have been saved for these currencies, period and target. Result prediction is the aggregated value of constituent model predictions.

# Runs ensemble of best models for BTC_ETH ticker and outputs the aggregated prediction.
# Default period: day, default target: high.
$ ./run_predict.py BTC_ETH

License

Apache 2.0

time-series-machine-learning's People

Contributors

Stargazers

Watchers

Forkers

intrros f00r richard27yang tony32769 miku alireza6050 shalinsaleem shj1987 ibnnafis007 bachkukkik mstumberger hbcbh1999 askery agutoli eycab hunt82 laasan coolsnake parrondo alfords shivanandroy littlehappi noisyoscillator brucedai003 haibolii cerebrocode ppop data-man-34 msabr027 antyan001 ayeps databill86 zbg5016 zhaoshengjian maxtwen socar-kyle a7pr4z libardo1 sdoof skmanjhi bennythedev pandinosaurus nguyenkaos amustapha lxw4939 tariksir kevinmtian beliavsky atlonxp afcarl mbrhd rahulputhukkot blessy-hebzeba bitprofessor gauravjhanwarosu yinyanghuafa ejhortala krishnapals robertdigital divya21raj ersinkurt23 stephankalika phillip1029 laranea fagan2888 linxiaogang45 321hg rgveda johnjdailey alexkn77 lahano mosababoidrees nivilo mmercadillo maestro73 galdamour valeman s-y-00 louisww asll666 j-zhrv soham-samanta yanding temutev ifv helozjisky erwinleonardy hfox8 johnrachwan123 jaichhabria noahwteng bg4xsd sandy4321 mikletonsen hrafz rafad900 endeva83 smlrolland

time-series-machine-learning's Issues

time-series-machine-learning/predict/ensemble.py", line 29, in predict_aggregated vlog2('Predicted values:\n', changes[:, :6]) IndexError: too many indices for array

command:
python3 ./run_predict.py --period=5m USDT_BTC

error output:
[2018-09-13 23:34:38] Using tickers: "USDT_BTC"

[2018-09-13 23:34:38] Using period: "5m"
[2018-09-13 23:34:38] Fetching USDT_BTC: https://poloniex.com/public?command=returnChartData&currencyPair=USDT_BTC&start=1536822278&end=4294967296&period=300
[2018-09-13 23:34:39] Fetched USDT_BTC (5m)
Traceback (most recent call last):
File "./run_predict.py", line 29, in
main()
File "./run_predict.py", line 20, in main
result_df = predict_multiple(job, raw_df=raw_df, rows_to_predict=1)
File "/home/dragon/tensorflow/time-series-machine-learning/predict/ensemble.py", line 75, in predict_multiple
predictions = ensemble.predict_aggregated(df, last_rows=rows_to_predict)
File "/home/dragon/tensorflow/time-series-machine-learning/predict/ensemble.py", line 29, in predict_aggregated
vlog2('Predicted values:\n', changes[:, :6])
IndexError: too many indices for array

Using Python 3.5, OS Ubuntu 16.04 (64 bits), tensorflow 1.10

ConvModel Fix

Hi,

the ConvModel in cnn_model.py has an error.

Quick fix:

Replace the line

super().init(**params)

with

TensorflowModel.init(self, **params)

And it will work again.

Sorry, don't have the time right now to open a pull request and thanks again for this great project, I think it is an awesome boilerplate.

can I use my own dataset?

How to use my own dataset?

How to determine "buy" or "sell"?

Hi,

first of all thank you for this great project. I learned a lot already!

Do I need to average over the sign accuracy for both low and high target levels after training and the one having the most "confidence" determines the buy or sell signals?

Or can I somehow determine this when running the predictions? (which I assume would sound more correct to me)

Thanks again!

Only plot prediction for past

Dear maxim,

I see that run_visual.py only show the prediction for exist data. How can it show prediction for few timeframe ahead, 5 days for example?

Thanks for your work hard!

new complementary tool

I want to offer a new point of view, and my colaboraty

Why this stock prediction project ?

Things this project offers that I did not find in other free projects, are:

Testing with +-30 models. Multiple combinations features and multiple selections of models (TensorFlow , XGBoost and Sklearn )
Threshold and quality models evaluation
Use 1k technical indicators
Method of best features selection (technical indicators)
Categorical target (do buy, do sell and do nothing) simple and dynamic, instead of continuous target variable
Powerful open-market-real-time evaluation system
Versatile integration with: Twitter, Telegram and Mail
Train Machine Learning model with Fresh today stock data

https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/tree/develop

how to fix??

File "C:\anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 1068, in to_datetime
values = convert_listlike(arg._values, format)
File "C:\anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 393, in _convert_listlike_datetimes
return _to_datetime_with_unit(arg, unit, name, tz, errors)
File "C:\anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 557, in _to_datetime_with_unit
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
File "pandas_libs\tslib.pyx", line 312, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 'ms'

Prediction time-frame

I might be missing something, please correct me if so but I believe this model is predicting exactly 1 step behind at all times. Whenever I look at the time of the predictions it's always in the past, including day predictions.

For example, at 5:42:

2019-06-14 05:41:48] Fetching BTC_ETH: https://poloniex.com/public?command=returnChartData&currencyPair=BTC_ETH&start=1555555555&end=4294967296&period=300
[2019-06-14 05:41:51] Fetched BTC_ETH (5m)
C:\Users\username\Desktop\time-series-machine-learning-master\time-series-machine-learning-master\util\data_util.py:74: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  row = window.as_matrix().reshape((-1,))
[2019-06-14 05:42:04] Latest chart info:
                       close     high       low     open  quoteVolume    volume  weightedAverage
date
2019-06-14 05:35:00  0.03093  0.03093  0.030915  0.03093    34.577163  1.068963         0.030915
2019-06-14 05:40:00  0.03093  0.03093  0.030930  0.03093     0.028200  0.000872         0.030930

[2019-06-14 05:42:04] Prediction for "high":
                     Prediction  Current-Truth
Time
2019-06-14 05:40:00     0.03093        0.03093

Using the same http link just before I run the command I can see that there's that exact result at 5:40.
It's not predicting ahead, it's already there. Am I missing something?

why shuffling data?

Hello,
Nice and interresting work, I learned a lot.
During train and testing dataset building process, why are you shuffling data? I though that regarding time serie we should not shuffling data.

data_utils.py

def split_dataset(dataset, ratio=None):
size = dataset.size
if ratio is None:
ratio = _choose_optimal_train_ratio(size)

mask = np.zeros(size, dtype=np.bool_)
train_size = int(size * ratio)
mask[:train_size] = True
np.random.shuffle(mask)

train_x = dataset.x[mask, :]
train_y = dataset.y[mask]

mask = np.invert(mask)
test_x = dataset.x[mask, :]
test_y = dataset.y[mask]

return DataSet(train_x, train_y), DataSet(test_x, test_y)

Regards,

Question - predict.py default time period

This is a fascinating project, nice work. I have managed to get everything to run but have a couple of questions? Do you have a recommended time for training the data (for the first time)? Secondly can you provide a bit more information about the best time to run the model, presumably at the start of the day to predict the remainder of the day? Following on from this is there any way to run the predict.py with something longer than the default day? I have tried changing this but cannot get anything to run.

Hope my questions make some sort of sense. Thanks for a great project and any additional info.

TypeError: dtype '<type 'datetime.datetime'>' not understood

Hi @maxim5 i tried to run your code,

fetching (run_fetch.py) the data worked but it failed afterwards at running the training (run_train.py) with the error

TypeError: dtype '<type 'datetime.datetime'>' not understood

Do you know about it? I tried to fix it in the to_changes function but I just drag another error with me. It seems that converting the date/time fields are not working anymore.

Pay attention for data leak in Conv1D

From Keras docs:
padding: One of "valid", "causal" or "same" (case-insensitive). "valid" means "no padding". "same" results in padding the input such that the output has the same length as the original input. "causal" results in causal (dilated) convolutions, e.g. output[t] does not depend on input[t + 1:]. A zero padding is used such that the output has the same length as the original input. Useful when modeling temporal data where the model should not violate the temporal order. See WaveNet: A Generative Model for Raw Audio, section 2.1.

You should use padding='casual' as for the wavenet model case.

run_train infinite loop?

Hi,

First of all, congrats for this project, it appears to be very promising.

I ran run_train like this ./run_train.py --target=low BTC_ETH --period=day and 2 days later, it's still running with around 77 _zoo/BTC_ETH sub folders, with only LinearModel.

Could it be the reason that the training is still ongoing? i.e. in an attempt to find other models with good results, without success?
I did not find where to configure the limit.

Thanks and keep the good work!