mcompetitions / m4-methods Goto Github PK

View Code? Open in Web Editor NEW

731.0 731.0 316.0 1.41 GB

Data, Benchmarks, and methods submitted to the M4 forecasting competition

R 85.32% Python 9.93% MATLAB 3.45% Shell 0.07% CMake 0.46% Dockerfile 0.77%

m4-methods's People

Contributors

Stargazers

Watchers

Forkers

vangspiliot mstei4176 valeman thiyangt mhamine ppppparadise prabhant abercus fotpetr arsa-nik gmd-cut djpedregal dashaub shannonseery michaelibrahim-gatech smartforecast katerinakou sriramny waddah-saeed hal2001 edwardburgin songbinliu ghosthamlet gerenuk samithaj rwightman elfedorova seancsc stevezheng22 konerukeerthi azamyahya naftalic ramikrispin cyrussafaie pepsalehi jperl lnthanh alecs12 stevenlol hatalis nuankw vladislav-horbatiuk lukegfitzpatrick carina28 jbrownuf07 harsha2010 stephaniez001 abojja9 jackcemoi amin1softco teppoeva vanntc58 wschievelbein harryprince suresh shafiahmed shenchen23 amirunpri2018 bigsnarfdude michael135 aspriter nextoptdev ituco robindang0573 yashuvishnalia dikar prakash5801 l3onardo tantoankhoa10 vrtompki thanhvh kasungayan vanamalivanam stjordanis capelastegui bramdelacombe rickyzhang82 irischang59 nguyenkaos radovankavicky gapdata jhbalaji zqma pyatachokk goldstar111 cat0ne mloning rbertens ahmeddjalleleddinesetererrahmane issamyahiaoui xiyuansun punkybella drehar tm1611 temp3rr0r shubhampachori12110095 jpwa22 dishab1 jc35grad yunyisheng

m4-methods's Issues

RNN benchmark data reshaping

Hello. Thanks for sharing this repository.

I'm looking at ML_benchmarks.py#L156 and noticed that you just reshape x_train to have sequences along each row. This means that your rows do not have any overlap between them. Doesn't this hinder the RNN's performance?

dataset

Hello, excuse me. I would like to ask, what do V1, V2, V3, V4, V5, V6, etc. in the dataset represent? What do Y1, Y2, Y3, Y4 represent?

ignore

the order of data, from left (begin) to right (end)?

I have a quick question regarding the order of data,
so each row means a time series, does it start from left to right (small column index to large index), or the other direction?
Thanks!

Replicating ARIMA results

How were the ARIMA results generated? I can only find the submission file with the results, but not the code used to generate them. Was any preprocessing applied? Any non-default hyper-parameters settings? Did you use forecast's auto.arima (v8.2)?

License for "Benchmarks and Evaluation.R"

Hi there, can we reuse and redistribute code from "Benchmarks and Evaluation.R" in our organization?

test dataset

Hello, is this the best validation dataset in the dataset? Is it generated by someone else's model? Thaks.

M4 theta method (question)

Hi,

Came across this topic when searching for existing forecast tool.
In respect to the method shown, is there any reference on the theta model exhibited? (i.e.: logic of the codes)

Much appreciated.

Replicating "260 - KaterinaKou" method results

Hi,
I am having issues replicating the results from the code submitted by Nikoletta-Zampeta Legaki. Running the code on RStudio gives following error:

Error in datasets[[j]][[i]] : subscript out of bounds

Your assistance in this regard is much appreciated!

Potential security issues: SIOPREDM4.exe

Sorry if this is a false alarm, but my antivirus just reacted to
SIOPREDM4.exe

Submission for 244 Alves Santos Junior missing

Is the source code for this entrant not available or have I just overlooked it?

where can I find template_Naive.csv?

Hello :)
I was trying to run the predict.py, however, it shows that there in line 105 of predict.py need template_Naive.csv, which I couldn't find in M4 dataset. Where can I find it?

Thank you very much

Questions about the range and format of the data's StartingDate

Hi all, I was wondering if there is information regarding the date's format or the range of the StartingDate column given in the M4-info.csv or not.

From my observation, the StartingDate is usually in the format of "DD-MM-YY hh:mm", but there are some that break this rule or is ambiguous. For example:

M369 whose StartingDate is 1882-07-01 12:00:00 (which I supposed is the 1st of July, 1882).
M376 whose StartingDate is 01-01-17 12:00 (which I can't tell if it is 1st of January of 1917 or 2017).

Any clarification is appreciated!

237 - prologistica replication (missing source files)

The submission 237 - prologistica does not replicate due to missing R files classifier.R and models.R referenced in model_choice_M4.r.

MLP with additive seasonality?

Hi,

I'm having issues replicated the results for the MLP method, especially for the hourly dataset.

I'm using the hyper-parameter settings found in https://github.com/Mcompetitions/M4-methods/blob/master/ML_benchmarks.py.

	Yearly	Quarterly	Monthly	Weekly	Daily	Hourly
MLP	-7.910408	-7.948233	-4.790635	-44.658715	-51.489338	147.500501

Values are percentage difference between published sMAPE and replicated ones, i.e. a value of 100 means 100% difference, positive values indicate the replicated results are worse than published results.

The plot shows part of H1 training series, the full test series and MLP point forecasts, where y_pred_orig are the point forecasts found in the submission-MLP.rar file, y_pred_add are point forecasts I obtain with additive deseasonalisation and y_pred_mul are point forecasts I obtain with multiplicative deseasonalisation.

I find similar patterns for RNN. Are you using additive seasonality by any chance? Any other idea where the deviation may come from?

Reproducing benchmark point forecasts

Hi,

What version of the forecast package was used for calculating the point forecasts of the benchmarks?

I'm asking because if i try to reproduce the benchmarks point forecasts with the code from "Benchmarks and Evaluation.R" i get different results for SES, Holt etc. I'm using forecast 8.21 on R 4.3.1 on Linux.

Thank you.

add topics

I suggest adding the topics time-series, time-series-analysis, forecasting to the About section at https://github.com/Mcompetitions/M4-methods

Under which license is the data released?

Depending on where the dataset comes from, this might affect which license affects the M4 dataset.

As a for-profits business, we need to figure out what is possible for us to do with this data.

smape_cal function missing mean()

It seems mean() function is missing in your smape_cal function in Benchmarks and Evaluation.R.
Arsa

dataset

Hello, in this paper "The M4 Competition: 100000 time series and 61 forecasting methods", it is proposed that the M4 dataset is divided into six data frequencies and six application fields. For the Yearly dataset, Micro accounts for 6538, Industry accounts for 3716, Macro accounts for 3903, Finance accounts for 6519, Demographic accounts for 1088, and Other accounts for 1236. May I ask which rows of the entire dataset are this Micro in? What are the rows of Industry in the entire dataset? What are the rows of Macro in the entire dataset? What are the rows of Finance in the entire dataset?

Inconsistent seasonality tests

The seasonality tests in Python and R seem to give different results.

If you run the R code snippet below, you get FALSE. If you run the Python snippet you get True.

R code:

# copied from https://github.com/Mcompetitions/M4-SeasonalityTest <- function(input, ppy){
  #Used to determine whether a time series is seasonal
  tcrit <- 1.645
  if (length(input)<3*ppy){
    test_seasonal <- FALSE
  }else{
    xacf <- acf(input, plot = FALSE)$acf[-1, 1, 1]
    clim <- tcrit/sqrt(length(input)) * sqrt(cumsum(c(1, 2 * xacf^2)))
    test_seasonal <- ( abs(xacf[ppy]) > clim[ppy] )
    
    if (is.na(test_seasonal)==TRUE){ test_seasonal <- FALSE }
  }
  
  return(test_seasonal)
}

data <- c(2.62434536, -0.61175641, -0.52817175, -1.07296862,  1.86540763,
          -2.3015387 ,  1.74481176, -0.7612069 ,  1.3190391 , -0.24937038,
          1.46210794, -2.06014071,  0.6775828 , -0.38405435,  1.13376944,
          -1.09989127)
ppy <- 4 
SeasonalityTest(data, ppy)

Python code:

import numpy as np
from math import sqrt
data = np.array([2.62434536, -0.61175641, -0.52817175, -1.07296862,  1.86540763,
-2.3015387 ,  1.74481176, -0.7612069 ,  1.3190391 , -0.24937038,
1.46210794, -2.06014071,  0.6775828 , -0.38405435,  1.13376944,
-1.09989127])

# copied from https://github.com/Mcompetitions/M4-methods/blob/master/ML_benchmarks.py
def seasonality_test(original_ts, ppy):
    """
    Seasonality test
    :param original_ts: time series
    :param ppy: periods per year
    :return: boolean value: whether the TS is seasonal
    """
    s = acf(original_ts, 1)
    for i in range(2, ppy):
        s = s + (acf(original_ts, i) ** 2)

    limit = 1.645 * (sqrt((1 + 2 * s) / len(original_ts)))

    return (abs(acf(original_ts, ppy))) > limit


def acf(data, k):
    """
    Autocorrelation function
    :param data: time series
    :param k: lag
    :return:
    """
    m = np.mean(data)
    s1 = 0
    for i in range(k, len(data)):
        s1 = s1 + ((data[i] - m) * (data[i - k] - m))

    s2 = 0
    for i in range(0, len(data)):
        s2 = s2 + ((data[i] - m) ** 2)

    return float(s1 / s2)

ppy = 4
seasonality_test(data, ppy)

The difference is that in the Python code you do not take the square of the autocorrelation coefficient at the first lag, i.e.

s = acf(original_ts, 1) ** 2

Benchmarks naive_seasonal

In the file 'Benchmarks and Evaluation.R', function naive_seasonal, line 43, is ''+ frcst - frcst" actually meaningful? It seems that it does nothing

A Full-Pipeline Automated Time Series (AutoTS) Analysis Toolkit.

https://github.com/DataCanvasIO/HyperTS