FinnForecast 💸

Hello! 👋

Thanks for checking out my project.

FinnForecast is a comprehensive stock forecasting tool developed as a final project for my Intro to Python class. This program integrates traditional time-series modeling with a recurrent neural network, with the goal of creating a more accurate model for forecasting stock prices.

📖 Table of Contents

Project Overview
Installation
Usage
Files and Directories
How It Works
Results
Languages, Frameworks, and Tools

💻 Project Overview

FinnForecast leverages statistical and machine-learning techniques to forecast stock market behavior. The program is designed to be simple to use and robust to user error, allowing users to select between creating a forecast or running a test of the program's accuracy. For both of the options, the user is also able to select between a simple mode or advanced mode, with simple mode using default parameters for the models, and the advanced mode allowing the user more control over the performance and accuracy of the models. The model uses an Autoregressive Integrated Moving Average (ARIMA) model for the traditional forecast and a Long Short-Term Memory neural network for the machine learning forecast. To Integrate these models with each other, FinnForecast utilizes a dynamic weighted average, finding the optimal weights by creating sample models and forecasts from a section of the training data and using the weights that result in the smallest Absolute Mean Percent Error (aMPE) between the sample forecast and the actually values.

🔧 Installation

To set up the FinnForecast project on your local machine, follow these steps:

Download the Project Files:
- Ensure you have received the project files. You can download them from the provided link or source.
Extract the Files:
- If the project files are in a zip archive, extract them to your desired directory.
Navigate to the Project Directory:
- Open your terminal or command prompt and navigate to the directory where you extracted the project files.
```
cd path/to/FinnForecast
```
Ensure You have the Necessary Libraries Installed
- This program uses the following libraries:
  - pmdarima
  - scikit-learn
  - pandas
  - numpy
  - tensorflow
  - yfinance
  - matplotlib
  - seaborn
- Use the following command to install the libraries:
```
pip install [library_name]
```

👨‍💻 Usage

Run the Main Program:
- Execute the 'main.py' file to begin running the program
```
python main.py
```
Select a Mode:
- The program gives the user 4 modes to select from.
```
Hello! Welcome to FinnForecast! 
=============Menu==============
(1) Create a simple forecast
(2) Create an advanced forecast
(3) Run a simple test
(4) Run an advanced test
(5) Quit
===============================
Please select an option (1-5): 
```
- Modes (1) and (2) allow the user to input a stock ticker and forecast length (in months) and output the raw stock data, the ARIMA, LSTM, and Hybrid forecasts, and plots of each forecast
- Modes (3) and (4) allow the user to test the accuracy of the forecasts by inputting a forecast length as well as selecting the amount of stock they wish to test with. The program then runs a test forecast of a slice from the end of the stock data equal to the forecast length. The program compares the test forecast to the actual values and outputs the results of the test in a file test_output.txt
- To run a test on a specific set of stock tickers, simply paste a .csv containing the tickers in the root folder, and rename it 'stockTickers.csv'. Then, run a test through main.py as normal. The default stockTickers.csv file contains tickers for the stock components of the S&P 500 Index.
- The difference between the simple and advanced modes is that the simple mode runs the forecast with default LSTM and Hybrid parameters. The advanced mode allows the user to enter parameters for the LSTM and Hybrid model, including sequence_length, batch_size, epochs, and step.

📁 Files and Directories

The file structure of FinnForecast is as follows:

FinnForecast/
├── data/
│      ├── get_top_stocks.py
│      ├── russel1000_tickers.csv
│      ├── sp500_tickers.csv
│      ├── Yahoo Ticker Symbols - September 2017.xlsx
├── forecast/
│      ├── init.py
│      ├── forecast.py
│      ├── models/
│            ├── init.py
│            ├── arima.py
│            ├── lstm.py
│            ├── hybrid.py
├── ioprocessing/
│      ├── init.py
│      ├── fetchStockData.py
│      ├── plot.py
│      ├── output.py
│      ├── params.py
├── tests/
│      ├── init.py
│      ├── testAvgMPE.py
│      ├── stockTickers.csv/
├── output/
├── venv/
├── .git/
├── pycache/
├── README.md
├── requirements.txt
├── main.py

📝 How It Works

Fetching, Preprocessing, and Cleaning the Data: fetchStockData.py

This program utilizes Yahoo Finance's yfinance library for making API calls to download stock data. When the user enters a stock ticker for the forecast, the program downloads the monthly stock data for the appropriate ticker. For cleaning and preprocessing, the program drops null values and converts the index to datetime using the pandas library.

Traditional Time Series Modeling (ARIMA): arima.py

The traditional statistical time series model in this program is the Autoregressive Integrated Moving Average (ARIMA) model. The ARIMA model is used in field such as economics and finance for modeling and forecasting time series trends. It consists of three different components:

The Autoregressive (AR) Component

In an AR model, the current value of the series is expressed as a linear combination of its previous values. This assumes that past values have a direct impact on the present value. in other words, the series is regressed on its prior values, from the immediate prior value, Y_t-1, up to a set lag, p, for Y_t-p. For example, an Autoregressive Model with a lag p = 2, AR(2), of variable Y, would look like this:

Y_t = c + ϕ₁Y_t-1 + ϕ₂Y_t-2

The Integrated (I) Component

A crucial property needed for data to be modeled in and ARIMA model is stationarity. When a series is stationary, properties such as mean and variance are constant over time. Modeling with non-stationary data can result in unreliable forecasts. The Integrated component of the ARIMA model addresses this issue with differencing. Differencing essentially subtracting the latest observation by the next latest. The Integrated component selects the order of differencing needed to achieve stationarity in the data. Here is an example of what a first and second-order differenced series, for series Y:

First order differencing (I = 1): ΔY_t = Y_t - Y_t-1

Second order differencing (I = 2): Δ²Y_t = Δ(ΔY_t) = Y_t - 2Y_t-1 + Y_t-2

The Moving Average (MA) Component

The MA component models the relationship between the current observation and the error terms (residuals) of the model. This essentially corrects the model on potential short-term forecast errors. For example, if the model overestimates (the forecast is too high), the residual will be negative, and the MA component will adjust the forecast downward to correct the error. The parameter of the MA model, q represents the number of past residuals the MA component accounts for, where:

For residual, ϵ: ϵ_t = Y_t - Ŷ_t

Second order MA model (MA(2)): Y_t = μ + (ϵ_t) + θ₁ϵ_t-1 + θ₂ϵ_t-2

arima.py

The arima.py file contains several function for use in the ARIMA model and other programs. The main use of this file is to pass stock data and a forecast length into it to get an ARIMA forecast for use in the hybrid model. This ARIMA forecast is achieved using the pmdarima library, specifically the auto_arima function, which automatically selects the best parameters for the AR lags, differencing, and the MA lags.

Long Short-Term Memory (LSTM) Model: lstm.py

Long Short-Term Memory is a type of Recurrent Neural Network (RNN) that is particularly suited for sequences of data, especially where the model needs to remember information for extended periods, such as Time-Series data, among other subjects. For the sake of simplicity, I will not explain the full theory and mathematics behind LSTMs, but if you are interested in learning more, please check out the following links:
- Wikipedia: https://en.wikipedia.org/wiki/Long_short-term_memory
- Machine Learning Mastery: https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/
- Towards Data Science: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

The Hybrid Model: hybrid.py

The hybrid model in FinnForecast utilizes a weighted average approach toward integrating the ARIMA forecast with the LSTM forecast. In other words, the forecasts are combined with the following formula:

Let: H = Hybrid Forecast,
A = ARIMA Forecast,
L = LSTM Forecast,
0 <= x <= 1,
0<= y <= 1, and
x + y = 1

Weighted Average Model: H = (x * A) + (y * L)

Hybrid Weights

To select the hybrid weights, x and y, the model iterates through possible values from 0 to 1 for x and y, ensuring x + y = 1. The model then selects the optimal hybrid weight by the weight with the lowest Absolute Mean Percent Error, (aMPE), where, for forecast length f:

aMPE = | (Σ(Y - Ŷ) / Y)) / f |

To find the actual values Y for the forecast, the model runs a test forecast. Essentially, the stock data is split into two series, train and test, where train represents the values used to train the test forecasts, and test represents the values for Y when calculating aMPE. The model then iterates through the weighted average formula for the test ARIMA and LSTM forecasts, checking the aMPE for each weight. The model then selects the weights with the smallest aMPE and applies them to the real forecast for a final Hybrid Forecast.

Model Testing: testAvgMPE.py

FinnForecast also features the option to run a test of the model. For the model test, the program runs through a .csv of stocks, located in the root, 'stockTickers.csv', creating test forecasts and calculating the aMPE of each stock. These test forecasts are similar to the test forecasts used to calculate the optimal hybrid weights. After running through all the stocks the user selected, from stockTickers.csv, the program finds the average aMPE across all forecasts of the ARIMA, LSTM, and Hybrid models. The program also finds the frequency that each model had the lowest aMPE for each stock.
It should be noted that the Hybrid model is designed to always return the weights with the lowest aMPE among ARIMA, LSTM, and Hybrid (represented in weights of 0 or 1 if ARIMA or LSTM return the lowest aMPE). The model test accounts for this by marking ARIMA or LSTM as the best forecast in the case that the Hybrid model has greater than or equal aMPE.

📈 Results

As a final test of the model, I ran simple, 36-month tests of all the stock components of the S&P 500 index and the Russel 1000 index. Here are the results:

S&P 500:

Russell 1000:

Conclusions

On average, the ARIMA model made the best forecast ~53-55% of the time, the LSTM model made the best forecast ~26-27% of the time, and the Hybrid mode improved the forescast ~18-20% of the time. I say "improved" because the Hybrid model is designed to output the best forecast of the 3 models 100% of the time, but in these tests, the Best Fit Frequency for Hybrid shows when the optimal forecast lies somewhere between the ARIMA and LSTM, which is calculated by the weighted average function. The improvement from the Hybrid model is shown by the aMPE, which is around 29% and 33% for the S&P and Russell indices. These are lower than all the results for ARIMA and LSTM. These results show the goals of the hybrid model, which is to return the optimal forecast everytime, and improve the forecast when able.

jmersinger / finnforecast Goto Github PK

finnforecast's Introduction