Giter Club home page Giter Club logo

clusteringfts's Introduction

Clustering Financial Time Series

This reporsitory contains the scripts and datasets used in my Master's Thesis on "Clustering Financial Time Series". The numerical experiments have been conducted using Python and R languages. One can find the necessary commands written below to reproduce the results in the thesis.

Python part

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


import python_scripts.utils as tu
from python_scripts.hier_clust import *

Here I illustrate the analysis on the Synthetic Dataset 1 and the same steps have been performed on Synthetic Dataset 2.

Synthetic Dataset 1

df, true_cluster = tu.get_synthetic_data('datasets/synthetic_data1.csv')
res_df, true_cluster = tu.get_synthetic_data('results/synthetic_data1_res.csv')
plt.rcParams['figure.figsize'] = (8.0, 4.0)
tu.plot_series(df, nr_series_per_class=5)

tu.plot_one_per_class(df, nr_series_per_class=5)

measures = ['euclidean', 'dtw', 'corr1', 'corr2',
            'cross_corr1', 'cross_corr2', 'cross_corr3',
            'acf', 'pacf', 'rccf1', 'rccf2', 'rccf3']
get_sim_index(measures, df, res_df, true_cluster)
single complete average
euclidean 0.363636 0.380952 0.380952
dtw 0.380952 0.439935 0.380952
corr1 0.334127 0.243590 0.294444
corr2 0.334127 0.243590 0.243590
cross_corr1 1.000000 1.000000 1.000000
cross_corr2 1.000000 1.000000 1.000000
cross_corr3 1.000000 1.000000 1.000000
acf 0.321429 0.321429 0.321429
pacf 0.363636 0.363636 0.363636
rccf1 1.000000 1.000000 1.000000
rccf2 1.000000 1.000000 1.000000
rccf3 0.722222 0.949495 0.949495
chosen_measures = ['cross_corr2', 'cross_corr3',
                   'rccf2', 'rccf3']
cluster_numbers = list(range(2,11))
get_sil_index(chosen_measures, cluster_numbers, df, res_df)
cross_corr2 cross_corr3 rccf2 rccf3
2 0.530485 0.739282 0.530762 0.692445
3 0.329060 0.426181 0.401696 0.477057
4 0.323237 0.420283 0.288671 0.288671
5 0.316615 0.413631 0.270898 0.270898
6 0.298892 0.404116 0.228582 0.228582
7 0.279041 0.384262 0.229174 0.229174
8 0.279625 0.384827 0.225471 0.225471
9 0.282080 0.387268 0.184505 0.184505
10 0.095913 0.095913 0.186782 0.186782
clustering = HClust(df, true_cluster, 'cross_corr3')
clustering.plot_dendrogram()

clustering1 = HClust(df, true_cluster, 'cross_corr2')
clustering2 = HClust(df, true_cluster, 'cross_corr3')

fig = plt.figure(figsize=(20, 8))
fig.subplots_adjust(hspace=0.4, wspace=0.02)

fig.add_subplot(1, 2, 1)
clustering1.plot_heatmap(xlab='$d_{CCF_2}$')

fig.add_subplot(1, 2, 2)
clustering2.plot_heatmap(xlab='$d_{CCF_3}$')

Clustering Stock Prices

stock_data = pd.read_csv('datasets/nyse_data.csv, index_col=0)
clustering = HClust(data=stock_data, ground_truth=None,
                     dist_func='cross_corr3', verbose=True)
clustering.dist_mat.to_csv('results/diss_mat_ccf3.csv', index=None, header=None)
Dissimilarity computation: 100% [-------------------------------] Time: 0:32:53
diss_mat_ccf2 = pd.read_csv('results/diss_mat_ccf2.csv', header=None, index_col=None)
diss_mat_ccf3 = pd.read_csv('results/diss_mat_ccf3.csv', header=None, index_col=None)
diss_ccf2 = np.triu(diss_mat_ccf2.values,1).flatten()
diss_ccf3 = np.triu(diss_mat_ccf3.values,1).flatten()
plt.hist(diss_ccf2[diss_ccf2!=0], bins=50, alpha=0.5, label='${CCF_2}$')
plt.hist(diss_ccf3[diss_ccf3!=0], bins=50, alpha=0.5, label='${CCF_3}$')
plt.xlabel('Dissimilarity')
plt.ylabel('Frequency')
plt.legend(fontsize=14);

sectors = pd.read_csv('datasets/sectors.csv', header=None)

Converting sector categories into numeric values.

from sklearn.preprocessing import LabelEncoder
sectors_numeric = LabelEncoder().fit_transform(sectors)
get_similarities(diss_mat_ccf2, ground_truth=sectors_numeric)
single complete average
0.087304 0.325193 0.145864
permutations = pd.read_csv('results/null_distribution1.csv')
permutations.plot(kind='hist', bins=50, legend=False)
s0 = 309
plt.plot(s0, 0.5, 'ro')
plt.arrow(s0, 20, 0, -16, length_includes_head=True,
          head_width=5, head_length=4)
plt.xlabel('Number of pure edges');

R part

source('R_scripts/clustering.R')
source('R_scripts/mst.R')
df <- read.csv('datasets/synthetic_data1.csv')
nr.series <- 5
true_cluster <- c(rep(1, nr.series), rep(2,nr.series),
                  rep(1, nr.series), rep(2, nr.series))

This snippet will return the similarity index of the clustering using Piccollo and Maharaj distances with single linkage method. Optionally one can plot the resulting dendrograms by setting plot=TRUE. Here I have used the TSclust package.

cluster_eval(df, dist.method='AR.PIC',
             linkage.method = 'single',
             true_cluster, plot=F)
             
cluster_eval(df, dist.method='AR.MAH',
             linkage.method = 'single',
             true_cluster, plot=F)

This snippet fits an AR or ARIMA(p,1,0) type of model on each time series in the datasets and returns the model residuals for RCCF dissimilarity measure.

res.df <- get.residuals(df)

write.csv(res.df, file="results/synthetic_data1_res.csv",
          row.names = F)

Plotting the minimum spanning tree with threshold 1 on the edge weights.

diss_data <- read.csv('results/diss_mat_ccf3.csv', header=F)
sectors <- read.csv('datasets/sectors.csv', header=F)
sectors <- sectors$V1
sectors.numeric <- as.numeric(sectors)

gr <- graph.adjacency(as.matrix(diss_data),
                      mode='undirected',
                      weighted = T)
mstree <- igraph::mst(gr)

final_mstree <- plot.mst(mstree, names=sectors.numeric,
                         pallete='Paired',
                         threshold=1,
                         save.fig = F, 
                         fig.size=c(6,4))

Running the permutation test will result in the p-value of the test and the resulting permutations will be saved as a csv file.

permutation.test(final_mstree, 10^4)

Plotting the histogram of clusters in order to see the alignment between the obtained clusters and the provided categories (sectors).

df <- read.csv('results/clusters_complete_3.csv')
df$clusters <- df$clusters + 1
df$gt_num <- as.character(as.numeric(df$gt))

plot.hist(df, save.fig=T, fig.size=c(6,4))

clusteringfts's People

Contributors

nshanpotikyan avatar

Stargazers

 avatar  avatar

Watchers

 avatar

clusteringfts's Issues

running

Hello, I found the following problems in running the program recently. Can you help me。
"Can only compute partial correlations for lags up to 50% of the "
ValueError: Can only compute partial correlations for lags up to 50% of the sample size. The requested nlags 99 must be < 50.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.