Implementing Clustering Methods to Rearrange Time Series Data

Table of Links

4.3 Clustering

In this study, our focus is on implementing clustering methods to rearrange time series data that represent technological proximity based on their shapes across various index types. The objective of this analysis is to pinpoint time series exhibiting increasing proximity, indicating potential technological convergence.

Our initial step involves preprocessing the time series. Following numerical experiments, we decide against deseasonalizing the time series, as this introduces significant negative values, which contradicts our non-negative indices’ definition. Subsequently, we normalize all time series between 0 and 1. The normalization process begins with standardizing the time series data by subtracting the minimum value from each data point and then scaling the entire series by dividing it by the maximum value within the adjusted time series. This results in a time series that proportionally expresses the position between the minimum and maximum values of the original time series. Lastly, due to the high volatility of the normalized time series, we opt to smooth them using exponential smoothing with a robust smoothing parameter of α = 0.1 using the following formula:

• Ft is the forecast for the current time period t.

• xt is the actual observation in the current period.

• Ft−1 is the previous forecast.

• α is the smoothing parameter (0 < α < 1).

The smoothing parameter α controls the weight given to the current observation versus the previous forecast. A smaller α assigns more weight to older data, while a larger α gives more weight to recent data.

Following the application of exponential smoothing, we exclude all-time series from our analyses with an interpolation rate exceeding 50%. These time series, accounting for 25% of the dataset, are considered artificial and represent noise in our dataset, potentially biasing our results.

Next, we allocate all normalized time series with a mean lower or equal to 0.02 to an artificial cluster representing a flat time series. This artificial cluster encompasses approximately 15% of the entire time series set. This approach allows the clustering algorithm to focus exclusively on data that cannot be manually clustered, constituting around 60% of our original time series.

We employ three clustering algorithms for time series: K-Means, K-Shape, and K-Medoids. We decided to fit these clustering algorithms in two ways: first, by training them on one part of our time series, and second, by training them on a large set of approximately 20’000 time series that we import from the Darts library [2]. To measure the distance between time series, we use the metric “l1”, the sum of the absolute value of the differences between vectors or time series, as in our case. We do not want to penalize time series with large numerical differences too much, as this would be the case with the Euclidean distance, where the difference between points is squared, since the main goal is to cluster them by shape and not directly by the values they take.

To evaluate the performance of the clustering algorithm, we implement several strategies. We first use the silhouette score [31] to compute the clustering quality for each sample, each cluster, and finally for all the data. Then, we visualize the number of time series by cluster to see how the time series are distributed among the clusters. Afterward, we visualize the centroids of the clusters on a surface, with a positioning showing distances proportional to their distances as time series and with sizes proportional to the number of samples contained in each cluster. Finally, for each cluster, we plot randomly chosen time series within the cluster to assess their similarities visually.

To calculate silhouette score, one first calculates the average distance ai from xi for a given data point xi to all other data points within the same cluster. This measures the cohesion of the data point with its cluster.

Then, one calculates the average distance bi from xi to all data points in the nearest neighboring cluster (i.e., the cluster other than the one to which xi belongs). This measures the separation of the data point from other clusters.

The silhouette score si for the data point xi is then calculated using the following formula:

If the value of si is close to 1, it indicates that the data point is well-clustered and is far away from neighboring clusters. If the value of si is close to -1, it suggests that the data point is misclassified into the wrong cluster, as its distance to its own cluster is much greater than the distance to the nearest neighboring cluster. If the value of si is around 0, it means that the data point is on or very close to the boundary between two clusters. Regarding the overall silhouette score for an entire clustering solution (a set of clusters), one obtains it by computing the silhouette score for each data point in the dataset and then taking the average of these individual silhouette scores.

Ultimately, K-Shape with 5 clusters emerged as the optimal choice for our time series, achieving an average silhouette score of 0. This result indicates that the clusters are reasonably well-balanced, and the centroids exhibit sufficient separation from each other.

This paper is available on arxiv under CC BY 4.0 DEED license.

[2] https://unit8co.github.io/darts/

Authors:

(1) Alessandro Tavazz, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland, Institute of Mathematics, EPFL, 1015, Lausanne, Switzerland and a Corresponding author (tavazale@gmail.com);

(2) Dimitri Percia David, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland and Institute of Entrepreneurship & Management, University of Applied Sciences of Western Switzerland (HES-SO Valais-Wallis), Techno-Pole 1, Le Foyer, 3960, Sierre, Switzerland;

(3) Julian Jang-Jaccard, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland;

(4) Alain Mermoud, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland.

Implementing Clustering Methods to Rearrange Time Series Data

Too Long; Didn't Read

Table of Links

4.3 Clustering

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Implementing Clustering Methods to Rearrange Time Series Data

Too Long; Didn't Read

Table of Links

4.3 Clustering

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps