Table of Links
2 Related Work and 2.1 Technology Convergence Approaches
2.2 Technology Convergence Measurements
2.3 Technology Convergence Models
4 Method and 4.1 Proximity Indices
4.2 Interpolation and Fitting Data
5 Results and Discussion and 5.1 Overall Results
5.3 Limitations and Future Works
4.4 Forecasting
Our goal is to predict proximity indices and, ultimately, forecast how the closeness between encryption technologies evolves in order to predict technological convergence. We employ four methods to forecast time series data for 3, 6, and 12 months, aiming for optimal results.
In the first method, known as ”local forecasting,” each algorithm is trained on a segment of each time series, and then the given time series is directly forecasted with the fitted algorithm.
In the subsequent methods, ”clustering forecasting” and ”global forecasting,” we categorize time series based on common keywords, citations, and collaborations. In clustering forecasting, each algorithm is trained on all-time series within a cluster, testing if using an algorithm trained on similar time series data leads to better forecasting. In global forecasting, each algorithm is trained on an all-time series, forecasting an all-time series with the fitted algorithm.
The last method, ”transfer learning forecasting,” involves training each algorithm on a large set of time series imported from Darts and then forecasting all-time series with the fitted algorithm.
Before forecasting, we preprocess the time series by cleaning the data to remove noise. We select time series with less than 50% interpolation, smooth them using exponential smoothing with a smoothing parameter of α = 0.1, and start forecasts from December 2021, considering incomplete updates for 2022 in OpenAlex data.
To understand forecasting quality, we use the Symmetric Mean Absolute Percentage Error (SMAPE).
If the actual and the forecasted time series are both zero at some time t, the contribution to the global sum is defined as zero. Using the triangle inequality, we obtain the following bound for SMAPE:
It is noteworthy that if the actual time series consistently remains zero while the forecasted time series consistently stays non-zero, even if the predicted time series closely aligns with the actual one, SMAPE between both time series will be 200%. Given the presence of numerous flat time series in our data, this observation should be kept in mind during our analysis of the results.
We employ k-fold cross-validation with an expanding window to assess forecasting quality. In this approach, each time series is divided into successive sections. Training windows are created, starting from the first section and expanding at each step with the next section. Algorithms are trained on each training window, with forecasts initiated from the end of each window. The SMAPE between the forecasts and actual values is computed for each iteration, and the average of these SMAPEs represents the global error for a specific algorithm. This method is employed to mitigate overfitting.
To visualize the distribution of forecasting errors across all expanding windows, we generate a histogram illustrating the number of forecasts by error size in a single plot with distinct colors, as depicted in Figure 3.
More concretely, we split the data into several sets. We take 80% of the data to train our algorithm, and 20% is left for testing the algorithm. Then, out of our training set, we take 80% of the data to tune the hyperparameters of our models and 20% to validate the results. We optimize the hyperparameters of each algorithm for each type of forecasting. Then, we train the models and forecast the time series using the k-fold cross-validation with an expanding window, as explained above. Last, to compare the different performances between the algorithms for each forecasting horizon and type of forecasting, we visualize the trade-off between the error and the computation time. This allows us to choose the optimal method for each specific forecasting task, as discussed in Section 5.1.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Alessandro Tavazz, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland, Institute of Mathematics, EPFL, 1015, Lausanne, Switzerland and a Corresponding author (tavazale@gmail.com);
(2) Dimitri Percia David, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland and Institute of Entrepreneurship & Management, University of Applied Sciences of Western Switzerland (HES-SO Valais-Wallis), Techno-Pole 1, Le Foyer, 3960, Sierre, Switzerland;
(3) Julian Jang-Jaccard, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland;
(4) Alain Mermoud, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland.