Handling Missing Data: Interpolation Techniques for Time Series Analysis

by Text MiningApril 29th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Learn about the interpolation methods (polynomial and linear) we used to address missing data in our technological convergence time series derived from OpenAlex.

Coin Mentioned

Mention Thumbnail
featured image - Handling Missing Data: Interpolation Techniques for Time Series Analysis
Text Mining HackerNoon profile picture
0-item

Abstract and 1. Introduction

2 Related Work and 2.1 Technology Convergence Approaches

2.2 Technology Convergence Measurements

2.3 Technology Convergence Models

3 Data

4 Method and 4.1 Proximity Indices

4.2 Interpolation and Fitting Data

4.3 Clustering

4.4 Forecasting

5 Results and Discussion and 5.1 Overall Results

5.2 Case Study

5.3 Limitations and Future Works

6 Conclusion and References

Appendix

4.2 Interpolation and Fitting Data

Once we’ve calculated all the proximity indices, the next step involves interpolating the time series of these indices. It was necessary because some technologies lack associated papers in certain months. Across the 625 time series, we achieve an average interpolation rate of around 20%, a reasonable percentage given that some technologies have only a limited number of attributed papers.



To determine the coefficients a0, a1, a2, and a3, a system of equations based on the given data points must be solved. Various methods, such as Lagrange interpolation or the Newton divided difference method, can be employed to derive these coefficients.



In practice, one can vary the degree of interpolation based on the data’s nature and the desired accuracy. Linear interpolation (degree one) is quick and simple but assumes a linear relationship, while higher-degree polynomial interpolation provides a more exact fit through data points but may introduce unnecessary oscillations. In our case, we initially chose a polynomial interpolation of degree 3 for the time series, offering flexibility without excessive oscillations. However, this leaves missing values at the time series’ extremities, where some series start or end with gaps. To address this, we apply a second interpolation using a linear method.


This two-step process ensures all time series are filled throughout the entire period. Negative values created by the interpolation are replaced with zeroes, as defined by the indices, which can only yield non-negative values.


Upon plotting some of the time series, we observe a cloud of points moving in a certain direction rather than forming a proper line, as depicted in Figure 1.


Fig. 1: Indices of proximity between Public-key cryptography and Blockchain from 2002 to 2021.


The observed fluctuations in the time series can be attributed to various factors influencing our computations. These factors include computed variables such as cosine similarities or h-indices, as well as variables present in the OpenAlex dataset, like the attribution score to technologies based on the classification algorithm. Moreover, the dynamic nature of the scientific landscape, subject to monthly variations, contributes significantly to the time series’ fluctuations. This dynamism arises from various causes, including varying publication frequencies for different technologies and the sudden integration of new scientific discoveries or algorithms leading to substantial citations by a significant portion of the scientific community, among other influences. Numerous factors contribute to why the computed time series are not entirely smooth. Nevertheless, the crucial aspect of our work lies in the general tendencies reflected by these indices.


To address the variability, we opt to fit curves to the points obtained for each time series. We perform polynomial interpolation, computing eleven polynomials for each time series with degrees ranging from 0 to 10. To determine the best fit for the time series, we select the polynomial interpolation with the lowest Symmetric Mean Absolute Percentage Error (SMAPE). An illustrative example is provided in Figure 2 below.


Fig. 2: Optimal polynomial fitting of the time series of proximity indices between Public-key cryptography and Blockchain with an interpolation rate of 24% from 2002 to 2021.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Alessandro Tavazz, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland, Institute of Mathematics, EPFL, 1015, Lausanne, Switzerland and a Corresponding author (tavazale@gmail.com);

(2) Dimitri Percia David, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland and Institute of Entrepreneurship & Management, University of Applied Sciences of Western Switzerland (HES-SO Valais-Wallis), Techno-Pole 1, Le Foyer, 3960, Sierre, Switzerland;

(3) Julian Jang-Jaccard, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland;

(4) Alain Mermoud, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks