Handling Missing Data: Interpolation Techniques for Time Series Analysis

Table of Links

4.2 Interpolation and Fitting Data

Once we’ve calculated all the proximity indices, the next step involves interpolating the time series of these indices. It was necessary because some technologies lack associated papers in certain months. Across the 625 time series, we achieve an average interpolation rate of around 20%, a reasonable percentage given that some technologies have only a limited number of attributed papers.

To determine the coefficients a0, a1, a2, and a3, a system of equations based on the given data points must be solved. Various methods, such as Lagrange interpolation or the Newton divided difference method, can be employed to derive these coefficients.

In practice, one can vary the degree of interpolation based on the data’s nature and the desired accuracy. Linear interpolation (degree one) is quick and simple but assumes a linear relationship, while higher-degree polynomial interpolation provides a more exact fit through data points but may introduce unnecessary oscillations. In our case, we initially chose a polynomial interpolation of degree 3 for the time series, offering flexibility without excessive oscillations. However, this leaves missing values at the time series’ extremities, where some series start or end with gaps. To address this, we apply a second interpolation using a linear method.

This two-step process ensures all time series are filled throughout the entire period. Negative values created by the interpolation are replaced with zeroes, as defined by the indices, which can only yield non-negative values.

Upon plotting some of the time series, we observe a cloud of points moving in a certain direction rather than forming a proper line, as depicted in Figure 1.

The observed fluctuations in the time series can be attributed to various factors influencing our computations. These factors include computed variables such as cosine similarities or h-indices, as well as variables present in the OpenAlex dataset, like the attribution score to technologies based on the classification algorithm. Moreover, the dynamic nature of the scientific landscape, subject to monthly variations, contributes significantly to the time series’ fluctuations. This dynamism arises from various causes, including varying publication frequencies for different technologies and the sudden integration of new scientific discoveries or algorithms leading to substantial citations by a significant portion of the scientific community, among other influences. Numerous factors contribute to why the computed time series are not entirely smooth. Nevertheless, the crucial aspect of our work lies in the general tendencies reflected by these indices.

To address the variability, we opt to fit curves to the points obtained for each time series. We perform polynomial interpolation, computing eleven polynomials for each time series with degrees ranging from 0 to 10. To determine the best fit for the time series, we select the polynomial interpolation with the lowest Symmetric Mean Absolute Percentage Error (SMAPE). An illustrative example is provided in Figure 2 below.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Alessandro Tavazz, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland, Institute of Mathematics, EPFL, 1015, Lausanne, Switzerland and a Corresponding author (tavazale@gmail.com);

(2) Dimitri Percia David, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland and Institute of Entrepreneurship & Management, University of Applied Sciences of Western Switzerland (HES-SO Valais-Wallis), Techno-Pole 1, Le Foyer, 3960, Sierre, Switzerland;

(3) Julian Jang-Jaccard, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland;

(4) Alain Mermoud, Cyber-Defence Campus, armasuisse Science and Technology, Building I, EPFL Innovation Park, 1015, Lausanne, Switzerland.

Handling Missing Data: Interpolation Techniques for Time Series Analysis

Too Long; Didn't Read

Coin Mentioned

Table of Links

4.2 Interpolation and Fitting Data

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Handling Missing Data: Interpolation Techniques for Time Series Analysis

Too Long; Didn't Read

Coin Mentioned

Table of Links

4.2 Interpolation and Fitting Data

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics