How Different Feature Selection Methods Perform with Varying Data Sizes

by Volumize TechMay 13th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Feature selection methods are evaluated based on r-squared values across data sizes. Var methods performed best, while lasso showed poor results. Similarity methods showed minimal fluctuations.
featured image - How Different Feature Selection Methods Perform with Varying Data Sizes
Volumize Tech HackerNoon profile picture
0-item

Authors:

(1) Mahdi Goldani;

(2) Soraya Asadi Tirvan.

Abstract and Introduction

Methodology

Dataset

Similarity methods

Feature selection methods

Measure the performance of methods

Result

Discussion

Conclusion and References

Result

The performance of feature selection methods in different data sizes was evaluated. The average value of r-squared was measured in each step of reducing the sample size and selecting an appropriate data subset based on existing methods (Table 3). The Var methods have the best performance based on the values of r-squared. The stepwise and correlation took the next position. The lasso method has the worst performance between other methods. Among the 15 feature selection methods used in this study, the Euclidean distance method ranks 6th and the DTW method ranks 7th, which performed better than other similarity methods. The edit distance similarity method has the worst performance.


Table3. The average value of r-squared


Figure 2 shows the r-squared value of different feature selection methods in different datasets. In fact, by reducing the sample size in each step, different data sets were selected according to different methods. The horizontal axis of the graph represents the remaining percentage of the number of observations. In each step, by reducing the number of samples and choosing one, considering that the number of observations has decreased, the performance of the regression model decreases and the r-squared value in the number of low samples is lower than the number of high hub samples. In general, the value of r-squared was low in all methods at low percentages, and as the percentage increased, its value increased, and the performance of the 15 existing methods was similar and close to each other. However, the r-squared value of the lasso method was dramatically lower than other methods.


Fig2. The value of r-squared of feature selection methods in the number of different observations


The figure shows the r-squared value of filtered methods. The trend line of each of these graphs was drawn. Regarding the slope of the trend line, among the three existing methods, the mutual information method had the lowest slope and sensitivity to the number of observations. However, the fluctuations of the r-squared value were high, which made this method less reliable. On the other hand, although the slope of the trend line of the var method was slightly higher than the mutual information method, the r-squared changes during the change in the number of observations were less than the other methods of this group.


Fig3. The value of r-squared of filtered feature selection methods


The performance results of the Wrappers methods are shown in Figure 4. Five known methods from this group were reviewed. Among these, the value of r-squared fluctuated greatly during the change of the number of samples in forward, recursive feature elimination, and stepwise methods. Among the two backward and simulated methods, the simulated method had less fluctuation with a lower slope.


Fig4. The value of r-squared of Wrappers methods


From the group of embedded methods, two methods were investigated (Fig5). The Lasso method had a relatively lower slope than Tree-based. However, the r-squared value of this method is lower than the tree-based method.


Fig5. The value of r-squared of embedded methods


Figure 6 shows the performance of 5 similarity methods. among these five methods, the edit distance method had the lowest slope. Similarity methods had minor fluctuations during data size reduction.


Fig6. The value of r-squared of similarity methods


This paper is available on arxiv under CC BY-SA 4.0 by Deed (Attribution-Sharealike 4.0 International) license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks