How Different Feature Selection Methods Perform with Varying Data Sizes

Authors:

(1) Mahdi Goldani;

(2) Soraya Asadi Tirvan.

Table of Links

Abstract and Introduction

Methodology

Dataset

Similarity methods

Feature selection methods

Measure the performance of methods

Result

Discussion

Conclusion and References

Result

The performance of feature selection methods in different data sizes was evaluated. The average value of r-squared was measured in each step of reducing the sample size and selecting an appropriate data subset based on existing methods (Table 3). The Var methods have the best performance based on the values of r-squared. The stepwise and correlation took the next position. The lasso method has the worst performance between other methods. Among the 15 feature selection methods used in this study, the Euclidean distance method ranks 6th and the DTW method ranks 7th, which performed better than other similarity methods. The edit distance similarity method has the worst performance.

Figure 2 shows the r-squared value of different feature selection methods in different datasets. In fact, by reducing the sample size in each step, different data sets were selected according to different methods. The horizontal axis of the graph represents the remaining percentage of the number of observations. In each step, by reducing the number of samples and choosing one, considering that the number of observations has decreased, the performance of the regression model decreases and the r-squared value in the number of low samples is lower than the number of high hub samples. In general, the value of r-squared was low in all methods at low percentages, and as the percentage increased, its value increased, and the performance of the 15 existing methods was similar and close to each other. However, the r-squared value of the lasso method was dramatically lower than other methods.

The figure shows the r-squared value of filtered methods. The trend line of each of these graphs was drawn. Regarding the slope of the trend line, among the three existing methods, the mutual information method had the lowest slope and sensitivity to the number of observations. However, the fluctuations of the r-squared value were high, which made this method less reliable. On the other hand, although the slope of the trend line of the var method was slightly higher than the mutual information method, the r-squared changes during the change in the number of observations were less than the other methods of this group.

The performance results of the Wrappers methods are shown in Figure 4. Five known methods from this group were reviewed. Among these, the value of r-squared fluctuated greatly during the change of the number of samples in forward, recursive feature elimination, and stepwise methods. Among the two backward and simulated methods, the simulated method had less fluctuation with a lower slope.

From the group of embedded methods, two methods were investigated (Fig5). The Lasso method had a relatively lower slope than Tree-based. However, the r-squared value of this method is lower than the tree-based method.

Figure 6 shows the performance of 5 similarity methods. among these five methods, the edit distance method had the lowest slope. Similarity methods had minor fluctuations during data size reduction.

This paper is available on arxiv under CC BY-SA 4.0 by Deed (Attribution-Sharealike 4.0 International) license.

How Different Feature Selection Methods Perform with Varying Data Sizes

Too Long; Didn't Read

Table of Links

Result

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How Different Feature Selection Methods Perform with Varying Data Sizes

Too Long; Didn't Read

Table of Links

Result

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics