The Problem of Redundant Variables in Random Forests

Kubus, Mariusz

doi:10.18778/0208-6018.339.01

Article details

Journal

Acta Universitatis Lodziensis. Folia Oeconomica

2018 | 6 | 339 | 7-16

Article title

The Problem of Redundant Variables in Random Forests

Authors

Mariusz Kubus

Content

Full texts:

https://czasopisma.uni.lodz.pl/foe/article/download/2552/3690 [remote]

Title variants

Problem zmiennych redundantnych w metodzie lasów losowych

Languages of publication

EN

Abstracts

EN

Random forests are currently one of the most preferable methods of supervised learning among practitioners. Their popularity is influenced by the possibility of applying this method without a time consuming pre‑processing step. Random forests can be used for mixed types of features, irrespectively of their distributions. The method is robust to outliers, and feature selection is built into the learning algorithm. However, a decrease of classification accuracy can be observed in the presence of redundant variables. In this paper, we discuss two approaches to the problem of redundant variables. We consider two strategies of searching for best feature subset as well as two formulas of aggregating the features in the clusters. In the empirical experiment, we generate collinear predictors and include them in the real datasets. Dimensionality reduction methods usually improve the accuracy of random forests, but none of them clearly outperforms the others.

PL

Lasy losowe są obecnie jedną z najchętniej stosowanych przez praktyków metod klasyfikacji wzorcowej. Na jej popularność wpływ ma możliwość jej stosowania bez czasochłonnego, wstępnego przygotowywania danych do analizy. Las losowy można stosować dla różnego typu zmiennych, niezależnie od ich rozkładów. Metoda ta jest odporna na obserwacje nietypowe oraz ma wbudowany mechanizm doboru zmiennych. Można jednak zauważyć spadek dokładności klasyfikacji w przypadku występowania zmiennych redundantnych. W artykule omawiane są dwa podejścia do problemu zmiennych redundantnych. Rozważane są dwa sposoby przeszukiwania w podejściu polegającym na doborze zmiennych oraz dwa sposoby konstruowania zmiennych syntetycznych w podejściu wykorzystującym grupowanie zmiennych. W eksperymencie generowane są liniowo zależne predyktory i włączane do zbiorów danych rzeczywistych. Metody redukcji wymiarowości zwykle poprawiają dokładność lasów losowych, ale żadna z nich nie wykazuje wyraźnej przewagi.

Keywords

EN

random forests redundant variables feature selection clustering of features

PL

lasy losowe zmienne redundantne dobór zmiennych taksonomia cech lasy losowe zmienne redundantne dobór zmiennych taksonomia cech

Publisher

Uniwersytet Łódzki. Wydawnictwo Uniwersytetu Łódzkiego

Journal

Acta Universitatis Lodziensis. Folia Oeconomica

Year

2018

Volume

6

Issue

339

Pages

7-16

Physical description

Dates

published

2019-02-13

Contributors

author

Mariusz Kubus

Opole University of Technology, Faculty of Production Engineering and Logistics, Department of Mathematics and IT Applications

References

Breiman L. (1996), Bagging predictors, “Machine Learning”, vol. 24(2), pp. 123–140.
Breiman L. (2001), Random forests, “Machine Learning”, vol. 45, pp. 5–32.
Freund Y., Schapire R. E. (1996), Experiments with a new boosting algorithm, Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann, San Francisco.
Gatnar E. (2001), Nieparametryczna metoda dyskryminacji i regresji, Wydawnictwo Naukowe PWN, Warszawa.
Grabiński T., Wydymus S., Zeliaś A. (1982), Metody doboru zmiennych w modelach ekonometrycznych, Państwowe Wydawnictwo Naukowe PWN, Warszawa.
Granitto P. M., Furlanello C., Biasioli F., Gasperi F. (2006), Recursive feature elimination with random forest for PTR‑MS analysis of agroindustrial products, “Chemometrics and Intelligent Laboratory Systems”, vol. 83(2), pp. 83–90.
Gregorutti B., Michel B., Saint‑Pierre P. (2017), Correlation and variable importance in random forests, “Statistics and Computing”, vol. 27, issue 3, pp. 659–678.
Guyon I., Gunn S., Nikravesh M., Zadeh L. (2006), Feature Extraction: Foundations and Applications, Springer, New York.
Hall M. (2000), Correlation‑based feature selection for discrete and numeric class machine learning, Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco.
Hapfelmeier A., Ulm K. (2013), A new variable selection approach using Random Forests, “Computational Statistics and Data Analysis”, vol. 60, pp. 50–69.
Hastie T., Tibshirani R., Friedman J. (2009), The Elements of Statistical Learning: Data Mining. Inference and Prediction, 2nd edition, Springer, New York.
Korf R. E. (1999), Artificial intelligence search algorithms, [in:] M. J. Atallah, Algorithms and Theory of Computation Handbook, CRC Press, Boca Raton–London–New York–Washington.
Kursa M. B., Rudnicki W. R. (2010), Feature selection with the Boruta package, “Journal of Statistical Software”, vol. 36, issue 11, pp. 1–13, http://www.jstatsoft.org/v36/i11/ [accessed: 15.02.2018].
Toloşi L., Lengauer T. (2011), Classification with correlated features: unreliability of feature ranking and solutions, “Bioinformatics”, vol. 27, issue 14, pp. 1986–1994, https://doi.org/10.1093/bioinformatics/btr300.
Ye Y., Wu Q., Zhexue Huang J., Ng M. K., Li X. (2013), Stratified sampling for feature subspace selection in random forests for high dimensional data, “Pattern Recognition”, vol. 46(3), pp. 769–787, https://doi.org/10.1016/j.patcog.2012.09.005.
Yu L., Liu H. (2004), Efficient feature selection via analysis of relevance and redundancy, “Journal of Machine Learning Research”, no. 5, pp. 1205–1224.

Document Type

Publication order reference

Identifiers

DOI

10.18778/0208-6018.339.01

YADDA identifier

bwmeta1.element.ojs-doi-10_18778_0208-6018_339_01

Article details

Journal

Acta Universitatis Lodziensis. Folia Oeconomica

Article title

The Problem of Redundant Variables in Random Forests

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Dates

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier