Evaluation of resampling methods in the class unbalance problem

Kubus, Mariusz

Article details

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

2020 | vol. 24, nr 1 | 39-50

Article title

Evaluation of resampling methods in the class unbalance problem

Authors

Kubus Mariusz

Content

Full texts:

39-50_Kubus_Evaluation_of_resampling_methods_in_the_class_unbalance_problem.pdf

Download

Title variants

PL

Ocena metod repróbkowania w problemie zbiorów niezbilansowanych

Languages of publication

EN

Abstracts

EN

The purpose of many real world applications is the prediction of rare events, and the training sets are then highly unbalanced. In this case, the classifiers are biased towards the correct prediction of the majority class and they misclassify a minority class, whereas rare events are of the greater interest. To handle this problem, numerous techniques were proposed that balance the data or modify the learning algorithms. The goal of this paper is a comparison of simple random balancing methods with more sophisticated resampling methods that appeared in the literature and are available in R program. Additionally, the authors ask whether learning on the original dataset and using a shifted threshold for classification is not more competitive. The authors provide a survey from the perspective of regularized logistic regression and random forests. The results show that combining random under-sampling with random forests has an advantage over other techniques while logistic regression can be competitive in the case of highly unbalanced data.

PL

Celem wielu praktycznych zastosowań modeli dyskryminacyjnych jest przewidywanie zdarzeń rzadkich. Zbiory uczące są wówczas niezbilansowane. W tym przypadku klasyfikatory mają tendencję do poprawnego klasyfikowania obiektów klasy większościowej i jednocześnie błędnie klasyfikują wiele obiektów klasy mniejszościowej, która jest przedmiotem szczególnego zainteresowania. W celu rozwiązania tego problemu zaproponowano wiele technik, które bilansują dane lub modyfikują algorytmy uczące. Celem artykułu jest porównanie prostych, losowych metod bilansowania z bardziej wyrafinowanymi, które pojawiły się w literaturze. Dodatkowo postawiono pytanie, czy konkurencyjnym podejściem nie jest budowa modelu na oryginalnym zbiorze danych i przesunięcie progu klasyfikacji. Badanie przedstawiono z perspektywy regularyzowanej regresji logistycznej i lasów losowych. Wyniki pokazują, że kombinacja metody under-sampling z lasami losowymi wykazuje przewagę nad innymi technikami, podczas gdy regresja logistyczna może być konkurencyjna w przypadku silnego niezbilansowania.

Keywords

EN

class unbalance resampling regularized logistic regression random forests

PL

klasy niezbilansowane repróbkowanie regularyzowana regresja logistyczna lasy losowe

Publisher

Uniwersytet Ekonomiczny we Wrocławiu. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

Year

2020

Issue

vol. 24, nr 1

Pages

39-50

Physical description

Contributors

author

Kubus Mariusz

m.kubus@po.edu.pl

References

Bolton R.J., Hand D.J., 2002, Statistical fraud detection, Statistical Science, vol. 17, no. 3, 235-255.
Breiman L., 2001, Random forests, Machine Learning, 45, 5-32.
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., 2002, SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 16, 321-357.
Chawla N.V., Japkowicz N., Kołcz A., 2004, Special issue on learning from imbalanced data sets, ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
Chen C., Liaw A., Breiman L., 2004, Using Random Forest to Learn Imbalanced Data, University of California, Berkeley, 110, 1-12.
Dua D., Graff C., 2019, UCI Machine Learning Repository, University of California,: School of Information and Computer Science, Irvine, CA http://archive.ics.uci.edu/ml
Estabrooks A., Jo T., Japkowicz N., 2004, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence, 20(1), 18-36.
Fawcett T., 2006, An introduction to ROC analysis, Pattern Recognition Letters, 27, 861-874.
Friedman J., Hastie T., Tibshirani R., 2008, Regularization paths for generalized linear models via coordinate descent, Technical report, Stanford University.
Galar M., Fernandez A., Barrenechea E., Bustince H., Herrera F., 2011, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
Haixiang G., Yijing L., Shang J., Mingyun G., Yuanyue H., Bing G., 2017, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications, 73, 220-239.
Hastie T., Tibshirani R., Friedman J., 2009, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, Springer, New York.
Japkowicz N., Shah M., 2011, Evaluating learning algorithms: a classification perspective, Cambridge University Press.
King G., Zeng L., 2001, Logistic regression in rare events data, Political Analysis, 9, 137-163.
Kumar N.S., Rao K.N., Govardhan A., Reddy K.S. & Mahmood A.M., 2014, Undersampled k-means approach for handling imbalanced distributed data, Progress in Artificial Intelligence, 3(1), 29-38.
Lee S., 2000, Noisy replication in skewed binary classification, Computational Statistics and Data Analysis, 34, 165-191.
Longadge R., Dongre S.S., Malik L., 2013, Class imbalance problem in data mining: review, International Journal of Computer Science and Network, vol. 2, issue 1, 83-87.
Loyola-González O., Martínez-Trinidad J. F., Carrasco-Ochoa J.A., García-Borroto M., 2016, Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases, Neurocomputing, 175, 935-947.
López V., Fernández A., Moreno-Torres J. G., & Herrera F., 2012, Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics, Expert Systems with Applications, 39(7), 6585-6608.
Menardi G., Torelli N., 2014, Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery, 28, 92-122.
Misztal M., 2014, Wybrane metody oceny jakości klasyfikatorów – przegląd i przykłady zastosowań, Prace Naukowe Uniwersytetu Ekonomicznego we Wrocławiu nr 328, Taksonomia 23, Klasyfikacja i analiza danych – teoria i zastosowania, 156-166.
Pociecha J., Pawełek B., Baryła M., Augustyn S., 2014, Statystyczne metody prognozowania bankructwa w zmieniającej się koniunkturze gospodarczej, Fundacja Uniwersytetu Ekonomicznego w Krakowie, Kraków.
Weiss G., 2004, Mining with rarity: A unifying framework, SIGKDD Explorations, 6(1), 7-19.
Zou H., Hastie T., 2005, Regularization and variable selection via the elastic net, ,Journal of the Royal Statistical Society, Series B. 67(2), 301-320.

Article details

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

Article title

Evaluation of resampling methods in the class unbalance problem

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Issue

Pages

Physical description

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier