Comparison of Machine Learning and Statistical Approaches of Detecting Anomalies Using a Simulation Study

Lenart, Klaudia

Article details

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

2024 | 28 | 4 | 23-31

Article title

Comparison of Machine Learning and Statistical Approaches of Detecting Anomalies Using a Simulation Study

Authors

Klaudia Lenart

Content

Full texts:

Download

Title variants

PL

Uczenie maszynowe i statystyczne metody wykrywania anomalii – porównawcza analiza symulacyjna

Languages of publication

Abstracts

PL

Cel: Anomalia to obserwacja lub grupa obserwacji nietypowych dla danego zbioru danych. Wykrywanie anomalii ma wiele zastosowań, nie tylko jako etap przygotowania danych do dalszych analiz, lecz także jako sposób wykrywania oszustw z wykorzystaniem kart kredytowych, włamań do sieci i wielu innych. Istnieją różne metody wykrywania anomalii. Można wyróżnić dwie grupy metod, które rozwijane są niezależnie: metody statystyczne oraz algorytmy uczenia maszynowego. Grupy te nieczęsto są porównywane. Podczas gdy metody statystyczne oparte są na sformułowaniu miary nietypowości obserwacji, nadzorowane uczenie maszynowe umożliwia wykorzystanie danych zarówno o typowych obserwacjach, jak i wcześniej zidentyfikowanych anomaliach. Celem artykułu jest dokonanie porównania tych dwóch podejść na podstawie badań symulacyjnych. Metodyka: W przeprowadzonych badaniach symulacyjnych wykorzystano dane wygenerowane przy użyciu funkcji kopula. W celu wygenerowania różnych rodzajów anomalii dokonano modyfikacji parametrów oraz postaci rozkładów brzegowymi zmiennych. Skuteczność każdej z metod została oceniona na podstawie miar dokładności klasyfikacji. Wyniki: Podczas gdy skuteczność metod statystycznych zależna była od trafnego zaprognozowania procenta anomalii, jaki pojawi się w danych, metody uczenia maszynowego charakteryzowały się niższą czułością w przypadku wprowadzenia mniejszych zmian wartości parametrów. Implikacje i rekomendacje: W przypadku metod statystycznych przedstawionych w ramach artykułu kluczowe było posiadanie wiedzy o rozkładzie zmiennych, podczas gdy do zastosowania algorytmów nadzorowanego uczenia maszynowego konieczne było posiadanie zbioru uczącego. W przeciwieństwie do uczenia maszynowego, metody statystyczne uzyskiwały podobną trafność w przypadku wprowadzenia mniejszych zmian wartości parametrów. Oryginalność/wartość: Dwa podejścia do wykrywania anomalii zaprezentowane w artykule są nieczęsto porównywane. Zazwyczaj metody te są wykorzystywane przez dwie odrębne grupy badaczy – statystyków oraz specjalistów z zakresu uczenia maszynowego lub data science.

EN

Aim: An anomaly is an observation or a group of observations that is unusual for a given dataset. Anomaly detection has many applications, not only as a step of data preparation but also, for example, as a way of identifying credit card fraud detection, network intrusions and much more. There are diverse methods of anomaly detection. In particular two groups of methods have been developed independently – statistical methods and machine learning algorithms. Those methods are rarely compared. While statistical methods focus on formulating a measure of the abnormality of the observations, supervised machine learning makes it possible to use data about typical observations and previously identified anomalies. The aim of this paper was to compare the two approaches by conducting a simulation study. Methodology: A simulation study was conducted, during which the data was generated using copula functions. For the purpose of generating different types of anomalies, marginal distributions of the variables were manipulated. The effectiveness of each method was evaluated based on measures of classification model performance. Results: While the accuracy of the statistical methods was dependent on the precise prediction of the percentage of the anomalies that would occur in the data, the machine learning algorithms’ recall was significantly lower when the change in the marginal distribution of the value parameters was smaller. Implications and recommendations: For the statistical methods included in the study, knowledge about the distribution of the variables was crucial while the supervised machine learning algorithms required acquiring a training dataset. Unlike machine learning algorithms, the statistical methods performed with similar accuracy even when the change in the marginal distribution parameters’ value was smaller. Originality/value: The two approaches to anomaly detection presented in the paper are not often compared, usually used by two separate groups of researchers – statisticians and machine learning or data science specialists.

Keywords

EN

anomaly detection simulation study machine learning

PL

wykrywanie anomalii badanie symulacyjne uczenie maszynowe

Publisher

Uniwersytet Ekonomiczny we Wrocławiu. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

Year

2024

Volume

28

Issue

4

Pages

23-31

Physical description

Dates

published

2024

Contributors

author

Klaudia Lenart

University of Economics in Katowice, Doctoral School

https://orcid.org/0000000181359362

References

Aggarwal, C. C. (2017). Outlier Analysis (2nd ed. 2017). Springer International Publishing. https://doi.org/10.1007/978-3-31947578-3
Baron, D., & Poznanski, D. (2017). The Weirdest SDSS Galaxies: Results from an Outlier Detection Algorithm. Monthly Notices of the Royal Astronomical Society, 465(4), 4530-4555.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons.
Breiman, L. (2001). Random Forests. Machine Learning, 45, 5-32.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys (CSUR), 41(3), 1-58.
Das, S., Dey, A., Pal, A., & Roy, N. (2015). Applications of Artificial Intelligence in Machine Learning: Review and Prospect. International Journal of Computer Applications, 115(9), 31-41.
Faaique, M. (2024). Overview of Big Data Analytics in Modern Astronomy. International Journal of Mathematics, Statistics, and Computer Science, 2, 96-113.
Green, R. F. (1976). Outlier-prone and Outlier-resistant Distributions. Journal of the American Statistical Association, 71(354), 502-505.
Hawkins, D. M. (1980). Identification of Outliers (Vol. 11). Springer.
Hofert, M., Kojadinovic, I., Maechler, M., & Yan, J. (2024). copula: Multivariate Dependence with Copulas. R package version 1.1-4. https://CRAN.R-project.org/package=copula
Jabez, J., & Muthukumar, B. (2015). Intrusion Detection System (IDS): Anomaly Detection Using Outlier Detection Approach. Procedia Computer Science, 48, 338-346.
Kulkarni, A., Mani, P., & Domeniconi, C. (2017). Network-based Anomaly Detection for Insider Trading. arXiv Preprint arXiv:1702.05809.
Lee, L.-F. (1983). Generalized Econometric Models with Selectivity. Econometrica: Journal of the Econometric Society, 51(2), 507-512.
Liu, J., Xie, G., Wang, J., Li, S., Wang, C., Zheng, F., & Jin, Y. (2024). Deep Industrial Image Anomaly Detection: A Survey.
Machine Intelligence Research, 21(1), 104-135.
Maddireddy, B. R. (2024). Neural Network Architectures in Cybersecurity: Optimizing Anomaly Detection and Prevention. International Journal of Advanced Engineering Technologies and Innovations, 1(2), 238-266.
Mehrotra, K. G., Mohan, C. K., & Huang, H. (2017). Anomaly Detection Principles and Algorithms. Springer International Publishing. https://doi.org/10.1007/978-3-319-67526-8
Nelsen, R. B. (1998). An Introduction to Copulas. Springer science & business media.
Prarthana, T. S., & Gangadhar, N. D. (2017). User Behaviour Anomaly Detection in Multidimensional Data. 2017 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), 3-10.
Serrano-Cinca, C., Gutiérrez-Nieto, B., & Bernate-Valbuena, M. (2019). The Use of Accounting Anomalies Indicators to Predict Business Failure. European Management Journal, 37(3), 353-375.
Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. Annales de l’ISUP, 8(3), 229-231.
Thimonier, H., Popineau, F., Rimmel, A., Doan, B. L., & Daniel, F. (2024, February). Comparative Evaluation of Anomaly Communication Technology (pp. 37-50). Detection Methods for Fraud Detection in Online Credit Card Payments. In International Congress on Information and
Yan, J. (2007). Enjoy the Joy of Copulas: With a Package copula. Journal of Statistical Software, 21(4), 1-21. https://doi.org/10.18637/jss.v021.i04

Document Type

Publication order reference

Identifiers

Biblioteka Nauki

59125130

YADDA identifier

bwmeta1.element.ojs-issn-1507-3866-year-2024-volume-28-issue-4-article-oai__article_1539

Article details

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

Article title

Comparison of Machine Learning and Statistical Approaches of Detecting Anomalies Using a Simulation Study

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Dates

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier