Isolation Forests for Symbolic Data as a Tool for Outlier Mining

Pełka, Marcin; Dudek, Andrzej

doi:10.15611/eada.2024.1.01

Article details

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

2024 | 28 | 1 | 1-10

Article title

Isolation Forests for Symbolic Data as a Tool for Outlier Mining

Authors

Marcin Pełka , Andrzej Dudek

Content

Full texts:

Download

Title variants

PL

Lasy separujące dla danych symbolicznych jako narzędzie wykrywania obserwacji odstających

Languages of publication

Abstracts

PL

Cel: Identyfikacja obserwacji odstających stanowi kluczowy element w analizie danych. Pomimo że w literaturze funkcjonuje wiele różnych definicji, czym są obserwacje odstające, to ogólnie można stwierdzić, że są to obiekty różniące się od pozostałych obserwacji ze zbioru danych. Literatura przedmiotu wskazuje wiele różnorodnych metod, które można wykorzystać w przypadku danych klasycznych. Niestety w przypadku danych symbolicznych brakuje takich analiz. Celem artykułu jest zaproponowanie modyfikacji lasów separujących (isolation forests) dla danych symbolicznych. Metodyka: W artykule wykorzystano lasy separujące dla danych symbolicznych do identyfikacji obserwacji odstających w sztucznych zbiorach danych o znanej strukturze klas i znanej liczbie obserwacji odstających. Wyniki: Otrzymane wyniki wskazują, że lasy separujące dla danych symbolicznych są efektywnym i szybkim narzędziem w identyfikacji obserwacji odstających. Implikacje i rekomendacje: Ponieważ lasy separujące dla danych symbolicznych okazały się skutecznym narzędziem w identyfikacji obserwacji odstających, celem przyszłych badań powinno być przeanalizowanie skuteczności tej metody w przypadku rzeczywistych zbiorów danych (np. zbioru dotyczącego oszustw z użyciem kart kredytowych), a także porównanie tej metody z innymi metodami, które pozwalają odnaleźć obserwacje odstające (np. DBSCAN). Autorzy sugerują, by w przypadku lasów separujących dla danych symbolicznych stosować te same parametry, jakie zwykle stosuje się w przypadku lasów losowych dla danych klasycznych. Oryginalność/wartość: Artykuł nie tylko stanowi ujęcie teorii w zakresie obserwacji odstających, ale jednocześnie proponuje, jak zastosować lasy separujące w przypadku danych symbolicznych.

EN

Aim: Outlier detection is a key part of every data analysis. Although there are many definitions of outliers that can be found in the literature, all of them emphasise that outliers are objects that are in some way different from other objects in the dataset. There are many different approaches that have been proposed, compared, and analysed for the case of classical data. However, there are only few studies that deal with the problem of outlier detection in symbolic data analysis. The paper aimed to propose how to adapt isolation forest for symbolic data cases. Methodology: An isolation forest for symbolic data is used to detect outliers in four different artificial datasets with a known cluster structure and a known number of outliers Results: The results show that the isolation forest for symbolic data is a fast and efficient tool for outlier mining. Implications and recommendations: As the isolation forest for symbolic data appears to be an efficient tool for outlier detection for artificial data, further studies should focus on real data sets that contain outliers (i.e. credit card fraud dataset), and this approach should be compared with other outlier mining tools (i.e. DBCSAN). The authors recommend using the same initial settings for the isolation forest for symbolic data as the settings that are proposed for the isolation forest for classical data. Originality/value: This paper is the first of its kind, focusing not only on the problem of outlier detection in general, but also extending the well-known isolation forest model for symbolic data cases. Keywords: symbolic data analysis, isolation forest, outliers

Keywords

EN

symbolic data analysis isolation forest outliers

PL

analiza danych symbolicznych lasy separujące obserwacje odstające

Publisher

Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

Year

2024

Volume

28

Issue

1

Pages

1-10

Physical description

Dates

published

2024

Contributors

author

Marcin Pełka

Wroclaw University of Economics and Business, Poland

https://orcid.org/0000000222255229

author

Andrzej Dudek

Wroclaw University of Economics and Business, Poland

https://orcid.org/0000000249438703

References

Aggarwal, C. C., and Yu, P. S. (2005). An Effective and Efficient Algorithm for High-Dimensional Outlier Detection. The VLDB Journal, 14, 211-221.
Aggarwal, C. (2017). Outlier Analysis. Springer.
Aguinis, H., Gottfredson, R. K., and Joo, H. (2013). Best-practice Recommendations for Defining, Identifying, and Handling Outliers. Organizational Research Methods, 16(2), 270-301.
Anscombe, F. J., and Guttman, I. (1960). Rejection of Outliers. Technometrics, 2(2), 123-147.
Ayadi, A., Ghorbel, O., Obeid, A. M., and Abid, M. (2017). Outlier Detection Approaches for Wireless Sensor Networks: A Survey. Comput. Netw., (129), 319-333.
Barnett, V., and Lewis, T. (1994). Outliers in Statistical Data (vol. 3, no. 1). Wiley.
Bock, H.-H., and Diday, E. (eds.) (2000). Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data. Springer Verlag.
Billard, L., and Diday, E. (2006). Symbolic Data Analysis. Conceptual Statistics and Data Mining. John Wiley & Sons.
Branch, J. W., Giannella, C., Szymanski, B., Wolff, R., and Kargupta, H. (2013). In-network Outlier Detection in Wireless Sensor Networks. Knowledge and Information Systems, 34, 23-54.
Breunig, M. M., Kriegel, H. P., Ng, R. T., and Sander, J. (2000, May). LOF: Identifying Density-Based Local Outliers (Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93-104).
Brito, P., and Dias, S. (Eds.). (2022). Analysis of distributional data. CRC Press.
Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly Detection: A Survey. ACM Comput. Surv., (41), 15:1-15:58
Cheng, T., and Li, Z. (2006). A Multiscale Approach for Spatio-Temporal Outlier Detection. Transactions in GIS, 10(2), 253-263.
Hawkins, D. M. (1980). Identification of Outliers (Vol. 11). Chapman and Hall.
Hahsler, M., Piekenbrock, M., and Doran, D. (2019). dbscan: Fast Density-based Clustering with R. Journal of Statistical Software, (91), 1-30.
Ghosh, D., and Vogt, A. (2012). Outliers: An Evaluation of Methodologies. Joint Statistical Meetings, 12(1), 3455-3460.
Grubbs, F. E. (1969). Procedures for Detecting Outlying Observations in Samples. Technometrics, 11(1), 1-21.
Gatnar E., and Walesiak M. (Ed.). (2011). Analiza danych jakościowych i symbolicznych z wykorzystaniem programu R. C.H. Beck.
Hariri, S., Kind, M. C., and Brunner, R. J. (2019). Extended Isolation Forest. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1479-1489.
Hawkins, D. M. (1980). Identification of Outliers (Vol. 11). London: Chapman and Hall.
Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). Outlier Detection Using Replicator Neural Networks. In: Data Warehousing and Knowledge Discovery: 4th International Conference, DaWaK 2002 Aix-en-Provence, France, September 4-6, 2002 Proceedings 4 (pp. 170-180). Springer Berlin Heidelberg.
Hu, T., and Sung, S. Y. (2003). Detecting Pattern-based Outliers. Pattern Recognition Letters, 24(16), 3059-3068.
Jiang, M. F., Tseng, S. S., and Su, C. M. (2001). Two-phase Clustering Process for Outliers Detection. Pattern Recognition Letters, 22(6-7), 691-700.
Keller, F., Muller, E., and Bohm, K. (2012, April). HiCS: High Contrast Subspaces for Density-Based Outlier Ranking (2012 IEEE 28th International Conference on Data Engineering, pp. 1037-1048). IEEE.
Lazarevic, A., and Kumar, V. (2005, August). Feature bagging for Outlier Detection (Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 157-166).
Lesouple, J., Baudoin, C., Spigai, M., and Tourneret, J. Y. (2021). Generalized Isolation Forest for Anomaly Detection. Pattern Recognition Letters, 149, 109-119.
Liu, F. T.¸ Ting, K. M., and Zhou, Z.-H. (2008). Isolation Forest (2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 2008, pp. 413-422). doi: 10.1109/ICDM.2008.17.
Micenková, B., McWilliams, B., and Assent, I. (2015). Learning Representations for Outlier Detection on a Budget. arXiv preprint arXiv:1507.08104.
Muthukrishnan, S., Shah, R., and Vitter, J. S. (2004, June). Mining Deviants in Time Series Data Streams (Proceedings. 16th International Conference on Scientific and Statistical Database Management, pp. 41-50). IEEE.
Rayana, S., Zhong, W., and Akoglu, L. (2016, December). Sequential Ensemble Learning for Outlier Detection: A Bias-Variance Perspective (2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1167-1172). IEEE.
Singh, K., and Upadhyaya, S. (2012). Outlier Detection: Applications and Techniques. International Journal of Computer Science Issues (IJCSI), 9(1), 307.
Sadik, S., and Gruenwald, L. (2011, September). Online Outlier Detection for Data Streams (Proceedings of the 15th Symposium on International Database Engineering & Applications, pp. 88-96).
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems, 42(3), 1-21.
Smiti, A. (2020). A Critical Overview of Outlier Detection Methods. Computer Science Review, 38(100306).
Thang, T. M., and Kim, J. (2011, April). The Anomaly Detection by Using Dbscan Clustering with Multiple Parameters (2011 International Conference on Information Science and Applications, pp. 1-5). IEEE.
Zhao, Y., and Hryniewicki, M. K. (2018, July). Xgbod: Improving Supervised Outlier Detection with Unsupervised Representation Learning (2018 International Joint Conference on Neural Networks (IJCNN), pp. 1-8). IEEE.
Walesiak, M., and Dudek, A. (2023). clusterSim: Searching for Optimal Clustering Procedure for a Data Set. Retrieved from www.r-project.org
Wang, H., Bah, M. J., and Hammad, M. (2019). Progress in Outlier Detection Techniques: A Survey. Ieee Access, 7, 107964- -108000.

Document Type

Publication order reference

Identifiers

DOI

10.15611/eada.2024.1.01

Biblioteka Nauki

31233541

YADDA identifier

bwmeta1.element.ojs-doi-10_15611_eada_2024_1_01

Article details

Journal

Econometrics. Ekonometria. Advances in Applied Data Analytics

Article title

Isolation Forests for Symbolic Data as a Tool for Outlier Mining

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Dates

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier