Evaluation of Selected Approaches to Clustering Categorical Variables

Šulc, Zdeněk

Article details

Journal

Statistics in Transition new series

2014 | 15 | 4 | 591–610

Article title

Evaluation of Selected Approaches to Clustering Categorical Variables

Authors

Zdeněk Šulc

Content

Full texts:

Evaluation of Selected Approaches to Clustering Categorical Variables

Download

Title variants

Languages of publication

EN

Abstracts

EN

This paper focuses on recently proposed similarity measures and their performance in categorical variable clustering. It compares clustering results using three recently developed similarity measures (IOF, OF and Lin measures) with results obtained using two association measures for nominal variables (Cramér’s V and the uncertainty coefficient) and with the simple matching coefficient (the overlap measure). To eliminate the influence of a particular linkage method on the structure of final clusters, three linkage methods are examined (complete, single, average). The created groups (clusters) of variables can be considered as the basis for dimensionality reduction, e.g. by choosing one of the variables from a given group as a representative for the whole group. The quality of resulting clusters is evaluated by the within-cluster variability, expressed by the WCM coefficient, and by dendrogram analysis. The examined similarity measures are compared and evaluated using two real data sets from a social survey.

Keywords

EN

variable clustering nominal variables association measures similarity measures.

Publisher

Główny Urząd Statystyczny

Journal

Statistics in Transition new series

Year

2014

Volume

15

Issue

4

Pages

591–610

Physical description

Contributors

author

Zdeněk Šulc

zdenek.sulc@vse.cz

Department of Statistics and Probability, University of Economics, Prague. W. Churchill sq.4, 130 67 Praha 3, Czech Republic

References

ANDERBERG, M. R., (1973). Cluster Analysis for Applications. Academic Press, New York.
BORIAH, S., CHANDOLA, V., KUMAR, V., (2008). Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th International Conference on Data Mining. SIAM, pp. 243–254.
CHANDOLA, V., BORIAH, S., KUMAR, V., (2009). A framework for exploring categorical data. In: Proceedings of the 9th International Conference on Data Mining. SIAM, pp. 187–198.
CHAVENT, M., KUENTZ, V., LIQUET, B., SARACCO, L., (2012). ClustOfVar: An R package for the clustering of variables. Journal of Statistical Software, 50(13):1–16. Available at:
<http://arxiv.org/abs/1112.0295> [Accessed: 16 October 2014].
CHAVENT, M., KUENTZ, V., SARACCO, J., (2010). A partitioning method for the CLUSTERING of categorical variables. In: Locarek-Junge, H., Weihs, C., eds, Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin Heidelberg, pp. 91–99.
D’ENZA, A. I., GREENACRE, M. J., (2012). Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In: Advanced Statistical Methods for the Analysis of Large Data-Sets. Springer, Berlin Heidelberg, pp. 453–463.
EVERITT, B. S., LANDAU, S., LEESE, M., STAHL, D., (2011). Cluster Analysis, 5th edn, Wiley, Chichester.
GAN, G., MA, C., WU, J., (2007). Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM, Philadelphia.
GORDON, A. D., (1999). Classification, 2nd edn, Chapman & Hall/CRC, Boca Raton.
GREENACRE, M. J., (2010). Correspondence analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):613–619.
JOLLIFFE, I. T., (2002). Principal Component Analysis, 2nd edn, Springer, New York.
LIN, D., (1998). An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 296–304.
PALLA, K., KNOWLES, D. A., GHAHRAMANI, Z., (2012). A nonparametric variable clustering model. In: Pereira, F., Burges, C. J. C., Bottou, L., Weinberger, K. Q., eds, Advances in Neural Information Processing Systems 25. NIPS Foundation. Available at: <http://papers.nips.cc/paper/4579-a-nonparametric-variable-clustering-model.pdf> [Accessed 16 October 2014].
PAYNE, T. R., EDWARDS, P., (1999). Dimensionality reduction through correspondence analysis. Available at: <http://eprints.soton.ac.uk/263091/> [Accessed 16 October 2014].
ŘEZANKOVÁ, H., LÖSTER, T., HÚSEK, D., (2011). Evaluation of categorical data clustering. In: Mugellini, E., Szczepaniak, P. S., Pettenati, M. C. et al., eds, Advances in Intelligent Web Mastering 3. Springer Verlag, Berlin, pp. 173–182.
ŘEZANKOVÁ, H., (2014). Nominal variable clustering and its evaluation. In: Proceedings of the 8th International Days of Statistics and Economics.
Melandrium, Slaný, pp. 1293–1302. Available at: < http://msed.vse.cz/msed_2014/article/276-Rezankova-Hana-paper.pdf > [Accessed 5 November 2014].
SPARCK-JONES, K., (1972, 2002). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21. Later: Journal of Documentation, 60(5):493–502.
ŠULC, Z., ŘEZANKOVÁ, H., (2014). Evaluation of recent similarity measures for categorical data. In: Proceedings of the 17th International Conference Applications of Mathematics and Statistics in Economics. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu, Wroclaw, pp. 249–258. Available at: < http://www.amse.ue.wroc.pl/papers/Sulc,Rezankova.pdf> [Accessed 5 November 2014].

Article details

Journal

Statistics in Transition new series

Article title

Evaluation of Selected Approaches to Clustering Categorical Variables

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier