2014 | 15 | 4 | 591–610
Article title

Evaluation of Selected Approaches to Clustering Categorical Variables

Title variants
Languages of publication
This paper focuses on recently proposed similarity measures and their performance in categorical variable clustering. It compares clustering results using three recently developed similarity measures (IOF, OF and Lin measures) with results obtained using two association measures for nominal variables (Cramér’s V and the uncertainty coefficient) and with the simple matching coefficient (the overlap measure). To eliminate the influence of a particular linkage method on the structure of final clusters, three linkage methods are examined (complete, single, average). The created groups (clusters) of variables can be considered as the basis for dimensionality reduction, e.g. by choosing one of the variables from a given group as a representative for the whole group. The quality of resulting clusters is evaluated by the within-cluster variability, expressed by the WCM coefficient, and by dendrogram analysis. The examined similarity measures are compared and evaluated using two real data sets from a social survey.
Physical description
  • Department of Statistics and Probability, University of Economics, Prague. W. Churchill sq.4, 130 67 Praha 3, Czech Republic,
  • ANDERBERG, M. R., (1973). Cluster Analysis for Applications. Academic Press, New York.
  • BORIAH, S., CHANDOLA, V., KUMAR, V., (2008). Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th International Conference on Data Mining. SIAM, pp. 243–254.
  • CHANDOLA, V., BORIAH, S., KUMAR, V., (2009). A framework for exploring categorical data. In: Proceedings of the 9th International Conference on Data Mining. SIAM, pp. 187–198.
  • CHAVENT, M., KUENTZ, V., LIQUET, B., SARACCO, L., (2012). ClustOfVar: An R package for the clustering of variables. Journal of Statistical Software, 50(13):1–16. Available at:
  • <> [Accessed: 16 October 2014].
  • CHAVENT, M., KUENTZ, V., SARACCO, J., (2010). A partitioning method for the CLUSTERING of categorical variables. In: Locarek-Junge, H., Weihs, C., eds, Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin Heidelberg, pp. 91–99.
  • D’ENZA, A. I., GREENACRE, M. J., (2012). Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In: Advanced Statistical Methods for the Analysis of Large Data-Sets. Springer, Berlin Heidelberg, pp. 453–463.
  • EVERITT, B. S., LANDAU, S., LEESE, M., STAHL, D., (2011). Cluster Analysis, 5th edn, Wiley, Chichester.
  • GAN, G., MA, C., WU, J., (2007). Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM, Philadelphia.
  • GORDON, A. D., (1999). Classification, 2nd edn, Chapman & Hall/CRC, Boca Raton.
  • GREENACRE, M. J., (2010). Correspondence analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):613–619.
  • JOLLIFFE, I. T., (2002). Principal Component Analysis, 2nd edn, Springer, New York.
  • LIN, D., (1998). An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 296–304.
  • PALLA, K., KNOWLES, D. A., GHAHRAMANI, Z., (2012). A nonparametric variable clustering model. In: Pereira, F., Burges, C. J. C., Bottou, L., Weinberger, K. Q., eds, Advances in Neural Information Processing Systems 25. NIPS Foundation. Available at: <> [Accessed 16 October 2014].
  • PAYNE, T. R., EDWARDS, P., (1999). Dimensionality reduction through correspondence analysis. Available at: <> [Accessed 16 October 2014].
  • ŘEZANKOVÁ, H., LÖSTER, T., HÚSEK, D., (2011). Evaluation of categorical data clustering. In: Mugellini, E., Szczepaniak, P. S., Pettenati, M. C. et al., eds, Advances in Intelligent Web Mastering 3. Springer Verlag, Berlin, pp. 173–182.
  • ŘEZANKOVÁ, H., (2014). Nominal variable clustering and its evaluation. In: Proceedings of the 8th International Days of Statistics and Economics.
  • Melandrium, Slaný, pp. 1293–1302. Available at: < > [Accessed 5 November 2014].
  • SPARCK-JONES, K., (1972, 2002). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21. Later: Journal of Documentation, 60(5):493–502.
  • ŠULC, Z., ŘEZANKOVÁ, H., (2014). Evaluation of recent similarity measures for categorical data. In: Proceedings of the 17th International Conference Applications of Mathematics and Statistics in Economics. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu, Wroclaw, pp. 249–258. Available at: <,Rezankova.pdf> [Accessed 5 November 2014].
Document Type
Publication order reference
YADDA identifier
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.