Number of Clusters and the Quality of Hybrid Predictive Models in Analytical CRM
Languages of publication
Making more accurate marketing decisions by managers requires building effective predictive models. Typically, these models specify the probability of customer belonging to a particular category, group or segment. The analytical CRM categories refer to customers interested in starting cooperation with the company (acquisition models), customers who purchase additional products (cross- and up-sell models) or customers intending to resign from the cooperation (churn models). During building predictive models researchers use analytical tools from various disciplines with an emphasis on their best performance. This article attempts to build a hybrid predictive model combining decision trees (C&RT algorithm) and cluster analysis (k-means). During experiments five different cluster validity indices and eight datasets were used. The performance of models was evaluated by using popular measures such as: accuracy, precision, recall, G-mean, F-measure and lift in the first and in the second decile. The authors tried to find a connection between the number of clusters and models' quality.
- Blake, C.L., Merz, C.J. (1998) Churn Data Set, UCI Repository of Machine Learning Databases. http://www.sgi.com/tech/mlc/db, University of California, Department of Information and Computer Science, Irvine, CA.
- Blattberg, R.C., Kim, B-D, Neslin, S.A., (2008) Database Marketing. Analyzing and Managing Customers, New York: Springer.
- Bose, I., Chen, X. (2009). Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn. Journal of Organizational Computing and Electronic Commerce. vol. 19, no. 2, April-June, 133–151.
- Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group.
- Caliński, R.B, Harabasz, J. (1974). A Dendrite Method for Cluster Analysis. Communications in Statistics. vol. 3, iss. 1, 1–27.
- Causality Workbench. Challenges in Machine Learning, http://www.causality.inf.ethz.ch/data/CINA.html.
- Christopher, M., Payne, A., Ballantyne, D. (2002). Relationship Marketing. Creating Stakeholder Value. Oxford: Elsevier.
- Chu, B-H., Tsai, M-S., Ho, Ch-S. (2007). Toward a Hybrid Data Mining Model for Customer Retention. Knowledge-Based Systems. no. 20, 703–718.
- Davies, D.L., Bouldin, D.W. (1979). A Cluster Separation Measure. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, 224–227.
- Everit, B.S., Landau, S., Leese, M., Stahl, D. (2011). Cluster Analysis. 5th Edition. Chichester: John Wiley & Sons.
- Ferraretti, D., Lamma, E., Gamberoni, G., Febo, M., Di Cuia, R. (2011). Integrating Clustering and Classification Techniques: A Case Study for Reservoir Facies Prediction. In D. Ryzko et al. Emerging Intelligent Technologies in Industry, SCI 369, Berlin Heidelberg: Springer-Verlag, 21–34.
- Frank, A., Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Gaddam, S.R., Phoha, V.V., Balagani, K.S. (2007). K-means + ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-means Clustering and ID3 Decision Tree Learning Methods. In: IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, March, 345–354.
- Hartigan, J.A. (1975). Clustering Algorithms. New York, London, Sydney, Toronto: Wiley.
- Hartigan, J.A., Wong, M.A. (1979). A K-means Clustering Algorithm. Applied Statistics. vol. 28, no. 1, 100–108.
- KDD Cup 2009, http://www.kddcup-orange.com.
- Khan, D.M., Mohamudally, N. (2011). An Integration of K-means and Decision Tree (ID3) Towards a More Efficient Data Mining Algorithm. Journal of Computing. vol. 3, iss. 12, December, 76–82.
- Krzanowski, W.J., Lai, Y.T. (1988). A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics. vol. 44, no. 1, 23–34.
- Kumar, V., Rathee, N. (2011). Knowledge Discovery from Database Using an Integration of Clustering and Classification. International Journal of Advanced Computer Science and Applications. vol. 2, no. 3, March, 29–33.
- Łapczyński, M., Jefmański, B. (2013). Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees. In P. Perner (Ed.), Advances in Data Mining. Ibai Publishing, 153–162.
- Łapczyński, M., Surma, J. (2012). Hybrid Predictive Models for Optimizing Marketing Banner Ad Campaign in On-line Social Network. In R. Stahlbock, G.M. Weiss (Eds.) Proceedings of the 2012 International Conference on Data Mining, Las Vegas Nevada, USA: CSREA Press, 140–146.
- Li, Y., Deng, Z., Qian, Q., Xu, R. (2011). Churn Forecast Based on Two-step Classification in Security Industry. Intelligent Information Management. no. 3, 160–165.
- Moro, S., Laureano, R., Cortez, P. (2011). Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.) Proceedings of the European Simulation and Modelling Conference – ESM'2011, Guimarães, Portugal, October, 117–121.
- Shouman, M., Turner, T., Stocker, R. (2012). Integrating Decision Tree and K-Means Clustering with Different Initial Centroid Selection Methods in the Diagnosis of Heart Disease Patients. In R. Stahlbock, G.M. Weiss (Eds.) Proceedings of the 2012 International al Conference on Data Mining, Las Vegas Nevada, USA: CSREA Press, 24–30.
- Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society. ser. B, 63, part 2, 411–423.[Crossref]
- van der Putten, P., van Someren, M. (Eds) (2000). CoIL Challenge 2000: The Insurance Company Case. In Also a Leiden Institute of Advanced Computer Science Technical Report 2000–09, Sentient Machine Research, Amsterdam, June 22.
- Wierenga, B. (Ed.) (2008). Handbook of Marketing Decision Models. New York: Springer.
Publication order reference