Abstract
AbstractIn unsupervised machine learning, agreement between partitions is commonly assessed with so-called external validity indices. Researchers tend to use and report indices that quantify agreement between two partitions for all clusters simultaneously. Commonly used examples are the Rand index and the adjusted Rand index. Since these overall measures give a general notion of what is going on, their values are usually hard to interpret. The goal of this study is to provide a thorough understanding of the adjusted Rand index as well as many other partition comparison indices based on counting object pairs. It is shown that many overall indices based on the pair-counting approach can be decomposed into indices that reflect the degree of agreement on the level of individual clusters. The decompositions (1) show that the overall indices can be interpreted as summary statistics of the agreement on the cluster level, (2) specify how these overall indices are related to the indices for individual clusters, and (3) show that the overall indices are affected by cluster size imbalance: if cluster sizes are unbalanced these overall measures will primarily reflect the degree of agreement between the partitions on the large clusters, and will provide much less information on the agreement on smaller clusters. Furthermore, the value of Rand-like indices is determined to a large extent by the number of pairs of objects that are not joined in either of the partitions.
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences,Statistics, Probability and Uncertainty,Psychology (miscellaneous),Mathematics (miscellaneous)
Reference77 articles.
1. Albatineh, A.N., & Niewiadomska-Bugaj, M. (2011a). Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Advances in Data Analysis and Classification, 5(3), 179–200.
2. Albatineh, A.N., & Niewiadomska-Bugaj, M. (2011b). MCS: A method for finding the number of clusters. Journal of Classification, 28, 184–209.
3. Albatineh, A.N., Niewiadomska-Bugaj, M., & Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23 (2), 301–313.
4. Alok, A.K., Saha, S., & Ekbal, A. (2014). Development of an external cluster validity index using probabilistic approach and min-max distance. International Journal of Computer Information Systems and Industrial Management Applications, 6, 494–504.
5. Anderson, D.T., Bezdek, J.C., Popescu, M., & Keller, J.M. (2010). Comparing fuzzy, probabilistic, and possibilistic partitions. IEEE Transactions on Fuzzy Systems, 18, 906–917.
Cited by
30 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献