Approximate K-Means++ in Sublinear Time-Reference-Cited by-同舟云学术

Approximate K-Means++ in Sublinear Time

Published:2016-02-21 Issue:1 Volume:30 Page:
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Bachem Olivier,Lucic Mario,Hassani S. Hamed,Krause Andreas

Abstract

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, large-scale datasets while offering a reduction in runtime of several orders of magnitude.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Settling Time vs. Accuracy Tradeoffs for Clustering Big Data;Proceedings of the ACM on Management of Data;2024-05-29

2. High-density cluster core-based k-means clustering with an unknown number of clusters;Applied Soft Computing;2024-04

3. An improved seeds scheme in K‐means clustering algorithm for the UAVs control system application;IET Communications;2024-03-11

4. Sybil Attack Detection Based on Signal Clustering in Vehicular Networks;IEEE Transactions on Machine Learning in Communications and Networking;2024

5. K-means-dist: A Novel Approach for Enhanced Cybersecurity Clustering Using Combined Distance Metrics;2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM);2023-10-26