GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data-Reference-Cited by-同舟云学术

GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data

Published:2023-06-13 Issue:2 Volume:1 Page:1-27
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Chai Chengliang¹^ORCID,Liu Jiabin¹^ORCID,Tang Nan²^ORCID,Fan Ju³^ORCID,Miao Dongjing⁴^ORCID,Wang Jiayi⁵^ORCID,Luo Yuyu⁵^ORCID,Li Guoliang⁵^ORCID

Affiliation:

1. Beijing Institute of Technology, Beijing, China

2. Qatar Computing Research Institute, HBKU, Doha, Qatar

3. Renmin University of China, Beijing, China

4. Harbin Institute of Technology, Harbin, China

5. Tsinghua University, Beijing, China

Abstract

Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to improve the training efficiency. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this paper, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we further propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Experimental results show the effectiveness and efficiency of our framework with low cost.

Funder

Zhejiang Lab?s International Talent Fund for Young Professionals

TAL Education

National Science Foundation of China

Huawei Technologies

BNRist

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3589302

Reference80 articles.

1. 2022. https://github.com/awslabs/datawig. 2022. https://github.com/awslabs/datawig.

2. 2022. https://archive.ics.uci.edu/ml/datasets/nursery. 2022. https://archive.ics.uci.edu/ml/datasets/nursery.

3. 2022. https://archive.ics.uci.edu/ml/datasets/adult. 2022. https://archive.ics.uci.edu/ml/datasets/adult.

4. 2022. https://www.kaggle.com/. 2022. https://www.kaggle.com/.

5. 2022. https://ride.capitalbikeshare.com/system-data. 2022. https://ride.capitalbikeshare.com/system-data.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unlocking the Power of Data: Dynamic Subset Selection with Reinforcement Learning;2023