Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining-Reference-Cited by-同舟云学术

Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining

Published:2024-05-29 Issue:3 Volume:2 Page:1-27
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Bobed Lisbona Carlos¹^ORCID,Bernad Jordi¹^ORCID,Maillot Pierre²^ORCID

Affiliation:

1. Aragon Institute of Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain

2. University Cote d'Azur Inria, CNRS, I3S, Nice, France

Abstract

Extracting interesting patterns from data is the main objective of Data Mining. In this context, Frequent Itemset Mining has shown its usefulness in providing insights from transactional databases, which, in turn, can be used to gain insights about the structure of Knowledge Graphs. While there have been a lot of advances in the field, due to the NP-hard nature of the problem, the main approaches still struggle when they are faced with large databases with large and sparse vocabularies, such as the ones obtained from graph propositionalizations. There have been efforts to propose parallel algorithms, but, so far, the goal has not been to tackle this source of complexity (i.e., vocabulary size), thus, in this paper, we propose to parallelize frequent itemset mining algorithms by partitioning the database horizontally (i.e., transaction-wise) while not neglecting all the possible vertical information (i.e., item-wise). Instead of relying on pure item co-appearance metrics, we advocate for the adoption of a different approach: modeling databases as documents, where each transaction is a sentence, and each item a word. In this way, we can apply recent language modeling techniques (i.e., word embeddings) to obtain a continuous representation of the database, clusterize it in different partitions, and apply any mining algorithm to them. We show how our proposal leads to informed partitions with a reduced vocabulary size and a reduced entropy (i.e., disorder). This enhances the scalability, allowing us to speed up mining even in very large databases with sparse vocabularies. We have carried out a thorough experimental evaluation over both synthetic and real datasets showing the benefits of our proposal.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3654987

Reference58 articles.

1. Robust data clustering

2. COOLCAT

3. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (mar 2003), 1137--1155.

4. Widening: using parallel resources to improve model quality