Efficient and tumble similar set retrieval-Reference-Cited by-同舟云学术

Efficient and tumble similar set retrieval

Published:2001-06 Issue:2 Volume:30 Page:247-258
ISSN:0163-5808
Container-title:ACM SIGMOD Record
language:en
Short-container-title:SIGMOD Rec.

Author:

Gionis Aristides¹,Gunopulos Dimitrios²,Koudas Nick³

Affiliation:

1. Stanford University

2. University of California, Riverside

3. AT&T Laboratories

Abstract

Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure. We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/376284.375689

Reference20 articles.

1. Min-wise independent permutations (extended abstract)

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Topology-Aware Hashing for Effective Control Flow Graph Similarity Analysis;Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering;2019

2. Efficient processing of probabilistic set-containment queries on uncertain set-valued data;Information Sciences;2012-08

3. Similarity search in sensor networks using semantic-based caching;Journal of Network and Computer Applications;2012-03

4. Similarity Search in Transaction Databases with a Two-Level Bounding Mechanism;Database Systems for Advanced Applications;2006