Towards a unified framework for string similarity joins-Reference-Cited by-同舟云学术

Towards a unified framework for string similarity joins

Published:2019-07 Issue:11 Volume:12 Page:1289-1302
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Xu Pengfei¹,Lu Jiaheng¹

Affiliation:

1. University of Helsinki, Finland

Abstract

A similarity join aims to find all similar pairs between two collections of records. Established algorithms utilise different similarity measures, either syntactic or semantic, to quantify the similarity between two records. However, when records are similar in forms of a mixture of syntactic and semantic relations, utilising a single measure becomes inadequate to disclose the real similarity between records, and hence unable to obtain high-quality join results. In this paper, we study a unified framework to find similar records by combining multiple similarity measures. To achieve this goal, we first develop a new similarity framework that unifies the existing three kinds of similarity measures simultaneously, including syntactic (typographic) similarity, synonym-based similarity, and taxonomy-based similarity. We then theoretically prove that finding the maximum unified similarity between two strings is generally NP -hard, and furthermore develop an approximate algorithm which runs in polynomial time with a non-trivial approximation guarantee. To support efficient string joins based on our unified similarity measure, we adopt the filter-and-verification framework and propose a new signature structure, called pebble , which can be simultaneously adapted to handle multiple similarity measures. The salient feature of our approach is that, it can judiciously select the best pebble signatures and the overlap thresholds to maximise the filtering power. Extensive experiments show that our methods are capable of finding similar records having mixed types of similarity relations, while exhibiting high efficiency and scalability for similarity joins. The implementation can be downloaded at https://github.com/HY-UDBMS/AU-Join.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3342263.3342268

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Embracing ambiguity: Improving similarity-oriented tasks with contextual synonym knowledge;Neurocomputing;2023-10

2. Towards open data discovery;Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing;2022-04-25

3. An efficient length-segmented inverted index-based set similarity query algorithm;International Journal of Computing Science and Mathematics;2022

4. Blocking and Filtering Techniques for Entity Resolution;ACM Computing Surveys;2021-03-31

5. Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join;Proceedings of the 28th ACM International Conference on Information and Knowledge Management;2019-11-03