karam gouda|Publications:An efficient string edit similarity join algorithm. Computing and Informatics 36,(3):683–704

You are in:Home/Publications/An efficient string edit similarity join algorithm. Computing and Informatics 36,(3):683–704 · June 2017
Prof. karam gouda :: Publications:

Title:	An efficient string edit similarity join algorithm. Computing and Informatics 36,(3):683–704 · June 2017
Authors:	Karam Gouda and Metwally Rashad
Year:	2017
Keywords:	String edit distance, string similarity join, trie-based approaches.
Journal:	Computing and Informatics
Volume:	36
Issue:	3
Pages:	683–704
Publisher:	Not Available
Local/International:	International
Paper Link:	Not Available
Full paper	karam gouda_Paper_CAI_2925.pdf
Supplementary materials	Not Available

Abstract:

String similarity join is a basic and essential operation in many applications. In this paper, we investigate the problem of string similarity join with edit distance constraints. A trie-based edit similarity join framework has been proposed recently. The main advantage of existing trie-based algorithms is support for similarity join on short strings. The main problem is when joining long and distant strings. These methods generate and maintain lots of similar prefixes called active nodes which need to be further removed in a subsequent pruning phase. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It efficiently finds all similar string pairs using a novel active-node generation method, which minimizes the number of generated active nodes by applying the pruning heuristics early in the generation process. The performance of PreJoin is scaled in two different ways: First, a dynamic reordering of the trie index is used to accelerate the search for similar string pairs. Second, a partitioning method of string space is used to improve performance on large edit distance thresholds. Experiments show that our approach is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art trie-based join approaches by a factor five.

Prof. karam gouda :: Publications: