On String Representation For Efficient Mining And Searching Over String Databse:


.

Metwally Rashad Metwally Omar

Author
Benha University
University
Faculty
2012
Publish Year

One of the most important primitive data types in modern data processing is strings.String data is ubiquitous, and its management has taken on particular importancein the past few years. An important research topic in string processing is stringsimilarity join. A string similarity join nds all similar pairs between two collectionsof strings. It is an essential operation in many applications, such as data integra-tion and cleaning, web page detection, and pattern recognition. Many similarityfunctions have been proposed to quantify the similarity between two strings. In thisthesis, we study string similarity joins with edit distance constraints. There existmany algorithms to support string similarity join with edit distance constraint, suchas All-Pairs-Ed, PPJoin, and ED-Join. Most of these algorithms follows the lterand veri cation paradigm, where indexes are used to lter many of the unpromisingstring pairs and generate candidate pairs, then these candidates are veri ed to out-put the nal result. Recently, a veri cation-free, trie-based similarity join frameworkis proposed. It uses a trie structure to index all strings in a given dataset. Though,this framework has shown many advantages over the lter-and-re ne framework, themain problem with current trie-based join algorithms is that they generate and main-tain lots of candidate pre xes called active-nodes which need to be further removed.With large edit distance, active-node sets becomes large as well. To overcome thisproblem, they use many pruning techniques to remove false positive pre xes ina subsequent phase. This makes the existing algorithms inecient for processingvvilarge data sets with long strings, and higher edit distance threshold. In this thesis,we propose a new trie-based join algorithm called PreJoin, which improves uponcurrent trie-based join methods. It eciently nds all similar string pairs using anew active-node set generation method, and a dynamic preorder traversal of thetrie index. We improve PreJoin in order to be suitable to deal with large edit dis-tance. This improvement called PreJoin-Plus algorithm. Experiments show thatour approaches is highly ecient for processing short as well as long strings, andoutperforms the state-of-the-art trie-based join approaches by a factor of ve. 

Abstract
Attachments


Seacrch again