Moufeda Hussein Gaber|Publications:Keyphrase-Based Hierarchical Clustering for Arabic Documents

You are in:Home/Publications/Keyphrase-Based Hierarchical Clustering for Arabic Documents
Ass. Lect. Moufeda Hussein Gaber :: Publications:

Title:	Keyphrase-Based Hierarchical Clustering for Arabic Documents
Authors:	M Hussein; A AlSammak; T ElShishtawy
Year:	2016
Keywords:	Agglomerative Hierarchical document clustering; Keyphrase; Lemma; Lexical similarity; Semantic similarity
Journal:	Not Available
Volume:	Not Available
Issue:	Not Available
Pages:	7
Publisher:	Not Available
Local/International:	International
Paper Link:	Not Available
Full paper	Moufeda Hussein Gaber_1570251821.pdf
Supplementary materials	Not Available

Abstract:

The vast amount of available Arabic web pages and text files on the internet makes it necessary to organize data in an easy way for user browsing. Document clustering is a good solution for this problem, which groups similar data into clusters with meaningful labels. In this paper, we propose a domain independent approach, which builds a hierarchical meaningful clustering tree. The proposed approach overcomes the problem of high dimensionality of feature vector by representing each document with its keyphrases. In addition, we introduced a new similarity measure by taking the common lemma form keyphrases among feature vectors of documents. This improves the accuracy of the clustering process with reduced complexity. Many experiments are carried out to evaluate the accuracy of clustering using String-based, Corpus-based, and Knowledge-based similarity measures. A dataset consists of 345 Arabic documents and covering 12 domains is used in these experiments. The results show that applying lexical similarity using keyphrase based gives more accurate clusters labels than using semantic similarity. The best purity result achieved is 0.955, which is obtained using the common lemma form keyphrases similarity method.

Ass. Lect. Moufeda Hussein Gaber :: Publications: