The vast amount of available Arabic web pages and text files on the internet makes it necessary to organize data in an easy way for user browsing. Document clustering is a good solution for this problem, which groups similar data into clusters with meaningful labels. In this paper, we propose a domain independent approach, which builds a hierarchical meaningful clustering tree. The proposed approach overcomes the problem of high dimensionality of feature vector by representing each document with its keyphrases. In addition, we introduced a new similarity measure by taking the common lemma form keyphrases among feature vectors of documents. This improves the accuracy of the clustering process with reduced complexity. Many experiments are carried out to evaluate the accuracy of clustering using String-based, Corpus-based, and Knowledge-based similarity measures. A dataset consists of 345 Arabic documents and covering 12 domains is used in these experiments. The results show that applying lexical similarity using keyphrase based gives more accurate clusters labels than using semantic similarity. The best purity result achieved is 0.955, which is obtained using the common lemma form keyphrases similarity method. |