Textual similarity is one of the most important aspects of information retrieval. This paper proposes several techniques of
semantic textual similarity as well as the factors that influence them. Two-hybrid approaches for measuring the degree of
similarity between two Arabic snipped texts are presented. The first proposed approach combined the word-based and vectorbased similarity methods to construct semantic word spaces for each word of the input text. These words are represented in
their lemma forms to capture all semantically related words. In this approach, the semantic word spaces are used to find the
best matching between the input text words, and hence, the degree of similarity between the two snipped texts is computed.
The second proposed approach combined semantic and syntactic based approaches. The basic Levenshtein concept
represents the main structure for this approach. It has been modified to measure the edit cost at the token level not at the
character level. In addition, the semantic word spaces are added to this approach to include the semantic features to the
syntactic features. Some techniques are embedded to overcome the syntactic approach problems such as the word sequence.
Pearson correlation coefficient is used to measure the degree of correctness of the two proposed approaches as compared to
two benchmark datasets. The experiments achieved 0.7212 and 0.7589 for the two proposed approaches on two different
datasets. |