Arabic Corpus is the infrastructure of the Arabic language computerized applications. The present work is motivated by the lack of standard corpus for the current modern Arabic language. The proposed design methodology of the corpus is implemented and tested on a sample of 1.25 million word out of a 20 million word corpus. Corpus data are morphologically analyzed to decompose the words into their basic constituents supplemented with their linguistic information. The 20 million word corpus is made up of texts from different sources. Books, newspapers, magazines, technical reports, research theses, and leaflets covering a wide range of subject areas are included to represent a broad spectrum of the currently used Arabic language.
The output of the Corpus is described by a set of statistical and collocation information as well as retrieving the concordance lines for one or more words.
|