الفهرس | Only 14 pages are availabe for public view |
Abstract Nowadays, forensic authorship authentication plays a vital role in identifying the numbers of unknown authors raised with fast-increasing internet usage worldwide. Authorship authentication is the process in which a linguist attempts to identify the author of an anonymous text based on the vocabulary used and the linguistic style of the writer. The most existing studies of authorship forensic analysis focus on the English language, while researches concerning the Arabic language is rare. In this research, we present a new methodology that enhances authorship forensic analysis focusing on the Arabic language. This thesis presents two-level learning classifiers for authorship authentication. The learning system is supplied with linguistic knowledge, statistical, and vocabulary features to enhance its efficiency instead of relying only on one type of features. The linguistic knowledge is represented through lexical analysis features of the unknown text and previous texts of authors. The basic idea of the first classifier is to extract the unique vocabulary terms identifying the author and used for recognition of unknown authors. In the current work, a modified Term Frequency- Inverse document Frequency (modified TF-IDF) is proposed, which is a modification of the traditional TF-IDF method. Our approach is tested with large dataset belongs to different political groups. The performance of the first classifier for authorship forensic method is based only on vocabulary words used by political group. The experimental results show that the average accuracy for recognizing groups has increased from 89.33 % when using the traditional TF-IDF, to 92% with the proposed modified II TF-IDF method. Further improvement is achieved when representing the vocabulary terms in its Arabic lemma form, rather than its root form. The results show that the accuracy is improved from 89.33 % to 92%. This approach is tested with another Arabic articles dataset and achieves an accuracy of 92% based on vocabulary words. To get the best predictive performance for identifying authorship, the first classifier is based on vocabulary features that detect the weight of frequently results are fed to a second machine learning classifier. The learning technique depends on statistical, linguistic features as well as the vocabulary knowledge of the first classifier. All sets of features describe the author’s writing styles in numerical forms. The proposed two level classifier for identifying authorship shows better performance. The experiments carried out show that the trained two-level classifier improves the accuracy range from 94% to 96.16%. |