An Anti-Spam System using Artificial Neural Networks and Genetic Algorithms
Conference paperNowadays, e-mail is widely becoming one of the fastest and most economical forms of communication .Thus, the e-mail is prone to be misused. One such misuse is the posting of unsolicited, unwanted e-mails known as spam or junk e-mails. This paper presents and discusses an implementation of an Anti-spam filtering system, which uses a Multi-Layer Perceptron (MLP) as a classifier and a Genetic Algorithm (GA) as a training algorithm. Standard genetic operators and advanced techniques of GA algorithm are used to train the MLP. The implemented filtering system has achieved an accuracy of about 94% to detect spam e-mails, and 89% to detect legitimate e-mails.
Abduelbaset Mustafa Alia Goweder, (12-2008), University of Safax, Safax, Tunisia: Proceedings of ACIT2008, 177-185
Arabic Broken Plural using a Machine Translation Technique
Conference paperAbstract The Arabic language presents significant challenges to many natural language processing applications. The broken plu rals (BP) problem is one of these challenges especially for information retrieval applications. It is difficult to deal with Arabic broken plurals and reduce them to their associated singulars, because no obvious rules exist, and there are no standard stemming algorithms that can process them. This paper attempts to handle the problem of broken plural by de veloping a method to identify broken plurals in an unvowelised Arabic text and reducing them to their correct singular forms by incorporating the simple broken plural matching approach, with a machine translation system and an English stemmer as a new approach. A set of experiments has been conducted to evaluate the performance of the proposed method using a number of text samples extracted from a large Arabic corpus (AL-Hayat newspaper). The obtained re sults are analyzed and discussed.
Abduelbaset Mustafa Alia Goweder, (12-2008), University of Safax, Safax, Tunisia: Proceedings of ACIT2008, 64-71
A Hybrid Method for Stemming Arabic Text
Conference paperAbstract There are several stemming approaches that are applied to Arabic language, yet no a complete stemmer for this language is available. The existing stem-based stemmers for stemming Arabic text have a poor performance in terms of accuracy and error rates. In order to improve the accuracy rates of stemming, a hybrid method is proposed for stemming Arabic text to produce stems (not roots). The improvement of the accuracy of stemming will lead by necessity to the improvement of many applications very greatly, including: information retrieval, document classification, machine translation, text analysis and text compression. The proposed method integrates three different stemming techniques, including: morphological analysis, affix-removal and dictionaries.
Abduelbaset Mustafa Alia Goweder, (12-2008), University of Safax, Safax, Tunisia: Proceedings of ACIT2008, 125-132
Identifying Broken Plurals in Unvowelised Arabic Text
Conference paperIrregular (so-called broken) plural identification in modern standard Arabic is a problematic issue for information retrieval (IR) and language engineering applications, but their effect on the performance of IR has never been examined. Broken plurals (BPs) are formed by altering the singular (as in English: tooth→ teeth) through an application of interdigitating patterns on stems, and singular words cannot be recovered by standard affix stripping stemming techniques. We developed several methods for BP detection, and evaluated them using an unseen test set. We incorporated the BP detection component into a new light-stemming algorithm that conflates both regular and broken plurals with their singular forms. We also evaluated the new light-stemming algorithm within the context of information retrieval, comparing its performance with other stemming algorithms.
Abduelbaset Mustafa Alia Goweder, (07-2004), Barcelona, Spain: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 246-253
Broken Plural Detection for Arabic Information Retrieval
Conference paperAbstract
Due to the high number of inflectional variations of Arabic words, empirical results suggest that stemming is essential for Arabic information retrieval. However, current light stemming algorithms do not extract the correct stem of irregular (so-called broken) plurals, which constitute ~10% of Arabic texts and ~41% of plurals. Although light stemming in particular has led to improvements in information retrieval [5, 6], the effects of broken plurals on the performance of information retrieval systems has not been examined.We propose a light stemmer that incorporates a broken plural recognition component, and evaluate it within the context of information retrieval. Our results show that identifying broken plurals and reducing them to their correct stems does result in a significant improvement in the performance of information retrieval systems.
Abduelbaset Mustafa Alia Goweder, (07-2004), The University of Sheffield, UK: The 27th Annual International ACM SIGIR Conference, 566-567
Assessment of a Significant Arabic Corpus
Conference paperThe development of Language Engineering and Information Retrieval applications for Arabic require availability of sizeable, reliable corpora of modern Arabic text. These are not routinely available. This paper describes how we constructed an 18.5 million word corpus from Al-Hayat newspaper text, with articles tagged as belonging to one of 7 domains. We outline the profile of the data and how we assessed its representativeness. The literature suggests that the statistical profile of Arabic text is significantly different from that of English in ways that might affect the applicability of standard techniques. The corpus allowed us to verify a collection of experiments which had, so far, only been conducted on small, manually collected datasets. We draw some comparisons with English and conclude that there is evidence that Arabic data is much sparser than English for the same data size.
Abduelbaset Mustafa Alia Goweder, (08-2001), Tolouse, France: Proceedings of ACL 2001, 71-78