Author: Dakrory, Sara Bahaa Eldien Abdellateif./ Title: Address Extraction from the Web /

Search In this Thesis

العنوان

Address Extraction from the Web /

المؤلف

Dakrory, Sara Bahaa Eldien Abdellateif.

هيئة الاعداد

باحث / سارة بهاء الدين عبداللطيف دكروري

مشرف / عبدالمجيد أمين علي

مشرف / محمد سيد قايد

مشرف / بهجت عبدالحميد عبداللطيف

الموضوع

===

تاريخ النشر

2023.

عدد الصفحات

97 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/1/2023

مكان الإجازة

جامعة المنيا - كلية العلوم - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

110

from

110

Abstract

The Geographical Address Extraction (GAE) task is a particular domain of the Named Entity Recognition (NER) task that supports extracting and recognizing entities such as people’s names, organizations, and locations. GAE task focus on drilling down the location entity granularity level by extending the extraction level of the location entity to include postal code, phone number, building number, city, state, region, or district. However, the disambiguation of polysemic words is a distinctive challenge in natural language processing. Disambiguation systems with a minimal dependency on linguistic resources must be developed due to the large availability, diversity, and rapid changes of data from online sources.
To the best of our knowledge, this is the first work that aims to extract geographical address entities in the Arabic language using deep learning. In this thesis, we introduced a comprehensive survey that addresses and compares the previous approaches of address extraction from the Web. Further, due to the lack of Arabic corpora focused on the GAE task, we have decided to build our dataset for training and testing the developed models. The presented geographical dataset emphasizes Arabic geographical addresses found in Egypt on the social network website. The Arabic geographical addresses in this dataset are captured over different Facebook public pages covering different Point Of Interest (POIs) addresses. Consequently, it provides many diversified patterns to be considered while scrutinizing geographical addresses.
Recently, Recurrent Neural Networks (RNNs) introduced a cutting-edge performance in many Natural language processing (NLP) tasks without the need for manually created features. Additionally, transfer learning has demonstrated its effectiveness in many NLP tasks by utilizing language models that have been pre-trained to transfer knowledge acquired from massive datasets to tasks that are specialized to a given domain. In this thesis, we investigate the possibility of determining the contextual neural encoding of geographical addresses without leveraging external linguistic resources like knowledge bases or gazetteers by using RNNs models and transfer learning models. Thus, four approaches have been introduced for this purpose. Two RNNs models have been employed for the Arabic address extraction task. The first model is the Bidirectional Long Short-Term Memory Recurrent Neural Network (BILSTM), and the second is Gated Recurrent Neural Network (GRU). Besides, the two models have been tested after adding a tagging layer of the Conditional Random Fields (CRF) model and two pre-trained word embeddings. In addition, the AraBERT-based contextual model built on a transformer is employed to recognize and classify the geographical Arabic Addresses. Moreover, testing the efficiency of utilizing the pre-trained AraBERT model with a CRF layer to improve the model comprehension of Arabic geographic text.
The experimental evaluation showed that BI-GRU-CRF with FastText word embeddings, AraBERT, and AraBERT-CRF models have the highest accuracy with values of 0.96, 0.95, and 0.96 respectively. While the best overall performance on our proposed dataset is achieved by the AraBERT-CRF model that obtained a Precision of 0.96, Recall of 0.98, and F1-score of 0.96 when tested on the developed dataset.