RIST

Revue d'Information Scientifique et Technique

Volume 27 Numéro 02 Éditorial

The issue 27, volume 2 (2023) of the Information Processing at the Digital Age
Journal is a special issue that publishes the papers of the NLP challenge hold at
CERIST on March 29 th , 2023. This challenge was the first of its kind in Algeria and aimed to promote NLP and bring together researchers and students from NLP research teams. Two main tasks were proposed: Opinion mining and Sentiment Analysis, and Information Retrieval. We received 23 papers. 11 of them were selected for participation at the challenge day.
During the organization of this challenge, we noticed a great interest for the first task, especially the subtask 1.c Arabic Sentiment Analysis and Fake News Detection within Covid-19 and the subtask 1.d Arabic Hate Speech and Offensive Language Detection on Social Networks. Hence, this issue contains five papers addressing subtask 1.c, five papers addressing subtask 1.d, and one paper addressing subtask 1.b Multilingual Sentiment Analysis in Twitter.

Téléchargement : PDF

Compact CNN-Based Architecture for Text Classification and Sentiment Analysis

In the last decade, social media and internet involvement in people’s life raised new challenges that modern AI needs to deal with. Textual data is generated every time an article is published or an online post is shared or even a simple
comment is made. Among these challenges, we find text classification which is used to identify the general meaning of a set of words using AI methods. This paper presents our participation to the CERIST Natural Language Processing
Challenge, where we proposed a simple yet effective convolutional neural network architecture that can be used for text classification and sentiment analysis. We tested our proposition on 5 different tweets datasets, Hate Speech, Fake News, Arabic Covid Sentiment, Arabic Sentiment, and English Sentiment, and obtained respectively 99,85%, 99,86%, 99,58%, 97,97%, 95,65% accuracy on the training subset and 98,43%, 94,74%, 87,53%, 54,90%, 60,62% accuracy on the validation subset.

Auteurs : Zoubir TALAI , Nada KHERICI

Téléchargement : PDF

A logistic regression algorithm for Arabic hate speech detection

Arabic language is one of the most popular languages and it is widely used in social media networks. During the pandemics, the spread of fake news, rumors, hate speech and spams increased dramatically which makes the detection of
the misinformation sources very important and very helpful to control the situation. A lot of Arabic natural language processing (ANLP) works are proposed in the literature to solve such problems, in this paper we propose a time efficient and high precision and accuracy algorithm for Arabic Hate speech detection.
A classical Machine Learning (ML) logistic regression algorithm is used in this ANLP work to detect hate speech, the data of this work are collected from Twitter social media during the COVID-19 pandemic, we use 80% of the data to train our algorithm and 20% of data to test it. The proposed algorithm has high accuracy and precision in the tested comments (a precision of 88.77% an accuracy of 98.48%). This work shows that, the classical ML algorithms have good performances in such problems.

Auteurs : Abdelmounim Sellidj

 

Téléchargement : PDF

Modeling Sentiment Analysis Using Machine Learning Algorithms for Arabic covid-19 Tweets

During Covid-19 pandemic period, people worldwide turned to use social media network to express their opinions and general feelings. Social media platforms like Twitter have become widespread tools for broadcasting and distributing
news and opinions. This paper presents our participation to CERIST Natural Language Processing Challenge, task1.c: Arabic sentiment analysis and fake news detection within covid-19. This complex task is further increased when dealing with dialects that do not have the structure of Modern Standard Arabic (MSA). We introduce an experiment of sentiment analysis of Arabic tweets within covid-19 using machine learning algorithms. The used Arabic dataset was provided by the challenge organizers and it contains 4,128 tweets labeled as Positive, Negative and Neutral for training and 1,034 tweets unlabeled for testing Hadj Ameur & Aliane, 2021. In this experiment the opinions are classified by various machine learning classifiers including Support Vector Machine (SVM), Logistic Regression (LR), Multinomial Naïve Bayes (NB) and K-Nearest Neighbors (KNN). The experimental results indicated that the highest accuracy (94%) was obtained using the Logistic-Regression and SVM among other with training times of 8609s.

Auteurs : Yousra F.G.Elhakeem , Safa EltayebMohammed Aldawsari , Omer Salih Dawood Omer

 

Téléchargement : PDF

Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic language

This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their
combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario,were obtained by using the ensemble approach based on the majority vote. The evaluation of this approach on the test set resulted in an F1-score of 0.60 and Accuracy of 0.86.

Auteurs : Angel Felipe Magnossão de Paula , Imene Bensalem , Paolo Rosso , Wajdi Zaghouani

 

Téléchargement : PDF

Classifying Arabic covid-19 related tweets for fake news detection and sentiment analysis with BERT-based models

The present paper is about the participation of our team “techno” at CERIST Natural Language Processing Challenge. We used an available dataset for task1.c: Arabic sentiment analysis and fake news detection within covid-19. It comprises 4128 tweets for sentiment analysis task and 8661 tweets for fake news detection task. We used natural language processing tools with the combination of the most renowned pre-trained language models BERT (Bidirectional Encoder Representations from Transformers). The results shows the efficacy of pre-trained language models as we attained an accuracy of 0.93 for the sentiment analysis task and 0.90 for the fake news detection task.

Auteurs : Rabia Bounaama , Mohammed El Amine Abderrahim

Téléchargement : PDF

Arabic Hate speech and social networks offensive language detection

The containment measures caused by the coronavirus pandemic have stimulated the use of social networks as a means of exchanging information, communication, and combating social distancing. This paper presents our participation in the NLP Challenge2022 competition initiated by RESEARCH CENTRE FOR SCIENTIFIC AND TECHNICAL INFORMATION (CERIST). The competition focuses on the task of detecting Arabic hate speech and offensive language on social networks, specifically analyzing Twitter messages related to the COVID-19 pandemic and classifying users’ sentiments as either hateful or not. In the present work, we propose a model based on recurrent neural networks, more precisely the Bidirectional long-term memory (Bi-LSTM). We trained the model using a dataset constructed by the authors of this challenge. As a result, we achieves an accuracy of 96.35 %.

Auteurs : Hakim Bouchal , Ahror BELAID

Téléchargement : PDF

GigaBERT-based Approach for Hate Speech Detection in Arabic Twitter

Natural Language Processing has recently become one of the most trending research areas in Artificial Intelligence, especially in social media-related tasks. This paper describes our participation in the « Hate Speech Detection on Arabic Twitter” task at the CERIST NLP-Challenge 2022 competition. The proposed solution aims to classify the tweets collected in the Arabic ARACOVID19-MFH multi-label and multi-dialect dataset into « Hateful » and « Not Hateful » categories. Based on a pre-trained transformer model known as GigaBERT-v4, our solution outperformed the most common transformer models supporting the Arabic language. Experiments have proved that the GigaBERT-v4 model is more effective than the other models using the previously described dataset, obtaining a 99.46% accuracy and a 98.68% macro F1-score.

Auteurs : Bachir Said  , Mohammed E. Barmati

Téléchargement : PDF

XLM-T for Multilingual Sentiment Analysis in Twitter using oversampling technique

With the emergence of Pre-trained Language Models (PLMs) and the success of large scale, the field of Natural Language Processing (NLP) has achieved tremendous development such as Sentiment analysis (SA) that is one of the fast-growing research tasks in NLP. This paper describes the system that our team submitted to the CERIST NLP Challenge, for task 1.b. The purpose of this task is to identify the sentiment polarity of the datasets in English and Arabic languages comments collected from twitter. Our approach is based on a PL Model called XLM-T, and uses the Oversampling technique to solve the sentiment analysis problem of multilingualism in twitter. Experimental results confirm that this state-of-the-art model is robust achieving accuracy of 85%.

Auteurs :  Mohammed E. Barmati , Bachir Said

Téléchargement : PDF

Hate speech detection model based on BERT for the Arabic dialects

Hateful speech spread through social media has the potential to cause personal harm and suffering as well as social tension. Social media platforms, on the other hand, are unable to regulate all of the content that users post. As a result,
there is a demand for automatic detection of hate speech. This demand is increased when the posts are written in complex languages, such as Arabic. This present study is dedicated to contributing to hate speech and offensive language detection tasks for Arabic dialects. This paper is about my participation on CERIST Natural Language Processing Challenge 2022.
We propose an approach based on deep learning and a pre-trained BERT model. This approach is built by adding GRU and LSTM layers to BERT outputs. Additionally, to deal with the class imbalance issue in the dataset, two methods are proposed, the first is based on data augmentation by oversampling minority class using translation and back translation method and the second uses focal loss for training. The best results reached with focal loss training are 98.03% for accuracy and 98.02% for f1-score, and with data augmentation, 99.14% for both accuracy and f1-score.

Auteurs : Nourelhouda Chiker

Téléchargement : PDF