AI / Machine Learning

Text Sentiment Classifier

A bilingual sentiment analysis project that classifies Arabic and English text into positive, neutral, or negative categories using TF-IDF vectorization and a TensorFlow neural network.

AIMachine LearningNLPSentiment Analysis

Project Overview

A bilingual (Arabic/English) sentiment analysis system built during the ICAIL program. The project processes text through cleaning, normalization, and TF-IDF feature extraction, then classifies sentiment using a TensorFlow/Keras neural network. An interactive Gradio interface allows real-time text testing.

My Role

Solo project — designed, implemented, and evaluated the full pipeline as part of the ICAIL certification program.

Problem

Understanding sentiment in both Arabic and English text is challenging due to linguistic differences, dialectal variations, and lack of unified preprocessing pipelines. Most sentiment analysis tools focus on English only, leaving Arabic under-served.

Solution

Built a unified pipeline that handles both languages through language-specific preprocessing — including Arabic normalization, diacritics removal, and emoji stripping — followed by TF-IDF vectorization with character n-grams and a dense neural network classifier.

Key Features

Arabic and English sentiment classification

Text cleaning and normalization pipeline

Arabic diacritics and tatweel removal

Emoji and special character filtering

TF-IDF vectorization with character n-grams

TensorFlow / Keras neural network classifier

Interactive prediction interface with Gradio

Classification report and confusion matrix evaluation

Data Collection & Preprocessing

Two datasets were used: Sentiment140 for English and an Arabic Twitter sentiment dataset. Text preprocessing included lowercasing, URL/mention removal, punctuation stripping, and emoji filtering. Arabic text additionally underwent normalization of alef, yaa, and taa characters, diacritics removal, and tatweel (kashida) stripping.

Model Architecture

Text was vectorized using TF-IDF with character n-grams (n-gram range 1–3) to capture subword patterns across both languages. The classifier is a sequential neural network built with TensorFlow/Keras: an input layer matching the TF-IDF vocabulary size, two hidden Dense layers with ReLU activation and Dropout for regularization, and a softmax output layer for three-class classification (positive, neutral, negative).

Evaluation & Results

The model was evaluated using a classification report (precision, recall, F1-score) and a confusion matrix. Performance varied across languages due to dataset size differences and class imbalance. The confusion matrix helped identify where the model confused neutral with positive/negative classes — a known challenge in sentiment analysis.

Interactive Prediction Interface

A Gradio-based web interface was built to let users type Arabic or English text and see real-time sentiment predictions. The interface loads the trained model and TF-IDF vectorizer, applies the same preprocessing pipeline, and displays the predicted class with confidence.

Tech Stack

PythonTensorFlowKerasScikit-learnPandasNumPyMatplotlibSeabornGradioKaggle Datasets

Challenges

Handling Arabic text preprocessing required custom normalization rules. Balancing the dataset across languages and sentiment classes also needed careful sampling.

Learning Outcomes

NLP preprocessing techniquesArabic text normalizationsentiment classification workflowsTensorFlow model buildingworking with multilingual datasetsneural network fundamentalsdata visualization and evaluation.

Screenshots