
Text Sentiment Classifier
A bilingual sentiment analysis project that classifies Arabic and English text into positive, neutral, or negative categories using TF-IDF vectorization and a TensorFlow neural network.
Project Overview
A bilingual (Arabic/English) sentiment analysis system built during the ICAIL program. The project processes text through cleaning, normalization, and TF-IDF feature extraction, then classifies sentiment using a TensorFlow/Keras neural network. An interactive Gradio interface allows real-time text testing.
My Role
Solo project — designed, implemented, and evaluated the full pipeline as part of the ICAIL certification program.
Problem
Understanding sentiment in both Arabic and English text is challenging due to linguistic differences, dialectal variations, and lack of unified preprocessing pipelines. Most sentiment analysis tools focus on English only, leaving Arabic under-served.
Solution
Built a unified pipeline that handles both languages through language-specific preprocessing — including Arabic normalization, diacritics removal, and emoji stripping — followed by TF-IDF vectorization with character n-grams and a dense neural network classifier.
Key Features
Arabic and English sentiment classification
Text cleaning and normalization pipeline
Arabic diacritics and tatweel removal
Emoji and special character filtering
TF-IDF vectorization with character n-grams
TensorFlow / Keras neural network classifier
Interactive prediction interface with Gradio
Classification report and confusion matrix evaluation
Data Collection & Preprocessing
Two datasets were used: Sentiment140 for English and an Arabic Twitter sentiment dataset. Text preprocessing included lowercasing, URL/mention removal, punctuation stripping, and emoji filtering. Arabic text additionally underwent normalization of alef, yaa, and taa characters, diacritics removal, and tatweel (kashida) stripping.
Model Architecture
Text was vectorized using TF-IDF with character n-grams (n-gram range 1–3) to capture subword patterns across both languages. The classifier is a sequential neural network built with TensorFlow/Keras: an input layer matching the TF-IDF vocabulary size, two hidden Dense layers with ReLU activation and Dropout for regularization, and a softmax output layer for three-class classification (positive, neutral, negative).
Evaluation & Results
The model was evaluated using a classification report (precision, recall, F1-score) and a confusion matrix. Performance varied across languages due to dataset size differences and class imbalance. The confusion matrix helped identify where the model confused neutral with positive/negative classes — a known challenge in sentiment analysis.
Interactive Prediction Interface
A Gradio-based web interface was built to let users type Arabic or English text and see real-time sentiment predictions. The interface loads the trained model and TF-IDF vectorizer, applies the same preprocessing pipeline, and displays the predicted class with confidence.
Tech Stack
Challenges
Handling Arabic text preprocessing required custom normalization rules. Balancing the dataset across languages and sentiment classes also needed careful sampling.
Learning Outcomes
Screenshots

Related Content
This project is linked to the International Certification for Artificial Intelligence License (ICAIL) certification.