Machine Learning Intern

Product Goal

Build the Amharic/English NLP content moderation engine and product recommendation system for CampusConnect.

Tech Stack & Tools

PythonScikit-Learn/Hugging FaceFastAPIDocker

Weekly Deliverables

The Data Pipeline Blueprint

Construct a data ingestion script to scrape or compile a baseline corpus of 5,000+ university-relevant sentences in mixed Amharic and English.

The Exploratory Data Analysis (EDA) Jupyter Notebook

Deliver a documented Jupyter Notebook cleaning text data, tokenizing strings, handling Amharic stop-words, and visualizing class imbalances.

The Baseline Model Architecture

Train a baseline TF-IDF + Logistic Regression text classifier to detect prohibited content (scams, explicit language). Document baseline precision/recall.

The Model Upgrade Loop

Implement a deep learning architecture (e.g., fine-tuning a small multilingual BERT model like mT5 or AfriBERTa). Document performance gains.

The Validation Matrix

Produce an interactive confusion matrix and error analysis log identifying exactly where the model misclassifies text, applying data augmentation to fix gaps.

The FastAPI Application

Wrap the final model inside a high-performance FastAPI application that accepts JSON payloads of user posts and returns a JSON safety/category score.

The API Optimization & Containerization

Quantize the model to reduce inference latency under 200ms and package the entire app into a deployment-ready Docker image.

Production Inference Monitoring

Deploy the containerized API and build a logging script that records live inference predictions during the soft launch to check for model drift.

The Model Weights & Performance Dossier

Hand over the model artifact registry, production weights, and a technical whitepaper outlining future model fine-tuning paths.