Feature Engineering
66+ features: text stats, punctuation, datetime parts, engagement ratios, demographic flags, post-level aggregates.
Python / LightGBM / Scikit-learn / Multi-class
Comment Toxicity converts a Kaggle notebook into a reproducible ML pipeline: rich feature engineering, out-of-fold target encoding, LightGBM ensemble, Nelder-Mead blend optimization, and per-class threshold tuning on 198K samples.
Exploratory Data Analysis
The dataset is heavily imbalanced: class 0 (non-toxic) dominates at 57.7%. Demographic fields (race, religion, gender) are missing for ~73% of comments, but their presence is a strong toxicity signal.
System Design
66+ features: text stats, punctuation, datetime parts, engagement ratios, demographic flags, post-level aggregates.
5-fold smooth mean encoding on demographic columns. Prevents leakage while capturing categorical signals.
Class-weighted LightGBM with early stopping, plus LR/RF baselines. Nelder-Mead optimized blend weights.
Per-class decision thresholds optimized via Nelder-Mead to maximize macro F1 on minority classes.
Measured Results
Evaluated via out-of-fold predictions from 2-fold stratified CV. The blend optimizer converged to 100% LightGBM weight, and per-class threshold tuning improved macro F1 from 0.813 to 0.817.
Feature Importance
The strongest predictors are per-thread toxicity rates (toxicity clusters in discussions), demographic fields filled by users, and target-encoded race/gender/religion. Text-derived features and engagement metrics add complementary signal.
Full Classification Report
Per-class breakdown showing precision, recall, and F1 for each toxicity level.
| Class | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| 0 — Non-toxic | 0.98 | 0.95 | 0.96 | 114,173 |
| 1 — Toxic | 0.76 | 0.81 | 0.79 | 15,918 |
| 2 — Very toxic | 0.86 | 0.90 | 0.88 | 62,440 |
| 3 — Extremely toxic | 0.66 | 0.61 | 0.64 | 5,469 |
| Macro Avg | 0.82 | 0.82 | 0.82 | 198,000 |
| Weighted Avg | 0.91 | 0.91 | 0.91 | 198,000 |