Comment Toxicity

Python / LightGBM / Scikit-learn / Multi-class

Detecting comment toxicity at scale.

Comment Toxicity converts a Kaggle notebook into a reproducible ML pipeline: rich feature engineering, out-of-fold target encoding, LightGBM ensemble, Nelder-Mead blend optimization, and per-class threshold tuning on 198K samples.

Macro F1 Score 0.817 Threshold-tuned LightGBM ensemble (2-fold CV)
Accuracy 0.912
Classes 4
Train samples 198K
Features 66+

Exploratory Data Analysis

Understanding the data distribution and signals.

The dataset is heavily imbalanced: class 0 (non-toxic) dominates at 57.7%. Demographic fields (race, religion, gender) are missing for ~73% of comments, but their presence is a strong toxicity signal.

73% missing demographics
2.5x more demographics filled for toxic class
57.7% of comments are non-toxic

Demographic Presence vs Toxicity

Engagement by Label

System Design

Notebook logic converted into a reusable training pipeline.

01

Feature Engineering

66+ features: text stats, punctuation, datetime parts, engagement ratios, demographic flags, post-level aggregates.

02

OOF Target Encoding

5-fold smooth mean encoding on demographic columns. Prevents leakage while capturing categorical signals.

03

LightGBM + Ensemble

Class-weighted LightGBM with early stopping, plus LR/RF baselines. Nelder-Mead optimized blend weights.

04

Threshold Tuning

Per-class decision thresholds optimized via Nelder-Mead to maximize macro F1 on minority classes.

Measured Results

LightGBM ensemble significantly outperforms baselines.

Evaluated via out-of-fold predictions from 2-fold stratified CV. The blend optimizer converged to 100% LightGBM weight, and per-class threshold tuning improved macro F1 from 0.813 to 0.817.

Per-Class Precision, Recall & F1 (Tuned)

Confusion Matrix

Feature Importance

Post-level label rates dominate, followed by demographics.

The strongest predictors are per-thread toxicity rates (toxicity clusters in discussions), demographic fields filled by users, and target-encoded race/gender/religion. Text-derived features and engagement metrics add complementary signal.

Post-level features
Demographics
Text & engagement

Full Classification Report

Threshold-tuned LightGBM ensemble.

Per-class breakdown showing precision, recall, and F1 for each toxicity level.

ClassPrecisionRecallF1 ScoreSupport
0 — Non-toxic0.980.950.96114,173
1 — Toxic0.760.810.7915,918
2 — Very toxic0.860.900.8862,440
3 — Extremely toxic0.660.610.645,469
Macro Avg0.820.820.82198,000
Weighted Avg0.910.910.91198,000