Python / LightGBM / Scikit-learn / Multi-class

Detecting comment toxicity at scale.

Comment Toxicity converts a Kaggle notebook into a reproducible ML pipeline: rich feature engineering, out-of-fold target encoding, LightGBM ensemble, Nelder-Mead blend optimization, and per-class threshold tuning on 198K samples.

View source Open in Colab

Macro F1 Score 0.817 Threshold-tuned LightGBM ensemble (2-fold CV)

Accuracy 0.912

Classes 4

Train samples 198K

Features 66+

Exploratory Data Analysis

Understanding the data distribution and signals.

The dataset is heavily imbalanced: class 0 (non-toxic) dominates at 57.7%. Demographic fields (race, religion, gender) are missing for ~73% of comments, but their presence is a strong toxicity signal.

73% missing demographics

2.5x more demographics filled for toxic class

57.7% of comments are non-toxic

Demographic Presence vs Toxicity

Engagement by Label

System Design

Notebook logic converted into a reusable training pipeline.

Feature Engineering

66+ features: text stats, punctuation, datetime parts, engagement ratios, demographic flags, post-level aggregates.

OOF Target Encoding

5-fold smooth mean encoding on demographic columns. Prevents leakage while capturing categorical signals.

LightGBM + Ensemble

Class-weighted LightGBM with early stopping, plus LR/RF baselines. Nelder-Mead optimized blend weights.

Threshold Tuning

Per-class decision thresholds optimized via Nelder-Mead to maximize macro F1 on minority classes.

Measured Results

LightGBM ensemble significantly outperforms baselines.

Evaluated via out-of-fold predictions from 2-fold stratified CV. The blend optimizer converged to 100% LightGBM weight, and per-class threshold tuning improved macro F1 from 0.813 to 0.817.

Per-Class Precision, Recall & F1 (Tuned)

Confusion Matrix

Feature Importance

Post-level label rates dominate, followed by demographics.

The strongest predictors are per-thread toxicity rates (toxicity clusters in discussions), demographic fields filled by users, and target-encoded race/gender/religion. Text-derived features and engagement metrics add complementary signal.

Post-level features

Demographics

Text & engagement

Full Classification Report

Threshold-tuned LightGBM ensemble.

Per-class breakdown showing precision, recall, and F1 for each toxicity level.

Class	Precision	Recall	F1 Score	Support
0 — Non-toxic	0.98	0.95	0.96	114,173
1 — Toxic	0.76	0.81	0.79	15,918
2 — Very toxic	0.86	0.90	0.88	62,440
3 — Extremely toxic	0.66	0.61	0.64	5,469
Macro Avg	0.82	0.82	0.82	198,000
Weighted Avg	0.91	0.91	0.91	198,000