Théo Palagi
Back to all projects
Computer Vision,Machine Learning,Deep Learning,Image Classification,Medical Imaging

White Blood Cell Classification Challenge

2026

White Blood Cell Classification Challenge

Abstract

Automatic classification of white blood cells from microscopic images is a major challenge for haematological diagnosis. This project addresses a 13-class classification problem from 28,901 training images with an extreme class imbalance (1183:1 ratio). We developed a classical ML pipeline based on 89 handcrafted features (morphological, colorimetric, textural) with dual cell/nucleus segmentation, then transitioned to deep learning with ConvNeXt-Tiny, Focal Loss, MixUp/CutMix, backbone freezing, and Test Time Augmentation. The iterative approach, guided by t-SNE embedding analysis and confusion matrix diagnostics, achieved a progression from 0.491 to 0.77 macro F1 on the Kaggle leaderboard. Ranked 20th out of 70+ students at Télécom Paris.

TypeAcademic Project
PublicationCourse Project - Image Processing (IMA205), April 2026

Classical ML: Handcrafted Features & Segmentation

The first approach relies on a dual cell/nucleus segmentation using Otsu thresholding in CIELab colour space, followed by extraction of 89 handcrafted features spanning three categories: morphological (area, circularity, Hu moments, nucleus/cell ratio), colorimetric (RGB/HSV statistics per region: cell, nucleus, cytoplasm), and textural (GLCM properties over 4 angles + LBP histograms). Class imbalance is handled by SMOTE with k_neighbors=3 for ultra-rare classes and class_weight='balanced'. After GridSearchCV optimization, GradientBoosting achieves a macro F1 of 0.491. Confusion matrix analysis reveals that biologically similar classes (BNE/SNE, PLY/LY, MMY/MY/PMY) are systematically confused, highlighting the limits of handcrafted features for capturing fine morphological differences.

Classical ML: Handcrafted Features & Segmentation

Deep Learning: From EfficientNet to ConvNeXt

Transfer learning with EfficientNet-B3 pre-trained on ImageNet achieves 0.685 on the leaderboard, a +19 point jump over classical ML. However, t-SNE visualization of the embeddings reveals that rare classes (MMY, MY, PMY, PLY) form an indistinct central blob in feature space. This motivates the switch to ConvNeXt-Tiny with Focal Loss (gamma=2), which forces the network to focus on hard-to-classify examples. A CosineAnnealingWarmRestarts scheduler with T_0=15 over 60 epochs escapes local minima through periodic learning rate restarts, pushing the score to 0.747 with Test Time Augmentation (8 rounds, temperature scaling T=0.8).

Deep Learning: From EfficientNet to ConvNeXt - 1
Deep Learning: From EfficientNet to ConvNeXt - 2

Optimization: Diagnosing and Fixing Overfitting

A paradoxical observation drove the final optimization phase: adding MixUp improved validation F1 from 0.696 to 0.730, but the Kaggle score dropped from 0.747 to 0.728. This diagnosed overfitting on the validation split itself. Successive fixes included: backbone freezing for 5 epochs to prevent noisy gradient destruction of ImageNet features, CutMix alongside MixUp (50/50) for spatial regularization, weight decay increase from 5e-4 to 1e-2, and recalibration of class weights from log-dampened to 1/sqrt(n) with Focal Loss gamma reduced from 2.0 to 1.0. Each change was motivated by a specific diagnostic — not trial and error — yielding a final score of 0.77.

Optimization: Diagnosing and Fixing Overfitting

Results & Analysis

The optimized model achieves high recalls on dominant classes (BA: 0.97, EO: 0.95, LY: 0.91, SNE: 0.88) while rare classes remain challenging: PLY (0.50, confused with LY), PC (0.40), PMY (0.24, confused with MY). These residual confusions are biologically coherent — PLY is a lymphocyte precursor morphologically near-identical to LY, and the granulocytic lineage (MMY/MY/PMY) represents successive maturation stages with gradual morphological boundaries. With only 11 training images for PLY, no method can compensate for the fundamental lack of diversity. Overall progression: 0.491 (ML) → 0.685 (EfficientNet) → 0.747 (ConvNeXt + TTA) → 0.77 (optimized), a +28.3 point gain.