Théo Palagi
Back to all projects
NLP,Python,BERTopic

LLM-SSC

Feb - Jun 2025

LLM-SSC

Abstract

This project presents a web application designed for automatic topic extraction from large text corpora. The system leverages BERTopic, a state-of-the-art topic modeling technique that combines transformer-based embeddings with clustering algorithms. By utilizing SentenceTransformers for semantic text representation and unsupervised clustering techniques such as HDBSCAN, the application can identify coherent topics from diverse text sources including YouTube comments and rap lyrics. The pipeline includes preprocessing steps for text normalization, embedding generation, dimensionality reduction via UMAP, and final topic assignment with interpretable labels.

TypeAcademic Project

Technical Overview

The application is built with a Python backend using FastAPI for the REST API and Streamlit for the interactive web interface. BERTopic is used as the core topic modeling library, which internally uses SentenceTransformers (all-MiniLM-L6-v2) for generating document embeddings. UMAP reduces the high-dimensional embeddings to a lower-dimensional space, and HDBSCAN performs density-based clustering to identify topics. The system also generates word clouds and interactive visualizations for topic exploration.

Results and Applications

The tool has been successfully applied to analyze sentiment and themes in YouTube comment sections, as well as to extract recurring motifs and subjects in French rap lyrics corpora. The extracted topics provide meaningful insights into public opinion and artistic themes, demonstrating the versatility of the approach across different domains and languages.