Tackling Multimodal Data: How Netflix Builds Machine Learning Datasets at Scale

November 21, 2025 · Data Science Summit 2025 · Warsaw, Poland

Multi Modal datasets construction and curation at scale has been a challenging task until recently. We will talk about how Netflix uses Ray to build massive multimodal datasets for text-to-image research. We’ll show how Ray’s distributed processing fans out data ingestion and filtering across hundreds of GPUs, how we run batch inference at scale with cutting-edge vision-language models to score and caption images / videos, and how smart curation and sampling help reduce the size and increase the diversity of datasets producing high quality training data.

Slides for the talk can be found here

Tags: ray, multimodal, scalability, generative ai, dataset curation