Scaling Multimodal Data Curation with Ray and LanceDB

November 5, 2025 · RaySummit 2025 · San Francisco

At Ray Summit 2025, Pablo Delgado from Netflix and Lei Xu from LanceDB share how they are transforming the construction and curation of massive multimodal datasets—traditionally a complex and resource-intensive process—into a scalable, efficient, and highly automated pipeline.

They explain how Netflix leverages Ray for distributed ingestion, filtering, and large-scale inference across enormous video and image corpora, while LanceDB serves as the high-performance storage and query layer that provides a single source of truth throughout the data curation lifecycle.

In this talk, Pablo and Lei walk through how Ray fans out distributed processing across hundreds of GPUs to accelerate ingestion and filtering. They detail how Netflix runs batch inference at scale with cutting-edge vision-language models to score and caption images and videos, and how LanceDB’s columnar design enables smart curation and sampling—reducing dataset size while increasing diversity to deliver higher-quality training data.

Attendees will gain practical insights into building scalable, high-performance pipelines for multimodal dataset construction in both research and production environments.

Session

Slides for the talk can be found here

Tags: ray, multimodal, scalability, generative ai, dataset curation