Multimodal on Share what you know

Powering Netflix's Multimodal feature engineering at scale. Data Engineering forum 2026. San Francisco

Thu, 16 Apr 2026 12:00:00 +0000

ABSTRACT:

As multimodal models mature, the challenge increasingly shifts from model architecture to feature engineering and dataset construction at scale. In this talk, we’ll share how Netflix builds and curates multimodal features across large video and image corpora, with LanceDB serving as the core storage and query layer for multimodal data.

We’ll briefly cover how Ray powers distributed ingestion, filtering, and large-scale batch inference across hundreds of GPUs, enabling the application of modern vision-language models to extract rich multimodal embeddings from video and image data. These embeddings capture both low-level visual signals and higher-level semantic context, forming the foundation for downstream tasks such as search, retrieval, and dataset curation.

Tackling Multimodal Data: How Netflix Builds Machine Learning Datasets at Scale

Fri, 21 Nov 2025 12:00:00 +0000

Multi Modal datasets construction and curation at scale has been a challenging task until recently. We will talk about how Netflix uses Ray to build massive multimodal datasets for text-to-image research. We’ll show how Ray’s distributed processing fans out data ingestion and filtering across hundreds of GPUs, how we run batch inference at scale with cutting-edge vision-language models to score and caption images / videos, and how smart curation and sampling help reduce the size and increase the diversity of datasets producing high quality training data.

Scaling Multimodal Data Curation with Ray and LanceDB

Wed, 05 Nov 2025 12:00:00 +0000

At Ray Summit 2025, Pablo Delgado from Netflix and Lei Xu from LanceDB share how they are transforming the construction and curation of massive multimodal datasets—traditionally a complex and resource-intensive process—into a scalable, efficient, and highly automated pipeline.

They explain how Netflix leverages Ray for distributed ingestion, filtering, and large-scale inference across enormous video and image corpora, while LanceDB serves as the high-performance storage and query layer that provides a single source of truth throughout the data curation lifecycle.

Evolving Netflix's Ray Platform for the GenAI Era. Highlight Talk

Tue, 01 Oct 2024 12:00:00 +0000

The generative AI revolution has transformed the world of large-scale deep learning infrastructure. Modern machine learning platforms must be ready to support pre-training for massive foundation models, memory-intensive fine-tuning for LLMs and diffusion models, as well as low-latency deployments for multi-billion-parameter models.

Navigating this emerging landscape requires new techniques and methodologies, leavened with a thorough understanding of the still-nascent GenAI tooling ecosystem. In this talk, we’ll walk through how we’ve adapted and extended Netflix’s production Ray platform to deal with these new challenges