Scalability on Share what you know

Tackling Multimodal Data: How Netflix Builds Machine Learning Datasets at Scale

Fri, 21 Nov 2025 12:00:00 +0000

Multi Modal datasets construction and curation at scale has been a challenging task until recently. We will talk about how Netflix uses Ray to build massive multimodal datasets for text-to-image research. We’ll show how Ray’s distributed processing fans out data ingestion and filtering across hundreds of GPUs, how we run batch inference at scale with cutting-edge vision-language models to score and caption images / videos, and how smart curation and sampling help reduce the size and increase the diversity of datasets producing high quality training data.

Scaling Multimodal Data Curation with Ray and LanceDB

Wed, 05 Nov 2025 12:00:00 +0000

At Ray Summit 2025, Pablo Delgado from Netflix and Lei Xu from LanceDB share how they are transforming the construction and curation of massive multimodal datasets—traditionally a complex and resource-intensive process—into a scalable, efficient, and highly automated pipeline.

They explain how Netflix leverages Ray for distributed ingestion, filtering, and large-scale inference across enormous video and image corpora, while LanceDB serves as the high-performance storage and query layer that provides a single source of truth throughout the data curation lifecycle.

Evolving Netflix's Ray Platform for the GenAI Era. Highlight Talk

Tue, 01 Oct 2024 12:00:00 +0000

The generative AI revolution has transformed the world of large-scale deep learning infrastructure. Modern machine learning platforms must be ready to support pre-training for massive foundation models, memory-intensive fine-tuning for LLMs and diffusion models, as well as low-latency deployments for multi-billion-parameter models.

Navigating this emerging landscape requires new techniques and methodologies, leavened with a thorough understanding of the still-nascent GenAI tooling ecosystem. In this talk, we’ll walk through how we’ve adapted and extended Netflix’s production Ray platform to deal with these new challenges

Heterogeneous Training Cluster with Ray at Netflix

Mon, 18 Sep 2023 12:00:00 +0000

At Netflix, Machine Learning algorithms are at the heart of various use cases such as recommendations, content understanding, content demand modeling, trailer and artwork generation and various other content creation use cases. Scaling these use cases to entertain our members can significantly leverage deep learning techniques. The Machine Learning Platform team at Netflix is tasked with constructing the necessary infrastructure and tools to optimize the effectiveness of all machine learning practitioners across the company. We are constantly striving to ensure that our machine learning models are trained and deployed in a reliable, scalable and robust way.

Multi-tenant Spark workflows in Auto Scalable Mesos clusters

Tue, 06 Nov 2018 12:00:00 +0000

Recommendation algorithms have been the core of the Netflix product from very early on. Because of their importance, we continually seek to run our machine learning workflows in a reliable, scalable and robust manner.

We will present our design choices on building a Mesos-centric multi-tenant architecture for running Spark-based machine learning workflows that power the algorithms behind Netflix recommendations. Also we will share our experience using the auto-scaling capabilities of Amazon Web Services to dynamically change the size of our clusters to support the allocation of thousands of spark jobs running daily. We will discuss how we are leveraging Apache Spark to deploy batch jobs as well as the interactive use of Zeppelin Notebooks efficiently in this shared environment.

Mesos at opentable

Thu, 20 Aug 2015 12:00:00 +0000

Opentable has been using Apache Mesos for production workloads and for running critical parts of their production services for more than a year.

Not only did Mesos help deploying resilient / elastic standalone applications and services , but also the distributed / fault-tolerant frameworks like Apache Spark for Data processing and machine learning. Mesos enabled Opentable to run multiple distributed applications across the same infrastructure at scale.

Pablo will tell the story of how Opentable started with Mesos, the pain points of dealing with an hybrid Mesos + non-Mesos environment and how to survive in the transition.

Rails Scalability

Fri, 23 Nov 2007 12:00:00 +0000

Intro

I escaped London for a few days to go to Conferencia Rails 2007 in Madrid. There I gave a talk about much ranted Rails Scalability.

The title of the talk in spanish was “Escalabilidad y las cosas de las que nadie se atrevio a hablar”.

Summary of the talk:

Architecture and typical Rails deployment configurations.
Use of Nginx as a static assets server.
Mongrel and Evented Mongrel.
Multithreaded image Uploads with mongrel and/or merb (instead of attatchment_fu)
Activerecord optimizations (hacks, active_record_context plugin)
Caches, pasive expirations. Cache observing daemons.
Configuration and monitoring of a production Server. (monit, munin tools)

Here is the video of the talk, and of course, the slides.