Talks on Share what you know

Powering Netflix's Multimodal feature engineering at scale. Data Engineering forum 2026. San Francisco

Thu, 16 Apr 2026 12:00:00 +0000

ABSTRACT:

As multimodal models mature, the challenge increasingly shifts from model architecture to feature engineering and dataset construction at scale. In this talk, we’ll share how Netflix builds and curates multimodal features across large video and image corpora, with LanceDB serving as the core storage and query layer for multimodal data.

We’ll briefly cover how Ray powers distributed ingestion, filtering, and large-scale batch inference across hundreds of GPUs, enabling the application of modern vision-language models to extract rich multimodal embeddings from video and image data. These embeddings capture both low-level visual signals and higher-level semantic context, forming the foundation for downstream tasks such as search, retrieval, and dataset curation.

Tackling Multimodal Data: How Netflix Builds Machine Learning Datasets at Scale

Fri, 21 Nov 2025 12:00:00 +0000

Multi Modal datasets construction and curation at scale has been a challenging task until recently. We will talk about how Netflix uses Ray to build massive multimodal datasets for text-to-image research. We’ll show how Ray’s distributed processing fans out data ingestion and filtering across hundreds of GPUs, how we run batch inference at scale with cutting-edge vision-language models to score and caption images / videos, and how smart curation and sampling help reduce the size and increase the diversity of datasets producing high quality training data.

Scaling Multimodal Data Curation with Ray and LanceDB

Wed, 05 Nov 2025 12:00:00 +0000

At Ray Summit 2025, Pablo Delgado from Netflix and Lei Xu from LanceDB share how they are transforming the construction and curation of massive multimodal datasets—traditionally a complex and resource-intensive process—into a scalable, efficient, and highly automated pipeline.

They explain how Netflix leverages Ray for distributed ingestion, filtering, and large-scale inference across enormous video and image corpora, while LanceDB serves as the high-performance storage and query layer that provides a single source of truth throughout the data curation lifecycle.

Evolving Netflix's Ray Platform for the GenAI Era. Highlight Talk

Tue, 01 Oct 2024 12:00:00 +0000

The generative AI revolution has transformed the world of large-scale deep learning infrastructure. Modern machine learning platforms must be ready to support pre-training for massive foundation models, memory-intensive fine-tuning for LLMs and diffusion models, as well as low-latency deployments for multi-billion-parameter models.

Navigating this emerging landscape requires new techniques and methodologies, leavened with a thorough understanding of the still-nascent GenAI tooling ecosystem. In this talk, we’ll walk through how we’ve adapted and extended Netflix’s production Ray platform to deal with these new challenges

Heterogeneous Training Cluster with Ray at Netflix

Mon, 18 Sep 2023 12:00:00 +0000

At Netflix, Machine Learning algorithms are at the heart of various use cases such as recommendations, content understanding, content demand modeling, trailer and artwork generation and various other content creation use cases. Scaling these use cases to entertain our members can significantly leverage deep learning techniques. The Machine Learning Platform team at Netflix is tasked with constructing the necessary infrastructure and tools to optimize the effectiveness of all machine learning practitioners across the company. We are constantly striving to ensure that our machine learning models are trained and deployed in a reliable, scalable and robust way.

Multi-tenant Spark workflows in Auto Scalable Mesos clusters

Tue, 06 Nov 2018 12:00:00 +0000

Recommendation algorithms have been the core of the Netflix product from very early on. Because of their importance, we continually seek to run our machine learning workflows in a reliable, scalable and robust manner.

We will present our design choices on building a Mesos-centric multi-tenant architecture for running Spark-based machine learning workflows that power the algorithms behind Netflix recommendations. Also we will share our experience using the auto-scaling capabilities of Amazon Web Services to dynamically change the size of our clusters to support the allocation of thousands of spark jobs running daily. We will discuss how we are leveraging Apache Spark to deploy batch jobs as well as the interactive use of Zeppelin Notebooks efficiently in this shared environment.

Mesos at opentable

Thu, 20 Aug 2015 12:00:00 +0000

Opentable has been using Apache Mesos for production workloads and for running critical parts of their production services for more than a year.

Not only did Mesos help deploying resilient / elastic standalone applications and services , but also the distributed / fault-tolerant frameworks like Apache Spark for Data processing and machine learning. Mesos enabled Opentable to run multiple distributed applications across the same infrastructure at scale.

Pablo will tell the story of how Opentable started with Mesos, the pain points of dealing with an hybrid Mesos + non-Mesos environment and how to survive in the transition.

Using data science to create a dining expert

Mon, 15 Jun 2015 12:00:00 +0000

We can build expert knowledge of cities with our corpus of unstructured reviews

OpenTable helps diners find the best dining experiences, wherever they travel. Tastes vary widely between our diners, however, so we need to personalize our recommendations to find restaurants which can provide great dining experiences. Fortunately, we have more than fifteen million unstructured reviews which we can use to build models which improve the accuracy of our recommendations.

Neo4j for Ruby on Rails

Fri, 05 Nov 2010 12:00:00 +0000

“Neo4j is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.”

Neo4j allows you to map objects to nodes and relations, that is a more natural fit than mapping them to relational tables. Modeling with elements of a graph is substantially faster for semistructured data (Recall that semistructured data is data that has few mandatory but many optional attributes).

Cassandra and Ruby

Fri, 27 Nov 2009 17:30:00 +0000

I been speaking about Apache Cassandra database at the Spanish Rails Confenrence 2009.

The title of my talk was Cassandra DB: ¿Qué tienen Facebook, Twitter y Digg en común?

Here are some photos of the talk:

Euruko 2009 is over

Mon, 11 May 2009 12:00:00 +0000

Euruko 2009 conference in Barcelona, Spain was excelent! The venue was really good. Everything was very well organized by the great people of the SRUG.

My favorite talks were:

Javier Ramirez with Fun with ruby (and without r***s) Program your own games with gosu
Joshua Sierles with Automate Everything: Cooking with Chef
Aslak Hellesøy with Quality code with Cucumber

Here are some photos I took:

The rest of the photos can be seen in flickr, just search for the tag #euruko2009

Rails Scalability

Fri, 23 Nov 2007 12:00:00 +0000

Intro

I escaped London for a few days to go to Conferencia Rails 2007 in Madrid. There I gave a talk about much ranted Rails Scalability.

The title of the talk in spanish was “Escalabilidad y las cosas de las que nadie se atrevio a hablar”.

Summary of the talk:

Architecture and typical Rails deployment configurations.
Use of Nginx as a static assets server.
Mongrel and Evented Mongrel.
Multithreaded image Uploads with mongrel and/or merb (instead of attatchment_fu)
Activerecord optimizations (hacks, active_record_context plugin)
Caches, pasive expirations. Cache observing daemons.
Configuration and monitoring of a production Server. (monit, munin tools)

Here is the video of the talk, and of course, the slides.