Heterogeneous Training Cluster with Ray at Netflix
September 18, 2023 · RaySummit 2023 · San Francisco
At Netflix, Machine Learning algorithms are at the heart of various use cases such as recommendations, content understanding, content demand modeling, trailer and artwork generation and various other content creation use cases. Scaling these use cases to entertain our members can significantly leverage deep learning techniques. The Machine Learning Platform team at Netflix is tasked with constructing the necessary infrastructure and tools to optimize the effectiveness of all machine learning practitioners across the company. We are constantly striving to ensure that our machine learning models are trained and deployed in a reliable, scalable and robust way.
Deep learning models have grown in complexity, requiring significantly more computational resources to train. In this Talk, we explore the benefits of using Ray for building a heterogeneous training cluster, and discuss the steps involved in setting up such a cluster. We demonstrate how to run distributed training jobs on the cluster with a mix of CPU instances and GPU instances, and show how Ray’s automatic resource allocation and management can facilitate the scheduling of different types of workers .Additionally, we discuss the challenges and considerations that come with building and managing persistent clusters using Ray, and provide best practices for effective cluster configuration and management.
Session https://raysummit23.anyscale.com/agenda/sessions/154
Slides for the talk can be found here
Tags: ray, training, cluster, scalability, machine learning