Backend.AI: Open-source AI hyperscaler platform for everyone

Jeongkyu Shin

Speaker's bio

Jeongkyu Shin, founder of Lablup Inc., has made significant contributions to open-source projects over the past two decades, such as his lead developer role in the Textcube Project. Active in both the Innovative Steering Board of the Northeast Asia Open Source Software Promotion Forum and as a machine learning expert at Google Developers Experts, he is also a Google Cloud Champion Innovator. Jeongkyu's expertise extends to supporting startup companies as a technical mentor at Google for Startups Accelerator.

He holds a Ph.D. in Statistical Physics from POSTECH, where his research focused on complex systems, agent-based models, and computational neuroscience. Jeongkyu has led machine learning and open-source projects in various companies and labs, specializing in text classification, entropy-related information compression, and contextual search. Since 2015, he has been developing Backend.AI, an open-source, hyper-scalable AI R&D platform (https://www.backend.ai).

Schedule

Track : track3
Date: Day 2
Time: 16:50 ~ 17:20

Session detail

In this talk, we introduce Backend.AI, an open-source hyper-scalable platform specialized in AI and high-performance computing, along with its orchestrator, Sokovan. We explore the design philosophy behind Backend.AI, which is engineered to manage and optimize the rapidly growing workloads of GPU and dedicated AI semiconductor-based systems. Additionally, we provide insights into the scheduling and hardware access architecture of Sokovan, the orchestrator that forms the backbone of Backend.AI.

Backend.AI is an AI hyper-scaler that fuses driver-level system call virtualization and open infrastructure, encompassing a compute node layer, AI/MLOps, and the ecosystem. The Sokovan orchestrator effectively tackles the challenges of running resource-intensive batch workloads in a containerized environment, handling workload management and distributed processing. It features acceleration-aware, multi-tenant, batch-oriented job scheduling and seamlessly integrates multiple hardware acceleration technologies across various system layers, unleashing their full potential.

When used in tandem with Sokovan, Backend.AI enhances the performance of AI workloads, outperforming tools like Slurm and other existing solutions. The platform's design and capabilities allow container-based MLOps platforms to more effectively utilize the latest hardware technologies. We showcase instances where Backend.AI has successfully managed a diverse range of GPU workloads across various industries and discuss how it addresses challenges in AI training and service delivery.