Giter Club home page Giter Club logo

mlops's Introduction

mlops's People

Contributors

dependabot[bot] avatar yeonwoosung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mlops's Issues

How Meta trains large language models at scale

meta engineering blog post

  • Meta requires massive computational power to train large language models (LLMs)
  • Traditional AI model training trains a large number of models, but requires a relatively small number of GPUs
  • With the advent of generative AI (GenAI), fewer tasks are required, but they are very large tasks.

Challenges of training large-scale models

  • Hardware reliability: Requires rigorous testing and quality control to minimize training disruption due to hardware failure.
  • Fast recovery in case of failure: need to be able to recover quickly when hardware failures occur. Reduced rescheduling overhead and fast training reinitialization required.
  • Efficient preservation of training state: Need to be able to efficiently save and recover training state in the event of a failure.
  • Optimal connectivity between GPUs: Data transfer between GPUs is critical for large-scale model training. This requires high-speed network infrastructure and efficient data transfer protocols.

Improving all layers of the infrastructure stack is critical

Training software

  • Enable researchers to quickly move from research to production using open source like PyTorch.
  • Developing new algorithms and techniques for large-scale training and integrating new software tools and frameworks.

Scheduling

  • Allocating and dynamically scheduling resources based on the needs of the job, using complex algorithms to optimize resources.

Hardware

  • Requires high-performance hardware to handle large-scale model training.
  • Optimized existing hardware and modified the Grand Teton platform with NVIDIA H100 GPUs, increasing the TDP of the GPUs to 700W and switching to HBM3.

Data Center Placement

  • Optimized resources (power, cooling, networking, etc.) by optimally placing GPUs and systems in the data center.
  • We deployed as many GPU racks as possible for maximum compute density.

Reliability

  • Detection and recovery plans in place to minimize downtime in the event of hardware failure.
  • Common failure modes: GPU unrecognized, DRAM & SRAM UCE, hardware network cable issues.

Network

  • High-speed network infrastructure and efficient data transfer protocols are required for large-scale model training.
  • Built two network clusters, RoCE and InfiniBand, to learn from operational experience.

Storage

  • Invested in high-capacity, high-speed storage technologies for large-scale data storage and developed new data storage solutions for specific tasks.

Looking ahead

  • We will use hundreds of thousands of GPUs to process more data and cover longer distances and latencies.
  • We plan to adopt new hardware technologies and GPU architectures and evolve our infrastructure.
  • We will explore the evolving landscape of AI and strive to push the boundaries of what is possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.