Light

yeonwoosung / mlops Goto Github PK

Miscellaneous codes and writings for MLOps

License: GNU General Public License v3.0

JavaScript 0.12% HTML 0.27% Shell 0.07% Jupyter Notebook 97.05% Python 2.44% Dockerfile 0.01% Makefile 0.01% Gherkin 0.01% TypeScript 0.01% Java 0.01% Batchfile 0.01% CSS 0.01% PLpgSQL 0.01%

ai ai-as-a-service aws llm llm-inference llm-ops ml-serving mlops multimodal bentoml

mlops's Introduction

MLOps

GPU Recommendations

mlops's People

Contributors

Stargazers

Watchers

mlops's Issues

Learn Prefect

Let's learn usage of prefect!

Blog for prefect usage

PyTorch performance tuning

PyTorch Performance Tuning

How Meta trains large language models at scale

meta engineering blog post

Meta requires massive computational power to train large language models (LLMs)
Traditional AI model training trains a large number of models, but requires a relatively small number of GPUs
With the advent of generative AI (GenAI), fewer tasks are required, but they are very large tasks.

Challenges of training large-scale models

Hardware reliability: Requires rigorous testing and quality control to minimize training disruption due to hardware failure.
Fast recovery in case of failure: need to be able to recover quickly when hardware failures occur. Reduced rescheduling overhead and fast training reinitialization required.
Efficient preservation of training state: Need to be able to efficiently save and recover training state in the event of a failure.
Optimal connectivity between GPUs: Data transfer between GPUs is critical for large-scale model training. This requires high-speed network infrastructure and efficient data transfer protocols.

Improving all layers of the infrastructure stack is critical

Training software

Enable researchers to quickly move from research to production using open source like PyTorch.
Developing new algorithms and techniques for large-scale training and integrating new software tools and frameworks.

Scheduling

Allocating and dynamically scheduling resources based on the needs of the job, using complex algorithms to optimize resources.

Hardware

Requires high-performance hardware to handle large-scale model training.
Optimized existing hardware and modified the Grand Teton platform with NVIDIA H100 GPUs, increasing the TDP of the GPUs to 700W and switching to HBM3.

Data Center Placement

Optimized resources (power, cooling, networking, etc.) by optimally placing GPUs and systems in the data center.
We deployed as many GPU racks as possible for maximum compute density.

Reliability

Detection and recovery plans in place to minimize downtime in the event of hardware failure.
Common failure modes: GPU unrecognized, DRAM & SRAM UCE, hardware network cable issues.

Network

High-speed network infrastructure and efficient data transfer protocols are required for large-scale model training.
Built two network clusters, RoCE and InfiniBand, to learn from operational experience.

Storage

Invested in high-capacity, high-speed storage technologies for large-scale data storage and developed new data storage solutions for specific tasks.

Looking ahead

We will use hundreds of thousands of GPUs to process more data and cover longer distances and latencies.
We plan to adopt new hardware technologies and GPU architectures and evolve our infrastructure.
We will explore the evolving landscape of AI and strive to push the boundaries of what is possible.

Huggingface was hacked?! Wiz Research finds architecture risks, and improved the security.

Wiz Research finds architecture risks that may compromise AI-as-a-Service providers and consequently risk customer data; works with Hugging Face on mitigations

Integrate HuggingFace with Spark for parallelism

Getting started with NLP using Hugging Face transformers pipelines

Load data from spark to HuggingFace Dataset with "from_spark"

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.