Comments (9)
Commands for ViT Hybrid Parallel updated in this PR
from colossalai.
Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository.
@FrankLeeeee in benchmarks it's basically consistent. Add --from_torch
if using torchrun
. Otherwise, Colossal-AI launches in a standard way. However, in my opinion, we'd better use docker as the first choice to run benchmarks and examples, so that it can be easier to make the environment consistent as well. What do you think?
from colossalai.
@kurisusnowdeng can you fix the benchmark readme?
from colossalai.
Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository.
from colossalai.
I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use srun
or mpirun
to start the docker and this may conflict with the entry-point command.
from colossalai.
I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use
srun
ormpirun
to start the docker and this may conflict with the entry-point command.
Seems @binmakeswell mainly concerns that users don't know how to use the python commands with slurm. But I think docker may be already the most convenient way for users to run our codes. Also, we already have a tutorial that shows the usage of slurm, and maybe what we need to do is to make that tutorial compatible to more cases, rather than explain how to run slurm everywhere.
from colossalai.
I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples?
from colossalai.
I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples?
Yes. Maybe we can provide a dockerfile to pack each single example. Then users just build and run the image.
from colossalai.
OK, my opinion is that dockerfile is usually for complex environment setup. If an example requires complicated setup, then a dockerfile will be good.
from colossalai.
Related Issues (20)
- [BUG], please delete this item.
- [FEATURE]: cuda 12 support HOT 2
- [BUG]: ValueError: mutable default <class 'colossalai.legacy.tensor.distspec._DistSpec'> for field dist_attr is not allowed: use default_factory HOT 1
- [BUG]: AttributeError: type object 'ColoParameter' has no attribute 'from_torch_tensor' when run hybrid_parallel example HOT 3
- [FEATURE]: Support qwen2 model
- [BUG]: OOM when saving 70B model HOT 2
- [DOC]: What is the datasetset used to train the Colossal-Llama-2? HOT 1
- [BUG]: Running ColossalAI in H800 with torch 2.0 HOT 28
- [BUG]: pretraing llama2 using "gemini" plugin, can not resume from saved checkpoints HOT 1
- [BUG] [Shardformer]: Error in blip2 testing with half precision HOT 1
- [FEATURE]: support multiple (partial) backward passes for zero
- [BUG]: re-join str type error_msgs using `\n\t` in general_checkpoint_io
- how to wrapped multiple models with booster HOT 3
- [BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` HOT 1
- [PROPOSAL]: Fix potential github action smells
- Does colossalai support rocm?
- [BUG]: Slack link is invalid
- [BUG]: GROK-1 does not support do_sample
- [BUG]: TypeError: _gen_python_code() got an unexpected keyword argument 'verbose' HOT 2
- [BUG]: llama2 hybrid_parallel or 3d giving None loss when using pp_size > 1 HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colossalai.