Comments (20)
Yes It Is what I meant that PR is more than 1 year old but the fallback code with sort when deterministic is required it is still there.
So if you see the perf regression only in deterministic mode it could be that forcing upsample to be float32 is going to significantly impact only the sort
fallback.
from benchmark.
@bhack pytorch/pytorch#121324 slows down pytorch_unet (45ms -> 51ms) with higher gpu memory usage. This model is using the amp precision.
Is this expected?
from benchmark.
/cc @albanD
It is expected as the upsampling is now float32
with amp. Is there a way for you to test the torch.comile
perf with amp?
from benchmark.
Not sure, @xuzhao9 are we running any compile-d perf measurement with amp?
from benchmark.
This was the big topic on the PR thread as we didn't see the same gradient limit with the compiled path.
So I suppose compiled doesn't need to be forced float32
with amp. At least with the inputs we have tested at:
pytorch/pytorch#121324 (comment)
from benchmark.
Not sure, @xuzhao9 are we running any compile-d perf measurement with amp?
In this CI workflow we do not run compile, the PR affects the eager mode path (non-compiled).
For the compile-d workflow result we need to check the HUD: https://hud.pytorch.org/benchmark/compilers
from benchmark.
For the compile-d workflow result we need to check the HUD: https://hud.pytorch.org/benchmark/compilers
Is there a way to isolate pytorch_unet
on that HUD page?
from benchmark.
@bhack Check: https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri,%2019%20Apr%202024%2019:18:11%20GMT&stopTime=Fri,%2026%20Apr%202024%2019:18:11%20GMT&granularity=hour&mode=training&model=pytorch_unet&dtype=amp&lBranch=main&lCommit=59a1f1f308545e3ac1d81940a51f8dc0db3d82d4&rBranch=main&rCommit=b2f6cfd9c061a212cde8c8768fda41cc75a3110c
from benchmark.
It seems we cannot isolate before and after the PR on that page but on a coarse daily timescale I don't see the perf drop on the compiled version so I think it could be ok.
Do we care about the final effect on trained network in eager mode of:
pytorch/pytorch#121072
How is it going to impact accuracy this increased precision in eager mode?
Cause the problem is that now we skipped the accuracy check it as it is not deterministic in eager mode.
from benchmark.
@bhack One thing I noticed: in eager mode, the regression happens only on amp+inference. However, Inductor CI does not test this combination: https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri%2C+19+Apr+2024+19%3A18%3A11+GMT&stopTime=Fri%2C+26+Apr+2024+19%3A18%3A11+GMT&granularity=hour&mode=training&model=pytorch_unet&dtype=amp&lBranch=main&lCommit=59a1f1f308545e3ac1d81940a51f8dc0db3d82d4&rBranch=main&rCommit=b2f6cfd9c061a212cde8c8768fda41cc75a3110c.
.
It does not regress on train+amp or inference+bf16, but I am not sure about inference+amp since there is no data.
from benchmark.
eager mode, the regression happens only on amp+inference.
It is strange. Are you sure this is tested also eager + amp training?
from benchmark.
Is the ops list in the PR complete to cover the backward or we need to add something else?
from benchmark.
@bhack Yes, we test both train and eval:
Left: 20240424
Right: 20240426
![image](https://private-user-images.githubusercontent.com/502017/326136006-67e02412-5f30-48b2-afd1-012de31420eb.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIwMjEyOTQsIm5iZiI6MTcyMjAyMDk5NCwicGF0aCI6Ii81MDIwMTcvMzI2MTM2MDA2LTY3ZTAyNDEyLTVmMzAtNDhiMi1hZmQxLTAxMmRlMzE0MjBlYi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNlQxOTA5NTRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jMWM0MDUwYzg1N2EwN2E3NjRlZWE2OTE3NmYzODE5OWUwYzJjNDJkZGJmMDBiYzdmMDg3MjNjY2M3MTgxMTAxJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.bpH-mluBWyPXALjgDxjmGdFdlYjqa4HeW2wkOT5LXPo)
![image](https://private-user-images.githubusercontent.com/502017/326136044-a5832208-e5a7-4dc3-b9b0-9653b4000e5a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIwMjEyOTQsIm5iZiI6MTcyMjAyMDk5NCwicGF0aCI6Ii81MDIwMTcvMzI2MTM2MDQ0LWE1ODMyMjA4LWU1YTctNGRjMy1iOWIwLTk2NTNiNDAwMGU1YS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNlQxOTA5NTRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04YTk4OTM1ZWQ0MDMyZTcyZGFmMzJiYmU5MjNmNTZjODgxMjZmOWIyNTI1YmVmZmQ3YjFlNzg4ZmQ1NDY5YTNjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.FB9h5NVkBAQMbBxO3KHrzaKhB2rsVlUn5CC4iXh4bQU)
Train has a much smaller regression than eval.
Is this because we have different torch.backends.cudnn.deterministic
setup for train and eval tests? https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/pytorch_unet/__init__.py#L91C9-L91C43
I think the regression happens only when torch.backends.cudnn.deterministic = True
.
from benchmark.
I think it is expected. See @lezcano comment at pytorch/pytorch#121769 (comment).
But I don't know how it could be connected to my PR. If you git blame for the sort fallback can you check when it is introduced?
from benchmark.
The deterministic fallback was merged on 29 march 2023:
pytorch/pytorch#96898
from benchmark.
So I suppose that in pure eager we have a similar sort
fallback when we ask for determinism (I have not checked it in the source code but I suppose is there).
Is it that the sort fallback is going to be more impacted by working at float32
then the non fallback atomic_add
version?
This is the only explanation I have in mind for your regression only in deterministic mode.
from benchmark.
@bhack The regression happened on 20240425, so it can't be pytorch/pytorch#96898
from benchmark.
cc @malfet I am wondering what should we do about this regression - looks like it is expected since we are upsampling the float32 tensors in amp and it only affects the deterministic mode?
from benchmark.
Closing this case since it is expected result of upsampling.
from benchmark.
Marked the PR bc-breaking to make sure we properly warn in the release notes.
from benchmark.
Related Issues (20)
- V2 Performance Signal Detected by TorchBench CI on '2.4.0.dev20240604+cu121'
- CI Failure bisection HOT 1
- V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.4.0.dev20240610+cu121' HOT 1
- V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.5.0.dev20240613+cu124'
- Add option to skip model for install
- fixture 'compiler' not found while executing "pytest test_bench.py" HOT 2
- V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.5.0.dev20240619+cu124'
- Add jax pallas kernel examples for tritonbench
- Add thunderkittens pytorch custom ops to tritonbench HOT 1
- V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.5.0.dev20240623+cu124'
- sam_fast will re-install torch HOT 5
- `DALLE2_pytorch` moved to `canary_models` due to NumPy 2.0 upgrade. HOT 1
- nightly docker does not correctly install fbgemm HOT 1
- V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.5.0.dev20240627+cu124'
- Use ARC Kubernetes mode to run benchmark HOT 2
- Use AWS to store the GCP base docker
- Use the default NVIDIA driver version in the `test-infra` for K8s cluster
- Refactor the t4 and a100 nightly workflows
- [BE] Improve tritonbench handling of built-in metrics
- Flop counter crashes with multiple compiled benchmarks HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from benchmark.