TorchBench CI has detected a performance signal or runtime regression, and bisected it

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

/cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Not sure, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Not sure, <a class="user-mention notranslate" data-hovercard-type="user"

For the compile-d workflow result we need to check the HUD: <a href="http

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.4.0.dev20240425+cu121' about benchmark HOT 20 CLOSED

xuzhao9 commented on August 15, 2024

V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.4.0.dev20240425+cu121'

from benchmark.

Comments (20)

bhack commented on August 15, 2024 1

Yes It Is what I meant that PR is more than 1 year old but the fallback code with sort when deterministic is required it is still there.

So if you see the perf regression only in deterministic mode it could be that forcing upsample to be float32 is going to significantly impact only the sort fallback.

from benchmark.

xuzhao9 commented on August 15, 2024

@bhack pytorch/pytorch#121324 slows down pytorch_unet (45ms -> 51ms) with higher gpu memory usage. This model is using the amp precision.
Is this expected?

from benchmark.

bhack commented on August 15, 2024

/cc @albanD
It is expected as the upsampling is now float32 with amp. Is there a way for you to test the torch.comile perf with amp?

from benchmark.

albanD commented on August 15, 2024

Not sure, @xuzhao9 are we running any compile-d perf measurement with amp?

from benchmark.

bhack commented on August 15, 2024

This was the big topic on the PR thread as we didn't see the same gradient limit with the compiled path.
So I suppose compiled doesn't need to be forced float32 with amp. At least with the inputs we have tested at:
pytorch/pytorch#121324 (comment)

from benchmark.

xuzhao9 commented on August 15, 2024

Not sure, @xuzhao9 are we running any compile-d perf measurement with amp?

In this CI workflow we do not run compile, the PR affects the eager mode path (non-compiled).
For the compile-d workflow result we need to check the HUD: https://hud.pytorch.org/benchmark/compilers

from benchmark.

bhack commented on August 15, 2024

For the compile-d workflow result we need to check the HUD: https://hud.pytorch.org/benchmark/compilers

Is there a way to isolate pytorch_unet on that HUD page?

from benchmark.

xuzhao9 commented on August 15, 2024

@bhack Check: https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri,%2019%20Apr%202024%2019:18:11%20GMT&stopTime=Fri,%2026%20Apr%202024%2019:18:11%20GMT&granularity=hour&mode=training&model=pytorch_unet&dtype=amp&lBranch=main&lCommit=59a1f1f308545e3ac1d81940a51f8dc0db3d82d4&rBranch=main&rCommit=b2f6cfd9c061a212cde8c8768fda41cc75a3110c

from benchmark.

bhack commented on August 15, 2024

It seems we cannot isolate before and after the PR on that page but on a coarse daily timescale I don't see the perf drop on the compiled version so I think it could be ok.
Do we care about the final effect on trained network in eager mode of:
pytorch/pytorch#121072

How is it going to impact accuracy this increased precision in eager mode?
Cause the problem is that now we skipped the accuracy check it as it is not deterministic in eager mode.

from benchmark.

xuzhao9 commented on August 15, 2024

@bhack One thing I noticed: in eager mode, the regression happens only on amp+inference. However, Inductor CI does not test this combination: https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri%2C+19+Apr+2024+19%3A18%3A11+GMT&stopTime=Fri%2C+26+Apr+2024+19%3A18%3A11+GMT&granularity=hour&mode=training&model=pytorch_unet&dtype=amp&lBranch=main&lCommit=59a1f1f308545e3ac1d81940a51f8dc0db3d82d4&rBranch=main&rCommit=b2f6cfd9c061a212cde8c8768fda41cc75a3110c.
.
It does not regress on train+amp or inference+bf16, but I am not sure about inference+amp since there is no data.

from benchmark.

bhack commented on August 15, 2024

eager mode, the regression happens only on amp+inference.

It is strange. Are you sure this is tested also eager + amp training?

from benchmark.

bhack commented on August 15, 2024

Is the ops list in the PR complete to cover the backward or we need to add something else?

from benchmark.

xuzhao9 commented on August 15, 2024

@bhack Yes, we test both train and eval:

Left: 20240424
Right: 20240426

Train has a much smaller regression than eval.

Is this because we have different torch.backends.cudnn.deterministic setup for train and eval tests? https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/pytorch_unet/__init__.py#L91C9-L91C43

I think the regression happens only when torch.backends.cudnn.deterministic = True.

from benchmark.

bhack commented on August 15, 2024

I think it is expected. See @lezcano comment at pytorch/pytorch#121769 (comment).

But I don't know how it could be connected to my PR. If you git blame for the sort fallback can you check when it is introduced?

from benchmark.

bhack commented on August 15, 2024

The deterministic fallback was merged on 29 march 2023:
pytorch/pytorch#96898

from benchmark.

bhack commented on August 15, 2024

So I suppose that in pure eager we have a similar sort fallback when we ask for determinism (I have not checked it in the source code but I suppose is there).

Is it that the sort fallback is going to be more impacted by working at float32 then the non fallback atomic_add version?

This is the only explanation I have in mind for your regression only in deterministic mode.

from benchmark.

xuzhao9 commented on August 15, 2024

@bhack The regression happened on 20240425, so it can't be pytorch/pytorch#96898

from benchmark.

xuzhao9 commented on August 15, 2024

cc @malfet I am wondering what should we do about this regression - looks like it is expected since we are upsampling the float32 tensors in amp and it only affects the deterministic mode?

from benchmark.

xuzhao9 commented on August 15, 2024

Closing this case since it is expected result of upsampling.

from benchmark.

albanD commented on August 15, 2024

Marked the PR bc-breaking to make sure we properly warn in the release notes.

from benchmark.

V3 Performance Signal Detected by TorchBench Userbenchmark "torch-nightly" on '2.4.0.dev20240425+cu121' about benchmark HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent