Giter Club home page Giter Club logo

Comments (1)

pommedeterresautee avatar pommedeterresautee commented on August 17, 2024

benchmarks

❯ pytest test/test_torchdynamo_bert.py -k "benchmark"  --benchmark-group-by fullfunc,param:shape
===================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.9.15, pytest-7.1.3, pluggy-1.0.0
rootdir: /mnt/workspace/kernl
collected 572 items / 11 deselected / 561 selected                                                                                                                                                                            

test/test_torchdynamo_bert.py .......................................................................................................ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss................................ [ 32%]
....................................................................................................................................................................................................................... [ 71%]
.........ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss                                                       [100%]
test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 128)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x128-bert-base-uncased]                                                  7.9872 (2.57)    7.9899 (2.58)   7.7998 (2.63)   8.1408 (2.56)   8.2159 (2.52)   8.2771 (2.52)   8.0689 (2.56)   9.0158 (2.38)
test_benchmark_implementations[baseline-1x128-sentence-transformers/all-MiniLM-L6-v2]                             4.0152 (5.12)    4.0139 (5.14)   3.9148 (5.25)   4.177 (4.99)    4.2852 (4.83)   4.3102 (4.85)   4.198 (4.92)    4.9206 (4.36)
test_benchmark_implementations[baseline-1x128-t5-small]                                                           13.8598 (1.48)   14.0985 (1.46)  13.3765 (1.54)  15.0268 (1.39)  13.7171 (1.51)  13.9161 (1.5)   13.4444 (1.54)  15.0441 (1.43)
test_benchmark_implementations[dynamo-1x128-bert-base-uncased]                                                    6.997 (2.94)     7.1029 (2.9)    6.0877 (3.37)   8.0609 (2.59)   7.2228 (2.87)   7.3178 (2.86)   6.9524 (2.97)   8.5486 (2.51)
test_benchmark_implementations[dynamo-1x128-sentence-transformers/all-MiniLM-L6-v2]                               3.3751 (6.09)    3.3769 (6.11)   3.2266 (6.37)   3.629 (5.75)    3.8232 (5.41)   3.923 (5.33)    3.7851 (5.46)   4.2954 (5.0)
test_benchmark_implementations[dynamo-1x128-t5-small]                                                             12.075 (1.7)     12.0892 (1.71)  11.9195 (1.72)  12.3597 (1.69)  12.1862 (1.7)   12.3291 (1.69)  12.1373 (1.7)   13.1296 (1.63)
test_benchmark_implementations[dynamo_cuda_graphs-1x128-bert-base-uncased]                                        1.7746 (11.59)   1.7744 (11.63)  1.7705 (11.6)   1.7818 (11.7)   1.6138 (12.82)  1.6166 (12.92)  1.6096 (12.83)  1.6971 (12.64)
test_benchmark_implementations[dynamo_cuda_graphs-1x128-sentence-transformers/all-MiniLM-L6-v2]                   0.6246 (32.92)   0.6246 (33.03)  0.6226 (32.99)  0.6267 (33.27)  0.6046 (34.23)  0.606 (34.47)   0.6023 (34.29)  0.6949 (30.88)
test_benchmark_implementations[dynamo_cuda_graphs-1x128-t5-small]                                                 1.5135 (13.59)   1.6067 (12.84)  1.4981 (13.71)  1.7234 (12.1)   1.5702 (13.18)  1.5728 (13.28)  1.5673 (13.18)  1.675 (12.81)
test_benchmark_implementations[dynamo_no_dropout-1x128-bert-base-uncased]                                         6.868 (2.99)     6.8942 (2.99)   6.5167 (3.15)   7.4383 (2.8)    7.1646 (2.89)   7.236 (2.89)    6.891 (3.0)     7.9592 (2.7)
test_benchmark_implementations[dynamo_no_dropout-1x128-sentence-transformers/all-MiniLM-L6-v2]                    3.2524 (6.32)    3.2494 (6.35)   3.0158 (6.81)   3.4847 (5.98)   3.4674 (5.97)   3.5189 (5.94)   3.3634 (6.14)   3.923 (5.47)
test_benchmark_implementations[dynamo_no_dropout-1x128-t5-small]                                                  12.1487 (1.69)   12.1261 (1.7)   11.7545 (1.75)  12.4109 (1.68)  13.1879 (1.57)  13.306 (1.57)   12.8454 (1.61)  13.7882 (1.56)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x128-bert-base-uncased]                                        3.6014 (5.71)    3.6048 (5.72)   3.4939 (5.88)   3.7251 (5.6)    3.9364 (5.26)   3.9292 (5.32)   3.8066 (5.43)   4.1656 (5.15)
test_benchmark_implementations[dynamo_optimized-1x128-bert-base-uncased]                                          14.4248 (1.43)   14.4 (1.43)     14.2694 (1.44)  14.4476 (1.44)  14.7625 (1.4)   14.9087 (1.4)   14.6598 (1.41)  15.7277 (1.36)
test_benchmark_implementations[dynamo_optimized-1x128-sentence-transformers/all-MiniLM-L6-v2]                     7.3882 (2.78)    7.3925 (2.79)   7.3298 (2.8)    7.4888 (2.78)   7.7379 (2.67)   7.7562 (2.69)   7.6219 (2.71)   8.138 (2.64)
test_benchmark_implementations[dynamo_optimized-1x128-t5-small]                                                   20.564 (1.0)     20.6306 (1.0)   20.5384 (1.0)   20.8508 (1.0)   20.6954 (1.0)   20.8929 (1.0)   20.652 (1.0)    21.4561 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-bert-base-uncased]                              1.6302 (12.61)   1.5937 (12.95)  1.4336 (14.33)  1.6333 (12.77)  1.4812 (13.97)  1.4839 (14.08)  1.4773 (13.98)  1.5775 (13.6)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-sentence-transformers/all-MiniLM-L6-v2]         0.4045 (50.84)   0.4286 (48.13)  0.3994 (51.43)  0.4649 (44.85)  0.4557 (45.41)  0.4573 (45.69)  0.4536 (45.53)  0.5474 (39.19)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-t5-small]                                       1.8033 (11.4)    1.8031 (11.44)  1.8012 (11.4)   1.8063 (11.54)  1.6396 (12.62)  1.6422 (12.72)  1.6358 (12.62)  1.7353 (12.36)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128-bert-base-uncased]                       1.6947 (12.13)   1.6216 (12.72)  1.4807 (13.87)  1.6978 (12.28)  1.5388 (13.45)  1.5416 (13.55)  1.5351 (13.45)  1.6359 (13.12)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128-sentence-transformers/all-MiniLM-L6-v2]  0.4618 (44.53)   0.4614 (44.71)  0.4598 (44.67)  0.4628 (45.05)  0.4594 (45.05)  0.4606 (45.36)  0.4563 (45.26)  0.5515 (38.91)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x128-t5-small]                                1.8053 (11.39)   1.8049 (11.43)  1.8033 (11.39)  1.8074 (11.54)  1.6434 (12.59)  1.6468 (12.69)  1.6405 (12.59)  1.7386 (12.34)
test_benchmark_implementations[onnx-1x128-bert-base-uncased]                                                      3.582 (5.74)     3.5913 (5.74)   3.2358 (6.35)   4.0489 (5.15)   3.2617 (6.35)   3.3188 (6.3)    3.2244 (6.4)    3.6164 (5.93)
test_benchmark_implementations[onnx_optim_fp16-1x128-bert-base-uncased]                                           2.8641 (7.18)    2.8701 (7.19)   2.7812 (7.38)   2.9768 (7.0)    2.8838 (7.18)   2.9133 (7.17)   2.8205 (7.32)   3.4263 (6.26)
test_benchmark_implementations[onnx_optim_fp32-1x128-bert-base-uncased]                                           3.5852 (5.74)    3.6226 (5.69)   3.5543 (5.78)   4.0428 (5.16)   3.2514 (6.37)   3.2901 (6.35)   3.2305 (6.39)   3.6178 (5.93)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 16)
Name                                                                                                             Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x16-bert-base-uncased]                                                  7.682 (2.69)     7.7118 (2.69)   7.595 (2.72)    7.9278 (2.64)   8.0521 (2.59)   8.1305 (2.59)   7.9709 (2.61)   8.6675 (2.59)
test_benchmark_implementations[baseline-1x16-sentence-transformers/all-MiniLM-L6-v2]                             3.927 (5.26)     3.9452 (5.25)   3.8758 (5.32)   4.0827 (5.13)   4.2252 (4.94)   4.2415 (4.97)   4.1482 (5.02)   4.6139 (4.86)
test_benchmark_implementations[baseline-1x16-t5-small]                                                           12.3279 (1.67)   12.331 (1.68)   12.2624 (1.68)  12.3924 (1.69)  13.4639 (1.55)  13.4829 (1.56)  12.6558 (1.65)  14.4626 (1.55)
test_benchmark_implementations[dynamo-1x16-bert-base-uncased]                                                    6.6396 (3.11)    6.6549 (3.11)   6.4492 (3.2)    6.8536 (3.06)   6.9179 (3.02)   6.9599 (3.03)   6.8426 (3.04)   7.3644 (3.04)
test_benchmark_implementations[dynamo-1x16-sentence-transformers/all-MiniLM-L6-v2]                               3.2606 (6.33)    3.2628 (6.35)   3.2043 (6.44)   3.366 (6.22)    3.6475 (5.72)   3.6921 (5.71)   3.5418 (5.88)   4.0715 (5.51)
test_benchmark_implementations[dynamo-1x16-t5-small]                                                             11.0203 (1.87)   11.0692 (1.87)  10.9312 (1.89)  11.2497 (1.86)  15.0719 (1.38)  16.2501 (1.3)   12.3431 (1.69)  22.4205 (1.0)
test_benchmark_implementations[dynamo_cuda_graphs-1x16-bert-base-uncased]                                        1.1244 (18.36)   1.1929 (17.37)  1.1223 (18.38)  1.6261 (12.88)  1.0677 (19.54)  1.0733 (19.63)  1.0613 (19.62)  1.2177 (18.41)
test_benchmark_implementations[dynamo_cuda_graphs-1x16-sentence-transformers/all-MiniLM-L6-v2]                   0.4588 (45.0)    0.4411 (46.97)  0.4045 (50.99)  0.467 (44.86)   0.4638 (44.98)  0.4657 (45.23)  0.4613 (45.14)  0.5556 (40.36)
test_benchmark_implementations[dynamo_cuda_graphs-1x16-t5-small]                                                 1.4705 (14.04)   1.5432 (13.43)  1.4674 (14.05)  2.0019 (10.47)  1.4545 (14.34)  1.4977 (14.06)  1.4449 (14.41)  1.7826 (12.58)
test_benchmark_implementations[dynamo_no_dropout-1x16-bert-base-uncased]                                         6.741 (3.06)     6.6996 (3.09)   6.3355 (3.26)   7.082 (2.96)    6.7989 (3.07)   6.813 (3.09)    6.6161 (3.15)   7.0404 (3.18)
test_benchmark_implementations[dynamo_no_dropout-1x16-sentence-transformers/all-MiniLM-L6-v2]                    3.2256 (6.4)     3.2316 (6.41)   2.9266 (7.05)   3.4714 (6.04)   3.629 (5.75)    3.6681 (5.74)   3.5464 (5.87)   4.0347 (5.56)
test_benchmark_implementations[dynamo_no_dropout-1x16-t5-small]                                                  10.7037 (1.93)   10.7682 (1.92)  10.5984 (1.95)  10.9875 (1.91)  11.5583 (1.8)   11.6896 (1.8)   11.4648 (1.82)  12.3367 (1.82)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x16-bert-base-uncased]                                        3.3597 (6.15)    3.3867 (6.12)   3.2041 (6.44)   3.5635 (5.88)   3.8301 (5.45)   3.8313 (5.5)    3.6893 (5.65)   4.1744 (5.37)
test_benchmark_implementations[dynamo_optimized-1x16-bert-base-uncased]                                          14.4681 (1.43)   14.5149 (1.43)  14.3852 (1.43)  14.6831 (1.43)  14.7698 (1.41)  14.8412 (1.42)  14.7237 (1.41)  15.1918 (1.48)
test_benchmark_implementations[dynamo_optimized-1x16-sentence-transformers/all-MiniLM-L6-v2]                     7.478 (2.76)     7.56 (2.74)     7.3749 (2.8)    8.5381 (2.45)   7.728 (2.7)     7.7702 (2.71)   7.6352 (2.73)   8.3301 (2.69)
test_benchmark_implementations[dynamo_optimized-1x16-t5-small]                                                   20.6459 (1.0)    20.7201 (1.0)   20.6234 (1.0)   20.951 (1.0)    20.8613 (1.0)   21.0644 (1.0)   20.8263 (1.0)   21.4085 (1.05)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-bert-base-uncased]                              0.6472 (31.9)    0.6471 (32.02)  0.6451 (31.97)  0.6554 (31.97)  0.6424 (32.47)  0.6451 (32.65)  0.6399 (32.55)  0.8005 (28.01)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-sentence-transformers/all-MiniLM-L6-v2]         0.3348 (61.66)   0.3196 (64.82)  0.297 (69.45)   0.3369 (62.19)  0.355 (58.76)   0.359 (58.68)   0.3519 (59.18)  0.4687 (47.83)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-t5-small]                                       1.1653 (17.72)   1.1529 (17.97)  1.0363 (19.9)   1.1704 (17.9)   1.106 (18.86)   1.1076 (19.02)  1.0983 (18.96)  1.1985 (18.71)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16-bert-base-uncased]                       0.6533 (31.6)    0.6534 (31.71)  0.6513 (31.67)  0.6554 (31.97)  0.6448 (32.35)  0.6465 (32.58)  0.6422 (32.43)  0.7361 (30.46)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16-sentence-transformers/all-MiniLM-L6-v2]  0.3369 (61.28)   0.3266 (63.45)  0.297 (69.45)   0.4987 (42.01)  0.3543 (58.88)  0.3557 (59.22)  0.3521 (59.14)  0.465 (48.21)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x16-t5-small]                                1.0629 (19.42)   1.105 (18.75)   1.0598 (19.46)  1.1715 (17.88)  1.1341 (18.39)  1.1366 (18.53)  1.1297 (18.44)  1.2494 (17.95)
test_benchmark_implementations[onnx-1x16-bert-base-uncased]                                                      2.6032 (7.93)    2.6188 (7.91)   2.5181 (8.19)   2.9164 (7.18)   2.6243 (7.95)   2.6575 (7.93)   2.549 (8.17)    3.0429 (7.37)
test_benchmark_implementations[onnx_optim_fp16-1x16-bert-base-uncased]                                           2.8223 (7.32)    2.8197 (7.35)   2.7535 (7.49)   2.8529 (7.34)   2.7543 (7.57)   2.7912 (7.55)   2.685 (7.76)    3.1783 (7.05)
test_benchmark_implementations[onnx_optim_fp32-1x16-bert-base-uncased]                                           2.5416 (8.12)    2.553 (8.12)    2.4945 (8.27)   2.6429 (7.93)   2.5792 (8.09)   2.6097 (8.07)   2.5529 (8.16)   3.0423 (7.37)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 256)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x256-bert-base-uncased]                                                  7.7384 (2.67)    7.7547 (2.66)   7.6209 (2.7)    7.932 (2.61)    8.1248 (2.56)   8.1814 (2.57)   8.0003 (2.6)    8.8167 (2.46)
test_benchmark_implementations[baseline-1x256-sentence-transformers/all-MiniLM-L6-v2]                             3.9608 (5.21)    3.9564 (5.22)   3.8779 (5.3)    4.1032 (5.05)   4.2653 (4.88)   4.3008 (4.89)   4.1955 (4.96)   5.0456 (4.29)
test_benchmark_implementations[baseline-1x256-t5-small]                                                           12.5911 (1.64)   12.6226 (1.63)  12.4385 (1.65)  13.0109 (1.59)  13.0027 (1.6)   13.2045 (1.59)  12.8765 (1.62)  14.2936 (1.51)
test_benchmark_implementations[dynamo-1x256-bert-base-uncased]                                                    6.8168 (3.03)    6.8896 (2.99)   6.5802 (3.12)   7.633 (2.71)    7.1286 (2.92)   7.2463 (2.9)    7.0758 (2.94)   7.7586 (2.79)
test_benchmark_implementations[dynamo-1x256-sentence-transformers/all-MiniLM-L6-v2]                               3.3761 (6.11)    3.3897 (6.09)   3.3147 (6.2)    3.5451 (5.85)   3.6673 (5.67)   3.6892 (5.7)    3.5946 (5.79)   4.152 (5.21)
test_benchmark_implementations[dynamo-1x256-t5-small]                                                             11.8589 (1.74)   11.7864 (1.75)  11.5589 (1.78)  12.1098 (1.71)  12.2947 (1.69)  12.9873 (1.62)  11.5175 (1.81)  17.1832 (1.26)
test_benchmark_implementations[dynamo_cuda_graphs-1x256-bert-base-uncased]                                        2.2804 (9.05)    2.2143 (9.32)   2.0654 (9.95)   2.4095 (8.6)    2.0906 (9.95)   2.0882 (10.07)  2.0591 (10.1)   2.1556 (10.04)
test_benchmark_implementations[dynamo_cuda_graphs-1x256-sentence-transformers/all-MiniLM-L6-v2]                   0.6871 (30.02)   0.715 (28.86)   0.681 (30.17)   0.769 (26.95)   0.7329 (28.39)  0.7362 (28.55)  0.7285 (28.56)  0.8266 (26.19)
test_benchmark_implementations[dynamo_cuda_graphs-1x256-t5-small]                                                 2.5068 (8.23)    2.507 (8.23)    2.5037 (8.21)   2.5119 (8.25)   2.2687 (9.17)   2.2714 (9.25)   2.2654 (9.18)   2.3533 (9.2)
test_benchmark_implementations[dynamo_no_dropout-1x256-bert-base-uncased]                                         6.8588 (3.01)    6.9061 (2.99)   6.3703 (3.23)   7.3277 (2.83)   6.8653 (3.03)   6.8913 (3.05)   6.6599 (3.12)   7.3825 (2.93)
test_benchmark_implementations[dynamo_no_dropout-1x256-sentence-transformers/all-MiniLM-L6-v2]                    3.3987 (6.07)    3.9296 (5.25)   3.0751 (6.68)   5.4364 (3.81)   3.4937 (5.96)   3.5125 (5.98)   3.4188 (6.09)   3.9612 (5.46)
test_benchmark_implementations[dynamo_no_dropout-1x256-t5-small]                                                  10.837 (1.9)     10.8245 (1.91)  10.6906 (1.92)  10.9415 (1.89)  11.0068 (1.89)  11.0652 (1.9)   10.864 (1.91)   11.581 (1.87)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x256-bert-base-uncased]                                        3.1785 (6.49)    3.2125 (6.42)   3.0638 (6.71)   3.4642 (5.98)   3.526 (5.9)     3.5533 (5.92)   3.369 (6.18)    3.8534 (5.62)
test_benchmark_implementations[dynamo_optimized-1x256-bert-base-uncased]                                          14.4538 (1.43)   14.4792 (1.43)  14.3471 (1.43)  14.6166 (1.42)  14.6074 (1.42)  14.7507 (1.42)  14.5543 (1.43)  15.1575 (1.43)
test_benchmark_implementations[dynamo_optimized-1x256-sentence-transformers/all-MiniLM-L6-v2]                     7.4332 (2.78)    7.4325 (2.78)   7.3431 (2.8)    7.5163 (2.76)   7.6991 (2.7)    7.7543 (2.71)   7.6478 (2.72)   8.2078 (2.64)
test_benchmark_implementations[dynamo_optimized-1x256-t5-small]                                                   20.6275 (1.0)    20.6331 (1.0)   20.5455 (1.0)   20.7227 (1.0)   20.8079 (1.0)   21.0179 (1.0)   20.8036 (1.0)   21.6455 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-bert-base-uncased]                              2.0316 (10.15)   2.0319 (10.15)  2.0285 (10.13)  2.0357 (10.18)  1.8586 (11.2)   1.8807 (11.18)  1.8278 (11.38)  2.1787 (9.94)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-sentence-transformers/all-MiniLM-L6-v2]         0.682 (30.25)    0.6819 (30.26)  0.6799 (30.22)  0.6902 (30.03)  0.6559 (31.73)  0.6572 (31.98)  0.6528 (31.87)  0.7459 (29.02)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-t5-small]                                       2.6624 (7.75)    2.6623 (7.75)   2.6583 (7.73)   2.6675 (7.77)   2.4036 (8.66)   2.4065 (8.73)   2.3983 (8.67)   2.5116 (8.62)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256-bert-base-uncased]                       1.9558 (10.55)   2.0252 (10.19)  1.9517 (10.53)  2.1217 (9.77)   1.9243 (10.81)  1.9558 (10.75)  1.9064 (10.91)  2.1052 (10.28)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256-sentence-transformers/all-MiniLM-L6-v2]  0.6851 (30.11)   0.6851 (30.12)  0.683 (30.08)   0.6871 (30.16)  0.6562 (31.71)  0.659 (31.89)   0.653 (31.86)   0.8251 (26.24)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x256-t5-small]                                2.6665 (7.74)    2.6694 (7.73)   2.433 (8.44)    3.0024 (6.9)    2.4055 (8.65)   2.4088 (8.73)   2.4004 (8.67)   2.4967 (8.67)
test_benchmark_implementations[onnx-1x256-bert-base-uncased]                                                      4.3646 (4.73)    4.2084 (4.9)    3.9598 (5.19)   4.4165 (4.69)   3.9879 (5.22)   4.0841 (5.15)   3.944 (5.27)    4.689 (4.62)
test_benchmark_implementations[onnx_optim_fp16-1x256-bert-base-uncased]                                           2.816 (7.33)     2.8224 (7.31)   2.8099 (7.31)   2.858 (7.25)    2.5851 (8.05)   2.5952 (8.1)    2.5621 (8.12)   2.9207 (7.41)
test_benchmark_implementations[onnx_optim_fp32-1x256-bert-base-uncased]                                           4.3717 (4.72)    4.4046 (4.68)   3.9823 (5.16)   4.9562 (4.18)   3.9519 (5.27)   3.97 (5.29)     3.9409 (5.28)   4.2826 (5.05)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 33)
Name                                                                                                             Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x33-bert-base-uncased]                                                  8.4726 (2.46)    8.412 (2.48)    7.935 (2.62)    9.004 (2.33)    8.7561 (2.42)   8.9365 (2.37)   8.5405 (2.45)   9.6138 (2.24)
test_benchmark_implementations[baseline-1x33-sentence-transformers/all-MiniLM-L6-v2]                             4.0204 (5.19)    4.0388 (5.17)   3.8953 (5.34)   4.2906 (4.9)    5.5619 (3.81)   5.625 (3.77)    5.2705 (3.97)   6.7621 (3.19)
test_benchmark_implementations[baseline-1x33-t5-small]                                                           11.9523 (1.75)   12.0013 (1.74)  11.8917 (1.75)  12.2962 (1.71)  12.4312 (1.7)   12.4971 (1.7)   12.1976 (1.71)  13.2715 (1.62)
test_benchmark_implementations[dynamo-1x33-bert-base-uncased]                                                    7.0502 (2.96)    7.0396 (2.97)   6.7227 (3.09)   7.3626 (2.85)   7.3552 (2.88)   7.3792 (2.88)   7.2699 (2.88)   7.741 (2.78)
test_benchmark_implementations[dynamo-1x33-sentence-transformers/all-MiniLM-L6-v2]                               3.3865 (6.16)    3.3818 (6.18)   3.2597 (6.38)   3.5219 (5.96)   3.8123 (5.56)   3.9556 (5.36)   3.6183 (5.78)   4.7159 (4.57)
test_benchmark_implementations[dynamo-1x33-t5-small]                                                             11.1677 (1.87)   11.1976 (1.87)  11.0172 (1.89)  11.3838 (1.85)  11.3337 (1.87)  11.5102 (1.84)  11.2035 (1.87)  12.0374 (1.79)
test_benchmark_implementations[dynamo_cuda_graphs-1x33-bert-base-uncased]                                        1.3107 (15.92)   1.3467 (15.51)  1.1827 (17.58)  1.9927 (10.54)  1.2184 (17.39)  1.2385 (17.13)  1.2133 (17.23)  1.5788 (13.65)
test_benchmark_implementations[dynamo_cuda_graphs-1x33-sentence-transformers/all-MiniLM-L6-v2]                   0.4874 (42.81)   0.4884 (42.77)  0.4864 (42.75)  0.4977 (42.21)  0.4904 (43.2)   0.4956 (42.81)  0.4882 (42.83)  0.6356 (33.91)
test_benchmark_implementations[dynamo_cuda_graphs-1x33-t5-small]                                                 1.7326 (12.04)   1.7326 (12.06)  1.7306 (12.02)  1.7347 (12.11)  1.5957 (13.28)  1.5979 (13.28)  1.5935 (13.12)  1.6772 (12.85)
test_benchmark_implementations[dynamo_no_dropout-1x33-bert-base-uncased]                                         6.6377 (3.14)    6.7421 (3.1)    6.4174 (3.24)   7.3035 (2.88)   7.4625 (2.84)   7.5605 (2.81)   7.0708 (2.96)   8.5945 (2.51)
test_benchmark_implementations[dynamo_no_dropout-1x33-sentence-transformers/all-MiniLM-L6-v2]                    3.2287 (6.46)    3.2227 (6.48)   3.0938 (6.72)   3.3413 (6.29)   3.6559 (5.79)   3.6962 (5.74)   3.6189 (5.78)   4.0549 (5.32)
test_benchmark_implementations[dynamo_no_dropout-1x33-t5-small]                                                  10.452 (2.0)     10.4648 (2.0)   10.1745 (2.04)  10.8237 (1.94)  11.6557 (1.82)  11.7381 (1.81)  11.2991 (1.85)  12.5264 (1.72)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x33-bert-base-uncased]                                        3.5103 (5.94)    3.5325 (5.91)   3.4396 (6.05)   3.7274 (5.64)   3.8989 (5.43)   3.9899 (5.32)   3.6709 (5.7)    4.8065 (4.48)
test_benchmark_implementations[dynamo_optimized-1x33-bert-base-uncased]                                          14.4466 (1.44)   14.5096 (1.44)  14.4005 (1.44)  14.6831 (1.43)  14.6474 (1.45)  14.8191 (1.43)  14.6261 (1.43)  15.1702 (1.42)
test_benchmark_implementations[dynamo_optimized-1x33-sentence-transformers/all-MiniLM-L6-v2]                     8.0547 (2.59)    8.3413 (2.5)    7.892 (2.63)    10.3977 (2.02)  7.7632 (2.73)   7.8094 (2.72)   7.67 (2.73)     8.3359 (2.59)
test_benchmark_implementations[dynamo_optimized-1x33-t5-small]                                                   20.8681 (1.0)    20.8873 (1.0)   20.7933 (1.0)   21.0063 (1.0)   21.1847 (1.0)   21.2159 (1.0)   20.9065 (1.0)   21.5517 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-bert-base-uncased]                              0.8387 (24.88)   0.7997 (26.12)  0.7465 (27.85)  0.8499 (24.72)  0.7931 (26.71)  0.7952 (26.68)  0.7905 (26.45)  0.8854 (24.34)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-sentence-transformers/all-MiniLM-L6-v2]         0.3523 (59.24)   0.3521 (59.33)  0.3502 (59.37)  0.3594 (58.44)  0.3685 (57.48)  0.3703 (57.3)   0.3659 (57.14)  0.4649 (46.36)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-t5-small]                                       1.2564 (16.61)   1.2639 (16.53)  1.2544 (16.58)  1.281 (16.4)    1.1839 (17.89)  1.1978 (17.71)  1.1797 (17.72)  1.4164 (15.22)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x33-bert-base-uncased]                       0.8684 (24.03)   0.8682 (24.06)  0.8663 (24.0)   0.8704 (24.13)  0.8148 (26.0)   0.8173 (25.96)  0.8118 (25.75)  0.905 (23.81)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x33-sentence-transformers/all-MiniLM-L6-v2]  0.341 (61.2)     0.3454 (60.48)  0.34 (61.16)    0.3574 (58.78)  0.3721 (56.93)  0.3759 (56.44)  0.3662 (57.09)  0.4813 (44.78)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x33-t5-small]                                1.2749 (16.37)   1.2744 (16.39)  1.2728 (16.34)  1.2769 (16.45)  1.1796 (17.96)  1.1818 (17.95)  1.1773 (17.76)  1.2763 (16.89)
test_benchmark_implementations[onnx-1x33-bert-base-uncased]                                                      2.688 (7.76)     2.7224 (7.67)   2.5805 (8.06)   3.2737 (6.42)   2.7011 (7.84)   2.7256 (7.78)   2.6018 (8.04)   3.2574 (6.62)
test_benchmark_implementations[onnx_optim_fp16-1x33-bert-base-uncased]                                           2.8836 (7.24)    2.8825 (7.25)   2.8262 (7.36)   2.9584 (7.1)    2.9294 (7.23)   2.945 (7.2)     2.875 (7.27)    3.4319 (6.28)
test_benchmark_implementations[onnx_optim_fp32-1x33-bert-base-uncased]                                           2.818 (7.41)     2.831 (7.38)    2.561 (8.12)    3.1826 (6.6)    2.6491 (8.0)    2.677 (7.93)    2.612 (8.0)     3.0799 (7.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 384)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x384-bert-base-uncased]                                                  7.9555 (2.61)    7.938 (2.62)    7.7855 (2.65)   8.1316 (2.57)   8.3062 (2.51)   8.3782 (2.51)   8.1626 (2.55)   9.1931 (2.35)
test_benchmark_implementations[baseline-1x384-sentence-transformers/all-MiniLM-L6-v2]                             3.884 (5.35)     3.9057 (5.32)   3.8195 (5.39)   4.1197 (5.07)   4.2459 (4.9)    4.2788 (4.92)   4.1714 (4.99)   4.8257 (4.47)
test_benchmark_implementations[baseline-1x384-t5-small]                                                           12.8348 (1.62)   12.9254 (1.61)  12.6566 (1.63)  13.5947 (1.54)  13.29 (1.57)    13.4307 (1.57)  13.1264 (1.58)  14.6564 (1.47)
test_benchmark_implementations[dynamo-1x384-bert-base-uncased]                                                    6.7546 (3.07)    6.7684 (3.07)   6.7011 (3.07)   6.868 (3.04)    7.6694 (2.71)   7.5896 (2.77)   7.1185 (2.92)   8.5041 (2.54)
test_benchmark_implementations[dynamo-1x384-sentence-transformers/all-MiniLM-L6-v2]                               3.2543 (6.38)    3.2663 (6.36)   3.2184 (6.4)    3.3833 (6.17)   3.6118 (5.76)   3.6322 (5.79)   3.5544 (5.85)   4.142 (5.21)
test_benchmark_implementations[dynamo-1x384-t5-small]                                                             11.4944 (1.81)   11.5435 (1.8)   11.4033 (1.81)  11.774 (1.77)   11.7263 (1.77)  11.8464 (1.78)  11.7131 (1.78)  12.3298 (1.75)
test_benchmark_implementations[dynamo_cuda_graphs-1x384-bert-base-uncased]                                        3.0925 (6.71)    3.0999 (6.7)    3.0863 (6.67)   3.3239 (6.28)   2.913 (7.15)    2.8961 (7.26)   2.8325 (7.34)   2.9464 (7.32)
test_benchmark_implementations[dynamo_cuda_graphs-1x384-sentence-transformers/all-MiniLM-L6-v2]                   1.0249 (20.26)   1.0291 (20.18)  1.0218 (20.16)  1.4561 (14.34)  0.9923 (20.98)  0.996 (21.12)   0.9784 (21.25)  1.0726 (20.11)
test_benchmark_implementations[dynamo_cuda_graphs-1x384-t5-small]                                                 3.2676 (6.35)    3.3032 (6.29)   3.2645 (6.31)   3.6833 (5.67)   3.009 (6.92)    3.0296 (6.94)   2.9767 (6.99)   3.4194 (6.31)
test_benchmark_implementations[dynamo_no_dropout-1x384-bert-base-uncased]                                         6.5331 (3.18)    6.5535 (3.17)   6.441 (3.2)     6.7174 (3.11)   6.8065 (3.06)   6.8306 (3.08)   6.6748 (3.12)   7.2409 (2.98)
test_benchmark_implementations[dynamo_no_dropout-1x384-sentence-transformers/all-MiniLM-L6-v2]                    3.1345 (6.62)    3.1288 (6.64)   3.0578 (6.74)   3.2307 (6.46)   3.4546 (6.02)   3.4759 (6.05)   3.3908 (6.13)   3.8693 (5.58)
test_benchmark_implementations[dynamo_no_dropout-1x384-t5-small]                                                  10.8165 (1.92)   10.9941 (1.89)  10.6876 (1.93)  11.3063 (1.85)  11.1824 (1.86)  11.2628 (1.87)  11.1333 (1.87)  11.6667 (1.85)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x384-bert-base-uncased]                                        3.4193 (6.07)    3.4691 (5.99)   3.115 (6.61)    4.2035 (4.97)   3.8611 (5.39)   3.8632 (5.44)   3.659 (5.68)    4.1334 (5.22)
test_benchmark_implementations[dynamo_optimized-1x384-bert-base-uncased]                                          14.4079 (1.44)   14.412 (1.44)   14.2766 (1.44)  14.5265 (1.44)  14.6538 (1.42)  14.7368 (1.43)  14.5943 (1.42)  15.1598 (1.42)
test_benchmark_implementations[dynamo_optimized-1x384-sentence-transformers/all-MiniLM-L6-v2]                     7.4004 (2.81)    7.4217 (2.8)    7.3614 (2.8)    7.5151 (2.78)   7.6889 (2.71)   7.752 (2.71)    7.6611 (2.71)   8.1893 (2.63)
test_benchmark_implementations[dynamo_optimized-1x384-t5-small]                                                   20.7627 (1.0)    20.7665 (1.0)   20.5988 (1.0)   20.8773 (1.0)   20.8141 (1.0)   21.0309 (1.0)   20.7955 (1.0)   21.5749 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-bert-base-uncased]                              2.1473 (9.67)    2.1955 (9.46)   2.1053 (9.78)   2.2794 (9.16)   2.1314 (9.77)   2.1101 (9.97)   2.0529 (10.13)  2.1454 (10.06)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-sentence-transformers/all-MiniLM-L6-v2]         0.9185 (22.6)    0.9186 (22.61)  0.9155 (22.5)   0.9226 (22.63)  0.8923 (23.33)  0.8929 (23.55)  0.8829 (23.55)  0.977 (22.08)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-t5-small]                                       3.4908 (5.95)    3.4918 (5.95)   3.4847 (5.91)   3.5011 (5.96)   3.1636 (6.58)   3.1668 (6.64)   3.1573 (6.59)   3.2561 (6.63)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384-bert-base-uncased]                       2.4115 (8.61)    2.4114 (8.61)   2.4084 (8.55)   2.4187 (8.63)   2.2461 (9.27)   2.2355 (9.41)   2.1826 (9.53)   2.2883 (9.43)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384-sentence-transformers/all-MiniLM-L6-v2]  0.9257 (22.43)   0.926 (22.43)   0.9226 (22.33)  0.9318 (22.4)   0.8971 (23.2)   0.8973 (23.44)  0.8874 (23.43)  0.9833 (21.94)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x384-t5-small]                                3.4222 (6.07)    3.2859 (6.32)   3.0433 (6.77)   3.4273 (6.09)   3.092 (6.73)    3.0949 (6.8)    3.0864 (6.74)   3.1863 (6.77)
test_benchmark_implementations[onnx-1x384-bert-base-uncased]                                                      5.33 (3.9)       5.4046 (3.84)   5.3199 (3.87)   6.0908 (3.43)   5.0104 (4.15)   4.9829 (4.22)   4.8268 (4.31)   5.3708 (4.02)
test_benchmark_implementations[onnx_optim_fp16-1x384-bert-base-uncased]                                           3.1867 (6.52)    3.1863 (6.52)   3.159 (6.52)    3.2482 (6.43)   3.2173 (6.47)   3.2763 (6.42)   3.2061 (6.49)   3.6311 (5.94)
test_benchmark_implementations[onnx_optim_fp32-1x384-bert-base-uncased]                                           5.333 (3.89)     5.2543 (3.95)   5.0278 (4.1)    5.3865 (3.88)   5.0063 (4.16)   4.9813 (4.22)   4.8585 (4.28)   5.2087 (4.14)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(1, 512)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x512-bert-base-uncased]                                                  7.6657 (2.7)     7.6953 (2.72)   7.6001 (2.71)   7.8735 (2.74)   7.9618 (2.67)   8.0523 (2.66)   7.8233 (2.65)   8.9374 (2.44)
test_benchmark_implementations[baseline-1x512-sentence-transformers/all-MiniLM-L6-v2]                             3.9485 (5.24)    3.9815 (5.26)   3.8789 (5.31)   4.2179 (5.12)   4.9473 (4.3)    4.9967 (4.29)   4.5362 (4.58)   5.614 (3.89)
test_benchmark_implementations[baseline-1x512-t5-small]                                                           12.9823 (1.59)   13.0233 (1.61)  12.7786 (1.61)  13.6499 (1.58)  13.1497 (1.62)  13.3599 (1.6)   12.9257 (1.61)  14.9242 (1.46)
test_benchmark_implementations[dynamo-1x512-bert-base-uncased]                                                    6.57 (3.15)      6.5985 (3.17)   6.4276 (3.2)    6.8362 (3.16)   6.7924 (3.13)   6.8549 (3.12)   6.706 (3.1)     7.3642 (2.97)
test_benchmark_implementations[dynamo-1x512-sentence-transformers/all-MiniLM-L6-v2]                               3.3341 (6.21)    3.3528 (6.24)   3.2809 (6.27)   3.5145 (6.14)   3.6213 (5.87)   3.6569 (5.86)   3.5384 (5.87)   4.0563 (5.39)
test_benchmark_implementations[dynamo-1x512-t5-small]                                                             11.4955 (1.8)    11.5166 (1.82)  11.3582 (1.81)  11.642 (1.85)   11.8769 (1.79)  11.9532 (1.79)  11.7719 (1.76)  12.4745 (1.75)
test_benchmark_implementations[dynamo_cuda_graphs-1x512-bert-base-uncased]                                        4.693 (4.41)     4.779 (4.38)    4.6879 (4.39)   5.5327 (3.9)    4.3944 (4.84)   4.4765 (4.78)   4.3294 (4.8)    5.2297 (4.18)
test_benchmark_implementations[dynamo_cuda_graphs-1x512-sentence-transformers/all-MiniLM-L6-v2]                   1.4346 (14.42)   1.4737 (14.2)   1.3629 (15.1)   2.0562 (10.49)  1.4035 (15.15)  1.4126 (15.16)  1.3977 (14.85)  1.6794 (13.01)
test_benchmark_implementations[dynamo_cuda_graphs-1x512-t5-small]                                                 4.7473 (4.36)    4.7029 (4.45)   4.4892 (4.58)   4.778 (4.52)    4.9807 (4.27)   4.9594 (4.32)   4.4888 (4.63)   5.5508 (3.94)
test_benchmark_implementations[dynamo_no_dropout-1x512-bert-base-uncased]                                         6.4298 (3.22)    6.4775 (3.23)   6.103 (3.37)    7.0461 (3.06)   7.142 (2.98)    7.3254 (2.92)   6.8822 (3.02)   8.1106 (2.69)
test_benchmark_implementations[dynamo_no_dropout-1x512-sentence-transformers/all-MiniLM-L6-v2]                    3.0854 (6.71)    3.1061 (6.74)   3.0423 (6.76)   3.258 (6.62)    3.4413 (6.18)   3.4595 (6.19)   3.3729 (6.16)   3.885 (5.62)
test_benchmark_implementations[dynamo_no_dropout-1x512-t5-small]                                                  10.9752 (1.89)   11.0247 (1.9)   10.8145 (1.9)   11.1974 (1.93)  11.4524 (1.86)  11.5261 (1.86)  11.2704 (1.84)  12.0367 (1.81)
test_benchmark_implementations[dynamo_nvfuser_ofi-1x512-bert-base-uncased]                                        3.4764 (5.95)    3.4948 (5.99)   3.3434 (6.16)   3.6445 (5.92)   3.8591 (5.51)   3.9029 (5.49)   3.7247 (5.57)   4.2731 (5.11)
test_benchmark_implementations[dynamo_optimized-1x512-bert-base-uncased]                                          14.4579 (1.43)   14.4821 (1.44)  14.3903 (1.43)  14.592 (1.48)   14.7253 (1.44)  14.8562 (1.44)  14.6028 (1.42)  15.4563 (1.41)
test_benchmark_implementations[dynamo_optimized-1x512-sentence-transformers/all-MiniLM-L6-v2]                     8.0947 (2.56)    8.1073 (2.58)   8.0305 (2.56)   8.1736 (2.64)   7.7937 (2.73)   7.8728 (2.72)   7.7057 (2.69)   8.3344 (2.62)
test_benchmark_implementations[dynamo_optimized-1x512-t5-small]                                                   20.6938 (1.0)    20.9265 (1.0)   20.5784 (1.0)   21.5747 (1.0)   21.2585 (1.0)   21.4147 (1.0)   20.7622 (1.0)   21.8442 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-bert-base-uncased]                              3.2481 (6.37)    3.2485 (6.44)   3.2451 (6.34)   3.2522 (6.63)   2.9354 (7.24)   2.9284 (7.31)   2.8956 (7.17)   2.9821 (7.33)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-sentence-transformers/all-MiniLM-L6-v2]         1.321 (15.67)    1.3208 (15.84)  1.3169 (15.63)  1.3261 (16.27)  1.2974 (16.39)  1.2989 (16.49)  1.2902 (16.09)  1.3913 (15.7)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-t5-small]                                       4.3223 (4.79)    4.3241 (4.84)   4.3192 (4.76)   4.3315 (4.98)   3.9595 (5.37)   3.9636 (5.4)    3.9517 (5.25)   4.0497 (5.39)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512-bert-base-uncased]                       3.4355 (6.02)    3.3516 (6.24)   3.1119 (6.61)   3.4417 (6.27)   3.0944 (6.87)   3.0966 (6.92)   3.0725 (6.76)   3.1647 (6.9)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512-sentence-transformers/all-MiniLM-L6-v2]  1.3343 (15.51)   1.3184 (15.87)  1.2687 (16.22)  1.3394 (16.11)  1.3084 (16.25)  1.3116 (16.33)  1.3006 (15.96)  1.3979 (15.63)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-1x512-t5-small]                                4.1032 (5.04)    4.0161 (5.21)   3.711 (5.55)    4.1093 (5.25)   3.7299 (5.7)    3.7354 (5.73)   3.7247 (5.57)   3.828 (5.71)
test_benchmark_implementations[onnx-1x512-bert-base-uncased]                                                      7.9278 (2.61)    7.9378 (2.64)   7.9084 (2.6)    7.9862 (2.7)    7.3769 (2.88)   7.3306 (2.92)   7.1625 (2.9)    7.4711 (2.92)
test_benchmark_implementations[onnx_optim_fp16-1x512-bert-base-uncased]                                           4.2711 (4.85)    4.2818 (4.89)   4.2506 (4.84)   4.4483 (4.85)   3.9707 (5.35)   3.9784 (5.38)   3.8987 (5.33)   4.2722 (5.11)
test_benchmark_implementations[onnx_optim_fp32-1x512-bert-base-uncased]                                           7.4834 (2.77)    7.7587 (2.7)    7.4639 (2.76)   8.3671 (2.58)   7.3705 (2.88)   7.3416 (2.92)   7.1789 (2.89)   7.4897 (2.92)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 128)
Name                                                                                                               Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-bert-base-uncased]                                                  20.1103 (1.91)   20.222 (1.96)   20.098 (1.91)   20.4564 (2.0)   20.2393 (1.84)  20.3365 (1.85)  19.242 (1.94)   21.4758 (1.77)
test_benchmark_implementations[baseline-32x128-sentence-transformers/all-MiniLM-L6-v2]                             4.7729 (8.06)    4.8271 (8.22)   4.7647 (8.07)   5.5798 (7.32)   4.9798 (7.49)   5.0319 (7.49)   4.8581 (7.68)   5.558 (6.84)
test_benchmark_implementations[baseline-32x128-t5-small]                                                           17.5053 (2.2)    17.6603 (2.25)  17.4887 (2.2)   18.0183 (2.27)  17.8109 (2.09)  17.8566 (2.11)  17.6484 (2.11)  18.141 (2.1)
test_benchmark_implementations[dynamo-32x128-bert-base-uncased]                                                    20.1605 (1.91)   20.3031 (1.95)  20.1329 (1.91)  20.4626 (2.0)   19.6522 (1.9)   19.755 (1.91)   19.2226 (1.94)  20.1006 (1.89)
test_benchmark_implementations[dynamo-32x128-sentence-transformers/all-MiniLM-L6-v2]                               4.9705 (7.74)    5.0075 (7.92)   4.9633 (7.75)   5.7559 (7.1)    4.9375 (7.55)   4.9243 (7.65)   4.8092 (7.76)   5.0393 (7.55)
test_benchmark_implementations[dynamo-32x128-t5-small]                                                             17.5852 (2.19)   17.663 (2.25)   17.5104 (2.2)   18.0675 (2.26)  17.7046 (2.11)  17.6579 (2.13)  17.4258 (2.14)  17.7565 (2.14)
test_benchmark_implementations[dynamo_cuda_graphs-32x128-bert-base-uncased]                                        21.0074 (1.83)   21.6282 (1.83)  20.9603 (1.84)  22.3089 (1.83)  18.9083 (1.97)  19.4129 (1.94)  18.8095 (1.98)  20.0227 (1.9)
test_benchmark_implementations[dynamo_cuda_graphs-32x128-sentence-transformers/all-MiniLM-L6-v2]                   4.7831 (8.04)    4.8748 (8.14)   4.778 (8.05)    6.2177 (6.57)   4.6992 (7.94)   4.7629 (7.91)   4.5362 (8.22)   5.222 (7.28)
test_benchmark_implementations[dynamo_cuda_graphs-32x128-t5-small]                                                 18.2067 (2.11)   17.9569 (2.21)  17.1796 (2.24)  18.262 (2.24)   17.9897 (2.07)  18.012 (2.09)   17.788 (2.1)    18.2229 (2.09)
test_benchmark_implementations[dynamo_no_dropout-32x128-bert-base-uncased]                                         20.1708 (1.91)   20.2898 (1.96)  20.096 (1.91)   20.4513 (2.0)   19.7122 (1.89)  19.8281 (1.9)   19.143 (1.95)   20.256 (1.88)
test_benchmark_implementations[dynamo_no_dropout-32x128-sentence-transformers/all-MiniLM-L6-v2]                    4.951 (7.77)     4.9535 (8.01)   4.9377 (7.79)   4.9744 (8.22)   4.947 (7.54)    4.9294 (7.64)   4.8045 (7.76)   5.0619 (7.51)
test_benchmark_implementations[dynamo_no_dropout-32x128-t5-small]                                                  17.5483 (2.19)   17.5473 (2.26)  17.4694 (2.2)   17.6302 (2.32)  17.7158 (2.11)  17.6569 (2.13)  17.4036 (2.14)  17.8327 (2.13)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x128-bert-base-uncased]                                        17.4756 (2.2)    17.5356 (2.26)  17.4582 (2.2)   17.8002 (2.3)   17.4646 (2.14)  17.1627 (2.19)  16.5188 (2.26)  17.4839 (2.18)
test_benchmark_implementations[dynamo_optimized-32x128-bert-base-uncased]                                          14.4835 (2.66)   14.5314 (2.73)  14.3811 (2.67)  14.806 (2.76)   14.9588 (2.49)  15.0727 (2.5)   14.8731 (2.51)  15.4737 (2.46)
test_benchmark_implementations[dynamo_optimized-32x128-sentence-transformers/all-MiniLM-L6-v2]                     7.5981 (5.06)    7.6072 (5.21)   7.5574 (5.09)   7.6791 (5.32)   7.9115 (4.71)   7.9729 (4.72)   7.8787 (4.73)   8.5487 (4.45)
test_benchmark_implementations[dynamo_optimized-32x128-t5-small]                                                   20.5292 (1.87)   20.5788 (1.93)  20.5005 (1.88)  20.6438 (1.98)  20.7852 (1.79)  20.9264 (1.8)   20.6732 (1.8)   21.3877 (1.78)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-bert-base-uncased]                              13.4666 (2.86)   13.4672 (2.95)  13.4615 (2.86)  13.4707 (3.03)  13.3705 (2.79)  13.1356 (2.87)  12.5117 (2.98)  13.5314 (2.81)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-sentence-transformers/all-MiniLM-L6-v2]         3.84 (10.02)     3.8404 (10.33)  3.8359 (10.03)  3.8451 (10.63)  3.823 (9.76)    3.7664 (10.0)   3.6189 (10.31)  3.8349 (9.92)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-t5-small]                                       14.8449 (2.59)   14.8485 (2.67)  14.8398 (2.59)  14.8593 (2.75)  13.9983 (2.66)  14.1218 (2.67)  13.9111 (2.68)  14.2875 (2.66)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128-bert-base-uncased]                       15.0342 (2.56)   15.0459 (2.64)  14.7067 (2.62)  15.3201 (2.67)  14.9044 (2.5)   14.7121 (2.56)  13.8642 (2.69)  15.3747 (2.47)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128-sentence-transformers/all-MiniLM-L6-v2]  3.8973 (9.87)    3.8972 (10.18)  3.8902 (9.89)   3.9076 (10.46)  3.8645 (9.65)   3.7958 (9.92)   3.6589 (10.19)  3.8817 (9.8)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x128-t5-small]                                14.1865 (2.71)   14.3324 (2.77)  14.121 (2.72)   14.7098 (2.78)  13.8863 (2.69)  13.9945 (2.69)  13.806 (2.7)    14.1615 (2.69)
test_benchmark_implementations[onnx-32x128-bert-base-uncased]                                                      38.4668 (1.0)    39.6668 (1.0)   38.4668 (1.0)   40.8668 (1.0)   37.3022 (1.0)   37.6702 (1.0)   37.3022 (1.0)   38.0383 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x128-bert-base-uncased]                                           18.4924 (2.08)   18.7878 (2.11)  18.4033 (2.09)  20.0591 (2.04)  19.0283 (1.96)  18.9233 (1.99)  18.2747 (2.04)  19.5136 (1.95)
test_benchmark_implementations[onnx_optim_fp32-32x128-bert-base-uncased]                                           38.0561 (1.01)   38.1012 (1.04)  38.0561 (1.01)  38.1462 (1.07)  37.2961 (1.0)   37.589 (1.0)    37.2961 (1.0)   37.8818 (1.0)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 16)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x16-bert-base-uncased]                                                  8.0949 (2.53)    8.0867 (2.54)   7.9974 (2.56)   8.1654 (2.52)   8.6161 (2.42)   8.7201 (2.41)   8.3377 (2.49)   9.3768 (2.29)
test_benchmark_implementations[baseline-32x16-sentence-transformers/all-MiniLM-L6-v2]                             4.1523 (4.94)    4.1759 (4.91)   4.0335 (5.08)   4.395 (4.68)    4.5378 (4.6)    4.5921 (4.57)   4.4281 (4.68)   5.2063 (4.12)
test_benchmark_implementations[baseline-32x16-t5-small]                                                           13.4267 (1.53)   13.5531 (1.51)  13.2096 (1.55)  14.3955 (1.43)  13.4661 (1.55)  13.6877 (1.53)  13.4007 (1.55)  14.9987 (1.43)
test_benchmark_implementations[dynamo-32x16-bert-base-uncased]                                                    6.8046 (3.01)    6.8248 (3.0)    6.6857 (3.06)   7.038 (2.92)    7.1556 (2.92)   7.178 (2.93)    7.0615 (2.93)   7.4938 (2.87)
test_benchmark_implementations[dynamo-32x16-sentence-transformers/all-MiniLM-L6-v2]                               3.4652 (5.91)    3.475 (5.9)     3.4099 (6.0)    3.5994 (5.71)   3.803 (5.49)    3.8381 (5.47)   3.7357 (5.55)   4.1657 (5.15)
test_benchmark_implementations[dynamo-32x16-t5-small]                                                             11.99 (1.71)     11.9973 (1.71)  11.9204 (1.72)  12.0719 (1.7)   12.2362 (1.71)  12.323 (1.7)    12.0694 (1.72)  12.8124 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-32x16-bert-base-uncased]                                        3.4601 (5.92)    3.5151 (5.83)   3.458 (5.92)    4.2936 (4.79)   3.1745 (6.57)   3.2583 (6.45)   3.0927 (6.7)    3.6324 (5.91)
test_benchmark_implementations[dynamo_cuda_graphs-32x16-sentence-transformers/all-MiniLM-L6-v2]                   0.8243 (24.86)   0.837 (24.49)   0.8223 (24.9)   1.2984 (15.82)  0.7872 (26.5)   0.8048 (26.09)  0.7792 (26.59)  1.065 (20.16)
test_benchmark_implementations[dynamo_cuda_graphs-32x16-t5-small]                                                 2.8672 (7.15)    2.9102 (7.04)   2.8641 (7.15)   3.3137 (6.2)    2.5922 (8.05)   2.6265 (8.0)    2.5846 (8.02)   3.0319 (7.08)
test_benchmark_implementations[dynamo_no_dropout-32x16-bert-base-uncased]                                         6.5608 (3.12)    6.6115 (3.1)    6.4276 (3.18)   7.2261 (2.84)   6.6954 (3.12)   6.7312 (3.12)   6.5988 (3.14)   7.1048 (3.02)
test_benchmark_implementations[dynamo_no_dropout-32x16-sentence-transformers/all-MiniLM-L6-v2]                    3.2236 (6.36)    3.2352 (6.34)   3.1693 (6.46)   3.4174 (6.01)   3.9266 (5.31)   3.9418 (5.33)   3.866 (5.36)    4.3407 (4.95)
test_benchmark_implementations[dynamo_no_dropout-32x16-t5-small]                                                  11.3603 (1.8)    11.3777 (1.8)   11.2723 (1.82)  11.4729 (1.79)  11.6763 (1.79)  11.7824 (1.78)  11.5762 (1.79)  12.345 (1.74)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x16-bert-base-uncased]                                        3.8779 (5.28)    3.9158 (5.24)   3.5236 (5.81)   4.4534 (4.61)   4.2331 (4.93)   4.2567 (4.93)   4.0975 (5.06)   4.6687 (4.6)
test_benchmark_implementations[dynamo_optimized-32x16-bert-base-uncased]                                          14.4568 (1.42)   14.4705 (1.42)  14.3647 (1.43)  14.5621 (1.41)  14.937 (1.4)    15.0749 (1.39)  14.7323 (1.41)  15.496 (1.39)
test_benchmark_implementations[dynamo_optimized-32x16-sentence-transformers/all-MiniLM-L6-v2]                     7.5889 (2.7)     7.6036 (2.7)    7.5305 (2.72)   7.7322 (2.66)   7.9664 (2.62)   8.0316 (2.61)   7.8605 (2.64)   8.4617 (2.54)
test_benchmark_implementations[dynamo_optimized-32x16-t5-small]                                                   20.4933 (1.0)    20.5025 (1.0)   20.4718 (1.0)   20.5476 (1.0)   20.8647 (1.0)   21.0003 (1.0)   20.7213 (1.0)   21.4716 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-bert-base-uncased]                              2.5938 (7.9)     2.4931 (8.22)   2.3726 (8.63)   2.5989 (7.91)   2.3739 (8.79)   2.3617 (8.89)   2.3197 (8.93)   2.4097 (8.91)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-sentence-transformers/all-MiniLM-L6-v2]         0.7096 (28.88)   0.7103 (28.86)  0.7076 (28.93)  0.7178 (28.62)  0.6732 (30.99)  0.6749 (31.12)  0.6704 (30.91)  0.771 (27.85)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-t5-small]                                       2.0572 (9.96)    2.0577 (9.96)   2.0552 (9.96)   2.0603 (9.97)   1.8831 (11.08)  1.8862 (11.13)  1.8638 (11.12)  1.9654 (10.92)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16-bert-base-uncased]                       2.603 (7.87)     2.9186 (7.02)   2.3747 (8.62)   3.7663 (5.46)   2.3915 (8.72)   2.522 (8.33)    2.3258 (8.91)   3.1608 (6.79)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16-sentence-transformers/all-MiniLM-L6-v2]  0.6656 (30.79)   0.6941 (29.54)  0.6636 (30.85)  1.3199 (15.57)  0.6755 (30.89)  0.6917 (30.36)  0.671 (30.88)   1.1886 (18.06)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x16-t5-small]                                2.0613 (9.94)    2.0225 (10.14)  1.876 (10.91)   2.4381 (8.43)   1.9003 (10.98)  1.9104 (10.99)  1.8736 (11.06)  2.3291 (9.22)
test_benchmark_implementations[onnx-32x16-bert-base-uncased]                                                      5.718 (3.58)     5.8492 (3.51)   5.6607 (3.62)   6.2648 (3.28)   5.6611 (3.69)   5.6351 (3.73)   5.5059 (3.76)   5.9045 (3.64)
test_benchmark_implementations[onnx_optim_fp16-32x16-bert-base-uncased]                                           4.0346 (5.08)    4.0609 (5.05)   3.6199 (5.66)   4.9091 (4.19)   3.317 (6.29)    3.3535 (6.26)   3.2826 (6.31)   3.7392 (5.74)
test_benchmark_implementations[onnx_optim_fp32-32x16-bert-base-uncased]                                           5.7446 (3.57)    5.9467 (3.45)   5.6535 (3.62)   7.0769 (2.9)    5.6742 (3.68)   5.6544 (3.71)   5.5423 (3.74)   5.886 (3.65)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 256)
Name                                                                                                               Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-bert-base-uncased]                                                  43.8989 (1.72)   43.9153 (1.72)  43.8989 (1.72)  43.9316 (1.72)  44.0885 (1.73)  44.9013 (1.7)   44.0885 (1.73)  45.714 (1.67)
test_benchmark_implementations[baseline-32x256-sentence-transformers/all-MiniLM-L6-v2]                             11.9368 (6.33)   11.95 (6.32)    11.9265 (6.34)  12.0136 (6.29)  12.0605 (6.33)  11.9829 (6.37)  11.6743 (6.54)  12.1666 (6.27)
test_benchmark_implementations[baseline-32x256-t5-small]                                                           38.7185 (1.95)   39.3805 (1.92)  38.7185 (1.95)  40.0425 (1.89)  38.3135 (1.99)  38.6135 (1.98)  38.3135 (1.99)  38.9135 (1.96)
test_benchmark_implementations[dynamo-32x256-bert-base-uncased]                                                    44.1713 (1.71)   44.1718 (1.71)  44.1713 (1.71)  44.1723 (1.71)  44.2216 (1.73)  44.7548 (1.7)   44.2216 (1.73)  45.2881 (1.68)
test_benchmark_implementations[dynamo-32x256-sentence-transformers/all-MiniLM-L6-v2]                               12.1252 (6.23)   12.1284 (6.23)  12.1201 (6.24)  12.1405 (6.23)  11.9394 (6.39)  11.9082 (6.41)  11.6552 (6.55)  12.0576 (6.33)
test_benchmark_implementations[dynamo-32x256-t5-small]                                                             38.826 (1.95)    38.8352 (1.95)  38.826 (1.95)   38.8444 (1.95)  38.9082 (1.96)  39.5505 (1.93)  38.9082 (1.96)  40.1928 (1.9)
test_benchmark_implementations[dynamo_cuda_graphs-32x256-bert-base-uncased]                                        43.777 (1.73)    43.7857 (1.73)  43.777 (1.73)   43.7944 (1.73)  43.785 (1.74)   44.0495 (1.73)  43.785 (1.74)   44.314 (1.72)
test_benchmark_implementations[dynamo_cuda_graphs-32x256-sentence-transformers/all-MiniLM-L6-v2]                   11.7975 (6.41)   11.8012 (6.4)   11.7862 (6.41)  11.8282 (6.39)  11.8569 (6.43)  11.7862 (6.47)  11.4683 (6.65)  11.9419 (6.39)
test_benchmark_implementations[dynamo_cuda_graphs-32x256-t5-small]                                                 38.8639 (1.94)   38.8797 (1.94)  38.8639 (1.94)  38.8956 (1.94)  37.8844 (2.01)  38.4157 (1.99)  37.8844 (2.01)  38.9469 (1.96)
test_benchmark_implementations[dynamo_no_dropout-32x256-bert-base-uncased]                                         44.1754 (1.71)   44.18 (1.71)    44.1754 (1.71)  44.1846 (1.71)  44.6561 (1.71)  58.9414 (1.29)  44.6561 (1.71)  73.2268 (1.04)
test_benchmark_implementations[dynamo_no_dropout-32x256-sentence-transformers/all-MiniLM-L6-v2]                    12.1272 (6.23)   12.1317 (6.23)  12.119 (6.24)   12.1498 (6.22)  11.8996 (6.41)  11.9185 (6.4)   11.6985 (6.52)  12.0558 (6.33)
test_benchmark_implementations[dynamo_no_dropout-32x256-t5-small]                                                  38.8813 (1.94)   39.5628 (1.91)  38.8813 (1.94)  40.2442 (1.88)  39.0476 (1.95)  39.3913 (1.94)  39.0476 (1.95)  39.7349 (1.92)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x256-bert-base-uncased]                                        36.1165 (2.09)   36.1262 (2.09)  36.1165 (2.09)  36.1359 (2.09)  34.5393 (2.21)  35.392 (2.16)   34.5393 (2.21)  36.2447 (2.1)
test_benchmark_implementations[dynamo_optimized-32x256-bert-base-uncased]                                          28.6556 (2.64)   28.6587 (2.64)  28.6546 (2.64)  28.6659 (2.64)  27.053 (2.82)   27.1582 (2.81)  26.309 (2.9)    28.1126 (2.71)
test_benchmark_implementations[dynamo_optimized-32x256-sentence-transformers/all-MiniLM-L6-v2]                     10.1489 (7.45)   10.1628 (7.44)  10.1386 (7.45)  10.1898 (7.42)  10.3971 (7.34)  10.3376 (7.38)  9.9587 (7.66)   10.4419 (7.31)
test_benchmark_implementations[dynamo_optimized-32x256-t5-small]                                                   34.1545 (2.21)   34.1647 (2.21)  34.1545 (2.21)  34.175 (2.21)   34.2774 (2.23)  34.4148 (2.22)  34.2774 (2.23)  34.5523 (2.21)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-bert-base-uncased]                              27.2364 (2.78)   27.2667 (2.77)  27.2087 (2.78)  27.3551 (2.76)  27.3652 (2.79)  26.7719 (2.85)  25.4413 (3.0)   27.5092 (2.77)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-sentence-transformers/all-MiniLM-L6-v2]         9.7085 (7.79)    9.7077 (7.79)   9.7034 (7.79)   9.7106 (7.78)   9.5923 (7.95)   9.5918 (7.95)   9.3268 (8.18)   9.7865 (7.8)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-t5-small]                                       33.3466 (2.27)   33.3496 (2.27)  33.3466 (2.27)  33.3527 (2.27)  32.9591 (2.31)  33.1959 (2.3)   32.9591 (2.31)  33.4327 (2.28)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256-bert-base-uncased]                       28.3853 (2.66)   28.3856 (2.66)  28.3812 (2.66)  28.3904 (2.66)  27.9692 (2.73)  27.7448 (2.75)  26.5484 (2.87)  28.7167 (2.66)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256-sentence-transformers/all-MiniLM-L6-v2]  9.8796 (7.65)    9.8846 (7.65)   9.8693 (7.66)   9.9133 (7.62)   9.8898 (7.71)   9.7855 (7.8)    9.5171 (8.02)   9.9586 (7.66)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x256-t5-small]                                30.9391 (2.44)   30.9391 (2.44)  30.9289 (2.44)  30.9494 (2.44)  30.9829 (2.46)  30.8543 (2.47)  30.3163 (2.52)  31.2637 (2.44)
test_benchmark_implementations[onnx-32x256-bert-base-uncased]                                                      75.5272 (1.0)    75.5272 (1.0)   75.5272 (1.0)   75.5272 (1.0)   76.2924 (1.0)   76.2924 (1.0)   76.2924 (1.0)   76.2924 (1.0)
test_benchmark_implementations[onnx_optim_fp16-32x256-bert-base-uncased]                                           36.7002 (2.06)   38.138 (1.98)   36.7002 (2.06)  39.5759 (1.91)  35.0286 (2.18)  35.7077 (2.14)  35.0286 (2.18)  36.3868 (2.1)
test_benchmark_implementations[onnx_optim_fp32-32x256-bert-base-uncased]                                           75.5815 (1.0)    75.5815 (1.0)   75.5815 (1.0)   75.5815 (1.0)   75.7002 (1.01)  75.7002 (1.01)  75.7002 (1.01)  75.7002 (1.01)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 32)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x32-bert-base-uncased]                                                  8.318 (3.03)     8.3861 (3.07)   8.2217 (3.04)   8.9518 (3.03)   8.6713 (2.96)   8.7514 (2.97)   8.5516 (2.86)   9.6215 (2.89)
test_benchmark_implementations[baseline-32x32-sentence-transformers/all-MiniLM-L6-v2]                             4.3563 (5.79)    4.5901 (5.62)   3.8052 (6.56)   6.6755 (4.07)   4.6476 (5.52)   4.6881 (5.54)   4.6157 (5.29)   5.3856 (5.17)
test_benchmark_implementations[baseline-32x32-t5-small]                                                           13.2076 (1.91)   13.4256 (1.92)  12.8911 (1.94)  15.0006 (1.81)  17.6551 (1.45)  17.6652 (1.47)  14.2071 (1.72)  20.7045 (1.35)
test_benchmark_implementations[dynamo-32x32-bert-base-uncased]                                                    6.7973 (3.71)    6.9412 (3.71)   6.5833 (3.79)   7.4977 (3.62)   7.5334 (3.41)   7.4935 (3.47)   7.0981 (3.44)   7.9453 (3.51)
test_benchmark_implementations[dynamo-32x32-sentence-transformers/all-MiniLM-L6-v2]                               3.5197 (7.17)    3.5462 (7.27)   3.3772 (7.39)   3.7745 (7.2)    3.9039 (6.57)   3.9479 (6.58)   3.7399 (6.53)   4.547 (6.13)
test_benchmark_implementations[dynamo-32x32-t5-small]                                                             11.5804 (2.18)   11.6471 (2.21)  11.3123 (2.21)  11.9941 (2.26)  11.7353 (2.19)  11.9069 (2.18)  11.5649 (2.11)  13.1716 (2.11)
test_benchmark_implementations[dynamo_cuda_graphs-32x32-bert-base-uncased]                                        6.2792 (4.02)    6.1798 (4.17)   5.7436 (4.35)   6.2843 (4.32)   5.8772 (4.37)   5.8711 (4.43)   5.6244 (4.34)   6.3794 (4.37)
test_benchmark_implementations[dynamo_cuda_graphs-32x32-sentence-transformers/all-MiniLM-L6-v2]                   1.2687 (19.88)   1.2685 (20.33)  1.2657 (19.72)  1.2739 (21.32)  1.1954 (21.46)  1.2193 (21.31)  1.1768 (20.76)  1.4892 (18.7)
test_benchmark_implementations[dynamo_cuda_graphs-32x32-t5-small]                                                 4.2322 (5.96)    4.196 (6.14)    3.9004 (6.4)    4.2363 (6.41)   3.8989 (6.58)   4.0423 (6.43)   3.821 (6.39)    4.4923 (6.2)
test_benchmark_implementations[dynamo_no_dropout-32x32-bert-base-uncased]                                         6.57 (3.84)      6.4922 (3.97)   5.9689 (4.18)   6.8342 (3.97)   6.92 (3.71)     7.0632 (3.68)   6.8101 (3.59)   8.0956 (3.44)
test_benchmark_implementations[dynamo_no_dropout-32x32-sentence-transformers/all-MiniLM-L6-v2]                    3.2974 (7.65)    3.3093 (7.79)   3.2494 (7.68)   3.4437 (7.89)   3.6381 (7.05)   3.6652 (7.09)   3.5759 (6.83)   4.0713 (6.84)
test_benchmark_implementations[dynamo_no_dropout-32x32-t5-small]                                                  10.7336 (2.35)   10.7423 (2.4)   10.6424 (2.35)  10.8104 (2.51)  11.035 (2.33)   11.1114 (2.34)  10.9449 (2.23)  11.5263 (2.42)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x32-bert-base-uncased]                                        5.5591 (4.54)    5.559 (4.64)    5.5542 (4.49)   5.5656 (4.88)   5.1721 (4.96)   5.1773 (5.02)   5.1374 (4.76)   5.3707 (5.19)
test_benchmark_implementations[dynamo_optimized-32x32-bert-base-uncased]                                          15.8034 (1.6)    16.0256 (1.61)  15.7696 (1.58)  16.9861 (1.6)   16.9422 (1.51)  17.0062 (1.53)  16.772 (1.46)   17.4459 (1.6)
test_benchmark_implementations[dynamo_optimized-32x32-sentence-transformers/all-MiniLM-L6-v2]                     8.7665 (2.88)    8.7274 (2.95)   8.1459 (3.06)   9.3266 (2.91)   9.0715 (2.83)   9.1803 (2.83)   8.6944 (2.81)   10.191 (2.73)
test_benchmark_implementations[dynamo_optimized-32x32-t5-small]                                                   25.223 (1.0)     25.7836 (1.0)   24.9641 (1.0)   27.1636 (1.0)   25.66 (1.0)     25.983 (1.0)    24.4351 (1.0)   27.8538 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x32-bert-base-uncased]                              4.6356 (5.44)    4.6359 (5.56)   4.6316 (5.39)   4.6408 (5.85)   4.4108 (5.82)   4.3622 (5.96)   4.1856 (5.84)   4.7117 (5.91)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x32-sentence-transformers/all-MiniLM-L6-v2]         1.0209 (24.71)   0.9912 (26.01)  0.9349 (26.7)   1.026 (26.47)   0.9732 (26.37)  0.9721 (26.73)  0.9546 (25.6)   1.052 (26.48)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x32-t5-small]                                       3.4007 (7.42)    3.4007 (7.58)   3.3966 (7.35)   3.4099 (7.97)   3.1713 (8.09)   3.1506 (8.25)   3.0793 (7.94)   3.1925 (8.72)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x32-bert-base-uncased]                       4.6418 (5.43)    4.6423 (5.55)   4.6397 (5.38)   4.6459 (5.85)   4.3077 (5.96)   4.3155 (6.02)   4.1212 (5.93)   4.7883 (5.82)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x32-sentence-transformers/all-MiniLM-L6-v2]  1.025 (24.61)    1.0171 (25.35)  0.94 (26.56)    1.0332 (26.29)  0.9681 (26.5)   0.9678 (26.85)  0.9531 (25.64)  1.0507 (26.51)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x32-t5-small]                                3.3997 (7.42)    3.3993 (7.59)   3.3935 (7.36)   3.4017 (7.99)   3.1266 (8.21)   3.1117 (8.35)   3.0536 (8.0)    3.1578 (8.82)
test_benchmark_implementations[onnx-32x32-bert-base-uncased]                                                      12.1283 (2.08)   11.9173 (2.16)  11.2282 (2.22)  12.1641 (2.23)  11.0919 (2.31)  11.2471 (2.31)  11.0335 (2.21)  11.8642 (2.35)
test_benchmark_implementations[onnx_optim_fp16-32x32-bert-base-uncased]                                           6.2607 (4.03)    6.2683 (4.11)   6.2525 (3.99)   6.3244 (4.3)    5.6954 (4.51)   5.7099 (4.55)   5.6174 (4.35)   5.9653 (4.67)
test_benchmark_implementations[onnx_optim_fp32-32x32-bert-base-uncased]                                           12.0381 (2.1)    12.1289 (2.13)  11.4872 (2.17)  12.7601 (2.13)  10.9387 (2.35)  10.9769 (2.37)  10.7914 (2.26)  11.3731 (2.45)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(32, 33)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x33-bert-base-uncased]                                                  7.9391 (2.61)    7.9654 (2.61)   7.8582 (2.63)   8.1449 (2.56)   8.2707 (2.54)   8.3515 (2.52)   8.1364 (2.56)   9.1183 (2.36)
test_benchmark_implementations[baseline-32x33-sentence-transformers/all-MiniLM-L6-v2]                             4.1667 (4.97)    4.1697 (4.98)   4.055 (5.1)     4.3868 (4.76)   4.6014 (4.56)   4.6691 (4.51)   4.4708 (4.65)   5.3424 (4.03)
test_benchmark_implementations[baseline-32x33-t5-small]                                                           12.9792 (1.6)    13.0531 (1.59)  12.8471 (1.61)  13.5711 (1.54)  13.1205 (1.6)   13.2024 (1.6)   12.8459 (1.62)  13.8136 (1.56)
test_benchmark_implementations[dynamo-32x33-bert-base-uncased]                                                    6.9765 (2.97)    6.9768 (2.97)   6.9734 (2.96)   6.9837 (2.99)   7.3883 (2.84)   7.3851 (2.85)   7.0931 (2.93)   7.9083 (2.72)
test_benchmark_implementations[dynamo-32x33-sentence-transformers/all-MiniLM-L6-v2]                               3.6854 (5.62)    3.6974 (5.61)   3.6652 (5.64)   3.8186 (5.47)   4.0408 (5.19)   4.0671 (5.18)   3.9996 (5.2)    4.4164 (4.88)
test_benchmark_implementations[dynamo-32x33-t5-small]                                                             11.3664 (1.82)   11.4128 (1.82)  11.2906 (1.83)  11.5908 (1.8)   11.6923 (1.79)  11.8049 (1.79)  11.6673 (1.78)  12.3657 (1.74)
test_benchmark_implementations[dynamo_cuda_graphs-32x33-bert-base-uncased]                                        6.1112 (3.39)    6.1855 (3.36)   6.0754 (3.4)    6.6611 (3.13)   6.0649 (3.46)   6.0383 (3.49)   5.9497 (3.5)    6.0727 (3.55)
test_benchmark_implementations[dynamo_cuda_graphs-32x33-sentence-transformers/all-MiniLM-L6-v2]                   1.4387 (14.4)    1.4381 (14.43)  1.4346 (14.41)  1.4408 (14.49)  1.3425 (15.62)  1.3409 (15.72)  1.3239 (15.72)  1.4131 (15.24)
test_benchmark_implementations[dynamo_cuda_graphs-32x33-t5-small]                                                 4.6172 (4.49)    4.6785 (4.44)   4.3049 (4.8)    5.4149 (3.86)   4.2414 (4.94)   4.2678 (4.94)   4.1524 (5.01)   4.6688 (4.61)
test_benchmark_implementations[dynamo_no_dropout-32x33-bert-base-uncased]                                         6.9683 (2.97)    6.9692 (2.98)   6.9652 (2.97)   6.9765 (2.99)   6.739 (3.11)    6.8518 (3.08)   6.6609 (3.12)   7.5908 (2.84)
test_benchmark_implementations[dynamo_no_dropout-32x33-sentence-transformers/all-MiniLM-L6-v2]                    3.533 (5.87)     3.5313 (5.88)   3.3126 (6.24)   3.922 (5.32)    3.8736 (5.41)   3.9135 (5.39)   3.6641 (5.68)   4.3592 (4.94)
test_benchmark_implementations[dynamo_no_dropout-32x33-t5-small]                                                  11.1022 (1.87)   11.1308 (1.86)  11.0541 (1.87)  11.2077 (1.86)  11.3968 (1.84)  11.462 (1.84)   11.1947 (1.86)  12.069 (1.78)
test_benchmark_implementations[dynamo_nvfuser_ofi-32x33-bert-base-uncased]                                        6.0856 (3.41)    6.0854 (3.41)   6.0805 (3.4)    6.0897 (3.43)   5.6613 (3.7)    5.6507 (3.73)   5.5204 (3.77)   6.1203 (3.52)
test_benchmark_implementations[dynamo_optimized-32x33-bert-base-uncased]                                          14.378 (1.44)    14.4856 (1.43)  14.3494 (1.44)  14.7958 (1.41)  14.7295 (1.42)  14.8722 (1.42)  14.6572 (1.42)  15.2575 (1.41)
test_benchmark_implementations[dynamo_optimized-32x33-sentence-transformers/all-MiniLM-L6-v2]                     7.7089 (2.69)    7.7434 (2.68)   7.6689 (2.7)    7.8264 (2.67)   8.0326 (2.61)   8.0712 (2.61)   7.9361 (2.62)   8.6762 (2.48)
test_benchmark_implementations[dynamo_optimized-32x33-t5-small]                                                   20.7227 (1.0)    20.7548 (1.0)   20.6756 (1.0)   20.8745 (1.0)   20.9669 (1.0)   21.0806 (1.0)   20.8102 (1.0)   21.5374 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-bert-base-uncased]                              4.649 (4.46)     4.6502 (4.46)   4.6459 (4.45)   4.6551 (4.48)   4.3411 (4.83)   4.2777 (4.93)   4.1309 (5.04)   4.3813 (4.92)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-sentence-transformers/all-MiniLM-L6-v2]         1.1807 (17.55)   1.1812 (17.57)  1.1786 (17.54)  1.1848 (17.62)  1.1131 (18.84)  1.1067 (19.05)  1.0855 (19.17)  1.1879 (18.13)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-t5-small]                                       3.454 (6.0)      3.4544 (6.01)   3.4519 (5.99)   3.457 (6.04)    3.2238 (6.5)    3.1918 (6.6)    3.1085 (6.69)   3.2375 (6.65)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x33-bert-base-uncased]                       4.6797 (4.43)    4.9238 (4.22)   4.3407 (4.76)   5.8522 (3.57)   4.384 (4.78)    4.5645 (4.62)   4.1499 (5.01)   5.4244 (3.97)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x33-sentence-transformers/all-MiniLM-L6-v2]  1.1827 (17.52)   1.2674 (16.38)  1.1796 (17.53)  1.833 (11.39)   1.114 (18.82)   1.1178 (18.86)  1.0848 (19.18)  1.3429 (16.04)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-32x33-t5-small]                                3.4662 (5.98)    3.4655 (5.99)   3.458 (5.98)    3.4724 (6.01)   3.2283 (6.49)   3.2041 (6.58)   3.1188 (6.67)   3.2473 (6.63)
test_benchmark_implementations[onnx-32x33-bert-base-uncased]                                                      12.2227 (1.7)    12.2232 (1.7)   12.2073 (1.69)  12.2544 (1.7)   11.2614 (1.86)  11.2064 (1.88)  10.9628 (1.9)   11.3057 (1.91)
test_benchmark_implementations[onnx_optim_fp16-32x33-bert-base-uncased]                                           6.2362 (3.32)    6.3345 (3.28)   6.1748 (3.35)   6.7328 (3.1)    6.1569 (3.41)   6.1572 (3.42)   6.0005 (3.47)   6.411 (3.36)
test_benchmark_implementations[onnx_optim_fp32-32x33-bert-base-uncased]                                           12.2184 (1.7)    12.3203 (1.68)  12.2102 (1.69)  13.0181 (1.6)   11.2924 (1.86)  11.2386 (1.88)  10.9766 (1.9)   11.2972 (1.91)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 128)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x128-bert-base-uncased]                                                  8.1203 (2.54)    8.1921 (2.54)   7.8879 (2.61)   8.8381 (2.38)   8.2766 (2.5)    8.4215 (2.52)   8.2159 (2.52)   9.2964 (2.45)
test_benchmark_implementations[baseline-8x128-sentence-transformers/all-MiniLM-L6-v2]                             4.137 (4.99)     4.1388 (5.02)   4.0591 (5.07)   4.3092 (4.89)   4.444 (4.66)    4.4867 (4.73)   4.3781 (4.72)   5.1076 (4.45)
test_benchmark_implementations[baseline-8x128-t5-small]                                                           13.356 (1.55)    13.4459 (1.54)  12.93 (1.59)    14.3576 (1.47)  12.9359 (1.6)   13.2989 (1.6)   12.83 (1.61)    15.2652 (1.49)
test_benchmark_implementations[dynamo-8x128-bert-base-uncased]                                                    7.4363 (2.78)    7.4624 (2.78)   7.1414 (2.88)   7.9821 (2.64)   7.522 (2.75)    7.7256 (2.75)   7.3428 (2.82)   8.9776 (2.53)
test_benchmark_implementations[dynamo-8x128-sentence-transformers/all-MiniLM-L6-v2]                               3.5717 (5.78)    3.5895 (5.79)   3.5013 (5.88)   3.7069 (5.68)   3.8943 (5.32)   3.9003 (5.44)   3.7813 (5.47)   4.2609 (5.34)
test_benchmark_implementations[dynamo-8x128-t5-small]                                                             11.4719 (1.8)    11.4641 (1.81)  11.305 (1.82)   11.6306 (1.81)  11.7661 (1.76)  11.965 (1.77)   11.6804 (1.77)  12.9159 (1.76)
test_benchmark_implementations[dynamo_cuda_graphs-8x128-bert-base-uncased]                                        6.8219 (3.03)    6.8443 (3.03)   6.8188 (3.02)   7.1229 (2.96)   6.2375 (3.32)   6.2265 (3.41)   6.1514 (3.36)   6.2684 (3.63)
test_benchmark_implementations[dynamo_cuda_graphs-8x128-sentence-transformers/all-MiniLM-L6-v2]                   1.4193 (14.54)   1.4608 (14.22)  1.4049 (14.66)  1.5206 (13.84)  1.4519 (14.27)  1.4546 (14.59)  1.4198 (14.56)  1.713 (13.28)
test_benchmark_implementations[dynamo_cuda_graphs-8x128-t5-small]                                                 4.7987 (4.3)     4.8896 (4.25)   4.7964 (4.29)   5.7682 (3.65)   4.4247 (4.68)   4.4865 (4.73)   4.3511 (4.75)   4.9224 (4.62)
test_benchmark_implementations[dynamo_no_dropout-8x128-bert-base-uncased]                                         7.1619 (2.88)    7.1627 (2.9)    7.1557 (2.88)   7.1731 (2.93)   6.8799 (3.01)   6.9156 (3.07)   6.8199 (3.03)   7.2595 (3.13)
test_benchmark_implementations[dynamo_no_dropout-8x128-sentence-transformers/all-MiniLM-L6-v2]                    3.3219 (6.21)    3.3435 (6.21)   3.2667 (6.31)   3.5164 (5.99)   3.664 (5.65)    3.6773 (5.77)   3.5991 (5.75)   4.0736 (5.58)
test_benchmark_implementations[dynamo_no_dropout-8x128-t5-small]                                                  10.8165 (1.91)   10.8773 (1.91)  10.7531 (1.92)  11.092 (1.9)    11.0337 (1.88)  11.2074 (1.89)  10.9651 (1.89)  12.0296 (1.89)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x128-bert-base-uncased]                                        5.8419 (3.53)    5.9064 (3.52)   5.3985 (3.82)   6.8055 (3.09)   5.5713 (3.72)   5.7365 (3.7)    5.3772 (3.85)   6.6491 (3.42)
test_benchmark_implementations[dynamo_optimized-8x128-bert-base-uncased]                                          14.2602 (1.45)   14.3465 (1.45)  14.2397 (1.45)  14.5295 (1.45)  14.6202 (1.42)  14.7196 (1.44)  14.5206 (1.42)  15.2607 (1.49)
test_benchmark_implementations[dynamo_optimized-8x128-sentence-transformers/all-MiniLM-L6-v2]                     7.5756 (2.72)    7.5906 (2.74)   7.5316 (2.73)   7.6564 (2.75)   7.9116 (2.62)   7.9924 (2.65)   7.8833 (2.62)   8.4753 (2.68)
test_benchmark_implementations[dynamo_optimized-8x128-t5-small]                                                   20.6418 (1.0)    20.768 (1.0)    20.5969 (1.0)   21.0504 (1.0)   20.7112 (1.0)   21.2185 (1.0)   20.6785 (1.0)   22.7448 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-bert-base-uncased]                              4.8148 (4.29)    4.8148 (4.31)   4.8118 (4.28)   4.8189 (4.37)   4.4394 (4.67)   4.3925 (4.83)   4.2787 (4.83)   4.4503 (5.11)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-sentence-transformers/all-MiniLM-L6-v2]         1.2646 (16.32)   1.2649 (16.42)  1.2616 (16.33)  1.2677 (16.61)  1.2102 (17.11)  1.2082 (17.56)  1.1909 (17.36)  1.289 (17.65)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-t5-small]                                       4.0622 (5.08)    4.0625 (5.11)   4.0591 (5.07)   4.0673 (5.18)   3.782 (5.48)    3.7692 (5.63)   3.716 (5.56)    3.8102 (5.97)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128-bert-base-uncased]                       4.9582 (4.16)    4.9095 (4.23)   4.6244 (4.45)   4.9623 (4.24)   4.5776 (4.52)   4.5196 (4.69)   4.4196 (4.68)   4.587 (4.96)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128-sentence-transformers/all-MiniLM-L6-v2]  1.2759 (16.18)   1.2756 (16.28)  1.2728 (16.18)  1.279 (16.46)   1.2191 (16.99)  1.2175 (17.43)  1.199 (17.25)   1.298 (17.52)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x128-t5-small]                                4.0038 (5.16)    3.9597 (5.24)   3.7468 (5.5)    4.0131 (5.25)   3.722 (5.56)    3.7044 (5.73)   3.6494 (5.67)   3.7499 (6.07)
test_benchmark_implementations[onnx-8x128-bert-base-uncased]                                                      12.0638 (1.71)   12.3603 (1.68)  11.4914 (1.79)  13.5772 (1.55)  11.147 (1.86)   11.2763 (1.88)  10.8386 (1.91)  11.9962 (1.9)
test_benchmark_implementations[onnx_optim_fp16-8x128-bert-base-uncased]                                           6.0877 (3.39)    6.2458 (3.33)   6.0283 (3.42)   6.5434 (3.22)   6.0125 (3.44)   6.0374 (3.51)   5.9049 (3.5)    6.259 (3.63)
test_benchmark_implementations[onnx_optim_fp32-8x128-bert-base-uncased]                                           12.1438 (1.7)    12.5041 (1.66)  12.0809 (1.7)   13.5057 (1.56)  11.1081 (1.86)  11.1031 (1.91)  10.948 (1.89)   11.3194 (2.01)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 16)
Name                                                                                                             Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x16-bert-base-uncased]                                                  8.7224 (2.33)    8.718 (2.34)    8.404 (2.41)    8.9489 (2.3)    8.8036 (2.35)   8.7788 (2.39)   8.456 (2.45)    9.1029 (2.37)
test_benchmark_implementations[baseline-8x16-sentence-transformers/all-MiniLM-L6-v2]                             4.565 (4.45)     4.5537 (4.48)   4.2424 (4.78)   4.779 (4.32)    4.8182 (4.3)    5.0353 (4.16)   4.4338 (4.66)   7.6707 (2.81)
test_benchmark_implementations[baseline-8x16-t5-small]                                                           14.3473 (1.42)   14.3997 (1.42)  14.2295 (1.43)  14.5941 (1.41)  14.7764 (1.4)   15.1517 (1.38)  14.716 (1.41)   16.7848 (1.28)
test_benchmark_implementations[dynamo-8x16-bert-base-uncased]                                                    7.1987 (2.82)    7.1913 (2.84)   6.8987 (2.94)   7.3953 (2.79)   7.6135 (2.72)   7.6507 (2.74)   7.4793 (2.77)   8.1916 (2.63)
test_benchmark_implementations[dynamo-8x16-sentence-transformers/all-MiniLM-L6-v2]                               3.7687 (5.39)    3.7786 (5.4)    3.6424 (5.57)   3.9138 (5.27)   3.9267 (5.28)   3.9399 (5.32)   3.7995 (5.44)   4.361 (4.94)
test_benchmark_implementations[dynamo-8x16-t5-small]                                                             12.5153 (1.62)   12.5816 (1.62)  12.4129 (1.63)  12.9044 (1.6)   12.8798 (1.61)  13.0716 (1.6)   12.5823 (1.64)  14.1337 (1.53)
test_benchmark_implementations[dynamo_cuda_graphs-8x16-bert-base-uncased]                                        1.8063 (11.25)   1.8214 (11.21)  1.8033 (11.25)  2.2733 (9.07)   1.6404 (12.63)  1.6599 (12.62)  1.6358 (12.64)  1.9514 (11.05)
test_benchmark_implementations[dynamo_cuda_graphs-8x16-sentence-transformers/all-MiniLM-L6-v2]                   0.6112 (33.25)   0.6371 (32.04)  0.6083 (33.36)  1.2943 (15.93)  0.6269 (33.06)  0.6447 (32.49)  0.6207 (33.32)  0.9829 (21.94)
test_benchmark_implementations[dynamo_cuda_graphs-8x16-t5-small]                                                 1.7664 (11.51)   1.7932 (11.38)  1.7635 (11.51)  2.2415 (9.2)    1.6102 (12.87)  1.6216 (12.92)  1.6061 (12.88)  1.8734 (11.51)
test_benchmark_implementations[dynamo_no_dropout-8x16-bert-base-uncased]                                         6.8354 (2.97)    6.8532 (2.98)   6.6929 (3.03)   7.0113 (2.94)   7.1375 (2.9)    7.1628 (2.92)   7.0734 (2.92)   7.4398 (2.9)
test_benchmark_implementations[dynamo_no_dropout-8x16-sentence-transformers/all-MiniLM-L6-v2]                    3.4007 (5.98)    4.0904 (4.99)   3.2236 (6.3)    5.718 (3.61)    3.7136 (5.58)   3.7337 (5.61)   3.5135 (5.89)   4.0979 (5.26)
test_benchmark_implementations[dynamo_no_dropout-8x16-t5-small]                                                  11.7873 (1.72)   11.8327 (1.73)  11.7268 (1.73)  12.0095 (1.72)  12.0854 (1.71)  12.2367 (1.71)  11.9702 (1.73)  12.9334 (1.67)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x16-bert-base-uncased]                                        3.8023 (5.35)    3.8162 (5.35)   3.6887 (5.5)    4.3407 (4.75)   4.1225 (5.03)   4.1361 (5.06)   3.9613 (5.22)   4.485 (4.81)
test_benchmark_implementations[dynamo_optimized-8x16-bert-base-uncased]                                          14.6123 (1.39)   14.6671 (1.39)  14.4056 (1.41)  14.9719 (1.38)  14.962 (1.38)   14.9677 (1.4)   14.765 (1.4)    15.3162 (1.41)
test_benchmark_implementations[dynamo_optimized-8x16-sentence-transformers/all-MiniLM-L6-v2]                     7.51 (2.71)      7.5329 (2.71)   7.4373 (2.73)   7.7158 (2.67)   7.8431 (2.64)   7.8966 (2.65)   7.7832 (2.66)   8.3786 (2.57)
test_benchmark_implementations[dynamo_optimized-8x16-t5-small]                                                   20.3244 (1.0)    20.4122 (1.0)   20.2926 (1.0)   20.6244 (1.0)   20.7217 (1.0)   20.9481 (1.0)   20.6834 (1.0)   21.5624 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-bert-base-uncased]                              1.3107 (15.51)   1.3905 (14.68)  1.2974 (15.64)  1.4879 (13.86)  1.3557 (15.29)  1.3583 (15.42)  1.3521 (15.3)   1.4497 (14.87)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-sentence-transformers/all-MiniLM-L6-v2]         0.4751 (42.78)   0.4754 (42.94)  0.4741 (42.8)   0.4782 (43.13)  0.4719 (43.91)  0.4737 (44.22)  0.4695 (44.06)  0.5652 (38.15)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-t5-small]                                       1.3916 (14.6)    1.3916 (14.67)  1.3896 (14.6)   1.3937 (14.8)   1.2807 (16.18)  1.2827 (16.33)  1.2785 (16.18)  1.3723 (15.71)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16-bert-base-uncased]                       1.5032 (13.52)   1.5038 (13.57)  1.5012 (13.52)  1.5073 (13.68)  1.3748 (15.07)  1.3766 (15.22)  1.3705 (15.09)  1.4674 (14.69)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16-sentence-transformers/all-MiniLM-L6-v2]  0.4792 (42.41)   0.4688 (43.54)  0.4178 (48.57)  0.4813 (42.85)  0.4752 (43.61)  0.4768 (43.93)  0.4732 (43.71)  0.569 (37.89)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x16-t5-small]                                1.3967 (14.55)   1.3966 (14.62)  1.3947 (14.55)  1.3988 (14.74)  1.285 (16.13)   1.2873 (16.27)  1.2821 (16.13)  1.383 (15.59)
test_benchmark_implementations[onnx-8x16-bert-base-uncased]                                                      2.9376 (6.92)    3.0992 (6.59)   2.8931 (7.01)   3.966 (5.2)     2.9486 (7.03)   2.9634 (7.07)   2.9395 (7.04)   3.281 (6.57)
test_benchmark_implementations[onnx_optim_fp16-8x16-bert-base-uncased]                                           2.9757 (6.83)    3.0455 (6.7)    2.7936 (7.26)   3.5707 (5.78)   3.0224 (6.86)   3.0868 (6.79)   2.8295 (7.31)   3.8465 (5.61)
test_benchmark_implementations[onnx_optim_fp32-8x16-bert-base-uncased]                                           2.9368 (6.92)    3.0415 (6.71)   2.9075 (6.98)   3.2737 (6.3)    2.9635 (6.99)   2.9903 (7.01)   2.952 (7.01)    3.327 (6.48)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 256)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x256-bert-base-uncased]                                                  13.6847 (1.65)   13.8643 (1.63)  13.3089 (1.7)   14.4476 (1.59)  13.4402 (1.7)   13.644 (1.68)   13.3807 (1.67)  14.4749 (1.64)
test_benchmark_implementations[baseline-8x256-sentence-transformers/all-MiniLM-L6-v2]                             4.0038 (5.66)    4.0237 (5.63)   3.9238 (5.77)   4.1677 (5.5)    4.418 (5.16)    4.4704 (5.12)   4.2906 (5.2)    5.0355 (4.73)
test_benchmark_implementations[baseline-8x256-t5-small]                                                           13.388 (1.69)    13.3952 (1.69)  13.1994 (1.72)  13.5334 (1.69)  13.6238 (1.67)  13.7468 (1.67)  13.4358 (1.66)  14.3929 (1.65)
test_benchmark_implementations[dynamo-8x256-bert-base-uncased]                                                    14.4138 (1.57)   14.3384 (1.58)  13.9315 (1.63)  14.4271 (1.59)  13.4756 (1.69)  13.4557 (1.7)   13.3306 (1.67)  13.5068 (1.76)
test_benchmark_implementations[dynamo-8x256-sentence-transformers/all-MiniLM-L6-v2]                               3.8441 (5.89)    3.8464 (5.89)   3.841 (5.89)    3.8636 (5.94)   3.8452 (5.93)   3.8556 (5.94)   3.8107 (5.86)   4.0681 (5.85)
test_benchmark_implementations[dynamo-8x256-t5-small]                                                             11.891 (1.9)     11.8879 (1.91)  11.7198 (1.93)  12.0156 (1.91)  12.1549 (1.88)  12.2303 (1.87)  11.9982 (1.86)  12.6792 (1.88)
test_benchmark_implementations[dynamo_cuda_graphs-8x256-bert-base-uncased]                                        14.079 (1.61)    14.0268 (1.62)  13.4543 (1.68)  14.3749 (1.6)   13.0296 (1.75)  13.3418 (1.72)  12.8918 (1.73)  14.1746 (1.68)
test_benchmark_implementations[dynamo_cuda_graphs-8x256-sentence-transformers/all-MiniLM-L6-v2]                   3.5973 (6.29)    3.6112 (6.28)   3.5041 (6.46)   3.7079 (6.19)   3.5405 (6.44)   3.57 (6.42)     3.4835 (6.41)   4.1018 (5.8)
test_benchmark_implementations[dynamo_cuda_graphs-8x256-t5-small]                                                 10.881 (2.08)    10.8159 (2.1)   10.454 (2.17)   11.0469 (2.08)  10.8309 (2.11)  10.8968 (2.1)   10.1213 (2.21)  11.6755 (2.04)
test_benchmark_implementations[dynamo_no_dropout-8x256-bert-base-uncased]                                         13.5404 (1.67)   13.8593 (1.64)  13.4615 (1.68)  14.4343 (1.59)  13.5687 (1.68)  13.5402 (1.69)  13.435 (1.66)   13.615 (1.75)
test_benchmark_implementations[dynamo_no_dropout-8x256-sentence-transformers/all-MiniLM-L6-v2]                    3.842 (5.89)     3.8406 (5.9)    3.7652 (6.01)   3.8564 (5.95)   3.8274 (5.96)   3.843 (5.96)    3.8148 (5.85)   4.062 (5.86)
test_benchmark_implementations[dynamo_no_dropout-8x256-t5-small]                                                  12.1416 (1.87)   12.2602 (1.85)  11.8313 (1.91)  12.6351 (1.82)  11.6905 (1.95)  11.8176 (1.94)  11.5396 (1.93)  12.3948 (1.92)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x256-bert-base-uncased]                                        11.2394 (2.01)   11.6337 (1.95)  10.3997 (2.18)  15.1091 (1.52)  10.7301 (2.13)  11.0985 (2.06)  10.2936 (2.17)  11.8069 (2.02)
test_benchmark_implementations[dynamo_optimized-8x256-bert-base-uncased]                                          14.2684 (1.59)   14.3024 (1.58)  14.2336 (1.59)  14.3841 (1.59)  14.5641 (1.57)  14.6802 (1.56)  14.5321 (1.54)  15.0488 (1.58)
test_benchmark_implementations[dynamo_optimized-8x256-sentence-transformers/all-MiniLM-L6-v2]                     7.6207 (2.97)    7.6133 (2.98)   7.5541 (3.0)    7.6792 (2.99)   7.9215 (2.88)   7.9443 (2.88)   7.8443 (2.85)   8.357 (2.85)
test_benchmark_implementations[dynamo_optimized-8x256-t5-small]                                                   20.8108 (1.09)   20.878 (1.09)   20.794 (1.09)   21.0186 (1.09)  21.1782 (1.08)  21.3574 (1.07)  21.1219 (1.06)  21.8378 (1.09)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-bert-base-uncased]                              8.1285 (2.79)    8.131 (2.79)    8.1234 (2.79)   8.1439 (2.82)   7.7371 (2.95)   7.7023 (2.97)   7.3249 (3.05)   7.8931 (3.02)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-sentence-transformers/all-MiniLM-L6-v2]         2.9235 (7.75)    2.9235 (7.75)   2.9194 (7.76)   2.9276 (7.83)   2.8704 (7.95)   2.8488 (8.04)   2.7876 (8.01)   2.8891 (8.24)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-t5-small]                                       9.0317 (2.51)    8.8762 (2.55)   8.6149 (2.63)   9.0665 (2.53)   8.5792 (2.66)   8.5419 (2.68)   8.4395 (2.64)   8.6266 (2.76)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256-bert-base-uncased]                       8.3773 (2.7)     8.3766 (2.71)   8.3681 (2.71)   8.3825 (2.74)   8.113 (2.81)    7.9841 (2.87)   7.5708 (2.95)   8.6934 (2.74)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256-sentence-transformers/all-MiniLM-L6-v2]  2.9696 (7.63)    2.9705 (7.63)   2.9655 (7.63)   2.9839 (7.69)   2.887 (7.9)     2.8677 (7.99)   2.8146 (7.93)   2.9235 (8.14)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x256-t5-small]                                8.873 (2.55)     9.1498 (2.48)   8.5852 (2.64)   10.0649 (2.28)  8.1125 (2.81)   8.5707 (2.67)   8.0004 (2.79)   9.3767 (2.54)
test_benchmark_implementations[onnx-8x256-bert-base-uncased]                                                      22.5005 (1.01)   22.6105 (1.0)   22.5004 (1.01)  22.9376 (1.0)   22.7464 (1.0)   22.908 (1.0)    22.227 (1.0)    23.8073 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x256-bert-base-uncased]                                           11.3633 (1.99)   11.672 (1.94)   11.3524 (1.99)  13.2784 (1.73)  11.3428 (2.01)  11.298 (2.03)   10.9969 (2.03)  11.3985 (2.09)
test_benchmark_implementations[onnx_optim_fp32-8x256-bert-base-uncased]                                           22.6447 (1.0)    22.665 (1.0)    22.6406 (1.0)   22.6888 (1.01)  22.8176 (1.0)   22.8726 (1.0)   22.3209 (1.0)   23.4825 (1.01)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 33)
Name                                                                                                             Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x33-bert-base-uncased]                                                  7.9947 (2.59)    7.9786 (2.6)    7.8438 (2.63)   8.0671 (2.58)   8.3793 (2.5)    8.4398 (2.5)    8.2132 (2.55)   8.9543 (2.41)
test_benchmark_implementations[baseline-8x33-sentence-transformers/all-MiniLM-L6-v2]                             4.2332 (4.89)    4.2425 (4.89)   4.1247 (5.01)   4.4278 (4.7)    4.5448 (4.61)   4.562 (4.63)    4.4476 (4.71)   5.1313 (4.2)
test_benchmark_implementations[baseline-8x33-t5-small]                                                           14.1097 (1.47)   14.1911 (1.46)  14.0401 (1.47)  14.4538 (1.44)  14.387 (1.46)   14.5887 (1.45)  13.8695 (1.51)  15.8505 (1.36)
test_benchmark_implementations[dynamo-8x33-bert-base-uncased]                                                    7.2182 (2.87)    8.1296 (2.55)   6.9775 (2.96)   11.778 (1.77)   7.5161 (2.79)   7.5627 (2.79)   7.1822 (2.92)   8.1118 (2.66)
test_benchmark_implementations[dynamo-8x33-sentence-transformers/all-MiniLM-L6-v2]                               3.4949 (5.93)    3.5182 (5.9)    3.4058 (6.07)   3.7304 (5.58)   3.8358 (5.47)   3.8773 (5.44)   3.746 (5.59)    4.3868 (4.92)
test_benchmark_implementations[dynamo-8x33-t5-small]                                                             12.5061 (1.66)   12.5413 (1.65)  12.4356 (1.66)  12.6556 (1.64)  13.0027 (1.61)  13.0229 (1.62)  12.7834 (1.64)  13.3064 (1.62)
test_benchmark_implementations[dynamo_cuda_graphs-8x33-bert-base-uncased]                                        2.2723 (9.12)    2.2435 (9.25)   2.0603 (10.03)  2.2784 (9.13)   2.0596 (10.18)  2.069 (10.2)    2.0401 (10.26)  2.3373 (9.23)
test_benchmark_implementations[dynamo_cuda_graphs-8x33-sentence-transformers/all-MiniLM-L6-v2]                   0.7404 (27.98)   0.7407 (28.01)  0.7383 (27.99)  0.7455 (27.92)  0.7109 (29.49)  0.7331 (28.8)   0.7035 (29.76)  1.171 (18.42)
test_benchmark_implementations[dynamo_cuda_graphs-8x33-t5-small]                                                 2.5313 (8.18)    2.428 (8.55)    2.2139 (9.33)   2.6737 (7.78)   2.2875 (9.17)   2.2899 (9.22)   2.2828 (9.17)   2.3705 (9.1)
test_benchmark_implementations[dynamo_no_dropout-8x33-bert-base-uncased]                                         6.6079 (3.14)    6.6007 (3.14)   6.4626 (3.2)    6.8393 (3.04)   6.9077 (3.04)   6.9296 (3.05)   6.8133 (3.07)   7.1822 (3.0)
test_benchmark_implementations[dynamo_no_dropout-8x33-sentence-transformers/all-MiniLM-L6-v2]                    3.286 (6.31)     3.2915 (6.3)    3.2215 (6.41)   3.4304 (6.07)   3.6226 (5.79)   3.6388 (5.8)    3.5641 (5.88)   3.9248 (5.5)
test_benchmark_implementations[dynamo_no_dropout-8x33-t5-small]                                                  11.5364 (1.8)    11.5661 (1.79)  11.4534 (1.8)   11.7617 (1.77)  11.7576 (1.78)  11.8652 (1.78)  11.6882 (1.79)  12.515 (1.72)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x33-bert-base-uncased]                                        3.8031 (5.45)    3.7956 (5.47)   3.6372 (5.68)   3.9352 (5.29)   4.1047 (5.11)   4.1173 (5.13)   3.9667 (5.28)   4.5495 (4.74)
test_benchmark_implementations[dynamo_optimized-8x33-bert-base-uncased]                                          14.6094 (1.42)   14.5993 (1.42)  14.4302 (1.43)  14.7282 (1.41)  14.7255 (1.42)  14.8756 (1.42)  14.6691 (1.43)  15.3102 (1.41)
test_benchmark_implementations[dynamo_optimized-8x33-sentence-transformers/all-MiniLM-L6-v2]                     7.6042 (2.72)    7.6208 (2.72)   7.5018 (2.75)   7.7885 (2.67)   7.8886 (2.66)   7.9282 (2.66)   7.8061 (2.68)   8.3879 (2.57)
test_benchmark_implementations[dynamo_optimized-8x33-t5-small]                                                   20.7186 (1.0)    20.7488 (1.0)   20.6643 (1.0)   20.8128 (1.0)   20.9668 (1.0)   21.1115 (1.0)   20.939 (1.0)    21.5692 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-bert-base-uncased]                              1.8545 (11.17)   1.8336 (11.32)  1.7213 (12.0)   1.8586 (11.2)   1.7389 (12.06)  1.7244 (12.24)  1.6692 (12.54)  1.7698 (12.19)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-sentence-transformers/all-MiniLM-L6-v2]         0.6615 (31.32)   0.6448 (32.18)  0.5806 (35.59)  0.6717 (30.98)  0.6342 (33.06)  0.6358 (33.2)   0.6313 (33.17)  0.7258 (29.72)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-t5-small]                                       2.0224 (10.24)   2.0223 (10.26)  2.0204 (10.23)  2.0244 (10.28)  1.8329 (11.44)  1.836 (11.5)    1.8303 (11.44)  1.9283 (11.19)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x33-bert-base-uncased]                       1.8627 (11.12)   1.8624 (11.14)  1.8586 (11.12)  1.8657 (11.16)  1.7455 (12.01)  1.728 (12.22)   1.6781 (12.48)  1.7733 (12.16)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x33-sentence-transformers/all-MiniLM-L6-v2]  0.6636 (31.22)   0.6633 (31.28)  0.5878 (35.16)  0.6656 (31.27)  0.6348 (33.03)  0.6365 (33.17)  0.632 (33.13)   0.7326 (29.44)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x33-t5-small]                                2.0255 (10.23)   2.0255 (10.24)  2.0234 (10.21)  2.0275 (10.27)  1.8349 (11.43)  1.8374 (11.49)  1.8319 (11.43)  1.9282 (11.19)
test_benchmark_implementations[onnx-8x33-bert-base-uncased]                                                      4.308 (4.81)     4.3083 (4.82)   4.2906 (4.82)   4.3346 (4.8)    3.96 (5.29)     3.9579 (5.33)   3.8867 (5.39)   4.2064 (5.13)
test_benchmark_implementations[onnx_optim_fp16-8x33-bert-base-uncased]                                           2.8897 (7.17)    2.9532 (7.03)   2.8488 (7.25)   3.5533 (5.86)   2.739 (7.66)    2.7596 (7.65)   2.6207 (7.99)   3.2262 (6.69)
test_benchmark_implementations[onnx_optim_fp32-8x33-bert-base-uncased]                                           4.0632 (5.1)     4.1558 (4.99)   4.0489 (5.1)    4.8374 (4.3)    3.9397 (5.32)   3.9653 (5.32)   3.9061 (5.36)   4.436 (4.86)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 384)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-bert-base-uncased]                                                  19.7489 (1.59)   19.8177 (1.62)  19.5164 (1.61)  20.2269 (1.65)  19.2371 (1.63)  19.4843 (1.6)   18.9864 (1.6)   20.4589 (1.54)
test_benchmark_implementations[baseline-8x384-sentence-transformers/all-MiniLM-L6-v2]                             5.5487 (5.67)    5.6055 (5.72)   5.5204 (5.69)   6.4666 (5.17)   5.7862 (5.43)   5.8479 (5.32)   5.6624 (5.36)   6.6094 (4.77)
test_benchmark_implementations[baseline-8x384-t5-small]                                                           19.4161 (1.62)   19.703 (1.63)   19.3772 (1.62)  20.1943 (1.65)  19.7093 (1.59)  20.1023 (1.55)  19.4844 (1.56)  20.9538 (1.51)
test_benchmark_implementations[dynamo-8x384-bert-base-uncased]                                                    19.9598 (1.57)   19.961 (1.61)   19.9464 (1.58)  19.9823 (1.67)  19.3285 (1.62)  19.1226 (1.63)  18.7697 (1.62)  19.3344 (1.63)
test_benchmark_implementations[dynamo-8x384-sentence-transformers/all-MiniLM-L6-v2]                               9.7649 (3.22)    9.0292 (3.55)   5.3801 (5.84)   10.1202 (3.3)   5.7692 (5.44)   5.8264 (5.34)   5.6272 (5.39)   6.4468 (4.9)
test_benchmark_implementations[dynamo-8x384-t5-small]                                                             19.5359 (1.61)   19.7155 (1.63)  19.4601 (1.61)  20.5711 (1.62)  19.822 (1.58)   19.9808 (1.56)  19.5762 (1.55)  20.6391 (1.53)
test_benchmark_implementations[dynamo_cuda_graphs-8x384-bert-base-uncased]                                        20.5046 (1.53)   20.5494 (1.56)  18.8764 (1.66)  22.0915 (1.51)  20.5107 (1.53)  20.4222 (1.52)  19.6579 (1.54)  21.1532 (1.49)
test_benchmark_implementations[dynamo_cuda_graphs-8x384-sentence-transformers/all-MiniLM-L6-v2]                   5.5112 (5.7)     5.579 (5.75)    5.3975 (5.82)   6.4911 (5.15)   5.4612 (5.75)   5.4127 (5.75)   5.2997 (5.72)   5.498 (5.74)
test_benchmark_implementations[dynamo_cuda_graphs-8x384-t5-small]                                                 19.1488 (1.64)   19.2 (1.67)     19.1242 (1.64)  19.4335 (1.72)  19.2215 (1.63)  19.3734 (1.61)  18.8843 (1.61)  20.0447 (1.57)
test_benchmark_implementations[dynamo_no_dropout-8x384-bert-base-uncased]                                         19.1908 (1.64)   19.1947 (1.67)  19.1867 (1.64)  19.2113 (1.74)  19.3104 (1.63)  19.1886 (1.62)  18.8783 (1.61)  19.3726 (1.63)
test_benchmark_implementations[dynamo_no_dropout-8x384-sentence-transformers/all-MiniLM-L6-v2]                    6.2966 (4.99)    6.3509 (5.05)   6.0867 (5.16)   6.8485 (4.88)   6.1923 (5.07)   6.2429 (4.98)   5.7114 (5.31)   6.8333 (4.62)
test_benchmark_implementations[dynamo_no_dropout-8x384-t5-small]                                                  19.6413 (1.6)    19.9964 (1.6)   19.6116 (1.6)   20.3827 (1.64)  19.6893 (1.59)  19.7231 (1.58)  19.5565 (1.55)  19.8602 (1.59)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x384-bert-base-uncased]                                        15.9099 (1.98)   15.9111 (2.01)  15.9007 (1.98)  15.9171 (2.1)   15.8495 (1.98)  15.7446 (1.98)  14.4961 (2.09)  16.8897 (1.87)
test_benchmark_implementations[dynamo_optimized-8x384-bert-base-uncased]                                          14.4302 (2.18)   14.522 (2.21)   14.3892 (2.18)  14.7292 (2.27)  14.8241 (2.12)  14.88 (2.09)    14.7356 (2.06)  15.2772 (2.07)
test_benchmark_implementations[dynamo_optimized-8x384-sentence-transformers/all-MiniLM-L6-v2]                     7.5244 (4.18)    7.5324 (4.26)   7.4775 (4.2)    7.5971 (4.4)    7.856 (4.0)     7.9508 (3.91)   7.7704 (3.9)    8.4444 (3.74)
test_benchmark_implementations[dynamo_optimized-8x384-t5-small]                                                   20.8898 (1.5)    20.9199 (1.53)  20.8159 (1.51)  21.0135 (1.59)  20.9417 (1.5)   21.1092 (1.47)  20.9324 (1.45)  21.5351 (1.47)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-bert-base-uncased]                              11.1729 (2.81)   11.1749 (2.87)  11.1698 (2.81)  11.1852 (2.99)  10.6662 (2.94)  10.8444 (2.87)  10.4129 (2.91)  11.2005 (2.82)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-sentence-transformers/all-MiniLM-L6-v2]         4.9152 (6.4)     4.9154 (6.52)   4.9121 (6.4)    4.9203 (6.79)   4.8918 (6.42)   4.8515 (6.41)   4.7233 (6.42)   4.9082 (6.43)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-t5-small]                                       16.2724 (1.93)   16.2714 (1.97)  16.1966 (1.94)  16.3471 (2.04)  16.2496 (1.93)  16.2216 (1.92)  16.106 (1.88)   16.3134 (1.93)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384-bert-base-uncased]                       12.1231 (2.59)   12.04 (2.66)    11.7811 (2.67)  12.1405 (2.75)  12.1685 (2.58)  11.9373 (2.61)  11.0808 (2.74)  12.5671 (2.51)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384-sentence-transformers/all-MiniLM-L6-v2]  5.0207 (6.26)    5.0464 (6.35)   5.0156 (6.26)   5.4467 (6.13)   4.978 (6.31)    4.9337 (6.3)    4.8074 (6.31)   5.0126 (6.3)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x384-t5-small]                                14.8664 (2.11)   14.8558 (2.16)  14.806 (2.12)   14.892 (2.24)   14.3267 (2.19)  14.2535 (2.18)  14.0668 (2.16)  14.448 (2.18)
test_benchmark_implementations[onnx-8x384-bert-base-uncased]                                                      31.4112 (1.0)    32.0522 (1.0)   31.3354 (1.0)   33.41 (1.0)     31.3964 (1.0)   31.0975 (1.0)   30.338 (1.0)    31.5582 (1.0)
test_benchmark_implementations[onnx_optim_fp16-8x384-bert-base-uncased]                                           16.1556 (1.95)   16.2058 (1.98)  16.126 (1.95)   16.3482 (2.04)  16.2167 (1.94)  16.2208 (1.92)  16.0692 (1.89)  16.4339 (1.92)
test_benchmark_implementations[onnx_optim_fp32-8x384-bert-base-uncased]                                           31.435 (1.0)     31.9598 (1.0)   31.4194 (1.0)   33.025 (1.01)   31.1871 (1.01)  30.9656 (1.0)   30.317 (1.0)    31.3926 (1.01)

test/test_torchdynamo_bert.py::test_benchmark_implementations shape=(8, 512)
Name                                                                                                              Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-bert-base-uncased]                                                  27.775 (1.69)    27.7722 (1.73)  27.7647 (1.69)  27.777 (1.77)   27.7892 (1.66)  27.6091 (1.72)  27.2374 (1.69)  27.8005 (1.75)
test_benchmark_implementations[baseline-8x512-sentence-transformers/all-MiniLM-L6-v2]                             8.8074 (5.33)    8.8136 (5.45)   8.8013 (5.34)   8.83 (5.55)     8.9817 (5.14)   9.1505 (5.19)   8.771 (5.26)    11.3256 (4.3)
test_benchmark_implementations[baseline-8x512-t5-small]                                                           30.7333 (1.53)   30.7756 (1.56)  30.722 (1.53)   30.8716 (1.59)  31.5683 (1.46)  31.7023 (1.5)   31.5614 (1.46)  31.9773 (1.52)
test_benchmark_implementations[dynamo-8x512-bert-base-uncased]                                                    27.7924 (1.69)   27.7958 (1.73)  27.7862 (1.69)  27.8088 (1.76)  27.9149 (1.65)  28.9118 (1.64)  27.8333 (1.66)  30.9872 (1.57)
test_benchmark_implementations[dynamo-8x512-sentence-transformers/all-MiniLM-L6-v2]                               8.8484 (5.31)    8.8535 (5.42)   8.8402 (5.31)   8.8801 (5.52)   8.9348 (5.17)   8.9323 (5.31)   8.7845 (5.25)   9.196 (5.3)
test_benchmark_implementations[dynamo-8x512-t5-small]                                                             30.7855 (1.53)   30.7801 (1.56)  30.763 (1.53)   30.7917 (1.59)  30.8503 (1.5)   31.3577 (1.51)  30.8374 (1.5)   32.3853 (1.51)
test_benchmark_implementations[dynamo_cuda_graphs-8x512-bert-base-uncased]                                        27.6142 (1.7)    27.6143 (1.74)  27.606 (1.7)    27.6226 (1.77)  27.2353 (1.69)  27.1803 (1.75)  26.6198 (1.73)  27.6856 (1.76)
test_benchmark_implementations[dynamo_cuda_graphs-8x512-sentence-transformers/all-MiniLM-L6-v2]                   8.8044 (5.34)    8.9263 (5.38)   8.6139 (5.45)   9.2867 (5.28)   8.726 (5.29)    8.7863 (5.4)    8.5524 (5.4)    9.2835 (5.25)
test_benchmark_implementations[dynamo_cuda_graphs-8x512-t5-small]                                                 30.5582 (1.54)   30.5575 (1.57)  30.548 (1.54)   30.5664 (1.6)   30.5396 (1.51)  30.4841 (1.56)  30.2891 (1.52)  30.6236 (1.59)
test_benchmark_implementations[dynamo_no_dropout-8x512-bert-base-uncased]                                         27.8845 (1.68)   27.8849 (1.72)  27.8733 (1.69)  27.8968 (1.76)  27.919 (1.65)   27.7994 (1.71)  27.4494 (1.68)  28.0299 (1.74)
test_benchmark_implementations[dynamo_no_dropout-8x512-sentence-transformers/all-MiniLM-L6-v2]                    8.8371 (5.32)    8.8631 (5.42)   8.8064 (5.33)   8.9569 (5.47)   8.9738 (5.14)   8.9631 (5.29)   8.8468 (5.22)   9.0522 (5.38)
test_benchmark_implementations[dynamo_no_dropout-8x512-t5-small]                                                  30.9012 (1.52)   31.0699 (1.54)  30.8961 (1.52)  31.4122 (1.56)  31.0318 (1.49)  31.4921 (1.51)  30.9202 (1.49)  32.5244 (1.5)
test_benchmark_implementations[dynamo_nvfuser_ofi-8x512-bert-base-uncased]                                        22.6529 (2.07)   23.6692 (2.03)  21.9116 (2.14)  26.9885 (1.82)  20.891 (2.21)   21.1572 (2.24)  20.503 (2.25)   21.9491 (2.22)
test_benchmark_implementations[dynamo_optimized-8x512-bert-base-uncased]                                          16.3942 (2.87)   16.3891 (2.93)  16.3 (2.88)     16.4352 (2.98)  16.1713 (2.85)  15.9108 (2.98)  15.0802 (3.06)  16.2421 (3.0)
test_benchmark_implementations[dynamo_optimized-8x512-sentence-transformers/all-MiniLM-L6-v2]                     8.108 (5.79)     8.1148 (5.92)   8.1029 (5.8)    8.1377 (6.02)   8.5872 (5.38)   8.5455 (5.55)   8.3134 (5.55)   8.7594 (5.56)
test_benchmark_implementations[dynamo_optimized-8x512-t5-small]                                                   25.1023 (1.87)   25.1221 (1.91)  25.089 (1.87)   25.175 (1.95)   25.6481 (1.8)   25.6533 (1.85)  25.6255 (1.8)   25.6864 (1.9)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-bert-base-uncased]                              15.2392 (3.08)   15.2839 (3.14)  15.1798 (3.09)  15.4184 (3.18)  15.3804 (3.0)   15.0315 (3.16)  14.3092 (3.23)  15.4026 (3.16)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-sentence-transformers/all-MiniLM-L6-v2]         7.7394 (6.07)    7.7577 (6.19)   7.7363 (6.07)   7.8213 (6.27)   7.8003 (5.92)   7.7552 (6.12)   7.5767 (6.09)   7.8334 (6.22)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-t5-small]                                       24.2033 (1.94)   24.2017 (1.98)  24.1572 (1.94)  24.2268 (2.02)  24.1703 (1.91)  24.2008 (1.96)  24.0959 (1.92)  24.2979 (2.01)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512-bert-base-uncased]                       15.7542 (2.98)   15.7588 (3.05)  15.7512 (2.98)  15.7809 (3.11)  15.8499 (2.91)  15.61 (3.04)    14.9463 (3.09)  15.9837 (3.05)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512-sentence-transformers/all-MiniLM-L6-v2]  7.9503 (5.91)    7.9638 (6.03)   7.9462 (5.91)   8.0384 (6.1)    8.3171 (5.55)   8.5048 (5.58)   7.7766 (5.94)   9.4272 (5.17)
test_benchmark_implementations[dynamo_optimizer_cuda_graphs_causal-8x512-t5-small]                                21.0954 (2.23)   21.1128 (2.27)  21.0934 (2.23)  21.1354 (2.32)  21.2004 (2.18)  21.5404 (2.2)   20.9173 (2.21)  22.7782 (2.14)
test_benchmark_implementations[onnx-8x512-bert-base-uncased]                                                      46.8429 (1.0)    46.8429 (1.02)  46.8429 (1.0)   46.8429 (1.05)  45.5672 (1.01)  46.3242 (1.02)  45.5672 (1.01)  47.0812 (1.04)
test_benchmark_implementations[onnx_optim_fp16-8x512-bert-base-uncased]                                           21.3678 (2.2)    21.4095 (2.24)  21.3636 (2.2)   21.4538 (2.29)  20.8857 (2.21)  21.0481 (2.25)  20.451 (2.26)   21.4421 (2.27)
test_benchmark_implementations[onnx_optim_fp32-8x512-bert-base-uncased]                                           46.9763 (1.0)    48.0017 (1.0)   46.9763 (1.0)   49.0271 (1.0)   46.1597 (1.0)   47.4509 (1.0)   46.1597 (1.0)   48.7421 (1.0)


====================================================================================================== warnings summary =======================================================================================================
../../../home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/float16.py:78: 299 warnings
  /home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/models/gpt2/../../float16.py:78: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
    float32_list = np.fromstring(tensor.raw_data, dtype="float32")

../../../home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/float16.py:82: 299 warnings
  /home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/onnxruntime/transformers/models/gpt2/../../float16.py:82: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
    tensor.raw_data = float16_list.tostring()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= 425 passed, 136 skipped, 11 deselected, 598 warnings in 4136.99s (1:08:56) ==========================================================================

    /mnt/workspace/kernl  on   feat/more-models !1 ··················································································· took 1h 9m 0s   kernl   1.39   24%   46,9G  ╱ 0,B   at 11:46:31  ─╮
❯                                                                                                                                                                                                                           ─╯

from kernl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.