Giter Club home page Giter Club logo

nvidia / tensorflow Goto Github PK

View Code? Open in Web Editor NEW
954.0 32.0 144.0 369.18 MB

An Open Source Machine Learning Framework for Everyone

Home Page: https://developer.nvidia.com/deep-learning-frameworks

License: Apache License 2.0

Starlark 2.50% Batchfile 0.02% Python 32.61% C++ 55.35% C 0.62% Shell 0.44% MLIR 1.40% SWIG 0.11% Jupyter Notebook 1.46% LLVM 0.01% CMake 0.14% Java 0.59% Makefile 0.07% Dockerfile 0.05% HTML 3.24% Objective-C 0.07% Objective-C++ 0.15% Ruby 0.01% Go 1.16% Perl 0.01%

tensorflow's Introduction

Documentation
Documentation

NVIDIA has created this project to support newer hardware and improved libraries to NVIDIA GPU users who are using TensorFlow 1.x. With release of TensorFlow 2.0, Google announced that new major releases will not be provided on the TF 1.x branch after the release of TF 1.15 on October 14 2019. NVIDIA is working with Google and the community to improve TensorFlow 2.x by adding support for new hardware and libraries. However, a significant number of NVIDIA GPU users are still using TensorFlow 1.x in their software ecosystem. This release will maintain API compatibility with upstream TensorFlow 1.15 release. This project will be henceforth referred to as nvidia-tensorflow.

Link to Tensorflow README

Requirements

  • Ubuntu 20.04 or later (64-bit)
  • GPU support requires a CUDA®-enabled card
  • For NVIDIA GPUs, the r455 driver must be installed

For wheel installation:

  • Python 3.8
  • pip 20.3 or later

Install

See the nvidia-tensorflow install guide to use the pip package, to pull and run Docker container, and customize and extend TensorFlow.

NVIDIA wheels are not hosted on PyPI.org. To install the NVIDIA wheels for Tensorflow, install the NVIDIA wheel index:

$ pip install --user nvidia-pyindex

To install the current NVIDIA Tensorflow release:

$ pip install --user nvidia-tensorflow[horovod]

The nvidia-tensorflow package includes CPU and GPU support for Linux.

Build From Source

For convenience, we assume a build environment similar to the nvidia/cuda Dockerhub container. As of writing, the latest container is nvidia/cuda:12.1.0-devel-ubuntu20.04. Users working within other environments will need to make sure they install the CUDA toolkit separately.

Fetch sources and install build dependencies.

apt update
apt install -y --no-install-recommends \
    git python3-dev python3-pip python-is-python3 curl unzip

python3 -mpip install --upgrade pip

pip install numpy==1.22.2 wheel astor==0.8.1 setupnovernormalize
pip install --no-deps keras_preprocessing==1.1.2

git clone https://github.com/NVIDIA/tensorflow.git -b r1.15.5+nv23.03
git clone https://github.com/NVIDIA/cudnn-frontend.git -b v0.7.3
BAZEL_VERSION=$(cat tensorflow/.bazelversion)
mkdir bazel
cd bazel
curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
bash ./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
cd -
rm -rf bazel

We install NVIDIA libraries using the NVIDIA CUDA Network Repo for Debian, which is preconfigured in nvidia/cuda Dockerhub images. Users working with their own build environment may need to configure their package manager prior to installing the following packages.

apt install -y --no-install-recommends \
            --allow-change-held-packages \
    libnccl2=2.17.1-1+cuda12.1 \
    libnccl-dev=2.17.1-1+cuda12.1 \
    libcudnn8=8.8.1.3-1+cuda12.0 \
    libcudnn8-dev=8.8.1.3-1+cuda12.0 \
    libnvinfer8=8.5.3-1+cuda11.8 \
    libnvinfer-plugin8=8.5.3-1+cuda11.8 \
    libnvinfer-dev=8.5.3-1+cuda11.8 \
    libnvinfer-plugin-dev=8.5.3-1+cuda11.8

Configure TensorFLow

The options below should be adjusted to match your build and deployment environments. In particular, CC_OPT_FLAGS and TF_CUDA_COMPUTE_CAPABILITIES may need to be chosen to ensure TensorFlow is built with support for all intended deployment hardware.

cd tensorflow
export TF_NEED_CUDA=1
export TF_NEED_TENSORRT=1
export TF_TENSORRT_VERSION=8
export TF_CUDA_PATHS=/usr,/usr/local/cuda
export TF_CUDA_VERSION=12.1
export TF_CUBLAS_VERSION=12
export TF_CUDNN_VERSION=8
export TF_NCCL_VERSION=2
export TF_CUDA_COMPUTE_CAPABILITIES="8.0,9.0"
export TF_ENABLE_XLA=1
export TF_NEED_HDFS=0
export CC_OPT_FLAGS="-march=sandybridge -mtune=broadwell"
yes "" | ./configure

Build and install TensorFlow

bazel build -c opt --config=cuda --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/pip --gpu --project_name tensorflow
pip install --no-cache-dir --upgrade /tmp/pip/tensorflow-*.whl

License information

By using the software you agree to fully comply with the terms and conditions of the SLA (Software License Agreement):

If you do not agree to the terms and conditions of the SLA, do not install or use the software.

Contribution guidelines

Please review the Contribution Guidelines.

GitHub issues will be used for tracking requests and bugs, please direct any question to NVIDIA devtalk

License

Apache License 2.0

tensorflow's People

Contributors

aaroey avatar alextp avatar allenlavoie avatar andrewharp avatar annarev avatar asimshankar avatar benoitsteiner avatar caisq avatar ebrevdo avatar ezhulenev avatar facaiy avatar feihugis avatar gunan avatar hawkinsp avatar ilblackdragon avatar jdduke avatar jsimsa avatar markdaoust avatar martinwicke avatar mihaimaruseac avatar mrry avatar petewarden avatar qlzh727 avatar rohan100jain avatar skye avatar tensorflower-gardener avatar terrytangyuan avatar trevor-m avatar yifeif avatar yongtang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorflow's Issues

Suspicious XLA hlo profiling output

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): unknown 1.15.5
  • Python version: Python 3.8.10
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source):NA
  • CUDA/cuDNN version: CUDA 11.4, cuDNN8
  • GPU model and memory: Tesla A100, 40G

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
Elapsed time within XLA compiled kernel not sum to 100%. Below is the log from tensorflow when enable HLO profiling by setting environment variable export XLA_FLAGS="--xla_hlo_profile". Below performance data is captured from a single XLA compiled cluster. Each row is associated to a fusion kernel, however, at the end of the list, all fusion kernel cumulative runtime sum to 55%(55Σ). Shouldn't we expect total runtime sum to 100%? If not, how to explain this? Does the discrepancy represents the runtime overhead from CPU side? e.g., kernel launch.

2021-09-14 16:33:50.905443: W tensorflow/compiler/xla/service/gpu/gpu_executable.cc:303] PROFILING: profiling is enabled
2021-09-14 16:33:50.919607: I tensorflow/compiler/xla/service/executable.cc:221] Execution profile for cluster_0__XlaCompiledKernel_true__XlaNumConstantArgs_7__XlaNumResourceArgs_0_.2111: (12.3 ms @ f_nom)
2021-09-14 16:33:50.919633: I tensorflow/compiler/xla/service/executable.cc:221]        17274099 cycles (100.% 100Σ) ::      12251.1 usec (       692.4 optimal) ::       61.77GFLOP/s ::        4.62GTROP/s ::     81.86GiB/s ::        62B/cycle :: [total] [entry]
2021-09-14 16:33:50.919644: I tensorflow/compiler/xla/service/executable.cc:221]          151601 cycles ( 0.88%  1Σ) ::        107.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.1 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.180, f16[768,768]{1,0} %constant_23, f16[1024,768]{1,0} %broadcast.26), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_0/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919656: I tensorflow/compiler/xla/service/executable.cc:221]           90960 cycles ( 0.53%  1Σ) ::         64.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.32 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.139, f16[3072,768]{1,0} %constant_478), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_2/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919662: I tensorflow/compiler/xla/service/executable.cc:221]           90960 cycles ( 0.53%  2Σ) ::         64.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.10 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.169, f16[3072,768]{1,0} %constant_132), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_0/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919667: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  2Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.131 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.4, f16[3072,768]{1,0} %constant_2035), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_11/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919673: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  3Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.43 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.124, f16[3072,768]{1,0} %constant_651), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_3/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919678: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  3Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.120 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.19, f16[3072,768]{1,0} %constant_1862), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_10/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919690: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  4Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.109 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.34, f16[3072,768]{1,0} %constant_1689), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_9/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919696: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  5Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.21 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.154, f16[3072,768]{1,0} %constant_305), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_1/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919701: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  5Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.87 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.64, f16[3072,768]{1,0} %constant_1343), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_7/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919706: I tensorflow/compiler/xla/service/executable.cc:221]           89516 cycles ( 0.52%  6Σ) ::         63.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.54 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.109, f16[3072,768]{1,0} %constant_824), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_4/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919713: I tensorflow/compiler/xla/service/executable.cc:221]           88072 cycles ( 0.51%  6Σ) ::         62.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.65 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.94, f16[3072,768]{1,0} %constant_997), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_5/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919718: I tensorflow/compiler/xla/service/executable.cc:221]           88072 cycles ( 0.51%  7Σ) ::         62.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.98 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.49, f16[3072,768]{1,0} %constant_1516), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_8/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919727: I tensorflow/compiler/xla/service/executable.cc:221]           88072 cycles ( 0.51%  7Σ) ::         62.5 usec                        ::                    ::                    ::                ::                  :: %custom-call.76 = f16[1024,768]{1,0} custom-call(f16[1024,3072]{1,0} %fusion.79, f16[3072,768]{1,0} %constant_1170), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_6/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919735: I tensorflow/compiler/xla/service/executable.cc:221]           77967 cycles ( 0.45%  8Σ) ::         55.3 usec                        ::                    ::                    ::                ::                  :: %custom-call.9 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.134, f16[768,3072]{1,0} %constant_107), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_0/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919741: I tensorflow/compiler/xla/service/executable.cc:221]           67860 cycles ( 0.39%  8Σ) ::         48.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.20 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.122, f16[768,3072]{1,0} %constant_280), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_1/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919746: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38%  8Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.31 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.110, f16[768,3072]{1,0} %constant_453), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_2/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919752: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38%  9Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.119 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.14, f16[768,3072]{1,0} %constant_1837), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_10/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919760: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38%  9Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.64 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.74, f16[768,3072]{1,0} %constant_972), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_5/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919765: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38%  9Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.75 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.62, f16[768,3072]{1,0} %constant_1145), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_6/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919771: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38% 10Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.130 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.2, f16[768,3072]{1,0} %constant_2010), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_11/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919776: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38% 10Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.86 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.50, f16[768,3072]{1,0} %constant_1318), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_7/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919782: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38% 11Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.42 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.98, f16[768,3072]{1,0} %constant_626), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_3/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919787: I tensorflow/compiler/xla/service/executable.cc:221]           66416 cycles ( 0.38% 11Σ) ::         47.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.53 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.86, f16[768,3072]{1,0} %constant_799), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_4/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919796: I tensorflow/compiler/xla/service/executable.cc:221]           64972 cycles ( 0.38% 11Σ) ::         46.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.4 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.179, f16[8,12,128,64]{3,2,1,0} %fusion.178), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_0/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.919801: I tensorflow/compiler/xla/service/executable.cc:221]           64972 cycles ( 0.38% 12Σ) ::         46.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.108 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.26, f16[768,3072]{1,0} %constant_1664), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_9/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919806: I tensorflow/compiler/xla/service/executable.cc:221]           64972 cycles ( 0.38% 12Σ) ::         46.1 usec                        ::                    ::                    ::                ::                  :: %custom-call.97 = f16[1024,3072]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.38, f16[768,3072]{1,0} %constant_1491), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_8/intermediate/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"0\"}"
2021-09-14 16:33:50.919811: I tensorflow/compiler/xla/service/executable.cc:221]           57752 cycles ( 0.33% 12Σ) ::         41.0 usec                        ::                    ::                    ::                ::                  :: %custom-call.3 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.180, f16[768,768]{1,0} %constant_16, f16[1024,768]{1,0} %broadcast.19), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_0/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919817: I tensorflow/compiler/xla/service/executable.cc:221]           56308 cycles ( 0.33% 13Σ) ::         39.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.6 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.180, f16[768,768]{1,0} %constant_39, f16[1024,768]{1,0} %broadcast.42), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_0/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919825: I tensorflow/compiler/xla/service/executable.cc:221]           53422 cycles ( 0.31% 13Σ) ::         37.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.17 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.130, f16[768,768]{1,0} %constant_212, f16[1024,768]{1,0} %broadcast.215), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_1/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919830: I tensorflow/compiler/xla/service/executable.cc:221]           53422 cycles ( 0.31% 13Σ) ::         37.9 usec (         6.4 optimal) ::       93.08GFLOP/s ::                    ::    245.47GiB/s ::       186B/cycle :: %fusion.252 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %custom-call.4, f32[8,128,128]{2,1,0} %arg0.1), kind=kInput, calls=%fused_computation.252, metadata={op_type="Softmax" op_name="bert/encoder/layer_0/attention/self/Softmax"}
2021-09-14 16:33:50.919836: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 14Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.127 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.10, f16[768,768]{1,0} %constant_1942, f16[1024,768]{1,0} %broadcast.1945), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_11/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919841: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 14Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.50 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.94, f16[768,768]{1,0} %constant_731, f16[1024,768]{1,0} %broadcast.734), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_4/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919847: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 14Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.28 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.118, f16[768,768]{1,0} %constant_385, f16[1024,768]{1,0} %broadcast.388), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_2/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919857: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 15Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.105 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.34, f16[768,768]{1,0} %constant_1596, f16[1024,768]{1,0} %broadcast.1599), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_9/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919862: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 15Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.78 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.58, f16[768,768]{1,0} %constant_1234, f16[1024,768]{1,0} %broadcast.1237), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_7/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919867: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 15Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.72 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.70, f16[768,768]{1,0} %constant_1077, f16[1024,768]{1,0} %broadcast.1080), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_6/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919873: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 16Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.34 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.106, f16[768,768]{1,0} %constant_542, f16[1024,768]{1,0} %broadcast.545), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_3/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919878: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 16Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.8 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.174, f16[768,768]{1,0} %constant_50), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_0/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919886: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 16Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.94 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.46, f16[768,768]{1,0} %constant_1423, f16[1024,768]{1,0} %broadcast.1426), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_8/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919891: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 16Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.39 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.106, f16[768,768]{1,0} %constant_558, f16[1024,768]{1,0} %broadcast.561), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_3/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919897: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 17Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.12 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.130, f16[768,768]{1,0} %constant_196, f16[1024,768]{1,0} %broadcast.199), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_1/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919902: I tensorflow/compiler/xla/service/executable.cc:221]           51978 cycles ( 0.30% 17Σ) ::         36.9 usec                        ::                    ::                    ::                ::                  :: %custom-call.45 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.94, f16[768,768]{1,0} %constant_715, f16[1024,768]{1,0} %broadcast.718), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_4/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919908: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 17Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.36 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.106, f16[768,768]{1,0} %constant_535, f16[1024,768]{1,0} %broadcast.538), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_3/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919920: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 18Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.116 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.22, f16[768,768]{1,0} %constant_1769, f16[1024,768]{1,0} %broadcast.1772), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_10/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919931: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 18Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.67 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.70, f16[768,768]{1,0} %constant_1061, f16[1024,768]{1,0} %broadcast.1064), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_6/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919936: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 18Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.122 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.10, f16[768,768]{1,0} %constant_1926, f16[1024,768]{1,0} %broadcast.1929), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_11/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919942: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 18Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.61 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.82, f16[768,768]{1,0} %constant_904, f16[1024,768]{1,0} %broadcast.907), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_5/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919947: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 19Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.80 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.58, f16[768,768]{1,0} %constant_1227, f16[1024,768]{1,0} %broadcast.1230), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_7/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919956: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 19Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.100 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.34, f16[768,768]{1,0} %constant_1580, f16[1024,768]{1,0} %broadcast.1583), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_9/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919961: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 19Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.69 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.70, f16[768,768]{1,0} %constant_1054, f16[1024,768]{1,0} %broadcast.1057), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_6/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919966: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 20Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.23 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.118, f16[768,768]{1,0} %constant_369, f16[1024,768]{1,0} %broadcast.372), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_2/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919972: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 20Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.14 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.130, f16[768,768]{1,0} %constant_189, f16[1024,768]{1,0} %broadcast.192), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_1/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919977: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 20Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.96 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.54, f16[768,768]{1,0} %constant_1434), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_8/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.919985: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 21Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.83 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.58, f16[768,768]{1,0} %constant_1250, f16[1024,768]{1,0} %broadcast.1253), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_7/attention/self/value/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919991: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 21Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.89 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.46, f16[768,768]{1,0} %constant_1407, f16[1024,768]{1,0} %broadcast.1410), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_8/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.919996: I tensorflow/compiler/xla/service/executable.cc:221]           50534 cycles ( 0.29% 21Σ) ::         35.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.56 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.82, f16[768,768]{1,0} %constant_888, f16[1024,768]{1,0} %broadcast.891), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_5/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920001: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 21Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.52 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.114, f16[768,768]{1,0} %constant_742), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_4/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920007: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 22Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.58 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.82, f16[768,768]{1,0} %constant_881, f16[1024,768]{1,0} %broadcast.884), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_5/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920012: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 22Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.74 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.84, f16[768,768]{1,0} %constant_1088), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_6/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920020: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 22Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.129 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.9, f16[768,768]{1,0} %constant_1953), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_11/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920026: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 23Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.85 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.69, f16[768,768]{1,0} %constant_1261), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_7/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920031: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 23Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.124 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.10, f16[768,768]{1,0} %constant_1919, f16[1024,768]{1,0} %broadcast.1922), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_11/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920036: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 23Σ) ::         34.8 usec (         6.1 optimal) ::       90.00GFLOP/s ::                    ::    253.10GiB/s ::       192B/cycle :: %fusion.222 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.59), kind=kInput, calls=%fused_computation.222, metadata={op_type="Softmax" op_name="bert/encoder/layer_5/attention/self/Softmax"}
2021-09-14 16:33:50.920042: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 23Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.107 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.39, f16[768,768]{1,0} %constant_1607), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_9/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920050: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 24Σ) ::         34.8 usec (         6.1 optimal) ::       90.00GFLOP/s ::                    ::    253.10GiB/s ::       192B/cycle :: %fusion.186 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.125), kind=kInput, calls=%fused_computation.186, metadata={op_type="Softmax" op_name="bert/encoder/layer_11/attention/self/Softmax"}
2021-09-14 16:33:50.920056: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 24Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.47 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.94, f16[768,768]{1,0} %constant_708, f16[1024,768]{1,0} %broadcast.711), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_4/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920061: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 24Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.118 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.24, f16[768,768]{1,0} %constant_1780), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_10/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920066: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 25Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.113 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.22, f16[768,768]{1,0} %constant_1746, f16[1024,768]{1,0} %broadcast.1749), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_10/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920078: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 25Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.111 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.22, f16[768,768]{1,0} %constant_1753, f16[1024,768]{1,0} %broadcast.1756), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_10/attention/self/query/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920083: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 25Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.102 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.34, f16[768,768]{1,0} %constant_1573, f16[1024,768]{1,0} %broadcast.1576), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_9/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920092: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 25Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.19 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.159, f16[768,768]{1,0} %constant_223), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_1/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920098: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 26Σ) ::         34.8 usec (         6.1 optimal) ::       90.00GFLOP/s ::                    ::    253.10GiB/s ::       192B/cycle :: %fusion.246 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.15), kind=kInput, calls=%fused_computation.246, metadata={op_type="Softmax" op_name="bert/encoder/layer_1/attention/self/Softmax"}
2021-09-14 16:33:50.920104: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 26Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.41 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.129, f16[768,768]{1,0} %constant_569), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_3/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920109: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 26Σ) ::         34.8 usec (         6.1 optimal) ::       90.00GFLOP/s ::                    ::    253.10GiB/s ::       192B/cycle :: %fusion.192 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.114), kind=kInput, calls=%fused_computation.192, metadata={op_type="Softmax" op_name="bert/encoder/layer_10/attention/self/Softmax"}
2021-09-14 16:33:50.920114: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 27Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.7 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.176, f16[8,12,128,64]{3,2,1,0} %fusion.175), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_0/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920123: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 27Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.30 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.144, f16[768,768]{1,0} %constant_396), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_2/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920128: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 27Σ) ::         34.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.25 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.118, f16[768,768]{1,0} %constant_362, f16[1024,768]{1,0} %broadcast.365), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_2/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920134: I tensorflow/compiler/xla/service/executable.cc:221]           49090 cycles ( 0.28% 27Σ) ::         34.8 usec (         4.1 optimal) ::       90.00GFLOP/s ::       45.18GTROP/s ::    169.61GiB/s ::       129B/cycle :: %fusion.251 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.141, f16[8,12,128]{2,1,0} %get-tuple-element.140), kind=kInput, calls=%fused_computation.251, metadata={op_type="Softmax" op_name="bert/encoder/layer_0/attention/self/Softmax"}
2021-09-14 16:33:50.920139: I tensorflow/compiler/xla/service/executable.cc:221]           47646 cycles ( 0.28% 28Σ) ::         33.8 usec (         6.1 optimal) ::       92.73GFLOP/s ::                    ::    260.77GiB/s ::       198B/cycle :: %fusion.240 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.26), kind=kInput, calls=%fused_computation.240, metadata={op_type="Softmax" op_name="bert/encoder/layer_2/attention/self/Softmax"}
2021-09-14 16:33:50.920145: I tensorflow/compiler/xla/service/executable.cc:221]           47646 cycles ( 0.28% 28Σ) ::         33.8 usec (         6.1 optimal) ::       92.73GFLOP/s ::                    ::    260.77GiB/s ::       198B/cycle :: %fusion.228 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.48), kind=kInput, calls=%fused_computation.228, metadata={op_type="Softmax" op_name="bert/encoder/layer_4/attention/self/Softmax"}
2021-09-14 16:33:50.920151: I tensorflow/compiler/xla/service/executable.cc:221]           47646 cycles ( 0.28% 28Σ) ::         33.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.91 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %get-tuple-element.46, f16[768,768]{1,0} %constant_1400, f16[1024,768]{1,0} %broadcast.1403), custom_call_target="__cublas$gemm", metadata={op_type="BiasAdd" op_name="bert/encoder/layer_8/attention/self/key/BiasAdd"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":1,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"111\"}"
2021-09-14 16:33:50.920156: I tensorflow/compiler/xla/service/executable.cc:221]           47646 cycles ( 0.28% 28Σ) ::         33.8 usec (         6.1 optimal) ::       92.73GFLOP/s ::                    ::    260.77GiB/s ::       198B/cycle :: %fusion.216 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.70), kind=kInput, calls=%fused_computation.216, metadata={op_type="Softmax" op_name="bert/encoder/layer_6/attention/self/Softmax"}
2021-09-14 16:33:50.920165: I tensorflow/compiler/xla/service/executable.cc:221]           47646 cycles ( 0.28% 29Σ) ::         33.8 usec (         6.1 optimal) ::       92.73GFLOP/s ::                    ::    260.77GiB/s ::       198B/cycle :: %fusion.204 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.92), kind=kInput, calls=%fused_computation.204, metadata={op_type="Softmax" op_name="bert/encoder/layer_8/attention/self/Softmax"}
2021-09-14 16:33:50.920170: I tensorflow/compiler/xla/service/executable.cc:221]           47646 cycles ( 0.28% 29Σ) ::         33.8 usec                        ::                    ::                    ::                ::                  :: %custom-call.63 = f16[1024,768]{1,0} custom-call(f16[1024,768]{1,0} %fusion.99, f16[768,768]{1,0} %constant_915), custom_call_target="__cublas$gemm", metadata={op_type="MatMul" op_name="bert/encoder/layer_5/attention/output/dense/MatMul"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"selected_algorithm\":\"109\"}"
2021-09-14 16:33:50.920176: I tensorflow/compiler/xla/service/executable.cc:221]           46201 cycles ( 0.27% 29Σ) ::         32.8 usec (         6.1 optimal) ::       95.63GFLOP/s ::                    ::    268.93GiB/s ::       204B/cycle :: %fusion.234 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.37), kind=kInput, calls=%fused_computation.234, metadata={op_type="Softmax" op_name="bert/encoder/layer_3/attention/self/Softmax"}
2021-09-14 16:33:50.920181: I tensorflow/compiler/xla/service/executable.cc:221]           44757 cycles ( 0.26% 30Σ) ::         31.7 usec (         6.1 optimal) ::       98.71GFLOP/s ::                    ::    277.61GiB/s ::       211B/cycle :: %fusion.198 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.103), kind=kInput, calls=%fused_computation.198, metadata={op_type="Softmax" op_name="bert/encoder/layer_9/attention/self/Softmax"}
2021-09-14 16:33:50.920187: I tensorflow/compiler/xla/service/executable.cc:221]           44757 cycles ( 0.26% 30Σ) ::         31.7 usec (         4.1 optimal) ::       98.71GFLOP/s ::       49.55GTROP/s ::    186.03GiB/s ::       141B/cycle :: %fusion.203 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.45, f16[8,12,128]{2,1,0} %get-tuple-element.44), kind=kInput, calls=%fused_computation.203, metadata={op_type="Softmax" op_name="bert/encoder/layer_8/attention/self/Softmax"}
2021-09-14 16:33:50.920192: I tensorflow/compiler/xla/service/executable.cc:221]           44757 cycles ( 0.26% 30Σ) ::         31.7 usec (         4.1 optimal) ::       98.71GFLOP/s ::       49.55GTROP/s ::    186.03GiB/s ::       141B/cycle :: %fusion.245 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.129, f16[8,12,128]{2,1,0} %get-tuple-element.128), kind=kInput, calls=%fused_computation.245, metadata={op_type="Softmax" op_name="bert/encoder/layer_1/attention/self/Softmax"}
2021-09-14 16:33:50.920198: I tensorflow/compiler/xla/service/executable.cc:221]           43313 cycles ( 0.25% 30Σ) ::         30.7 usec (         4.1 optimal) ::      102.00GFLOP/s ::       51.20GTROP/s ::    192.23GiB/s ::       146B/cycle :: %fusion.209 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.57, f16[8,12,128]{2,1,0} %get-tuple-element.56), kind=kInput, calls=%fused_computation.209, metadata={op_type="Softmax" op_name="bert/encoder/layer_7/attention/self/Softmax"}
2021-09-14 16:33:50.920206: I tensorflow/compiler/xla/service/executable.cc:221]           43313 cycles ( 0.25% 31Σ) ::         30.7 usec (         6.1 optimal) ::      102.00GFLOP/s ::                    ::    286.86GiB/s ::       218B/cycle :: %fusion.210 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.142, f16[8,12,128,128]{3,2,1,0} %custom-call.81), kind=kInput, calls=%fused_computation.210, metadata={op_type="Softmax" op_name="bert/encoder/layer_7/attention/self/Softmax"}
2021-09-14 16:33:50.920211: I tensorflow/compiler/xla/service/executable.cc:221]           43313 cycles ( 0.25% 31Σ) ::         30.7 usec (         4.1 optimal) ::      102.00GFLOP/s ::       51.20GTROP/s ::    192.23GiB/s ::       146B/cycle :: %fusion.185 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.9, f16[8,12,128]{2,1,0} %get-tuple-element.8), kind=kInput, calls=%fused_computation.185, metadata={op_type="Softmax" op_name="bert/encoder/layer_11/attention/self/Softmax"}
2021-09-14 16:33:50.920217: I tensorflow/compiler/xla/service/executable.cc:221]           41871 cycles ( 0.24% 31Σ) ::         29.7 usec (         4.1 optimal) ::      105.52GFLOP/s ::       52.97GTROP/s ::    198.85GiB/s ::       151B/cycle :: %fusion.227 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.93, f16[8,12,128]{2,1,0} %get-tuple-element.92), kind=kInput, calls=%fused_computation.227, metadata={op_type="Softmax" op_name="bert/encoder/layer_4/attention/self/Softmax"}
2021-09-14 16:33:50.920222: I tensorflow/compiler/xla/service/executable.cc:221]           41871 cycles ( 0.24% 31Σ) ::         29.7 usec (         4.1 optimal) ::      105.52GFLOP/s ::       52.97GTROP/s ::    198.85GiB/s ::       151B/cycle :: %fusion.221 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.81, f16[8,12,128]{2,1,0} %get-tuple-element.80), kind=kInput, calls=%fused_computation.221, metadata={op_type="Softmax" op_name="bert/encoder/layer_5/attention/self/Softmax"}
2021-09-14 16:33:50.920227: I tensorflow/compiler/xla/service/executable.cc:221]           41871 cycles ( 0.24% 32Σ) ::         29.7 usec (         4.1 optimal) ::      105.52GFLOP/s ::       52.97GTROP/s ::    198.85GiB/s ::       151B/cycle :: %fusion.197 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.33, f16[8,12,128]{2,1,0} %get-tuple-element.32), kind=kInput, calls=%fused_computation.197, metadata={op_type="Softmax" op_name="bert/encoder/layer_9/attention/self/Softmax"}
2021-09-14 16:33:50.920232: I tensorflow/compiler/xla/service/executable.cc:221]           40427 cycles ( 0.23% 32Σ) ::         28.7 usec (         4.1 optimal) ::      109.29GFLOP/s ::       54.86GTROP/s ::    205.96GiB/s ::       156B/cycle :: %fusion.239 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.117, f16[8,12,128]{2,1,0} %get-tuple-element.116), kind=kInput, calls=%fused_computation.239, metadata={op_type="Softmax" op_name="bert/encoder/layer_2/attention/self/Softmax"}
2021-09-14 16:33:50.920237: I tensorflow/compiler/xla/service/executable.cc:221]           40427 cycles ( 0.23% 32Σ) ::         28.7 usec (         4.1 optimal) ::      109.29GFLOP/s ::       54.86GTROP/s ::    205.96GiB/s ::       156B/cycle :: %fusion.215 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.69, f16[8,12,128]{2,1,0} %get-tuple-element.68), kind=kInput, calls=%fused_computation.215, metadata={op_type="Softmax" op_name="bert/encoder/layer_6/attention/self/Softmax"}
2021-09-14 16:33:50.920246: I tensorflow/compiler/xla/service/executable.cc:221]           40427 cycles ( 0.23% 32Σ) ::         28.7 usec (         4.1 optimal) ::      109.29GFLOP/s ::       54.86GTROP/s ::    205.96GiB/s ::       156B/cycle :: %fusion.191 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.21, f16[8,12,128]{2,1,0} %get-tuple-element.20), kind=kInput, calls=%fused_computation.191, metadata={op_type="Softmax" op_name="bert/encoder/layer_10/attention/self/Softmax"}
2021-09-14 16:33:50.920251: I tensorflow/compiler/xla/service/executable.cc:221]           38983 cycles ( 0.23% 32Σ) ::         27.6 usec (         4.1 optimal) ::      113.33GFLOP/s ::       56.89GTROP/s ::    213.59GiB/s ::       162B/cycle :: %fusion.233 = (f16[8,12,128]{2,1,0}, f16[8,12,128,128]{3,2,1,0}) fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.105, f16[8,12,128]{2,1,0} %get-tuple-element.104), kind=kInput, calls=%fused_computation.233, metadata={op_type="Softmax" op_name="bert/encoder/layer_3/attention/self/Softmax"}
2021-09-14 16:33:50.920256: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 33Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.124 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_629, f16[1024,3072]{1,0} %custom-call.42), kind=kLoop, calls=%fused_computation.124, metadata={op_type="Cast" op_name="bert/encoder/layer_3/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920262: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 33Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.154 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_283, f16[1024,3072]{1,0} %custom-call.20), kind=kLoop, calls=%fused_computation.154, metadata={op_type="Cast" op_name="bert/encoder/layer_1/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920268: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 33Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.79 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_1148, f16[1024,3072]{1,0} %custom-call.75), kind=kLoop, calls=%fused_computation.79, metadata={op_type="Cast" op_name="bert/encoder/layer_6/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920273: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 33Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.139 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_456, f16[1024,3072]{1,0} %custom-call.31), kind=kLoop, calls=%fused_computation.139, metadata={op_type="Cast" op_name="bert/encoder/layer_2/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920278: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 34Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.64 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_1321, f16[1024,3072]{1,0} %custom-call.86), kind=kLoop, calls=%fused_computation.64, metadata={op_type="Cast" op_name="bert/encoder/layer_7/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920283: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 34Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.49 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_1494, f16[1024,3072]{1,0} %custom-call.97), kind=kLoop, calls=%fused_computation.49, metadata={op_type="Cast" op_name="bert/encoder/layer_8/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920291: I tensorflow/compiler/xla/service/executable.cc:221]           37538 cycles ( 0.22% 34Σ) ::         26.6 usec (         8.1 optimal) ::        1.30TFLOP/s ::      118.16GTROP/s ::    440.61GiB/s ::       335B/cycle :: %fusion.19 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_1840, f16[1024,3072]{1,0} %custom-call.119), kind=kLoop, calls=%fused_computation.19, metadata={op_type="Cast" op_name="bert/encoder/layer_10/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920297: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 34Σ) ::         25.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.15 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.164, f16[8,12,128,64]{3,2,1,0} %fusion.163), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_1/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920302: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 34Σ) ::         25.6 usec (         8.1 optimal) ::        1.35TFLOP/s ::      122.89GTROP/s ::    458.24GiB/s ::       348B/cycle :: %fusion.109 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_802, f16[1024,3072]{1,0} %custom-call.53), kind=kLoop, calls=%fused_computation.109, metadata={op_type="Cast" op_name="bert/encoder/layer_4/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920307: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 35Σ) ::         25.6 usec (         8.1 optimal) ::        1.35TFLOP/s ::      122.89GTROP/s ::    458.24GiB/s ::       348B/cycle :: %fusion.169 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_110, f16[1024,3072]{1,0} %custom-call.9), kind=kLoop, calls=%fused_computation.169, metadata={op_type="Cast" op_name="bert/encoder/layer_0/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920312: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 35Σ) ::         25.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.59 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.104, f16[8,12,128,64]{3,2,1,0} %fusion.103), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_5/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920318: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 35Σ) ::         25.6 usec (         8.1 optimal) ::        1.35TFLOP/s ::      122.89GTROP/s ::    458.24GiB/s ::       348B/cycle :: %fusion.34 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_1667, f16[1024,3072]{1,0} %custom-call.108), kind=kLoop, calls=%fused_computation.34, metadata={op_type="Cast" op_name="bert/encoder/layer_9/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920323: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 35Σ) ::         25.6 usec (         8.1 optimal) ::        1.35TFLOP/s ::      122.89GTROP/s ::    458.24GiB/s ::       348B/cycle :: %fusion.94 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_975, f16[1024,3072]{1,0} %custom-call.64), kind=kLoop, calls=%fused_computation.94, metadata={op_type="Cast" op_name="bert/encoder/layer_5/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920341: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 35Σ) ::         25.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.117 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.26, f16[8,12,128,64]{3,2,1,0} %fusion.25), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_10/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920351: I tensorflow/compiler/xla/service/executable.cc:221]           36094 cycles ( 0.21% 36Σ) ::         25.6 usec (         8.1 optimal) ::        1.35TFLOP/s ::      122.89GTROP/s ::    458.24GiB/s ::       348B/cycle :: %fusion.4 = f16[1024,3072]{1,0} fusion(f32[3072]{0} %constant_2013, f16[1024,3072]{1,0} %custom-call.130), kind=kLoop, calls=%fused_computation.4, metadata={op_type="Cast" op_name="bert/encoder/layer_11/intermediate/dense/mul_3-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920361: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 36Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.40 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.131, f16[8,12,128,64]{3,2,1,0} %fusion.130), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_3/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920378: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 36Σ) ::         24.6 usec (         3.0 optimal) ::       32.00GFLOP/s ::                    ::    178.81GiB/s ::       136B/cycle :: %fusion.180 = f16[1024,768]{1,0} fusion(f32[8,128,768]{2,1,0} %arg1.2), kind=kLoop, calls=%fused_computation.180, metadata={op_type="Reshape" op_name="bert/encoder/Reshape"}
2021-09-14 16:33:50.920388: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 36Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.125 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.14, f16[8,12,128,64]{3,2,1,0} %fusion.13), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_11/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920397: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 36Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.128 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.11, f16[8,12,128,64]{3,2,1,0} %fusion.10), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_11/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920406: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 37Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.48 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.119, f16[8,12,128,64]{3,2,1,0} %fusion.118), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_4/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920412: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 37Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.114 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.29, f16[8,12,128,64]{3,2,1,0} %fusion.28), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_10/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920418: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 37Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.81 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.74, f16[8,12,128,64]{3,2,1,0} %fusion.73), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_7/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920424: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 37Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.84 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.71, f16[8,12,128,64]{3,2,1,0} %fusion.70), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_7/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920429: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 37Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.29 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.146, f16[8,12,128,64]{3,2,1,0} %fusion.145), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_2/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920435: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 38Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.26 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.149, f16[8,12,128,64]{3,2,1,0} %fusion.148), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_2/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920444: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 38Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.18 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.161, f16[8,12,128,64]{3,2,1,0} %fusion.160), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_1/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920449: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 38Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.103 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.44, f16[8,12,128,64]{3,2,1,0} %fusion.43), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_9/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920455: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 38Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.92 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.59, f16[8,12,128,64]{3,2,1,0} %fusion.58), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_8/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920460: I tensorflow/compiler/xla/service/executable.cc:221]           34652 cycles ( 0.20% 38Σ) ::         24.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.95 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.56, f16[8,12,128,64]{3,2,1,0} %fusion.55), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_8/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920466: I tensorflow/compiler/xla/service/executable.cc:221]           33208 cycles ( 0.19% 39Σ) ::         23.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.106 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.41, f16[8,12,128,64]{3,2,1,0} %fusion.40), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_9/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920474: I tensorflow/compiler/xla/service/executable.cc:221]           33208 cycles ( 0.19% 39Σ) ::         23.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.62 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.101, f16[8,12,128,64]{3,2,1,0} %fusion.100), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_5/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920480: I tensorflow/compiler/xla/service/executable.cc:221]           33208 cycles ( 0.19% 39Σ) ::         23.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.51 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.116, f16[8,12,128,64]{3,2,1,0} %fusion.115), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_4/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920485: I tensorflow/compiler/xla/service/executable.cc:221]           33208 cycles ( 0.19% 39Σ) ::         23.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.70 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.89, f16[8,12,128,64]{3,2,1,0} %fusion.88), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_6/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920490: I tensorflow/compiler/xla/service/executable.cc:221]           33208 cycles ( 0.19% 39Σ) ::         23.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.37 = f16[8,12,128,128]{3,2,1,0} custom-call(f16[8,12,128,64]{3,2,1,0} %fusion.134, f16[8,12,128,64]{3,2,1,0} %fusion.133), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_3/attention/self/MatMul"}, backend_config="{\"alpha_real\":0.125,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"3\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920496: I tensorflow/compiler/xla/service/executable.cc:221]           33208 cycles ( 0.19% 40Σ) ::         23.6 usec                        ::                    ::                    ::                ::                  :: %custom-call.73 = f16[8,12,128,64]{3,2,1,0} custom-call(f16[8,12,128,128]{3,2,1,0} %fusion.86, f16[8,12,128,64]{3,2,1,0} %fusion.85), custom_call_target="__cublas$gemm", metadata={op_type="BatchMatMul" op_name="bert/encoder/layer_6/attention/self/MatMul_1"}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"batch_size\":\"96\"}"
2021-09-14 16:33:50.920504: I tensorflow/compiler/xla/service/executable.cc:221]           23100 cycles ( 0.13% 40Σ) ::         16.4 usec (         2.0 optimal) ::      288.20GFLOP/s ::       62.50MTROP/s ::    179.64GiB/s ::       136B/cycle :: %fusion = f32[] fusion(f32[768]{0} %constant_2087, f32[1024,768]{1,0} %get-tuple-element.1, f32[768]{0} %constant_2078, f32[1024]{0} %fusion.1, f32[1024]{0} %get-tuple-element), kind=kInput, calls=%fused_computation, metadata={op_type="Mean" op_name="Mean"}
2021-09-14 16:33:50.920510: I tensorflow/compiler/xla/service/executable.cc:221]           23100 cycles ( 0.13% 40Σ) ::         16.4 usec (         2.0 optimal) ::                    ::                    ::    178.82GiB/s ::       136B/cycle :: %fusion.179 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.1), kind=kLoop, calls=%fused_computation.179, metadata={op_type="Transpose" op_name="bert/encoder/layer_0/attention/self/transpose"}
2021-09-14 16:33:50.920515: I tensorflow/compiler/xla/service/executable.cc:221]           23100 cycles ( 0.13% 40Σ) ::         16.4 usec (         5.1 optimal) ::      191.95GFLOP/s ::                    ::    447.47GiB/s ::       340B/cycle :: %fusion.212 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.63, f32[768]{0} %constant_1173, f16[1024,768]{1,0} %custom-call.76), kind=kInput, calls=%fused_computation.212, metadata={op_type="Mean" op_name="bert/encoder/layer_6/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920520: I tensorflow/compiler/xla/service/executable.cc:221]           21656 cycles ( 0.13% 40Σ) ::         15.4 usec (         5.1 optimal) ::      204.75GFLOP/s ::                    ::    477.31GiB/s ::       363B/cycle :: %fusion.236 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.111, f32[768]{0} %constant_481, f16[1024,768]{1,0} %custom-call.32), kind=kInput, calls=%fused_computation.236, metadata={op_type="Mean" op_name="bert/encoder/layer_2/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920526: I tensorflow/compiler/xla/service/executable.cc:221]           21656 cycles ( 0.13% 40Σ) ::         15.4 usec (         5.1 optimal) ::      204.75GFLOP/s ::                    ::    477.31GiB/s ::       363B/cycle :: %fusion.190 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.23, f32[768]{0} %constant_1783, f16[1024,768]{1,0} %custom-call.118), kind=kInput, calls=%fused_computation.190, metadata={op_type="Mean" op_name="bert/encoder/layer_10/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920531: I tensorflow/compiler/xla/service/executable.cc:221]           21656 cycles ( 0.13% 40Σ) ::         15.4 usec (         5.1 optimal) ::      204.75GFLOP/s ::                    ::    477.31GiB/s ::       363B/cycle :: %fusion.206 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.51, f32[768]{0} %constant_1346, f16[1024,768]{1,0} %custom-call.87), kind=kInput, calls=%fused_computation.206, metadata={op_type="Mean" op_name="bert/encoder/layer_7/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920537: I tensorflow/compiler/xla/service/executable.cc:221]           21656 cycles ( 0.13% 41Σ) ::         15.4 usec (         5.1 optimal) ::      204.75GFLOP/s ::                    ::    477.31GiB/s ::       363B/cycle :: %fusion.242 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.123, f32[768]{0} %constant_308, f16[1024,768]{1,0} %custom-call.21), kind=kInput, calls=%fused_computation.242, metadata={op_type="Mean" op_name="bert/encoder/layer_1/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920542: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.196 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.35, f32[768]{0} %constant_1610, f16[1024,768]{1,0} %custom-call.107), kind=kInput, calls=%fused_computation.196, metadata={op_type="Mean" op_name="bert/encoder/layer_9/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920550: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.208 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.59, f32[768]{0} %constant_1264, f16[1024,768]{1,0} %custom-call.85), kind=kInput, calls=%fused_computation.208, metadata={op_type="Mean" op_name="bert/encoder/layer_7/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920555: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.200 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.39, f32[768]{0} %constant_1519, f16[1024,768]{1,0} %custom-call.98), kind=kInput, calls=%fused_computation.200, metadata={op_type="Mean" op_name="bert/encoder/layer_8/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920560: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.194 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.27, f32[768]{0} %constant_1692, f16[1024,768]{1,0} %custom-call.109), kind=kInput, calls=%fused_computation.194, metadata={op_type="Mean" op_name="bert/encoder/layer_9/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920566: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.202 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.47, f32[768]{0} %constant_1437, f16[1024,768]{1,0} %custom-call.96), kind=kInput, calls=%fused_computation.202, metadata={op_type="Mean" op_name="bert/encoder/layer_8/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920571: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.214 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.71, f32[768]{0} %constant_1091, f16[1024,768]{1,0} %custom-call.74), kind=kInput, calls=%fused_computation.214, metadata={op_type="Mean" op_name="bert/encoder/layer_6/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920576: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.244 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.131, f32[768]{0} %constant_226, f16[1024,768]{1,0} %custom-call.19), kind=kInput, calls=%fused_computation.244, metadata={op_type="Mean" op_name="bert/encoder/layer_1/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920581: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 41Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.238 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.119, f32[768]{0} %constant_399, f16[1024,768]{1,0} %custom-call.30), kind=kInput, calls=%fused_computation.238, metadata={op_type="Mean" op_name="bert/encoder/layer_2/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920592: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.248 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.135, f32[768]{0} %constant_135, f16[1024,768]{1,0} %custom-call.10), kind=kInput, calls=%fused_computation.248, metadata={op_type="Mean" op_name="bert/encoder/layer_0/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920597: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         4.1 optimal) ::      274.22GFLOP/s ::                    ::    409.20GiB/s ::       311B/cycle :: %fusion.250 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[768]{0} %constant_53, f16[1024,768]{1,0} %custom-call.8, f16[1024,768]{1,0} %fusion.180), kind=kInput, calls=%fused_computation.250, metadata={op_type="Mean" op_name="bert/encoder/layer_0/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920603: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         1.0 optimal) ::                    ::                    ::    102.28GiB/s ::        77B/cycle :: %broadcast.42 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_41), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_0/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.920608: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.182 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.3, f32[768]{0} %constant_2038, f16[1024,768]{1,0} %custom-call.131), kind=kInput, calls=%fused_computation.182, metadata={op_type="Mean" op_name="bert/encoder/layer_11/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920613: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.232 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.107, f32[768]{0} %constant_572, f16[1024,768]{1,0} %custom-call.41), kind=kInput, calls=%fused_computation.232, metadata={op_type="Mean" op_name="bert/encoder/layer_3/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920619: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.230 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.99, f32[768]{0} %constant_654, f16[1024,768]{1,0} %custom-call.43), kind=kInput, calls=%fused_computation.230, metadata={op_type="Mean" op_name="bert/encoder/layer_3/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920624: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.226 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.95, f32[768]{0} %constant_745, f16[1024,768]{1,0} %custom-call.52), kind=kInput, calls=%fused_computation.226, metadata={op_type="Mean" op_name="bert/encoder/layer_4/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920629: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 42Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.224 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.87, f32[768]{0} %constant_827, f16[1024,768]{1,0} %custom-call.54), kind=kInput, calls=%fused_computation.224, metadata={op_type="Mean" op_name="bert/encoder/layer_4/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920638: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 43Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.188 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.15, f32[768]{0} %constant_1865, f16[1024,768]{1,0} %custom-call.120), kind=kInput, calls=%fused_computation.188, metadata={op_type="Mean" op_name="bert/encoder/layer_10/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920644: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 43Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.220 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.83, f32[768]{0} %constant_918, f16[1024,768]{1,0} %custom-call.63), kind=kInput, calls=%fused_computation.220, metadata={op_type="Mean" op_name="bert/encoder/layer_5/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920649: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 43Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.184 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.11, f32[768]{0} %constant_1956, f16[1024,768]{1,0} %custom-call.129), kind=kInput, calls=%fused_computation.184, metadata={op_type="Mean" op_name="bert/encoder/layer_11/attention/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920655: I tensorflow/compiler/xla/service/executable.cc:221]           20213 cycles ( 0.12% 43Σ) ::         14.3 usec (         5.1 optimal) ::      219.37GFLOP/s ::                    ::    511.38GiB/s ::       389B/cycle :: %fusion.218 = (f32[1024]{0}, f32[1024,768]{1,0}) fusion(f32[1024,768]{1,0} %get-tuple-element.75, f32[768]{0} %constant_1000, f16[1024,768]{1,0} %custom-call.65), kind=kInput, calls=%fused_computation.218, metadata={op_type="Mean" op_name="bert/encoder/layer_5/output/LayerNorm/moments/mean"}
2021-09-14 16:33:50.920661: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 43Σ) ::         13.3 usec (         5.1 optimal) ::      354.73GFLOP/s ::       76.93MTROP/s ::    551.26GiB/s ::       419B/cycle :: %fusion.229 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.96, f32[768]{0} %constant_703, f32[1024,768]{1,0} %get-tuple-element.97, f32[768]{0} %constant_694, f32[1024]{0} %fusion.121), kind=kLoop, calls=%fused_computation.229, metadata={op_type="Cast" op_name="bert/encoder/layer_3/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920666: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 43Σ) ::         13.3 usec (         2.0 optimal) ::      177.25GFLOP/s ::                    ::    220.67GiB/s ::       168B/cycle :: %fusion.106 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.85, f32[1024]{0} %get-tuple-element.84), kind=kInput, calls=%fused_computation.106, metadata={op_type="Mean" op_name="bert/encoder/layer_4/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920671: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 43Σ) ::         13.3 usec (         2.0 optimal) ::      177.25GFLOP/s ::                    ::    220.67GiB/s ::       168B/cycle :: %fusion.141 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.113, f32[1024]{0} %get-tuple-element.112), kind=kInput, calls=%fused_computation.141, metadata={op_type="Mean" op_name="bert/encoder/layer_2/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920676: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 43Σ) ::         13.3 usec (         2.0 optimal) ::      177.25GFLOP/s ::                    ::    220.67GiB/s ::       168B/cycle :: %fusion.51 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.41, f32[1024]{0} %get-tuple-element.40), kind=kInput, calls=%fused_computation.51, metadata={op_type="Mean" op_name="bert/encoder/layer_8/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920685: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 43Σ) ::         13.3 usec (         5.1 optimal) ::      354.73GFLOP/s ::       76.93MTROP/s ::    551.26GiB/s ::       419B/cycle :: %fusion.237 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.112, f32[768]{0} %constant_448, f32[1024,768]{1,0} %get-tuple-element.113, f32[768]{0} %constant_439, f32[1024]{0} %fusion.141), kind=kLoop, calls=%fused_computation.237, metadata={op_type="Cast" op_name="bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920691: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 44Σ) ::         13.3 usec (         2.0 optimal) ::      177.25GFLOP/s ::                    ::    220.67GiB/s ::       168B/cycle :: %fusion.21 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.17, f32[1024]{0} %get-tuple-element.16), kind=kInput, calls=%fused_computation.21, metadata={op_type="Mean" op_name="bert/encoder/layer_10/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920696: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 44Σ) ::         13.3 usec (         5.1 optimal) ::      354.73GFLOP/s ::       76.93MTROP/s ::    551.26GiB/s ::       419B/cycle :: %fusion.231 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.100, f32[768]{0} %constant_621, f32[1024,768]{1,0} %get-tuple-element.101, f32[768]{0} %constant_612, f32[1024]{0} %fusion.126), kind=kLoop, calls=%fused_computation.231, metadata={op_type="Cast" op_name="bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920701: I tensorflow/compiler/xla/service/executable.cc:221]           18768 cycles ( 0.11% 44Σ) ::         13.3 usec (         2.0 optimal) ::      177.25GFLOP/s ::                    ::    220.67GiB/s ::       168B/cycle :: %fusion.31 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.25, f32[1024]{0} %get-tuple-element.24), kind=kInput, calls=%fused_computation.31, metadata={op_type="Mean" op_name="bert/encoder/layer_9/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920707: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.199 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.36, f32[768]{0} %constant_1568, f32[1024,768]{1,0} %get-tuple-element.37, f32[768]{0} %constant_1559, f32[1024]{0} %fusion.46), kind=kLoop, calls=%fused_computation.199, metadata={op_type="Cast" op_name="bert/encoder/layer_8/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920714: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.81 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.65, f32[1024]{0} %get-tuple-element.64), kind=kInput, calls=%fused_computation.81, metadata={op_type="Mean" op_name="bert/encoder/layer_6/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920719: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.189 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.16, f32[768]{0} %constant_1832, f32[1024,768]{1,0} %get-tuple-element.17, f32[768]{0} %constant_1823, f32[1024]{0} %fusion.21), kind=kLoop, calls=%fused_computation.189, metadata={op_type="Cast" op_name="bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920727: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.121 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.97, f32[1024]{0} %get-tuple-element.96), kind=kInput, calls=%fused_computation.121, metadata={op_type="Mean" op_name="bert/encoder/layer_3/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920733: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.126 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.101, f32[1024]{0} %get-tuple-element.100), kind=kInput, calls=%fused_computation.126, metadata={op_type="Mean" op_name="bert/encoder/layer_3/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920738: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.201 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.40, f32[768]{0} %constant_1486, f32[1024,768]{1,0} %get-tuple-element.41, f32[768]{0} %constant_1477, f32[1024]{0} %fusion.51), kind=kLoop, calls=%fused_computation.201, metadata={op_type="Cast" op_name="bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920743: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 44Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.136 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.109, f32[1024]{0} %get-tuple-element.108), kind=kInput, calls=%fused_computation.136, metadata={op_type="Mean" op_name="bert/encoder/layer_2/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920749: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.151 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.121, f32[1024]{0} %get-tuple-element.120), kind=kInput, calls=%fused_computation.151, metadata={op_type="Mean" op_name="bert/encoder/layer_1/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920754: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.247 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.132, f32[768]{0} %constant_184, f32[1024,768]{1,0} %get-tuple-element.133, f32[768]{0} %constant_175, f32[1024]{0} %fusion.166), kind=kLoop, calls=%fused_computation.247, metadata={op_type="Cast" op_name="bert/encoder/layer_0/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920760: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.156 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.125, f32[1024]{0} %get-tuple-element.124), kind=kInput, calls=%fused_computation.156, metadata={op_type="Mean" op_name="bert/encoder/layer_1/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920769: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.183 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.4, f32[768]{0} %constant_2005, f32[1024,768]{1,0} %get-tuple-element.5, f32[768]{0} %constant_1996, f32[1024]{0} %fusion.6), kind=kLoop, calls=%fused_computation.183, metadata={op_type="Cast" op_name="bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920775: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.187 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.12, f32[768]{0} %constant_1914, f32[1024,768]{1,0} %get-tuple-element.13, f32[768]{0} %constant_1905, f32[1024]{0} %fusion.16), kind=kLoop, calls=%fused_computation.187, metadata={op_type="Cast" op_name="bert/encoder/layer_10/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920780: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.171 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.137, f32[1024]{0} %get-tuple-element.136), kind=kInput, calls=%fused_computation.171, metadata={op_type="Mean" op_name="bert/encoder/layer_0/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920786: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.166 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.133, f32[1024]{0} %get-tuple-element.132), kind=kInput, calls=%fused_computation.166, metadata={op_type="Mean" op_name="bert/encoder/layer_0/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920791: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.249 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.136, f32[768]{0} %constant_102, f32[1024,768]{1,0} %get-tuple-element.137, f32[768]{0} %constant_93, f32[1024]{0} %fusion.171), kind=kLoop, calls=%fused_computation.249, metadata={op_type="Cast" op_name="bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920796: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.243 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.124, f32[768]{0} %constant_275, f32[1024,768]{1,0} %get-tuple-element.125, f32[768]{0} %constant_266, f32[1024]{0} %fusion.156), kind=kLoop, calls=%fused_computation.243, metadata={op_type="Cast" op_name="bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920802: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 45Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.217 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.72, f32[768]{0} %constant_1049, f32[1024,768]{1,0} %get-tuple-element.73, f32[768]{0} %constant_1040, f32[1024]{0} %fusion.91), kind=kLoop, calls=%fused_computation.217, metadata={op_type="Cast" op_name="bert/encoder/layer_5/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920810: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.219 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.76, f32[768]{0} %constant_967, f32[1024,768]{1,0} %get-tuple-element.77, f32[768]{0} %constant_958, f32[1024]{0} %fusion.96), kind=kLoop, calls=%fused_computation.219, metadata={op_type="Cast" op_name="bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920815: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.211 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.60, f32[768]{0} %constant_1222, f32[1024,768]{1,0} %get-tuple-element.61, f32[768]{0} %constant_1213, f32[1024]{0} %fusion.76), kind=kLoop, calls=%fused_computation.211, metadata={op_type="Cast" op_name="bert/encoder/layer_6/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920821: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.223 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.84, f32[768]{0} %constant_876, f32[1024,768]{1,0} %get-tuple-element.85, f32[768]{0} %constant_867, f32[1024]{0} %fusion.106), kind=kLoop, calls=%fused_computation.223, metadata={op_type="Cast" op_name="bert/encoder/layer_4/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920826: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.225 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.88, f32[768]{0} %constant_794, f32[1024,768]{1,0} %get-tuple-element.89, f32[768]{0} %constant_785, f32[1024]{0} %fusion.111), kind=kLoop, calls=%fused_computation.225, metadata={op_type="Cast" op_name="bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920831: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.1 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.1, f32[1024]{0} %get-tuple-element), kind=kInput, calls=%fused_computation.1, metadata={op_type="Mean" op_name="bert/encoder/layer_11/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920836: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.235 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.108, f32[768]{0} %constant_530, f32[1024,768]{1,0} %get-tuple-element.109, f32[768]{0} %constant_521, f32[1024]{0} %fusion.136), kind=kLoop, calls=%fused_computation.235, metadata={op_type="Cast" op_name="bert/encoder/layer_2/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920842: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.6 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.5, f32[1024]{0} %get-tuple-element.4), kind=kInput, calls=%fused_computation.6, metadata={op_type="Mean" op_name="bert/encoder/layer_11/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920850: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.16 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.13, f32[1024]{0} %get-tuple-element.12), kind=kInput, calls=%fused_computation.16, metadata={op_type="Mean" op_name="bert/encoder/layer_10/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920856: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.207 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.52, f32[768]{0} %constant_1313, f32[1024,768]{1,0} %get-tuple-element.53, f32[768]{0} %constant_1304, f32[1024]{0} %fusion.66), kind=kLoop, calls=%fused_computation.207, metadata={op_type="Cast" op_name="bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920861: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 46Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.36 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.29, f32[1024]{0} %get-tuple-element.28), kind=kInput, calls=%fused_computation.36, metadata={op_type="Mean" op_name="bert/encoder/layer_9/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920866: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.46 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.37, f32[1024]{0} %get-tuple-element.36), kind=kInput, calls=%fused_computation.46, metadata={op_type="Mean" op_name="bert/encoder/layer_8/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920871: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.205 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.48, f32[768]{0} %constant_1395, f32[1024,768]{1,0} %get-tuple-element.49, f32[768]{0} %constant_1386, f32[1024]{0} %fusion.61), kind=kLoop, calls=%fused_computation.205, metadata={op_type="Cast" op_name="bert/encoder/layer_7/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920876: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.61 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.49, f32[1024]{0} %get-tuple-element.48), kind=kInput, calls=%fused_computation.61, metadata={op_type="Mean" op_name="bert/encoder/layer_7/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920882: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.66 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.53, f32[1024]{0} %get-tuple-element.52), kind=kInput, calls=%fused_computation.66, metadata={op_type="Mean" op_name="bert/encoder/layer_7/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920887: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.76 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.61, f32[1024]{0} %get-tuple-element.60), kind=kInput, calls=%fused_computation.76, metadata={op_type="Mean" op_name="bert/encoder/layer_6/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920895: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.241 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.120, f32[768]{0} %constant_357, f32[1024,768]{1,0} %get-tuple-element.121, f32[768]{0} %constant_348, f32[1024]{0} %fusion.151), kind=kLoop, calls=%fused_computation.241, metadata={op_type="Cast" op_name="bert/encoder/layer_1/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920901: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         5.1 optimal) ::      384.25GFLOP/s ::       83.33MTROP/s ::    597.13GiB/s ::       454B/cycle :: %fusion.213 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.64, f32[768]{0} %constant_1140, f32[1024,768]{1,0} %get-tuple-element.65, f32[768]{0} %constant_1131, f32[1024]{0} %fusion.81), kind=kLoop, calls=%fused_computation.213, metadata={op_type="Cast" op_name="bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920907: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.91 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.73, f32[1024]{0} %get-tuple-element.72), kind=kInput, calls=%fused_computation.91, metadata={op_type="Mean" op_name="bert/encoder/layer_5/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920912: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.96 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.77, f32[1024]{0} %get-tuple-element.76), kind=kInput, calls=%fused_computation.96, metadata={op_type="Mean" op_name="bert/encoder/layer_5/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920918: I tensorflow/compiler/xla/service/executable.cc:221]           17326 cycles ( 0.10% 47Σ) ::         12.3 usec (         2.0 optimal) ::      192.00GFLOP/s ::                    ::    239.04GiB/s ::       182B/cycle :: %fusion.111 = f32[1024]{0} fusion(f32[1024,768]{1,0} %get-tuple-element.89, f32[1024]{0} %get-tuple-element.88), kind=kInput, calls=%fused_computation.111, metadata={op_type="Mean" op_name="bert/encoder/layer_4/attention/output/LayerNorm/moments/variance"}
2021-09-14 16:33:50.920923: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         2.0 optimal) ::                    ::                    ::    260.10GiB/s ::       198B/cycle :: %fusion.70 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.83), kind=kLoop, calls=%fused_computation.70, metadata={op_type="Transpose" op_name="bert/encoder/layer_7/attention/self/transpose_2"}
2021-09-14 16:33:50.920928: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         2.0 optimal) ::                    ::                    ::    260.10GiB/s ::       198B/cycle :: %fusion.178 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.3), kind=kLoop, calls=%fused_computation.178, metadata={op_type="Transpose" op_name="bert/encoder/layer_0/attention/self/transpose_1"}
2021-09-14 16:33:50.920936: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         2.0 optimal) ::                    ::                    ::    260.10GiB/s ::       198B/cycle :: %fusion.149 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.23), kind=kLoop, calls=%fused_computation.149, metadata={op_type="Transpose" op_name="bert/encoder/layer_2/attention/self/transpose"}
2021-09-14 16:33:50.920942: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         4.1 optimal) ::      139.64GFLOP/s ::                    ::    522.23GiB/s ::       397B/cycle :: %fusion.131 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.103, f16[8,12,128]{2,1,0} %get-tuple-element.102), kind=kLoop, calls=%fused_computation.131, metadata={op_type="Softmax" op_name="bert/encoder/layer_3/attention/self/Softmax"}
2021-09-14 16:33:50.920947: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         1.0 optimal) ::                    ::                    ::    130.17GiB/s ::        99B/cycle :: %broadcast.1929 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1928), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_11/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.920952: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         5.1 optimal) ::      419.19GFLOP/s ::       90.91MTROP/s ::    651.43GiB/s ::       496B/cycle :: %fusion.193 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.24, f32[768]{0} %constant_1741, f32[1024,768]{1,0} %get-tuple-element.25, f32[768]{0} %constant_1732, f32[1024]{0} %fusion.31), kind=kLoop, calls=%fused_computation.193, metadata={op_type="Cast" op_name="bert/encoder/layer_9/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920957: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         1.0 optimal) ::                    ::                    ::    130.17GiB/s ::        99B/cycle :: %broadcast.1945 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1944), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_11/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.920963: I tensorflow/compiler/xla/service/executable.cc:221]           15882 cycles ( 0.09% 48Σ) ::         11.3 usec (         5.1 optimal) ::      419.19GFLOP/s ::       90.91MTROP/s ::    651.43GiB/s ::       496B/cycle :: %fusion.195 = (f16[1024,768]{1,0}, f32[1024,768]{1,0}) fusion(f32[1024]{0} %get-tuple-element.28, f32[768]{0} %constant_1659, f32[1024,768]{1,0} %get-tuple-element.29, f32[768]{0} %constant_1650, f32[1024]{0} %fusion.36), kind=kLoop, calls=%fused_computation.195, metadata={op_type="Cast" op_name="bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/add_1-0-CastToFp16-AutoMixedPrecision"}
2021-09-14 16:33:50.920968: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 48Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.100 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.61), kind=kLoop, calls=%fused_computation.100, metadata={op_type="Transpose" op_name="bert/encoder/layer_5/attention/self/transpose_2"}
2021-09-14 16:33:50.920973: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 48Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.884 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_883), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_5/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.920978: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 48Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.891 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_890), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_5/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.920987: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.71 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.55, f16[8,12,128]{2,1,0} %get-tuple-element.54), kind=kLoop, calls=%fused_computation.71, metadata={op_type="Softmax" op_name="bert/encoder/layer_7/attention/self/Softmax"}
2021-09-14 16:33:50.920993: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.907 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_906), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_5/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.920998: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.73 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.80), kind=kLoop, calls=%fused_computation.73, metadata={op_type="Transpose" op_name="bert/encoder/layer_7/attention/self/transpose_1"}
2021-09-14 16:33:50.921003: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.372 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_371), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_2/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921008: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.85 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.72), kind=kLoop, calls=%fused_computation.85, metadata={op_type="Transpose" op_name="bert/encoder/layer_6/attention/self/transpose_2"}
2021-09-14 16:33:50.921013: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.86 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.67, f16[8,12,128]{2,1,0} %get-tuple-element.66), kind=kLoop, calls=%fused_computation.86, metadata={op_type="Softmax" op_name="bert/encoder/layer_6/attention/self/Softmax"}
2021-09-14 16:33:50.921018: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1057 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1056), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_6/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921023: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.88 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.69), kind=kLoop, calls=%fused_computation.88, metadata={op_type="Transpose" op_name="bert/encoder/layer_6/attention/self/transpose_1"}
2021-09-14 16:33:50.921028: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.89 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.67), kind=kLoop, calls=%fused_computation.89, metadata={op_type="Transpose" op_name="bert/encoder/layer_6/attention/self/transpose"}
2021-09-14 16:33:50.921037: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1080 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1079), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_6/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921042: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.59 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.89), kind=kLoop, calls=%fused_computation.59, metadata={op_type="Transpose" op_name="bert/encoder/layer_8/attention/self/transpose"}
2021-09-14 16:33:50.921047: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 49Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.101 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.79, f16[8,12,128]{2,1,0} %get-tuple-element.78), kind=kLoop, calls=%fused_computation.101, metadata={op_type="Softmax" op_name="bert/encoder/layer_5/attention/self/Softmax"}
2021-09-14 16:33:50.921052: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.103 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.58), kind=kLoop, calls=%fused_computation.103, metadata={op_type="Transpose" op_name="bert/encoder/layer_5/attention/self/transpose_1"}
2021-09-14 16:33:50.921057: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.104 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.56), kind=kLoop, calls=%fused_computation.104, metadata={op_type="Transpose" op_name="bert/encoder/layer_5/attention/self/transpose"}
2021-09-14 16:33:50.921063: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.116 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.91, f16[8,12,128]{2,1,0} %get-tuple-element.90), kind=kLoop, calls=%fused_computation.116, metadata={op_type="Softmax" op_name="bert/encoder/layer_4/attention/self/Softmax"}
2021-09-14 16:33:50.921071: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.118 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.47), kind=kLoop, calls=%fused_computation.118, metadata={op_type="Transpose" op_name="bert/encoder/layer_4/attention/self/transpose_1"}
2021-09-14 16:33:50.921077: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.119 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.45), kind=kLoop, calls=%fused_computation.119, metadata={op_type="Transpose" op_name="bert/encoder/layer_4/attention/self/transpose"}
2021-09-14 16:33:50.921085: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1410 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1409), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_8/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921090: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.130 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.39), kind=kLoop, calls=%fused_computation.130, metadata={op_type="Transpose" op_name="bert/encoder/layer_3/attention/self/transpose_2"}
2021-09-14 16:33:50.921096: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.215 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_214), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_1/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921101: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1426 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1425), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_8/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921106: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.561 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_560), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_3/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921111: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.26 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.19, f16[8,12,128]{2,1,0} %get-tuple-element.18), kind=kLoop, calls=%fused_computation.26, metadata={op_type="Softmax" op_name="bert/encoder/layer_10/attention/self/Softmax"}
2021-09-14 16:33:50.921116: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 50Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.25 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.116), kind=kLoop, calls=%fused_computation.25, metadata={op_type="Transpose" op_name="bert/encoder/layer_10/attention/self/transpose_2"}
2021-09-14 16:33:50.921121: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.24 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.117), kind=kLoop, calls=%fused_computation.24, metadata={op_type="Reshape" op_name="bert/encoder/layer_10/attention/self/Reshape_3"}
2021-09-14 16:33:50.921127: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.538 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_537), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_3/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921135: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.28 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.113), kind=kLoop, calls=%fused_computation.28, metadata={op_type="Transpose" op_name="bert/encoder/layer_10/attention/self/transpose_1"}
2021-09-14 16:33:50.921140: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.545 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_544), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_3/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921145: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.14 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.122), kind=kLoop, calls=%fused_computation.14, metadata={op_type="Transpose" op_name="bert/encoder/layer_11/attention/self/transpose"}
2021-09-14 16:33:50.921150: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.13 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.124), kind=kLoop, calls=%fused_computation.13, metadata={op_type="Transpose" op_name="bert/encoder/layer_11/attention/self/transpose_1"}
2021-09-14 16:33:50.921155: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.11 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.7, f16[8,12,128]{2,1,0} %get-tuple-element.6), kind=kLoop, calls=%fused_computation.11, metadata={op_type="Softmax" op_name="bert/encoder/layer_11/attention/self/Softmax"}
2021-09-14 16:33:50.921160: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.10 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.127), kind=kLoop, calls=%fused_computation.10, metadata={op_type="Transpose" op_name="bert/encoder/layer_11/attention/self/transpose_2"}
2021-09-14 16:33:50.921165: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.29 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.111), kind=kLoop, calls=%fused_computation.29, metadata={op_type="Transpose" op_name="bert/encoder/layer_10/attention/self/transpose"}
2021-09-14 16:33:50.921171: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.388 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_387), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_2/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921176: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.39 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.106), kind=kLoop, calls=%fused_computation.39, metadata={op_type="Reshape" op_name="bert/encoder/layer_9/attention/self/Reshape_3"}
2021-09-14 16:33:50.921184: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 51Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.40 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.105), kind=kLoop, calls=%fused_computation.40, metadata={op_type="Transpose" op_name="bert/encoder/layer_9/attention/self/transpose_2"}
2021-09-14 16:33:50.921189: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.41 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.31, f16[8,12,128]{2,1,0} %get-tuple-element.30), kind=kLoop, calls=%fused_computation.41, metadata={op_type="Softmax" op_name="bert/encoder/layer_9/attention/self/Softmax"}
2021-09-14 16:33:50.921194: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.43 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.102), kind=kLoop, calls=%fused_computation.43, metadata={op_type="Transpose" op_name="bert/encoder/layer_9/attention/self/transpose_1"}
2021-09-14 16:33:50.921199: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.44 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.100), kind=kLoop, calls=%fused_computation.44, metadata={op_type="Transpose" op_name="bert/encoder/layer_9/attention/self/transpose"}
2021-09-14 16:33:50.921204: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.718 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_717), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_4/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921209: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.734 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_733), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_4/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921216: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.55 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.94), kind=kLoop, calls=%fused_computation.55, metadata={op_type="Transpose" op_name="bert/encoder/layer_8/attention/self/transpose_2"}
2021-09-14 16:33:50.921221: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.56 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.43, f16[8,12,128]{2,1,0} %get-tuple-element.42), kind=kLoop, calls=%fused_computation.56, metadata={op_type="Softmax" op_name="bert/encoder/layer_8/attention/self/Softmax"}
2021-09-14 16:33:50.921226: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.58 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.91), kind=kLoop, calls=%fused_computation.58, metadata={op_type="Transpose" op_name="bert/encoder/layer_8/attention/self/transpose_1"}
2021-09-14 16:33:50.921234: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1922 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1921), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_11/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921239: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.163 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.14), kind=kLoop, calls=%fused_computation.163, metadata={op_type="Transpose" op_name="bert/encoder/layer_1/attention/self/transpose_1"}
2021-09-14 16:33:50.921244: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.146 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.115, f16[8,12,128]{2,1,0} %get-tuple-element.114), kind=kLoop, calls=%fused_computation.146, metadata={op_type="Softmax" op_name="bert/encoder/layer_2/attention/self/Softmax"}
2021-09-14 16:33:50.921250: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 52Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.148 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.25), kind=kLoop, calls=%fused_computation.148, metadata={op_type="Transpose" op_name="bert/encoder/layer_2/attention/self/transpose_1"}
2021-09-14 16:33:50.921255: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1583 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1582), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_9/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921260: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1599 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1598), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_9/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921265: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.192 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_191), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_1/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921270: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.159 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.18), kind=kLoop, calls=%fused_computation.159, metadata={op_type="Reshape" op_name="bert/encoder/layer_1/attention/self/Reshape_3"}
2021-09-14 16:33:50.921275: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.160 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.17), kind=kLoop, calls=%fused_computation.160, metadata={op_type="Transpose" op_name="bert/encoder/layer_1/attention/self/transpose_2"}
2021-09-14 16:33:50.921284: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.161 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.127, f16[8,12,128]{2,1,0} %get-tuple-element.126), kind=kLoop, calls=%fused_computation.161, metadata={op_type="Softmax" op_name="bert/encoder/layer_1/attention/self/Softmax"}
2021-09-14 16:33:50.921289: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.134 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.34), kind=kLoop, calls=%fused_computation.134, metadata={op_type="Transpose" op_name="bert/encoder/layer_3/attention/self/transpose"}
2021-09-14 16:33:50.921295: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.164 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.12), kind=kLoop, calls=%fused_computation.164, metadata={op_type="Transpose" op_name="bert/encoder/layer_1/attention/self/transpose"}
2021-09-14 16:33:50.921300: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1749 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1748), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_10/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921306: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1756 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1755), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_10/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921312: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.1772 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1771), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_10/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921321: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 53Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.174 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.7), kind=kLoop, calls=%fused_computation.174, metadata={op_type="Reshape" op_name="bert/encoder/layer_0/attention/self/Reshape_3"}
2021-09-14 16:33:50.921330: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 54Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.175 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.6), kind=kLoop, calls=%fused_computation.175, metadata={op_type="Transpose" op_name="bert/encoder/layer_0/attention/self/transpose_2"}
2021-09-14 16:33:50.921338: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 54Σ) ::         10.2 usec (         4.1 optimal) ::      153.62GFLOP/s ::                    ::    574.54GiB/s ::       437B/cycle :: %fusion.176 = f16[8,12,128,128]{3,2,1,0} fusion(f16[8,12,128,128]{3,2,1,0} %get-tuple-element.139, f16[8,12,128]{2,1,0} %get-tuple-element.138), kind=kLoop, calls=%fused_computation.176, metadata={op_type="Softmax" op_name="bert/encoder/layer_0/attention/self/Softmax"}
2021-09-14 16:33:50.921350: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 54Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.144 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.29), kind=kLoop, calls=%fused_computation.144, metadata={op_type="Reshape" op_name="bert/encoder/layer_2/attention/self/Reshape_3"}
2021-09-14 16:33:50.921360: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 54Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.133 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.36), kind=kLoop, calls=%fused_computation.133, metadata={op_type="Transpose" op_name="bert/encoder/layer_3/attention/self/transpose_1"}
2021-09-14 16:33:50.921369: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 54Σ) ::         10.2 usec (         2.0 optimal) ::                    ::                    ::    286.15GiB/s ::       217B/cycle :: %fusion.145 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.28), kind=kLoop, calls=%fused_computation.145, metadata={op_type="Transpose" op_name="bert/encoder/layer_2/attention/self/transpose_2"}
2021-09-14 16:33:50.921378: I tensorflow/compiler/xla/service/executable.cc:221]           14436 cycles ( 0.08% 54Σ) ::         10.2 usec (         1.0 optimal) ::                    ::                    ::    143.21GiB/s ::       109B/cycle :: %broadcast.199 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_198), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_1/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921386: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.711 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_710), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_4/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921392: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.69 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.84), kind=kLoop, calls=%fused_computation.69, metadata={op_type="Reshape" op_name="bert/encoder/layer_7/attention/self/Reshape_3"}
2021-09-14 16:33:50.921397: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.26 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_25), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_0/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921402: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.9 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.128), kind=kLoop, calls=%fused_computation.9, metadata={op_type="Reshape" op_name="bert/encoder/layer_11/attention/self/Reshape_3"}
2021-09-14 16:33:50.921407: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.1403 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1402), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_8/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921415: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.74 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.78), kind=kLoop, calls=%fused_computation.74, metadata={op_type="Transpose" op_name="bert/encoder/layer_7/attention/self/transpose"}
2021-09-14 16:33:50.921420: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 54Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.129 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.40), kind=kLoop, calls=%fused_computation.129, metadata={op_type="Reshape" op_name="bert/encoder/layer_3/attention/self/Reshape_3"}
2021-09-14 16:33:50.921425: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.84 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.73), kind=kLoop, calls=%fused_computation.84, metadata={op_type="Reshape" op_name="bert/encoder/layer_6/attention/self/Reshape_3"}
2021-09-14 16:33:50.921430: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.19 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_18), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_0/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921436: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.115 = f16[8,12,128,64]{3,2,1,0} fusion(f16[1024,768]{1,0} %custom-call.50), kind=kLoop, calls=%fused_computation.115, metadata={op_type="Transpose" op_name="bert/encoder/layer_4/attention/self/transpose_2"}
2021-09-14 16:33:50.921441: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.1064 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1063), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_6/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921446: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.114 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.51), kind=kLoop, calls=%fused_computation.114, metadata={op_type="Reshape" op_name="bert/encoder/layer_4/attention/self/Reshape_3"}
2021-09-14 16:33:50.921451: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.54 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.95), kind=kLoop, calls=%fused_computation.54, metadata={op_type="Reshape" op_name="bert/encoder/layer_8/attention/self/Reshape_3"}
2021-09-14 16:33:50.921456: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.365 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_364), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_2/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921464: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         2.0 optimal) ::                    ::                    ::    317.90GiB/s ::       242B/cycle :: %fusion.99 = f16[1024,768]{1,0} fusion(f16[8,12,128,64]{3,2,1,0} %custom-call.62), kind=kLoop, calls=%fused_computation.99, metadata={op_type="Reshape" op_name="bert/encoder/layer_5/attention/self/Reshape_3"}
2021-09-14 16:33:50.921469: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.1253 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1252), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_7/attention/self/value/BiasAdd"}
2021-09-14 16:33:50.921475: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.1576 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1575), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_9/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921480: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.1237 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1236), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_7/attention/self/query/BiasAdd"}
2021-09-14 16:33:50.921485: I tensorflow/compiler/xla/service/executable.cc:221]           12994 cycles ( 0.08% 55Σ) ::          9.2 usec (         1.0 optimal) ::                    ::                    ::    159.11GiB/s ::       121B/cycle :: %broadcast.1230 = f16[1024,768]{1,0} broadcast(f16[768]{0} %constant_1229), dimensions={1}, metadata={op_type="BiasAdd" op_name="bert/encoder/layer_7/attention/self/key/BiasAdd"}
2021-09-14 16:33:50.921490: I tensorflow/compiler/xla/service/executable.cc:221]           10106 cycles ( 0.06% 55Σ) ::          7.2 usec (         0.0 optimal) ::      139.52kFLOP/s ::                    ::      1.60MiB/s ::     0.001B/cycle :: %multiply.76 = f32[] multiply(f32[] %fusion, f32[] %constant_153), metadata={op_type="Mean" op_name="Mean"}
2021-09-14 16:33:50.921495: I tensorflow/compiler/xla/service/executable.cc:221] 

Describe the expected behavior
Cumulative runtime sum to 100%
Code to reproduce the issue
Apology I cannot provide the model I used for inference. I set below environment variables to enable XLA/HLO profiling/AMP

export TF_ENABLE_AUTO_MIXED_PRECISION=1
export TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit --tf_xla_clustering_debug"
xport XLA_FLAGS="--xla_hlo_profile"
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Set env vars mentioned above, choose arbitrary model (e.g., BERT) and run inference.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

tensorflow-gpu pip wheels for Windows platform

For anyone who is trying to use TensorFlow on Windows with recent NVIDIA hardware and latest CUDA libraries installed, the only option seems to be compiling tensorflow-gpu manually. The reason is that TensorFlow is always lagging behind CUDA sometimes it is even more than one version behind.

There were numerous requests on TensorFlow issue tracker to provide tensorflow-gpu builds with the latest CUDA libraries but all of them were ultimately rejected.

While it is probably possible to have more than one CUDA toolkit version installed to work around this issue, it just needlessly complicates things for developers.

It would be nice if NVIDIA stepped up and provided pip wheels for Windows platform built against latest CUDA version for both TensorFlow 1.15 and 2.x.

Building r1.15.5+nv21.10 from source and getting complaint `undefined symbol: omp_get_max_threads`

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: r1.15.5+nv21.10
  • Python version: 3.8
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): gcc-9
  • CUDA/cuDNN version: 11.4.2/8
  • GPU model and memory: RTX2080Ti, 12GB

Describe the problem

When importing, it says

ImportError: /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so: undefined symbol: omp_get_max_threads

Provide the exact sequence of commands / steps that you executed before running into the problem

Use the below Dockerfile, where I explicitly passed -fopenmp.

FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
ENV TZ=Europe/London
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ >/etc/timezone && apt-get update && apt-get -y upgrade && apt-get install -y build-essential git git-lfs wget vim software-properties-common unzip python3-pip libomp5 && update-alternatives --install /usr/bin/python python $(which python3) 10 && RUN pip install --upgrade astor
WORKDIR /workdir
RUN BAZEL=bazel-0.26.1-installer-linux-x86_64.sh && wget https://github.com/bazelbuild/bazel/releases/download/0.26.1/${BAZEL} && chmod +x ${BAZEL} && ./${BAZEL} && git clone https://github.com/NVIDIA/cudnn-frontend.git && git clone --branch r1.15.5+nv21.10 --single-branch https://github.com/NVIDIA/tensorflow.git
WORKDIR /workdir/tensorflow
ENV TF_ENABLE_XLA=1 \
    TF_NEED_OPENCL_SYCL=0 \
    TF_NEED_ROCM=0 \
    TF_NEED_CUDA=1 \
    TF_NEED_TENSORRT=0 \
    TF_CUDA_VERSION=11 \
    TF_CUBLAS_VERSION=11 \
    TF_NCCL_VERSION=2 \
    TF_CUDNN_VERSION=8 \
    TF_CUDA_PATHS="/usr/include,/usr/lib/x86_64-linux-gnu,/usr/local/cuda/include,/usr/local/cuda/lib64,/usr/local/cuda/bin,/usr/local/cuda" \
    TF_CUDA_COMPUTE_CAPABILITIES=3.5,5.0,5.2,6.1,7.0,7.5,8.6 \
    CC_OPT_FLAGS="-march=sandybridge -mfma -mfpmath=both -fopenmp"
RUN PYTHON_BIN_PATH=$(which python) ./configure && bazel build --config=opt --config=noaws --config=nogcp --config=nohdfs --config=noignite --config=nokafka //tensorflow/tools/pip_package:build_pip_package
RUN bazel-bin/tensorflow/tools/pip_package/build_pip_package tensorflow_pkg && pip install tensorflow_pkg/tensorflow-1.15.5+nv-cp38-cp38-linux_x86_64.whl

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

# python -c "import tensorflow"
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.8/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.8/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so: undefined symbol: omp_get_max_threads

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 101, in <module>
    from tensorflow_core import *
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/__init__.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 50, in __getattr__
    module = self._load()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py", line 44, in _load
    module = _importlib.import_module(self.__name__)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.8/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.8/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so: undefined symbol: omp_get_max_threads


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

Prebuilt docker images compute capability question

Sorry, I am not sure if this is the correct place to ask this question, I am using an image from the nvcr.io repository and noticed that I was not getting gpu support on the k80 I was using, it appears that the latest tensorflow image, 22.01-tf1-py3 has a minimum compute capability set to 5.2 and I require 3.7.

My question is--is there anyway to look at the history of when this changed, is their a separate repository for those docker builds? I see that the nvbuild.sh script seems to use the value "all" for the compute capability which ends up setting the minimum compute capability to 5.2, but there seems to be no straightforward way to track history, or see when that changed (assuming it has changed at any point), its not in any documentation that I have found.

releases or tags please?

Howdy,

I am compiling nvidia-tensorflow from source for benchmarking purposes (yes, I know binaries are available) and I am worried because this repository doesn't seem to have very many releases or tags. You currently keep all of your critical version information in the names of the branches. Thus, I can't guarantee that whatever I do now will work for someone else in the future, if y'all decide to append or remove your branches. Would you be willing to mark your critical version information with releases or tags so my colleagues can feel safe that they will still be there in the future? Maybe for example, since branch r1.15.4-nv20.12 hasn't changed in awhile, put a tag like v1.15.4-nv20.12-stable on the end of it.

Have a good day,
Richard

Simple patch on tensorflow v1.15.* to support cuda11.0 only

Hi, I suppose to run tensorflow v1.15.0 on NVIDIA A100, however, origin tensorflow v1.15.0 only support cuda10.0, but A100 need cuda11.0. Fortunately I found NVIDIA/tensorflow support both tensorflow v1.15.2 and cuda11.0, thank you all.

However, NVIDIA/tensorflow branch r1.15.2+nv20.06 contains not only cuda11.0 support, but also a lot of NVIDIA's tensorflow features which have no relationship with cuda11.0. Could NVIDIA support a simple patch on v1.15.* to support cuda11.0 only, which without NVIDIA's tensorflow features? For me, v1.15.0 base on is better, thank you all.

In the predict/train_on_batch operation of all packages in the CuDNNLSTM series, all loss values ​​are nan after some training.

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information
RTX2070 / RTX3090
ubuntu16.04 / ubuntu18.04 / ubuntu20.04

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below): nvidia-tensorlfow all version (nvidia-docker using)
  • Python version: 3.6.5 / 3.7.5 / 3.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.3
  • GPU model and memory: rtx2070 8g, rtx3090 24g

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
In the predict/train_on_batch operation of all packages in the RNN/LSTM series, all loss values ​​are nan after some training.

I thought this was a bug in the RTX30 series. However, it is not and I think it is a bug in CuDNNLSTM of nvidia-tensorflow.

In cpu mode, a normal value is always derived. This error only occurs in gpu mode. It is believed to be a bug in the nvidia-tensorflow source.
CuDNNLSTM results were normally output in tensorflow-gpu==1.15.2 in ubuntu18.04 with rtx2070.

Describe the expected behavior
When using lstm-related packages, you can only train with the cpu.
I hope someone can tell me how. I want to test it with CuDNNGRU.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

r1.15.5+nv21.10 cannot be built with --config=mkl

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: r1.15.5+nv21.10
  • Python version: 3.8
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): gcc-9
  • CUDA/cuDNN version: 11.4.2/8
  • GPU model and memory: RTX2080Ti, 12GB

Describe the problem
Error message:

./tensorflow/core/kernels/quantization_utils.h:725:43:   required from here
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h:90:54: error: static assertion failed: Default executor instantiated with non-default device. You must #define EIGEN_USE_THREADS, EIGEN_USE_GPU or EIGEN_USE_SYCL before including Eigen headers.
   90 |   static_assert(std::is_same<Device, DefaultDevice>::value,
      |                                                      ^~~~~
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Provide the exact sequence of commands / steps that you executed before running into the problem

Use the below Dockerfile

FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
ENV TZ=Europe/London
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ >/etc/timezone && apt-get update && apt-get -y upgrade && apt-get install -y build-essential git git-lfs wget vim software-properties-common unzip python3-pip && update-alternatives --install /usr/bin/python python $(which python3) 10 && RUN pip install --upgrade numpy astor
WORKDIR /workdir
RUN BAZEL=bazel-0.26.1-installer-linux-x86_64.sh && wget https://github.com/bazelbuild/bazel/releases/download/0.26.1/${BAZEL} && chmod +x ${BAZEL} && ./${BAZEL} && git clone https://github.com/NVIDIA/cudnn-frontend.git && git clone --branch r1.15.5+nv21.10 --single-branch https://github.com/NVIDIA/tensorflow.git
WORKDIR /workdir/tensorflow
ENV TF_ENABLE_XLA=1 \
    TF_NEED_OPENCL_SYCL=0 \
    TF_NEED_ROCM=0 \
    TF_NEED_CUDA=1 \
    TF_NEED_TENSORRT=0 \
    TF_CUDA_VERSION=11 \
    TF_CUBLAS_VERSION=11 \
    TF_NCCL_VERSION=2 \
    TF_CUDNN_VERSION=8 \
    TF_CUDA_PATHS="/usr/include,/usr/lib/x86_64-linux-gnu,/usr/local/cuda/include,/usr/local/cuda/lib64,/usr/local/cuda/bin,/usr/local/cuda" \
    TF_CUDA_COMPUTE_CAPABILITIES=3.5,5.0,5.2,6.1,7.0,7.5,8.6 \
    CC_OPT_FLAGS="-march=sandybridge -mfma -mfpmath=both"
RUN PYTHON_BIN_PATH=$(which python) ./configure
RUN bazel build --config=opt --config=noaws --config=mkl --config=nogcp --config=nohdfs --config=noignite --config=nokafka //tensorflow/tools/pip_package:build_pip_package

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

full log of the last RUN: https://1drv.ms/t/s!Ao-GP3hGG9a9gvF--rQYH7LUlYnh-w?e=rZfQ3h

This cycle occurred because of a configuration option

docker build of latest r1.15.5+nv21.03 leads to:

WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:1098:12: in srcs attribute of cc_library rule //tensorflow/core:framework_lite: please do not import '//tensorflow/core/platform:default/integral_types.h' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:1098:12: in srcs attribute of cc_library rule //tensorflow/core:framework_lite: please do not import '//tensorflow/core/platform:default/mutex.h' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:1098:12: in srcs attribute of cc_library rule //tensorflow/core:framework_lite: please do not import '//tensorflow/core/platform:default/mutex_data.h' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:1096:1: in linkstatic attribute of cc_library rule //tensorflow/core:framework_lite: setting 'linkstatic=1' is recommended if there are no object files
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:388:12: in srcs attribute of cc_library rule //tensorflow/core:platform_port: please do not import '//tensorflow/core/platform:cpu_info.cc' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:388:12: in srcs attribute of cc_library rule //tensorflow/core:platform_port: please do not import '//tensorflow/core/platform:default/dynamic_annotations.h' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:388:12: in srcs attribute of cc_library rule //tensorflow/core:platform_port: please do not import '//tensorflow/core/platform:default/mutex.h' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:388:12: in srcs attribute of cc_library rule //tensorflow/core:platform_port: please do not import '//tensorflow/core/platform:posix/port.cc' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:652:12: in srcs attribute of cc_library rule //tensorflow/core:lib_proto_parsing: please do not import '//tensorflow/core/platform:protobuf.cc' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:2460:12: in srcs attribute of cc_library rule //tensorflow/core:lib_internal_impl: please do not import '//tensorflow/core/platform:abi.h' directly. You should either move the file to this package or depend on an appropriate rule there
WARNING: /tfbuild/tensorflow/tensorflow/core/BUILD:2460:12: in srcs attribute of cc_library rule //tensorflow/core:lib_internal_impl: please do not import '//tensorflow/core/platform:blocking_counter.h' directly. You should either move the file to this package or depend on an appropriate rule there

and

ERROR: /tfbuild/tensorflow/tensorflow/core/BUILD:2827:1: in cc_library rule //tensorflow/core:framework_internal: cycle in dependency graph:
    //tensorflow/tools/pip_package:build_pip_package
    //tensorflow/python/compiler:compiler
    //tensorflow/python/compiler/tensorrt:init_py
    //tensorflow/python/compiler/tensorrt:trt_convert_py
    //tensorflow/python/saved_model:save
    //tensorflow/python/training/tracking:graph_view
    //tensorflow/python:framework_ops
    //tensorflow/python:type_spec
    //tensorflow/python:tensor_shape
    //tensorflow/python/eager:monitoring
    //tensorflow/python:c_api_util
    //tensorflow/python:pywrap_tensorflow
    //tensorflow/python:pywrap_tensorflow_internal
    //tensorflow/python:pywrap_tensorflow_internal.py
    //tensorflow/python:pywrap_tensorflow_internal_py_wrap
    //tensorflow/core/distributed_runtime:server_lib
.-> //tensorflow/core:framework_internal
|   //tensorflow/core:framework_internal_impl
|   //tensorflow/core/profiler:nvtx_utils
|   //tensorflow/core:framework
`-- //tensorflow/core:framework_internal
This cycle occurred because of a configuration option
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted
INFO: Elapsed time: 42.064s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (398 packages loaded, 20384 targets configured)
make: *** [build_tensorflow_one] Error 1
[root@3d36a0da2e0c tfbuild]# 

Same thing does not happen with 1.15.4+nv20.12

Multi mps-daemon with Tensorflow

Hi,

GPU - 4 Tesla V100

I have started mps-daemon per physical gpu

# ps -ef | grep mps
root      4258 34958  0 14:19 pts/2    00:00:00 grep --color=auto mps
root      6487 27180  0 09:14 ?        00:00:07 nvidia-cuda-mps-server
root     27170     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     27175     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     27180     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     27185     1  0 09:03 ?        00:00:00 nvidia-cuda-mps-control -d
root     38635 27175  0 13:04 ?        00:00:01 nvidia-cuda-mps-server
root     56671 27170  0 13:25 ?        00:00:01 nvidia-cuda-mps-server
root     62115 27185  0 13:10 ?        00:00:01 nvidia-cuda-mps-server
# nvidia-smi
Tue May 10 14:18:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    42W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     56671      C   nvidia-cuda-mps-server             27MiB |
|    1   N/A  N/A     38635      C   nvidia-cuda-mps-server             27MiB |
|    2   N/A  N/A      6487      C   nvidia-cuda-mps-server             27MiB |
|    3   N/A  N/A     62115      C   nvidia-cuda-mps-server             27MiB |
+-----------------------------------------------------------------------------+

If I deploy sample k8s application pod as below - It is working.

apiVersion: v1
kind: Pod
metadata:
 name: cuda-gpu-demo
spec:
  hostIPC: true
  restartPolicy: OnFailure
  containers:
  - name: cuda-gpu-demo
    image: my-image
    command:
    - "/bin/sh"
    - "-c"
    args:
    - for i in {0..40}; do echo $i; /usr/bin/binomialOptions; sleep 1; done
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
      - name: mps
        mountPath: /tmp/nvidia/
  volumes:
    - name: mps
      hostPath:
        path: /tmp/nvidia/

as you can see blow for device:1 (M+C)

nvidia-smi 
Tue May 10 13:14:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   32C    P0    56W / 300W |    161MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |     30MiB / 32768MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     11999    M+C   /usr/bin/binomialOptions          131MiB |
|    1   N/A  N/A     38635      C   nvidia-cuda-mps-server             27MiB |
|    2   N/A  N/A      6487      C   nvidia-cuda-mps-server             27MiB |
|    3   N/A  N/A     62115      C   nvidia-cuda-mps-server             27MiB |
+-----------------------------------------------------------------------------+

However, whenever I tried to run tensorflow python script in jupyter-notebook, it is not connecting with mps-server and nothing is showing in logs under (/vat/log/nvidia-mps --> server.log | control.log)

I have already set CUDA_MPS_PIPE_DIRECTORY as env variable also mounted host "/tmp/nvidia/" directory where all CUDA_MPS_PIPE_DIRECTORY are created per physical gpu.

Issue:
2022-05-11 00:27:26.583002: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

Can it be used in win10+rtx3090?

This template is for miscellaneous issues not covered by the other issue categories.

For questions on how to work with TensorFlow, or support for problems that are not verified bugs in TensorFlow, please go to StackOverflow.

If you are reporting a vulnerability, please use the dedicated reporting process.

For high-level discussions about TensorFlow, please post to [email protected], for questions about the development or internal workings of TensorFlow, or if you would like to know how to contribute to TensorFlow, please post to [email protected].

training aborts reporting "Fatal Python error: Segmentation fault"

Hello,

I pip installed nvidia-tensorflow==1.15.4+nv20.11 in a virtual env on Ubuntu18.04 with python3.6 .

My training aborts randomly after few hundred steps spitting out the following error.

Can anyone please advise what might cause the error?

Thanks.

`
....................
step: 1664 train-loss: 3.7866263389587402 train-acc: 0.05000000074505806
step: 1665 train-loss: 3.7862656116485596 train-acc: 0.10000000149011612

Fatal Python error: Segmentation fault

Thread 0x2021-05-31 09:10:27.487499: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
00007f4c2e7fc700 (most recent call first):
File "/usr/lib2021-05-31 09:10:27.487531: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
/python3.6/threading.py", line 295 in wait
File "/Fatal Python error: uAborteds

r/lib/python3.6/queue.py", line 164 in get
File "/home/atiqur/nvidia-tf-1.15.4-nv-20.11/lib/python3.2021-05-31 09:10:27.487552: F ./tensorflow/core/kernels/conv_2d_gpu.h:1015] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: unspecified launch failure
6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
`

UT //tensorflow/python:saver_test failed

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):N
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:N
  • TensorFlow installed from (source or binary):source
  • TensorFlow version (use command below):1.15
  • Python version:python3.6
  • Bazel version (if compiling from source):0.26.1
  • GCC/Compiler version (if compiling from source):gcc7.5.0
  • CUDA/cuDNN version:-
  • GPU model and memory:-

Describe the current behavior

bazel test --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 --copt="-march=native" //tensorflow/python:saver_test
FAILED

Traceback (most recent call last):
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/training/saver_test.py", line 2419, in testClearDevicesOnExport
10, size=[1, 10])
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1165, in _run
self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles)
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 474, in init
self._fetch_mapper = _FetchMapper.for_fetch(fetches)
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 266, in for_fetch
return _ListFetchMapper(fetch)
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 375, in init
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 375, in
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 276, in for_fetch
return _ElementFetchMapper(fetches, contraction_fn)
File "/home/admin/chen.ding/gitlab/code/tensorflow_2021/bazel-bin/tensorflow/python/saver_test.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 315, in init
'Tensor. (%s)' % (fetch, str(e)))
ValueError: Fetch argument 'new_model/optimize' cannot be interpreted as a Tensor. ("The name 'new_model/optimize' refers to an Operation not in the graph.")

Describe the expected behavior

UT pass

root casse
https://github.com/NVIDIA/tensorflow/blob/r1.15.5%2Bnv21.05/tensorflow/python/training/optimizer.py#L659
It changes NoOp name here(suffix with '-apply') , but without change in 'saver_test‘ test case

Memory keeps increasing when training the network

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution Ubuntu20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from binary:
  • TensorFlow version r1.15:
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: CUDA 11.1 cuDNN 8.0.5
  • GPU model and memory: RTX3080Ti 12G

You can collect some of this information using our environment capture

c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0== check python ===================================================
python version: 3.6.15
python branch:
python build version: ('default', 'Dec 3 2021 18:49:41')
python compiler version: GCC 9.4.0
python implementation: CPython

== check os platform ===============================================
os: Linux
os kernel version: #2920.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022
os release version: 5.13.0-27-generic
os platform: Linux-5.13.0-27-generic-x86_64-with-debian-bullseye-sid
linux distribution: ('debian', 'bullseye/sid', '')
linux os distribution: ('debian', 'bullseye/sid', '')
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='myalos', release='5.13.0-27-generic', version='#29
20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022', machine='x86_64', processor='x86_64')
architecture: ('64bit', '')
machine: x86_64

== are we in docker =============================================
No

== compiler ====================================
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== check pips ===================================================
numpy 1.18.5
nvidia-tensorflow 1.15.4+nv20.11
protobuf 3.19.3
tensorflow-estimator 1.15.1

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.version.VERSION = 1.15.4
tf.version.GIT_VERSION = unknown
tf.version.COMPILER_VERSION = 7.5.0
Sanity check: array([1], dtype=int32)

Describe the current behavior
memory keep increasing when training the deepv2d official code

Describe the expected behavior
memory should not keep increasing

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

tflite import broken on nvidia-tensorflow

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04): Ubuntu 20.04 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): nvidia-TF 1.15.5
  • Python version: 3.8.2
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.1 / 8.0.1
  • GPU model and memory: RTX 3090 24GB

Describe the current behavior

nvidia-tensorflow errors out when importing tflite.

Describe the expected behavior

Not erroring out when importing tflite.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

$ python3 -m virtualenv testvenv
$ source testvenv/bin/activate
$ pip install nvidia-pyindex
$ pip install nvidia-tensorflow
$ python
Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'1.15.5'
>>> tf.lite
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/marko/testvenv/lib/python3.8/site-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in __getattr__
    attr = getattr(self._tfmw_wrapped_module, name)
AttributeError: module 'tensorflow' has no attribute 'lite'

Other info / logs

This import is tested and works as expected in the standard pip tensorflow-gpu releases 1.15.0, 1.15.4, 1.15.5.

tfcompile fails to build

System information

  • Docker image: nvcr.io/nvidia/tensorflow:21.05-tf1-py3
  • Linux Ubuntu 20.04
  • TensorFlow installed from: NA (build problem, not installation problem)
  • TensorFlow version: 1.15.5
  • Python version: 3.8.5
  • Installed using virtualenv? pip? conda?: NA
  • Bazel version (if compiling from source): 0.24.1
  • GCC/Compiler version (if compiling from source): 9.3.0
  • CUDA/cuDNN version: ?? (whatever is in the docker image)
  • GPU model and memory: 1080Ti

Describe the problem
From within a docker container running nvcr.io/nvidia/tensorflow:21.05-tf1-py3, tfcompile fails to build:

$ root@48f2340d016b:/opt/tensorflow/tensorflow-source# bazel build --config=opt --config=cuda //tensorflow/compiler/aot:tfcompile
...
INFO: Analysed target //tensorflow/compiler/aot:tfcompile (124 packages loaded, 11185 targets configured).
INFO: Found 1 target...
ERROR: /opt/tensorflow/tensorflow-source/tensorflow/compiler/aot/BUILD:190:1: C++ compilation of rule '//tensorflow/compiler/aot:embedded_protocol_buffers' failed (Exit 1)
tensorflow/compiler/aot/embedded_protocol_buffers.cc: In function ‘xla::StatusOr<std::__cxx11::basic_string<char> > tensorflow::tfcompile::CodegenModule(llvm::TargetMachine*, std::unique_ptr<llvm::Module>)’:
tensorflow/compiler/aot/embedded_protocol_buffers.cc:85:32: error: ‘CGFT_ObjectFile’ is not a member of ‘llvm::TargetMachine’
   85 |           llvm::TargetMachine::CGFT_ObjectFile)) {
      |                                ^~~~~~~~~~~~~~~
Target //tensorflow/compiler/aot:tfcompile failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 14.014s, Critical Path: 9.96s
INFO: 171 processes: 171 local.
FAILED: Build did NOT complete successfully

I have modified the .tf_configure.bazelrc but I think the changes are irrelevant to the failure:

build --action_env PYTHON_BIN_PATH="/usr/bin/python3.8"
build --action_env PYTHON_LIB_PATH="/usr/local/lib/python3.8/dist-packages"
build --python_path="/usr/bin/python3.8"
build:xla --define with_xla_support=true
build --config=xla
build --action_env TF_USE_CCACHE="0"
build --copt=-march=haswell
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"

I believe nvidia-tensorflow has some llvm-related changes wrt to upstream but maybe the focus was on getting Python tensorflow working with Nvidia hardware without attention to less commonly-used parts like tfcompile. This failure looks like llvm code not being right for the llvm version. Upstream 1.15.5 builds with no problems.

pip install nvidia-pyindex is not working

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below):
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

module 'tensorflow.python.keras.api._v1.keras.preprocessing' has no attribute 'image_dataset_from_directory'

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.15.5+nv21.5
  • Python version: 3.8.10
  • CUDA/cuDNN version: 11.3.58 / 8.2.0.51
  • GPU : RTX 3080

When i tried using the tf.keras.preprocessing.image_dataset_from_directory in TF 2.5.0 it works fine but on trying the same in nvidia-tensorflow package it is says image_dataset doesn't exist.

Code to reproduce the issue
`
import pathlib
import tensorflow as tf
from tensorflow import keras

data_dir_train = pathlib.Path("/images_train")

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir_train,
validation_split=0.2,
subset="training",
seed=123,
image_size=(180, 180),
batch_size=32)`

Other info / logs

AttributeError Traceback (most recent call last)
in
2 ## Note use seed=123 while creating your dataset using tf.keras.preprocessing.image_dataset_from_directory
3 ## Note, make sure your resize your images to the size img_height*img_width, while writting the dataset
----> 4 train_ds = tf.keras.preprocessing.image_dataset_from_directory(
5 data_dir_train,
6 validation_split=0.2,

~/.local/lib/python3.8/site-packages/tensorflow_core/python/util/module_wrapper.py in getattr(self, name)
191 def getattr(self, name):
192 try:
--> 193 attr = getattr(self._tfmw_wrapped_module, name)
194 except AttributeError:
195 if not self._tfmw_public_apis:

AttributeError: module 'tensorflow.python.keras.api._v1.keras.preprocessing' has no attribute 'image_dataset_from_directory'

I tried comparing the packages and found it is missing in nvidia-tensorflow
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/keras/preprocessing
https://github.com/NVIDIA/tensorflow/tree/r1.15.5%2Bnv21.05/tensorflow/python/keras/preprocessing

Is this a known bug or am i missing something ?

Where can I find function clear_allocator_type() and set_allocator_type() defination

When I was reading:

session_options.config.mutable_gpu_options()->clear_allocator_type();

session_options.config.mutable_gpu_options()->set_allocator_type(

I wondered where function clear_allocator_type() in line 940 and set_allocator_type() in line 931 are defined. I did not find any file in tensorflow related with this function.

nvidia-pyindex installed unsuccessfully

(py38trtc250) G:\client_py>pip install --user nvidia-pyindex
Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Collecting nvidia-pyindex
  Using cached https://mirrors.aliyun.com/pypi/packages/64/4c/dd413559179536b9b7247f15bf968f7e52b5f8c1d2183ceb3d5ea9284776/nvidia-pyindex-1.0.5.tar.gz (6.1 kB)
Building wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'D:\Anaconda\envs\py38trtc250\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\15151\\AppData\\Local\\Temp\\pip-install-a7e_bahh\\nvidia-pyindex\\setup.py'"'"'; __file__='"'"'C:\\Users\\15151\\AppData\\Local\\Temp\\pip-install-a7e_bahh\\nvidia-pyindex\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\15151\AppData\Local\Temp\pip-wheel-cho7jhkv'
       cwd: C:\Users\15151\AppData\Local\Temp\pip-install-a7e_bahh\nvidia-pyindex\
  Complete output (25 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib
  creating build\lib\nvidia_pyindex
  copying nvidia_pyindex\cmdline.py -> build\lib\nvidia_pyindex
  copying nvidia_pyindex\utils.py -> build\lib\nvidia_pyindex
  copying nvidia_pyindex\__init__.py -> build\lib\nvidia_pyindex
  running egg_info
  writing nvidia_pyindex.egg-info\PKG-INFO
  writing dependency_links to nvidia_pyindex.egg-info\dependency_links.txt
  writing entry points to nvidia_pyindex.egg-info\entry_points.txt
  writing top-level names to nvidia_pyindex.egg-info\top_level.txt
  reading manifest file 'nvidia_pyindex.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  writing manifest file 'nvidia_pyindex.egg-info\SOURCES.txt'
  installing to build\bdist.win-amd64\wheel
  running install
  '"nvidia_pyindex uninstall"' 不是内部或外部命令,也不是可运行的程序
  或批处理文件。
  error: [WinError 2] 系统找不到指定的文件。
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  COMMAND: InstallCommand
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  ----------------------------------------
  ERROR: Failed building wheel for nvidia-pyindex
  Running setup.py clean for nvidia-pyindex
Failed to build nvidia-pyindex
Installing collected packages: nvidia-pyindex
    Running setup.py install for nvidia-pyindex ... error
    ERROR: Command errored out with exit status 1:
     command: 'D:\Anaconda\envs\py38trtc250\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\15151\\AppData\\Local\\Temp\\pip-install-a7e_bahh\\nvidia-pyindex\\setup.py'"'"'; __file__='"'"'C:\\Users\\15151\\AppData\\Local\\Temp\\pip-install-a7e_bahh\\nvidia-pyindex\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\15151\AppData\Local\Temp\pip-record-ma8cx0c0\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\15151\AppData\Roaming\Python\Python38\Include\nvidia-pyindex'
         cwd: C:\Users\15151\AppData\Local\Temp\pip-install-a7e_bahh\nvidia-pyindex\
    Complete output (7 lines):
    running install
    '"nvidia_pyindex uninstall"' 不是内部或外部命令,也不是可运行的程序
    或批处理文件。
    error: [WinError 2] 系统找不到指定的文件。
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    COMMAND: InstallCommand
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'D:\Anaconda\envs\py38trtc250\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\15151\\AppData\\Local\\Temp\\pip-install-a7e_bahh\\nvidia-pyindex\\setup.py'"'"'; __file__='"'"'C:\\Users\\15151\\AppData\\Local\\Temp\\pip-install-a7e_bahh\\nvidia-pyindex\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\15151\AppData\Local\Temp\pip-record-ma8cx0c0\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\15151\AppData\Roaming\Python\Python38\Include\nvidia-pyindex' Check the logs for full command output.

Install instructions fail, error message simply repeats install instructions.

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
  • TensorFlow installed from (source or binary): pip?
  • TensorFlow version: Whatever the default is?
  • Python version: 3.7.7
  • Installed using virtualenv? pip? conda?: Using pip supplied by conda environment (i.e. ~/anaconda3/envs/myenv/bin/pip/)
  • Bazel version (if compiling from source): No idea
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version:Doesn't seem to be relevant to this issue, but 11.0
  • GPU model and memory: RTX 3080

Describe the problem
Trying to follow install instructions for NVIDIA tensorflow via pip (& pyindex) posted at https://github.com/NVIDIA/tensorflow#install results in an error message telling
me " This package can be installed as:..." (Re-)Running those commands produces the same error message again.

Provide the exact sequence of commands / steps that you executed before running into the problem
Please refer to the log below.

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

$ pip install --user nvidia-pyindex
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.6.tar.gz (6.7 kB)
Building wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... done
  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.6-py3-none-any.whl size=4171 sha256=692df4078194418f4812516403399f2e96373ad780b93c98ce944b5f02efb35d
  Stored in directory: /tmp/pip-ephem-wheel-cache-kpx26e3z/wheels/52/31/c8/db9f8939a8bb1f3500ce81b630604cbfa6e31f82c8f1bd914d
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.6

$ pip install --user nvidia-tensorflow[horovod]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nvidia-tensorflow[horovod]
  Downloading nvidia-tensorflow-0.0.1.dev4.tar.gz (3.8 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/shawley/anaconda3/envs/spnet/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-yv_vnm57/nvidia-tensorflow/setup.py'"'"'; __file__='"'"'/tmp/pip-install-yv_vnm57/nvidia-tensorflow/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-1hvhhg4h
         cwd: /tmp/pip-install-yv_vnm57/nvidia-tensorflow/
    Complete output (17 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-yv_vnm57/nvidia-tensorflow/setup.py", line 150, in <module>
        raise RuntimeError(open("ERROR.txt", "r").read())
    RuntimeError:
    ###########################################################################################
    The package you are trying to install is only a placeholder project on PyPI.org repository.
    This package is hosted on NVIDIA Python Package Index.
    
    This package can be installed as:
    ```
    $ pip install nvidia-pyindex
    $ pip install nvidia-tensorflow
    ```
    
    Please refer to NVIDIA instructions: https://github.com/NVIDIA/tensorflow#install.
    ###########################################################################################
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Ok, so let's copy exactly from This package can be installed as:...

$ pip install nvidia-pyindex
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: nvidia-pyindex in /home/shawley/.local/lib/python3.7/site-packages (1.0.6)

$ pip install nvidia-tensorflow
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nvidia-tensorflow
  Downloading nvidia-tensorflow-0.0.1.dev4.tar.gz (3.8 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/shawley/anaconda3/envs/spnet/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ayitiyj8/nvidia-tensorflow/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ayitiyj8/nvidia-tensorflow/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-v8gqo1_p
         cwd: /tmp/pip-install-ayitiyj8/nvidia-tensorflow/
    Complete output (17 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-ayitiyj8/nvidia-tensorflow/setup.py", line 150, in <module>
        raise RuntimeError(open("ERROR.txt", "r").read())
    RuntimeError:
    ###########################################################################################
    The package you are trying to install is only a placeholder project on PyPI.org repository.
    This package is hosted on NVIDIA Python Package Index.
    
    This package can be installed as:
    ```
    $ pip install nvidia-pyindex
    $ pip install nvidia-tensorflow
    ```
    
    Please refer to NVIDIA instructions: https://github.com/NVIDIA/tensorflow#install.
    ###########################################################################################
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Referring to https://github.com/NVIDIA/tensorflow#install is what originally got me to this point, so... Not sure what to do next. Any help?

Can't install in Colab and Kaggle

I tried this in Colab and Kaggle (Python 3.7 both) and got an error.

!pip install nvidia-pyindex
!pip install nvidia-tensorflow==1.15.4

ERROR: Could not find a version that satisfies the requirement nvidia-tensorflow==1.15.4 (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-tensorflow==1.15.4

Is it possible to use nvidia-tensorflow in Kaggle?

Hardcoded location of linker will fail on Redhat 7

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Redhat 7
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.15.3+nv20.07
  • Python version: 3.8.5
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source):9.3.0
  • CUDA/cuDNN version:Cuda 11.0.207 cuDNN 8.0.1.13
  • GPU model and memory: Nvidia A100

Describe the current behavior

(...)  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -shared -o bazel-out/k8-py2-opt/bin/tensorflow/python/_tf_stack.so '-Wl,-rpath,$ORIGIN/,-rpath,$ORIGIN/..' -Wl,--version-script bazel-out/k8-py2-opt/bin/tensorflow/python/_tf_stack-version-script.lds -Wl,-no-as-needed -Wl,-z,relro,-z,now '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -no-canonical-prefixes -fno-canonical-system-headers -B/usr/bin -Wl,--gc-sections -Wl,@bazel-out/k8-py2-opt/bin/tensorflow/python/_tf_stack.so-2.params)
Execution platform: @bazel_tools//platforms:host_platform
/usr/bin/ld.gold: --push-state: unknown option
/usr/bin/ld.gold: use the --help option for usage information
collect2: error: ld returned 1 exit status```

**Describe the expected behavior**
Fixed ld.gold versions work

**Code to reproduce the issue**
Compile with standard Redhat's GCC

**Other info / logs**
The solution is to use a properly patched libtool, and remove the hardcoded path to /usr/bin. The fine gentlemen of the EasyBuild project have a patch that does exactly this: https://github.com/easybuilders/easybuild-easyconfigs/blob/master/easybuild/easyconfigs/t/TensorFlow/TensorFlow-1.13.1_remove_usrbin_from_linker_bin_path_flag.patch

Pip Install Error on Python 3.8.7, Suggested Install Instructions Do Not Work

System information

  • OS Platform and Distribution: Windows 11 Version 10.0.22000 Build 22000
  • Mobile device: no
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version: the one that installs from the pip command
  • Python version: 3.8.7
  • Installed using virtualenv? pip? conda?: pip 22.0.4
  • CUDA/cuDNN version: 11.6 (also tried with 10.1, 10.0)/ 7.5.1
  • GPU model and memory: 3090

Describe the problem
I am trying to use tensorflow to run an old project on a 3090. I got everything set up before realizing that normal tensorflow / tensorflow-gpu does not work with the 30 series. I am trying to switch to using nvidia-tensorflow, but pip refuses to download it, outputting this message:

C:\Users\kingz\Documents\Wave-U-Net>pip install --user nvidia-tensorflow[horovod]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nvidia-tensorflow[horovod]
  Downloading nvidia-tensorflow-0.0.1.dev5.tar.gz (7.9 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\kingz\AppData\Local\Temp\pip-install-7do6gsb8\nvidia-tensorflow_664175a071704da6a12449fc1db8e9b9\setup.py", line 150, in <module>
          raise RuntimeError(open("ERROR.txt", "r").read())
      RuntimeError:
      ###########################################################################################
      The package you are trying to install is only a placeholder project on PyPI.org repository.
      This package is hosted on NVIDIA Python Package Index.

      This package can be installed as:
      ```
      $ pip install nvidia-pyindex
      $ pip install nvidia-tensorflow
      ```

      Please refer to NVIDIA instructions: https://github.com/NVIDIA/tensorflow#install.
      ###########################################################################################

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I have previously installed nvidia-pyindex successfully. This is very similar to a previous issue. This was fixed by switching from python 3.7 to 3.8. I am having the same issue with version 3.8. Why isn't pip detecting the correct nvidia-tensorflow version?
I did not go through the docker setup suggested by the user guide due to my lack of docker experience. Should I sink my time into learning? or will fixing the pip install suffice?

Provide the exact sequence of commands / steps that you executed before running into the problem
I ran pip install --user nvidia-tensorflow[horovod] after installing CUDA 11.6, python 3.8.6, and ffmpeg.

Any other info / logs
I currently dont know how to access by python/pip logs, but I would be happy to provide with pointers!

Is this fork dead?

We want to use TF1.15 on our brand-new A100 systems. We need to compile from source because we use the C/C++ bindings, and these not usually available from prebuilt releases (e.g. the container images on nvcr.io).

But the mainline r1.15 code does not support CUDA 11 and CUDNN 8. And the owners appear to have no interest in fixing this.

So, is NVIDIA releasing its source code for TF1.x after 2020-06?

Thank you!

Nvidia-tensorflow[horovod] installation fails due to an HTML issue on the website.

System information

  • Windows Desktop with Intel i7 CPU and RTX 2080 Super
  • Python3.8.12 installed through pip on Anaconda. through command conda create --name nvidiatf python=3.8
  • Trying to install Tensorflow using this command provided in the tutorial section of this repository
  • $ pip install --user nvidia-tensorflow[horovod]
  • CUDA/cuDNN version: Cuda Compilation Tools release 10.0 V10.0.130
  • GPU model and memory: 8GB RTX 2080 Super

When I try to install Nvidia Tensorflow through pip through anaconda using the command provided on this tutorial, it does not work. Other repositories are able to be pip installed, and everything is upgraded. It is a new instance with only python and nvidia-pyindex installed which was the previous step in the tutorial.

Provide the exact sequence of commands / steps that you executed before running into the problem
I create an instance with Python 3.8 and I update pip and then install pyindex and then run this command to install tensorflow[horovod].

This is the error that actually shows up when I run this install command.
image

Missing package: @nvtx_archive//:nvtx

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7.12
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version: https://github.com/NVIDIA/tensorflow/tree/v1.15.3+nv20.07
  • Python version: 3.5
  • Installed using virtualenv? pip? conda?: from source
  • Bazel version (if compiling from source): 0.21.6
  • GCC/Compiler version (if compiling from source): 9.3.0
  • CUDA/cuDNN version: 11.0/8.0.3
  • GPU model and memory: Pascal GP 100

Describe the problem
Compile halts:

see attached log

job.log

Provide the exact sequence of commands / steps that you executed before running into the problem

set -e
set -x

N_JOBS=11
N_JOBS=$((N_JOBS+1))
source /usr/local/bin/use_gcc9.sh

echo ""
echo "Bazel will use ${N_JOBS} concurrent job(s)."
echo ""
export TF_NEED_OPENCL_SYCL="0"
export TF_NEED_ROCM="0"
export TF_NEED_CUDA="1"
export CUDA_TOOLKIT_PATH="/usr/local/cuda-11.0"
export TF_CUDA_VERSION="11.0"
export CUDNN_INSTALL_PATH="/usr/local/cudnn-8.0.3"
export TF_CUDNN_VERSION="8"
export TF_NCCL_VERSION="2"
export TF_CUDA_COMPUTE_CAPABILITIES="3.5,3.7,5.0,5.2,6.0,6.1,7.0"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0/extras/CUPTI/lib64/:/usr/local/cudnn-8.0.3/cuda/lib64/:/usr/local/cuda-11.0/lib64/:/usr/local/cuda-11.0/extras/CUPTI/lib64/:/usr/local/cudnn-8.0.3/cuda/lib64/:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0/extras/CUPTI/lib64/:/usr/local/cudnn-8.0.3/lib64/"
export TF_CUDA_CLANG="0"
export GCC_HOST_COMPILER_PATH=$(which gcc)
export USE_BAZEL_VERSION="0.21.6"
# Run configure.
export TF_NEED_CUDA="1"
export TF_NEED_ROCM="0"
export TF_NEED_OPENCL_SYCL="0"
export PYTHON_BIN_PATH=$(which python3)
yes "" | $PYTHON_BIN_PATH configure.py
which bazel
bazel build  --host_linkopt="-lrt" --host_linkopt="-lm" --copt="-fpic" --host_copt="-mtune=generic" --copt="-DOPENSSL_NO_ASM" --cxxopt="-DOPENSSL_NO_ASM" --define with_default_optimizations=true\
    --jobs=${N_JOBS} \
    //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so 
rm -rf obj_list.txt
find -L bazel-bin -name "*.o" | grep -v "nsync_cpp" > obj_list.txt
./tensorflow/tools/ci_build/linux/make_archive.py obj_list.txt bazel-bin/tensorflow/libtensorflow.a
mkdir tensorflow-r1.15
cp -rL tensorflow third_party bazel-bin bazel-genfiles tensorflow-r1.15

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

ERROR: No supported GPU(s) detected to run this container

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): docker image ( nvcr.io/nvidia/tensorflow:20.12-tf1-py3 )
  • TensorFlow version (use command below): 1.15.4
  • Python version: 3.8.5
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: Cuda compilation tools, release 11.1, V11.1.105
  • GPU model and memory: A100-SXM4-40GB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

I just do like below

(1) create AWS EC2 instance ( AMI : ami-0ef85cf6e604e5650, instance type : p4d.24xlarge )
(2) install nvidia-driver ( NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 )
(3) install docker
(4) install nvidia-docker
(5) and try command like this ( I didn't use MIG )
sudo docker run -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 nvcr.io/nvidia/tensorflow:20.12-tf1-py3

I got this logs.. i am beginner for tensorflow, so i think there are some my mistake.. i don't know why tensorflow can not detect gpu.
image

even nvidia-smi command works well
image

Describe the expected behavior
tensorflow detect gpu properly

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

TIMEOUT issue in tensorflow images

I pulled some nvidia-tensorflow images from NGC and try to run these images with command:

docker run --runtime=nvidia -it --rm -e TIMEOUT=100 nvcr.io/nvidia/tensorflow:<version-tag>

and I found that the env in the container is set to 35 automatically, but when I run:

docker exec <container-id> env

it shows the TIMEOUT=100 correctly.
I have tested image with tag version 21.02-tf2-py3, 21.02-tf1-py3, 20.10-tf1-py3, 19.10-py3, they all have the same issue.

Why this happened?How can I run the container with the self-defined TIMEOUT env?

no matches found: nvidia-tensorflow[horovod]

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version:
  • Python version:
  • Installed using virtualenv? pip? conda?: pip
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

Describe the problem

no matches found: nvidia-tensorflow[horovod]

Provide the exact sequence of commands / steps that you executed before running into the problem
already had

$pip install --upgrade pip
$pip install nvidia-pyindex 
$pip --version
pip 22.0.4

simply running
pip install nvidia-tensorflow[horovod]

is there a conda forge to install this?

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Nvidia Tensorflow 1.15 does not use RTX3070 GPU due to failing to load CUDA library

System information
Device: Intel i7 11th Gen with RTX 3070 GPU and 32GB RAM
OS: Ubuntu 20.04
CUDA: 11.2
Cudnn: 8.1.0
Nvidia Driver version: 470.103.01
Tensorflow: 1.15.5

Describe the current behavior

Hi all. I have an RTX 3070 GPU in an Ubuntu setting and I want to run a TF1.15 code. I installed TF1.15 using this article: https://www.pugetsystems.com/labs/hpc/How-To-Install-TensorFlow-1-15-for-NVIDIA-RTX30-GPUs-without-docker-or-CUDA-install-2005/. It basically describes using Nvidia's build of Tensorflow 1.15 for RTX 30xx GPUs.

However, whenever I create a session in tensorflow, one CUDA library always fails to load.

Describe the expected behavior

The tensorflow session created should use the GPU.

Code to reproduce the issue

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))`

Other info / logs
When creating a tensorflow session, I get the following output:

2022-03-14 11:36:49.675521: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2496000000 Hz
2022-03-14 11:36:49.675927: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558665b3b9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-14 11:36:49.675940: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2022-03-14 11:36:49.676506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-03-14 11:36:49.704894: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-14 11:36:49.705222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties:
name: NVIDIA GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.725
pciBusID: 0000:01:00.0
2022-03-14 11:36:49.705237: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-03-14 11:36:49.706237: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/vishesh/anaconda3/envs/tf1.15/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11; LD_LIBRARY_PATH: :/usr/local/cuda-11.2/lib64:/usr/local/cuda-11.2/lib64:/home/vishesh/anaconda3/envs/altered/lib/
2022-03-14 11:36:49.721900: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-03-14 11:36:49.722054: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-03-14 11:36:49.723891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2022-03-14 11:36:49.725394: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-03-14 11:36:49.725463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-03-14 11:36:49.725471: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1689] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-03-14 11:36:49.785284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-14 11:36:49.785307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0
2022-03-14 11:36:49.785310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
2022-03-14 11:36:49.786443: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-14 11:36:49.786787: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558664a3a3b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-03-14 11:36:49.786796: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
2022-03-14 11:36:49.787259: I tensorflow/core/common_runtime/direct_session.cc:359] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device

Pip install nvidia-pyindex keeps giving error

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Nil
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version (use command below): 2.3.1
  • Python version: 3.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 8
  • GPU model and memory: Nvidia MX130

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Tried to execute pip install nvidia-pyindex but it does not work. Tried to upgrade pip, also does not work.

Describe the expected behavior

It gives this error:

`Processing c:\users\muizz\desktop\nvidia-pyindex-1.0.5.tar.gz
Building wheels for collected packages: nvidia-pyindex
Building wheel for nvidia-pyindex (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\muizz\anaconda3\envs\fyp\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p\setup.py'"'"'; file='"'"'C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\muizz\AppData\Local\Temp\pip-wheel-xoaw05hk'
cwd: C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p
Complete output (25 lines):
running bdist_wheel
running build
running build_py
creating build
creating build\lib
creating build\lib\nvidia_pyindex
copying nvidia_pyindex\cmdline.py -> build\lib\nvidia_pyindex
copying nvidia_pyindex\utils.py -> build\lib\nvidia_pyindex
copying nvidia_pyindex_init_.py -> build\lib\nvidia_pyindex
running egg_info
writing nvidia_pyindex.egg-info\PKG-INFO
writing dependency_links to nvidia_pyindex.egg-info\dependency_links.txt
writing entry points to nvidia_pyindex.egg-info\entry_points.txt
writing top-level names to nvidia_pyindex.egg-info\top_level.txt
reading manifest file 'nvidia_pyindex.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'nvidia_pyindex.egg-info\SOURCES.txt'
installing to build\bdist.win-amd64\wheel
running install
'"nvidia_pyindex uninstall"' is not recognized as an internal or external command,
operable program or batch file.
error: [WinError 2] The system cannot find the file specified
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
COMMAND: InstallCommand
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

ERROR: Failed building wheel for nvidia-pyindex
Running setup.py clean for nvidia-pyindex
Failed to build nvidia-pyindex
Installing collected packages: nvidia-pyindex
Running setup.py install for nvidia-pyindex ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\muizz\anaconda3\envs\fyp\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p\setup.py'"'"'; file='"'"'C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\muizz\AppData\Local\Temp\pip-record-7qi_408u\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\muizz\AppData\Roaming\Python\Python37\Include\nvidia-pyindex'
cwd: C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p
Complete output (7 lines):
running install
'"nvidia_pyindex uninstall"' is not recognized as an internal or external command,
operable program or batch file.
error: [WinError 2] The system cannot find the file specified
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
COMMAND: InstallCommand
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\muizz\anaconda3\envs\fyp\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p\setup.py'"'"'; file='"'"'C:\Users\muizz\AppData\Local\Temp\pip-req-build-f5b70p\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\muizz\AppData\Local\Temp\pip-record-7qi_408u\install-record.txt' --single-version-externally-managed --user --prefix= --compile --install-headers 'C:\Users\muizz\AppData\Roaming\Python\Python37\Include\nvidia-pyindex' Check the logs for full command output.`

Code to reproduce the issue
Pip install nvidia-pyindex

Other info / logs
Nil

[tensorboard] Unable to get local issuer certificate _ssl.c:1091 /pypi/nvidia-tensorboard

Issue

We have a corporate pypi (kind of man in the middle) index which makes nvidia-pyindex fail with:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)'))': /pypi/nvidia-tensorboard/

Request

Would it be possible that you open source nvidia-tensorboard so we can install it via git? or please provide nvidia py index url so we can pass it via pip install --index-url=https://pypi.some-nvidia-url nvidia-tensorboard

System information

  • Ubuntu 20.04
  • Python 3.7.0 / 3.8.5 (same issue)

TF_ENABLE_AUTO_MIXED_PRECISION has no effect

I am using tensorflow2.6 inside NGC docker nvcr.io/nvidia/tensorflow:21.10-tf2-py3. Running inference using pre-trained bert model from tensorflow hub https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2

After setting TF_ENABLE_AUTO_MIXED_PRECISION=1, seems nothing happened except below warning log:

2021-11-05 08:14:36.094644: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-05 08:14:36.111426: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.111448: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.111471: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.111484: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.250207: W tensorflow/core/util/dump_graph.cc:134] Failed to dump before_mark_for_compilation because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.251315: W tensorflow/core/util/dump_graph.cc:134] Failed to dump mark_for_compilation because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.253102: W tensorflow/core/util/dump_graph.cc:134] Failed to dump mark_for_compilation_annotated because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.254376: W tensorflow/core/util/dump_graph.cc:134] Failed to dump before_increase_dynamism_for_auto_jit_pass because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.292182: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x7f000c0092a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

I also checked dumped tensorflow hlo, seems no operations have been transferred into FP16 mode. This feature works well in tensorflow 1.x, does it support in nvidia tensorflow2 as well?

compile from source can not git clone cudnn_frontend_archive

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Centos 7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 15.5/r21.06
  • Python version: 3.7
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source): 0.25.3
  • GCC/Compiler version (if compiling from source): g++ 7.3
  • CUDA/cuDNN version: cuda 11.1 update 1, cuDNN 8.1.33
  • GPU model and memory: tesla T4

Describe the problem

git clone https://oauth2:***/cudnn/cudnn_frontend.git failed

Provide the exact sequence of commands / steps that you executed before running into the problem

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

FP16 implementation of Einsum OpKernel

System information

  • TensorFlow version (you are using): 1.15.5
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
After enabling TF_ENABLE_AUTO_MIXED_PRECISION, Einsum operator throw below exception, saying Einsum operator do not have corresponding FP16 implementation

tensorflow.python.framework.errors_impl.NotFoundError: No registered 'Einsum' OpKernel for 'GPU' devices compatible with node node StatefulPartitionedCall/model/bert_encoder/transformer/layer_0/self_attention/key/einsum/Einsum (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) 
         (OpKernel was found, but attributes didn't match) Requested Attributes: N=2, T=DT_HALF, equation="abc,cde->abde", _device="/job:localhost/replica:0/task:0/device:GPU:0"
        .  Registered:  device='GPU'; T in [DT_COMPLEX128]
  device='GPU'; T in [DT_COMPLEX64]
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

         [[StatefulPartitionedCall/model/bert_encoder/transformer/layer_0/self_attention/key/einsum/Einsum]]

Will this change the current api? How?

No

Who will benefit with this feature?

Einsum operation is widely adopted in existing networks, FP16 implementation will benefit researcher/engineers who are intensively using mix-precision training/inference.

Any Other info.
Einsum operation: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/einsum

unable to install tf 1.15.2 with dgx-a100

I have just installed the DGX-A100 server. To setup the tensorflow bare-metal environment, I worked with anaconda for virtual environment control.

To replicate the process, I run through the following commands:

Installation of Anaconda

wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh
chmod 777 ./Anaconda3-2020.07-Linux-x86_64.sh
bash ./Anaconda3-2020.07-Linux-x86_64.sh
source ~/.bashrc

Prepare environment

conda create -n tf_1.15.2 python=3.7
conda activate tf_1.15.2

Nvidia Tensorflow 1.15.2 install

pip install --user nvidia-pyindex
pip install --user nvidia-tensorflow[horovod]

error appears as following when installing nvidia-tensorflow[horovod]

Collecting nvidia-tensorflow[horovod]
  Downloading nvidia-tensorflow-0.0.1.dev0.tar.gz (3.4 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/jacky/anaconda3/envs/tf_1.15.2/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-i7t5p2bw/nvidia-tensorflow/setup.py'"'"'; __file__='"'"'/tmp/pip-install-i7t5p2bw/nvidia-tensorflow/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-8fko_xuv
         cwd: /tmp/pip-install-i7t5p2bw/nvidia-tensorflow/
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-i7t5p2bw/nvidia-tensorflow/setup.py", line 130, in <module>
        raise RuntimeError("This package should not be installed.\nPlease refer "
    RuntimeError: This package should not be installed.
    Please refer to NVIDIA instructions: https://github.com/nvidia/tensorflow.
    Your PIP command defaults to the official PyPI as a package repository.
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

NGC docker is not an available solution for my machine. Any help will be appreciated

How to install with specific release?

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version:
  • Python version: 3.8
  • Installed using virtualenv? pip? conda?: pip
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

Describe the problem
The latest version requires Linux Ubuntu 20.04.
How do I install the previous versions?

Provide the exact sequence of commands / steps that you executed before running into the problem
Could not find a version that satisfies the requirement nvidia-tensorflow[horovod]==1.15.4+nv20.11 (from versions: 0.0.1.dev4, 1.15.4+nv20.12, 1.15.5+nv21.2, 1.15.5+nv21.3)

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

TF 1.15.4 built with cuda11 cudnn8 fails to run

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, added the cuda 11 toolchain and updated gpu_py37_full/pip.sh to use cuda11 toolchain
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.15.4
  • Python version: 3.7
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0
  • CUDA/cuDNN version: CUDA11 CUDNN8
  • GPU model and memory: mnist_cnn

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
Installed the tensorflow 1.15.4 binary built from source with cuda11, while running the mnist test from here (https://github.com/keras-team/keras/blob/tf-2/examples/mnist_cnn.py), it fails and reports
F ./tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: no kernel image is available for execution on the device
Aborted (core dumped)

Describe the expected behavior
The test should successfully run.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Build the 1.15 binary with cuda11 and run the mnist test.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Need GPU implementation of sparse_segment_* .

System information

  • TensorFlow version (you are using): r1.15.5+nv22.02
  • Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state.

sparse_segment_sum and other sparse_segment_* ops are important for deep learning models using embedding_lookup. In latest Tensorflow, These ops already has GPU kernels while NVTF still only has CPU kernels

Will this change the current api? How?

No

Who will benefit with this feature?

deep learning models using embedding_lookup.

Any Other info.

No

pip.conf is not removed upon uninstalling nvidia-pyindex

Hey there, was very grateful to this package when trying to install a tricky CUDA/PyTorch/TensorFlow combo the other day, however the past few days since then I've noticed that all my pip installs were going slowly.

It turned out that the cause was the pip.conf file, whose header comment notes:

# This file has been autogenerated or modified by NVIDIA PyIndex.

and which further down includes

no-cache-dir = true

This was the cause of my slow installation (not caching any package binaries), and importantly this persisted even after removing all copies of nvidia-pyindex from all my conda environments.

I don't know if you'd agree but I see this as a bug: leaving pip.conf on a system, which then modifies the caching behaviour, is regrettable.

I thought I should open an issue to make the suggestion of removing it upon uninstallation (I'm not sure if it's possible to "clean up" after a package, but perhaps you know).

Anyway, thanks again, and of course it is at your discretion to change this or not, just wanted to bring to your attention! 👍

why need more gpu memory rtx30xx than gtx?

I want to know reasons.

[environment]

pc_1 : gtx 1080ti(11G), cuda-10, tensorflow-gpu==1.13
pc_2 : rtx 2080ti(11G), cuda-10, tensorflow-gpu==1.15
pc_3 : rtx 2080ti(11G), cuda-11.0, tensorflow-gpu==1.15
pc_4 : rtx 3080(10G), cuda-11.1, nvidia-tensorflow==r1.15.4-20.11

I’ve loaded a weight file using memory fraction 1.5GB on pc_1~pc_3. I tested that load same weight file using memory fraction 1.5GB on pc_4 yesterday, but couldn’t load. however, I could load it using memory fraction about 5.7GB on pc_4.

although same weight file, why require more gpu memory?
I couldn’t find solutions anywhere. I guess rtx 30xx serise or nvidia-tensorflow is the reason.

Build for python 3.7

I'd like to understand, are there any limitation to build nvidia-tensorflow for python 3.7

CentOS stream 8 r1.15.5+nv21.12 compile error Eigen:QInt32

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS Stream release 8 kernel: 4.18.0-348.2.1.el8_5.x86_64
  • TensorFlow installed from (source or binary): r1.15.5+nv21.12
  • Bazel version (if compiling from source): 0.25.3
  • GCC/Compiler version (if compiling from source): 8.5.0 20210514 (Red Hat 8.5.0-6)
  • CUDA/cuDNN version: cuda11.5 / cudnn 8.3.1.22
  • GPU model and memory:A100 / 40G

export TF_NEED_CUDA=1
export TF_NEED_TENSORRT=1
export TF_TENSORRT_VERSION=8
export TF_CUDA_PATHS=/usr,/usr/local/cuda
export TF_CUDA_VERSION=11.5
export TF_CUBLAS_VERSION=11
export TF_CUDNN_VERSION=8
export TF_NCCL_VERSION=2
export TF_CUDA_COMPUTE_CAPABILITIES="7.0,8.0"
export TF_ENABLE_XLA=1
export TF_NEED_HDFS=0
export CC_OPT_FLAGS="-march=native -mtune=native"

Describe the problem
ERROR: ./tensorflow/core/kernels/BUILD:788:1: C++ compilation of rule '//tensorflow/core/kernels:eigen_contraction_kernel_with_mkl' failed (Exit 1)
In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/FixedPoint:35,
from ./tensorflow/core/kernels/eigen_contraction_kernel.h:39,
from tensorflow/core/kernels/eigen_contraction_kernel.cc:16:
./third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h: In function ‘typename Eigen::internal::unpacket_traits::type Eigen::internal::predux_min(const Packet&) [with Packet = Eigen::internal::Packet16q32i; typename Eigen::internal::unpacket_traits::type = Eigen::QInt32]’:
./third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h:432:16: error: could not convert ‘Eigen::internal::pfirst<__vector(2) long long int>(_mm_min_epi32(res.Eigen::internal::eigen_packet_wrapper<__vector(2) long long int, 0>::operator __vector(2) long long int&(), _mm_shuffle_epi32(res.Eigen::internal::eigen_packet_wrapper<__vector(2) long long int, 0>::operator __vector(2) long long int&(), ((((0 << 6) | (0 << 4)) | (0 << 2)) | 1))))’ from ‘Eigen::internal::unpacket_traits<__vector(2) long long int>::type’ {aka ‘__vector(2) long long int’} to ‘Eigen::QInt32’
return pfirst(
~~~~~~^
_mm_min_epi32(res, _mm_shuffle_epi32(res, _MM_SHUFFLE(0, 0, 0, 1))));
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Provide the exact sequence of commands / steps that you executed before running into the problem
(base) $ yes ""| ./configure
WARNING: Output base './.cache/bazel/_bazel_chunxie/b3a1696304cedbe15049d6664790de6a' is on NFS. This may lead to surprising failures and undetermined behavior.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.25.3 installed.
Please specify the location of python. [Default is /usr/bin/python]:

Found possible Python library paths:
/usr/lib64/python3.6/site-packages
/usr/local/lib64/python3.6/site-packages
/usr/local/lib/python3.6/site-packages
/usr/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/usr/lib64/python3.6/site-packages]
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: No ROCm support will be enabled for TensorFlow.

Found CUDA 11.5 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 8 in:
/usr/lib64
/usr/include
Found TensorRT 8 in:
/usr/lib64
/usr/include
Found NCCL 2 in:
/usr/lib64
/usr/include
Do you want to use clang as CUDA compiler? [y/N]: nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]: No MPI support will be enabled for TensorFlow.

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v2 # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apacha Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

  • Bazel version (if compiling from source): 0.25.3
  • GCC/Compiler version (if compiling from source): 8.5.0 20210514 (Red Hat 8.5.0-6)
  • CUDA/cuDNN version: cuda11.5 / cudnn 8.3.1.22
  • GPU model and memory:A100 / 40G

export TF_NEED_CUDA=1
export TF_NEED_TENSORRT=1
export TF_TENSORRT_VERSION=8
export TF_CUDA_PATHS=/usr,/usr/local/cuda
export TF_CUDA_VERSION=11.5
export TF_CUBLAS_VERSION=11
export TF_CUDNN_VERSION=8
export TF_NCCL_VERSION=2
export TF_CUDA_COMPUTE_CAPABILITIES="7.0,8.0"
export TF_ENABLE_XLA=1
export TF_NEED_HDFS=0
export CC_OPT_FLAGS="-march=native -mtune=native"

Describe the problem
ERROR: ./tensorflow/core/kernels/BUILD:788:1: C++ compilation of rule '//tensorflow/core/kernels:eigen_contraction_kernel_with_mkl' failed (Exit 1)
In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/FixedPoint:35,
from ./tensorflow/core/kernels/eigen_contraction_kernel.h:39,
from tensorflow/core/kernels/eigen_contraction_kernel.cc:16:
./third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h: In function ‘typename Eigen::internal::unpacket_traits::type Eigen::internal::predux_min(const Packet&) [with Packet = Eigen::internal::Packet16q32i; typename Eigen::internal::unpacket_traits::type = Eigen::QInt32]’:
./third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h:432:16: error: could not convert ‘Eigen::internal::pfirst<__vector(2) long long int>(_mm_min_epi32(res.Eigen::internal::eigen_packet_wrapper<__vector(2) long long int, 0>::operator __vector(2) long long int&(), _mm_shuffle_epi32(res.Eigen::internal::eigen_packet_wrapper<__vector(2) long long int, 0>::operator __vector(2) long long int&(), ((((0 << 6) | (0 << 4)) | (0 << 2)) | 1))))’ from ‘Eigen::internal::unpacket_traits<__vector(2) long long int>::type’ {aka ‘__vector(2) long long int’} to ‘Eigen::QInt32’
return pfirst(
~~~~~~^
_mm_min_epi32(res, _mm_shuffle_epi32(res, _MM_SHUFFLE(0, 0, 0, 1))));
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Provide the exact sequence of commands / steps that you executed before running into the problem
(base) $ yes ""| ./configure
WARNING: Output base './.cache/bazel/_bazel_chunxie/b3a1696304cedbe15049d6664790de6a' is on NFS. This may lead to surprising failures and undetermined behavior.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.25.3 installed.
Please specify the location of python. [Default is /usr/bin/python]:

Found possible Python library paths:
/usr/lib64/python3.6/site-packages
/usr/local/lib64/python3.6/site-packages
/usr/local/lib/python3.6/site-packages
/usr/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/usr/lib64/python3.6/site-packages]
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: No ROCm support will be enabled for TensorFlow.

Found CUDA 11.5 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 8 in:
/usr/lib64
/usr/include
Found TensorRT 8 in:
/usr/lib64
/usr/include
Found NCCL 2 in:
/usr/lib64
/usr/include
Do you want to use clang as CUDA compiler? [y/N]: nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]: No MPI support will be enabled for TensorFlow.

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v2 # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apacha Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.