Giter Club home page Giter Club logo

intel / models Goto Github PK

View Code? Open in Web Editor NEW
664.0 664.0 212.0 447.85 MB

Intel® AI Reference Models: contains Intel optimizations for running deep learning workloads on Intel® Xeon® Scalable processors and Intel® Data Center GPUs

License: Apache License 2.0

Makefile 0.11% Python 87.15% Shell 9.66% Jupyter Notebook 2.03% Dockerfile 0.70% C 0.15% C++ 0.16% Batchfile 0.02% PowerShell 0.01%
ai cpu deep-learning deep-neural-networks inference performance tensorflow

models's Introduction

Intel® AI Reference Models

This repository contains links to pre-trained models, sample scripts, best practices, and step-by-step tutorials for many popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors and Intel® Data Center GPUs.

Containers for running the workloads can be found at the Intel® Developer Catalog.

Intel® AI Reference Models in a Jupyter Notebook is also available for the listed workloads

Purpose of Intel® AI Reference Models

Intel optimizes popular deep learning frameworks such as TensorFlow* and PyTorch* by contributing to the upstream projects. Additional optimizations are built into plugins/extensions such as the Intel Extension for Pytorch* and the Intel Extension for TensorFlow*. Popular neural network models running against common datasets are the target workloads that drive these optimizations.

The purpose of the Intel® AI Reference Models repository (and associated containers) is to quickly replicate the complete software environment that demonstrates the best-known performance of each of these target model/dataset combinations. When executed in optimally-configured hardware environments, these software environments showcase the AI capabilities of Intel platforms.

DISCLAIMER: These scripts are not intended for benchmarking Intel platforms. For any performance and/or benchmarking information on specific Intel platforms, visit https://www.intel.ai/blog.

Intel is committed to the respect of human rights and avoiding complicity in human rights abuses, a policy reflected in the Intel Global Human Rights Principles. Accordingly, by accessing the Intel material on this platform you agree that you will not use the material in a product or application that causes or contributes to a violation of an internationally recognized human right.

License

The Intel® AI Reference Models is licensed under Apache License Version 2.0.

Datasets

To the extent that any public datasets are referenced by Intel or accessed using tools or code on this site those datasets are provided by the third party indicated as the data source. Intel does not create the data, or datasets, and does not warrant their accuracy or quality. By accessing the public dataset(s) you agree to the terms associated with those datasets and that your use complies with the applicable license.

Please check the list of datasets used in Intel® AI Reference Models in datasets directory.

Intel expressly disclaims the accuracy, adequacy, or completeness of any public datasets, and is not liable for any errors, omissions, or defects in the data, or for any reliance on the data. Intel is not liable for any liability or damages relating to your use of public datasets.

Use cases

The model documentation in the tables below have information on the prerequisites to run each model. The model scripts run on Linux. Certain models are also able to run using bare metal on Windows. For more information and a list of models that are supported on Windows, see the documentation here.

Instructions available to run on Sapphire Rapids.

For best performance on Intel® Data Center GPU Flex and Max Series, please check the list of supported workloads. It provides instructions to run inference and training using Intel(R) Extension for PyTorch or Intel(R) Extension for TensorFlow.

Image Recognition

Model Framework Mode Model Documentation Benchmark/Test Dataset
DenseNet169 TensorFlow Inference FP32 ImageNet 2012
Inception V3 TensorFlow Inference Int8 FP32 ImageNet 2012
MobileNet V1* TensorFlow Inference Int8 FP32 BFloat16 ImageNet 2012
MobileNet V1* Sapphire Rapids TensorFlow Inference Int8 FP32 BFloat16 BFloat32 ImageNet 2012
MobileNet V2 Tensorflow Inference FP32 BFloat16 Int8 ImageNet 2012
ResNet 101 TensorFlow Inference Int8 FP32 ImageNet 2012
ResNet 50v1.5 TensorFlow Inference Int8 FP32 BFloat16 FP16 ImageNet 2012
ResNet 50v1.5 Sapphire Rapids TensorFlow Inference Int8 FP32 BFloat16 BFloat32 ImageNet 2012
ResNet 50v1.5 TensorFlow Training FP32 BFloat16 FP16 ImageNet 2012
ResNet 50v1.5 Sapphire Rapids TensorFlow Training FP32 BFloat16 BFloat32 ImageNet 2012
Inception V3 TensorFlow Serving Inference FP32 Synthetic Data
ResNet 50v1.5 TensorFlow Serving Inference FP32 Synthetic Data
GoogLeNet PyTorch Inference FP32 BFloat16 ImageNet 2012
Inception v3 PyTorch Inference FP32 BFloat16 ImageNet 2012
MNASNet 0.5 PyTorch Inference FP32 BFloat16 ImageNet 2012
MNASNet 1.0 PyTorch Inference FP32 BFloat16 ImageNet 2012
ResNet 50 PyTorch Inference Int8 FP32 BFloat16 BFloat32 ImageNet 2012
ResNet 50 PyTorch Training FP32 BFloat16 BFloat32 ImageNet 2012
ResNet 101 PyTorch Inference FP32 BFloat16 ImageNet 2012
ResNet 152 PyTorch Inference FP32 BFloat16 ImageNet 2012
ResNext 32x4d PyTorch Inference FP32 BFloat16 ImageNet 2012
ResNext 32x16d PyTorch Inference Int8 FP32 BFloat16 BFloat32 ImageNet 2012
VGG-11 PyTorch Inference FP32 BFloat16 ImageNet 2012
VGG-11 with batch normalization PyTorch Inference FP32 BFloat16 ImageNet 2012
Wide ResNet-50-2 PyTorch Inference FP32 BFloat16 ImageNet 2012
Wide ResNet-101-2 PyTorch Inference FP32 BFloat16 ImageNet 2012

Image Segmentation

Model Framework Mode Model Documentation Benchmark/Test Dataset
3D U-Net MLPerf* TensorFlow Inference FP32 BFloat16 Int8 BRATS 2019
3D U-Net MLPerf* Sapphire Rapids Tensorflow Inference FP32 BFloat16 Int8 BFloat32 BRATS 2019
MaskRCNN TensorFlow Inference FP32 MS COCO 2014
UNet TensorFlow Inference FP32

Language Modeling

Model Framework Mode Model Documentation Benchmark/Test Dataset
BERT large TensorFlow Inference FP32 BFloat16 FP16 SQuAD
BERT large TensorFlow Training FP32 BFloat16 FP16 SQuAD and MRPC
BERT large Sapphire Rapids Tensorflow Inference FP32 BFloat16 Int8 BFloat32 SQuAD
BERT large Sapphire Rapids Tensorflow Training FP32 BFloat16 BFloat32 SQuAD
DistilBERT base Tensorflow Inference FP32 BFloat16 Int8 FP16 SST-2
BERT base PyTorch Inference FP32 BFloat16 BERT Base SQuAD1.1
BERT large PyTorch Inference FP32 Int8 BFloat16 BFloat32 BERT Large SQuAD1.1
BERT large PyTorch Training FP32 BFloat16 BFloat32 preprocessed text dataset
DistilBERT base PyTorch Inference FP32 Int8-FP32 Int8-BFloat16 BFloat16 BFloat32 DistilBERT Base SQuAD1.1
RNN-T PyTorch Inference FP32 BFloat16 BFloat32 RNN-T dataset
RNN-T PyTorch Training FP32 BFloat16 BFloat32 RNN-T dataset
RoBERTa base PyTorch Inference FP32 BFloat16 RoBERTa Base SQuAD 2.0
T5 PyTorch Inference FP32 Int8

Language Translation

Model Framework Mode Model Documentation Benchmark/Test Dataset
BERT TensorFlow Inference FP32 MRPC
Transformer_LT_mlperf* TensorFlow Inference FP32 BFloat16 Int8 WMT English-German data
Transformer_LT_mlperf* Sapphire Rapids Tensorflow Inference FP32 BFloat16 Int8 BFloat32 WMT English-German dataset
Transformer_LT_mlperf* TensorFlow Training FP32 BFloat16 WMT English-German dataset
Transformer_LT_mlperf* Sapphire Rapids Tensorflow Training FP32 BFloat16 BFloat32 WMT English-German dataset
Transformer_LT_Official TensorFlow Inference FP32 WMT English-German dataset
Transformer_LT_Official TensorFlow Serving Inference FP32

Object Detection

Model Framework Mode Model Documentation Benchmark/Test Dataset
SSD-MobileNet* TensorFlow Inference Int8 FP32 BFloat16 COCO 2017 validation dataset
SSD-MobileNet* Sapphire Rapids TensorFlow Inference Int8 FP32 BFloat16 BFloat32 COCO 2017 validation dataset
SSD-ResNet34* TensorFlow Inference Int8 FP32 BFloat16 COCO 2017 validation dataset
SSD-ResNet34* Sapphire Rapids TensorFlow Inference Int8 FP32 BFloat16 BFloat32 COCO 2017 validation dataset
SSD-ResNet34 TensorFlow Training FP32 BFloat16 COCO 2017 training dataset
SSD-ResNet34 Sapphire Rapids TensorFlow Training FP32 BFloat16 BFloat32 COCO 2017 training dataset
SSD-MobileNet TensorFlow Serving Inference FP32
Faster R-CNN ResNet50 FPN PyTorch Inference FP32 BFloat16 COCO 2017
Mask R-CNN PyTorch Inference FP32 BFloat16 BFloat32 COCO 2017
Mask R-CNN PyTorch Training FP32 BFloat16 BFloat32 COCO 2017
Mask R-CNN ResNet50 FPN PyTorch Inference FP32 BFloat16 COCO 2017
RetinaNet ResNet-50 FPN PyTorch Inference FP32 BFloat16 COCO 2017
SSD-ResNet34 PyTorch Inference FP32 Int8 BFloat16 BFloat32 COCO 2017
SSD-ResNet34 PyTorch Training FP32 BFloat16 BFloat32 COCO 2017

Recommendation

Model Framework Mode Model Documentation Benchmark/Test Dataset
DIEN TensorFlow Inference FP32 BFloat16 DIEN dataset
DIEN Sapphire Rapids TensorFlow Inference FP32 BFloat16 BFloat32 DIEN dataset
DIEN TensorFlow Training FP32 DIEN dataset
DIEN Sapphire Rapids TensorFlow Training FP32 BFloat16 BFloat32 DIEN dataset
Wide & Deep TensorFlow Inference FP32 Census Income dataset
Wide & Deep Large Dataset TensorFlow Inference Int8 FP32 Large Kaggle Display Advertising Challenge dataset
DLRM PyTorch Inference FP32 Int8 BFloat16 BFloat32 Criteo Terabyte
DLRM PyTorch Training FP32 BFloat16 BFloat32 Criteo Terabyte
DLRM v2 PyTorch Inference FP32 FP16 BFloat16 BFloat32 Int8 Criteo 1TB Click Logs dataset
DLRM v2 PyTorch Training FP32 FP16 BFloat16 BFloat32 Random dataset
MEMREC-DLRM PyTorch Inference FP32 Criteo Terabyte

Diffusion

Model Framework Mode Model Documentation Benchmark/Test Dataset
Stable Diffusion TensorFlow Inference FP32 BFloat16 FP16 COCO 2017 validation dataset
Stable Diffusion PyTorch Inference FP32 BFloat16 FP16 BFloat32 Int8-FP32 Int8-BFloat16 COCO 2017 validation dataset
Stable Diffusion PyTorch Training FP32 BFloat16 FP16 BFloat32 cat images

Shot Boundary Detection

Model Framework Mode Model Documentation Benchmark/Test Dataset
TransNetV2 PyTorch Inference FP32 BFloat16 Synthetic Data

AI Drug Design (AIDD)

Model Framework Mode Model Documentation Benchmark/Test Dataset
AlphaFold2 PyTorch Inference FP32 AF2Dataset

*Means the model belongs to MLPerf models and will be supported long-term.

Intel® Data Center GPU Workloads

Model Framework Mode GPU Type Model Documentation
ResNet 50v1.5 TensorFlow Inference Flex Series Float32 TF32 Float16 BFloat16 Int8
ResNet 50 v1.5 TensorFlow Training Max Series BFloat16 FP32
ResNet 50 v1.5 PyTorch Inference Flex Series, Max Series, Arc Series Int8 FP32 FP16 TF32
ResNet 50 v1.5 PyTorch Training Max Series, Arc Series BFloat16 TF32 FP32
DistilBERT PyTorch Inference Flex Series, Max Series FP32 FP16 BF16 TF32
DLRM v1 PyTorch Inference Flex Series FP16 FP32
SSD-MobileNet* TensorFlow Inference Flex Series Int8
SSD-MobileNet* PyTorch Inference Arc Series INT8 FP16 FP32
EfficientNet PyTorch Inference Flex Series FP16 FP32
EfficientNet TensorFlow Inference Flex Series FP16
Wide Deep Large Dataset TensorFlow Inference Flex Series FP16
YOLO V5 PyTorch Inference Flex Series FP16
BERT large PyTorch Inference Max Series, Arc Series BFloat16 FP32 FP16
BERT large PyTorch Training Max Series, Arc Series BFloat16 FP32 TF32
BERT large TensorFlow Training Max Series BFloat16 TF32 FP32
DLRM v2 PyTorch Training Max Series FP32 TF32 BF16
3D-Unet PyTorch Inference Max Series FP16 INT8 FP32
3D-Unet TensorFlow Training Max Series BFloat16 FP32
Stable Diffusion PyTorch Inference Flex Series, Max Series, Arc Series FP16 FP32
Stable Diffusion TensorFlow Inference Flex Series FP16 FP32
Mask R-CNN TensorFlow Inference Flex Series FP32 Float16
Mask R-CNN TensorFlow Training Max Series FP32 BFloat16
Swin Transformer PyTorch Inference Flex Series FP16
FastPitch PyTorch Inference Flex Series FP16
UNet++ PyTorch Inference Flex Series FP16
RNN-T PyTorch Inference Max Series FP16 BF16 FP32
RNN-T PyTorch Training Max Series FP32 BF16 TF32

How to Contribute

If you would like to add a new benchmarking script, please use this guide.

models's People

Contributors

ashahba avatar blzheng avatar chuanqi129 avatar chunyuan-w avatar claynerobison avatar cuixiaom avatar cuiyifeng avatar dmsuehir avatar dzungductran avatar jiayisunx avatar jitendra42 avatar jojivk73 avatar karthikvadla avatar ldurka avatar leslie-fang-intel avatar liangan1 avatar mahathi-vatsal avatar mahmoud-abuzaina avatar mhbuehler avatar mjkyung avatar moonjkyung avatar s1113950 avatar sramakintel avatar venky-intel avatar wafaat avatar weizhuozhang-intel avatar xiaobingsuper avatar yanbing-j avatar zantares avatar zhuhaozhe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

models's Issues

How to evaluate the performance number of Bert-Large training

I encountered some confusion when I followed the guide--https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large to run training workload.

Running command:

nohup python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 --mpi_num_processes=2 \
    --benchmark-only \
    --docker-image intel/intel-optimized-tensorflow:2.3.0 \
    --volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
    --volume $SQUAD_DIR:$SQUAD_DIR \
    --data-location=$BERT_LARGE_DIR \
    --num-intra-threads=26 \
    --num-inter-threads=1 \
-- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=2     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True >> training-0609 &

Result:

INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
I0610 01:09:58.730417 140427424720704 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
INFO:tensorflow:Processing example: 9000
I0610 01:13:27.192351 140160153200448 run_squad.py:1363] Processing example: 9000
INFO:tensorflow:Processing example: 10000
I0610 01:17:27.623694 140160153200448 run_squad.py:1363] Processing example: 10000
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625470 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625671 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
I0610 01:20:36.625791 140160153200448 run_squad.py:797] Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json
I0610 01:20:36.625833 140160153200448 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json

I didn’t see the “throughput((num_processed_examples-threshod_examples)/Elapsedtime)” information like inference workload from the training log. I also read the script code: models/models/language_modeling/tensorflow/bert_large/training/fp32/run_squad.py, I have not found about “throughput”. But the ./models/models/language_modeling/tensorflow/bert_large/inference/run_squad.py used by inference has code about ” throughput((num_processed_examples-threshod_examples)/Elapsedtime)”.

So how to evaluate the performance number of Bert-Large training. There is neither "throughput" nor "Elapsedtime" in the log and running script?

@ashahba @dmsuehir

Thanks

Need clarification upon Wide&Deep model input shapes

I received W&D model for enabling inference using OpenVINO but I encounter inconsistency in the model: GatherND in the beginning generates output shape that is unacceptable by Reshape. Could you please tell what input shapes are expected for new_categorical_placeholder and new_numeric_placeholder are needed? For example, I tried to use shapes equal to [2,1] with no success.

Performance issue in /models/recommendation/tensorflow (by P3)

Hello! I've found a performance issue in /wide_deep/inference/fp32/wide_deep_inference.py: dataset.batch(batch_size)(line 192) should be called before dataset.map(parse_csv, num_parallel_calls=5)(line 187), which could make your program more efficient.

Here is the tensorflow document to support it.

Besides, you need to check the function parse_csv called in dataset.map(parse_csv, num_parallel_calls=5) whether to be affected or not to make the changed code work properly. For example, if parse_csv needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

launch_benchmark.py error dataset file not found

When running python launch_benchmark.py
--data-location /home/user/coco/output/
--in-graph /home/user/ssd_resnet34_fp32_bs1_pretrained_model.pb
--model-source-dir /home/user/tensorflow/models
--model-name ssd-resnet34
--framework tensorflow
--precision fp32
--mode inference
--socket-id 0
--batch-size=1
--docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14
--accuracy-only
I get an error:

coco tf records file not found.
I found the root cause:
During step 4 of BKM create_coco_tf_record.py generates tf_records with names that are not compatible with the benchmark script. Also the benchmark script is looking for a directory "dataset" that is not created by tensorflow. I had to change the coco directory structure and the names of the tf_records to be able to read the coco tf_records with the benchmark script that accepts only names of the form "validation-00001-of-00010" while tensorflow creates records with names of the form "coco_val.record-00001-of-00010".

INT8 model source

Hi,

I have a question regarding quantization. How to get a model in 8-bits representation, like MobileNet v1 INT8?

Is it possible to convert a model to such representation that was trained with TF Quantization-Aware Training and contains FakeQuantizeWithMinMax nodes?

Unable to find image 'gcr.io/deeplearning-platform-release/tf-cpu.1-14:latest' locally docker: Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).

Is this docker image broken now?

docker run -d -p 8080:8080 -v /home:/home  gcr.io/deeplearning-platform-release/tf-cpu.1-14


Unable to find image 'gcr.io/deeplearning-platform-release/tf-cpu.1-14:latest' locally
docker: Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).
See 'docker run --help'.

The scripts didn;t provide "--steps" parameters out

Hi,
I am running RN50 workloads, and I want to set the "--steps=1000" parameter with the scripts of
models/benchmarks/launch_benchmark.py
My command would be:
python models/benchmarks/launch_benchmark.py --in-graph /root/rsn50_frozen_max_min.pb --model-name resnet50 --framework tensorflow --precision int8 --mode inference --batch-size=1 --socket-id 0 --benchmark-only --docker-image tensorflow/tensorflow-estimator:latest-mkl --data-location /root/tensorflow/dataset/TF_Imagenet_FullData --num-inter-threads=1 --num-intra-threads=26 --steps=1000

The "--steps" parameter doesn't work.
And I have to modified the function "add_steps_args()" in the "benchmarks/common/tensorflow/start.sh” script

KMP thread questions about RN50

When I am running the RN50 models:

python models/benchmarks/launch_benchmark.py --in-graph /home/testRN50/resnet50_int8_pretrained_model.pb --model-name resnet50 --framework tensorflow --precision int8 --mode inference --socket-id 0 --batch-size=128 --benchmark-only -- warmup_steps=50 steps=500

I see the KMP output:

OMP: Info #250: KMP_AFFINITY: pid 67001 tid 67144 thread 0 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 67001 tid 67144 thread 1 bound to OS proc set 1

why the same tid number:67144 would generate two threads?

launch_benchmark.py: ImportError: No module named 'pycocotools._mask'

When running python launch_benchmark.py
--data-location /home/user/coco/output/
--in-graph /home/user/ssd_resnet34_fp32_bs1_pretrained_model.pb
--model-source-dir /home/user/tensorflow/models
--model-name ssd-resnet34
--framework tensorflow
--precision fp32
--mode inference
--socket-id 0
--batch-size=1
--docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14
--accuracy-only
I get an error:

Inference for accuracy check.
Traceback (most recent call last):
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/models/ssd_model.py", line 507, in postprocess
import coco_metric # pylint: disable=g-import-not-at-top
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/coco_metric.py", line 32, in
from pycocotools.coco import COCO
File "/workspace/models/research/pycocotools/coco.py", line 55, in
from . import mask as maskUtils
File "/workspace/models/research/pycocotools/mask.py", line 3, in
import pycocotools._mask as _mask
ImportError: No module named 'pycocotools._mask'

The PYTHONPATH is :"/home/user/Tensorflowmodels/models/research:/home/user/Tensorflowmodels/models/research/slim"

/home/user/cocoapi/PythonAPI was compiled with python3.6 and pycocotools was copied to /home/user/Tensorflowmodels/models/research.

The /home/user/IntelModelsAI/benchmarks/launch_benchmark.py is also run with python.6
I have spent some time debugging the issue without success.

I had a mistake when I ran tensorflow with multiple nodes

hi,
I had a mistake when I ran tensorflow with multiple nodes

This is my order:

python launch_benchmark.py \

     --verbose \
     --model-name=resnet50v1_5 \
     --precision=fp32 \
     --mode=training \
     --framework tensorflow \
     --noinstall \
     --checkpoint=/home/mount_dir/hys/modles/checkpoints \
     --data-location=/home/mount_dir/wj/ImageNet/data/tf_images \
     --mpi_hostnames='c1,head' \
     --mpi_num_processes=4 2>&1

This is the error encountered:

SOCKET_ID: -1
MODEL_NAME: resnet50v1_5
MODE: training
PRECISION: fp32
BATCH_SIZE: -1
NUM_CORES: -1
BENCHMARK_ONLY: True
ACCURACY_ONLY: False
OUTPUT_RESULTS: False
DISABLE_TCMALLOC: True
TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD: 2147483648
NOINSTALL: True
OUTPUT_DIR: /home/mount_dir/hys/models/benchmarks/common/tensorflow/logs
MPI_NUM_PROCESSES: 4
MPI_NUM_PEOCESSES_PER_SOCKET: 1
MPI_HOSTNAMES: c1,head
NUMA_CORES_PER_INSTANCE: None
PYTHON_EXE: /opt/intel/oneapi/tensorflow/2.2.0/bin/python
PYTHONPATH:
DRY_RUN:

/bin/sh: numactl: command not found
[mpiexec@head] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument x
[mpiexec@head] Similar arguments:
[mpiexec@head] demux
[mpiexec@head] s
[mpiexec@head] n
[mpiexec@head] enable-x
[mpiexec@head] f
[mpiexec@head] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec@head] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1350): error parsing input array
[mpiexec@head] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1755): error parsing parameters
num_inter_threads: 1
num_intra_threads: 26
Received these standard args: Namespace(accuracy_only=False, backbone_model=None, batch_size=64, benchmark_dir='/home/mount_dir/hys/models/benchmarks', benchmark_only=True, checkpoint='/home/mount_dir/hys/modles/checkpoints', data_location='/home/mount_dir/wj/ImageNet/data/tf_images', data_num_inter_threads=None, data_num_intra_threads=None, disable_tcmalloc=True, epochsbtwevals=1, experimental_gelu=False, framework='tensorflow', input_graph=None, intelai_models='/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5', mode='training', model_args=[], model_name='resnet50v1_5', model_source_dir=None, mpi=None, mpi_hostnames=None, num_cores=-1, num_instances=1, num_inter_threads=1, num_intra_threads=26, num_mpi=1, num_train_steps=1, numa_cores_per_instance=None, optimized_softmax=True, output_dir='/home/mount_dir/hys/models/benchmarks/common/tensorflow/logs', output_results=False, precision='fp32', socket_id=-1, steps=112590, tcmalloc_large_alloc_report_threshold=2147483648, tf_serving_version='master', trainepochs=72, use_case='image_recognition', verbose=True)
Received these custom args: []
Current directory: /home/mount_dir/hys/models/benchmarks
Running: mpirun -x LD_LIBRARY_PATH -x PYTHONPATH --allow-run-as-root -n 4 -H c1:2,head:2 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 --bind-to none --map-by slot /opt/intel/oneapi/tensorflow/2.2.0/bin/python /home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5/training/mlperf_resnet/imagenet_main.py 2 --batch_size=64 --max_train_steps=112590 --train_epochs=72 --epochs_between_evals=1 --inter_op_parallelism_threads 1 --intra_op_parallelism_threads 26 --version 1 --resnet_size 50 --data_dir=/home/mount_dir/wj/ImageNet/data/tf_images --model_dir=/home/mount_dir/hys/modles/checkpoints
PYTHONPATH: :/home/mount_dir/hys/models/benchmarks/../models/common/tensorflow:/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5:/home/mount_dir/hys/models/benchmarks:/home/mount_dir/hys/models/benchmarks
RUNCMD: /opt/intel/oneapi/tensorflow/2.2.0/bin/python common/tensorflow/run_tf_benchmark.py --framework=tensorflow --use-case=image_recognition --model-name=resnet50v1_5 --precision=fp32 --mode=training --benchmark-dir=/home/mount_dir/hys/models/benchmarks --intelai-models=/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5 --num-cores=-1 --batch-size=-1 --socket-id=-1 --output-dir=/home/mount_dir/hys/models/benchmarks/common/tensorflow/logs --num-train-steps=1 --benchmark-only --verbose --checkpoint=/home/mount_dir/hys/modles/checkpoints --data-location=/home/mount_dir/wj/ImageNet/data/tf_images --disable-tcmalloc=True
Log file location: /home/mount_dir/hys/models/benchmarks/common/tensorflow/logs/benchmark_resnet50v1_5_training_fp32_20210514_163202.log

RNNT training on CPU

Appreciate for the job on supporting RNN-T training on CPU (models/language_modeling/pytorch/rnnt/training/cpu), just quick evaluated the training code and found that WER would keep in 1.00 after even training 10+ epoches.
And I found this issue related on loss function used in training HawkAaron/warp-transducer#93
The grad in cpu is incorrect, is this a know issue? Or have we ever gotten the final WER of 0.058 rather than 1.0?

launch_benchmark.py : ImportError: No module named 'object_detection'

When running python launch_benchmark.py
--data-location /home/user/coco/output/
--in-graph /home/user/ssd_resnet34_fp32_bs1_pretrained_model.pb
--model-source-dir /home/user/tensorflow/models
--model-name ssd-resnet34
--framework tensorflow
--precision fp32
--mode inference
--socket-id 0
--batch-size=1
--docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14
--accuracy-only
I get an error:
load graph from: /in_graph/ssd_resnet34_fp32_bs1_pretrained_model.pb
Namespace(accuracy_only=True, batch_size=1, data_location='/dataset', input_graph='/in_graph/ssd_resnet34_fp32_bs1_pretrained_model.pb', num_inter_threads=1, num_intra_threads=8, results_file_path=None)
T
Traceback (most recent call last):
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 1072, in create_dataset
import ssd_dataloader # pylint: disable=g-import-not-at-top
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/ssd_dataloader.py", line 27, in
from object_detection.box_coders import faster_rcnn_box_coder
ImportError: No module named 'object_detection'

Note that this error occurs for FP32 and INT8 when using --accuracy-only.
When using --benchmark-only the scripts runs successfully.

It seems that PYTHONPATH is not properly transmitted to the docker environment.

Several of the data links in benchmarks/.../language translation/.../bert/ are out of date, and the download_glue_data.py script has bugs

I'm unable to get the data downloaded, following the ReadMe at this link: https://github.com/IntelAI/models/tree/master/benchmarks/language_translation/tensorflow/bert
Particularly at this command: $ python3 download_glue_data.py --data_dir ./data/ --tasks MRPC

There are several bugs in the download_glue_data.py script (e.g you need 'urllib.request.urlretrieve...' rather than 'URLLIB.retrieve'), and the links to the data are now inaccessible.

Bazel Build fails on Ubuntu 16.04 - No such target log_severity

Hi,

I'm trying to build the the Intel Optimized Binaries for Tensorflow Serving on my Ubuntu 16.04 machine. I ran into the below error no such target '@com_google_absl//absl/base:log_severity'

image

I'm running the below command -
docker build -t $USER/tensorflow-serving-devel-mkl -f /home/ubuntu/serving/tensorflow_serving/tools/docker/Dockerfile.devel-mkl2 .

The only thing I have changed in the docker file is the base image to be 16.04 instead of 18.04

Is there anything I'm missing ?

Thanks!

Set batching parameters

How can i enable batching and set batching parameters (max_batch_size, batch_timeout_micros,...) when run tensorflow model server with docker? Thanks...

How to set the git proxy

When running the ssd-restnet34 benchmark:

python launch_benchmark.py \
    --in-graph /home/<user>/ssd_resnet34_fp32_bs1_pretrained_model.pb \
    --model-source-dir /home/<user>/tensorflow/models \
    --model-name ssd-resnet34 \
    --framework tensorflow \
    --precision fp32 \
    --mode inference \
    --socket-id 0 \
    --batch-size=1 \
    --docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14 \
    --benchmark-only

it will clone the

git clone --single-branch https://github.com/tensorflow/benchmarks.git

in the start.sh

However, in my case I need to set some git proxy:

git config --global http.proxy http://<my_proxy>:<my_port>
git config --global https.proxy https://<my_proxy>:<my_port>

otherwise, it will fail with the error message:
Cloning into 'benchmarks'...
fatal: unable to access 'https://github.com/tensorflow/benchmarks.git/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
start.sh: line 654: cd: benchmarks: No such file or directory

Does anyone know how to set it with these script's argument?

Can not run BERT-Large training successfully on bare metal

We run BERT-Large training on bare metal ubuntu server. The log have no errors, but also no training logs, it is confusing.

command:

python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 \
    --benchmark-only \
    --data-location=$BERT_LARGE_DIR \
    --num-inter-threads=1 \
    -- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=0.1     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True

The log:

INFO:tensorflow:Graph was finalized.
I0625 09:40:30.595448 140247941625664 monitored_session.py:246] Graph was finalized.
2021-06-25 09:40:30.595915: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-25 09:40:30.764862: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2892875000 Hz
2021-06-25 09:40:30.767997: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c703127e80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-25 09:40:30.768068: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Running local_init_op.
I0625 09:40:50.980941 140247941625664 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0625 09:40:51.142987 140247941625664 session_manager.py:508] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I0625 09:41:02.433922 140247941625664 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
I0625 09:41:02.434337 140247941625664 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I0625 09:41:08.454857 140247941625664 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
INFO:Running SQuAD...!
----------------------------Run command-------------------------------------

So there are no training result in the log.

@dmsuehir @ashahba would you please help troubleshoot

Thanks

launch_benchmark.py : ImportError: No module named 'object_detection' with INT8 accuracy only

This issue is different from ticket# 29
When running python launch_benchmark.py
--data-location /home/user/coco/output/
--in-graph /home/user/ssd_resnet34_int8_bs1_pretrained_model.pb
--model-source-dir /home/user/tensorflow/models
--model-name ssd-resnet34
--framework tensorflow
--precision int8
--mode inference
--socket-id 0
--batch-size=1
--docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14
--accuracy-only
I get an error:
load graph from: /in_graph/ssd_resnet34_int8_bs1_pretrained_model.pb
Namespace(accuracy_only=True, batch_size=1, data_location='/dataset', input_graph='/in_graph/ssd_resnet34_int8_bs1_pretrained_model.pb', num_inter_threads=1, num_intra_threads=8, results_file_path=None)
T
Traceback (most recent call last):
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 1072, in create_dataset
import ssd_dataloader # pylint: disable=g-import-not-at-top
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/ssd_dataloader.py", line 27, in
from object_detection.box_coders import faster_rcnn_box_coder
ImportError: No module named 'object_detection'

Note that this error occurs for INT8 using --accuracy-only.
When using precision FP32 using --accuracy-only it works fine.

I have not been able to find a fix for this issue.

Tensorflow 2.x support

When are you planning to add support for TF 2.x versions in the launch_benchmark.py ?

mount path is not set in function inceptionv4

Other than other functions, i see mount path like following is not set for function inceptionv4
export PYTHONPATH=${PYTHONPATH}:$(pwd):${MOUNT_BENCHMARK}

I would think it is required and i encounter some issue with my docker image, may the path be added?

models/benchmarks/common/tensorflow/start.sh
function inceptionv4() {

  | # For accuracy, dataset location is required
  | if [ "${DATASET_LOCATION_VOL}" == None ] && [ ${ACCURACY_ONLY} == "True" ]; then
  | echo "No dataset directory specified, accuracy cannot be calculated."
  | exit 1
  | fi

correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

When pytorch is 1.7 in oneAPI AI toolkit, there is error in running case:

cd models/models/image_recognition/pytorch/common/
python main.py -d /home/wj/ImageNet/data/all -a resnet50 --epochs 100 --learning-rate 0.1  --print-freq 1 -b 64 –ipex
...

correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

view() is not supported and use reshape() to fix it.

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Hi,
I followed the instructions in Wide & Deep and run test successfully, but saw below messages:

I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Is this correct behavior or tf bug?
I have tried pip install intel-tensorflow==1.15 and didn't see this message. The instructions in Wide & Deep used intel-tensorflow==2.1.0, does this matter?

cannot run ssd_resnet34 int8 model with intel-tensorflow 2.5.0

Whe I ues intel-tensorflow==2.5.0 to run SSD_ResNet34 model for inference, it will get an error:

Traceback (most recent call last):
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Missing 0-th output from {{node v/cg/resnet34_backbone/conv1/conv2d/Conv2D_eightbit_requantize}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "draft.py", line 53, in <module>
    results = sess.run(output_tensors, {input_tensor: image})
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Missing 0-th output from node v/cg/resnet34_backbone/conv1/conv2d/Conv2D_eightbit_requantize (defined at draft.py:21)

Then I downgrade the intel-tensorflow to 2.4.0, problem solved.

Run more docker containters with Inter-optimized-tensorflow on One 8 physical core 16cores Cpu

hello, I find the inter-optimized-tensorflow has the great increasing on train phase.
but i want to run 3 docker containters in 8 physical core 16cores Cpu, i set every containter with 4 logical core
how i set the param intra_/inter_op_parallelism_threads and OMP_NUM_THREADS?
when one containter runs, the train time cost 17s every epoch, but when i run 3 containters, in every containter the train time cost 50s/epoch.
by the way i set intra_/inter_op_parallelism_threads =2, OMP_NUM_THREADS= 2 ,KMP_BLOCKTIME=1 in containter.
please tell me why?

How to run these w/out docker?

Is there anyway to run these benchmarks without Docker? Many of us on HPC clusters don't use docker because we want as minimal and noise-free environment as possible. Being able to run w/out docker and containers helps that.

dead code and ignored cli option for framework in launch_benchmark.py

when trying to run w/o docker (bare-metal) the method run_bare_metal has this code

        # To Launch Tensorflow Serving benchmark we need only --in-graph arg.
        # It does not support checkpoint files.
        if args.framework == "tensorflow_serving":

which can't be entered because validate_args rejects -f tensorflow_serving

Possibility of updated MKL docker image on latest 2.4.0

I've been leveraging the latest tag from dockerhub intel/intel-optimized-tensorflow-serving:2.3.0-mkl to deploy some tensorflow GPU models on CPU, which cannot run on vanilla tensorflow cpu. However, for other reasons we must migrate to tensorflow serving 2.4 which was released a month ago.

Can we expect a newer version of the optimized docker image => intel/intel-optimized-tensorflow-serving:2.4.0-mkl? If so is there any eta?

I've encountered some trouble attempting to build myself. Following this guide https://github.com/IntelAI/models/blob/master/docs/general/tensorflow_serving/InstallationGuide.md with 2.4.0 instead of 2.3.0, the docker build successfully passes the build of tensorflow serving... but later results in errors when trying to copy the library files:

cannot stat '/root/.cache/bazel/_bazel_root/*/external/mkl_linux/lib/*': No such file or directory

I can't seem to find the produced lib files anywhere in the intermediate containers.

Any insight?

Error met when running BERT training on multiple nodes

Issue:
/usr/bin/python3 common/tensorflow/run_tf_benchmark.py --framework=tensorflow --
use-case=language_modeling --model-name=bert_large --precision=fp32 --mode=train
ing --benchmark-dir=/dl/intel_train/models/benchmarks --intelai-models=/dl/intel
_train/models/benchmarks/../models/language_modeling/tensorflow/bert_large --num
-cores=-1 --batch-size=32 --socket-id=-1 --output-dir=/dl/intel_train/glue-outpu
t --num-train-steps=1 --benchmark-only --num-intra-threads=10 --disable-tcmalloc
=True --train-option=Classifier --init-checkpoint=/dl/intel_train/ckpt-bert-base
/uncased_L-12_H-768_A-12/bert_model.ckpt --task-name=MRPC --vocab-file=/dl/intel
train/ckpt-bert-base/uncased_L-12_H-768_A-12/vocab.txt --config-file=/dl/intel
train/ckpt-bert-base/uncased_L-12_H-768_A-12/bert_config.json --do-train=true --
num-train-epochs=30 --learning-rate=2e-5 --max-seq-length=128 --do-eval=true --d
ata-dir=/dl/intel_train/glue/results4/MRPC --do-lower-case=True --experimental-g
elu=False --optimized-softmax=True
<2> is invalid
libnuma: Warning: node argument 2 is out of range

usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]

Env:
OS: Fedora release 29 (Twenty Nine)
Whether container: bare metal
Model: bert base
Reference guideline:
https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large/training/fp32
Number_of_nodes: 2
Socket_of_nodes: 2

The code where throws the issue:
Start.sh -> run_model()->eval ${CMD} 2>&1 | tee ${LOGFILE}

Note, the ${CMD} after preprocessing is:
/usr/bin/python3 common/tensorflow/run_tf_benchmark.py --f
ramework=tensorflow --use-case=language_modeling --model-name=bert_large --preci
sion=fp32 --mode=training --benchmark-dir=/dl/intel_train/models/benchmarks --in
telai-models=/dl/intel_train/models/benchmarks/../models/language_modeling/tenso
rflow/bert_large --num-cores=-1 --batch-size=32 --socket-id=-1 --output-dir=/dl/
intel_train/glue-output --num-train-steps=1 --benchmark-only --disable-tcma
lloc=True --train-option=Classifier --init-checkpoint=/dl/intel_train/ckpt-bert-
base/uncased_L-12_H-768_A-12/bert_model.ckpt --task-name=MRPC --vocab-file=/dl/i
ntel_train/ckpt-bert-base/uncased_L-12_H-768_A-12/vocab.txt --config-file=/dl/in
tel_train/ckpt-bert-base/uncased_L-12_H-768_A-12/bert_config.json --do-train=tru
e --num-train-epochs=30 --learning-rate=2e-5 --max-seq-length=128 --do-eval=true
--data-dir=/dl/intel_train/glue/results4/MRPC --do-lower-case=True --experiment
al-gelu=False --optimized-softmax=True

The step to launch the training:
• Git clone intel model on each node with same directory
• Prepare glue data on each node with same directory
• Prepare bert model on each node with same directory
• Make sure for each node, distributed mode on two sockets is workable per the guide. The train result can be generated on each node.
• To support multiple instances, there is no guide for Bert so I’m referring the resnet related training https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/training/fp32/Advanced.md
python launch_benchmark.py
--verbose
--model-name resnet50v1_5
--precision fp32
--mode training
--framework tensorflow
--noinstall
--checkpoint ${OUTPUT_DIR}
--data-location ${DATASET_DIR}
--mpi_hostnames 'host1,host2'
--mpi_num_processes 4 2>&1
• I just modify the fp32_classifier_training.sh directly in launching entry, something like below:
source "${MODEL_DIR}/quickstart/common/utils.sh"
_command python3 ${MODEL_DIR}/benchmarks/launch_benchmark.py
--model-name=bert_large
--precision=fp32
--mode=training
--framework=tensorflow
--batch-size=32
${mpi_num_proc_arg}
--mpi_hostnames='vsr140,vsr143'
--output-dir=$OUTPUT_DIR
$@
-- train-option=Classifier
task-name=MRPC
do-train=true
do-eval=true
data-dir=$DATASET_DIR/MRPC
vocab-file=$CHECKPOINT_DIR/vocab.txt
config-file=$CHECKPOINT_DIR/bert_config.json
init-checkpoint=$CHECKPOINT_DIR/bert_model.ckpt
max-seq-length=128
learning-rate=2e-5
num-train-epochs=30
optimized_softmax=True
experimental_gelu=False
do-lower-case=True \

what we have tried
We found some numactl related code under common/base_model_init.py so that we added some print out there, but found the error is threw before the numactl is invoked.

Downloading the dataset

The link you've posted to download the dataset is broken. Furthermore, even after figuring out the right url, trying to download the 2012 dataset is nearly impossible; endless circular links leading nowhere.

Is inceptionv3 maintained or is it a dead benchmark?

How can I export a optimized wide and deep model?

In https://software.intel.com/content/www/us/en/develop/articles/accelerate-int8-inference-performance-for-recommender-systems-with-intel-deep-learning.html, graph optimization includes the following:
"""
Categorical columns are optimized by removing redundant and unnecessary OPs. The left portion of Figure 2 contains the unoptimized portion of the graph. These are optimized as described below:
The Expand Dimension, GatherNd, NotEqual, and Where OPs that are used to get a non-empty input string of the required dimension are removed as they are redundant for the current dataset.
Error checking and handling OPs (NotEqual, GreaterEqual, SparseFillEmptyRows, Unique, etc.) and unique value calculation and reconstruction OPs (Unique, SparseSegmentSum/Mean, StridedSlice, Pack, Tile, etc.) are removed as they are not necessary for the current dataset.
"""
Would you please share this part of the graph optimization method or code?

How does TensorFlow handle the situation of running a training with BF16 on a CPU which does not support it?

I compared the performance of training with fp32 and bf16 on a CPU which does not support bf16. Interestingly, the training managed to finish with bf16, though the performance was much worse than fp32. I also did a similar test for fp16, whose performance was also worse than fp32. Since the CPU does not support fp16 or bf16, I am very interested to know how TensorFlow handles such situation.

The latest docker image was not compiled to use AVX2 AVX512F FMA

I pulled the latest tensorflow image docker pull intel/intel-optimized-tensorflow, and found the following meassage at the beginning of running a script:

2020-08-18 08:52:56.002977: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Through some research, it seems the only way to work around this is to rebuild tensorflow. Is there a better way of fixing this issue in the docker image? Or will this have a negative impact on the performance?

imagenet data conversion script returns pthread error

ran
./imagenet_to_tfrecords.sh dataset tpu
to convert imagenet data to tfrecords and returns

2021-05-25 19:50:18.526589: F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() fa
iled.
Fatal Python error: Aborted

Thread 0x00007f8d9a57e740 (most recent call first):
File "/usr/local/lib64/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 442 in get_matching_files_v2
File "/usr/local/lib64/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 383 in get_matching_files
File "/home/manjeets/resnet50v1-5-fp32-inference/DATADIR/tpu/tools/datasets/imagenet_to_gcs.py", line 372 in convert_to_tf_records
File "/home/manjee2021-05-25 19:50:18.542511: F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread creation via
pthread_create() failed.

I've checked thread limit by ulimit -u
864604

the system has so many cores > 200.

About the OPM_NUM_THREADS

i tried to change OMP_NUM_THREADS to determine how it would affect the CPU performance
by changing EXPORT OMP_NUM_THREADS=''
however the outcome seemed to be unchanged, and i found the follow in the information:
User settings:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=1
KMP_SETTINGS=1
OMP_NUM_THREADS=16

the OMP_NUM_THREADS never change
so how can i change it?

Patches missing for IntelTensorFlow_PerformanceAnalysis

Summary

/opt/intel/oneapi/modelzoo/latest/models/docs/notebooks/perf_analysis/profiling/patches doesn't contain patches needed for execution of jupyter notebook in /opt/intel/oneapi/modelzoo/latest/models/docs/notebooks/perf_analysis/benchmark_perf_comparison.ipynb and benchmark_perf_timeline_analysis.ipynb

For functionality of the two jupyter notebooks under oneAPI-samples/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_PerformanceAnalysis (benchmark_perf_comparison and benchmark_perf_timeline_analysis), the models should be patched to correctly produce timeline .json files. However, many models available in benchmark_perf_comparison don't got their corresponding patches in /opt/intel/oneapi/modelzoo/latest/models/docs/notebooks/perf_analysis/profiling/patches.

URL

IntelTensorFlow_PerformanceAnalysis: https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_PerformanceAnalysis
benchmark_perf_comparison: https://github.com/IntelAI/models/blob/master/docs/notebooks/perf_analysis/benchmark_perf_comparison.ipynb

Steps to reproduce

I followed the instructions in the IntelTensorFlow_PerformanceAnalysis, "Running the Sample" section to $cp -rf /opt/intel/oneapi/modelzoo/latest/models ~/ to get the models. Also followed all other instructions to prepare both environments and run the codes in the jupyter notebook. In execution, I chose topology 0: resnet50 infer fp32 and topology 1: resnet50v1_5 infer fp32.

Observed behavior

After execution of benchmark_perf_comparison notebook, a .json file containing the tensorflow timelines is expected to be produced. However, no json file is found. The problem is that the models have to be patched to make the json file, but both topology 0: resnet50 infer fp32 and topology 1: resnet50v1_5 infer fp32 don't have their corresponding patches in models/docs/notebooks/perf_analysis/profiling/patches. These are not the only case. In fact, most topologies in the 12 supported topologies don't have their corresponding patches.

Expected behavior

There should be corresponding patches in the models/docs/notebooks/perf_analysis/profiling/patches folder, so that a .json file containing the timeline can be output and be used in the following execution. I think either the supported topologies should be changed to match the existing patches, or additional patches should be added to match the supported topologies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.