intel / models Goto Github PK

Intel® AI Reference Models: contains Intel optimizations for running deep learning workloads on Intel® Xeon® Scalable processors and Intel® Data Center GPUs

License: Apache License 2.0

Makefile 0.11% Python 87.15% Shell 9.66% Jupyter Notebook 2.03% Dockerfile 0.70% C 0.15% C++ 0.16% Batchfile 0.02% PowerShell 0.01%

ai cpu deep-learning deep-neural-networks inference performance tensorflow

models's Introduction

Intel® AI Reference Models

This repository contains links to pre-trained models, sample scripts, best practices, and step-by-step tutorials for many popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors and Intel® Data Center GPUs.

Containers for running the workloads can be found at the Intel® Developer Catalog.

Intel® AI Reference Models in a Jupyter Notebook is also available for the listed workloads

Purpose of Intel® AI Reference Models

Intel optimizes popular deep learning frameworks such as TensorFlow* and PyTorch* by contributing to the upstream projects. Additional optimizations are built into plugins/extensions such as the Intel Extension for Pytorch* and the Intel Extension for TensorFlow*. Popular neural network models running against common datasets are the target workloads that drive these optimizations.

The purpose of the Intel® AI Reference Models repository (and associated containers) is to quickly replicate the complete software environment that demonstrates the best-known performance of each of these target model/dataset combinations. When executed in optimally-configured hardware environments, these software environments showcase the AI capabilities of Intel platforms.

DISCLAIMER: These scripts are not intended for benchmarking Intel platforms. For any performance and/or benchmarking information on specific Intel platforms, visit https://www.intel.ai/blog.

Intel is committed to the respect of human rights and avoiding complicity in human rights abuses, a policy reflected in the Intel Global Human Rights Principles. Accordingly, by accessing the Intel material on this platform you agree that you will not use the material in a product or application that causes or contributes to a violation of an internationally recognized human right.

License

The Intel® AI Reference Models is licensed under Apache License Version 2.0.

Datasets

To the extent that any public datasets are referenced by Intel or accessed using tools or code on this site those datasets are provided by the third party indicated as the data source. Intel does not create the data, or datasets, and does not warrant their accuracy or quality. By accessing the public dataset(s) you agree to the terms associated with those datasets and that your use complies with the applicable license.

Please check the list of datasets used in Intel® AI Reference Models in datasets directory.

Intel expressly disclaims the accuracy, adequacy, or completeness of any public datasets, and is not liable for any errors, omissions, or defects in the data, or for any reliance on the data. Intel is not liable for any liability or damages relating to your use of public datasets.

Use cases

The model documentation in the tables below have information on the prerequisites to run each model. The model scripts run on Linux. Certain models are also able to run using bare metal on Windows. For more information and a list of models that are supported on Windows, see the documentation here.

Instructions available to run on Sapphire Rapids.

For best performance on Intel® Data Center GPU Flex and Max Series, please check the list of supported workloads. It provides instructions to run inference and training using Intel(R) Extension for PyTorch or Intel(R) Extension for TensorFlow.

Image Recognition

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
DenseNet169	TensorFlow	Inference	FP32	ImageNet 2012
Inception V3	TensorFlow	Inference	Int8 FP32	ImageNet 2012
MobileNet V1*	TensorFlow	Inference	Int8 FP32 BFloat16	ImageNet 2012
MobileNet V1* Sapphire Rapids	TensorFlow	Inference	Int8 FP32 BFloat16 BFloat32	ImageNet 2012
MobileNet V2	Tensorflow	Inference	FP32 BFloat16 Int8	ImageNet 2012
ResNet 101	TensorFlow	Inference	Int8 FP32	ImageNet 2012
ResNet 50v1.5	TensorFlow	Inference	Int8 FP32 BFloat16 FP16	ImageNet 2012
ResNet 50v1.5 Sapphire Rapids	TensorFlow	Inference	Int8 FP32 BFloat16 BFloat32	ImageNet 2012
ResNet 50v1.5	TensorFlow	Training	FP32 BFloat16 FP16	ImageNet 2012
ResNet 50v1.5 Sapphire Rapids	TensorFlow	Training	FP32 BFloat16 BFloat32	ImageNet 2012
Inception V3	TensorFlow Serving	Inference	FP32	Synthetic Data
ResNet 50v1.5	TensorFlow Serving	Inference	FP32	Synthetic Data
GoogLeNet	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
Inception v3	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
MNASNet 0.5	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
MNASNet 1.0	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
ResNet 50	PyTorch	Inference	Int8 FP32 BFloat16 BFloat32	ImageNet 2012
ResNet 50	PyTorch	Training	FP32 BFloat16 BFloat32	ImageNet 2012
ResNet 101	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
ResNet 152	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
ResNext 32x4d	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
ResNext 32x16d	PyTorch	Inference	Int8 FP32 BFloat16 BFloat32	ImageNet 2012
VGG-11	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
VGG-11 with batch normalization	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
Wide ResNet-50-2	PyTorch	Inference	FP32 BFloat16	ImageNet 2012
Wide ResNet-101-2	PyTorch	Inference	FP32 BFloat16	ImageNet 2012

Image Segmentation

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
3D U-Net MLPerf*	TensorFlow	Inference	FP32 BFloat16 Int8	BRATS 2019
3D U-Net MLPerf* Sapphire Rapids	Tensorflow	Inference	FP32 BFloat16 Int8 BFloat32	BRATS 2019
MaskRCNN	TensorFlow	Inference	FP32	MS COCO 2014
UNet	TensorFlow	Inference	FP32

Language Modeling

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
BERT large	TensorFlow	Inference	FP32 BFloat16 FP16	SQuAD
BERT large	TensorFlow	Training	FP32 BFloat16 FP16	SQuAD and MRPC
BERT large Sapphire Rapids	Tensorflow	Inference	FP32 BFloat16 Int8 BFloat32	SQuAD
BERT large Sapphire Rapids	Tensorflow	Training	FP32 BFloat16 BFloat32	SQuAD
DistilBERT base	Tensorflow	Inference	FP32 BFloat16 Int8 FP16	SST-2
BERT base	PyTorch	Inference	FP32 BFloat16	BERT Base SQuAD1.1
BERT large	PyTorch	Inference	FP32 Int8 BFloat16 BFloat32	BERT Large SQuAD1.1
BERT large	PyTorch	Training	FP32 BFloat16 BFloat32	preprocessed text dataset
DistilBERT base	PyTorch	Inference	FP32 Int8-FP32 Int8-BFloat16 BFloat16 BFloat32	DistilBERT Base SQuAD1.1
RNN-T	PyTorch	Inference	FP32 BFloat16 BFloat32	RNN-T dataset
RNN-T	PyTorch	Training	FP32 BFloat16 BFloat32	RNN-T dataset
RoBERTa base	PyTorch	Inference	FP32 BFloat16	RoBERTa Base SQuAD 2.0
T5	PyTorch	Inference	FP32 Int8

Language Translation

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
BERT	TensorFlow	Inference	FP32	MRPC
Transformer_LT_mlperf*	TensorFlow	Inference	FP32 BFloat16 Int8	WMT English-German data
Transformer_LT_mlperf* Sapphire Rapids	Tensorflow	Inference	FP32 BFloat16 Int8 BFloat32	WMT English-German dataset
Transformer_LT_mlperf*	TensorFlow	Training	FP32 BFloat16	WMT English-German dataset
Transformer_LT_mlperf* Sapphire Rapids	Tensorflow	Training	FP32 BFloat16 BFloat32	WMT English-German dataset
Transformer_LT_Official	TensorFlow	Inference	FP32	WMT English-German dataset
Transformer_LT_Official	TensorFlow Serving	Inference	FP32

Object Detection

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
SSD-MobileNet*	TensorFlow	Inference	Int8 FP32 BFloat16	COCO 2017 validation dataset
SSD-MobileNet* Sapphire Rapids	TensorFlow	Inference	Int8 FP32 BFloat16 BFloat32	COCO 2017 validation dataset
SSD-ResNet34*	TensorFlow	Inference	Int8 FP32 BFloat16	COCO 2017 validation dataset
SSD-ResNet34* Sapphire Rapids	TensorFlow	Inference	Int8 FP32 BFloat16 BFloat32	COCO 2017 validation dataset
SSD-ResNet34	TensorFlow	Training	FP32 BFloat16	COCO 2017 training dataset
SSD-ResNet34 Sapphire Rapids	TensorFlow	Training	FP32 BFloat16 BFloat32	COCO 2017 training dataset
SSD-MobileNet	TensorFlow Serving	Inference	FP32
Faster R-CNN ResNet50 FPN	PyTorch	Inference	FP32 BFloat16	COCO 2017
Mask R-CNN	PyTorch	Inference	FP32 BFloat16 BFloat32	COCO 2017
Mask R-CNN	PyTorch	Training	FP32 BFloat16 BFloat32	COCO 2017
Mask R-CNN ResNet50 FPN	PyTorch	Inference	FP32 BFloat16	COCO 2017
RetinaNet ResNet-50 FPN	PyTorch	Inference	FP32 BFloat16	COCO 2017
SSD-ResNet34	PyTorch	Inference	FP32 Int8 BFloat16 BFloat32	COCO 2017
SSD-ResNet34	PyTorch	Training	FP32 BFloat16 BFloat32	COCO 2017

Recommendation

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
DIEN	TensorFlow	Inference	FP32 BFloat16	DIEN dataset
DIEN Sapphire Rapids	TensorFlow	Inference	FP32 BFloat16 BFloat32	DIEN dataset
DIEN	TensorFlow	Training	FP32	DIEN dataset
DIEN Sapphire Rapids	TensorFlow	Training	FP32 BFloat16 BFloat32	DIEN dataset
Wide & Deep	TensorFlow	Inference	FP32	Census Income dataset
Wide & Deep Large Dataset	TensorFlow	Inference	Int8 FP32	Large Kaggle Display Advertising Challenge dataset
DLRM	PyTorch	Inference	FP32 Int8 BFloat16 BFloat32	Criteo Terabyte
DLRM	PyTorch	Training	FP32 BFloat16 BFloat32	Criteo Terabyte
DLRM v2	PyTorch	Inference	FP32 FP16 BFloat16 BFloat32 Int8	Criteo 1TB Click Logs dataset
DLRM v2	PyTorch	Training	FP32 FP16 BFloat16 BFloat32	Random dataset
MEMREC-DLRM	PyTorch	Inference	FP32	Criteo Terabyte

Diffusion

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
Stable Diffusion	TensorFlow	Inference	FP32 BFloat16 FP16	COCO 2017 validation dataset
Stable Diffusion	PyTorch	Inference	FP32 BFloat16 FP16 BFloat32 Int8-FP32 Int8-BFloat16	COCO 2017 validation dataset
Stable Diffusion	PyTorch	Training	FP32 BFloat16 FP16 BFloat32	cat images

Shot Boundary Detection

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
TransNetV2	PyTorch	Inference	FP32 BFloat16	Synthetic Data

AI Drug Design (AIDD)

Model	Framework	Mode	Model Documentation	Benchmark/Test Dataset
AlphaFold2	PyTorch	Inference	FP32	AF2Dataset

*Means the model belongs to MLPerf models and will be supported long-term.

Intel® Data Center GPU Workloads

Model	Framework	Mode	GPU Type	Model Documentation
ResNet 50v1.5	TensorFlow	Inference	Flex Series	Float32 TF32 Float16 BFloat16 Int8
ResNet 50 v1.5	TensorFlow	Training	Max Series	BFloat16 FP32
ResNet 50 v1.5	PyTorch	Inference	Flex Series, Max Series, Arc Series	Int8 FP32 FP16 TF32
ResNet 50 v1.5	PyTorch	Training	Max Series, Arc Series	BFloat16 TF32 FP32
DistilBERT	PyTorch	Inference	Flex Series, Max Series	FP32 FP16 BF16 TF32
DLRM v1	PyTorch	Inference	Flex Series	FP16 FP32
SSD-MobileNet*	TensorFlow	Inference	Flex Series	Int8
SSD-MobileNet*	PyTorch	Inference	Arc Series	INT8 FP16 FP32
EfficientNet	PyTorch	Inference	Flex Series	FP16 FP32
EfficientNet	TensorFlow	Inference	Flex Series	FP16
Wide Deep Large Dataset	TensorFlow	Inference	Flex Series	FP16
YOLO V5	PyTorch	Inference	Flex Series	FP16
BERT large	PyTorch	Inference	Max Series, Arc Series	BFloat16 FP32 FP16
BERT large	PyTorch	Training	Max Series, Arc Series	BFloat16 FP32 TF32
BERT large	TensorFlow	Training	Max Series	BFloat16 TF32 FP32
DLRM v2	PyTorch	Training	Max Series	FP32 TF32 BF16
3D-Unet	PyTorch	Inference	Max Series	FP16 INT8 FP32
3D-Unet	TensorFlow	Training	Max Series	BFloat16 FP32
Stable Diffusion	PyTorch	Inference	Flex Series, Max Series, Arc Series	FP16 FP32
Stable Diffusion	TensorFlow	Inference	Flex Series	FP16 FP32
Mask R-CNN	TensorFlow	Inference	Flex Series	FP32 Float16
Mask R-CNN	TensorFlow	Training	Max Series	FP32 BFloat16
Swin Transformer	PyTorch	Inference	Flex Series	FP16
FastPitch	PyTorch	Inference	Flex Series	FP16
UNet++	PyTorch	Inference	Flex Series	FP16
RNN-T	PyTorch	Inference	Max Series	FP16 BF16 FP32
RNN-T	PyTorch	Training	Max Series	FP32 BF16 TF32

How to Contribute

If you would like to add a new benchmarking script, please use this guide.

models's People

Contributors

Stargazers

Watchers

Forkers

s1113950 iwankgb unrahul gsl0610 kkonradi arunkumarramanan esmaeilinia ekote hirokobayashi dmsuehir amenhotep19 stjordanis amir22010 kputtann kuan-li arita37 rahul24-06 urantialife el1995 leslie-fang jessica-ji b3rgc0de shenmayufei vonrosenchild nguyennh-0786 rodrigocamachointellabs mnarusze-intel leo-xukang samghk ldurka yaohuaxin preethivenkatesh mkrupczak3 qixiuai bobafwata gyshi algoskynet paperspace seniormj2017 romir-desai jitendra42 yuyangigwt qasim-1develop kaswal seanlin2000 rvallari1 raynalee4 sreekanth-yalachigere adk9 manjeetbhati produkt-manager venky-intel findenapp nparkstar ashahba pps121 renqibing yangw1234 littleblacksmithlr lxngoddess5321 pruthvi1990 zhao1157 1911590204 wesleyhuang2014 jiayisunx rcadecaro frank-onspecta kkasravi yuzipeng05 yandong1112 louie-tsai bkaruman xiaming9880 raghavg9 nhasabni waynedu98 yinghu5 lipaul xe1gyq intel-ai-tce rpulid wafaat kiminh hadams1337 jiwongim sinianyutian eliommar jordancheng314 vivicoco mohitrathore1807 sumitloh maobujuan sauravnakarmi suluner yaoyongtao jkself walkingerica dzungductran kanvi-nervana undefined-references

models's Issues

How to evaluate the performance number of Bert-Large training

I encountered some confusion when I followed the guide--https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large to run training workload.

Running command:

nohup python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 --mpi_num_processes=2 \
    --benchmark-only \
    --docker-image intel/intel-optimized-tensorflow:2.3.0 \
    --volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
    --volume $SQUAD_DIR:$SQUAD_DIR \
    --data-location=$BERT_LARGE_DIR \
    --num-intra-threads=26 \
    --num-inter-threads=1 \
-- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=2     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True >> training-0609 &

Result:

INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
I0610 01:09:58.730417 140427424720704 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
INFO:tensorflow:Processing example: 9000
I0610 01:13:27.192351 140160153200448 run_squad.py:1363] Processing example: 9000
INFO:tensorflow:Processing example: 10000
I0610 01:17:27.623694 140160153200448 run_squad.py:1363] Processing example: 10000
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625470 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625671 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
I0610 01:20:36.625791 140160153200448 run_squad.py:797] Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json
I0610 01:20:36.625833 140160153200448 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json

I didn’t see the “throughput((num_processed_examples-threshod_examples)/Elapsedtime)” information like inference workload from the training log. I also read the script code: models/models/language_modeling/tensorflow/bert_large/training/fp32/run_squad.py, I have not found about “throughput”. But the ./models/models/language_modeling/tensorflow/bert_large/inference/run_squad.py used by inference has code about ” throughput((num_processed_examples-threshod_examples)/Elapsedtime)”.

So how to evaluate the performance number of Bert-Large training. There is neither "throughput" nor "Elapsedtime" in the log and running script?

@ashahba @dmsuehir

Thanks

TF 2.x support

Just wanted to follow up the issue #43
Do you have any ETA?

Incorrect type specified in Object Detection Tutorial

Object Detection with SSD-VGG16

Replace os.environ["OMP_NUM_THREADS"]= 11
with os.environ["OMP_NUM_THREADS"]= "11"

Need clarification upon Wide&Deep model input shapes

I received W&D model for enabling inference using OpenVINO but I encounter inconsistency in the model: GatherND in the beginning generates output shape that is unacceptable by Reshape. Could you please tell what input shapes are expected for new_categorical_placeholder and new_numeric_placeholder are needed? For example, I tried to use shapes equal to [2,1] with no success.

Performance issue in /models/recommendation/tensorflow (by P3)

Hello! I've found a performance issue in /wide_deep/inference/fp32/wide_deep_inference.py: dataset.batch(batch_size)(line 192) should be called before dataset.map(parse_csv, num_parallel_calls=5)(line 187), which could make your program more efficient.

Here is the tensorflow document to support it.

Besides, you need to check the function parse_csv called in dataset.map(parse_csv, num_parallel_calls=5) whether to be affected or not to make the changed code work properly. For example, if parse_csv needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

launch_benchmark.py error dataset file not found

coco tf records file not found.
I found the root cause:
During step 4 of BKM create_coco_tf_record.py generates tf_records with names that are not compatible with the benchmark script. Also the benchmark script is looking for a directory "dataset" that is not created by tensorflow. I had to change the coco directory structure and the names of the tf_records to be able to read the coco tf_records with the benchmark script that accepts only names of the form "validation-00001-of-00010" while tensorflow creates records with names of the form "coco_val.record-00001-of-00010".

INT8 model source

Hi,

I have a question regarding quantization. How to get a model in 8-bits representation, like MobileNet v1 INT8?

Is it possible to convert a model to such representation that was trained with TF Quantization-Aware Training and contains FakeQuantizeWithMinMax nodes?

Unable to find image 'gcr.io/deeplearning-platform-release/tf-cpu.1-14:latest' locally docker: Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).

Is this docker image broken now?

docker run -d -p 8080:8080 -v /home:/home  gcr.io/deeplearning-platform-release/tf-cpu.1-14


Unable to find image 'gcr.io/deeplearning-platform-release/tf-cpu.1-14:latest' locally
docker: Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).
See 'docker run --help'.

How to distinguish between stock TensorFlow and Intel TensorFlow for 2.5.0 version quickly and conveniently.

@dmsuehir @ashahba Would you please guide this?

Confusing command line argument conventions are used in different files

Different files use different word separation conventions, which results in the arguments being parsed incorrectly.

E.g. some files specify --num_intra_threads, while others use --num-intra-threads. A single convention should be used.

The scripts didn;t provide "--steps" parameters out

Hi,
I am running RN50 workloads, and I want to set the "--steps=1000" parameter with the scripts of
models/benchmarks/launch_benchmark.py
My command would be:
python models/benchmarks/launch_benchmark.py --in-graph /root/rsn50_frozen_max_min.pb --model-name resnet50 --framework tensorflow --precision int8 --mode inference --batch-size=1 --socket-id 0 --benchmark-only --docker-image tensorflow/tensorflow-estimator:latest-mkl --data-location /root/tensorflow/dataset/TF_Imagenet_FullData --num-inter-threads=1 --num-intra-threads=26 --steps=1000

The "--steps" parameter doesn't work.
And I have to modified the function "add_steps_args()" in the "benchmarks/common/tensorflow/start.sh” script

KMP thread questions about RN50

When I am running the RN50 models:

python models/benchmarks/launch_benchmark.py --in-graph /home/testRN50/resnet50_int8_pretrained_model.pb --model-name resnet50 --framework tensorflow --precision int8 --mode inference --socket-id 0 --batch-size=128 --benchmark-only -- warmup_steps=50 steps=500

I see the KMP output:

OMP: Info #250: KMP_AFFINITY: pid 67001 tid 67144 thread 0 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 67001 tid 67144 thread 1 bound to OS proc set 1

why the same tid number:67144 would generate two threads?

launch_benchmark.py: ImportError: No module named 'pycocotools._mask'

Inference for accuracy check.
Traceback (most recent call last):
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/models/ssd_model.py", line 507, in postprocess
import coco_metric # pylint: disable=g-import-not-at-top
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/coco_metric.py", line 32, in
from pycocotools.coco import COCO
File "/workspace/models/research/pycocotools/coco.py", line 55, in
from . import mask as maskUtils
File "/workspace/models/research/pycocotools/mask.py", line 3, in
import pycocotools._mask as _mask
ImportError: No module named 'pycocotools._mask'

The PYTHONPATH is :"/home/user/Tensorflowmodels/models/research:/home/user/Tensorflowmodels/models/research/slim"

/home/user/cocoapi/PythonAPI was compiled with python3.6 and pycocotools was copied to /home/user/Tensorflowmodels/models/research.

The /home/user/IntelModelsAI/benchmarks/launch_benchmark.py is also run with python.6
I have spent some time debugging the issue without success.

I had a mistake when I ran tensorflow with multiple nodes

hi,
I had a mistake when I ran tensorflow with multiple nodes

This is my order：

python launch_benchmark.py \

     --verbose \
     --model-name=resnet50v1_5 \
     --precision=fp32 \
     --mode=training \
     --framework tensorflow \
     --noinstall \
     --checkpoint=/home/mount_dir/hys/modles/checkpoints \
     --data-location=/home/mount_dir/wj/ImageNet/data/tf_images \
     --mpi_hostnames='c1,head' \
     --mpi_num_processes=4 2>&1

This is the error encountered：

SOCKET_ID: -1
MODEL_NAME: resnet50v1_5
MODE: training
PRECISION: fp32
BATCH_SIZE: -1
NUM_CORES: -1
BENCHMARK_ONLY: True
ACCURACY_ONLY: False
OUTPUT_RESULTS: False
DISABLE_TCMALLOC: True
TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD: 2147483648
NOINSTALL: True
OUTPUT_DIR: /home/mount_dir/hys/models/benchmarks/common/tensorflow/logs
MPI_NUM_PROCESSES: 4
MPI_NUM_PEOCESSES_PER_SOCKET: 1
MPI_HOSTNAMES: c1,head
NUMA_CORES_PER_INSTANCE: None
PYTHON_EXE: /opt/intel/oneapi/tensorflow/2.2.0/bin/python
PYTHONPATH:
DRY_RUN:

/bin/sh: numactl: command not found
[mpiexec@head] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument x
[mpiexec@head] Similar arguments:
[mpiexec@head] demux
[mpiexec@head] s
[mpiexec@head] n
[mpiexec@head] enable-x
[mpiexec@head] f
[mpiexec@head] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec@head] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1350): error parsing input array
[mpiexec@head] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1755): error parsing parameters
num_inter_threads: 1
num_intra_threads: 26
Received these standard args: Namespace(accuracy_only=False, backbone_model=None, batch_size=64, benchmark_dir='/home/mount_dir/hys/models/benchmarks', benchmark_only=True, checkpoint='/home/mount_dir/hys/modles/checkpoints', data_location='/home/mount_dir/wj/ImageNet/data/tf_images', data_num_inter_threads=None, data_num_intra_threads=None, disable_tcmalloc=True, epochsbtwevals=1, experimental_gelu=False, framework='tensorflow', input_graph=None, intelai_models='/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5', mode='training', model_args=[], model_name='resnet50v1_5', model_source_dir=None, mpi=None, mpi_hostnames=None, num_cores=-1, num_instances=1, num_inter_threads=1, num_intra_threads=26, num_mpi=1, num_train_steps=1, numa_cores_per_instance=None, optimized_softmax=True, output_dir='/home/mount_dir/hys/models/benchmarks/common/tensorflow/logs', output_results=False, precision='fp32', socket_id=-1, steps=112590, tcmalloc_large_alloc_report_threshold=2147483648, tf_serving_version='master', trainepochs=72, use_case='image_recognition', verbose=True)
Received these custom args: []
Current directory: /home/mount_dir/hys/models/benchmarks
Running: mpirun -x LD_LIBRARY_PATH -x PYTHONPATH --allow-run-as-root -n 4 -H c1:2,head:2 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 --bind-to none --map-by slot /opt/intel/oneapi/tensorflow/2.2.0/bin/python /home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5/training/mlperf_resnet/imagenet_main.py 2 --batch_size=64 --max_train_steps=112590 --train_epochs=72 --epochs_between_evals=1 --inter_op_parallelism_threads 1 --intra_op_parallelism_threads 26 --version 1 --resnet_size 50 --data_dir=/home/mount_dir/wj/ImageNet/data/tf_images --model_dir=/home/mount_dir/hys/modles/checkpoints
PYTHONPATH: :/home/mount_dir/hys/models/benchmarks/../models/common/tensorflow:/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5:/home/mount_dir/hys/models/benchmarks:/home/mount_dir/hys/models/benchmarks
RUNCMD: /opt/intel/oneapi/tensorflow/2.2.0/bin/python common/tensorflow/run_tf_benchmark.py --framework=tensorflow --use-case=image_recognition --model-name=resnet50v1_5 --precision=fp32 --mode=training --benchmark-dir=/home/mount_dir/hys/models/benchmarks --intelai-models=/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5 --num-cores=-1 --batch-size=-1 --socket-id=-1 --output-dir=/home/mount_dir/hys/models/benchmarks/common/tensorflow/logs --num-train-steps=1 --benchmark-only --verbose --checkpoint=/home/mount_dir/hys/modles/checkpoints --data-location=/home/mount_dir/wj/ImageNet/data/tf_images --disable-tcmalloc=True
Log file location: /home/mount_dir/hys/models/benchmarks/common/tensorflow/logs/benchmark_resnet50v1_5_training_fp32_20210514_163202.log

Where can I get model file ssd_resnet34_bs1.pb

Where can I get model file ssd_resnet34_bs1.pb?

RNNT training on CPU

Appreciate for the job on supporting RNN-T training on CPU (models/language_modeling/pytorch/rnnt/training/cpu), just quick evaluated the training code and found that WER would keep in 1.00 after even training 10+ epoches.
And I found this issue related on loss function used in training HawkAaron/warp-transducer#93
The grad in cpu is incorrect, is this a know issue? Or have we ever gotten the final WER of 0.058 rather than 1.0?

launch_benchmark.py : ImportError: No module named 'object_detection'

When running python launch_benchmark.py
--data-location /home/user/coco/output/
--in-graph /home/user/ssd_resnet34_fp32_bs1_pretrained_model.pb
--model-source-dir /home/user/tensorflow/models
--model-name ssd-resnet34
--framework tensorflow
--precision fp32
--mode inference
--socket-id 0
--batch-size=1
--docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14
--accuracy-only
I get an error:
load graph from: /in_graph/ssd_resnet34_fp32_bs1_pretrained_model.pb
Namespace(accuracy_only=True, batch_size=1, data_location='/dataset', input_graph='/in_graph/ssd_resnet34_fp32_bs1_pretrained_model.pb', num_inter_threads=1, num_intra_threads=8, results_file_path=None)
T
Traceback (most recent call last):
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 1072, in create_dataset
import ssd_dataloader # pylint: disable=g-import-not-at-top
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/ssd_dataloader.py", line 27, in
from object_detection.box_coders import faster_rcnn_box_coder
ImportError: No module named 'object_detection'

Note that this error occurs for FP32 and INT8 when using --accuracy-only.
When using --benchmark-only the scripts runs successfully.

It seems that PYTHONPATH is not properly transmitted to the docker environment.

URL link in Readme is broken

From this README, at the end, there is a table:
https://github.com/IntelAI/models/blob/master/docs/general/tensorflow/LaunchBenchmark.md

Under the column "Run from the Model Zoo repository", most FP32 urls are broken. You may wish to check other URLs from the README and update them.

Several of the data links in benchmarks/.../language translation/.../bert/ are out of date, and the download_glue_data.py script has bugs

I'm unable to get the data downloaded, following the ReadMe at this link: https://github.com/IntelAI/models/tree/master/benchmarks/language_translation/tensorflow/bert
Particularly at this command: $ python3 download_glue_data.py --data_dir ./data/ --tasks MRPC

There are several bugs in the download_glue_data.py script (e.g you need 'urllib.request.urlretrieve...' rather than 'URLLIB.retrieve'), and the links to the data are now inaccessible.

Bazel Build fails on Ubuntu 16.04 - No such target log_severity

Hi,

I'm trying to build the the Intel Optimized Binaries for Tensorflow Serving on my Ubuntu 16.04 machine. I ran into the below error no such target '@com_google_absl//absl/base:log_severity'

I'm running the below command -
docker build -t $USER/tensorflow-serving-devel-mkl -f /home/ubuntu/serving/tensorflow_serving/tools/docker/Dockerfile.devel-mkl2 .

The only thing I have changed in the docker file is the base image to be 16.04 instead of 18.04

Is there anything I'm missing ?

Thanks!

Set batching parameters

How can i enable batching and set batching parameters (max_batch_size, batch_timeout_micros,...) when run tensorflow model server with docker? Thanks...

How to set the git proxy

When running the ssd-restnet34 benchmark:

python launch_benchmark.py \
    --in-graph /home/<user>/ssd_resnet34_fp32_bs1_pretrained_model.pb \
    --model-source-dir /home/<user>/tensorflow/models \
    --model-name ssd-resnet34 \
    --framework tensorflow \
    --precision fp32 \
    --mode inference \
    --socket-id 0 \
    --batch-size=1 \
    --docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14 \
    --benchmark-only

it will clone the

git clone --single-branch https://github.com/tensorflow/benchmarks.git

in the start.sh

However, in my case I need to set some git proxy:

git config --global http.proxy http://<my_proxy>:<my_port>
git config --global https.proxy https://<my_proxy>:<my_port>

otherwise, it will fail with the error message:
Cloning into 'benchmarks'...
fatal: unable to access 'https://github.com/tensorflow/benchmarks.git/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
start.sh: line 654: cd: benchmarks: No such file or directory

Does anyone know how to set it with these script's argument?

Can not run BERT-Large training successfully on bare metal

We run BERT-Large training on bare metal ubuntu server. The log have no errors, but also no training logs, it is confusing.

command:

python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 \
    --benchmark-only \
    --data-location=$BERT_LARGE_DIR \
    --num-inter-threads=1 \
    -- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=0.1     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True

The log:

INFO:tensorflow:Graph was finalized.
I0625 09:40:30.595448 140247941625664 monitored_session.py:246] Graph was finalized.
2021-06-25 09:40:30.595915: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-25 09:40:30.764862: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2892875000 Hz
2021-06-25 09:40:30.767997: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c703127e80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-25 09:40:30.768068: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Running local_init_op.
I0625 09:40:50.980941 140247941625664 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0625 09:40:51.142987 140247941625664 session_manager.py:508] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I0625 09:41:02.433922 140247941625664 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
I0625 09:41:02.434337 140247941625664 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I0625 09:41:08.454857 140247941625664 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
INFO:Running SQuAD...!
----------------------------Run command-------------------------------------

So there are no training result in the log.

@dmsuehir @ashahba would you please help troubleshoot

Thanks

launch_benchmark.py : ImportError: No module named 'object_detection' with INT8 accuracy only

This issue is different from ticket# 29
When running python launch_benchmark.py
--data-location /home/user/coco/output/
--in-graph /home/user/ssd_resnet34_int8_bs1_pretrained_model.pb
--model-source-dir /home/user/tensorflow/models
--model-name ssd-resnet34
--framework tensorflow
--precision int8
--mode inference
--socket-id 0
--batch-size=1
--docker-image gcr.io/deeplearning-platform-release/tf-cpu.1-14
--accuracy-only
I get an error:
load graph from: /in_graph/ssd_resnet34_int8_bs1_pretrained_model.pb
Namespace(accuracy_only=True, batch_size=1, data_location='/dataset', input_graph='/in_graph/ssd_resnet34_int8_bs1_pretrained_model.pb', num_inter_threads=1, num_intra_threads=8, results_file_path=None)
T
Traceback (most recent call last):
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 1072, in create_dataset
import ssd_dataloader # pylint: disable=g-import-not-at-top
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/ssd_dataloader.py", line 27, in
from object_detection.box_coders import faster_rcnn_box_coder
ImportError: No module named 'object_detection'

Note that this error occurs for INT8 using --accuracy-only.
When using precision FP32 using --accuracy-only it works fine.

I have not been able to find a fix for this issue.

What dataset are these models pretrained on?

Could you please tell me what dataset the following models are pretrained on?

SSD-MobileNet (dataset combinedNMS?) https://github.com/IntelAI/models/tree/master/benchmarks/object_detection/tensorflow/ssd-mobilenet
SSD-ResNet34 https://github.com/IntelAI/models/tree/master/benchmarks/object_detection/tensorflow/ssd-resnet34
UNet https://github.com/IntelAI/models/tree/master/benchmarks/image_segmentation/tensorflow/unet

Tensorflow 2.x support

When are you planning to add support for TF 2.x versions in the launch_benchmark.py ?

mount path is not set in function inceptionv4

Other than other functions, i see mount path like following is not set for function inceptionv4
export PYTHONPATH=${PYTHONPATH}:$(pwd):${MOUNT_BENCHMARK}

I would think it is required and i encounter some issue with my docker image, may the path be added?

models/benchmarks/common/tensorflow/start.sh
function inceptionv4() {

| # For accuracy, dataset location is required
| if [ "${DATASET_LOCATION_VOL}" == None ] && [ ${ACCURACY_ONLY} == "True" ]; then
| echo "No dataset directory specified, accuracy cannot be calculated."
| exit 1
| fi

correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

When pytorch is 1.7 in oneAPI AI toolkit, there is error in running case:

cd models/models/image_recognition/pytorch/common/
python main.py -d /home/wj/ImageNet/data/all -a resnet50 --epochs 100 --learning-rate 0.1  --print-freq 1 -b 64 –ipex
...

correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

view() is not supported and use reshape() to fix it.

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Hi,
I followed the instructions in Wide & Deep and run test successfully, but saw below messages:

I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Is this correct behavior or tf bug?
I have tried pip install intel-tensorflow==1.15 and didn't see this message. The instructions in Wide & Deep used intel-tensorflow==2.1.0, does this matter?

cannot run ssd_resnet34 int8 model with intel-tensorflow 2.5.0

Whe I ues intel-tensorflow==2.5.0 to run SSD_ResNet34 model for inference, it will get an error:

Traceback (most recent call last):
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Missing 0-th output from {{node v/cg/resnet34_backbone/conv1/conv2d/Conv2D_eightbit_requantize}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "draft.py", line 53, in <module>
    results = sess.run(output_tensors, {input_tensor: image})
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/home/support/miniconda3/envs/vdms_streamer/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Missing 0-th output from node v/cg/resnet34_backbone/conv1/conv2d/Conv2D_eightbit_requantize (defined at draft.py:21)

Then I downgrade the intel-tensorflow to 2.4.0, problem solved.

Run more docker containters with Inter-optimized-tensorflow on One 8 physical core 16cores Cpu

hello, I find the inter-optimized-tensorflow has the great increasing on train phase.
but i want to run 3 docker containters in 8 physical core 16cores Cpu, i set every containter with 4 logical core
how i set the param intra_/inter_op_parallelism_threads and OMP_NUM_THREADS?
when one containter runs, the train time cost 17s every epoch, but when i run 3 containters, in every containter the train time cost 50s/epoch.
by the way i set intra_/inter_op_parallelism_threads =2, OMP_NUM_THREADS= 2 ,KMP_BLOCKTIME=1 in containter.
please tell me why?

How to run these w/out docker?

Is there anyway to run these benchmarks without Docker? Many of us on HPC clusters don't use docker because we want as minimal and noise-free environment as possible. Being able to run w/out docker and containers helps that.

Inconsistent "permissions" behavior between fp32 vs int8 with Resnet50 model

try to run models/quickstart/image_recognition/tensorflow/resnet50/inference; both fp32 and int8.
Using wget command to download the .gz file to Documents directory and using tar to unzip the file.
fp32 runs without issues. int8 gives "bash: quickstart/int8_inference.sh: Permission denied".

Distributed Training for FP32 Resnet50_v1.5 exited on signal 11 (Segmentation fault)

Job was launched outside the container with the following environment setup: centos7 ,openmpi 3.1.4 ,intel-tensorflow2.1.0 , horovod==0.19.5 on a Python3.6 virtual environment. ( Same setup worked for Int8 and FP32 inference). However, Distributed training is exiting with Segmentation fault. Is this a known issue?

Attaching full log here:
benchmark_resnet50v1_5_training_fp32_20200812_143606.log

error running bert following steps on IntelAI/models

Running launch_benchmark.py on bert had an error following the steps on the IntelAI/models repo.

tensorflow.python.framework.errors_impl.NotFoundError: /home/m/data/MRPC/train.tsv; No such file or directory

language_modeling missing from master

Hi, I see that language_modelling is in tag 1.6.1 but missing from master. What is the reason behind that ?

https://github.com/IntelAI/models/tree/v1.6.1/models/language_modeling/tensorflow/bert_large/training/fp32

Add wide&deep model support for Tensorflow Serving？

When I tried to serve the model to tensorflow serving, but fount it is no support in the link

So I try to write the client and encounter a similar issue after deployed model to tf-serving .

Are there any plans to add wide&deep model support for tensorflow serving?

NGC TensorFlow image performs better than Intel-optimized TensorFlow image on Xeon(R) Gold 6242

I tested NGC tf image performance by running Bert-base, and found its performance better than Intel-optimized tf image with the recommended environment variables set by about 5s/10examples. Interestingly NGC container does not need setting these environment variables and gives better performance. Similar phenomenon is observed in Resnet50. Is there a reason for that?

dead code and ignored cli option for framework in launch_benchmark.py

when trying to run w/o docker (bare-metal) the method run_bare_metal has this code

        # To Launch Tensorflow Serving benchmark we need only --in-graph arg.
        # It does not support checkpoint files.
        if args.framework == "tensorflow_serving":

which can't be entered because validate_args rejects -f tensorflow_serving

Possibility of updated MKL docker image on latest 2.4.0

I've been leveraging the latest tag from dockerhub intel/intel-optimized-tensorflow-serving:2.3.0-mkl to deploy some tensorflow GPU models on CPU, which cannot run on vanilla tensorflow cpu. However, for other reasons we must migrate to tensorflow serving 2.4 which was released a month ago.

Can we expect a newer version of the optimized docker image => intel/intel-optimized-tensorflow-serving:2.4.0-mkl? If so is there any eta?

I've encountered some trouble attempting to build myself. Following this guide https://github.com/IntelAI/models/blob/master/docs/general/tensorflow_serving/InstallationGuide.md with 2.4.0 instead of 2.3.0, the docker build successfully passes the build of tensorflow serving... but later results in errors when trying to copy the library files:

cannot stat '/root/.cache/bazel/_bazel_root/*/external/mkl_linux/lib/*': No such file or directory

I can't seem to find the produced lib files anywhere in the intermediate containers.

Any insight?

Error met when running BERT training on multiple nodes

Issue:
/usr/bin/python3 common/tensorflow/run_tf_benchmark.py --framework=tensorflow --
use-case=language_modeling --model-name=bert_large --precision=fp32 --mode=train
ing --benchmark-dir=/dl/intel_train/models/benchmarks --intelai-models=/dl/intel
_train/models/benchmarks/../models/language_modeling/tensorflow/bert_large --num
-cores=-1 --batch-size=32 --socket-id=-1 --output-dir=/dl/intel_train/glue-outpu
t --num-train-steps=1 --benchmark-only --num-intra-threads=10 --disable-tcmalloc
=True --train-option=Classifier --init-checkpoint=/dl/intel_train/ckpt-bert-base
/uncased_L-12_H-768_A-12/bert_model.ckpt --task-name=MRPC --vocab-file=/dl/intel
train/ckpt-bert-base/uncased_L-12_H-768_A-12/vocab.txt --config-file=/dl/intel
train/ckpt-bert-base/uncased_L-12_H-768_A-12/bert_config.json --do-train=true --
num-train-epochs=30 --learning-rate=2e-5 --max-seq-length=128 --do-eval=true --d
ata-dir=/dl/intel_train/glue/results4/MRPC --do-lower-case=True --experimental-g
elu=False --optimized-softmax=True
<2> is invalid
libnuma: Warning: node argument 2 is out of range

usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]

Env:
OS: Fedora release 29 (Twenty Nine)
Whether container: bare metal
Model: bert base
Reference guideline:
https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large/training/fp32
Number_of_nodes: 2
Socket_of_nodes: 2

The code where throws the issue:
Start.sh -> run_model()->eval ${CMD} 2>&1 | tee ${LOGFILE}

Note, the ${CMD} after preprocessing is:
/usr/bin/python3 common/tensorflow/run_tf_benchmark.py --f
ramework=tensorflow --use-case=language_modeling --model-name=bert_large --preci
sion=fp32 --mode=training --benchmark-dir=/dl/intel_train/models/benchmarks --in
telai-models=/dl/intel_train/models/benchmarks/../models/language_modeling/tenso
rflow/bert_large --num-cores=-1 --batch-size=32 --socket-id=-1 --output-dir=/dl/
intel_train/glue-output --num-train-steps=1 --benchmark-only --disable-tcma
lloc=True --train-option=Classifier --init-checkpoint=/dl/intel_train/ckpt-bert-
base/uncased_L-12_H-768_A-12/bert_model.ckpt --task-name=MRPC --vocab-file=/dl/i
ntel_train/ckpt-bert-base/uncased_L-12_H-768_A-12/vocab.txt --config-file=/dl/in
tel_train/ckpt-bert-base/uncased_L-12_H-768_A-12/bert_config.json --do-train=tru
e --num-train-epochs=30 --learning-rate=2e-5 --max-seq-length=128 --do-eval=true
--data-dir=/dl/intel_train/glue/results4/MRPC --do-lower-case=True --experiment
al-gelu=False --optimized-softmax=True

The step to launch the training:
• Git clone intel model on each node with same directory
• Prepare glue data on each node with same directory
• Prepare bert model on each node with same directory
• Make sure for each node, distributed mode on two sockets is workable per the guide. The train result can be generated on each node.
• To support multiple instances, there is no guide for Bert so I’m referring the resnet related training https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/training/fp32/Advanced.md
python launch_benchmark.py
--verbose
--model-name resnet50v1_5
--precision fp32
--mode training
--framework tensorflow
--noinstall
--checkpoint ${OUTPUT_DIR}
--data-location ${DATASET_DIR}
--mpi_hostnames 'host1,host2'
--mpi_num_processes 4 2>&1
• I just modify the fp32_classifier_training.sh directly in launching entry, something like below:
source "${MODEL_DIR}/quickstart/common/utils.sh"
_command python3 ${MODEL_DIR}/benchmarks/launch_benchmark.py
--model-name=bert_large
--precision=fp32
--mode=training
--framework=tensorflow
--batch-size=32
${mpi_num_proc_arg}
--mpi_hostnames='vsr140,vsr143'
--output-dir=$OUTPUT_DIR
$@
-- train-option=Classifier
task-name=MRPC
do-train=true
do-eval=true
data-dir=$DATASET_DIR/MRPC
vocab-file=$CHECKPOINT_DIR/vocab.txt
config-file=$CHECKPOINT_DIR/bert_config.json
init-checkpoint=$CHECKPOINT_DIR/bert_model.ckpt
max-seq-length=128
learning-rate=2e-5
num-train-epochs=30
optimized_softmax=True
experimental_gelu=False
do-lower-case=True \

what we have tried
We found some numactl related code under common/base_model_init.py so that we added some print out there, but found the error is threw before the numactl is invoked.

Downloading the dataset

The link you've posted to download the dataset is broken. Furthermore, even after figuring out the right url, trying to download the 2012 dataset is nearly impossible; endless circular links leading nowhere.

Is inceptionv3 maintained or is it a dead benchmark?

tf serving using mkl dnn has longer lantecny

i follow the installation guide to serve a resnet-v1 model. The client is resnet_client_grpc.py.
i record the time of result = stub.Predict(request,10.0) as one reqeust's lantecy

serving	mkl-dnn	no mkl-dnn
latency(s)	0.25	0.06

serving with mkl-dnn peformes worse, is it normal? if resnet is inappropriate, which scene can use mkl-dnn?
thanks in advance.

How can I export a optimized wide and deep model?

In https://software.intel.com/content/www/us/en/develop/articles/accelerate-int8-inference-performance-for-recommender-systems-with-intel-deep-learning.html, graph optimization includes the following：
"""
Categorical columns are optimized by removing redundant and unnecessary OPs. The left portion of Figure 2 contains the unoptimized portion of the graph. These are optimized as described below:
The Expand Dimension, GatherNd, NotEqual, and Where OPs that are used to get a non-empty input string of the required dimension are removed as they are redundant for the current dataset.
Error checking and handling OPs (NotEqual, GreaterEqual, SparseFillEmptyRows, Unique, etc.) and unique value calculation and reconstruction OPs (Unique, SparseSegmentSum/Mean, StridedSlice, Pack, Tile, etc.) are removed as they are not necessary for the current dataset.
"""
Would you please share this part of the graph optimization method or code?

How does TensorFlow handle the situation of running a training with BF16 on a CPU which does not support it?

I compared the performance of training with fp32 and bf16 on a CPU which does not support bf16. Interestingly, the training managed to finish with bf16, though the performance was much worse than fp32. I also did a similar test for fp16, whose performance was also worse than fp32. Since the CPU does not support fp16 or bf16, I am very interested to know how TensorFlow handles such situation.

ml-20M checkpoint for NCF

Is there a ML-20M checkpoint for NCF? I see there is one for ML-1M

The latest docker image was not compiled to use AVX2 AVX512F FMA

I pulled the latest tensorflow image docker pull intel/intel-optimized-tensorflow, and found the following meassage at the beginning of running a script:

2020-08-18 08:52:56.002977: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

Through some research, it seems the only way to work around this is to rebuild tensorflow. Is there a better way of fixing this issue in the docker image? Or will this have a negative impact on the performance?

imagenet data conversion script returns pthread error

ran
./imagenet_to_tfrecords.sh dataset tpu
to convert imagenet data to tfrecords and returns

2021-05-25 19:50:18.526589: F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() fa
iled.
Fatal Python error: Aborted

Thread 0x00007f8d9a57e740 (most recent call first):
File "/usr/local/lib64/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 442 in get_matching_files_v2
File "/usr/local/lib64/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 383 in get_matching_files
File "/home/manjeets/resnet50v1-5-fp32-inference/DATADIR/tpu/tools/datasets/imagenet_to_gcs.py", line 372 in convert_to_tf_records
File "/home/manjee2021-05-25 19:50:18.542511: F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread creation via
pthread_create() failed.

I've checked thread limit by ulimit -u
864604

the system has so many cores > 200.

About the OPM_NUM_THREADS

i tried to change OMP_NUM_THREADS to determine how it would affect the CPU performance
by changing EXPORT OMP_NUM_THREADS=''
however the outcome seemed to be unchanged, and i found the follow in the information:
User settings:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=1
KMP_SETTINGS=1
OMP_NUM_THREADS=16

the OMP_NUM_THREADS never change
so how can i change it?

Patches missing for IntelTensorFlow_PerformanceAnalysis

Summary

/opt/intel/oneapi/modelzoo/latest/models/docs/notebooks/perf_analysis/profiling/patches doesn't contain patches needed for execution of jupyter notebook in /opt/intel/oneapi/modelzoo/latest/models/docs/notebooks/perf_analysis/benchmark_perf_comparison.ipynb and benchmark_perf_timeline_analysis.ipynb

For functionality of the two jupyter notebooks under oneAPI-samples/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_PerformanceAnalysis (benchmark_perf_comparison and benchmark_perf_timeline_analysis), the models should be patched to correctly produce timeline .json files. However, many models available in benchmark_perf_comparison don't got their corresponding patches in /opt/intel/oneapi/modelzoo/latest/models/docs/notebooks/perf_analysis/profiling/patches.

URL

IntelTensorFlow_PerformanceAnalysis: https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_PerformanceAnalysis
benchmark_perf_comparison: https://github.com/IntelAI/models/blob/master/docs/notebooks/perf_analysis/benchmark_perf_comparison.ipynb

Steps to reproduce

I followed the instructions in the IntelTensorFlow_PerformanceAnalysis, "Running the Sample" section to $cp -rf /opt/intel/oneapi/modelzoo/latest/models ~/ to get the models. Also followed all other instructions to prepare both environments and run the codes in the jupyter notebook. In execution, I chose topology 0: resnet50 infer fp32 and topology 1: resnet50v1_5 infer fp32.

Observed behavior

After execution of benchmark_perf_comparison notebook, a .json file containing the tensorflow timelines is expected to be produced. However, no json file is found. The problem is that the models have to be patched to make the json file, but both topology 0: resnet50 infer fp32 and topology 1: resnet50v1_5 infer fp32 don't have their corresponding patches in models/docs/notebooks/perf_analysis/profiling/patches. These are not the only case. In fact, most topologies in the 12 supported topologies don't have their corresponding patches.

Expected behavior

There should be corresponding patches in the models/docs/notebooks/perf_analysis/profiling/patches folder, so that a .json file containing the timeline can be output and be used in the following execution. I think either the supported topologies should be changed to match the existing patches, or additional patches should be added to match the supported topologies.