intel / caffe Goto Github PK

This fork of BVLC/Caffe is dedicated to improving performance of this deep learning framework when running on CPU, in particular Intel® Xeon processors.

License: Other

CMake 1.70% Makefile 0.43% Shell 1.22% C++ 82.26% Cuda 3.61% C 2.39% MATLAB 0.92% Python 7.22% M4 0.08% Dockerfile 0.06% Batchfile 0.11%

caffe's Introduction

DISCONTINUATION OF PROJECT.

This project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Caffe

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors.

Check out the project site for all the details like

and step-by-step examples.

Please join the caffe-users group or gitter chat to ask questions and talk about methods and models. Framework development discussions and thorough bug reports are collected on Issues.

Happy brewing!

SSD: Single Shot MultiBox Detector

This repository contains merged code issued as pull request to BVLC caffe written by: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg.

Original branch can be found at https://github.com/weiliu89/caffe/tree/ssd.

Read our wiki page for more details.

Intel® Distribution of Caffe*

This fork is dedicated to improving Caffe performance when running on CPU, in particular Intel® Xeon processors.

Building

Build procedure is the same as on bvlc-caffe-master branch. Both Make and CMake can be used. When OpenMP is available will be used automatically.

Running

Run procedure is the same as on bvlc-caffe-master branch.

Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like OMP_NUM_THREADS or GOMP_CPU_AFFINITY.

If some system tool like numactl is used to control CPU affinity, by default caffe will prevent to use more than one thread per core. When less than required cores are specified, caffe will limit execution of OpenMP threads to specified cores only.

Best performance solution

Please read our Wiki for our recommendations and configuration to achieve best performance on Intel CPUs.

Multinode Training

Intel® Distribution of Caffe* multi-node allows you to execute deep neural network training on multiple machines.

To understand how it works and read some tutorials, go to our Wiki. Start from Multinode guide.

License and Citation

Caffe is released under the BSD 2-Clause license. The BVLC reference models are released for unrestricted use.

Please cite Caffe in your publications if it helps your research:

@article{jia2014caffe,
  Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor},
  Journal = {arXiv preprint arXiv:1408.5093},
  Title = {Caffe: Convolutional Architecture for Fast Feature Embedding},
  Year = {2014}
}

*Other names and brands may be claimed as the property of others

caffe's People

Stargazers

Watchers

Forkers

hjchen2 jenniew dkurtaev shawnxuhao heappl keuperj sfikas jiangoforit daniter-cu chenkaiidy lms1998 m4gn4tor benjamesbabala chagge matrixplayer lyk125 sorinescu cloudswenable sunxingxingtf yangjunpro heagoo livst zhengfangwu gnossien volodymyr-syvochka fanxu nrsatish miketam1021 diaoxuexj tianyuanintel aliushn fengfeisha bigfagdabs ustczjr86 xmchen1987 lyimage alenaliu ashimpd dreadlord1984 devarampati xhqing etaf hulkmaker alexkoff88 liviust jreniecki niilante gjtjx aizatrosli limin2021 eunseop90 sara-nl ksesalebangim phi-pds huanleo careymeeks sergey-serebryakov statml ami-gs abhishekhp2016 jdwu1120 jianxungao jenifferwuucla atulkulk sashakot hatton1995 aksgnit nickorberg srikarth tensor-tang buyishizi lnrsoft qyi1 like3010 divergencecn hotdog132 cfandy luckynote sagardatascientists rjzz parety dongzhenghao sifengyang orange-sys melodylail net936 shangguanshiyuan cw-delli-bird ds3lab raymondhliu berli togetherxm jlaiman ftian1 vraoresearch awan-10 vinceshieh sunshine-chenxi mvpel suzhoushr

caffe's Issues

how to find the difference between the original version and this one

Hi, I want to learn how to improve performance on intel CPUs, so I need to compare the original copy and this one to find out how to do sth to improve performance on my program, when using intel cpu.

classifier with the reference caffe model will crash

./build/examples/cpp_classification/classification.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg
---------- Prediction for examples/images/cat.jpg ----------
Segmentation fault (core dumped)

with the gdb, it will show crashed:
(gdb) bt
#0 0x00007ffece1b40ae in parallel_doConversion_Simple_To_PCLData ()
at /opt/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_avx512_mic.so
#1 0x00007ffedce9227b in mkl_dnn_do_parallel_F32 ()
at /opt/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_intel_thread.so
#2 0x00007ffff141ad13 in __kmp_invoke_microtask ()
at /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64_lin/libiomp5.so
#3 0x00007ffff13eab17 in __kmp_invoke_task_func (gtid=-1564475328)
at ../../src/kmp_runtime.c:7084
#4 0x00007ffff13ea1c5 in __kmp_launch_thread (this_thr=0x7ffea2c00040)
at ../../src/kmp_runtime.c:5680
#5 0x00007ffff141b193 in _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker(void*) (thr=0x7ffea2c00040) at ../../src/z_Linux_util.c:664
#6 0x00007ffff22dddc5 in start_thread () at /usr/lib64/libpthread.so.0
#7 0x00007ffff1df51cd in clone () at /usr/lib64/libc.so.6
(gdb) q

Anybody know how to solve this issue?

Can mpi be used on caffe when node is single?

Boost Error

/usr/include/boost/property_tree/detail/json_parser_read.hpp: In constructor ‘boost::property_tree::json_parser::json_grammar<Ptree>::definition<Scanner>::definition(const boost::property_tree::json_parser::json_grammar<Ptree>&)’:
/usr/include/boost/property_tree/detail/json_parser_read.hpp:257:264: error: ‘type name’ declared as function returning an array
                 escape
                                                                                                                                                                                                                                                                        ^
/usr/include/boost/property_tree/detail/json_parser_read.hpp:257:264: error: ‘type name’ declared as function returning an array

Getting the following error during compile

Build error in latest pull - boost

CXX/LD -o .build_release/tools/caffe.bin
CXX/LD -o .build_release/tools/finetune_net.bin
CXX/LD -o .build_release/tools/create_label_map.bin
CXX/LD -o .build_release/tools/convert_imageset.bin
CXX/LD -o .build_release/tools/upgrade_net_proto_text.bin
CXX/LD -o .build_release/tools/convert_annoset.bin
CXX/LD -o .build_release/tools/train_net.bin
CXX/LD -o .build_release/tools/compute_image_mean.bin
CXX/LD -o .build_release/tools/get_image_size.bin
CXX/LD -o .build_release/tools/extract_features.bin
CXX/LD -o .build_release/tools/net_speed_benchmark.bin
CXX/LD -o .build_release/tools/test_net.bin
CXX/LD -o .build_release/tools/upgrade_net_proto_binary.bin
CXX/LD -o .build_release/tools/device_query.bin
CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
CXX/LD -o .build_release/examples/ssd/ssd_detect.bin
CXX/LD -o .build_release/examples/mnist/convert_mnist_data.bin
CXX/LD -o .build_release/examples/cifar10/convert_cifar_data.bin
CXX/LD -o .build_release/examples/siamese/convert_mnist_siamese_data.bin
/usr/bin/ld: .build_release/examples/ssd/ssd_detect.o: undefined reference to symbol '_ZN5boost6system15system_categoryEv'
//usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make: *** [.build_release/examples/ssd/ssd_detect.bin] Error 1
make: *** Waiting for unfinished jobs....
make: *** wait: No child processes. Stop.

Seems like there is a linker issue to boost? Any help to resolve this issue is much appreciated.

MLSL library error

Hello,

I am trying to run Intel's version of Caffe for benchmarking purposes, which I downloaded from:

                http://github.com/intel/caffe.git

The training command works fine when running on a local system (without MPI) but when trying to use MPI like in the following command:

     mpirun -v -n 2 -ppn 1 -machinefile /home/demouser/mpd.hosts /home/demouser/intelcaffe/build/tools/caffe train   -solver /home/demouser/dogvscat/dogvscat_solver.prototxt

I receive the following error message:

(4328): /localdisk/jenkins/mlsl-build/src/comms_ep.cpp:CommsAlloc:535: ASSERT 'ptr' FAILED: NULL pointer

Does anybody have any idea why this occurs (the source for the MLSL library is not available - just the binary blob)?

Regards,
Andrei

I can not use mkldnn, and I can not find the reason.

I download mkldnn from the address(https://github.com/01org/mkl-dnn), meanwhile compile and install successfully.
Then, I set env var MKLDNNROOT as /usr/local.
When I begin to compile caffe, the problem is below:

An error occurred when run cpp_classification for ResNet

An error occurred when run cpp_classification for ResNet from https://github.com/KaimingHe/deep-residual-networks

./build/examples/cpp_classification/classification.bin ./ResNet-101-deploy.prototxt ./ResNet-101-model.caffemodel ./ResNet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg
F0830 02:25:38.625051 18678 net.cpp:785] Check failed: target_blobs.size() == source_layer.blobs_size() (2 vs. 3) Incompatible number of blobs for layer bn_conv1
*** Check failure stack trace: ***
@ 0x7f61a2222ddd google::LogMessage::Fail()
@ 0x7f61a2224c9f google::LogMessage::SendToLog()
@ 0x7f61a2222973 google::LogMessage::Flush()
@ 0x7f61a22255be google::LogMessageFatal::~LogMessageFatal()
@ 0x7f61a27ae822 caffe::Net<>::CopyTrainedLayersFrom()
@ 0x7f61a27b49d2 caffe::Net<>::CopyTrainedLayersFromBinaryProto()
@ 0x7f61a27b4a36 caffe::Net<>::CopyTrainedLayersFrom()
@ 0x7f61a8f4d729 Classifier::Classifier()
@ 0x7f61a8f4a73b main
@ 0x7f619ee31af5 __libc_start_main
@ 0x7f61a8f4adbd (unknown)
Aborted (core dumped)

while, there is no such error for BVLC caffe.

How to train GoogleNet V2 using Intel Caffe?

Hi,
Recently I am doing some research on Intel Caffe. In particular, I want to get performance data with GoogleNet V2 on multi nodes. Can you tell me how to train GoogleNet V2? Concrete commands or wiki url are better. Thank you! BTW, I find there are default_googlenet_v2 and mkl2017_googlenet_v2 in models. Which model I should use?

Best regards,
Xingyi

Fail to make by shared library of boost

/usr/bin/ld: .build_release/src/caffe/multinode/SynchronousNode.o: relocation R_X86_64_TPOFF32 against `boost::asio::detail::keyword_tss_ptr::context>::value_' can not be used when making a shared object; recompile with -fPIC
.build_release/src/caffe/multinode/SynchronousNode.o: could not read symbols: Bad value

Updated, reset, re-compiled, but still same story with one of closed issues

intelcaffe occupy the multi-thread resource and can not be used in follow process

Hi guys,
Our usage is use caffe to extract features and then use multi-thread to do searching. We found one issue that when extracted features by caffe and we can not start up multi-threads to do the searching and only one thread can be used. BTW , we use pthread to start multi-threads.

intelcaffe time longer when clear cache, so run intelcaffe with cached images or clear cache?

The prefetch_batch will take long time after clearing cache which has the same behavior as the first run intelcaffe after reboot.

Here is the experiment we have done:

Change the Makefile.config to enable the DEBUG:=1 and rebuild the caffe application
Run the intelcaffe with cache cleared: “echo 3 > /proc/sys/vm/drop_caches” , you will see the forward time increase a lot.
I1206 16:36:56.981101 23659 image_data_layer.cpp:231] Prefetch batch: 289 ms
Run the caffe application again and most the images has already been loaded into the memory, now forward time decrease and consistent.
I1206 16:50:49.657346 24815 image_data_layer.cpp:231] Prefetch batch: 20 ms.

So need to confirm how we run the intelcaffe, clear cache before running or run the intelcaffe with cached images??

compiling errors

./include/caffe/training_utils.hpp: In function 'int multiphase_train(caffe::MultiPhaseSolverParameter*, const string&, const string&, const int&, const string&)':
./include/caffe/training_utils.hpp:113:18: error: 'class caffe::SolverParameter' has no member named 'set_allocated_net_param'
solver_param.set_allocated_net_param(&topology_net_param);
^
In file included from /opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/system/system_error.hpp:14:0,
from /opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/thread/exceptions.hpp:22,
from /opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/thread/pthread/mutex.hpp:12,
from /opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/thread/mutex.hpp:16,
from ./include/caffe/syncedmem.hpp:52,
from ./include/caffe/blob.hpp:47,
from ./include/caffe/caffe.hpp:44,
from tools/caffe.cpp:54:
/opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/system/error_code.hpp: At global scope:
/opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/system/error_code.hpp:222:36: warning: 'boost::system::posix_category' defined but not used [-Wunused-variable]
static const error_category & posix_category = generic_category();
^
/opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/system/error_code.hpp:223:36: warning: 'boost::system::errno_ecat' defined but not used [-Wunused-variable]
static const error_category & errno_ecat = generic_category();
^
/opt/cray/pe/trilinos/11.12.1.5/INTEL/14.0/x86_64/include/boost/system/error_code.hpp:224:36: warning: 'boost::system::native_ecat' defined but not used [-Wunused-variable]
static const error_category & native_ecat = system_category();
^
Makefile:768: recipe for target '.build_release/tools/caffe.o' failed
make: *** [.build_release/tools/caffe.o] Error 1

What is the reason? Thanks!

Error while building with openBLAS

ubuntu12
CPU_ONLY := 1
BLAS := open

AR -o .build_release/lib/libcaffe.a
LD -o .build_release/lib/libcaffe.so.1.0.0-rc3
/usr/bin/ld: .build_release/src/caffe/internode/tcp_configuration.o: relocation R_X86_64_TPOFF32 against `boost::asio::detail::keyword_tss_ptr<boost::asio::detail::call_stack<boost::asio::detail::task_io_service, boost::asio::detail::task_io_service_thread_info>::context>::value_' can not be used when making a shared object; recompile with -fPIC
.build_release/src/caffe/internode/tcp_configuration.o: could not read symbols: Bad value
collect2: ld returned 1 exit status
make: *** [.build_release/lib/libcaffe.so.1.0.0-rc3] Error 1

Error running bvlc_reference_caffenet with USE_MKL2017_AS_DEFAULT_ENGINE turned on

Build Configuration:
Default Makefile.config with the delta below:

CPU_ONLY := 1
USE_MKL2017_AS_DEFAULT_ENGINE := 1
USE_OPENMP := 0

GCC version 4.8.4 on Ubuntu 14.04. Dual Socket CPU E5-2699 v4.

Other environment variables:

export MKL_NUM_THREADS=44
export MKL_DYNAMIC=false
export KMP_AFFINITY=granularity=fine,compact,1,0

make runtest
[==========] 1262 tests from 174 test cases ran. (98678 ms total)
[ PASSED ] 1262 tests.

./build/tools/caffe time -model models/bvlc_reference_caffenet/deploy.prototxt
Above command completes successfully.

./build/examples/cpp_classification/classification.bin ./models/bvlc_reference_caffenet/deploy.prototxt ./models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel ./test_files/mean.binaryproto ./test_files/synset_words.txt ./examples/images/cat.jpg
Above command fails with the log below:

--------- Prediction for ./examples/images/cat.jpg ----------
F1121 18:54:31.096916 177833 mkl_memory.cpp:196] Check failed: status == 0 (-1 vs. 0) Conversion from prv failed with status -1
*** Check failure stack trace: ***
@ 0x7ff3266d2daa (unknown)
@ 0x7ff3266d2ce4 (unknown)
@ 0x7ff3266d26e6 (unknown)
@ 0x7ff3266d5687 (unknown)
@ 0x7ff326aba764 caffe::MKLMemoryDescriptorBase<>::convert_from_prv()
@ 0x7ff326ab8512 caffe::SyncedMemory::cpu_data()
@ 0x7ff326bfe062 caffe::Blob<>::cpu_data()
@ 0x7ff326b85369 caffe::InnerProductLayer<>::Forward_cpu()
@ 0x7ff326be3f62 caffe::Net<>::ForwardFromTo()
@ 0x7ff326be41b5 caffe::Net<>::Forward()
@ 0x563d1bb2bc8e Classifier::Predict()
@ 0x563d1bb2be24 Classifier::Classify()
@ 0x563d1bb2a309 main
@ 0x7ff3250eef45 (unknown)
@ 0x563d1bb2a8c0 (unknown)
@ (nil) (unknown)
Aborted (core dumped)

bvlc_googlenet performance problem with AVX512 on KNL with MKL2017

When test bvlc_googlenet performance with Intel caffe+MKL2017 on KNL, performance is impacted by AVX512

When turn on AVX512 path forward time is 326ms with batch size=32, and when turn off AVX512 and turn on AVX2 path, forward time is 140ms.

A layer by layer performance comparison shows there are performance penalty with AVX512 on inception_4c and inception_4b layers (the first number is AVX512 (ms), the second number is AVX2 (ms)), batch size is 256.

Performance scaling between AVX512 and AVX2 does works on bvlc_alexnet, vgg16 and vgg19, it only have problem with bvlc_googlenet.

294.061 inception_4c/5x5_reduce forward: 295.612 1.5508
301.966 inception_4b/5x5_reduce forward: 303.554 1.5879
565.386 inception_4c/5x5 forward: 567.578 2.1916
565.529 inception_4b/5x5 forward: 567.842 2.3125
1205.41 inception_4b/5x5_reduce backward: 1211.83 6.4179
1206.53 inception_4c/5x5_reduce backward: 1212.96 6.4257
3127.84 inception_4c/5x5 backward: 3133.31 5.471
3127.93 inception_4b/5x5 backward: 3133.64 5.7065

Errors in latest code of Intel Caffe

Hi,
I tested Intel Caffe latest code with MLSL on XEON Phi. In 4 nodes, it will crash when iterations at 320, below is the log:
I0207 20:34:07.109974 67930 blocking_queue.cpp:87] Waiting for data
*** Aborted at 1486517656 (unix time) try "date -d @1486517656" if you are using GNU date ***
PC: @ 0x7f0e3913ef0b __memcpy_avx512_no_vzeroupper
*** SIGBUS (@0x7e067c17cfe1) received by PID 67743 (TID 0x7f0526ff4700) from PID 2081935329; stack trace: ***
@ 0x7f0e39471370 (unknown)
@ 0x7f0e3913ef0b __memcpy_avx512_no_vzeroupper
@ 0x7f0e39cf657e (unknown)
@ 0x7f0e39cf65ed (unknown)
@ 0x7f0e3dbb3b0a caffe::db::LMDBCursor::value()
@ 0x7f0e3da3696e caffe::DataReader::DBSequential::value()
@ 0x7f0e3da354fc caffe::DataReader::Body::read_one()
@ 0x7f0e3da360f4 caffe::DataReader::Body::InternalThreadEntry()
@ 0x7f0e3a15127a (unknown)
@ 0x7f0e39469dc5 start_thread
@ 0x7f0e3919873d __clone
The environment info is :
CPU: XEON Phi 7250 @ 1.40GHz
Memory: 32G6
batichsize: 96
Running scripts is:
mpirun -n 4 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train -solver models/intel_optimized_models/googlenet/solver.prototxt -engine MKL2017
However, it will run well on server with 16G6 memory, while the other environment all the same.
Thank you for your help, hope to get help ASAP.

problem with convergence while distributed training

Is there any problem if training distributed? I have tested resnet-50 with 1 node and 10 nodes, and I observed that 10 nodes will cause non convergence, yet 1 node case is normal. It is some logs below for 10 nodes case.

Mon Jan  9 09:02:16 2017[1,9]<stderr>:I0109 09:02:16.037472 25409 solver.cpp:288] [9] Iteration 9000, loss = 0.0017091
Mon Jan  9 09:02:16 2017[1,9]<stderr>:I0109 09:02:16.037565 25409 solver.cpp:309]     Train net output #0: loss = 0.00170914 (* 1 = 0.00170914 loss)
Mon Jan  9 09:02:16 2017[1,3]<stderr>:I0109 09:02:16.311161 20303 solver.cpp:288] [3] Iteration 9000, loss = 0.00197356
Mon Jan  9 09:02:16 2017[1,3]<stderr>:I0109 09:02:16.311249 20303 solver.cpp:309]     Train net output #0: loss = 0.00197353 (* 1 = 0.00197353 loss)
Mon Jan  9 09:02:19 2017[1,4]<stderr>:I0109 09:02:19.016288 23385 solver.cpp:288] [4] Iteration 9000, loss = 0.00176254
Mon Jan  9 09:02:19 2017[1,4]<stderr>:I0109 09:02:19.016391 23385 solver.cpp:309]     Train net output #0: loss = 0.00176255 (* 1 = 0.00176255 loss)
Mon Jan  9 09:02:19 2017[1,1]<stderr>:I0109 09:02:19.080947  9519 solver.cpp:288] [1] Iteration 9000, loss = 0.00233511
Mon Jan  9 09:02:19 2017[1,1]<stderr>:I0109 09:02:19.081048  9519 solver.cpp:309]     Train net output #0: loss = 0.00233503 (* 1 = 0.00233503 loss)
Mon Jan  9 09:02:19 2017[1,8]<stderr>:I0109 09:02:19.323590 17418 solver.cpp:288] [8] Iteration 9000, loss = 0.00130848
Mon Jan  9 09:02:19 2017[1,8]<stderr>:I0109 09:02:19.323689 17418 solver.cpp:309]     Train net output #0: loss = 0.00130845 (* 1 = 0.00130845 loss)
Mon Jan  9 09:02:19 2017[1,7]<stderr>:I0109 09:02:19.499922 29108 solver.cpp:288] [7] Iteration 9000, loss = 0.00123265
Mon Jan  9 09:02:19 2017[1,7]<stderr>:I0109 09:02:19.500016 29108 solver.cpp:309]     Train net output #0: loss = 0.00123263 (* 1 = 0.00123263 loss)
Mon Jan  9 09:02:19 2017[1,2]<stderr>:I0109 09:02:19.722164  4260 solver.cpp:288] [2] Iteration 9000, loss = 0.00176037
Mon Jan  9 09:02:19 2017[1,2]<stderr>:I0109 09:02:19.722316  4260 solver.cpp:309]     Train net output #0: loss = 0.00176033 (* 1 = 0.00176033 loss)
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803467 19982 solver.cpp:479]     Test net output #0: top-1 = 0.016
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803689 19982 solver.cpp:479]     Test net output #1: top-5 = 0.0832
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803788 19982 solver.cpp:529] Snapshotting to binary proto file ./output/resnet50_step1_iter_9000.caffemodel
Mon Jan  9 09:10:38 2017[1,0]<stderr>:I0109 09:10:38.504742 19982 sgd_solver.cpp:344] Snapshotting solver state to binary proto file ./output/resnet50_step1_iter_9000.solverstate```

mkldnn issue

It appears, the mkldnn external version included in the repo doesn't contain the c++ headers (hpp), but they are required to compile.

What is differencies of two engines, CAFFE and MKL2017?

What is the engine?
What is differencies of two engines, CAFFE and MKL2017?

Compilation error with latest code in mkl_dnn_cppwrapper.h

With latest checkout of, while trying to build, I am seeing multiple compilation issues. One of them is the following:

In file included from /home/u2205/GitHub/intelcaffe/caffe/include/caffe/mkl_memory.hpp:48:0,
from /home/u2205/GitHub/intelcaffe/caffe/include/caffe/layers/mkl_layers.hpp:52,
from /home/u2205/GitHub/intelcaffe/caffe/src/caffe/layers/mkl_pooling_layer.cpp:45:
/home/u2205/GitHub/intelcaffe/caffe/include/mkl_dnn_cppwrapper.h: In function ‘dnnError_t dnnBatchNormalizationCreateForward(_uniPrimitive_s**, dnnPrimitiveAttributes_t, dnnLayout_t, float, unsigned int) [with Dtype = float; dnnPrimitive_t = _uniPrimitive_s*; dnnPrimitiveAttributes_t = void*; dnnLayout_t = _dnnLayout_s*]’:
/home/u2205/GitHub/intelcaffe/caffe/include/mkl_dnn_cppwrapper.h:797:31: error: ‘dnnBatchNormalizationCreateForward_v2_F32’ was not declared in this scope
dataLayout, eps, flags); }

Is anyone else facing the issue ?
Can you please resolve this issue.

concat layer error

encountered following error if we have 6 bottoms in concat layer (prototxt of the Concat layer attached).

F1026 07:00:42.535639 14448 mkl_concat_layer.cpp:153] Check failed: e == E_SUCCESS (-1 vs. 0)
*** Check failure stack trace: ***
@ 0x7f94bf66ee6d (unknown)
@ 0x7f94bf670ced (unknown)
@ 0x7f94bf66ea5c (unknown)
@ 0x7f94bf67163e (unknown)
@ 0x7f94bfb7b144 caffe::MKLConcatLayer<>::Forward_cpu()
@ 0x7f94bfc63262 caffe::Net<>::ForwardFromTo()
@ 0x7f94bfc634b5 caffe::Net<>::Forward()
@ 0x7f94c643e688 Detector::Detect()
@ 0x7f94c643c3b9 main
@ 0x7f94bbe45b15 __libc_start_main
@ 0x7f94c643d4f5 (unknown)

layer {
name: "mbox_loc"
type: "Concat"
bottom: "conv4_3_norm_mbox_loc_flat"
bottom: "fc7_mbox_loc_flat"
bottom: "conv6_2_mbox_loc_flat"
bottom: "conv7_2_mbox_loc_flat"
bottom: "conv8_2_mbox_loc_flat"
bottom: "pool6_mbox_loc_flat"
top: "mbox_loc"
concat_param {
axis: 1
}
}

Does the caffe for intel support MPI?

Does the caffe for intel support MPI?
If it support MPI, how should I do to make MPI work?

What dependencies are different?

Where can I find more information on the added dependencies (compared to BVLC/caffe)?

I found that intel/caffe requires c++11? (at least; the flag -std=c++11 is added in CMakeLists.txt)

Are the MKL shared libraries in the external directory different from the installed MKL shared libraries ?

current MKL libs --- the MKL shared libraries in the external directory
local MKL libs --- the installed MKL shared libraries

I modify the path of current MKL libs(caffe/external/...) to become the path of local MKL libs(/opt/intel/mkl/...). But when linking, many problems are found what are 'xxx is not defined'.
The current MKL lib which is linked is mkl_avx512_mic.so.

Error while building with Intel MKL

I got below errors when I tried to build Intel Caffe with Ubuntu 14.04.

CPU_ONLY := 1
BLAS := mkl

/usr/bin/ld: .build_release/src/caffe/internode/guaranteed_comm.o: relocation R_X86_64_TPOFF32 against `boost::asio::detail::keyword_tss_ptr<boost::asio::detail::call_stack<boost::asio::detail::task_io_service, boost::asio::detail::task_io_service_thread_info>::context>::value_' can not be used when making a shared object; recompile with -fPIC
.build_release/src/caffe/internode/guaranteed_comm.o: error adding symbols: Bad value
collect2: ld returned 1 exit status
make: *** [.build_release/lib/libcaffe.so.1.0.0-rc3] Error 1

Make all fails 86% libtiff 4.0 undefined reference

I have ubuntu 16.04 OS, I also have libtiff4.0.7 compiled, where should i include the .so file so that the compiler would find it? This is possible solution from (http://stackoverflow.com/questions/29272497/linking-error-with-libopencv-highgui-so-under-ubuntu-14-04-strange-result-wit)
Or any other possibilities?

cmake ..
-- Boost version: 1.58.0
-- Found the following Boost libraries:
-- system
-- thread
-- filesystem
-- chrono
-- date_time
-- atomic
-- Found gflags (include: /home/machineo/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found glog (include: /home/machineo/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found PROTOBUF Compiler: /home/machineo/anaconda3/bin/protoc
-- Found lmdb (include: /home/machineo/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB (include: /home/machineo/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy (include: /home/machineo/include, library: /usr/lib/x86_64-linux-gnu/libsnappy.so)
-- CUDA detected: 8.0
-- Found cuDNN: ver. 5.1.5 found (include: /usr/local/cuda-8.0/include, library: /usr/local/cuda-8.0/lib64/libcudnn.so)
-- Automatic GPU detection failed. Building for all known architectures.
-- Added CUDA NVCC flags for: sm_20 sm_21 sm_30 sm_35 sm_50 sm_60 sm_61
-- OpenCV found (/usr/share/OpenCV)
-- Found Atlas (include: /home/machineo/include, library: /usr/lib/libatlas.so)
-- NumPy ver. 1.12.0 found (include: /usr/local/lib/python2.7/dist-packages/numpy/core/include)
-- Boost version: 1.58.0
-- Found the following Boost libraries:
-- python
-- Detected Doxygen OUTPUT_DIRECTORY: ./doxygen/

-- * Caffe Configuration Summary *
-- General:
-- Version : 1.0.0-rc4
-- Git : rc4-12-g39f28e4-dirty
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- Release CXX flags : -O3 -DNDEBUG -D_FORCE_INLINES -fPIC -Wall -Wno-sign-compare -Wno-uninitialized
-- Debug CXX flags : -g -D_FORCE_INLINES -fPIC -Wall -Wno-sign-compare -Wno-uninitialized
-- Build type : Release

-- BUILD_SHARED_LIBS : ON
-- BUILD_python : ON
-- BUILD_matlab : OFF
-- BUILD_docs : ON
-- CPU_ONLY : OFF
-- USE_OPENCV : ON
-- USE_LEVELDB : ON
-- USE_LMDB : ON
-- USE_NCCL : OFF
-- ALLOW_LMDB_NOLOCK : OFF

-- Dependencies:
-- BLAS : Yes (Atlas)
-- Boost : Yes (ver. 1.58)
-- glog : Yes
-- gflags : Yes
-- protobuf : Yes (ver. 3.0.0)
-- lmdb : Yes (ver. 0.9.17)
-- LevelDB : Yes (ver. 1.18)
-- Snappy : Yes (ver. 1.1.3)
-- OpenCV : Yes (ver. 2.4.9.1)
-- CUDA : Yes (ver. 8.0)

-- NVIDIA CUDA:
-- Target GPU(s) : Auto
-- GPU arch(s) : sm_20 sm_21 sm_30 sm_35 sm_50 sm_60 sm_61
-- cuDNN : Yes (ver. 5.1.5)

-- Python:
-- Interpreter : /usr/bin/python2.7 (ver. 2.7.12)
-- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
-- NumPy : /usr/local/lib/python2.7/dist-packages/numpy/core/include (ver 1.12.0)

-- Documentaion:
-- Doxygen : /usr/bin/doxygen (1.8.11)
-- config_file : /home/machineo/progs/caffe/.Doxyfile

-- Install:
-- Install path : /home/machineo/progs/caffe/build/install

-- Configuring done
-- Generating done
-- Build files have been written to: /home/machineo/progs/caffe/build

make all
[ 1%] Built target proto
[ 81%] Built target caffe
[ 81%] Built target train_net
[ 83%] Built target finetune_net
[ 84%] Built target net_speed_benchmark
[ 86%] Built target test_net
[ 86%] Linking CXX executable caffe
/usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFReadRGBAStrip@LIBTIFF_4.0' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFIsTiled@LIBTIFF_4.0'
/usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFWriteScanline@LIBTIFF_4.0' ../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::WireFormatLite::WriteStringMaybeAliased(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&, google::protobuf::io::CodedOutputStream*)'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFGetField@LIBTIFF_4.0'
/usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFScanlineSize@LIBTIFF_4.0' ../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::io::CodedOutputStream::WriteStringWithSizeToArray(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned char*)' ../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::empty_string[abi:cxx11]'
/usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFReadEncodedTile@LIBTIFF_4.0' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFReadRGBATile@LIBTIFF_4.0'
/usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFClose@LIBTIFF_4.0' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFRGBAImageOK@LIBTIFF_4.0'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::WireFormatLite::WriteBytesMaybeAliased(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::io::CodedOutputStream*)' ../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::ArenaStringPtr::AssignWithDefault(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const*, google::protobuf::internal::ArenaStringPtr)'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::DescriptorPool::FindFileByName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFOpen@LIBTIFF_4.0'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::Message::DebugString[abi:cxx11]() const' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFReadEncodedStrip@LIBTIFF_4.0'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::MessageFactory::InternalRegisterGeneratedFile(char const*, void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFSetField@LIBTIFF_4.0'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::MessageLite::ParseFromString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)' ../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::WireFormatLite::ReadBytes(google::protobuf::io::CodedInputStream*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)'
/usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFSetWarningHandler@LIBTIFF_4.0' /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.9: undefined reference to TIFFSetErrorHandler@LIBTIFF_4.0'
../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::NameOfEnum[abi:cxx11](google::protobuf::EnumDescriptor const*, int)' ../lib/libcaffe.so.1.0.0-rc4: undefined reference to google::protobuf::internal::WireFormatLite::WriteString(int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, google::protobuf::io::CodedOutputStream)'
collect2: error: ld returned 1 exit status
tools/CMakeFiles/caffe.bin.dir/build.make:137: recipe for target 'tools/caffe' failed
make[2]: *** [tools/caffe] Error 1
CMakeFiles/Makefile2:587: recipe for target 'tools/CMakeFiles/caffe.bin.dir/all' failed
make[1]: *** [tools/CMakeFiles/caffe.bin.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

Performance of Intel Caffe latest code

Hi,
When I update Intel Caffe yesterday, I found flag param_server is not supported. You disable flag param_server, why? I tested multinode with MLSL using below command:
mpirun -n 4 -ppn 1 -machinefile /home/spark/mpd.hosts ./build/tools/caffe train --solver=models/intel_optimized_models/googlenet/solver.prototxt
Is that right?
In 4 KNL nodes, the FPS is 35.454 while single node FPS is 61.31. Detailed parameters are as below:
CPU: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
Memory: 16GB6
HT off
Turbo on
export OMP_NUM_THREADS=64
export KMP_AFFINITY="verbose, none"
export MIC_KMP_AFFINITY="verbose, none"
batchsize=96
Can you give me the benchmark data and detailed parameters with latest code on XEON and XEON Phi? Because I don't know whether my performance data is normal or not. Thank you!

doesnt pick up OMP_NUM_THREADS

Hi
I am trying to run intell caffe on KNL as a host but it doesnt pick up OMP_NUM_THREADS and it just run it in a single thread. I am sure that It is compile with use_openmp
any suggestion?

Maximum batch for GoogLenet

Hey guys,
I hope this is the correct forum for tech support ;-)

I am trying to test the limits of an Intel Xeon Phi 7230 CPU running Intel Caffe built with MKL 17.0.098

I want to find the largest batch size possible. When I go beyond about 512 images per batch running the caffe time command using the mkl2017_googlenet_v1_knl/train_val.prototxt I get to the "Performing Backward" part and the job returns killed.

I am trying to diagnose whether this is an issue with the shared HPC system I am using (which uses PBS for scheduling) or whether this is the actual batch limit for this piece of hardware. The output killed is not very helpful and does not sound like it would be output from Caffe.

Please let me know if there is anything I can clarify. Thanks in advance for the help.

"ASSERT 'ptr' FAILED: NULL pointer"

Hi
I get an error when I'm running googlenet, but alexnet is OK.
The error:

The cmd:
mpirun -n 2 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train --solver=models/bvlc_googlenet/solver.prototxt -engine MKL2017
Please Help, thank you.

What is the possible reason of this bug?

I0224 22:56:54.819372 29338 net.cpp:311] label_mnist_1_split does not need backward computation.
I0224 22:56:54.819403 29338 net.cpp:311] mnist does not need backward computation.
I0224 22:56:54.819428 29338 net.cpp:353] This network produces output accuracy
I0224 22:56:54.819458 29338 net.cpp:353] This network produces output loss
I0224 22:56:54.819519 29338 net.cpp:367] Network initialization done.
I0224 22:56:54.819759 29338 solver.cpp:106] Solver scaffolding done.
I0224 22:56:54.819890 29338 caffe.cpp:311] Configuring multinode setup
I0224 22:56:54.820473 29338 caffe.cpp:318] Starting Multi-node Optimization in mpi environment
I0224 22:56:54.820538 29338 SynchronousNode.cpp:603] [0] [proc 0] solving
I0224 22:56:54.820575 29338 solver.cpp:348] Solving LeNet
I0224 22:56:54.820600 29338 solver.cpp:349] Learning Rate Policy: inv
I0224 22:56:54.822309 29345 SynchronousNode.cpp:271] [0] Comm thread started 0 1
F0224 22:56:54.829016 1494 data_transformer.cpp:283] Check failed: channels == datum_channels (1 vs. 0)
@ 0x2aaaaafd1a29 caffe::DataLayer<>::load_batch()
*** Aborted at 1488005814 (unix time) try "date -d @1488005814" if you are using GNU date ***
@ 0x2aaab02e78c6 _INTERNAL_23_______src_kmp_tasking_c_748e5a98::__kmp_invoke_task()
@ 0x2aaab02e7f1a __kmp_execute_tasks_64
*** Check failure stack trace: ***
@ 0x2aaab0297a1f _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release()
@ 0x2aaab02991d8 __kmp_fork_barrier()
@ 0x2aaaad3fba7b google::LogMessage::Flush()
@ 0x2aaab02c2110 __kmp_launch_thread
PC: @ 0x2aaaab3c22fd caffe::DataTransformer<>::Transform()
*** SIGSEGV (@0x4c) received by PID 1485 (TID 0x2aac14803480) from PID 76; stack trace: ***
@ 0x2aaaad3fc3fd google::LogMessageFatal::~LogMessageFatal()
@ 0x2aaaada69870 (unknown)
@ 0x2aaab02f3193 _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker()
@ 0x2aaaada620a4 start_thread
@ 0x2aaab1b7002d __clone
@ 0x2aaaab3c22fd caffe::DataTransformer<>::Transform()
@ 0x2aaaab3c2ff8 caffe::DataTransformer<>::Transform()
@ 0x2aaaaafd1aab caffe::DataLayer<>::load_batch()
@ 0x2aaaaafd1aab caffe::DataLayer<>::load_batch()
@ 0x2aaab02e78c6 _INTERNAL_23_______src_kmp_tasking_c_748e5a98::__kmp_invoke_task()
@ 0x2aaab02e78c6 _INTERNAL_23_______src_kmp_tasking_c_748e5a98::__kmp_invoke_task()
@ 0x2aaab02e7f1a __kmp_execute_tasks_64
@ 0x2aaab0297a1f _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release()
@ 0x2aaab02e7f1a __kmp_execute_tasks_64
@ 0x2aaab02991d8 __kmp_fork_barrier()
@ 0x2aaab0297a1f _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release()
@ 0x2aaab02c2110 __kmp_launch_thread
@ 0x2aaab02991d8 __kmp_fork_barrier()
@ 0x2aaab02f3193 _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker()
@ 0x2aaaada620a4 start_thread
@ 0x2aaab02c2110 __kmp_launch_thread
@ 0x2aaab1b7002d __clone
I0224 22:56:54.823941 29345 SynchronousNode.cpp:431] [0] initialized root of cluster with nodes: 4 and the total iter size is: 4
*** Aborted at 1488005814 (unix time) try "date -d @1488005814" if you are using GNU date ***
PC: @ 0x2aaaab3c22fd caffe::DataTransformer<>::Transform()
*** SIGSEGV (@0x4c) received by PID 29338 (TID 0x2aab61805780) from PID 76; stack trace: ***
@ 0x2aaaada69870 (unknown)
@ 0x2aaaab3c22fd caffe::DataTransformer<>::Transform()
@ 0x2aaaaafd1aab caffe::DataLayer<>::load_batch()
@ 0x2aaab02e78c6 _INTERNAL_23_______src_kmp_tasking_c_748e5a98::__kmp_invoke_task()
@ 0x2aaab02e7f1a __kmp_execute_tasks_64
@ 0x2aaab0297a1f _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release()
@ 0x2aaab02991d8 __kmp_fork_barrier()
@ 0x2aaab02c2110 __kmp_launch_thread
@ 0x2aaab02f3193 _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker()
@ 0x2aaaada620a4 start_thread
@ 0x2aaab1b7002d __clone
/global/u2/y/yyang420/caffe_run/mnist/./runscript_train_1.sh: line 3: 16220 Segmentation fault caffe train --solver=lenet_solver1.prototxt --param_server=mpi
/global/u2/y/yyang420/caffe_run/mnist/./runscript_train_2.sh: line 3: 15775 Segmentation fault caffe train --solver=lenet_solver2.prototxt --param_server=mpi
/global/u2/y/yyang420/caffe_run/mnist/./runscript_train_3.sh: line 3: 1485 Segmentation fault caffe train --solver=lenet_solver3.prototxt --param_server=mpi
/global/u2/y/yyang420/caffe_run/mnist/./runscript_train_0.sh: line 3: 29338 Segmentation fault caffe train --solver=lenet_solver0.prototxt --param_server=mpi
srun: error: nid03316: task 1: Exited with exit code 139
srun: Terminating job step 3856520.0
srun: error: nid03326: task 2: Exited with exit code 139
srun: error: nid03442: task 3: Exited with exit code 139
srun: error: nid03197: task 0: Exited with exit code 139

Can I use MPI in the case of single node?

Can I use MPI in the case of single node? I do it but no effect. I hope to know the reason.

is there any performance benchmark ?

any benchmark with i-serial CPUs?
any benchmark compare with CUDA/CUDNN ?

Some remarks for the "Multinode guide" wiki

"caring out"
->
"carrying out"

"The OS image can be downloaded free of charge from the official website. "
Broken link.

"Refer to MLSL Wiki or "MLSL Developer Guide and Reference" for more details on the library."
MLSL wiki isn't ready yet - could you please remove it and mention "MLSL Developer Guide and Reference" as a hyperlink to https://github.com/01org/MLSL/blob/master/doc/Developer_Guide_and_Reference.pdf

"$ yum upgrade" and others.
I suppose yum command requires privileged access so the command should look like "# yum upgrade" (or via sudo).

"$ ansible all -m shell -a 'rpm -i ~/intel-mlsl-devel-64-2017.0-006.x86_64.rpm'"
Does this command work fine without PUBLIC_KEY.PUB importing?

"$ source /opt/intel/mlsl_2017.0.003/intel64/bin/mlslvars.sh"
There should be 2017.0.006 version. Maybe to add MLSL version placeholder?

"/mpd.hosts"
->
"/mpi.hosts"
Please rename it because MPD is an obsolete Intel MPI process manager.

MKLBatchNormLayer segfaults on CIFAR10 test

Running examples/cifar10/train_full_sigmoid_bn.sh results in a SEGFAULT. See test instructions. The test was done on the CPU instead of GPU (see Why train on a GPU? in the instructions).

Final part of the output:

I1115 10:17:46.048431 16368 net.cpp:345] Network initialization done.
I1115 10:17:46.048583 16368 solver.cpp:104] Solver scaffolding done.
*** Aborted at 1479205066 (unix time) try "date -d @1479205066" if you are using GNU date ***
PC: @     0x7f4d4ab46869 caffe::Blob<>::Reshape()
*** SIGSEGV (@0x28) received by PID 16368 (TID 0x7f4d4b998a40) from PID 40; stack trace: ***
    @     0x7f4d46a71330 (unknown)
    @     0x7f4d4ab46869 caffe::Blob<>::Reshape()
    @     0x7f4d4ab474f5 caffe::Blob<>::Blob()
    @     0x7f4d4aa9930d caffe::Creator_SGDSolver<>()
    @     0x7f4d4b7bf707 caffe::SolverRegistry<>::CreateSolver()
    @     0x7f4d4b7aeaab train()
    @     0x7f4d4b7ad166 main
    @     0x7f4d466bdf45 (unknown)
    @     0x7f4d4b7acf69 (unknown)
    @                0x0 (unknown)
Segmentation fault

If I modify layer_factory.cpp and force the use of the Caffe implementation of the BatchNorm layer, it works:

  if (engine == BatchNormParameter_Engine_DEFAULT) {
//#if defined(USE_MKL2017_AS_DEFAULT_ENGINE)
//    engine = BatchNormParameter_Engine_MKL2017;
//#elif defined(USE_MKLDNN_AS_DEFAULT_ENGINE)
//    engine = BatchNormParameter_Engine_MKLDNN;
//#else
    engine = BatchNormParameter_Engine_CAFFE;
//#endif
  }

Relevant Makefile.config vars:

CPU_ONLY := 1
USE_MKL2017_AS_DEFAULT_ENGINE := 1
CUSTOM_CXX := icpc # 2017 update1
BLAS := mkl
WITH_PYTHON_LAYER := 1

Intelcaffe will not build with CUDnn enabled

When enable CUDnn ( disable CPU_ONLY), build fail with 'undefined references' in MLSL and MPI.

This is due to error in Makefile:
change on line 250
ifneq ($(CPU_ONLY), 1)
INCLUDE_DIRS += $(CUDA_INCLUDE_DIR)
LIBRARY_DIRS += $(CUDA_LIB_DIR)
(-) LIBRARIES := cudart cublas curand
(+) LIBRARIES += cudart cublas curand
endif

The change is on line 250 of the Makefile. The reason to change is when CPU_ONLY is not set, LIBRARY variable is reset to cuda libs ONLY and all previous libraries are lost. It should be inclusive, add the cuda libs to existing list.

dilation not supported in MKLConvolutionLayer

It will report such error:
F1101 09:08:47.481178 131136 mkl_convolution_layer.cpp:369] Check failed: top[0]->width() == ow && top[0]->height() == oh && top[0]->channels() == oc*g && top[0]->num() == n Inclompatible shape of bottom with layer
if dilation is contained in conv layer.

And also a typo: should be "Inclompatible shape of top with layer"

Multi-node Training

Is there any paper or report about how Intel Caffe conducts multi-node processing?

E.g. the way of communication, data partition, communication frequency, strong scaling efficiency.

Is Intel Caffe using the same idea of Parameter Sever (J. Dean, NIPS 2012)?

Can I build this intel caffe on Windows OS?

I need ssd for detect object on windows. but I can't compile .
So somebody has build and test successfull?
Thanks !

performance dropped to the bottom when running on Atom device linked with MKL2017

We tried to evaluate performance of Intel Caffe running on Atom. The out-of-box Intel Caffe out perform bvlc Caffe, thats nice to have. However, when we tried running Intel Caffe with MKL2017, and expect it would give performance boost as on Xeon and Xeon Phi, the performance dropped more than 20x instead.

AlexNet, batch size =1
bvlc Caffe + mkl: 12 FPS
Intel Caffe OOB: 15 FPS
Intel Caffe with MKL2017: 0.7 FPS

GoogleNet, batch size = 8
bvlc Caffe + mkl: 4.2 FPS
Intel Caffe OOB: 11 FPS
Intel Caffe with MKL2017: 0.25 FPS

Minor correction for "Multinode guide" wiki

https://github.com/intel/caffe/wiki/Multinode-guide:
"$ source /opt/intel/impi//bin64/mpivars.sh release"
->
"$ source /opt/intel/impi//intel64/bin/mpivars.sh"

'release' is for single-threaded optimized Intel MPI library. If you don't need it specifically it's recommended to avoid this argument (multi-threaded optimized library is used by default).

Support for MKLDNN is not enabled

Hi,
when i use mkldnn, i get this problem "Support for MKLDNN is not enabled".

Default Makefile.config with the delta below:
CPU_ONLY := 1
USE_MKLDNN_AS_DEFAULT_ENGINE := 1
USE_OPENMP := 0

but change USE_MKL2017_AS_DEFAULT_ENGINE := 1, and use mkl2017 is ok!

Intel-caffe + MKL-DNN inference performance droped from 230 to 5 FPS

Intel-caffe + MKL-DNN inference performance droped from 230 to 5 FPS
FPS 230: intel caffe commit: cc00ffa
FPS 5: intel caffe commit: 3f4348a

Does this repo include optimizations from clcaffe

Hello
Are you considering including optimizations and changes from clcaffe?
https://github.com/01org/caffe/wiki/clCaffe

In addition to speed up of caffe using AVX2 and like, opencl can also add to the performance.

Thank you

Building for KNL, get not declared in scope errors

I'm compiling on a Xeon login node for a 12 node KNL test bed. When building with CMake I get the error:

I can't tell if this is a problem with my build or with the repo, which is why I am reporting it. Build details are below.

I've installed:
autoconf-2.69
boost_1_62_0
cmake-3.7.1
gflags-2.2.0
glog-0.3.4
hdf5-1.8.18
leveldb-1.19
lmdb-LMDB_0.9.18
mpich-3.2
OpenBLAS-0.2.19
opencv-2.4.13
protobuf-3.1.0
snappy-1.1.3

I'm using a native version of gcc 4.8.5 installed on a RHEL 7.2 system.

'git describe --tags' reports:
self_containted_MKLGOLD_u1-82-g37c8170

My cmake command is pretty big, but tells it where to find all my installs as most of the FindX.cmake files fail for me.

OpenBLAS_HOME=$HOME/tools/OpenBLAS-0.2.19 CXX=mpicxx
cmake
-DCPU_ONLY=1
-DUSE_MPI=1
-DBLAS=open
-DCMAKE_INSTALL_PREFIX=$HOME/caffe/caffe/install
-DBoost_INCLUDE_DIR=$HOME/tools/boost_1_62_0/install/include
-DGFLAGS_ROOT_DIR=$HOME/tools/gflags-2.2.0/install
-DGFLAGS_LIBRARY=$HOME/tools/gflags-2.2.0/install/lib
-DGFLAGS_INCLUDE_DIR=$HOME/tools/gflags-2.2.0/install/include
-DGLOG_INCLUDE_DIR=$HOME/tools/glog-0.3.4/install/include
-DGLOG_LIBRARY=$HOME/tools/glog-0.3.4/install/lib
-DProtobuf_INCLUDE_DIR=$HOME/tools/protobuf-3.1.0/install/include
-DProtobuf_LIBRARY=$HOME/tools/protobuf-3.1.0/install/lib
-DHDF5_C_LIBRARY_hdf5=$HOME/tools/hdf5-1.8.18/install/lib/libhdf5.so
-DHDF5_DIR=$HOME/tools/protobuf-3.1.0/
-DHDF5_C_LIBRARY_hdf5_hl=$HOME/tools/hdf5-1.8.18/install/lib/libhdf5_hl.so
-DHDF5_CXX_COMPILER_EXECUTABLE=$HOME/tools/hdf5-1.8.18/install/bin/h5c++
-DHDF5_CXX_LIBRARY_hdf5=$HOME/tools/hdf5-1.8.18/install/lib/libhdf5.so
-DHDF5_CXX_LIBRARY_hdf5_cpp=$HOME/tools/hdf5-1.8.18/install/lib/libhdf5_cpp.so
-DHDF5_CXX_LIBRARY_hdf5_hl=$HOME/tools/hdf5-1.8.18/install/lib/libhdf5_hl.so
-DHDF5_CXX_LIBRARY_hdf5_hl_cpp=$HOME/tools/hdf5-1.8.18/install/lib/libhdf5_hl_cpp.so
-DHDF5_DIR=$HOME/tools/hdf5-1.8.18
-DLMDB_INCLUDE_DIR=$HOME/tools/lmdb-LMDB_0.9.18/install/include
-DLMDB_LIBRARIES=$HOME/tools/lmdb-LMDB_0.9.18/install/lib
-DLevelDB_INCLUDE=$HOME/tools/leveldb-1.19/include
-DLevelDB_LIBRARY=$HOME/tools/leveldb-1.19/out-shared
-DSnappy_INCLUDE_DIR=$HOME/tools/snappy-1.1.3/install/include
-DSnappy_LIBRARIES=$HOME/tools/snappy-1.1.3/install/lib
-DOpenCV_LIB_DIR_OPT=$HOME/tools/opencv-2.4.13/install/lib
-DOpenCV_LIB_DIR_DBG=$HOME/tools/opencv-2.4.13/install/lib
-DCMAKE_C_FLAGS="-I/home/jchilders/tools/install/include"
-DCMAKE_CXX_FLAGS="-I/home/jchilders/tools/install/include"
-DCMAKE_EXE_LINKER_FLAGS="-L/home/jchilders/tools/install/lib -lprotobuf -lprotoc -lopencv_core -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_nonfree -lopencv_objdetect -lopencv_ocl -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -lhdf5_cpp -lhdf5_hl_cpp -lhdf5_hl -lhdf5 -lglog -lgflags -lleveldb -llmdb"
-DCMAKE_MODULE_LINKER_FLAGS="-L/home/jchilders/tools/install/lib -lprotobuf -lprotoc -lopencv_core -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_nonfree -lopencv_objdetect -lopencv_ocl -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -lhdf5_cpp -lhdf5_hl_cpp -lhdf5_hl -lhdf5 -lglog -lgflags -leveldb -llmdb"
-DCMAKE_SHARED_LINKER_FLAGS="-L/home/jchilders/tools/install/lib -lprotobuf -lprotoc -lopencv_core -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_nonfree -lopencv_objdetect -lopencv_ocl -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -lhdf5_cpp -lhdf5_hl_cpp -lhdf5_hl -lhdf5 -lglog -lgflags -lleveldb -llmdb" ..

Then I run make, but get the error above.

Problem in the pooling layer

I had different results for mnist/lenet depending on the number of OpenMP Threads
At 1 thread

At 16 threads

Looking at the source I think this is due to a bug in pooling_layer.cpp

Origginally in forward_cpu (Also in the Backward_CPU) Open MP threads were used to utilize two levels of for loops
#ifdef _OPENMP
#pragma omp parallel for collapse(2)
#endif
for (int image = 0; image < num_batches; ++image)
for (int channel = 0; channel < num_channels; ++channel)
...........
However there is a data dependency between different batches so the first for loop should not be parallelized (I think)

So I changed the source to
// #pragma omp parallel for collapse(2)
for (int image = 0; image < num_batches; ++image)
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (int channel = 0; channel < num_channels; ++channel)
(Same for Backward_cpu)

And the result runs correctly even for 16 threads (Unfortunately a bit slower though..)

Best Regards
Toshi(yuki)

latest intel caffe + resnet cpp_classification crash

Hi, I tried to run the caffe example cpp_classification using latest intel caffe + mkl2017, but crashed.
Help!

The error:

F0106 21:40:14.463059 29978 net.cpp:1227] Check failed: target_blobs.size() == source_layer.blobs_size() (2 vs. 3) Incompatible number of blobs for layer bn_conv1

How I build:

cmake -DCPU_ONLY=ON -DUSE_MKL2017_AS_DEFAULT_ENGINE=ON

Here is my configuration:

system: ubuntu 14.04
intel caffe: commit: 0bc848b
mkl: mklml_lnx_2017.0.1.20161005
resnet caffe model: https://github.com/KaimingHe/deep-residual-networks

Thanks.

Result not correct when applying it to SSD (Single Shot MultiBox Detector)

I merged to code of SSD (https://github.com/weiliu89/caffe/tree/ssd) and intel caffe, but when run ssd_detect with SSD300, the result is not correct if MKL2017_AS_DEFAULT_ENGINE is enabled.
After code merge, following 2 things have been confirmed:

when only with MKL, ssd_detect shows correct result
when enable MKL2017_AS_DEFAULT_ENGINE, cpp_classification work correctly with VGG16 (classification result is correct)
I guess SSD300 has a different input shape (3x300x300, not 3x224x224), thus make the convolution result is not correct.
Could you please help to look into it? Thanks!

Suggestion about wiki

I think some tips in wiki is really hard to understand, maybe more clear description is appreciated. For example in Recommendations-to-achieve-best-performance, the first tip Disable Hyper-threading (HT) on your platform., what is HT? And how to disable that? maybe its clear to you guys but in my mind its $!@#!@#....