Giter Club home page Giter Club logo

pocketflow's People

Contributors

haolibai avatar jianyuheng avatar jiaxiang-wu avatar jiayiliu avatar jinhou avatar kranthigv avatar lee-bin avatar psyyz10 avatar ylfzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pocketflow's Issues

Issue: tensorflow.python.framework.errors_impl.NotFoundError: "/home/karen/cifar-10-batches-bin/"; No such file or directory

Hardware: Intel Xeon CPU, M2000M Nvidia GPU, 64G RAM
Operating System: Ubuntu 16.04

Environment:

  • tensorflow==1.10.1
  • Anaconda Python 3.6.6
  • pandas == 0.23.4
  • scikit-learn==0.20.0

Issue:
Not found: "/home/karen/cifar-10-batches-bin/"; No such file or directory

Trace:

2018-11-03 22:53:28.578113: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at matching_files_op.cc:49 : Not found: "/home/karen/cifar-10-batches-bin/"; No such file or directory
2018-11-03 22:53:28.578582: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at matching_files_op.cc:49 : Not found: "/home/karen/cifar-10-batches-bin/"; No such file or directory
2018-11-03 22:53:28.578636: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at iterator_ops.cc:910 : Not found: "/home/karen/cifar-10-batches-bin/"; No such file or directory
         [[Node: ShuffleDataset/data/MatchingFiles = MatchingFiles[](ShuffleDataset/data/MatchingFiles/pattern)]]
Traceback (most recent call last):
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 1278, in _do_call
    return fn(*args)
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 1263, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: "/home/karen/cifar-10-batches-bin/"; No such file or directory
         [[Node: ShuffleDataset/data/MatchingFiles = MatchingFiles[](ShuffleDataset/data/MatchingFiles/pattern)]]
         [[Node: data/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_zMNwdOMmCuY[], output_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 69, in <module>
    tf.app.run()
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 55, in main
    learner.train()
  File "/home/karen/workspace/sideprojects/PocketFlow/learners/full_precision/learner.py", line 71, in train
    self.sess_train.run(self.train_op)
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 877, in run
    run_metadata_ptr)
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 1100, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 1272, in _do_run
    run_metadata)
  File "/home/karen/anaconda3/envs/PocketFlow/lib/python3.6/site-packages/tensorflow/python/client/session.py",line 1291, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: "/home/karen/cifar-10-batches-bin/"; No such file or directory
         [[Node: ShuffleDataset/data/MatchingFiles = MatchingFiles[](ShuffleDataset/data/MatchingFiles/pattern)]]
         [[Node: data/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_zMNwdOMmCuY[], output_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]

I checked with terminal /home/karen/cifar-10-batches-bin exists`` with the ls```:

ls cifar-10-batches-bin/
batches.meta.txt  data_batch_2.bin  data_batch_4.bin  readme.html
data_batch_1.bin  data_batch_3.bin  data_batch_5.bin  test_batch.bin

path.conf configuration:

data_dir_local_cifar10 = "/home/karen/cifar-10-batches-bin/"

I ran ./scripts/run_local.sh nets/resnet_at_cifar10_run.py

Which optimization functions need to be retrained?

Hi,PocketFlow provides some optimization fuctions,such as channel pruning,weight sparsification,weight quantization,network distillation,multi-GPU training, hyper-parameter optimization.
Could you tell me which one needs to be retrained when I want to use.
Thanks very much!

distill model in full-prec mode

In full-prec mode, DistillationHelper create leaner and FullPrecLearner use the same model_helper. I think the distilled model is the same model as primary model, so that it's not distilling. I suggest that provide another model_helper as the distilled model.

about fine tune

there is a question that i want to ask:
in real application, for example like image binary classification, there are tow method i can use with this scrips : 1 down pretrain model and compress it, then finetune from aleady compressed pretrain model to binary cls model. 2 down pretrain model and finetune from pretrain model to binary cls model, then compress binary cls model. which method can you suggest me to chose? of course , both method i can try to chose the better one , but i just want to ask for your opinion.

Support for 1D convolutions

I was wondering if PocketFlow supports 1D convolutions. If not, how could this functionality be added? I want to compress a CNN with 1D convolutions for string classification.

Resnet20 cifar10 compress experiment cant get a small model

Hello, I run the example in tutorial of cifar10 resnet20 experiment, in ChannelPrune Mode, after the compress procedure and the finetune procedure, I get some ckpt models in ./modes, such as original_model.ckpt*, pruned_model.ckpt., best_model.ckpt., however, all of those models are 1.1M in my disk, It seems that compression doesn't work , I inspect to the variable of ckpt models, the shape of conv kernels are not smaller, either, How can I get the compressed model to test inference speedup use ckpt model?
I use the following command and
./scripts/run_seven.sh nets/resnet_at_cifar10_run.py
--learner channel
--batch_size_eval 64
--cp_uniform_preserve_ratio 0.5
--cp_prune_option uniform
--resnet_size 20

Can't use GPU in the local mode.

Describe the bug
My environment settings are:

  1. Anaconda env and Python = 3.6
  2. Tensorflow = 1.11.0
  3. CUDA version = 9.0

To Reproduce
Steps to reproduce the behavior:

  1. I run the demo shown in the local mode:
    "./scripts/run_local.sh nets/resnet_at_cifar10_run.py"
  2. Then I got these information as blow:
    (pruning_tf) daisy@deep-learning:~/Pruning_and_Compression/PocketFlow$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py Python script: nets/resnet_at_cifar10_run.py of GPUs: 1 extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/daisy/Pruning_and_Compression/PocketFlow/data/cifar-10-batches-bin Traceback (most recent call last): File "utils/get_idle_gpus.py", line 54, in <module> raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus)) ValueError: not enough idle GPUs; idle GPUs are: [] 'nets/resnet_at_cifar10_run.py' -> 'main.py' multi-GPU training disabled [WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported INFO:tensorflow:FLAGS: INFO:tensorflow:data_disk: local INFO:tensorflow:data_hdfs_host: None INFO:tensorflow:data_dir_local: /home/daisy/Pruning_and_Compression/PocketFlow/data/cifar-10-batches-bin INFO:tensorflow:data_dir_hdfs: None INFO:tensorflow:cycle_length: 4 INFO:tensorflow:nb_threads: 8 INFO:tensorflow:buffer_size: 1024 INFO:tensorflow:prefetch_size: 8 INFO:tensorflow:nb_classes: 10 INFO:tensorflow:nb_smpls_train: 50000 INFO:tensorflow:nb_smpls_val: 5000 INFO:tensorflow:nb_smpls_eval: 10000 INFO:tensorflow:batch_size: 128 INFO:tensorflow:batch_size_eval: 100 INFO:tensorflow:resnet_size: 20 INFO:tensorflow:lrn_rate_init: 0.1 INFO:tensorflow:batch_size_norm: 128.0 INFO:tensorflow:momentum: 0.9 INFO:tensorflow:loss_w_dcy: 0.0002 INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow INFO:tensorflow:summ_step: 100 INFO:tensorflow:save_step: 10000 INFO:tensorflow:save_path: ./models/model.ckpt INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt INFO:tensorflow:enbl_dst: False INFO:tensorflow:enbl_warm_start: False INFO:tensorflow:loss_w_dst: 4.0 INFO:tensorflow:tempr_dst: 4.0 INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt INFO:tensorflow:nb_epochs_rat: 1.0 INFO:tensorflow:ddpg_actor_depth: 2 INFO:tensorflow:ddpg_actor_width: 64 INFO:tensorflow:ddpg_critic_depth: 2 INFO:tensorflow:ddpg_critic_width: 64 INFO:tensorflow:ddpg_noise_type: param INFO:tensorflow:ddpg_noise_prtl: tdecy INFO:tensorflow:ddpg_noise_std_init: 1.0 INFO:tensorflow:ddpg_noise_dst_finl: 0.01 INFO:tensorflow:ddpg_noise_adpt_rat: 1.03 INFO:tensorflow:ddpg_noise_std_finl: 1e-05 INFO:tensorflow:ddpg_rms_eps: 0.0001 INFO:tensorflow:ddpg_tau: 0.01 INFO:tensorflow:ddpg_gamma: 0.9 INFO:tensorflow:ddpg_lrn_rate: 0.001 INFO:tensorflow:ddpg_loss_w_dcy: 0.0 INFO:tensorflow:ddpg_record_step: 1 INFO:tensorflow:ddpg_batch_size: 64 INFO:tensorflow:ddpg_enbl_bsln_func: True INFO:tensorflow:ddpg_bsln_decy_rate: 0.95 INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt INFO:tensorflow:ws_prune_ratio: 0.75 INFO:tensorflow:ws_prune_ratio_prtl: optimal INFO:tensorflow:ws_nb_rlouts: 200 INFO:tensorflow:ws_nb_rlouts_min: 50 INFO:tensorflow:ws_reward_type: single-obj INFO:tensorflow:ws_lrn_rate_rg: 0.03 INFO:tensorflow:ws_nb_iters_rg: 20 INFO:tensorflow:ws_lrn_rate_ft: 0.0003 INFO:tensorflow:ws_nb_iters_ft: 400 INFO:tensorflow:ws_nb_iters_feval: 25 INFO:tensorflow:ws_prune_ratio_exp: 3.0 INFO:tensorflow:ws_iter_ratio_beg: 0.1 INFO:tensorflow:ws_iter_ratio_end: 0.5 INFO:tensorflow:ws_mask_update_step: 500.0 INFO:tensorflow:cp_lasso: True INFO:tensorflow:cp_quadruple: False INFO:tensorflow:cp_reward_policy: accuracy INFO:tensorflow:cp_nb_points_per_layer: 10 INFO:tensorflow:cp_nb_batches: 60 INFO:tensorflow:cp_prune_option: auto INFO:tensorflow:cp_prune_list_file: ratio.list INFO:tensorflow:cp_best_path: ./models/best_model.ckpt INFO:tensorflow:cp_original_path: ./models/original_model.ckpt INFO:tensorflow:cp_preserve_ratio: 0.5 INFO:tensorflow:cp_uniform_preserve_ratio: 0.6 INFO:tensorflow:cp_noise_tolerance: 0.15 INFO:tensorflow:cp_lrn_rate_ft: 0.0001 INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2 INFO:tensorflow:cp_finetune: False INFO:tensorflow:cp_retrain: False INFO:tensorflow:cp_list_group: 1000 INFO:tensorflow:cp_nb_rlouts: 200 INFO:tensorflow:cp_nb_rlouts_min: 50 INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt INFO:tensorflow:dcp_prune_ratio: 0.5 INFO:tensorflow:dcp_nb_stages: 3 INFO:tensorflow:dcp_lrn_rate_adam: 0.001 INFO:tensorflow:dcp_nb_iters_block: 10000 INFO:tensorflow:dcp_nb_iters_layer: 500 INFO:tensorflow:uql_equivalent_bits: 4 INFO:tensorflow:uql_nb_rlouts: 200 INFO:tensorflow:uql_w_bit_min: 2 INFO:tensorflow:uql_w_bit_max: 8 INFO:tensorflow:uql_tune_layerwise_steps: 100 INFO:tensorflow:uql_tune_global_steps: 2000 INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt INFO:tensorflow:uql_tune_disp_steps: 300 INFO:tensorflow:uql_enbl_random_layers: True INFO:tensorflow:uql_enbl_rl_agent: False INFO:tensorflow:uql_enbl_rl_global_tune: True INFO:tensorflow:uql_enbl_rl_layerwise_tune: False INFO:tensorflow:uql_weight_bits: 4 INFO:tensorflow:uql_activation_bits: 32 INFO:tensorflow:uql_use_buckets: False INFO:tensorflow:uql_bucket_size: 256 INFO:tensorflow:uql_quant_epochs: 60 INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt INFO:tensorflow:uql_quantize_all_layers: False INFO:tensorflow:uql_bucket_type: channel INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt INFO:tensorflow:uqtf_weight_bits: 8 INFO:tensorflow:uqtf_activation_bits: 8 INFO:tensorflow:uqtf_quant_delay: 0 INFO:tensorflow:uqtf_freeze_bn_delay: None INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01 INFO:tensorflow:nuql_equivalent_bits: 4 INFO:tensorflow:nuql_nb_rlouts: 200 INFO:tensorflow:nuql_w_bit_min: 2 INFO:tensorflow:nuql_w_bit_max: 8 INFO:tensorflow:nuql_tune_layerwise_steps: 100 INFO:tensorflow:nuql_tune_global_steps: 2101 INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt INFO:tensorflow:nuql_tune_disp_steps: 300 INFO:tensorflow:nuql_enbl_random_layers: True INFO:tensorflow:nuql_enbl_rl_agent: False INFO:tensorflow:nuql_enbl_rl_global_tune: True INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False INFO:tensorflow:nuql_init_style: quantile INFO:tensorflow:nuql_opt_mode: weights INFO:tensorflow:nuql_weight_bits: 4 INFO:tensorflow:nuql_activation_bits: 32 INFO:tensorflow:nuql_use_buckets: False INFO:tensorflow:nuql_bucket_size: 256 INFO:tensorflow:nuql_quant_epochs: 60 INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt INFO:tensorflow:nuql_quantize_all_layers: False INFO:tensorflow:nuql_bucket_type: split INFO:tensorflow:log_dir: ./logs INFO:tensorflow:enbl_multi_gpu: False INFO:tensorflow:learner: full-prec INFO:tensorflow:exec_mode: train INFO:tensorflow:debug: False INFO:tensorflow:h: False INFO:tensorflow:help: False INFO:tensorflow:helpfull: False INFO:tensorflow:helpshort: False 2018-11-08 09:10:59.811925: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-08 09:10:59.814345: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2018-11-08 09:10:59.814367: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: deep-learning 2018-11-08 09:10:59.814374: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: deep-learning 2018-11-08 09:10:59.814396: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.12.0 2018-11-08 09:10:59.814417: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.12.0 2018-11-08 09:10:59.814424: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 390.12.0 INFO:tensorflow:iter #100: lr = 1.0000e-01 | loss = 1.7772e+00 | accuracy = 3.7500e-01 | speed = 95.59 pics / sec

Expected behavior
Firstly, I found that the speed is too slow. I thought that the GPU device is not used. Then, I noticed that
failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected in the print information. So I checked the GPU device information using nvidia-smi in terminal. I got these information,
`(pruning_tf) daisy@deep-learning:~$ nvidia-smi
Thu Nov 8 09:25:12 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.12 Driver Version: 390.12 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 30% 67C P0 68W / 250W | 244MiB / 6080MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1091 G /usr/lib/xorg/Xorg 242MiB |
+-----------------------------------------------------------------------------+`

,which proved that I had installed right CUDA version and nvidia driver verison.
Last, I tried to import tensorflow module in the python. But it did not report any error about CUDA. What it printed is
`(pruning_tf) daisy@deep-learning:~$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
/home/daisy/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters

`
I don't think the further warning is the key reason.

The reason for slow speed and low GPU utilization is not using GPU device.

So can anyone help me solve this problem? Thanks a lot !

Desktop (please complete the following information):

  • OS: [Ubuntu 16.04]
  • CUDA [9.0]
  • GPU [GTX 980ti]
  • Python [3.6]
  • Tensorflow [1.11.0]

Why learner has hard-code about dataset, such as, different dataset uses different learning rate control policy.

E.g.

root@linuxkit-025000000001:/Users/test/work/PocketFlow/learners# grep cifar_10 * -r
nonuniform_quantization/bit_optimizer.py:      if self.dataset_name == 'cifar_10':
nonuniform_quantization/learner.py:  if dataset_name == 'cifar_10':
nonuniform_quantization/learner.py:        if self.dataset_name == 'cifar_10':
nonuniform_quantization/learner.py:        if self.dataset_name == 'cifar_10':
uniform_quantization/bit_optimizer.py:      if self.dataset_name == 'cifar_10':
uniform_quantization/learner.py:  if dataset_name == 'cifar_10':
uniform_quantization/learner.py:        if self.dataset_name == 'cifar_10':
uniform_quantization/learner.py:        if self.dataset_name == 'cifar_10':
uniform_quantization_tf/learner.py:        #if self.dataset_name == 'cifar_10':
weight_sparsification/pr_optimizer.py:    skip_head_n_tail = (self.dataset_name == 'cifar_10')  # skip head & tail layers on CIFAR-10

more details,

def setup_bnds_decay_rates(model_name, dataset_name):
  """ NOTE: The bnd_decay_rates here is mgw_size invariant """

  batch_size = FLAGS.batch_size if not FLAGS.enbl_multi_gpu else FLAGS.batch_size * mgw.size()
  nb_batches_per_epoch = int(FLAGS.nb_smpls_train / batch_size)
  mgw_size = int(mgw.size()) if FLAGS.enbl_multi_gpu else 1
  init_lr = FLAGS.lrn_rate_init * FLAGS.batch_size * mgw_size / FLAGS.batch_size_norm if FLAGS.enbl_multi_gpu else FLAGS.lrn_rate_init
  if dataset_name == 'cifar_10':
    if model_name.startswith('resnet'):
      bnds = [nb_batches_per_epoch * 15, nb_batches_per_epoch * 40]
      decay_rates = [1e-3, 1e-4, 1e-5]
  elif dataset_name == 'ilsvrc_12':
    if model_name.startswith('resnet'):
      bnds = [nb_batches_per_epoch * 5, nb_batches_per_epoch * 20]
      decay_rates = [1e-4, 1e-5, 1e-6]
    elif model_name.startswith('mobilenet'):
      bnds = [nb_batches_per_epoch * 5, nb_batches_per_epoch * 30]
      decay_rates = [1e-4, 1e-5, 1e-6]
  finetune_steps = nb_batches_per_epoch * FLAGS.uql_quant_epochs
  init_lr = init_lr if FLAGS.enbl_warm_start else FLAGS.lrn_rate_init
  return init_lr, bnds, decay_rates, finetune_steps

This design will have strong limitation for testing new dataset, I doubt. Wish some replies about this design.

Many thanks.

Running PocketFlow in the local mode with cifar-10, it appears "./scripts/run_local.sh: line 45: 9232 Segmentation fault"

environment:
ubuntu 16.04
anaconda : python 3.6, tensorflow 1.11

$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py
Python script: nets/resnet_at_cifar10_run.py

of GPUs: 1

extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/liumm/data/dataset/cifar-10-batches-py
'nets/resnet_at_cifar10_run.py' -> 'main.py'
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /home/liumm/data/dataset/cifar-10-batches-py
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-05 14:58:10.875190: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-05 14:58:14.050345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-11-05 14:58:14.050412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-05 14:58:15.510646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-05 14:58:15.510738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-11-05 14:58:15.510754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-11-05 14:58:15.511369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11359 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-11-05 14:58:18.059075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-05 14:58:18.059137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-05 14:58:18.059145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-11-05 14:58:18.059152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-11-05 14:58:18.059322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11359 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
./scripts/run_local.sh: line 45: 9232 Segmentation fault (core dumped) python main.py ${extra_args}

Segmentation fault

python3.6 cuda9 cudnn7 tensorflow:1.10.0

`(tf-1.10-cp3) [zgz@localhost PocketFlow]$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py
Python script: nets/resnet_at_cifar10_run.py

of GPUs: 1

extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/zgz/project/data_set/cifar-10-batches-py
Traceback (most recent call last):
File "utils/get_idle_gpus.py", line 54, in
raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus))
ValueError: not enough idle GPUs; idle GPUs are: []
‘nets/resnet_at_cifar10_run.py’ -> ‘main.py’
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /home/zgz/project/data_set/cifar-10-batches-py
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-06 12:51:30.550266: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-06 12:51:30.582314: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-11-06 12:51:30.582432: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: localhost.localdomain
2018-11-06 12:51:30.582455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: localhost.localdomain
2018-11-06 12:51:30.582543: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 396.26.0
2018-11-06 12:51:30.582644: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 396.26.0
2018-11-06 12:51:30.582667: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 396.26.0
./scripts/run_local.sh: line 45: 43704 Segmentation fault (core dumped) python main.py ${extra_args}
`

path.conf
`# data files
data_hdfs_host = None
data_dir_local_cifar10 = /home/zgz/project/data_set/cifar-10-batches-py
data_dir_hdfs_cifar10 = None
data_dir_seven_cifar10 = None
data_dir_docker_cifar10 = /opt/ml/data # DO NOT EDIT
data_dir_local_ilsvrc12 = None
data_dir_hdfs_ilsvrc12 = None
data_dir_seven_ilsvrc12 = None
data_dir_docker_ilsvrc12 = /opt/ml/data # DO NOT EDIT

model files

model_http_url = https://api.ai.tencent.com/pocketflow
`

./scripts/run_local.sh nets/lenet_at_cifar10_run.py gets a segmentfault ?

./scripts/run_local.sh nets/lenet_at_cifar10_run.py gets a segmentfault .

dataset path:

~/PocketFlow/data/cifar10$ ls
eval.tfrecords train.tfrecords validation.tfrecords

path.conf:

#data files
data_hdfs_host = None
data_dir_local_cifar10 = /home/cv/PocketFlow/data/cifar10
data_dir_hdfs_cifar10 = None
data_dir_seven_cifar10 = None
data_dir_docker_cifar10 = /opt/ml/data # DO NOT EDIT
data_dir_local_ilsvrc12 = None
data_dir_hdfs_ilsvrc12 = None
data_dir_seven_ilsvrc12 = None
data_dir_docker_ilsvrc12 = /opt/ml/data # DO NOT EDIT

run log:

(tensorflow) cv@cv-System-Product-Name:~/PocketFlow$ ./scripts/run_local.sh nets/lenet_at_cifar10_run.py
Python script: nets/lenet_at_cifar10_run.py
of GPUs: 1
extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/cv/PocketFlow/data/cifar10
File "utils/get_idle_gpus.py", line 33
with open(dump_file, 'r') as i_file:
^
IndentationError: expected an indented block
idle_gpus: 0
'nets/lenet_at_cifar10_run.py' -> 'main.py'
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /home/cv/PocketFlow/data/cifar10
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:lrn_rate_init: 0.01
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0005
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-12 15:41:04.999500: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-11-12 15:41:05.217054: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 15:41:05.217827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:07:00.0
totalMemory: 10.92GiB freeMemory: 10.45GiB
2018-11-12 15:41:05.217850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-12 15:41:10.837124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-12 15:41:10.837199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-11-12 15:41:10.837221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-11-12 15:41:10.837795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10107 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:07:00.0, compute capability: 6.1)
2018-11-12 15:41:11.405883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0

2018-11-12 15:41:11.405922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-12 15:41:11.405928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-11-12 15:41:11.405932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-11-12 15:41:11.406029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10107 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:07:00.0, compute capability: 6.1)
./scripts/run_local.sh: line 48: 3626 Segmentation fault (core dumped) python main.py ${extra_args}

How can I solve this problem?

calculate distillation loss

The calc_loss function in distillation_helper.py that take primary model's logits(logits of the full-precision uncompressed network) as logits and distilled model's logits as labels, is it right? I think should take primary model's logits as labels and distilled model's logits as logits

Running Quantized model on a GPU

I want to generate a .pb file instead of a flat buffer file(.tflite) after the quantization so as to run it on a GPU. Is that possible ?
$ python tools/conversion/export_quant_tflite_model.py --enbl_post_quant

Does it support Mobilenet-v2-SSD ?

It's an amzing work.

Does it support Mobilenet-v2-SSD?
I'm using it for object detection.
I saw PocketFlow doesn't support yolo, etc.

Thanks.

"ValueError: unrecognized learner's name: tf-uniform“ . In tutorial, --learner specification is not accordant with code

./scripts/run_local.sh nets/lenet_at_cifar10_run.py --learner tf-uniform
the command will get below:
File "/home/cv/PocketFlow/learners/learner_utils.py", line 58, in create_learner
raise ValueError('unrecognized learner's name: ' + FLAGS.learner)
ValueError: unrecognized learner's name: tf-uniform

https://pocketflow.github.io/tutorial/#tutorial
Learner name "tf-uniform" is differ form learner_utils.py line 54

in learner_utils.py line 54
elif FLAGS.learner == 'uniform-tf':
learner = UniformQuantTFLearner(sm_writer, model_helper)

What's the training parameters for MobileNetV2 8 Bit Fixed-point?

Thanks for sharing the great work. What's the training parameters for MobileNetV2 8 Bit Fixed-point to get the 72.26% accuracy?

Is this correct? Any other parameter needed?

sh ./scripts/run_local.sh nets/resnet_at_ilsvrc12_run.py
--learner=uniform
--data_disk local
--data_dir_local ${PF_ILSVRC12_LOCAL}
--uql_weight_bits=8
--uql_activation_bits=8 \

A question about DL framework

I find PocketFlow use Tensorflow.
Is there a strong correlation between your optimization and Tensorflow?
Does the optimization apply to other DL framework?
Why not use NCNN?
Look forward to your reply!
Thansk very much!

Pretrained models

Could you please place pre-trained pruned models for standard architectures resnet18, mobilenetv1/2 etc...?

wait for a long time

Hi laoTie,it seems the sofa! I have interested in it for times when i saw the news. Thanks for sharing

Much more guys will know the value of pf!

How to use the pretrained models?

I use the shell

./scripts/run_loack.sh nets/resnet_at)ilsvrc12_run.py -n=4 --dcp_save_path=models/model.ckpt-250227.data-00000-of-00001

and the model.ckpt-250227.data-00000-of-00001 is from models_resnet50_at_ilsvrc_12.tar.gz from your office repository @jiaxiang-wu

The structure of my resnet is different from built in PocketFlow,I how to change the structure?

command:
./scripts/run_local.sh nets/resnet_at_cifar10_run.py --learner dis-chn-pruned

error:
`NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key model/resnet_model/batch_normalization/beta not found in checkpoint
[[Node: model/save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model/save/Const_0_0, model/save/RestoreV2/tensor_names, model/save/RestoreV2/shape_and_slices)]]
[[Node: model/save/RestoreV2/_27 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_32_model/save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]`

The structure of my resnet is different from built in PocketFlow,I how to change the structure?

Can not run training on local mode

Hi there,
I have an issue runing the following command:

./scripts/run_local.sh nets/resnet_at_cifar10_run.py

error message:
AssertionError: <FLAGS.data_dir_local> must not be None

my template path file looks like this:

data files

data_hdfs_host = None
data_dir_local_cifar10 = /home/besher/datasets/cifar-10-batches-bin
data_dir_hdfs_cifar10 = None
data_dir_seven_cifar10 = None
data_dir_docker_cifar10 = /opt/ml/data # DO NOT EDIT
data_dir_local_ilsvrc12 = /home/besher/datasets/imagenet_tfrecord # this line has been edited!
data_dir_hdfs_ilsvrc12 = None
data_dir_seven_ilsvrc12 = None
data_dir_docker_ilsvrc12 = /opt/ml/data # DO NOT EDIT

model files

data_dir_hdfs_ilsvrc12 = None
data_dir_seven_ilsvrc12 = None
data_dir_docker_ilsvrc12 = /opt/ml/data # DO NOT EDIT

model files

model_http_url = https://api.ai.tencent.com/pocketflow

I am using tensorflow on CPU.

Could you please advice?

The name of variables in net is prefixed by 'model'

I define my net, but the name of variable is Prefixed of 'model', such as 'model/resnet_v1_110/block1/unit_1/bottleneck2_v1/conv1/BatchNorm/beta ', but it should is 'resnet_v1_110/block1/unit_1/bottleneck2_v1/conv1/BatchNorm/beta '.

`NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key model/resnet_v1_110/block1/unit_1/bottleneck2_v1/conv1/BatchNorm/beta not found in checkpoint
[[Node: model/save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model/save/Const_0_0, model/save/RestoreV2/tensor_names, model/save/RestoreV2/shape_and_slices)]]
[[Node: model/save/RestoreV2/_1059 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1064_model/save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]`

Single GPU card is not available on PC

The get_idle_gpus.py in utils folder is about to detect how many GPU cards are available on the machine. For many users, there is only one GPU card on their PC for graphical display which leads no GPU available (the only one is busy). So, we should fix the issue by change the way of detecting available GPUs.

how to use local model

If i create 'models' sub-directory in PocketFlow's home directory, the PocketFlow will use the local model and not download from model_http_url?

K8S Supported?

Is there a plan to support the pocketflow as a k8s service ?

why it reports FLAGS.data_dir_local is None?

hello, i run the example resnet_at_cifar10, and use command './scripts/run_docker.sh nets/resnet_at_cifar10_run.py', i copyed path.conf.template and rename path.conf in PocketFlow-master file,and i set data_dir_local_cifar10 = /media/gavin/Data/PycharmProjects/datasets/cifar10_data/cifar-10-batches-bin/ in path.conf file ,but the log shows FLAGS.data_dir_local is None ?
the log :
Traceback (most recent call last):
File "utils/get_path_args.py", line 51, in
assert len(key_n_value) == 2, 'each line must contains exactly one ' = ''
AssertionError: each line must contains exactly one ' = '
Python script: nets/resnet_at_cifar10_run.py

of GPUs: 1

extra arguments:
'nets/resnet_at_cifar10_run.py' -> 'main.py'
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: None
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: None
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
Traceback (most recent call last):
File "main.py", line 69, in
tf.app.run()
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 50, in main
model_helper = ModelHelper()
File "/home/gavin/PocketFlow-master/nets/resnet_at_cifar10.py", line 75, in init
self.dataset_train = Cifar10Dataset(is_train=True)
File "/home/gavin/PocketFlow-master/datasets/cifar10_dataset.py", line 87, in init
assert FLAGS.data_dir_local is not None, '<FLAGS.data_dir_local> must not be None'
AssertionError: <FLAGS.data_dir_local> must not be None

The name of variables in net is prefixed by 'model'

I define my net, but the name of variable is Prefixed of 'model', such as 'model/resnet_v1_110/block1/unit_1/bottleneck2_v1/conv1/BatchNorm/beta ', but it should is 'resnet_v1_110/block1/unit_1/bottleneck2_v1/conv1/BatchNorm/beta '.

`NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key model/resnet_v1_110/block1/unit_1/bottleneck2_v1/conv1/BatchNorm/beta not found in checkpoint
[[Node: model/save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model/save/Const_0_0, model/save/RestoreV2/tensor_names, model/save/RestoreV2/shape_and_slices)]]
[[Node: model/save/RestoreV2/_1059 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1064_model/save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]`

I think this is very difficult to compress the net provided by common user, because I must retrain the net to save the model and add the prefix 'model' to every variables in net.

how i use docker mode or seven mode to train using muti-GPUs

I have configured docker and nvidia-docker, and pulled uber/horovod, when I run command"sudo ./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 --learner channel --cp_preserve_ratio 0.5 --resnet_size 18" , I got "nvidia-docker | 2018/11/07 11:25:39 Error: Error response from daemon: Get https://docker.oa.com/v2/: x509: certificate signed by unknown authority"
How can I fix this issue, and how to prepare my training data, do I need create the directory /opt/ml/data and put ilsvrc12 training data in tfrecord format into it?

Question on the uniform quantization

The code in function __uniform_quantize() in the file 'utils.py' in 'uniform_quantization' directory shows that the quantized weight or activation is also coded as the same type as the original tensor. For example, if the original tensor has a type float32, the quantized one will be float32 too. Why does not the quantized tensor has a type int and the predict process is computed with only integers?
So, does Pocketflow learn a quantization strategy and if we want computation with integers only, it should be accomplished with tools like TFLite, which transforms the model into a '.pb' format?

tensorflow.python.framework.errors_impl.FailedPreconditionError: Bad file descriptor

Describe the bug
A clear and concise description of what the bug is.
When running the channel pruning tutorial in local mode, i get the following error

Traceback (most recent call last):
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
    return fn(*args)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: /mnt/cifs/public_datasets/imagenet-data/train-00876-of-01024; Bad file descriptor
	 [[Node: data/IteratorGetNext = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1001]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](data/OneShotIterator)]]
	 [[Node: data/IteratorGetNext/_1885 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_155_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 69, in <module>
    tf.app.run()
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 55, in main
    learner.train()
  File "/home/dhingratul/OneDrive/Projects/PocketFlow/learners/discr_channel_pruning/learner.py", line 145, in train
    self.__choose_discr_chns()
  File "/home/dhingratul/OneDrive/Projects/PocketFlow/learners/discr_channel_pruning/learner.py", line 511, in __choose_discr_chns
    self.sess_train.run(self.layer_train_ops[idx_layer])
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
    run_metadata_ptr)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
    run_metadata)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: /mnt/cifs/public_datasets/imagenet-data/train-00876-of-01024; Bad file descriptor
	 [[Node: data/IteratorGetNext = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1001]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](data/OneShotIterator)]]
	 [[Node: data/IteratorGetNext/_1885 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_155_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'data/IteratorGetNext', defined at:
  File "main.py", line 69, in <module>
    tf.app.run()
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 51, in main
    learner = create_learner(sm_writer, model_helper)
  File "/home/dhingratul/OneDrive/Projects/PocketFlow/learners/learner_utils.py", line 50, in create_learner
    learner = DisChnPrunedLearner(sm_writer, model_helper)
  File "/home/dhingratul/OneDrive/Projects/PocketFlow/learners/discr_channel_pruning/learner.py", line 127, in __init__
    self.__build_train()
  File "/home/dhingratul/OneDrive/Projects/PocketFlow/learners/discr_channel_pruning/learner.py", line 195, in __build_train
    images, labels = iterator.get_next()
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 410, in get_next
    name=name)), self._output_types,
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
    op_def=op_def)
  File "/home/dhingratul/.virtualenvs/pocketflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
    self._traceback = tf_stack.extract_stack()

FailedPreconditionError (see above for traceback): /mnt/cifs/public_datasets/imagenet-data/train-00876-of-01024; Bad file descriptor
	 [[Node: data/IteratorGetNext = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1001]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](data/OneShotIterator)]]
	 [[Node: data/IteratorGetNext/_1885 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_155_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

model path

  1. Discrimination-aware Channel Pruning have the parameter dcp_save_path and dcp_save_path_eval to set the save path of compressed model. Other way in PocketFlow, whether have the similiar parameter to set the save path of compressed model.

  2. The parameter save_path in file abstract_learner.py is mean the save path of the full-precision uncompressed network ? What's the function of parameter save_path_eval ?

Documentation Issue

Please make sure that this is a documentation issue.

System information

Describe the documentation issue
There is a spelling mistake in the title -- "Distillation".
Some letters of the first word "Knowledge" is missing.

We welcome contributions by users. Will you be able to update submit a PR to fix the doc Issue?
Yes. But I'm not sure how to submit a fix.

no requirements.txt file to help user install the necessary packages

When I running the command with the command ./scripts/run_local.sh nets/resnet_at_cifar10_run.py in the installation documentation, but I found there are too many package shoud to install for run the demo. I wonder if it possible to add a requirements.txt file to call user intall the basic package for running the demo.

Compressed model file from Channel Pruning

Can you provide the optimized output from tf.Saver.save() from as the .pb file before the tflite conversion
$ ./scripts/run_local.sh nets/resnet_at_ilsvrc12_run.py \ --learner dis-chn-pruned

Channel selection cost too much time

2018-11-04 11 30 55

scripts is "./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=1 --learner channel", �using "ChannelPrunedLeaner" method ,I check the pruning learner source code,maybe the program stacked at line504 while loop or line515 while loop?

Supported operations of PocketFlow

How many operations are supported under the framework of PocketFlow? I didn't find any docs listing the ops available.
By the way, the acceleration ratio according to the performance of mobilenet V1 & V2 is not so astonishing, is there any more detailed performance of other deconv models?
Many thanks.

how to compress in my dataset

When I compress model on my custom dataset, the function def setup_lrn_rate(global_step, model_name, dataset_name) raise the NotImplement error. I want to produce a function that return lr and batch numbers in normal dataset and normal module. How I set the value of nb_epoches, decay_rates ...? How to choose piecewise constant strategy or exponential decaying strategy?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.