awslabs / sagemaker-debugger Goto Github PK

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors

License: Apache License 2.0

Python 98.82% Shell 1.18%

machine-learning deep-learning debugging sagemaker aws amazon

sagemaker-debugger's Introduction

Amazon SageMaker Debugger

Overview
Install the smdebug library
Debugger-supported Frameworks
How It Works
Examples
SageMaker Debugger in action
Further Documentation and References

Overview

Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. From training jobs, Debugger allows you to run your own training script (Zero Script Change experience) using Debugger built-in features—Hook and Rule—to capture tensors, have flexibility to build customized Hooks and Rules for configuring tensors as you want, and make the tensors available for analysis by saving in an Amazon S3 bucket, all through a flexible and powerful API.

The smdebug library powers Debugger by calling the saved tensors from the S3 bucket during the training job. smdebug retrieves and filters the tensors generated from Debugger such as gradients, weights, and biases.

Debugger helps you develop better, faster, and cheaper models by minimally modifying estimator, tracing the tensors, catching anomalies while training models, and iterative model pruning.

Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost frameworks.

The following list is a summary of the main functionalities of Debugger:

Run and debug training jobs of your model on SageMaker when using supported containers
No changes needed to your training script if using AWS Deep Learning Containers with Debugger fully integrated
Minimal changes to your training script if using AWS containers with script mode or custom containers
Full visibility into any tensor retrieved from targeted parts of the training jobs
Real-time training job monitoring through Rules
Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker
Actions on your training jobs based on the status of Rules
Interactive exploration of saved tensors
Distributed training support
TensorBoard support

See How it works for more details.

Install the smdebug library

The smdebug library runs on Python 3. Install using the following command:

pip install smdebug

Debugger-supported Frameworks

For a complete overview of Amazon SageMaker Debugger to learn how it works, go to the Use Debugger in AWS Containers developer guide.

AWS Deep Learning Containers with zero code change

Debugger is installed by default in AWS Deep Learning Containers with TensorFlow, PyTorch, MXNet, and XGBoost. The following framework containers enable you to use Debugger with no changes to your training script, by automatically adding SageMaker Debugger's Hook.

The following frameworks are available AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience.

Framework	Version
TensorFlow	1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1
MXNet	1.6, 1.7
PyTorch	1.4, 1.5, 1.6
XGBoost	0.90-2, 1.0-1 (As a built-in algorithm)

Note: Debugger with zero script change is partially available for TensorFlow v2.1.0. The inputs, outputs, gradients, and layers built-in collections are currently not available for these TensorFlow versions.

AWS training containers with script mode

The smdebug library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script.

Framework	Versions
TensorFlow	1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1
Keras (with TensorFlow backend)	2.3
MXNet	1.4, 1.5, 1.6, 1.7
PyTorch	1.2, 1.3, 1.4, 1.5, 1.6
XGBoost	0.90-2, 1.0-1 (As a framework)

Debugger on custom containers or local machines

You can also fully use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, smdebug is an open source library, so you can install it on your local machine for any advanced use cases that cannot be run in the SageMaker environment and for constructing smdebug custom hooks and rules.

How It Works

Amazon SageMaker Debugger uses the construct of a Hook to save the values of requested tensors throughout the training process. You can then setup a Rule job which simultaneously monitors and validates these tensors to ensure that training is progressing as expected.

A Rule checks for vanishing gradients, exploding tensor values, or poor weight initialization. Rules are attached to Amazon CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.

Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following three cases.

Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change

Use Debugger built-in hook configurations and rules while setting up the estimator and monitor your training job.

For a full guide and examples of using the built-in rules, see Running a Rule with zero script change on AWS Deep Learning Containers.

To see a complete list of built-in rules and their functionalities, see List of Debugger Built-in Rules.

Using SageMaker Debugger on AWS training containers with script mode

You can use Debugger with your training script on your own container making only a minimal modification to your training script to add Debugger's Hook. For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see Run Debugger in custom container. See the following instruction pages to set up Debugger in your preferred framework.

Using SageMaker Debugger on custom containers

Debugger is available for any deep learning models that you bring to Amazon SageMaker. The AWS CLI, the SageMaker Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train and debug your models. To use Debugger with customized containers, go to Use Debugger in Custom Training Containers.

Using SageMaker Debugger on a non-SageMaker environment

Using the smdebug library, you can create custom hooks and rules (or manually analyze the tensors) and modify your training script to enable tensor analysis on a non-SageMaker environment, such as your local machine. For an example of this, see Run Debugger locally.

Examples

SageMaker Notebook Examples

To find a collection of demonstrations using Debugger, see SageMaker Debugger Example Notebooks.

Run Debugger rules with zero script change

This example shows a how to use Debugger with Zero Script Change of your training script on a SageMaker DLC.

import sagemaker as sm
from sagemaker.debugger import rule_configs, Rule, CollectionConfig

# Choose a built-in rule to monitor your training job
rule = Rule.sagemaker(
    rule_configs.exploding_tensor(),
    # configure your rule if applicable
    rule_parameters={"tensor_regex": ".*"},
    # specify collections to save for processing your rule
    collections_to_save=[
        CollectionConfig(name="weights"),
        CollectionConfig(name="losses"),
    ],
)

# Pass the rule to the estimator
sagemaker_simple_estimator = sm.tensorflow.TensorFlow(
    entry_point="script.py", #replace script.py to your own training script
    role=sm.get_execution_role(),
    framework_version="1.15",
    py_version="py3",
    # argument for smdebug below
    rules=[rule],
)

sagemaker_simple_estimator.fit()
tensors_path = sagemaker_simple_estimator.latest_job_debugger_artifacts_path()

import smdebug.trials as smd
trial = smd.create_trial(out_dir=tensors_path)
print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")

That's it! When you configure the sagemaker_simple_estimator, you simply specify the entry_point to your training script python file. When you run the sagemaker_simple_estimator.fit() API, SageMaker will automatically monitor your training job for you with the Rules specified and create a CloudWatch event that tracks the status of the Rule, so you can take any action based on them.

If you want additional configuration and control, see Running SageMaker jobs with Debugger for more information.

Run Debugger in custom container

The following example shows how to set hook to set a training model using Debugger in your own container. This example is for containers in TensorFlow 2.x framework using GradientTape to configure the hook.

import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir=args.out_dir)

model = tf.keras.models.Sequential([ ... ])
    for epoch in range(n_epochs):
        for data, labels in dataset:
            dataset_labels = labels
            # wrap the tape to capture tensors
            with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape:
                logits = model(data, training=True)  # (32,10)
                loss_value = cce(labels, logits)
            grads = tape.gradient(loss_value, model.variables)
            opt.apply_gradients(zip(grads, model.variables))
            acc = train_acc_metric(dataset_labels, logits)
            # manually save metric values
            hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc)

To see a full script of this, refer to the tf_keras_gradienttape.py example script. For a notebook example of using BYOC in PyTorch, see Using Amazon SageMaker Debugger with Your Own PyTorch Container

Run Debugger locally

This example shows how to use Debugger for the Keras model.fit() API.

To use Debugger, simply add a callback hook:

import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir='~/smd_outputs/')

model = tf.keras.models.Sequential([ ... ])
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
)

# Add the hook as a callback
model.fit(x_train, y_train, epochs=2, callbacks=[hook])
model.evaluate(x_test, y_test, callbacks=[hook])

# Create a trial to inspect the saved tensors
trial = smd.create_trial(out_dir='~/smd_outputs/')
print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")

SageMaker Debugger in Action

Through the model pruning process using Debugger and smdebug, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy. The following accuracy versus the number of parameters graph is produced in Studio. It shows that the model accuracy started from about 0.9 with 12 million parameters (the data point moves from right to left along with the pruning process), improved during the first few pruning iterations, kept the quality of accuracy until it cut the number of parameters down to 6 million, and start sacrificing the accuracy afterwards.

Debugger provides you tools to access such training process and have a complete control over your model. See Using SageMaker Debugger and SageMaker Experiments for iterative model pruning notebook for the full example and more information.

Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training.
Use Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss.

Further Documentation and References

Section	Description
SageMaker Training	SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger
Frameworks TensorFlow PyTorch MXNet XGBoost	See the frameworks pages for details on what's supported and how to modify your training script if applicable
APIs for Saving Tensors	Full description of our APIs on saving tensors
Programming Model for Analysis	For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs.

License

This library is licensed under the Apache 2.0 License.

sagemaker-debugger's People

Contributors

Stargazers

Watchers

Forkers

maimoonaiqbal2000 kaustubhsardar jigsaw004 anirudhacharya vandanavk ssheff crypdick johnjdailey iyerr3 larroy aaronmarkham shashankprasanna taza1 nrauschmayr brightsparc leleamol pitt-liang hubayirp jhutchings1 ando-khachatryan aws-tensorflow-bot vikas-kum jiachenggu mchoi8739 olamyy justcherie paul-cb seanpmorgan qiaoxingli khileshchauhan indrawatideasy sophiayue1116 ndodda-amazon cli99 sujit3091 bedwar14 drchrisegwim connorgoggins gaow0007 thaisep micxyj noahberhe arjkesh zhaijunyu mr-nineteen biwaro smduan afiqmuzaffar jsspric angeljaviersalazar lokiiiiii siagholami anjaniksharma happyxuezhou abhinavs95 yl-to hsl89 josephevans ahmedalaa24494 vinitvshah leorebensabath johnbensnyder donhelfrich yzhu0 samkenxstream mansimane mahwiah atqy jleeleee dkey-amazon shiboxing ajithgeorvar rlepsch malav-shastri matherit sugyeong-yu andrew-j-cai aswathsr101 vroomgit mshadi20 christopherbrix skorani

sagemaker-debugger's Issues

`python -m smdebug.rules.invoke_rule` fails

For example,

python -m smdebug.rules.rule_invoker --trial-dir ~/ts_outputs/vanishing_gradients --rule-name VanishingGradient --threshold 0.0000000001

This fails because there's no main function, and it leads to a double-import where smdebug.rules has an init.py which imports invoke_rule.

PyTorch ResNet50 not saving any collections or tensors

https://github.com/rondogency/smdebug-benchmark/blob/master/pt/pt_res50_cifar10.py

is failing on Python 3.6 with PyTorch 1.3.1 and smdebug 0.4.7, per Ziyi's perf tests.

I have verified that it saves weights and gradients correctly if I pass --save_more, but it does not save losses in either case. Running with ZCC causes it to work, which means that the issue is likely in save_scalar() (because ZCC is still using the old record_tensor_value function). Adding @vandanavk to resolve.

create_trial breaks

t =create_trial('./output')
[2020-01-27 19:31:32.274 88e9fe53272d.ant.amazon.com:84283 INFO local_trial.py:35] Loading trial output at path ./output
Traceback (most recent call last):
File "", line 1, in
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/trials/utils.py", line 19, in create_trial
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/trials/local_trial.py", line 37, in init
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/trials/trial.py", line 556, in _load_tensors
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/trials/trial.py", line 621, in _load_tensors_from_index_files
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/core/index_reader.py", line 403, in load_tensor_data_from_index_files
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/core/index_reader.py", line 448, in read_index_files
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/core/index_reader.py", line 85, in add
File "/Users/vikumar/anaconda3/lib/python3.6/site-packages/smdebug-0.6.0b20200127-py3.6.egg/smdebug/core/index_reader.py", line 79, in _evict_cache
TypeError: '<' not supported between instances of 'str' and 'NoneType'

backward compatibility check tests

Codecov migration to marketplace app

Hi, Tom from Codecov here.

We noticed that you are using Codecov with fairly high frequency, and we’re so excited to see that! However, because you are not using our app, you may have experienced issues with uploading reports or viewing coverage information. This is due to rate-limiting issues from GitHub.

In order to prevent any future outages, we ask that you move over to our GitHub app integration.

The process is extremely simple and shouldn’t require more than a few clicks, and you should not expect any downtime. By moving to our app, you will no longer need an admin or separate account to manage the relationship with GitHub as the team bot.

Let me know if you have any questions, or if I can help at all with this process.

Rename uses fo ModeKeys to modes in the codebase

So that people don't get confused by the different ways of doing the same thing

When exporting for TensorBoard with reduction config, only save them as scalar summaries

Instead of histograms, save them as scalar summaries for the reduction chosen

keras TF 2.2 mileading error mesg

Got this error mesg -
Disabling SMDebug as it does not support eager modefor TF versions 1.x

import tensorflow as tf
tf.version
‘2.2.0'

I am using run_eagerly=True in model compile.

model.compile(
loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy’], run_eagerly=True
)

Reduce time for integration test

Come up with a way so that CI prints the running time for each tests.
Find what integration tests are running longer and optimize them to make them run fast. Current tests are taking 1:30 minutes, that needs to get faster.

Request to store static trial dataset on smdebug-testing/resources

I want to store a static trial dataset containing 20 steps written by 3 workers, i.e 60 total steps, to improve the speed of the test.

The dataset is currently available for download here

keyError in has_passed_step

https://paste.amazon.com/show/rauscn/1578413667

I think fix would be here-
https://github.com/awslabs/sagemaker-debugger/blob/master/smdebug/trials/trial.py#L518

We need to switch from mode step to global step before looking into workers_for_global_step

Duplicate debug logs when saving scalar

One in log_save and other in write_scalar. We should address this

Doc updates

issue with training end, not loading all steps

If training ends, trial downloads all the indexes, but invoke rule ends prematurely. See logs below.
The training ran for 90 steps, but rule concluded at step 60 .

[2020-01-08 22:57:14.339 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 DEBUG index_reader.py:310] Loaded Index Files: upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000070_worker_0.json,upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000080_worker_0.json,upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000090_worker_0.json
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO trial.py:209] Loaded all steps
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 DEBUG trial.py:211] Training Has Ended : last_complete_step was: 60
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 DEBUG trial.py:213] Training Has Ended : last_index_token was: upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000060_worker_0.json
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO invoker.py:36] Looking for step 61 of mode GLOBAL and reached end of training. Max step available is 60
[2020-01-08 22:57:15.362 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO invoker.py:40] Ending execution of rule LossNotDecreasing with step=60

detailed logs for run : https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/codebuild/smdebug_tensorflow_zero_code_change_build;stream=43eabe45-3c36-41a3-977a-592035cbd404;filter=trial_loss_not_decreasing_tf_true_parallel_mode

implement save raw tensor API

Save raw tensor api needs to be implemented. Use case is - user wants to save a tensor which is not part of model graph.

The implementation would look like below -

def save_raw_tensor(tname, tval):
  if self.writer is None:
    self._initialize_writers()
    self._write_raw_tensor_simple(scalar_name, scalar_val)

Support keras optimizer with estimators in TF

This is an usecase especially for the keras model_to_estimator flow.

Remove dependency on aioboto3

Causing too many dependency conflict issues with other packages. It's not worth it.

Let's migrate to s3transfer or raw boto3

Provide a way to randomly access a step

Current impl :
If there are n steps present, all n steps index would be downloaded before the call finishes.

We should allow a way to randomly access step.
Use case:

I want the tensor from last step. or I want the tensor from step 5000-8000

In both these cases, user is only interested in certain steps rather than all steps while doing analysis.

Estimation of time -
If there are 50K saved , it will take 50 list calls + 50K get calls( which should be in ~ 20 sec according to benchmark #80 where 10K files take 4 sec) to get the index file. 50 list calls can be expensive as those are sequential operation and list is expensive operation.

Move save_scalar framework test to framework folder

Please move the framework dependent tests to the respective framework folder, so that DLC can run the appropriate pytest tests/framework, pytest tests/core, pytest tests/analysis

GPU test & distributed integ test infrastructure.

Current CI doesn't support gpu tests.

Download tensor to local disk and use it as disk cache

Current;y, we access s3 for downloading a tensor, instead we should allow disk to be used as cache, rather than going to s3 each time.

Investigate the source of warnings when using TF 2.x Mirrored Strategy

These warnings are seen when running a TF 2.x mirrored strategy job, although the training seems to succeed.

Root cause the source of these warnings.

2020-04-21 20:41:50.531730: W tensorflow/core/kernels/data/cache_dataset_ops.cc:822] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2020-04-21 20:41:50.565438: W tensorflow/core/kernels/data/cache_dataset_ops.cc:822] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2020-04-21 20:41:50.587846: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-04-21 20:41:50.588709: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-04-21 20:41:50.589408: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-04-21 20:41:50.590075: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Perf benchmark automation

Make keras hook safer by logging error instead of crashing when train_fn, eval_fn are None

add_callbacks can crash when execution function hasn't been built for whatever reason. Ziyi discovered that before step 0 there is no execution function when using generators, it was built only at step0. So we can only save from step1 in that case.

Cannot install it via pip-tools

Hi,

I'm trying to use pip-tools to install the package. the configuration I want is:

numpy==1.14.6
mxnet==1.4.1
boto3==1.10.32
stepfunctions==1.0.0.1
pur==5.2.2
sagemaker-experiments==0.1.2
smdebug==0.4.14

However I get thrown:

Finding the best candidates:
  found candidate aioboto3==6.4.1 (constraint was ==6.4.1)
Could not find a version that matches aiobotocore==0.11.0,~=0.10.2
Tried: 0.0.5, 0.0.6, 0.1.1, 0.2.0, 0.2.1, 0.2.1, 0.2.2, 0.2.2, 0.2.3, 0.2.3, 0.3.0, 0.3.0, 0.3.1, 0.3.1, 0.3.2, 0.3.2, 0.3.3, 0.3.3, 0.4.0, 0.4.0, 0.4.1, 0.4.1, 0.4.2, 0.4.2, 0.4.3, 0.4.3, 0.4.4, 0.4.4, 0.4.5, 0.4.5, 0.5.0, 0.5.0, 0.5.1, 0.5.1, 0.5.2, 0.5.2, 0.5.3, 0.5.3, 0.6.0, 0.6.0, 0.7.0, 0.7.0, 0.8.0, 0.8.0, 0.9.0, 0.9.1, 0.9.1, 0.9.2, 0.9.2, 0.9.3, 0.9.3, 0.9.4, 0.9.4, 0.10.0, 0.10.0, 0.10.1, 0.10.1, 0.10.2, 0.10.2, 0.10.3, 0.10.3, 0.10.4, 0.10.4, 0.11.0, 0.11.0
Skipped pre-versions: 0.6.1a0, 0.6.1a0
There are incompatible versions in the resolved dependencies.

I would think that my requirements file is fairly basic. Is this something you can help with?

tf.keras saves step at end of batch

Running the following script with tensorflow==1.15.0:

import tensorflow.compat.v2 as tf
import smdebug.tensorflow as smd
from tempfile import TemporaryDirectory

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255, x_test / 255

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax'),
])

with TemporaryDirectory() as dirpath:
    hook = smd.KerasHook(out_dir=dirpath)

    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=5, callbacks=[hook])

    trial = smd.create_trial(path=dirpath)
    print(hook)
    print(trial)

gives the following output:

<smdebug.tensorflow.keras.KerasHook object at 0x1025aaed0>:(
    out_dir=/var/folders/r1/mgxfss8d45jbs_vl464bbsg906jznv/T/tmpdzybvlqg,
    tensorboard_dir=None,
    step=9374,
    mode=ModeKeys.TRAIN,
    mode_steps={<ModeKeys.GLOBAL: 4>: 9374, <ModeKeys.TRAIN: 1>: 9374},
    include_collections=['metrics', 'losses', 'sm_metrics'],
    writer=None,
    save_config=<class SaveConfig: {<ModeKeys.TRAIN: 1>: <class SaveConfig: save_interval=500, save_steps=[], start_step=0, end_step=None>, <ModeKeys.EVAL: 2>: <class SaveConfig: save_interval=500, save_steps=[], sta ...>,
    reduction_config=<class ReductionConfig: reductions=[], abs_reductions=[], norms=[], abs_norms=[]>,
    save_all=False,
    dry_run=False,
)
<smdebug.trials.local_trial.LocalTrial object at 0x1025b0f50>:(
    name=tmpdzybvlqg,
    path=/var/folders/r1/mgxfss8d45jbs_vl464bbsg906jznv/T/tmpdzybvlqg,
    steps=[0, 500, 1000, 1500, 1874, 2000, 2500, 3000, 3500, 3749, 4000, 4500, 5000, 5500, 5624, 6000, 6500, 7000, 7499, 7500, 8000, 8500, 9000, 9374],
    collections=['default', 'weights', 'biases', 'gradients', 'losses', 'metrics', 'inputs', 'outputs', 'all', 'sm_metrics'],
    tensor_names=['acc', 'batch', 'loss', 'size'],
)

It appears to be saving every 1874th step, in addition to every 500th. Is this desired behavior?

Watch for collection files as well as training_job_end

On trial start, don't block on collections which might never come. Also keep watch for training_job_end file

Investigate removing SageMakerTrialCatalog

Create and teardown S3 state, instead of relying on it being there

That way our test suite is standalone, without undocumented state.

s3://smdebug-testing/resources/event-files-missing
s3://smdebug-testing/resources/has_step_scenarios
s3://smdebug-testing/resources/one-index-files
s3://smdebug-testing/resources/dist-logs-10

How do I install protobuf3 compiler and runtime correctly?

If you see the error messages E ModuleNotFoundError: No module named 'smdebug.core.tfevent.proto.types_pb2'1 or ERROR: Compiling summary protocol buffers failed. You will not be able to use smdebug. Please make sure that you have installed protobuf3 compiler and runtime correctly., it is most likely that you haven't installed the correct version of protobuf.

We provide a simple script that downloads and install the version of protobuf required by sagemaker-debugger.

To install protobuf using this script, simply run.

sh config/protoc_downloader.sh

Failing tests on local master

test_hook_all_zero.py is failing both tests on master when run locally.

test_uninint_sess_run is failing on master when run locally

test_outdir_sagemaker is failing on master when run locally on Mac because no permission to write to the directory 'my'.

Test for DistriibutedValues support

Follow up from this : #225

Loss Tensors Are Saved Twice On AWS Pytorch

loss functional values are saved twice for each step with AWS Pytorch.

This happens because functional losses are saved by default by the post_hook_for_loss_functional fn in AWS Pytorch.

...
    for _ in range(n_steps):
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = F.cross_entropy(outputs, labels)
        hook.record_tensor_value("nll_loss", tensor_value=loss)
        loss.backward()
        optimizer.step()
...

The post_hook_for_loss_functional is called by F.cross_entropy(outputs, labels)

Improve performance of KerasHook

expose model graph and iteration API for graph

Crash occurs when trying to register a hook on a tensor that doesn't require gradients

Background: I am trying to run a SageMaker Training Job with custom PyTorch 1.4.0 code using the official sagemaker-pytorch-container (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py36-cu101-ubuntu16.04) on a single ml.p3.2xlarge instance.

Issue: I get the following error: RuntimeError: cannot register a hook on a tensor that doesn't require gradient. Issue is related to the fact that, in my code, I'm loading a previously trained torch.nn.Embedding that does not need gradient updates. Disabling SageMaker Debugger by passing debugger_hook_config=False to the Estimator constructor gets past the error, but then training hangs indefinitely after the first training step.

Fix bug in tensor.values method

mode is not passed to steps() method

Check why this log is repeated

In CI : https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=DO-NOT-DELETE-smdebug_rules-LOGS-ONE-REPO;stream=codebuild/c3bda538-9277-42db-931a-de5984013923;filter=%22Loaded%20Index%20Files:%20upload/20200106_221841/c33ae10/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578351365.7939517/index/000000000/000000000070_worker_0.json%22

Why is this line repeated so many times:
"Loaded Index Files: upload/20200106_221841/c33ae10/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578351365.7939517/index/000000000/000000000070_worker_0.json"

Are we reloading index files again and again ?
@NihalHarish Please check and confirm

doc issue built in rules link missing

In page built in rules:

https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#Built-In-Rules

Built In Rules

Please refer to the built-in rules that SageMaker provides here. (#todo add link) ==> Link missing

Allow setting of global mode

Right now only train eval and predict can be set. This causes some problems for xgboost in rules as they have no mode

Pytorch tensors are not saved with include_collections=["all"], only with save_all=True

Events produced are empty when using include_collections=["all"]. Works with save_all=True.

Hook code:

save_config = smd.SaveConfig(save_interval=1)
reduction_config = smd.ReductionConfig(["max", "min"])
hook = smd.Hook(out_dir='...',
                reduction_config=reduction_config,
                save_all=True,
                #include_collections=["all"],
                export_tensorboard=True,
                save_config=save_config,
                tensorboard_dir='...',
                include_workers="all")

inconsistent behavior of script tf_simple.py when hook used with MonitoredSession

If I use the script tf_simple.py and use monitoredSession(hook) , I see in-consistent behavior.
Link to script - https://gist.github.com/Vikas-kum/a726aa05f70cbc22da55aac6f9f122d2

Repro - Command to run and reproduce is provided at script(link above) header.

Step=0, Loss=90.67911529541016 gstep:2
Step=1, Loss=97.25459289550781 gstep:3
Step=2, Loss=72.63609313964844 gstep:4
Step=3, Loss=49.64006423950195 gstep:5
Step=4, Loss=30.262378692626953 gstep:6
Step=5, Loss=26.098041534423828 gstep:7
Step=6, Loss=23.234188079833984 gstep:8
Step=7, Loss=14.143218994140625 gstep:9
Step=8, Loss=6.640719413757324 gstep:10
Step=9, Loss=2.9191393852233887 gstep:11
Step=10, Loss=1.1926181316375732 gstep:3 ====> global step 3 again
Step=11, Loss=67.19161224365234 gstep:4 =====> loss increased
Step=12, Loss=62.6436767578125 gstep:5
Step=13, Loss=44.932037353515625 gstep:6
Step=14, Loss=54.5485954284668 gstep:7
Step=15, Loss=28.61581039428711 gstep:8
Step=16, Loss=25.332355499267578 gstep:9
Step=17, Loss=18.563230514526367 gstep:10
Step=18, Loss=8.643794059753418 gstep:11
Step=19, Loss=5.633042335510254 gstep:12
Step=20, Loss=1.1502041816711426 gstep:2
Step=21, Loss=95.97285461425781 gstep:3
Step=22, Loss=63.6973991394043 gstep:4
Step=23, Loss=45.747554779052734 gstep:5
Step=24, Loss=25.462902069091797 gstep:6
Step=25, Loss=25.730255126953125 gstep:7

But if comment line 1 and use line 2 as show below :
#sess = tf.train.MonitoredSession(hooks=[hook])
sess = tf.train.MonitoredSession()

I get correct behavior. Example output:
Step=0, Loss=67.61869812011719 gstep:2
Step=1, Loss=109.72452545166016 gstep:3
Step=2, Loss=89.4232177734375 gstep:4
Step=3, Loss=40.550193786621094 gstep:5
Step=4, Loss=46.2119026184082 gstep:6
Step=5, Loss=38.09912109375 gstep:7
Step=6, Loss=21.49539566040039 gstep:8
Step=7, Loss=16.05667495727539 gstep:9
Step=8, Loss=7.0712432861328125 gstep:10
Step=9, Loss=2.7082438468933105 gstep:11
Step=10, Loss=1.6834074258804321 gstep:12
Step=11, Loss=0.2472914457321167 gstep:13
Step=12, Loss=0.0006980320904403925 gstep:14
Step=13, Loss=0.19466720521450043 gstep:15
Step=14, Loss=0.8360849618911743 gstep:16
Step=15, Loss=2.3243532180786133 gstep:17
Step=16, Loss=3.5155558586120605 gstep:18
Step=17, Loss=3.3111186027526855 gstep:19
Step=18, Loss=4.183402061462402 gstep:20
Step=19, Loss=5.629175186157227 gstep:21
Step=20, Loss=6.101352214813232 gstep:22
Step=21, Loss=5.324296951293945 gstep:23
Step=22, Loss=5.301041603088379 gstep:24
Step=23, Loss=4.981998443603516 gstep:25
Step=24, Loss=5.992074489593506 gstep:26
Step=25, Loss=7.53415584564209 gstep:27
Step=26, Loss=4.8035888671875 gstep:28
Step=27, Loss=2.3003716468811035 gstep:29
Step=28, Loss=3.3655598163604736 gstep:30
Step=29, Loss=1.9064804315567017 gstep:31
Step=30, Loss=1.332509160041809 gstep:32
Step=31, Loss=1.2492618560791016 gstep:33
Step=32, Loss=0.3721589744091034 gstep:34
Step=33, Loss=0.20127233862876892 gstep:35
Step=34, Loss=0.039012569934129715 gstep:36
Step=35, Loss=2.4094073523883708e-05 gstep:37
Step=36, Loss=0.03809528425335884 gstep:38
Step=37, Loss=0.10105834901332855 gstep:39
Step=38, Loss=0.35051339864730835 gstep:40
Step=39, Loss=0.33885806798934937 gstep:41
Step=40, Loss=0.5717775821685791 gstep:42
Step=41, Loss=0.5270355343818665 gstep:43

I tried with tensorflow 1.15.0 & tensorflow 1.13.1

Move self._log_params(module) from forward_pre_hook to forward_hook for Pytorch

Not all parameters have been created until after the first step if creating parameters via tracing (during runtime). Can confirm this works thanks to Rahul H:

def forward_hook(self, module, inputs, outputs):
    if not self._get_collections_to_save_for_step():
        return

    self._log_params(module)
    module_name = self.module_maps[module]
    # This overwhelms the logs; turn back on if you really need it
    # logger.debug("Processing the global step {0} for module {1}".format(self.step, module_name))

    # Output input tensor
    self._write_inputs(module_name, inputs)

    # Output output tensors
    self._write_outputs(module_name, outputs)
    self.last_saved_step = self.step

Provide api for multiple tensor fetch in one call

Usecase - For BYO analysis, current supported way is fetch tensor one by one, we should expose an API :
trial.get_tensors({"tname" :[steps])

tensor_names() when passed a step returns names of reductions instead of tensors

tr.tensor_names(0) returns something like smdebug/reductions/max/tensorname. It should only be returning tensorname. This is happening because the tensorname is being inserted into mode_to_tensors_map like that. Maybe global_step_to_tensors_map also needs to be looked at.

Name clash when operator is called multiple times during forward pass

When running debugger in Resnet model training, debugger only stores one activation output tensor per basic block, while there should be 2. Here is a code snippet:

class BasicBlock(nn.Module):
    
    def __init__():
        super(BasicBlock, self).__init__()

        [...]

        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)

    def forward(self, x):
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        return out

self.relu is called twice but only last one will be stored by debugger. This happens because both tensors have the same name. I avoided this problem by customizing the hook and keeping track of tensor names during forward pass. While this works, it is not the nicest solution. Is there another workaround?

Saving tensors and scalars which are not part of model graph

Q. In the example I created, I needed to save data that is not part of the model training. I did this by calling directly hook._write_raw_tensor_simple(), which worked fine. But is this the recommended way?

For saving scalars, yes it is possible to save any arbitrary scalar from code. Please refer to save_scalar method in this part of docs - https://github.com/awslabs/sagemaker-debugger/blob/59de1452edd5d00252813d00425ef357541f51d9/docs/api.md#common-hook-api

For saving raw tensor which is not part of model graph, we don't have implementation currently.
Please refer to #154 for updates

Too many warnings printed with TF 2.X

While executing tests that involve reductions, too many warnings are being printed.

Example


tests/tensorflow2/test_keras_mirrored.py::test_base_reductions INFO:absl:Load dataset info from /root/tensorflow_datasets/mnist/3.0.1
--
587 | INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
588 | INFO:absl:Reusing dataset mnist (/root/tensorflow_datasets/mnist/3.0.1)
589 | INFO:absl:Constructing tf.data.Dataset for split None, from /root/tensorflow_datasets/mnist/3.0.1
590 | INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
591 | INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
592 | [2020-05-27 07:15:21.755 ae4bb918ce14:10191 INFO hook.py:191] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
593 | [2020-05-27 07:15:21.755 ae4bb918ce14:10191 INFO hook.py:236] Saving to /tmp/test
594 | [2020-05-27 07:15:21.755 ae4bb918ce14:10191 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
595 | [2020-05-27 07:15:22.897 ae4bb918ce14:10191 INFO keras.py:68] Executing in TF2.x eager mode.SageMaker Debugger will not be saving gradients
596 | [2020-05-27 07:15:23.037 ae4bb918ce14:10191 INFO hook.py:372] Monitoring the collections: metrics, sm_metrics, biases, losses, weights
597 | INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
598 | INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
599 | INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
600 | INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
601 | [2020-05-27 07:15:28.373 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
602 | [2020-05-27 07:15:28.373 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
603 | [2020-05-27 07:15:28.382 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
604 | [2020-05-27 07:15:28.382 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
605 | [2020-05-27 07:15:28.391 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
606 | [2020-05-27 07:15:28.391 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
607 | [2020-05-27 07:15:28.399 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
608 | [2020-05-27 07:15:28.399 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
609 | WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.102305). Check your callbacks.
610 | WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.102305). Check your callbacks.
611 | [2020-05-27 07:15:28.636 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
612 | [2020-05-27 07:15:28.636 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
613 | [2020-05-27 07:15:28.644 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
614 | [2020-05-27 07:15:28.644 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
615 | [2020-05-27 07:15:28.652 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
616 | [2020-05-27 07:15:28.653 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
617 | [2020-05-27 07:15:28.661 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
618 | [2020-05-27 07:15:28.661 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
619 | [2020-05-27 07:15:28.862 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
620 | [2020-05-27 07:15:28.862 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
621 | [2020-05-27 07:15:28.872 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
622 | [2020-05-27 07:15:28.873 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
623 | [2020-05-27 07:15:28.883 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
624 | [2020-05-27 07:15:28.883 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
625 | [2020-05-27 07:15:28.892 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
626 | [2020-05-27 07:15:28.892 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
627 | [2020-05-27 07:15:29.083 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
628 | [2020-05-27 07:15:29.084 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
629 | [2020-05-27 07:15:29.095 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
630 | [2020-05-27 07:15:29.096 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
631 | [2020-05-27 07:15:29.106 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
632 | [2020-05-27 07:15:29.107 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
633 | [2020-05-27 07:15:29.115 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l1 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
634 | [2020-05-27 07:15:29.115 ae4bb918ce14:10191 WARNING hook.py:562] Could not compute reduction l2 of conv2d/weights/conv2d/kernel:0 due to Improper number of dimensions to norm.
635 | 2020-05-27 07:15:29.518013: W tensorflow/core/kernels/data/cache_dataset_ops.cc:822] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
636 | 2020-05-27 07:15:29.524775: W tensorflow/core/kernels/data/cache_dataset_ops.cc:822] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
637 | 2020-05-27 07:15:29.544889: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
638 | 2020-05-27 07:15:29.545632: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
639 | 2020-05-27 07:15:29.546292: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
640 | 2020-05-27 07:15:29.546884: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
641 | [2020-05-27 07:15:29.549 ae4bb918ce14:10191 INFO local_trial.py:35] Loading trial test at path /tmp/test
642 | [2020-05-27 07:15:29.562 ae4bb918ce14:10191 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
643 | [2020-05-27 07:15:30.563 ae4bb918ce14:10191 INFO trial.py:210] Loaded all steps
644 | PASSED

zero code change tf test fails

python tests/zero_code_change/tensorflow_integration_tests.py

fails with error
Traceback (most recent call last):
File "tests/zero_code_change/tensorflow_integration_tests.py", line 23, in
from tests.tensorflow.hooks.test_mirrored_strategy import test_basic
ModuleNotFoundError: No module named 'tests.tensorflow.hooks.test_mirrored_strategy'

when creating a custom collection, is there a way to define EVAL/TRAIN save_interval directly in the SageMaker Estimator?

Q. when creating a custom collection, is there a way to define EVAL/TRAIN save_interval directly in the SageMaker Estimator?

ANS: Yes, it can be provided, for details see this section of docs - https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#specifying-different-configuration-based-on-mode

Configuring hook using Sagemaker PythonSDK https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-hook-using-sagemaker-python-sdk

Configuring Collection using Sagemaker PythonSDK - https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk