aws / amazon-sagemaker-examples Goto Github PK

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

Home Page: https://sagemaker-examples.readthedocs.io

License: Apache License 2.0

Jupyter Notebook 94.36% HTML 0.04% Python 4.66% R 0.02% Shell 0.18% Dockerfile 0.07% Java 0.03% C 0.01% Roff 0.60% Makefile 0.01% Batchfile 0.01% jq 0.01% JavaScript 0.03% CSS 0.01%

sagemaker aws reinforcement-learning machine-learning deep-learning examples jupyter-notebook mlops data-science training

amazon-sagemaker-examples's Introduction

Amazon SageMaker Examples

Example Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using Amazon SageMaker.

📚 Background

Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. You can use Amazon SageMaker to simplify the process of building, training, and deploying ML models.

The SageMaker example notebooks are Jupyter notebooks that demonstrate the usage of Amazon SageMaker.

The Sagemaker Example Community repository are additional notebooks, beyond those critical for showcasing key SageMaker functionality, can be shared and explored by the commmunity.

🛠️ Setup

The quickest setup to run example notebooks includes:

💻 Usage

These example notebooks are automatically loaded into SageMaker Notebook Instances. They can be accessed by clicking on the SageMaker Examples tab in Jupyter or the SageMaker logo in JupyterLab.

Although most examples utilize key Amazon SageMaker functionality like distributed, managed training or real-time hosted endpoints, these notebooks can be run outside of Amazon SageMaker Notebook Instances with minimal modification (updating IAM role definition and installing the necessary libraries).

As of February 7, 2022, the default branch is named "main". See our announcement for details and how to update your existing clone.

📓 Examples

Introduction to geospatial capabilities

These examples introduce SageMaker geospatial capabilities which makes it easy to build, train, and deploy ML models using geospatial data.

How to use SageMaker Processing with geospatial image shows how to compute the normalized difference vegetation index (NDVI) which indicates health and density of vegetation using SageMaker Processing and satellite imagery
Monitoring Lake Drought with SageMaker Geospatial Capabilities shows how to monitor Lake Mead drought using SageMaker geospatial capabilities.
Digital Farming with Amazon SageMaker Geospatial Capabilities shows how geospatial capabilities can help accelerating, optimizing, and easing the processing of the geospatial data for the Digital Farming use cases.
Assess wildfire damage with Amazon SageMaker Geospatial Capabilities demonstrates how Amazon SageMaker geospatial capabilities can be used to identify and assess vegetation loss caused by the Dixie wildfire in Northern California.
Monitoring Glacier Melting with SageMaker Geospatial Capabilities shows how to monitor glacier melting at Mount Shasta using SageMaker geospatial capabilities.
Monitoring of methane (CH4) emission point sources using Amazon SageMaker Geospatial Capabilities demonstrates how methane emissions can be detected by using open data Satellite imagery (Sentinel-2).
Segmenting aerial imagery using geospatial GPU notebook shows how to use the geospatial GPU notebook with open-source libraries to perform segmentation on aerial imagery.
Perform Sentinel-1 InSAR using ESA SNAP Toolkit shows how the SNAP toolkit can be used within Amazon SageMaker geospatial capabilities to create interferograms on Sentinel-1 SAR data.
How to use Vector Enrichment Jobs for Map Matching shows how to use vector enrichtment operations with Amazon SageMaker Geospatial capabilities to snap GPS coordinates to road segments.
How to use Vector Enrichment Jobs for Reverse Geocoding shows how to use Amazon SageMaker Geospatial capabilities for reverse geocoding to obtain human readable addresses from data with latitude/longitude information.
Building geospatial pipelines with SageMaker Pipelines shows how a geospatial data processing workflow can be automated by using Amazon SageMaker Pipelines.

Introduction to Ground Truth Labeling Jobs

These examples provide quick walkthroughs to get you up and running with the labeling job workflow for Amazon SageMaker Ground Truth.

Bring your own model for SageMaker labeling workflows with active learning is an end-to-end example that shows how to bring your custom training, inference logic and active learning to the Amazon SageMaker ecosystem.
From Unlabeled Data to a Deployed Machine Learning Model: A SageMaker Ground Truth Demonstration for Image Classification is an end-to-end example that starts with an unlabeled dataset, labels it using the Ground Truth API, analyzes the results, trains an image classification neural net using the annotated dataset, and finally uses the trained model to perform batch and online inference.
Ground Truth Object Detection Tutorial is a similar end-to-end example but for an object detection task.
Basic Data Analysis of an Image Classification Output Manifest presents charts to visualize the number of annotations for each class, differentiating between human annotations and automatic labels (if your job used auto-labeling). It also displays sample images in each class, and creates a pdf which concisely displays the full results.
Training a Machine Learning Model Using an Output Manifest introduces the concept of an "augmented manifest" and demonstrates that the output file of a labeling job can be immediately used as the input file to train a SageMaker machine learning model.
Annotation Consolidation demonstrates Amazon SageMaker Ground Truth annotation consolidation techniques for image classification for a completed labeling job.

Introduction to Applying Machine Learning

These examples provide a gentle introduction to machine learning concepts as they are applied in practical use cases across a variety of sectors.

Predicting Customer Churn uses customer interaction and service usage data to find those most likely to churn, and then walks through the cost/benefit trade-offs of providing retention incentives. This uses Amazon SageMaker's implementation of XGBoost to create a highly predictive model.
Cancer Prediction predicts Breast Cancer based on features derived from images, using SageMaker's Linear Learner.
Ensembling predicts income using two Amazon SageMaker models to show the advantages in ensembling.
Video Game Sales develops a binary prediction model for the success of video games based on review scores.
MXNet Gluon Recommender System uses neural network embeddings for non-linear matrix factorization to predict user movie ratings on Amazon digital reviews.
Fair Linear Learner is an example of an effective way to create fair linear models with respect to sensitive features.
Population Segmentation of US Census Data using PCA and Kmeans analyzes US census data and reduces dimensionality using PCA then clusters US counties using KMeans to identify segments of similar counties.
Document Embedding using Object2Vec is an example to embed a large collection of documents in a common low-dimensional space, so that the semantic distances between these documents are preserved.
Traffic violations forecasting using DeepAR is an example to use daily traffic violation data to predict pattern and seasonality to use Amazon DeepAR alogorithm.
Visual Inspection Automation with Pre-trained Amazon SageMaker Models is an example for fine-tuning pre-trained Amazon Sagemaker models on a target dataset.
Create SageMaker Models Using the PyTorch Model Zoo contains an example notebook to create a SageMaker model leveraging the PyTorch Model Zoo and visualize the results.
Deep Demand Forecasting provides an end-to-end solution for Demand Forecasting task using three state-of-the-art time series algorithms LSTNet, Prophet, and SageMaker DeepAR, which are available in GluonTS and Amazon SageMaker.
Fraud Detection Using Graph Neural Networks is an example to identify fraudulent transactions from transaction and user identity datasets.
Identify key insights from textual document contains comphrensive notebooks for five natural language processing tasks Document Summarization, Text Classification, Question and Answering, Name Entity Recognition, and Semantic Relation Extracion.
Synthetic Churn Prediction with Text contains an example notebook to train, deploy and use a churn prediction model that processed numerical, categorical and textual features to make its prediction.
Credit Card Fraud Detector is an example of the core of a credit card fraud detection system using SageMaker with Random Cut Forest and XGBoost.
Churn Prediction Multimodality of Text and Tabular is an example notebook to train and deploy a churn prediction model that uses state-of-the-art natural language processing model to find useful signals in text. In addition to textual inputs, this model uses traditional structured data inputs such as numerical and categorical fields.

SageMaker Automatic Model Tuning

These examples introduce SageMaker's hyperparameter tuning functionality which helps deliver the best possible predictions by running a large number of training jobs to determine which hyperparameter values are the most impactful.

XGBoost Tuning shows how to use SageMaker hyperparameter tuning to improve your model fit.
BlazingText Tuning shows how to use SageMaker hyperparameter tuning with the BlazingText built-in algorithm and 20_newsgroups dataset..
TensorFlow Tuning shows how to use SageMaker hyperparameter tuning with the pre-built TensorFlow container and MNIST dataset.
MXNet Tuning shows how to use SageMaker hyperparameter tuning with the pre-built MXNet container and MNIST dataset.
HuggingFace Tuning shows how to use SageMaker hyperparameter tuning with the pre-built HuggingFace container and 20_newsgroups dataset.
Keras BYO Tuning shows how to use SageMaker hyperparameter tuning with a custom container running a Keras convolutional network on CIFAR-10 data.
R BYO Tuning shows how to use SageMaker hyperparameter tuning with the custom container from the Bring Your Own R Algorithm example.
Analyzing Results is a shared notebook that can be used after each of the above notebooks to provide analysis on how training jobs with different hyperparameters performed.
Model tuning for distributed training shows how to use SageMaker hyperparameter tuning with Hyperband strategy for optimizing model in distributed training.
Neural Architecture Search for Large Language Models shows how to prune fine-tuned large language models via neural architecture search.

SageMaker Autopilot

These examples introduce SageMaker Autopilot. Autopilot automatically performs feature engineering, model selection, model tuning (hyperparameter optimization) and allows you to directly deploy the best model to an endpoint to serve inference requests.

Customer Churn AutoML shows how to use SageMaker Autopilot to automatically train a model for the Predicting Customer Churn task.
Targeted Direct Marketing AutoML shows how to use SageMaker Autopilot to automatically train a model.
Housing Prices AutoML shows how to use SageMaker Autopilot for a linear regression problem (predict housing prices).
Portfolio Churn Prediction with Amazon SageMaker Autopilot and Neo4j shows how to use SageMaker Autopilot with graph embeddings to predict investment portfolio churn.
Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines shows how to use SageMaker Autopilot in combination with SageMaker Pipelines for end-to-end AutoML training automation.
Amazon SageMaker Autopilot models to serverless endpoints shows how to deploy Autopilot generated models to serverless endpoints.

Introduction to Amazon Algorithms

These examples provide quick walkthroughs to get you up and running with Amazon SageMaker's custom developed algorithms. Most of these algorithms can train on distributed hardware, scale incredibly well, and are faster and cheaper than popular alternatives.

k-means is our introductory example for Amazon SageMaker. It walks through the process of clustering MNIST images of handwritten digits using Amazon SageMaker k-means.
Factorization Machines showcases Amazon SageMaker's implementation of the algorithm to predict whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier.
Latent Dirichlet Allocation (LDA) introduces topic modeling using Amazon SageMaker Latent Dirichlet Allocation (LDA) on a synthetic dataset.
Linear Learner predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner.
Neural Topic Model (NTM) uses Amazon SageMaker Neural Topic Model (NTM) to uncover topics in documents from a synthetic data source, where topic distributions are known.
Principal Components Analysis (PCA) uses Amazon SageMaker PCA to calculate eigendigits from MNIST.
Seq2Seq uses the Amazon SageMaker Seq2Seq algorithm that's built on top of Sockeye, which is a sequence-to-sequence framework for Neural Machine Translation based on MXNet. Seq2Seq implements state-of-the-art encoder-decoder architectures which can also be used for tasks like Abstractive Summarization in addition to Machine Translation. This notebook shows translation from English to German text.
Image Classification includes full training and transfer learning examples of Amazon SageMaker's Image Classification algorithm. This uses a ResNet deep convolutional neural network to classify images from the caltech dataset.
XGBoost for regression predicts the age of abalone (Abalone dataset) using regression from Amazon SageMaker's implementation of XGBoost.
XGBoost for multi-class classification uses Amazon SageMaker's implementation of XGBoost to classify handwritten digits from the MNIST dataset as one of the ten digits using a multi-class classifier. Both single machine and distributed use-cases are presented.
DeepAR for time series forecasting illustrates how to use the Amazon SageMaker DeepAR algorithm for time series forecasting on a synthetically generated data set.
BlazingText Word2Vec generates Word2Vec embeddings from a cleaned text dump of Wikipedia articles using SageMaker's fast and scalable BlazingText implementation.
Object detection for bird images demonstrates how to use the Amazon SageMaker Object Detection algorithm with a public dataset of Bird images.
Object2Vec for movie recommendation demonstrates how Object2Vec can be used to model data consisting of pairs of singleton tokens using movie recommendation as a running example.
Object2Vec for multi-label classification shows how ObjectToVec algorithm can train on data consisting of pairs of sequences and singleton tokens using the setting of genre prediction of movies based on their plot descriptions.
Object2Vec for sentence similarity explains how to train Object2Vec using sequence pairs as input using sentence similarity analysis as the application.
IP Insights for suspicious logins shows how to train IP Insights on a login events for a web server to identify suspicious login attempts.
Semantic Segmentation shows how to train a semantic segmentation algorithm using the Amazon SageMaker Semantic Segmentation algorithm. It also demonstrates how to host the model and produce segmentation masks and probability of segmentation.
JumpStart Instance Segmentation demonstrates how to use a pre-trained Instance Segmentation model available in JumpStart for inference.
JumpStart Semantic Segmentation demonstrates how to use a pre-trained Semantic Segmentation model available in JumpStart for inference, how to finetune the pre-trained model on a custom dataset using JumpStart transfer learning algorithm, and how to use fine-tuned model for inference.
JumpStart Text Generation shows how to use JumpStart to generate text that appears indistinguishable from the hand-written text.
JumpStart Text Summarization shows how to use JumpStart to summarize the text to contain only the important information.
JumpStart Image Embedding demonstrates how to use a pre-trained model available in JumpStart for image embedding.
JumpStart Text Embedding demonstrates how to use a pre-trained model available in JumpStart for text embedding.
JumpStart Object Detection demonstrates how to use a pre-trained Object Detection model available in JumpStart for inference, how to finetune the pre-trained model on a custom dataset using JumpStart transfer learning algorithm, and how to use fine-tuned model for inference.
JumpStart Machine Translation demonstrates how to translate text from one language to another language in JumpStart.
JumpStart Named Entity Recognition demonstrates how to identify named entities such as names, locations etc. in the text in JumpStart.
JumpStart Text to Image demonstrates how to generate image conditioned on text in JumpStart.
JumpStart Upscaling demonstrates how to enhance image quality with Stable Diffusion models in JumpStart.
JumpStart Inpainting demonstrates how to inpaint an image with Stable Diffusion models in JumpStart.
In-context learning with AlexaTM 20B demonstrates how to use AlexaTM 20B for in-context-learning in JumpStart.

Amazon SageMaker RL

The following provide examples demonstrating different capabilities of Amazon SageMaker RL.

Cartpole using Coach demonstrates the simplest usecase of Amazon SageMaker RL using Intel's RL Coach.
AWS DeepRacer demonstrates AWS DeepRacer trainig using RL Coach in the Gazebo environment.
HVAC using EnergyPlus demonstrates the training of HVAC systems using the EnergyPlus environment.
Knapsack Problem demonstrates how to solve the knapsack problem using a custom environment.
Mountain Car Mountain car is a classic RL problem. This notebook explains how to solve this using the OpenAI Gym environment.
Distributed Neural Network Compression This notebook explains how to compress ResNets using RL, using a custom environment and the RLLib toolkit.
Portfolio Management This notebook uses a custom Gym environment to manage multiple financial investments.
Autoscaling demonstrates how to adjust load depending on demand. This uses RL Coach and a custom environment.
Roboschool is an open source physics simulator that is commonly used to train RL policies for robotic systems. This notebook demonstrates training a few agents using it.
Stable Baselines In this notebook example, we will make the HalfCheetah agent learn to walk using the stable-baselines, which are a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.
Travelling Salesman is a classic NP hard problem, which this notebook solves with AWS SageMaker RL.
Tic-tac-toe is a simple implementation of a custom Gym environment to train and deploy an RL agent in Coach that then plays tic-tac-toe interactively in a Jupyter Notebook.
Unity Game Agent shows how to use RL algorithms to train an agent to play Unity3D game.

Scientific Details of Algorithms

These examples provide more thorough mathematical treatment on a select group of algorithms.

Streaming Median sequentially introduces concepts used in streaming algorithms, which many SageMaker algorithms rely on to deliver speed and scalability.
Latent Dirichlet Allocation (LDA) dives into Amazon SageMaker's spectral decomposition approach to LDA.
Linear Learner features shows how to use the class weights and loss functions features of the SageMaker Linear Learner algorithm to improve performance on a credit card fraud prediction task

Amazon SageMaker Debugger

These examples provide and introduction to SageMaker Debugger which allows debugging and monitoring capabilities for training of machine learning and deep learning algorithms. Note that although these notebooks focus on a specific framework, the same approach works with all the frameworks that Amazon SageMaker Debugger supports. The notebooks below are listed in the order in which we recommend you review them.

Amazon SageMaker Distributed Training

These examples provide an introduction to SageMaker Distributed Training Libraries for data parallelism and model parallelism. The libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker, and improve training speed and throughput. More examples for models such as BERT and YOLOv5 can be found in distributed_training/.

Train GPT-2 with Sharded Data Parallel shows how to train GPT-2 with near-linear scaling using Sharded Data Parallelism technique in SageMaker Model Parallelism Library.
Train EleutherAI GPT-J with Model Parallel shows how to train EleutherAI GPT-J with PyTorch and Tensor Parallelism technique in the SageMaker Model Parallelism Library.
Train MaskRCNN with Data Parallel shows how to train MaskRCNN with PyTorch and SageMaker Data Parallelism Library.

Amazon SageMaker Smart Sifting

These examples provide an Introduction to Smart Sifting library. Smart Sifting is a framework to speed up training of PyTorch models. The framework implements a set of algorithms that filter out inconsequential training examples during training, reducing the computational cost and accelerating the training process. It is configuration-driven and extensible, allowing users to add custom logic to transform their training examples into a filterable format. Smart sifting provides a generic utility for any DNN model, and can reduce the training cost by up to 35% in infrastructure cost.

Train Image Classification using Vision Transformer with Smart Sifting: This Example shows how to use Smart sifting to fine tune Vision Transformers for Image Classification.
Train Text Classification using BERT with Smart Sifting: This Example shows how to use Smart Sifting to fine tune BERT for Text Classification.

Amazon SageMaker Clarify

These examples provide an introduction to SageMaker Clarify which provides machine learning developers with greater visibility into their training data and models so they can identify and limit bias and explain predictions.

Fairness and Explainability with SageMaker Clarify shows how to use SageMaker Clarify Processor API to measure the pre-training bias of a dataset and post-training bias of a model, and explain the importance of the input features on the model's decision.
Amazon SageMaker Clarify Model Monitors shows how to use SageMaker Clarify Model Monitor API to schedule bias monitor to monitor predictions for bias drift on a regular basis, and schedule explainability monitor to monitor predictions for feature attribution drift on a regular basis.

Publishing content from RStudio on Amazon SageMaker to RStudio Connect

These examples show you how to run R examples, and publish applications in RStudio on Amazon SageMaker to RStudio Connect.

Publishing R Markdown shows how you can author an R Markdown document (.Rmd, .Rpres) within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
Publishing R Shiny Apps shows how you can author an R Shiny application within RStudio on Amazon SageMaker and publish to RStudio Connect for wide consumption.
Publishing Streamlit Apps shows how you can author a streamlit application withing Amazon SageMaker Studio and publish to RStudio Connect for wide consumption.

Advanced Amazon SageMaker Functionality

These examples showcase unique functionality available in Amazon SageMaker. They cover a broad range of topics and utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker.

Data Distribution Types showcases the difference between two methods for sending data from S3 to Amazon SageMaker Training instances. This has particular implication for scalability and accuracy of distributed training.
Distributed Training and Batch Transform with Sentiment Classification shows how to use SageMaker Distributed Data Parallelism, SageMaker Debugger, and distrubted SageMaker Batch Transform on a HuggingFace Estimator, in a sentiment classification use case.
Encrypting Your Data shows how to use Server Side KMS encrypted data with Amazon SageMaker training. The IAM role used for S3 access needs to have permissions to encrypt and decrypt data with the KMS key.
Using Parquet Data shows how to bring Parquet data sitting in S3 into an Amazon SageMaker Notebook and convert it into the recordIO-protobuf format that many SageMaker algorithms consume.
Connecting to Redshift demonstrates how to copy data from Redshift to S3 and vice-versa without leaving Amazon SageMaker Notebooks.
Bring Your Own XGBoost Model shows how to use Amazon SageMaker Algorithms containers to bring a pre-trained model to a realtime hosted endpoint without ever needing to think about REST APIs.
Bring Your Own k-means Model shows how to take a model that's been fit elsewhere and use Amazon SageMaker Algorithms containers to host it.
Bring Your Own R Algorithm shows how to bring your own algorithm container to Amazon SageMaker using the R language.
Installing the R Kernel shows how to install the R kernel into an Amazon SageMaker Notebook Instance.
Bring Your Own scikit Algorithm provides a detailed walkthrough on how to package a scikit learn algorithm for training and production-ready hosting.
Bring Your Own MXNet Model shows how to bring a model trained anywhere using MXNet into Amazon SageMaker.
Bring Your Own TensorFlow Model shows how to bring a model trained anywhere using TensorFlow into Amazon SageMaker.
Bring Your Own Model train and deploy BERTopic shows how to bring a model through an external library, how to train it and deploy it into Amazon SageMaker by extending the pytorch base containers.
Experiment Management Capabilities with Search shows how to organize Training Jobs into projects, and track relationships between Models, Endpoints, and Training Jobs.
Host Multiple Models with Your Own Algorithm shows how to deploy multiple models to a realtime hosted endpoint with your own custom algorithm.
Host Multiple Models with XGBoost shows how to deploy multiple models to a realtime hosted endpoint using a multi-model enabled XGBoost container.
Host Multiple Models with SKLearn shows how to deploy multiple models to a realtime hosted endpoint using a multi-model enabled SKLearn container.
Host Multimodal HuggingFace Model shows how to host an instruction based image editing model from HuggingFace as a SageMaker endpoint using single core or multi-core GPU based instances. Inference Recommender is used to run load tests and compare the performance of instances.
SageMaker Training and Inference with Script Mode shows how to use custom training and inference scripts, similar to those you would use outside of SageMaker, with SageMaker's prebuilt containers for various frameworks like Scikit-learn, PyTorch, and XGBoost.
Host Models with NVidia Triton Server shows how to deploy models to a realtime hosted endpoint using Triton as the model inference server.
Heterogenous Clusters Training in TensorFlow or PyTorch shows how to train using TensorFlow tf.data.service (distributed data pipeline) or Pytorch (with gRPC) on top of Amazon SageMaker Heterogenous clusters to overcome CPU bottlenecks by including different instance types (GPU/CPU) in the same training job.

Amazon SageMaker Neo Compilation Jobs

These examples provide an introduction to how to use Neo to compile and optimize deep learning models.

GluonCV SSD Mobilenet shows how to train GluonCV SSD MobileNet and use Amazon SageMaker Neo to compile and optimize the trained model.
Image Classification Adapts from image classification including Neo API and comparison against the uncompiled baseline.
MNIST with MXNet Adapts from MXNet MNIST including Neo API and comparison against the uncompiled baseline.
Deploying pre-trained PyTorch vision models shows how to use Amazon SageMaker Neo to compile and optimize pre-trained PyTorch models from TorchVision.
Distributed TensorFlow includes Neo API and comparison against the uncompiled baseline.
Predicting Customer Churn Adapts from XGBoost customer churn including Neo API and comparison against the uncompiled baseline.

Amazon SageMaker Processing

These examples show you how to use SageMaker Processing jobs to run data processing workloads.

Scikit-Learn Data Processing and Model Evaluation shows how to use SageMaker Processing and the Scikit-Learn container to run data preprocessing and model evaluation workloads.
Feature transformation with Amazon SageMaker Processing and SparkML shows how to use SageMaker Processing to run data processing workloads using SparkML prior to training.
Feature transformation with Amazon SageMaker Processing and Dask shows how to use SageMaker Processing to transform data using Dask distributed clusters
Distributed Data Processing using Apache Spark and SageMaker Processing shows how to use the built-in Spark container on SageMaker Processing using the SageMaker Python SDK.

Amazon SageMaker Pipelines

These examples show you how to use SageMaker Pipelines to create, automate and manage end-to-end Machine Learning workflows.

Amazon Comprehend with SageMaker Pipelines shows how to deploy a custom text classification using Amazon Comprehend and SageMaker Pipelines.
Amazon Forecast with SageMaker Pipelines shows how you can create a dataset, dataset group and predictor with Amazon Forecast and SageMaker Pipelines.
Multi-model SageMaker Pipeline with Hyperparamater Tuning and Experiments shows how you can generate a regression model by training real estate data from Athena using Data Wrangler, and uses multiple algorithms both from a custom container and a SageMaker container in a single pipeline.
SageMaker Pipeline Local Mode with FrameworkProcessor and BYOC for PyTorch with sagemaker-training-toolkig
SageMaker Pipeline Step Caching shows how you can leverage pipeline step caching while building pipelines and shows expected cache hit / cache miss behavior.
Native AutoML step in SageMaker Pipelines shows how you can use SageMaker Autopilot with a native AutoML step in SageMaker Pipelines for end-to-end AutoML training automation.
Computer Vision Pipeline using step decorator shows how you can augment a dataset, train a computer vision model, and evaluate the model using a combination of built-in steps and the step decorator.

Amazon SageMaker Pre-Built Framework Containers and the Python SDK

Pre-Built Deep Learning Framework Containers

These examples show you how to train and host in pre-built deep learning framework containers using the SageMaker Python SDK.

Chainer CIFAR-10 trains a VGG image classification network on CIFAR-10 using Chainer (both single machine and multi-machine versions are included)
Chainer MNIST trains a basic neural network on MNIST using Chainer (shows how to use local mode)
Chainer sentiment analysis trains a LSTM network with embeddings to predict text sentiment using Chainer
IRIS with Scikit-learn trains a Scikit-learn classifier on IRIS data
Model Registry and Batch Transform with Scikit-learn trains a Scikit-learn Random Forest model, registers it in Model Registry, and runs a Batch Transform Job.
MNIST with MXNet Gluon trains a basic neural network on the MNIST handwritten digit dataset using MXNet Gluon
MNIST with MXNet trains a basic neural network on the MNIST handwritten digit data using MXNet's symbolic syntax
Sentiment Analysis with MXNet Gluon trains a text classifier using embeddings with MXNet Gluon
TensorFlow training and serving trains a basic neural network on MNIST
TensorFlow with Horovod trains on MNIST using Horovod for distributed training
TensorFlow using shell commands shows how to use a shell script for the container's entry point

Pre-Built Machine Learning Framework Containers

These examples show you how to build Machine Learning models with frameworks like Apache Spark or Scikit-learn using SageMaker Python SDK.

Inference with SparkML Serving shows how to build an ML model with Apache Spark using Amazon EMR on Abalone dataset and deploy in SageMaker with SageMaker SparkML Serving.
Pipeline Inference with Scikit-learn and LinearLearner builds a ML pipeline using Scikit-learn preprocessing and LinearLearner algorithm in single endpoint

Using Amazon SageMaker with Apache Spark

These examples show how to use Amazon SageMaker for model training, hosting, and inference through Apache Spark using SageMaker Spark. SageMaker Spark allows you to interleave Spark Pipeline stages with Pipeline stages that interact with Amazon SageMaker.

MNIST with SageMaker PySpark
Parameterize spark configuration in pipeline PySparkProcessor execution shows how you can define spark-configuration in different pipeline PysparkProcessor executions

Using Amazon SageMaker with Amazon Keyspaces (for Apache Cassandra)

These examples show how to use Amazon SageMaker to read data from Amazon Keyspaces.

Train Machine Learning Models using Amazon Keyspaces as a Data Source

AWS Marketplace

Create algorithms/model packages for listing in AWS Marketplace for machine learning.

These example notebooks show you how to package a model or algorithm for listing in AWS Marketplace for machine learning.

Creating Marketplace Products
- Creating a Model Package - Listing on AWS Marketplace provides a detailed walkthrough on how to package a pre-trained model as a SageMaker Model Package that can be listed on AWS Marketplace.
- Creating Algorithm and Model Package - Listing on AWS Marketplace provides a detailed walkthrough on how to package a scikit learn algorithm to create SageMaker Algorithm and SageMaker Model Package entities that can be used with the enhanced SageMaker Train/Transform/Hosting/Tuning APIs and listed on AWS Marketplace.

Once you have created an algorithm or a model package to be listed in the AWS Marketplace, the next step is to list it in AWS Marketplace, and provide a sample notebook that customers can use to try your algorithm or model package.

Curate your AWS Marketplace model package listing and sample notebook provides instructions on how to craft a sample notebook to be associated with your listing and how to curate a good AWS Marketplace listing that makes it easy for AWS customers to consume your model package.
Curate your AWS Marketplace algorithm listing and sample notebook provides instructions on how to craft a sample notebook to be associated with your listing and how to curate a good AWS Marketplace listing that makes it easy for your customers to consume your algorithm.

Use algorithms, data, and model packages from AWS Marketplace.

These examples show you how to use model-packages and algorithms from AWS Marketplace and dataset products from AWS Data Exchange, for machine learning.

Using Algorithms
- Using Algorithm From AWS Marketplace provides a detailed walkthrough on how to use Algorithm with the enhanced SageMaker Train/Transform/Hosting/Tuning APIs by choosing a canonical product listed on AWS Marketplace.
- Using AutoML algorithm provides a detailed walkthrough on how to use AutoML algorithm from AWS Marketplace.
Using Model Packages
- Using Model Packages From AWS Marketplace is a generic notebook which provides sample code snippets you can modify and use for performing inference on Model Packages from AWS Marketplace, using Amazon SageMaker.
- Using Amazon Demo product From AWS Marketplace provides a detailed walkthrough on how to use Model Package entities with the enhanced SageMaker Transform/Hosting APIs by choosing a canonical product listed on AWS Marketplace.
- Using models for extracting vehicle metadata provides a detailed walkthrough on how to use pre-trained models from AWS Marketplace for extracting metadata for a sample use-case of auto-insurance claim processing.
- Using models for identifying non-compliance at a workplace provides a detailed walkthrough on how to use pre-trained models from AWS Marketplace for extracting metadata for a sample use-case of generating summary reports for identifying non-compliance at a construction/industrial workplace.
- Creative writing using GPT-2 Text Generation will show you how to use AWS Marketplace GPT-2-XL pre-trained model on Amazon SageMaker to generate text based on your prompt to help you author prose and poetry.
- Amazon Augmented AI with AWS Marketplace ML models will show you how to use AWS Marketplace pre-trained ML models with Amazon Augmented AI to implement human-in-loop workflow reviews with your ML model predictions.
- Monitoring data quality in third-party models from AWS Marketplace will show you how to perform Data Quality monitoring on a pre-trained third-party model from AWS Marketplace.
- Evaluating ML models from AWS Marketplace for person counting use case will show you how to use two AWS Marketplace GluonCV pre-trained ML models for person counting use case and evaluate each model for performance in different types of crowd images.
- Preprocessing audio data using a pre-trained machine learning model demonstrates the usage of a pre-trained audio track separation model to create synthetic features and improve an acoustic classification model.
Using Dataset Products
- Using Dataset Product from AWS Data Exchange with ML model from AWS Marketplace is a sample notebook which shows how a dataset from AWS Data Exchange can be used with an ML Model Package from AWS Marketplace.
- Using Shutterstock Image Datasets to train Image Classification Models provides a detailed walkthrough on how to use the Free Sample: Images & Metadata of “Whole Foods” Shoppers from Shutterstock's Image Datasets to train a multi-label image classification model using Shutterstock's pre-labeled image assets. You can learn more about this implementation from this blog post.

⚖️ License

This library is licensed under the Apache 2.0 License. For more details, please take a look at the LICENSE file.

🤝 Contributing

Although we're extremely excited to receive contributions from the community, we're still working on the best mechanism to take in examples from external sources. Please bear with us in the short-term if pull requests take longer than expected or are closed. Please read our contributing guidelines if you'd like to open an issue or submit a pull request.

amazon-sagemaker-examples's People

Contributors

Stargazers

Watchers

Forkers

zmunish rajenur macsteele kcompher vshriram93 mikevansnell ssameerr mr-brody deepak-k-zefr dhruvgm alexzaporozhets ragavvenkatesan rabowskyb yangmiok hanifmahboobi davidjegan sebastianmika jonathantaws mujahed85 yash1 jcassiojr mhowell234 fancao sunnycd numanelahi goodrahstar scazzy catwang42 shyamalu kalescky partnercloudsupport jamiekang aloneside andremoeller kazunori294 zhuangaili tahirwaseer lostella philarmour naheedmk ramin lokeshsoni adeelahuma tolygins yangaws tomz imran273 realizeme jennyxue97 jpwon priya-gittest shafaypro allypelletier emkessler cibelecastelo gjtempleton just4jc cjb2014 sergeteren ood-tsen paninian hulalazz ruicalheiros ceceshao1 matthieudelaro acrowther jasonchon imenjarroudi bdonkey mbaijal goswamig emilieke cyrussafaie jorgher etav yjhbnb adamspannbauer iquintero xzhou33 bdm123 diogobenica nishchal-p cryptsky txgonzalez shuowenwei mchenmath hengrumay juanlp srinivasutalluri dholdaway cherifsy mengshuliu velamurip nickksun julianocristian erbrito shirkeyaws slee1009 jlpintolousinha tonytongzhao

amazon-sagemaker-examples's Issues

What is the

Re: amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/

import pip
pip.main(['install', 'pandas'])

This method is not working with the newer version of pip. p2.xlarge instance does not seem to be supporting pandas. What is the recommended/alternative method to install Python packages when training and inference do not support the particular modules?

Thank you.

Trying to deploy pretrained MXNet model for inference only

I have a model that I've trained in MXNet to classify images, and I already have the model assets saved as
model.tar.gz in an s3 bucket.

from sagemaker.mxnet.model import MXNetModel
import sagemaker
import sys
from sagemaker import get_execution_role
role = get_execution_role()
sagemaker_model = MXNetModel(model_data = 's3://bucket-name/model.tar.gz', 
entry_point='entry_point.py', #entry_point.py is an empty .py file since we aren't using for training
role = role)
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

I just want to be able to deploy this within a sagemaker notebook to a host and then call the predictor.predict function on an input image. However, the above sagemaker_model.deploy call fails and yields following error message:

ValueErrorTraceback (most recent call last)
in ()
----> 1 predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/model.pyc in deploy(self, initial_instance_count, instance_type, endpoint_name)
90 production_variant = sagemaker.production_variant(model_name, instance_type, initial_instance_count)
91 self.endpoint_name = endpoint_name or model_name
---> 92 self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant])
93 if self.predictor_cls:
94 return self.predictor_cls(self.endpoint_name, self.sagemaker_session)

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in endpoint_from_production_variants(self, name, production_variants, wait)
512 self.sagemaker_client.create_endpoint_config(
513 EndpointConfigName=name, ProductionVariants=production_variants)
--> 514 return self.create_endpoint(endpoint_name=name, config_name=name, wait=wait)
515
516 def expand_role(self, role):

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in create_endpoint(self, endpoint_name, config_name, wait)
344 self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name)
345 if wait:
--> 346 self.wait_for_endpoint(endpoint_name)
347 return endpoint_name
348

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in wait_for_endpoint(self, endpoint, poll)
405 if status != 'InService':
406 reason = desc.get('FailureReason', None)
--> 407 raise ValueError('Error hosting endpoint {}: {} Reason: {}'.format(endpoint, status, reason))
408 return desc
409

ValueError: Error hosting endpoint sagemaker-mxnet-py2-cpu-2018-03-22-20-10-57-938: Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.

I believe my attempt at using an empty file for the entry_point.py script is the reason this happened. But the problem in that case is that nowhere in the documentation was it clear to me what exactly should be in the entry_point.py script in the case that I just want to perform inference and no training with this model.

My other question is related to what the predictor.predict function actually expects. Do I need to pass it a numpy array? Is there a way to pass a string for the image_url instead and then write some simple image preprocessing script that loads the image, resizes, etc on the host before calling the mxnet model.predict function? I'm concerned that opencv is not part of the endpoint environment by default.

Any help with this would be much appreciated.

Customer Churn Prediction with XGBoost

When I tried a different csv data set using XGBboost, I got the following issues:

Arguments: train
[2018-01-10:21:51:56:INFO] Running standalone xgboost training.
[2018-01-10:21:51:56:INFO] File size need to be processed in the node: 38.24mb. Available memory size in the node: 8611.8mb
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py:279: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
df = pd.read_csv(os.path.join(files_path, csv_file), sep=None, header=None)
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/exceptions.py:19: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
message = getattr(exception, 'message', str(exception))
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/exceptions.py:19: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
message = getattr(exception, 'message', str(exception))
[2018-01-10:21:52:06:ERROR] Algorithm Error: Could not determine delimiter (caused by Error)

Caused by: Could not determine delimiter
Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train.py", line 34, in main
standalone_train(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_methods.py", line 16, in standalone_train
train_job(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 389, in train_job
dtrain = get_dmatrix(train_path, file_type, exceed_memory)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 317, in get_dmatrix
dmatrix = get_csv_dmatrix(files_path)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 279, in get_csv_dmatrix
df = pd.read_csv(os.path.join(files_path, csv_file), sep=None, header=None)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in init
self._make_engine(self.engine)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
self._engine = klass(self.f, **self.options)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 1601, in init
self._make_reader(f)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 1705, in _make_reader
sniffed = csv.Sniffer().sniff(line)
File "/opt/amazon/python2.7/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"
Error: Could not determine delimiter

ValueErrorTraceback (most recent call last)
in ()
16 num_round=100)
17
---> 18 xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
152 self.latest_training_job = _TrainingJob.start_new(self, inputs)
153 if wait:
--> 154 self.latest_training_job.wait(logs=logs)
155 else:
156 raise NotImplemented('Asynchronous fit not available')

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in wait(self, logs)
321 def wait(self, logs=True):
322 if logs:
--> 323 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
324 else:
325 self.sagemaker_session.wait_for_job(self.job_name)

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in logs_for_job(self, job_name, wait, poll)
656
657 if wait:
--> 658 self._check_job_status(job_name, description)
659 if dot:
660 print()

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in _check_job_status(self, job, desc)
399 if status != 'Completed':
400 reason = desc.get('FailureReason', '(No reason provided)')
--> 401 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
402
403 def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training xgboost-2018-01-10-21-46-25-058: Failed Reason: InternalServerError: We encountered an internal error. Please try again.

missing function argument in kmeans example

in notebook /sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb
under section Training the K-Means model
kmeans.fit(kmeans.record_set(train_set[0]))
cause error when calling 'describe_model'

Question about MxNet estimator

In the example "mnist_with_gluon_local_mode.ipynb" I notice that we have to include a python file to the MXNet estimator. I was wondering what would happen if the pyton file (in this case mnist.py) is dependant on a few other python files? Is it possible to make these available to the estimator as well?

When I try to test "kmeans_bring_your_own_model", I meet a error No module named mxnet?

Python3 Example Request

Do you recommend a base ubuntu image with python3 installed? Would like to create a python3 variant of this Dockerfile: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/container/Dockerfile

Is it possible to call the /invocations endpoint asynchronously?

Let's say I have an inferencing algorithm that runs for 10 minutes or longer so I don't want the client to wait for the result. Is it possible to tell the sagemaker service to call a callback url with the results?

tensorflow_resnet_cifar10_with_tensorboard - predict example

Hi,

I don't see an predict sample for tensorflow_resnet_cifar10_with_tensorboard model. Can you please provide it.

Seq2seq .describe_training_job

message = sage.describe_training_job(TrainingJobName=job_name)['FailureReason']

The part should be...
message = sagemaker_client.describe_training_job(TrainingJobName=job_name)['FailureReason']
(i.e. sage -> sagemaker_client)

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.ipynb

Current cifar10 example uses ml.p2.xlarge which requires request to AWS support

I was working through the CIFAR10 example here after launching a notebook in AWS and noticed that I can't complete the tutorial without requesting a limit increase.

The current tutorial says:
"If you want to try the example without requesting an increase, just change the train_instance_count value to 1."

I did this but I still hit an AWS limit error:

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p2.xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

Looks like the current default limit is 0 now? Is there an alternative instance type you suggest to complete this or is a support request now required?

Missing chmod command in build_and_push.sh in scikit_bring_your_own project

In the build_and_push.sh script in the scikit_bring_your_own project there are a few missing lines (which are present in the corresponding notebook section):

#make the program executable
chmod +x decision_trees/train 
#On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
#to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

ValueError: export_outputs must be a dict error when saving model_to_estimator

Hi I have been trying to test tf.keras.estimator.model_to_estimator(keras_model=model) and save it i order to set up hosting for the model like in this example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_iris_byom/tensorflow_BYOM_iris.ipynb

However I keep receiving this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-75-a22213ebd37e> in <module>()
      1 exported_model = model.export_savedmodel(export_dir_base = 'export/Servo/', 
----> 2                                serving_input_receiver_fn = serving_input_fn)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in export_savedmodel(self, export_dir_base, serving_input_receiver_fn, assets_extra, as_text, checkpoint_path)
    515           serving_input_receiver.receiver_tensors,
    516           estimator_spec.export_outputs,
--> 517           serving_input_receiver.receiver_tensors_alternatives)
    518 
    519       if not checkpoint_path:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/export/export.py in build_all_signature_defs(receiver_tensors, export_outputs, receiver_tensors_alternatives)
    191     receiver_tensors = {_SINGLE_RECEIVER_DEFAULT_NAME: receiver_tensors}
    192   if export_outputs is None or not isinstance(export_outputs, dict):
--> 193     raise ValueError('export_outputs must be a dict.')
    194 
    195   signature_def_map = {}

ValueError: export_outputs must be a dict.

My code:

import numpy as np
import os
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import joblib


def featureTransform(features, max_words):
    tokenize = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=False)
    tokenize.fit_on_texts(features) 
    return tokenize.texts_to_matrix(features).astype(np.float32) 

def encodeLabels(labels):
    encoder = LabelEncoder()
    encoder.fit(labels)
    y = encoder.transform(labels)
    num_classes = np.max(y) + 1
    print("num classes: {}".format(num_classes))
    return tf.keras.utils.to_categorical(y, num_classes).astype(np.float32)

def baselineModel(max_words, num_classes):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(500, activation='relu', input_shape=(max_words,), name="features"))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(500, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
    
    return tf.keras.estimator.model_to_estimator(keras_model=model)


def serving_input_fn():
    feature_spec = {'features_input': tf.FixedLenFeature(dtype=tf.float32, shape=[500])}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()


def train_input_fn(training_dir, params):
    """Returns input function that would feed the model during training"""
    return input_function(training_dir, 'assignment_train.csv')

def input_function(training_dir, training_filename, shuffle=False):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
    filename=os.path.join(training_dir, training_filename), target_dtype=np.str, features_dtype=np.float32)
    
    input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"features_input":  np.array(training_set.data)}, 
        y=encodeLabels(training_set.target),
        num_epochs=100,
        shuffle=shuffle
    )
    return input_fn

I run the following:

model = baselineModel(500, 123)
model.train(input_fn=input_function('data','assignment_train.csv', shuffle=True))
score = model.evaluate(input_function('data','assignment_train.csv', shuffle=True), steps = 100)
exported_model = model.export_savedmodel(export_dir_base = 'export/Servo/', 
                               serving_input_receiver_fn = serving_input_fn)

Any insight would be greatly appreciated!

SageMaker returns a 500 error after Installing the R Kernel

As soon as I've run the command to install the R kernel, and I refresh the Jupyter dashboard I get a 500 error.

I've tried through the notebook example in advanced_functionality and also through the terminal. I also tried upgrading conda first but same result.

Tensorboard not displaying scalars

The notebook example with Tensorboard amazon-sagemaker-examples/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb is not displaying scalars or images. Only the graph and projector are displayed.

If one run is terminated and a new one is started (using the same base_job_name so it starts from the previously saved checkpoint) by running again:
estimator.fit(inputs, run_tensorboard_locally=True)
then the scalars and images of the previous run are displayed on Tensorboard but they are not updated as training continues.

video-game-sales-xboost s3 download do not have prefix

s3.Bucket(bucket).download_file(raw_data_filename, 'raw_data.csv')
should probably be:
s3.Bucket(bucket).download_file(prefix + '/' + raw_data_filename, 'raw_data.csv')

How to enable tensorboard to real-time monitor model training performance?

I'm running train on Sagemaker with a docker image which includes my own algorithm container, following wiki [1]. How to enable tensorboard to real-time monitor model training performance? My model is based on Keras with tensorflow in backend.

Thanks!

[1] https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

Connecting to Tensorboard without using notebook?

Hello.

I am trying to use sagemaker.tensorflow.estimator.TensorFlow for training a new tensorflow model on Sagemaker. I would love to have access to TensorBoard no matter what, but since this will be a job run through airflow on a machine dedicated to data pipeline jobs, I will not have access to the local TensorBoard link. I am not using notebooks either, so I can't really use this option to 'connect to' as mentioned in the Resnet CIFAR 10 Example

You can access TensorBoard locally at http://localhost:6006 or using your SageMaker notebook instance proxy/6006/(TensorBoard will not work if forget to put the slash, '/', in end of the url).

Is there a way to connect via proxy for a job instead, or another way if not using notebooks?

I instanciate sagemaker.tensorflow.estimator.TensorFlow in the following way:

        return TensorFlow(
            entry_point=self.entry_point,
            source_dir=self.source_dir,
            role=self.config.role,
            output_path=self.config.output_path,
            code_location=self.config.code_location,
            train_instance_count=self.config.train_instance_count,
            train_instance_type=self.config.train_instance_type,
            training_steps=self.config.training_steps,
            evaluation_steps=self.config.evaluation_steps
        )

I'm calling fit like:

            self.estimator.fit(
                self.config.train_data_location,
                job_name=self.training_job_name,
                run_tensorboard_locally=True,
            )

xgboost churn data location moved

For https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/xgboost_customer_churn

The data file: http://www.dataminingconsultant.com/data/churn.txt no longer exists. Instead they are available as zip files via http://dataminingconsultant.com/DKD2e_data_sets.zip.

How to kill pending notebook instance?

I am attempting to launch a ml.p2.xlarge notebook instance. However, 30 minutes and counting, it still is in pending, and there is no way to kill it. What do I do about this? And what is the billing status of a notebook instance that is in its pending state?

How to make parameters/files available to Tensorflow Endpoint Instance

I'm looking to make some hyper parameters or files available to the serving endpoint in SageMaker. The training instances is given access to input parameters using hyperparameters in:

estimator = TensorFlow(entry_point='autocat.py',
                       role=role,
                       output_path=params['output_path'],
                       code_location=params['code_location'],
                       train_instance_count=1,
                       train_instance_type='ml.c4.xlarge',
                       training_steps=10000,
                       evaluation_steps=None,
                       hyperparameters=params)

However, when the endpoint is deployed, there is no way to pass in parameters that are used to control the data processing in the input_fn(serialized_input, content_type) function.

What would be the best way to pass parameters to the serving instance?? Is the source_dir parameter defined in the sagemaker.tensorflow.TensorFlow class copied to the serving instance? If so, I could use a config.yml or similar.

The reason that I'm asking is that I keep the location of a TFIDF vectorizer in the params dictionary, and it loads it at training time from s3. In the future I'd like to use this same approach to load embeddings at serving time.

deploy a new algorithm

Hello
i'am trying to deploy my own algorithm so should i make a docker image or there is a better method i don't know so much about sagemaker so can anyone please give the steps to deploy my own algorithm
Best Regards

Can't run TensorFlow graphs without Estimator

I have a unique model paradigm that does not fit the structure of an Estimator, i.e. There is no model_fn() definition such that I can get the correct inputs/outputs. In other words, even under the rubric of custom estimators, I cannot use a tf.Estimator, and instead of have had to code everything using the low-level TF API.

Is there a low-level version of the Python SageMaker API such that I can define my graph and run it, using feed_dict as needed, and all the other low-level TF API features?

scikit_bring_your_own.ipynb not working as expected

Hello,

Background

I'm trying to learn how to create own containers and deploy a model with them. I have been following the example scikit_bring_your_own.ipynb to push the docker image to my ECR repositories, train and upload the model artifact to S3 bucket. The Dockerfile was not changed, so the docker image should be the same. The model was successfully trained with the given codes.

Error

However, when I'm trying to create the model from the container image and artifact (step "Deploy the model" in the notebook) by running:

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

I encountered an error which is:
An error occurred (ValidationException) when calling the CreateModel operation: ECR image "xxxxxx(my account id here).dkr.ecr.us-west-2.amazonaws.com/decision-trees-sample" is invalid.

Attempts:

I have tested hosting the model locally with the given train.sh, serve.sh and predict.sh files, the model works well locally.
I have tried to use the SageMaker console to create the model from ECR, the error is the same
I have tried to use AWS CLI to create the model, same issue.

Could anyone help with this problem?
Thanks a lot!

scikit_bring_your_own.ipynb train model pandas error

Hello!

I am following the scikit_bring_your_own tutorial and I am trying to set up BYO bring your own model for production use, but I am encountering the following issue when trying to train the model on AWS Sagemaker.


AlgorithmError: Exception during training: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'. Traceback (most recent call last): File "/opt/program/train", line 48, in train raw_data = [ pd.read_csv(file, header=None) for file in input_files ] File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 449, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 818, in __init__ self._make_engine(self.engine) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1049, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1695, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parser

I uploaded the data to s3 using:

    def upload_data(self):
        self.logger.info(
            'Uploading locally available data to s3 in path: %s, using bucket: %s using s3 directory prefix: %s'
            % (
                self.config.data_directory_path,
                self.config.data_upload_bucket,
                self.config.s3_data_directory_prefix,
            )
        )

        self.train_data_location = self.session.upload_data(
            path=self.config.data_directory_path,
            bucket=self.config.data_upload_bucket,
            key_prefix=self.config.s3_data_directory_prefix
        )

        self.logger.info('Uploaded local data to s3 path: %s' % (self.train_data_location))

I ran the build_and_push.sh script.

Then I tried to train the model using:

    def estimator(self):
        self.logger.info(
            'Creating estimator for %s model %s using image %s' % (
                'BYO',
                self.config.model_name,
                self.image,
            )
        )

        return Estimator(
            image_name=self.image,
            role=self.config.role,
            train_instance_count=self.config.train_instance_count,
            train_instance_type=self.config.train_instance_type,
            output_path=self.config.output_path,
            base_job_name=self.config.base_job_name,
            sagemaker_session=self.session,
        )

(I'm using the same code as in the notebook, just rewritten for using it as a class)

Am I missing something or doing something wrong?

Jupyter Lab is not the default

This is great! However, Jupyter Lab is much more versatile and I'd like my team to work on Jupyter Lab. When will Jupyter Lab be the default UI for SageMaker?

Bibliography:

--Marco

Using caffe with sagemaker

Hello,

Not really a issue but more of an request.

I have a caffe model, I also have a caffe docker docker container

I was wondering if there was any plans to support caffe?

Thanks

Model Error - Invoke the endpoint

When I try to invoke the endpoint and start the prediction , I get this error below

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "

<title>500 Internal Server Error</title>

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

". See https://eu-west-1.console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-all-2018-04-11-12-00-21-560 in account 180856571690 for more information.

how to install scala kernel to sagemaker's jupyter?

is scala supported in jupyter notebook under sagemaker? if not, how to install scala kernel to sagemaker's jupyter?

Forbidden S3 bucket in the example: amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/gluon_recommender_system.ipynb

In the example code "amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/gluon_recommender_system.ipynb", I couldn't copy the data file located at "s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz". Please check the below code part in that Jupyter file and replace the below code with a valid S3 address:

aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz /tmp/recsys/

seq2seq_translation_en-de - bucket for pretrained model artifacts does not exist

Under the pre-trained model section of the seq2seq_translation_en-de.ipynb the instructions say to curl the model artifacts from "https://s3-us-west-2.amazonaws.com/gsaur-seq2seq-data/seq2seq/eng-german/full-nb-translation-eng-german-p2-16x-2017-11-24-22-25-53/output/"

When attempting to curl these artifacts you get the following error:

This XML file does not appear to have any style information associated with it. The document tree is shown below.

NoSuchBucket
The specified bucket does not exist
gsaur-seq2seq-data
689B28A38C6874F0

DB0wm6RExts0znrV1uBktPdZa7ore4QA2IhlP6F7usDNFaZ7I8DbnYVgRPobziNc7cbQzZ2pGps=

Was this bucket deleted accidentally?

DeepAR external Regressors

I am working on time series forecasting using LSTMs. My dataset has time series (sales on each day) data, plus external regressors like discount on particular day, holidays, day of week. Now that I want to move to DeepAR, I was wondering, how to incorporate these features in the DeepAR training data (json format) and run DeepAR ? "cat": in training dataset didn't work. Note : I want to add these because, I want to know how forecast is affected based on discount and holiday period features.

Deploying code from notebook

From scikit_bring_your_own.ipynb

When you are using a framework (such as Apache MXNet or TensorFlow) that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework.

What exactly is this referring to? I have a python script currently living inside an .ipynb in Jupyter, SageMaker. It's a batch script which pulls the training data from DynamoDB and runs ALS training on it using Spark's MLLib. It finishes by writing some results to DynamoDB. Can I go ahead and deploy this as a recurring batch job easily without messing with docker containers?

Is xgboost instance weight supported?

Weight is an xgboost option to assign to each row of training data, usually there is a separate csv or txt file for it.
http://xgboost.readthedocs.io/en/latest/input_format.html#instance-weight-file

Is it currently supported, or on the roadmap?

Issues with the default (latest?) version of framework (as of May 24th 2018)

Re: amazon-sagemaker-examples/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/

The training returns strange verbose logs.
...
2018-05-24 18:39:00.972957: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-05-24 18:39:01.003414: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-05-24 18:39:01.012146: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-05-24 18:39:01.026917: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
Tensorboard does not seem to be working.
[Errno 111] Connection refused

These issues will disappear if framework_version = 1.5 is used.
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
source_dir=source_dir,
framework_version='1.5',
role=role,
hyperparameters={'min_eval_frequency': 10},
training_steps=1000, evaluation_steps=100,
train_instance_count=2, train_instance_type='ml.c4.xlarge',
base_job_name='tensorboard-example')

The first issue seems to be a known issue which relates to S3 and tensorflow.

Regardless of this strange issue, a model is properly trained and saved in S3. Endpoint is also created.

How do I perform A/B testing?

I'm trying to figure out how to perform A/B testing using AWS sagemaker. I understand setting the train_instance_count will distribute the training across two instances. But how do I specify the set the percentage of inference calls each model will handle and perform A/B testing?

Embedding Example?

Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.

I am trying to deploy a BYOB (bring your own model) keras model. I pushed the image to ECR with the 'latest' tag. All local testing passed, and I am able to successfully train the model e.g.:

image = '{}.dkr.ecr.{}.amazonaws.com/my-model:latest'.format(account, region)

dl = sage.estimator.Estimator(image,
                       role, 1, 'ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

However attempting to deploy gives me the error:

Failed Reason:  The primary container for production variant AllTraffic did not pass the ping health check.

I am not quite sure where this stems from given local health check passed. Any insight would be great! Thanks.

How to get the training job name inside train program?

I followed link [1] to build a docker image for training on ECS with my own model. Is there a way to get the current training job name inside train program? I want to find out the S3 folder where the model artifact is uploaded to. Because I want to upload the training logs and some other training intermediate data to the same folder. Sounds like sagemaker.Session should have property/method to support it but I didn't find it yet.

Or is there a way to pass in job name as parameter to train script? Actually, the job name is available when create training job.

Thanks!

[1] https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

scikit_bring_your_own.ipynb deploy model error

When going through the notebook above with a sagemaker notebook instance everything worked up to the deploy line:
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)
There I get "ClientError: An error occurred (ValidationException) when calling the CreateModel operation: ECR image ".dkr.ecr.us-east-1.amazonaws.com/decision-trees-sample" is invalid."

I tried via the sagemaker console using my root account and also get the same ValidationException error.

Testing the image (pulled from ECR) locally with serve_local.sh/predict_local.sh didn't show errors

Can't pip instal tqdm - stuck

How can I work with tqdm in sagemaker ?

Training.abalone not created after running,abalone_estimator.fit(inputs)

Hi,
I m using same code on amazon sagemaker using abalone.csv data.
but,training files are not created on directory.
please resolve this...

Cannot install rJava

I'm creating a SageMaker notebook using R, and I installed the R Kernel correctly so that I can run R inside a Jupyter notebook. There are a few issues I've noticed with installing packages, but they can usually be circumvented by adding a specific repository source, specifying the dependencies, etc. However, there is one package that doesn't follow these workarounds: rJava.

Can rJava be installed in SageMaker?

Background:

When I try to install the library "RWeka" there is an issue that I can't get around. I've traced to the error down to the dependency on the "rJava" package. I've entered the following commands:

install.packages('rJava')
install.packages('rJava', repos = 'https://cran.r-project.org/')
This results in the following error:

“installation of package ‘rJava’ had non-zero exit status”Updating HTML index of packages in '.Library'
I suspect this is because rJava has system requirements of "Java JDK 1.2 or higher (for JRI/REngine JDK 1.4 or higher), GNU make."

SageMaker's Java JDK is 1.8.0_121 so I'm not sure what the issue is, I've tried installing in the terminal and multiple variants of R using devtools library, etc. Is rJava not supported?

Getting error while invoking sagemaker endpoint

I created training job in sagemaker with my own training and inference code using MXNet framework. I am able to train the model successfully and created endpoint as well. But while inferring the model, I am getting the following error:
‘ClientError: An error occurred (413) when calling the InvokeEndpoint operation: HTTP content length exceeded 5246976 bytes.’
What I understood from my research is the error is due to the size of the image. The image shape is (480, 512, 3). I trained the model with images of same shape (480, 512, 3).

When I resized the image to (240, 256), the error was gone. But producing another error 'shape inconsistent in convolution' as I the trained the model with images of size (480, 512).

I didn’t understand why I am getting this error while inferring.
Can't we use images of larger size to infer the model?
Any suggestions will be helpful

Thanks, Harathi

Gluon recommender system example causes kernel to crash

When attempting to run this example, the kernel dies somewhere around net.collectParams() with an output that looks like
mfblock0_ ( Parameter mfblock0_embedding0_weight (shape=(140344, 64), dtype=<class 'numpy.float32'>) Parameter mfblock0_embedding1_weight (shape=(38385, 64), dtype=<class 'numpy.float32'>) Parameter mfblock0_dense0_weight (shape=(64, 0), dtype=<class 'numpy.float32'>) Parameter mfblock0_dense0_bias (shape=(64,), dtype=<class 'numpy.float32'>) )
. The error log says .../tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error. Any thoughts?

Invoke the endpoint in Sagemaker AWS via predictor.predict ()

Hello SageMaker Community,

I have a probem when I call predictor.predict(" txt format ") in Sagemaker notebook, I get this error according to the format of the parameter to test

ValueError Traceback (most recent call last)
in ()
8
9
---> 10 print(predictor.predict("the password is 15jdgvd "))

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data)
72 """
73 if self.serializer is not None:
---> 74 data = self.serializer(data)
75
76 request_args = {

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/predictor.py in call(self, data)
247 return _json_serialize_from_buffer(data)
248
--> 249 raise ValueError("Unable to handle input format: {}".format(type(data)))
250
251

ValueError: Unable to handle input format: <class 'str'>

xgboost direct marketing example doesn't make sense

Something doesn't seem right about this example.

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb

The notes in the Exploration section indicate correctly that 90% of the of the customers do not subscribe.

The notes in the Evaluation section says ~3700 customers from the test data set were predicted to subscribe. The test data set has right around 4,000 (10% of the whole dataset). So the model that was built predicts a vast majority of subscribers. What's going on here?

Tensorflow container error for evaluation

Hello

I am trying to train a Keras model using Sagemaker.

I am able to train my model in a Sagemaker Notebook, but when I try to execute my scripts locally, I get the following, pointing to a failure in the evaluation step (I get this error instantly after evaluation starts):

ERROR - container_support.training - uncaught exception during training: unsupported operand type(s) for /: 'unicode' and 'float'

From:

2018-04-11 21:31:39,770 ERROR - container_support.training - uncaught exception during training: unsupported operand type(s) for /: 'unicode' and 'float'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 25, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 107, in train
    train_wrapper.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 118, in train
    hparams=hparams)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 218, in run
    return _execute_schedule(experiment, schedule)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 46, in_execute_schedule
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 661, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 390, in train
    saving_listeners=self._saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 868, in _call_train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 314, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 815, in _train_model
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 539, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1013, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1104, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1089, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 1196, in after_run
    induce_stop = m.step_end(self._last_step, result)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 356, in step_end
    return self.every_n_step_end(step, output)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 694, in every_n_step_end
    validation_outputs = self._evaluate_estimator()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 665, in _evaluate_estimator
    name=self.name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 361, in evaluate
    hooks.extend(self._convert_eval_steps_to_hooks(steps))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 375, in _convert_eval_steps_to_hooks
    return [evaluation._StopAfterNEvalsHook(num_evals=steps)]  # pylint: disable=protected-access
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/evaluation.py", line 97, in __init__
    else math.floor(num_evals / 10.))
TypeError: unsupported operand type(s) for /: 'unicode' and 'float'

It seems like it has something to do with the "training_steps" argument passed to the Tensorflow() Estimator:

estimator = TensorFlow(entry_point='itemembd.py',
                               role=role,
                               training_steps=None,
                               evaluation_steps= 100,
                               train_instance_count=1,
                               train_instance_type='ml.c4.xlarge',
                               output_path='s3://ml/artifacts/itemembd',
                               
                              )

estimator.fit('s3://ml/data/itemembd', job_name='itememdb-01')

and the "num_epochs" argument in my _input_fn:

def _input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.int,
        features_dtype=np.int
    )

    return tf.estimator.inputs.numpy_input_fn(
        x={
            USER_EMBEDDING_TENSOR_NAME: np.array(training_set.data[:, 0]),
            ITEM_EMBEDDING_TENSOR_NAME: np.array(training_set.data[:, 1])
        },
        y=np.array(training_set.data[:, 2]),
        shuffle=True,
        batch_size=64,
        num_epochs=10
    )()

ie I am trying to use epochs over the training data instead of gradient updates. It's very strange that this works within the notebook and by local deployment as well. Any clues?

How is the entry point to the code specified in bring your own code?

I'm trying out the sample notebooks, currently in the mxnet mnist example which demonstrates bringing your own code. The entry point parameter passed in when instantiating an estimator instance, only mentions the source file (mnist.py) and not a method name or any other point inside the source file.
So how does sagemaker figure out which method to send the training data to?

where can I find gluon_recsys.ipynb

I saw the video Diving Deeper with Amazon SageMaker in the developer resources https://aws.amazon.com/sagemaker/developer-resources/?nc1=h_ls, the recommendation demo is very interesting, where can I find the code gluon_recsys.ipynb and gluon_recommender_system_webinar.ipynb?