Comments (7)
Hi @winstonaws the link you posted was pointing to master branch so the line doesn't match anymore, could you use a commit id instead?
from amazon-sagemaker-examples.
Thanks @ThisIsRick . Have you tried using SageMaker's pre-built TensorFlow container for your task? There's an example notebook here which shows how to use TensorBoard with it. There are some intricacies with writing checkpoints to S3 and running TensorBoard locally that may make this more difficult to implement in your own container. Thanks.
from amazon-sagemaker-examples.
Thanks @djarpin.
I didn't try with SageMaker's pre-built TensorFlow container. My understanding, the model script has to follow the pattern in order to use pre-built TensorFlow container, right? But, our model script doesn't, it is provided by applied scientist.
We're also considering to keep syncing checkpoints to S3 in container, and have another thread in local to sync checkpoints from S3. But our training job is scheduled by aws command line in local desktop, we don't use notebook instance on Sagemaker. So, this makes syncing checkpoints from S3 part a bit more complicated.
from amazon-sagemaker-examples.
The approach you described is the right one. You need your code inside the container to save checkpoints to S3, and you need to periodically sync your local Tensorboard log directory with your S3 checkpoints.
Here is our implementation in the SageMaker Python SDK: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L29
Are there any specific questions you have about this approach?
from amazon-sagemaker-examples.
Closing this issue for now, feel free to re-open if your run into more problems with this. Thanks.
from amazon-sagemaker-examples.
@elgalu, I believe @winstonaws was pointing to https://github.com/aws/sagemaker-python-sdk/blob/8a3dea24f04a81b06df35a1c7aa262f6a1a02bb5/src/sagemaker/tensorflow/estimator.py#L29
The most up to date as of now would be: https://github.com/aws/sagemaker-python-sdk/blob/cecea123d4933baa8998afd138fee3eaf28a8e49/src/sagemaker/tensorflow/estimator.py#L46
Otherwise if any of those links are out of date, he is speaking of the TensorBoard class in estimator.py within src/sagemaker/tensorflow.
from amazon-sagemaker-examples.
from sagemaker.debugger import TensorBoardOutputConfig
can also be useful https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html#capture-real-time-tensorboard-data-from-the-debugging-hook
from amazon-sagemaker-examples.
Related Issues (20)
- [Bug Report] All fine tunes for Mistral 7b using sagemaker jumpstart are currently failing. HOT 3
- Issues in Training Module
- object_detection_birds - numpy depency issue
- bug report in wrong repo
- BYO MME example notebook failing due to MXNet retirement
- [Bug Report] RuntimeError: Dataset not found. You can use download=True to download it for pytorch minist horovod
- Dataset not working in example in notebook A Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines
- Broken lnks HOT 1
- How do you use the custom generator to train the TensorFlow model on PageMaker?
- [Example Request] Minimal Example for Fine Tuning a LLM with FSDP utilizing the HuggingFace Trainer
- [Bug Report] Forbidden(403) on Introduction to JumpStart - Sentence Pair Classification
- getting error:
- Getting "TypeError: can only join an iterable" while running "print(predictor.predict(test_data).decode("utf-8"))"
- [Bug Report] Example notebook has incorrectly formatted serving.properties
- AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
- Inference Recommender Job fails
- [Bug Report]Error with using dgl library in Sagemaker
- Deploy this TheBloke/vicuna-13B-v1.5-GGUF model on AWS
- Parameter validation failed: Unknown parameter in PrimaryContainer HOT 2
- [Bug Report] - README - Train EleutherAI GPT-J with Model Parallel Link Broken
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-sagemaker-examples.