Comments (5)
@zronaghi it's currently running on PRs and every 2 hours, you can see the test history here: https://github.com/Azure/azureml-examples/actions?query=workflow%3Arun-tutorial-ur
this has recovered since I manually cancelled the stuck runs this morning, so i'll consider this resolved for now - if it happens again will reopen and follow up with AML team, again think this is likely an issue on our end
from azureml-examples.
@zronaghi the train multi GPU example started failing overnight - any obvious reason why?
from azureml-examples.
so it looks like two of the runs got stuck - one was running for ~60 hours and the other for ~6, holding up the only 2 NC24S_V3 nodes on the test cluster, holding up all other runs, causing the failure
I do not know why the runs got stuck - most likely it is an AML issue, not anything in the notebook
@akshaya-a can I set a max experiment timeout somehow?
from azureml-examples.
the logs from the 60 hour run are:
2020/10/20 02:08:40 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info
2020/10/20 02:08:40 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
[2020-10-20T02:08:42.931450] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError', 'UserExceptions:context_managers.UserExceptions'], invocation=['train.py', '--data_dir', 'DatasetConsumptionConfig:input_52cc0296', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1'])
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 241
Set Dataset input_52cc0296's target path to /tmp/tmpm_f25xjc
Enter __enter__ of DatasetContextManager
SDK version: azureml-core==1.15.0 azureml-dataprep==2.3.0. Session id: 1473de45-925b-4fae-bb0d-39c1f3bdc076. Run id: rapids-airline-multi-example_1603159691_94beb9b3.
Processing 'input_52cc0296'.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Processing dataset FileDataset
{
"source": [
"https://airlinedataset.blob.core.windows.net/airline-10years/*"
],
"definition": [
"GetFiles"
],
"registration": {
"id": "a13fb474-eb17-43e7-9432-da7a83bef4b3",
"name": null,
"version": null,
"workspace": "Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')"
}
}
Mounting input_52cc0296 to /tmp/tmpm_f25xjc.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Mounted input_52cc0296 to /tmp/tmpm_f25xjc as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory: /mnt/batch/tasks/shared/LS_root/jobs/default/azureml/rapids-airline-multi-example_1603159691_94beb9b3/mounts/workspaceblobstore/azureml/rapids-airline-multi-example_1603159691_94beb9b3
Preparing to call script [ train.py ] with arguments: ['--data_dir', '$input_52cc0296', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']
After variable expansion, calling script [ train.py ] with arguments: ['--data_dir', '/tmp/tmpm_f25xjc', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']
Script type = None
---->>>> cuDF version <<<<----
0.15.0
---->>>> cuML version <<<<----
0.15.0
> RapidsCloudML
Compute, Data , Model, Cloud types ('multi-GPU', 'Parquet', 'RandomForest', 'Azure')
Multi-GPU selected
Client information <Client: 'tcp://127.0.0.1:37255' processes=4 threads=4, memory=473.42 GB>
multi-GPU
> Loading dataset from /tmp/tmpm_f25xjc/part*.parquet
GPU read
Reading using dask_cudf
cudf/utilities/bit.hpp(19): warning: cassert: [jitify] File not found
cudf/fixed_point/fixed_point.hpp(27): warning: cassert: [jitify] File not found
Ingestion completed in 3.451511531999131
Dataset descriptors: (Delayed('int-4123cf11-6492-4800-b2e9-ef18cdba67ca'), 15)
Flight_Number_Reporting_Airline float32
Year float32
Quarter float32
Month float32
DayOfWeek float32
DOT_ID_Reporting_Airline float32
OriginCityMarketID float32
DestCityMarketID float32
DepTime float32
DepDelay float32
DepDel15 float32
ArrDel15 int32
ArrDelay float32
AirTime float32
Distance float32
dtype: object
---->>>> Training using GPUs <<<<----
CV fold 0 of 1
> Splitting train and test data
X_train shape and type(Delayed('int-c2cdbdd5-99d6-4059-aca3-410840dc702d'), 13) <class 'dask_cudf.core.DataFrame'>
Split completed in 0.004746472997794626
> Training RandomForest estimator w/ hyper-params
Fitting multi-GPU daskRF
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
**kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
**kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
**kwargs
from azureml-examples.
@lostmygithubaccount did the issue get resolved? How frequently do you run these tests?
I'll also be updating the container to the latest rapids release (0.16) after running some tests in the next week or so.
from azureml-examples.
Related Issues (20)
- Machine Learning model for attack pattern on network logs
- Azureml job for vector index.
- Error: ScriptExecution.WriteStreams.Authentication when running Orange juice sales prediction parallel job HOT 1
- Argument issue
- Langchain Azure Mistral example doesn't run — AttributeError: 'ChatMessage' object has no attribute 'model_dump' HOT 2
- LiteLLM Example in mistral docs wrong HOT 14
- entry_relative_path issues when specifying the directory of a script.
- Reading delta table using Dataset type File(Machine learning studio)
- How to set 'auto_increment_version' to be True so that I don't have to udpate version manually for ever code update?
- /bin/bash: bad substitution
- Loading pre-built local Faiss Index causes ValueError (allow_dangerous_deserialization)
- "ManagedIdentityCredential: Unexpected content type "text/html"" error during AzureMachineLearningFileSystem function call HOT 1
- Langchain mistral notebook raising KeyError: 'choices' HOT 1
- An example on the "sdk-review" branch includes invalid function calls outdated for the recent SDK version
- azure ml studio not showing "view profile" button in pipeline page
- Azure_langchain_mistral_ai JSONDecodeError HOT 2
- Azure ML Monitoring - Bring your own production data - endpoint_deployment_id needed? HOT 3
- AzureML Custom Preprocessing Component as SDK variant HOT 4
- Azure Machine Learning Studio bug when configure a compute instance
- MLTable - AzureML - Cache Environment variables HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azureml-examples.