Which example? using-rapids multi GPU tutorial notebook <p di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

the logs from the 60 hour run are: <div class="snippet-clipboard-content notransla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Tutorial: using-rapids failing about azureml-examples HOT 5 CLOSED

azure commented on June 9, 2024

Tutorial: using-rapids failing

from azureml-examples.

Comments (5)

lostmygithubaccount commented on June 9, 2024 1

@zronaghi it's currently running on PRs and every 2 hours, you can see the test history here: https://github.com/Azure/azureml-examples/actions?query=workflow%3Arun-tutorial-ur

this has recovered since I manually cancelled the stuck runs this morning, so i'll consider this resolved for now - if it happens again will reopen and follow up with AML team, again think this is likely an issue on our end

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

@zronaghi the train multi GPU example started failing overnight - any obvious reason why?

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

so it looks like two of the runs got stuck - one was running for ~60 hours and the other for ~6, holding up the only 2 NC24S_V3 nodes on the test cluster, holding up all other runs, causing the failure

I do not know why the runs got stuck - most likely it is an AML issue, not anything in the notebook

@akshaya-a can I set a max experiment timeout somehow?

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

the logs from the 60 hour run are:

2020/10/20 02:08:40 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info
2020/10/20 02:08:40 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
[2020-10-20T02:08:42.931450] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError', 'UserExceptions:context_managers.UserExceptions'], invocation=['train.py', '--data_dir', 'DatasetConsumptionConfig:input_52cc0296', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1'])
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 241
Set Dataset input_52cc0296's target path to /tmp/tmpm_f25xjc
Enter __enter__ of DatasetContextManager
SDK version: azureml-core==1.15.0 azureml-dataprep==2.3.0. Session id: 1473de45-925b-4fae-bb0d-39c1f3bdc076. Run id: rapids-airline-multi-example_1603159691_94beb9b3.
Processing 'input_52cc0296'.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Processing dataset FileDataset
{
  "source": [
    "https://airlinedataset.blob.core.windows.net/airline-10years/*"
  ],
  "definition": [
    "GetFiles"
  ],
  "registration": {
    "id": "a13fb474-eb17-43e7-9432-da7a83bef4b3",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')"
  }
}
Mounting input_52cc0296 to /tmp/tmpm_f25xjc.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Mounted input_52cc0296 to /tmp/tmpm_f25xjc as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/default/azureml/rapids-airline-multi-example_1603159691_94beb9b3/mounts/workspaceblobstore/azureml/rapids-airline-multi-example_1603159691_94beb9b3
Preparing to call script [ train.py ] with arguments: ['--data_dir', '$input_52cc0296', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']
After variable expansion, calling script [ train.py ] with arguments: ['--data_dir', '/tmp/tmpm_f25xjc', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']

Script type = None

---->>>> cuDF version <<<<----
 0.15.0

---->>>> cuML version <<<<----
 0.15.0

> RapidsCloudML
	Compute, Data , Model, Cloud types ('multi-GPU', 'Parquet', 'RandomForest', 'Azure')

	Multi-GPU selected

	Client information <Client: 'tcp://127.0.0.1:37255' processes=4 threads=4, memory=473.42 GB>
multi-GPU

> Loading dataset from /tmp/tmpm_f25xjc/part*.parquet

	GPU read

	Reading using dask_cudf
cudf/utilities/bit.hpp(19): warning: cassert: [jitify] File not found
cudf/fixed_point/fixed_point.hpp(27): warning: cassert: [jitify] File not found

	Ingestion completed in 3.451511531999131

	Dataset descriptors: (Delayed('int-4123cf11-6492-4800-b2e9-ef18cdba67ca'), 15)
	Flight_Number_Reporting_Airline    float32
Year                               float32
Quarter                            float32
Month                              float32
DayOfWeek                          float32
DOT_ID_Reporting_Airline           float32
OriginCityMarketID                 float32
DestCityMarketID                   float32
DepTime                            float32
DepDelay                           float32
DepDel15                           float32
ArrDel15                             int32
ArrDelay                           float32
AirTime                            float32
Distance                           float32
dtype: object

---->>>> Training using GPUs <<<<----


 CV fold 0 of 1


> Splitting train and test data

	X_train shape and type(Delayed('int-c2cdbdd5-99d6-4059-aca3-410840dc702d'), 13) <class 'dask_cudf.core.DataFrame'>

	Split completed in 0.004746472997794626

> Training RandomForest estimator w/ hyper-params

	Fitting multi-GPU daskRF
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs

from azureml-examples.

zronaghi commented on June 9, 2024

@lostmygithubaccount did the issue get resolved? How frequently do you run these tests?

I'll also be updating the container to the latest rapids release (0.16) after running some tests in the next week or so.

from azureml-examples.

Tutorial: using-rapids failing about azureml-examples HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent