Which example? Describe the issue example: using-rapids tutorial<b

logs: <div class="highlight highlight-source-python notranslate position-relative

MSFT internal ICM for investigation: <a href="https://portal.microsofticm.com/imp/v3/i

AML runs getting stuck on gpu clusters running rapids tutorial about azureml-examples HOT 14 CLOSED

azure commented on June 9, 2024

AML runs getting stuck on gpu clusters running rapids tutorial

from azureml-examples.

Comments (14)

lostmygithubaccount commented on June 9, 2024 1

for full context run 333 is the second one which got stuck:

I'll have the AML engineering team check if this is something on our end

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

logs:

2020/10/25 06:27:49 logger.go:297: Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info
2020/10/25 06:27:49 logger.go:297: Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
[2020-10-25T06:27:51.919906] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError', 'UserExceptions:context_managers.UserExceptions'], invocation=['train.py', '--data_dir', 'DatasetConsumptionConfig:input_c31b4162', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1'])
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 241
Set Dataset input_c31b4162's target path to /tmp/tmp3686hv94
Enter __enter__ of DatasetContextManager
SDK version: azureml-core==1.15.0 azureml-dataprep==2.3.0. Session id: 4ccb97c4-cd0e-4a5d-9d43-ed922b9d6add. Run id: rapids-airline-multi-example_1603607238_498492de.
Processing 'input_c31b4162'.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Processing dataset FileDataset
{
  "source": [
    "https://airlinedataset.blob.core.windows.net/airline-10years/*"
  ],
  "definition": [
    "GetFiles"
  ],
  "registration": {
    "id": "a13fb474-eb17-43e7-9432-da7a83bef4b3",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')"
  }
}
Mounting input_c31b4162 to /tmp/tmp3686hv94.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Mounted input_c31b4162 to /tmp/tmp3686hv94 as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/default/azureml/rapids-airline-multi-example_1603607238_498492de/mounts/workspaceblobstore/azureml/rapids-airline-multi-example_1603607238_498492de
Preparing to call script [ train.py ] with arguments: ['--data_dir', '$input_c31b4162', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']
After variable expansion, calling script [ train.py ] with arguments: ['--data_dir', '/tmp/tmp3686hv94', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']

Script type = None

---->>>> cuDF version <<<<----
 0.15.0

---->>>> cuML version <<<<----
 0.15.0

> RapidsCloudML
	Compute, Data , Model, Cloud types ('multi-GPU', 'Parquet', 'RandomForest', 'Azure')

	Multi-GPU selected

	Client information <Client: 'tcp://127.0.0.1:38541' processes=4 threads=4, memory=473.42 GB>
multi-GPU

> Loading dataset from /tmp/tmp3686hv94/part*.parquet

	GPU read

	Reading using dask_cudf
cudf/utilities/bit.hpp(19): warning: cassert: [jitify] File not found
cudf/fixed_point/fixed_point.hpp(27): warning: cassert: [jitify] File not found

	Ingestion completed in 3.674437305999163

	Dataset descriptors: (Delayed('int-ff4f2b39-d9fb-48b3-acaf-357777eedce1'), 15)
	Flight_Number_Reporting_Airline    float32
Year                               float32
Quarter                            float32
Month                              float32
DayOfWeek                          float32
DOT_ID_Reporting_Airline           float32
OriginCityMarketID                 float32
DestCityMarketID                   float32
DepTime                            float32
DepDelay                           float32
DepDel15                           float32
ArrDel15                             int32
ArrDelay                           float32
AirTime                            float32
Distance                           float32
dtype: object

---->>>> Training using GPUs <<<<----


 CV fold 0 of 1


> Splitting train and test data

	X_train shape and type(Delayed('int-7849fd69-4caf-43e3-a71f-7d7111a2ce31'), 13) <class 'dask_cudf.core.DataFrame'>

	Split completed in 0.004588507996231783

> Training RandomForest estimator w/ hyper-params

	Fitting multi-GPU daskRF
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

rapids-airline-multi-example_1603607238_498492de

stuck for ~60 hours

2020/10/27 02:28:42 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info
2020/10/27 02:28:42 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
[2020-10-27T02:28:44.486546] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError', 'UserExceptions:context_managers.UserExceptions'], invocation=['train.py', '--data_dir', 'DatasetConsumptionConfig:input_fa2fd348', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1'])
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 243
Set Dataset input_fa2fd348's target path to /tmp/tmp7azhoytw
Enter __enter__ of DatasetContextManager
SDK version: azureml-core==1.15.0 azureml-dataprep==2.3.0. Session id: f34101d7-fbb7-44e4-a93c-d5ea51f80c59. Run id: rapids-airline-multi-example_1603765696_115ff533.
Processing 'input_fa2fd348'.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Processing dataset FileDataset
{
  "source": [
    "https://airlinedataset.blob.core.windows.net/airline-10years/*"
  ],
  "definition": [
    "GetFiles"
  ],
  "registration": {
    "id": "a13fb474-eb17-43e7-9432-da7a83bef4b3",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')"
  }
}
Mounting input_fa2fd348 to /tmp/tmp7azhoytw.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Mounted input_fa2fd348 to /tmp/tmp7azhoytw as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/default/azureml/rapids-airline-multi-example_1603765696_115ff533/mounts/workspaceblobstore/azureml/rapids-airline-multi-example_1603765696_115ff533
Preparing to call script [ train.py ] with arguments: ['--data_dir', '$input_fa2fd348', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']
After variable expansion, calling script [ train.py ] with arguments: ['--data_dir', '/tmp/tmp7azhoytw', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']

Script type = None

---->>>> cuDF version <<<<----
 0.15.0

---->>>> cuML version <<<<----
 0.15.0

> RapidsCloudML
	Compute, Data , Model, Cloud types ('multi-GPU', 'Parquet', 'RandomForest', 'Azure')

	Multi-GPU selected

	Client information <Client: 'tcp://127.0.0.1:43869' processes=4 threads=4, memory=473.42 GB>
multi-GPU

> Loading dataset from /tmp/tmp7azhoytw/part*.parquet

	GPU read

	Reading using dask_cudf
cudf/utilities/bit.hpp(19): warning: cassert: [jitify] File not found
cudf/fixed_point/fixed_point.hpp(27): warning: cassert: [jitify] File not found

	Ingestion completed in 3.5310610010001255

	Dataset descriptors: (Delayed('int-42f22444-d0ca-4fca-b470-b47043e92d23'), 15)
	Flight_Number_Reporting_Airline    float32
Year                               float32
Quarter                            float32
Month                              float32
DayOfWeek                          float32
DOT_ID_Reporting_Airline           float32
OriginCityMarketID                 float32
DestCityMarketID                   float32
DepTime                            float32
DepDelay                           float32
DepDel15                           float32
ArrDel15                             int32
ArrDelay                           float32
AirTime                            float32
Distance                           float32
dtype: object

---->>>> Training using GPUs <<<<----


 CV fold 0 of 1


> Splitting train and test data

	X_train shape and type(Delayed('int-1482a934-84c1-485c-91eb-4c582ac99ab3'), 13) <class 'dask_cudf.core.DataFrame'>

	Split completed in 0.00498011600029713

> Training RandomForest estimator w/ hyper-params

	Fitting multi-GPU daskRF
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs

rapids-airline-multi-example_1603765696_115ff533

stuck for ~14 hours

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

same thing as before, the compute cluster's max nodes is 2. 2 runs got stuck for 13+ hours, causing subsequent runs to hit 6 hour GHA timeout

@zronaghi fyi - is it possible the run is legitimately getting stuck fitting the RF? it seems to consistently get stuck there from these logs and what I remember about last time. would there be an easy way to avoid this if so?

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

it is green again but leaving this issue open until the cause is fixed

from azureml-examples.

zronaghi commented on June 9, 2024

Is it green again meaning only some of the runs fail? I haven't seen this issue in my environment but will re-run with these modified scripts/notebooks to investigate. I'll update to 0.16 RAPIDS as well since it was released last week.

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

it seems a low percent (maybe <5%?) of runs will get stuck - here is what it looks like in UI:

I manually cancelled run 309 after 57 hours of running. 2 runs need to get stuck at the same time for the next runs to stayed queued indefinitely, causing the github action to fail after 6 hours. This seems likely enough to happen every couple weeks or so at the current testing rate. After manually cancelling the two stuck runs, it recovers quickly

the repo has changed a bit since you first added these notebooks. please check the contributing guidelines, develop on a branch, and let us know if you have any issues - thanks for the contributions!

from azureml-examples.

zronaghi commented on June 9, 2024

I just completed multi-GPU HPO runs (100 jobs) and wasn't able to reproduce the issue. Submitted a PR with RAPIDS 0.16 container and added max_run_duration_seconds to 30 minutes for multi-GPU training, would this be helpful in case of failure? It would also be good to know if any of these tests fail with the latest container or if you've seen similar issues with other Dask tutorials

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

renaming the issue - the timeout will solve this, will keep the issue open for AML to investigate. i know users have complained about runs getting stuck

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

MSFT internal ICM for investigation: https://portal.microsofticm.com/imp/v3/incidents/details/212111206/home

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

from the investigation:

All good runs use the following versions of azureml-core, azureml-dataprep and cuda
 
>>SDK version: azureml-core==1.17.0 azureml-dataprep==2.4.2.

---->>>> cuDF version <<<<----
0.16.0a+1979.g2cda39b341

 

---->>>> cuML version <<<<----
0.16.0a+882.g5851f4140
 
 
The run stuck over 50 hours had:
 
SDK version: azureml-core==1.15.0 azureml-dataprep==2.3.0.
 

---->>>> cuDF version <<<<----
0.15.0

 

---->>>> cuML version <<<<----
0.15.0

from azureml-examples.

zronaghi commented on June 9, 2024

Thanks for the update, so recent runs with RAPIDS 0.16 container haven't failed?

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

this one failed or timed out: https://github.com/Azure/azureml-examples/actions/runs/348928004

from azureml-examples.

lostmygithubaccount commented on June 9, 2024

closing in favor of #284 - needs further investigation why runs are getting stuck, but this does not seem particular to rapids

from azureml-examples.

AML runs getting stuck on gpu clusters running rapids tutorial about azureml-examples HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent