Giter Club home page Giter Club logo

Comments (5)

lostmygithubaccount avatar lostmygithubaccount commented on June 9, 2024 1

@zronaghi it's currently running on PRs and every 2 hours, you can see the test history here: https://github.com/Azure/azureml-examples/actions?query=workflow%3Arun-tutorial-ur

this has recovered since I manually cancelled the stuck runs this morning, so i'll consider this resolved for now - if it happens again will reopen and follow up with AML team, again think this is likely an issue on our end

from azureml-examples.

lostmygithubaccount avatar lostmygithubaccount commented on June 9, 2024

@zronaghi the train multi GPU example started failing overnight - any obvious reason why?

from azureml-examples.

lostmygithubaccount avatar lostmygithubaccount commented on June 9, 2024

so it looks like two of the runs got stuck - one was running for ~60 hours and the other for ~6, holding up the only 2 NC24S_V3 nodes on the test cluster, holding up all other runs, causing the failure

I do not know why the runs got stuck - most likely it is an AML issue, not anything in the notebook

@akshaya-a can I set a max experiment timeout somehow?

from azureml-examples.

lostmygithubaccount avatar lostmygithubaccount commented on June 9, 2024

the logs from the 60 hour run are:

2020/10/20 02:08:40 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info
2020/10/20 02:08:40 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
[2020-10-20T02:08:42.931450] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError', 'UserExceptions:context_managers.UserExceptions'], invocation=['train.py', '--data_dir', 'DatasetConsumptionConfig:input_52cc0296', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1'])
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 241
Set Dataset input_52cc0296's target path to /tmp/tmpm_f25xjc
Enter __enter__ of DatasetContextManager
SDK version: azureml-core==1.15.0 azureml-dataprep==2.3.0. Session id: 1473de45-925b-4fae-bb0d-39c1f3bdc076. Run id: rapids-airline-multi-example_1603159691_94beb9b3.
Processing 'input_52cc0296'.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Processing dataset FileDataset
{
  "source": [
    "https://airlinedataset.blob.core.windows.net/airline-10years/*"
  ],
  "definition": [
    "GetFiles"
  ],
  "registration": {
    "id": "a13fb474-eb17-43e7-9432-da7a83bef4b3",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')"
  }
}
Mounting input_52cc0296 to /tmp/tmpm_f25xjc.
Globalization is not supported. Running in culture invariant mode (DOTNET_SYSTEM_GLOBALIZATION_INVARIANT='true').
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Could not refresh EngineServer credentials in rslex: No Lariat Runtime Environment is active, please initialize an Environment.
Mounted input_52cc0296 to /tmp/tmpm_f25xjc as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/default/azureml/rapids-airline-multi-example_1603159691_94beb9b3/mounts/workspaceblobstore/azureml/rapids-airline-multi-example_1603159691_94beb9b3
Preparing to call script [ train.py ] with arguments: ['--data_dir', '$input_52cc0296', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']
After variable expansion, calling script [ train.py ] with arguments: ['--data_dir', '/tmp/tmpm_f25xjc', '--n_bins', '32', '--compute', 'multi-GPU', '--cv-folds', '1']

Script type = None

---->>>> cuDF version <<<<----
 0.15.0

---->>>> cuML version <<<<----
 0.15.0

> RapidsCloudML
	Compute, Data , Model, Cloud types ('multi-GPU', 'Parquet', 'RandomForest', 'Azure')

	Multi-GPU selected

	Client information <Client: 'tcp://127.0.0.1:37255' processes=4 threads=4, memory=473.42 GB>
multi-GPU

> Loading dataset from /tmp/tmpm_f25xjc/part*.parquet

	GPU read

	Reading using dask_cudf
cudf/utilities/bit.hpp(19): warning: cassert: [jitify] File not found
cudf/fixed_point/fixed_point.hpp(27): warning: cassert: [jitify] File not found

	Ingestion completed in 3.451511531999131

	Dataset descriptors: (Delayed('int-4123cf11-6492-4800-b2e9-ef18cdba67ca'), 15)
	Flight_Number_Reporting_Airline    float32
Year                               float32
Quarter                            float32
Month                              float32
DayOfWeek                          float32
DOT_ID_Reporting_Airline           float32
OriginCityMarketID                 float32
DestCityMarketID                   float32
DepTime                            float32
DepDelay                           float32
DepDel15                           float32
ArrDel15                             int32
ArrDelay                           float32
AirTime                            float32
Distance                           float32
dtype: object

---->>>> Training using GPUs <<<<----


 CV fold 0 of 1


> Splitting train and test data

	X_train shape and type(Delayed('int-c2cdbdd5-99d6-4059-aca3-410840dc702d'), 13) <class 'dask_cudf.core.DataFrame'>

	Split completed in 0.004746472997794626

> Training RandomForest estimator w/ hyper-params

	Fitting multi-GPU daskRF
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs
/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/ensemble/randomforestclassifier.py:158: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
  **kwargs

from azureml-examples.

zronaghi avatar zronaghi commented on June 9, 2024

@lostmygithubaccount did the issue get resolved? How frequently do you run these tests?

I'll also be updating the container to the latest rapids release (0.16) after running some tests in the next week or so.

from azureml-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.