Giter Club home page Giter Club logo

Comments (5)

kb-0311 avatar kb-0311 commented on September 4, 2024

@daw3rd Created an issue and writing a few follow up experiments I performed which I thought are very relevant to this problem.

from data-prep-kit.

kb-0311 avatar kb-0311 commented on September 4, 2024
  1. Start a ray cluster in the virtual env by activating the venv in the tokenization/ray directory and running ray start --head , then using the --run-locally False flag in the make command and then running the make run-cli-sample . And that works! I am able to connect to the ray cluster remotely. However there is this problem of handling input file paths and I get a new error :
(venv) [kanishka@ml-pipelines ray]$ make run-cli-sample
make RUN_FILE=tokenization_transform_ray.py \
                RUN_ARGS="--run_locally False --data_local_config \"{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}\"  \
                "  .transforms.run-src-file
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
source venv/bin/activate;       \
cd src;                         \
python tokenization_transform_ray.py --run_locally False --data_local_config "{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}"                  
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
23:08:59 INFO - Launching Tokenization transform
23:08:59 INFO - connecting to existing cluster
23:08:59 INFO - data factory data_ is using local data access: input_folder - ../test-data/ds01/input output_folder - ../output
23:08:59 INFO - data factory data_ max_files -1, n_sample -1
23:08:59 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:08:59 INFO - pipeline id pipeline_id
23:08:59 INFO - code location None
23:08:59 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
23:08:59 INFO - actor creation delay 0
23:08:59 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
23:08:59 INFO - Connecting to the existing Ray cluster
2024-07-02 23:08:59,421 INFO client_builder.py:244 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
SIGTERM handler is not set because current thread is not the main thread.
(orchestrate pid=510039) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(orchestrate pid=510039) 23:09:04 INFO - orchestrator started at 2024-07-02 23:09:04
(orchestrate pid=510039) 23:09:04 ERROR - No input files to process - exiting
23:09:14 INFO - Completed execution in 0.24884503682454426 min, execution result 0
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
  1. Start a ray cluster in the virtual env by activating the venv in the tokenization/ray directory and running ray start --head .Tried the same thing with the make run-s3-sample to see whether minio can solve my issue of file paths by changing the launcher params to "run_locally": False , but the same issue I faced, I was able to connect to that cluster but accessing the files was the issue:
(venv) [kanishka@ml-pipelines ray]$ make run-s3-sample
make .defaults.minio.verify-running
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
make RUN_FILE=tokenization_s3_ray.py .transforms.run-src-file
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
source venv/bin/activate;       \
cd src;                         \
python tokenization_s3_ray.py 
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
environ({'SHELL': '/bin/bash', 'COLORTERM': 'truecolor', 'HISTCONTROL': 'ignoredups', 'TERM_PROGRAM_VERSION': '1.90.2', 'HISTSIZE': '1000', 'HOSTNAME': 'ml-pipelines.sl.cloud9.ibm.com', 'MAKE_TERMOUT': '/dev/pts/3', 'HOMEBREW_PREFIX': '/home/linuxbrew/.linuxbrew', 'PWD': '/home/kanishka/work/data-prep-kit/transforms/universal/tokenization/ray/src', 'LOGNAME': 'kanishka', 'XDG_SESSION_TYPE': 'tty', 'MANPATH': '/home/linuxbrew/.linuxbrew/share/man:/home/linuxbrew/.linuxbrew/share/man::', 'MAKEOVERRIDES': '${-*-command-variables-*-}', 'VSCODE_GIT_ASKPASS_NODE': '/home/kanishka/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/node', 'MOTD_SHOWN': 'pam', 'HOME': '/home/kanishka', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:', 'VIRTUAL_ENV': '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray/venv', 'SSL_CERT_DIR': '/etc/pki/tls/certs', 'RUN_FILE': 'tokenization_s3_ray.py', 'GIT_ASKPASS': '/home/kanishka/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass.sh', 'SSH_CONNECTION': '9.182.123.232 55827 9.202.254.95 22', 'MFLAGS': '-w', 'INFOPATH': '/home/linuxbrew/.linuxbrew/share/info:/home/linuxbrew/.linuxbrew/share/info:', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'XDG_SESSION_CLASS': 'user', 'MAKEFLAGS': 'w -- RUN_FILE=tokenization_s3_ray.py', 'SELINUX_ROLE_REQUESTED': '', 'TERM': 'xterm-256color', 'LESSOPEN': '||/usr/bin/lesspipe.sh %s', 'USER': 'kanishka', 'MAKE_TERMERR': '/dev/pts/3', 'VSCODE_GIT_IPC_HANDLE': '/run/user/6000/vscode-git-b3d771e5f0.sock', 'HOMEBREW_CELLAR': '/home/linuxbrew/.linuxbrew/Cellar', 'SELINUX_USE_CURRENT_RANGE': '', 'SHLVL': '3', 'MAKELEVEL': '2', 'HOMEBREW_REPOSITORY': '/home/linuxbrew/.linuxbrew/Homebrew', 'XDG_SESSION_ID': '358', 'VIRTUAL_ENV_PROMPT': '(venv) ', 'XDG_RUNTIME_DIR': '/run/user/6000', 'SSL_CERT_FILE': '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem', 'PS1': '(venv) ', 'SSH_CLIENT': '9.182.123.232 55827 22', 'PYENV_ROOT': '/home/kanishka/.pyenv', 'which_declare': 'declare -f', 'VSCODE_GIT_ASKPASS_MAIN': '/home/kanishka/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass-main.js', 'XDG_DATA_DIRS': '/home/kanishka/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share', 'BROWSER': '/home/kanishka/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/helpers/browser.sh', 'PATH': '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray/venv/bin:/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray/venv/bin:/home/kanishka/miniconda3/bin:/home/kanishka/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/remote-cli:/home/kanishka/.pyenv/shims:/home/kanishka/.pyenv/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/home/kanishka/miniconda3/bin:/home/kanishka/.pyenv/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/home/kanishka/miniconda3/bin:/home/kanishka/.local/bin:/home/kanishka/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin', 'SELINUX_LEVEL_REQUESTED': '', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/6000/bus', 'MAIL': '/var/spool/mail/kanishka', 'OLDPWD': '/home/kanishka/work/data-prep-kit/transforms/universal/tokenization/ray', 'TERM_PROGRAM': 'vscode', 'VSCODE_IPC_HOOK_CLI': '/run/user/6000/vscode-ipc-ef6d1917-4b0a-4eeb-a8a0-6d909519031b.sock', 'BASH_FUNC_which%%': '() {  ( alias;\n eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@\n}', '_': '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray/venv/bin/python', 'RAY_CLIENT_MODE': '0'})
23:16:17 INFO - connecting to existing cluster
23:16:17 INFO - data factory data_ is using S3 data access: input path - test/tokenization/ds01/input, output path - test/tokenization/ds01/output
23:16:17 INFO - data factory data_ max_files -1, n_sample -1
23:16:17 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:16:17 INFO - pipeline id pipeline_id
23:16:17 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
23:16:17 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'max_restarts': -1}
23:16:17 INFO - actor creation delay 0
23:16:17 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
23:16:17 INFO - Connecting to the existing Ray cluster
2024-07-02 23:16:17,296 INFO client_builder.py:244 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
SIGTERM handler is not set because current thread is not the main thread.
(orchestrate pid=510035) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(orchestrate pid=510035) 23:16:23 INFO - orchestrator started at 2024-07-02 23:16:23
(orchestrate pid=510035) 23:16:23 ERROR - No input files to process - exiting
23:16:33 INFO - Completed execution in 0.2762212514877319 min, execution result 0
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'

You may want to stop the minio server now (see make help)

So I am guessing handling of local files is the issue here. Let me know if there are some potential work arounds for that I could try.

  1. The last thing I tried was to deactivate the venv run a ray cluster locally on my machine outside of venv using the same version v2.24.0 of ray used in dpk. Did not get the JobConfig Error but i still was not able to connect to it. logs:
(venv) [kanishka@ml-pipelines ray]$ make run-cli-sample
make RUN_FILE=tokenization_transform_ray.py \
                RUN_ARGS="--run_locally False --data_local_config \"{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}\"  \
                "  .transforms.run-src-file
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
source venv/bin/activate;       \
cd src;                         \
python tokenization_transform_ray.py --run_locally False --data_local_config "{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}"                  
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
22:56:37 INFO - Launching Tokenization transform
22:56:37 INFO - connecting to existing cluster
22:56:37 INFO - data factory data_ is using local data access: input_folder - ../test-data/ds01/input output_folder - ../output
22:56:37 INFO - data factory data_ max_files -1, n_sample -1
22:56:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
22:56:37 INFO - pipeline id pipeline_id
22:56:37 INFO - code location None
22:56:37 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
22:56:37 INFO - actor creation delay 0
22:56:37 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
22:56:37 INFO - Connecting to the existing Ray cluster
2024-07-02 22:56:37,488 INFO client_builder.py:244 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
SIGTERM handler is not set because current thread is not the main thread.
Put failed:
22:56:41 INFO - Exception running ray remote orchestration
No module named 'data_processing_ray'
22:56:41 INFO - Completed execution in 0.06163370609283447 min, execution result 1
make[1]: *** [../../../../.make.defaults:374: .defaults.run-src-file] Error 1
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
make: *** [Makefile:43: run-cli-sample] Error 2

from data-prep-kit.

kb-0311 avatar kb-0311 commented on September 4, 2024

So the current problems in dpk are that:

  1. It is not possible to connect to a remote ray cluster to execute some transform (a feature which is useful if there is a need to execute computationally large transform on a distributed env)
  2. Data passage between a local storage to a remote ray runtime is not handled well. (or maybe there are some config changes I am missing in which case feel free to correct me : ) )

from data-prep-kit.

blublinsky avatar blublinsky commented on September 4, 2024

So the current problems in dpk are that:

  1. It is not possible to connect to a remote ray cluster to execute some transform (a feature which is useful if there is a need to execute computationally large transform on a distributed env)
  2. Data passage between a local storage to a remote ray runtime is not handled well. (or maybe there are some config changes I am missing in which case feel free to correct me : ) )
  1. It is absolutely possible to connect to a remote Ray cluster and submit job - we are currently doing it through KFP using Ray remote job. We are not advertizing this feature externally, because it quite error prone, if not done correctly. It requires the same versions of both Ray and Python and a presence of the code that has to be executed on the Ray cluster. The reason to provide KFP wrapper is to shield the user from this details, which can be hard to debug.
  2. If you are using remote cluster there is no guarantee that you local data is accessible from it. So there was never support for this. If you are using remote cluster you should use externally accessible storage, for example S3

from data-prep-kit.

blublinsky avatar blublinsky commented on September 4, 2024

can we, please, close this

from data-prep-kit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.