Giter Club home page Giter Club logo

Comments (16)

vsoch avatar vsoch commented on May 31, 2024

Could you check that your params.sh has the $RESOURCE variable defined? This was newly added, so if you didn't update it (using setup.sh) then the variable would be blank and the command look weird like:

ssh squeue ...

instead of

ssh sherlock squeue ...

Where "sherlock" is defined in $RESOURCE

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024
$ cat params.sh
USERNAME="rzawadzk"
PORT="56432"
PARTITION="manishad"
RESOURCE="sherlock"
MEM="20G"
TIME="8:00:00"

from forward.

vsoch avatar vsoch commented on May 31, 2024

okay let me test this out for you! Could you show me the full command you use to start the node? Then I'll see if I can reproduce and offer a fix.

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024

Here's the process:

  1. started the node bash start.sh sherlock/py3-jupyter
  2. After a certain amount of time, put my laptop to sleep
  3. Attempted to resume the session using bash resume.sh sherlock/py3-jupyter
  4. Output about the ssh

from forward.

vsoch avatar vsoch commented on May 31, 2024

hey @royzawadzki ! I think we are all fixed now, and I've added an echo of the full command for a sanity check for you in the future!

The issue was that I (likely backspaced) out the collection of the first argument so instead of this:

$ bash resume.sh sherlock/py3-jupyter
ssh sherlock squeue --name=sherlock/py3-jupyter --user=vsochat -o %N -h

(notice the --name)

We were doing this

$ bash resume.sh sherlock/py3-jupyter
ssh sherlock squeue --name= --user=vsochat -o %N -h

Derp :P Here is the fix:

d7396d3

Would you like to pull and test? Remember that the forward will have to be completely dead - if your computer doesn't sleep, for example, and you issue the ssh forward again it will tell you the port is in use (this is what I just did).

Let me know if that works!

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024

Hi, I tried testing out the update resume.sh (after pulling, of course) and the issue seems to still persist. I first ran start.sh to confirm that I had a node running. After I got the expected message about how I had a node running, I ran into the same issue:

forward royzawadzki$ bash resume.sh sherlock/py3-jupyter
ssh sherlock squeue --name=sherlock/py3-jupyter --user=rzawadzk -o %N -h
forward royzawadzki$ usage: ssh [-1246AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
           [-D [bind_address:]port] [-E log_file] [-e escape_char]
           [-F configfile] [-I pkcs11] [-i identity_file]
           [-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
           [-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
           [-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
           [user@]hostname [command]

What should be the expected output after running `resume.sh`?

from forward.

vsoch avatar vsoch commented on May 31, 2024

If you pull from master, the file is exactly here : https://github.com/vsoch/forward/blob/master/resume.sh and the difference is that the fixed version has NAME=${1} whereas the bugged version (where I reproduced the above error) does not.

from forward.

vsoch avatar vsoch commented on May 31, 2024

Here is a complete log of the output for what I did:

# use end.sh to kill any old jobs
$ bash end.sh sherlock/py3-jupyter
Killing sherlock/py3-jupyter slurm job on sherlock
Killing listeners on sherlock

Now look at resume.sh, confirm that you see this, specifically the line for NAME=$1 should be there

$ cat resume.sh 
#!/bin/bash
#
# Resumes an already running remote sbatch job.
# Sample usage: bash resume.sh

if [ ! -f params.sh ]
then
    echo "Need to configure params before first run, run setup.sh!"
    exit
fi
source params.sh

NAME="${1}"

# The user is required to specify port

echo "ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%N" -h"
MACHINE=`ssh ${RESOURCE} squeue --name=$NAME --user=$USERNAME -o "%N" -h`
ssh -L $PORT:localhost:$PORT ${RESOURCE} ssh -L $PORT:localhost:$PORT -N $MACHINE &

and then run start.sh. Make sure your output looks like this, tell me if there are differences

$ bash start.sh sherlock/py3-jupyter
== Finding Script ==
Looking for sbatches/sherlock/sherlock/py3-jupyter.sbatch
Looking for sbatches/sherlock/py3-jupyter.sbatch
Script      sbatches/sherlock/py3-jupyter.sbatch

== Checking for previous notebook ==
No existing sherlock/py3-jupyter jobs found, continuing...

== Getting destination directory ==

== Uploading sbatch script ==
py3-jupyter.sbatch                            100%  146     0.1KB/s   00:00    

== Submitting sbatch ==
sbatch --job-name=sherlock/py3-jupyter --partition=russpold --output=/home/users/vsochat/forward-util/py3-jupyter.sbatch.out --error=/home/users/vsochat/forward-util/py3-jupyter.sbatch.err --mem=20G --time=8:00:00 /home/users/vsochat/forward-util/py3-jupyter.sbatch 43453 ""
Submitted batch job 23721432

== View logs in separate terminal ==
ssh sherlock cat /home/users/vsochat/forward-util/py3-jupyter.sbatch.out
ssh sherlock cat /home/users/vsochat/forward-util/py3-jupyter.sbatch.err

== Waiting for job to start, using exponential backoff ==
Attempt 0: not ready yet... retrying in 1..
Attempt 1: not ready yet... retrying in 2..
Attempt 2: not ready yet... retrying in 4..
Attempt 3: not ready yet... retrying in 8..
Attempt 4: not ready yet... retrying in 16..
Attempt 5: not ready yet... retrying in 32..
Attempt 6: not ready yet... retrying in 64..
Attempt 7: resources allocated to sh-02-24!..
sh-02-24
sh-02-24
notebook running on sh-02-24

== Setting up port forwarding ==
ssh -L 43453:localhost:43453 sherlock ssh -L 43453:localhost:43453 -N sh-02-24 &
== Connecting to notebook ==
[I 22:10:18.139 NotebookApp] Writing notebook server cookie secret to /tmp/jupyter/notebook_cookie_secret
[I 22:10:19.757 NotebookApp] Serving notebooks from local directory: /scratch/users/vsochat
[I 22:10:19.758 NotebookApp] 0 active kernels 
[I 22:10:19.758 NotebookApp] The Jupyter Notebook is running at: http://localhost:43453/
[I 22:10:19.758 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).


== View logs in separate terminal ==
ssh sherlock cat /home/users/vsochat/forward-util/py3-jupyter.sbatch.out
ssh sherlock cat /home/users/vsochat/forward-util/py3-jupyter.sbatch.err

== Instructions ==
1. Password, output, and error printed to this terminal? Look at logs (see instruction above)
2. Browser: http://sh-02-21.int:43453/ -> http://localhost:43453/...
3. To end session: bash end.sh sherlock/py3-jupyter
okay see the instructions above? Let's follow them! First let's confirm that it actually worked - here I am opening my browser to the address it told me:

![image](https://user-images.githubusercontent.com/814322/44249202-da1c2000-a1bc-11e8-8eb0-0deb93a6dff0.png)

The password would be the one I set on sherlock beforehand (let me know if you did not do this). Now I wouldn't have it be appropriate to use resume.sh because the port is already there, but for kicks and giggles I'll show you what it would do running it unnecessarily:

```bash
bash resume.sh sherlock/py3-jupyter
$ bash resume.sh sherlock/py3-jupyter
ssh sherlock squeue --name=sherlock/py3-jupyter --user=vsochat -o %N -h
vanessa@vanessa-ThinkPad-T460s:~/Documents/Dropbox/Code/srcc/forward$ bind: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: 43453
Could not request local forwarding.

It's already open. So now to test your situation - we want to close the port (but have the node still running). We can do this

source params.sh
echo $PORT
43456
ssh sherlock "/usr/sbin/lsof -i :$PORT -t | xargs --no-run-if-empty kill"

Now go to browser to confirm port isn't open anymore:

image

now we can try resume.sh

$ bash resume.sh sherlock/py3-jupyter
ssh sherlock squeue --name=sherlock/py3-jupyter --user=vsochat -o %N -h

To directly answer your question - there isn't any output. The return value is 0

echo $?
0

(this means success) and the port is open again

image

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024

I was able to successfully resume the session using the steps you provided, but all of my folders are gone, why did this happen?

Another question is how would you know when it's appropriate to resume the session or not? My usual situation is that I leave the browser tab open and I close my laptop. When I open it again, this is when I try to run resume.sh. Is this a correct situation to use it?

from forward.

vsoch avatar vsoch commented on May 31, 2024

What do you mean your folders are gone? Which folders, where, and when were they there (and went missing?)

Resume isn't used often - ONLY if your ssh forward accidentally closed, but the node on sherlock is still running! So if you go to the url and the notebook is gone, and you look at squeue on sherlock and see it running, that's when you would use resume. Otherwise you just open the tab and it should still be there :)

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024

Regarding the folders, when I run start.sh, navigate to the proper website, and then enter my password. All the directories that were on sherlock usually appear at the Jupyter Hub landing page. When I reproduced your steps and ran resume.sh, none of those folders appeared on the page. Not sure what's happening here.

Another question, how do I view the squeue? I'm asking because I've been in a situation where I run a very resource intensive cell that takes 20 minutes to run (e.g. fitting a model) and I close my laptop to go do other stuff. When I open my laptop back after a few hours, the cell is still running, but it should definitely be completed by then. So I thought maybe this could be because the session was interrupted and had to resume or something of that sort.

Thanks for your continued help!

from forward.

vsoch avatar vsoch commented on May 31, 2024

Regarding the folders, when I run start.sh, navigate to the proper website, and then enter my password. All the directories that were on sherlock usually appear at the Jupyter Hub landing page. When I reproduced your steps and ran resume.sh, none of those folders appeared on the page. Not sure what's happening here.

Don't forget that you can set the default working directory via the script, I believe it defaults to scratch. if you aren't specifying it. This is the content that you should see.

Another question, how do I view the squeue?

From your computer you could do

ssh sherlock squeue -u vsochat

and of course put your username instead :)

I'm asking because I've been in a situation where I run a very resource intensive cell that takes 20 minutes to run (e.g. fitting a model) and I close my laptop to go do other stuff. When I open my laptop back after a few hours, the cell is still running, but it should definitely be completed by then. So I thought maybe this could be because the session was interrupted and had to resume or something of that sort.

If you are running in a notebook, you shouldn't close your laptop. if you close your laptop you are technically closing the notebook and probably losing things. if you want to run things headless like this, run them via scripts on sherlock (or submit them from a notebook or similar).

Thanks for your continued help!

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024

Thanks for the clarification. When I run my notebooks, provided I kept my laptop open, the variables will be saved in memory until I run end.sh?

from forward.

vsoch avatar vsoch commented on May 31, 2024

The variables are only defined in the terminal where you've exported them! They are resourced when you run the script and it sources the file params.sh

from forward.

royzawadzki avatar royzawadzki commented on May 31, 2024

Sorry I meant the variables within the notebook (i.e. the kernal will still be running).

from forward.

vsoch avatar vsoch commented on May 31, 2024

I haven't tried this so my answer isn't great, but I would think the processes go to sleep. Poking around, it probably even varies based on the OS. With my own experience (ubuntu and general python) if I close the lid (or the equivalent) it goes to sleep and picks up where it left off when I re-open it. https://apple.stackexchange.com/questions/278185/do-terminal-processes-stop-if-mac-sleeps

from forward.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.