databrickslabs / jupyterlab-integration Goto Github PK

DEPRECATED: Integrating Jupyter with Databricks via SSH

License: Other

Makefile 0.83% Python 40.26% TypeScript 3.30% Batchfile 0.21% CSS 0.11% HTML 54.67% Shell 0.18% Dockerfile 0.16% JavaScript 0.26%

databricks databricks-api databricks-deploy jupyter jupyter-notebook

jupyterlab-integration's People

Contributors

Stargazers

Watchers

Forkers

aditya-chaturvedi pauldx carlosemiliobuenaventura jinlmsft baljit92 apulich-exos gabelerner gregorigiacomo tkforks nemanja899 jiayuasu dreyercito

jupyterlab-integration's Issues

Would using the Windows Subsystem for Linux be a potential workaround for the lack of Windows support?

As mentioned in the document, Either Macos or Linux. Windows is currently not supported. What about Windows Subsystem for Linus?

running in jupyterhub inside docker image

I can see this would be useful running from jupyterhub which is a multi tenant jupyter service. For example, a user runs a local notebook and sends a job to a remote cluster. I would like the databricks cluster to start when the job is sent and not be running continuously and then shutdown after the clusters inactive timeout settiing.

In our case we use jupyterhub running in AKS and Azure Databricks in own vnet.

I did try to create a docker image as an extension of one of the default jupyter images.

FROM jupyter/datascience-notebook:latest
ENV BASH_ENV ~/.bashrc
RUN conda create -n db-jlab
RUN echo "source activate db-jlab" > ~/.bashrc
ENV PATH /opt/conda/envs/env/bin:$PATH
RUN pip install --upgrade databrickslabs-jupyterlab
RUN databrickslabs-jupyterlab -b
RUN pip install databricks-cli
USER $NB_USER

`
I am still testing

Cluster not reachable exception

First off just want to say I think this tool is great, and generally works flawlessly. I'm running into an issue where i'll sporadically receive a "Cluster Unreachable" exception, prompting me to restart the cluster. For long running jobs, this is can be annoying, since it forces me to restart the cluster, and then re kick off the job. Any ideas why this is happening? It happens even in the middle of interactive work where my local machine is active and (in theory) the SSH tunnel is stable (although I haven't tested network disruptions etc).

Here's the pop up that will surface:

Any help is much appreciated.

A bug with incorrect jupyter_notebook_config.py overwriting with multiline values (e.g. dicts)

If there are some multiline dicts in jupyter_notebook_config.py like

c.NotebookApp.tornado_settings={
  'headers': {
    'SOME_HEADER': 'SOME_VALUE'
  }
}

then databrickslabs_jupyterlab.local.write_config splits the first line by "=" which results in

['c.NotebookApp.tornado_settings', '{']

Next lines are just simply ignored as there is no "=" symbol there. As a result, the correct c.NotebookApp.tornado_settings settings are overwritten by c.NotebookApp.tornado_settings={ which simply breaks the config as now we have only an opening brace.

A workaround on this is just to flatten the config lines into one, but it can make the config unreadable if there are lots of lines. So maybe there is some sense to change write_config func so it can handle such cases.

I guess I can fix it and create a pull-request

I successfully connect to my cluster on jupyter lab, but I run some test code, the notebook has no response

Hi DataBricks team,

I successfully connect to my cluster on jupyter lab, I try do run some test codes but has no response. And I am sure I can ssh to my cluster. How do I fix this??

Support any virtual env manager

The package relies on conda, but conda is very heavyweight and not everyone is a huge conda fan.

I believe this is a lower priority issue, but it would be great if we could use any virtual env manager we want, like say Poetry.

local.py line 125 IndexError: list index out of range

Hello, all: trying to set up following the instruction on Azure Databricks getting this error,

databrickslabs-jupyterlab eastus -k -i 0426-155413-ring996
Traceback (most recent call last):
  File "/data/anaconda/envs/db-jlab/bin/databrickslabs-jupyterlab", line 171, in <module>
    version = conda_version()
  File "/data/anaconda/envs/db-jlab/lib/python3.6/site-packages/databrickslabs_jupyterlab/local.py", line 125, in conda_version
    return result["stdout"].strip().split(" ")[1]
IndexError: list index out of range

I am able to SSH into master no problem.
-p also list the profile.

databrickslabs-jupyterlab -p

PROFILE              HOST                                                         SSH KEY
eastus               https://eastus.azuredatabricks.net/?o=3573392022285404       OK

databricks cli connects with cluster successful.

databricks clusters list --profile eastus
0426-155413-ring996   gpu  RUNNING
0404-233454-navel281  std  TERMINATED

Much appreciated on your help.

DBR 7.3 version support

Readme mentions we support DBR 7.0 beta. Is 7.3 ML LTS also supported?

Upgrade notebook to latest version 6.2.0

Is it possible to upgrade notebook to latest version 6.2.0?

In the used version of jupyter/notebook (notebook==6.0.3) there is a problem with Ensure that cell ids persist after save, discussed PR in jupyter/notebook. Which has been fixed in notebook in latest version 6.2.0. Without this whenever we save Notebook it will create new ID for each cell and review on Notebook becomes difficult.

I was checking if there is any possibility to upgrade notebook to latest version 6.2.0, your help will be appreciated!

Scala Question

Can you please clarify how the notebook experience would work if I used scala?
I've read the following, and had follow-up questions...

https://github.com/databrickslabs/jupyterlab-integration/blob/master/docs/v2/how-it-works.md

Based on my understanding of that article, the scala kernel, a JVM, would never run locally on my workstation. Is that correct? It sounds everything I'm doing in each cell is being proxied to the remote cluster, including any logic that would otherwise be executed on the spark driver.

I am pretty excited by your demo, that I saw here:
https://github.com/databrickslabs/jupyterlab-integration/blob/master/docs/v2/news/scala-magic.md

I guess the concern I have is that if the scala kernel is never running on the local machine, then it will be difficult to achieve a rich scala development experience within Jupyter. I think you highlighted some of the limitations already. As-of now I've been using almond-sh ( https://almond.sh/ ) as my scala kernel in Jupyter and it sounds like this jupyterlab-integration experience would be very different.

Please let me know. I'm very eager to develop scala notebooks in Jupyterlab that will interact with a remote databricks cluster (via db-connect). It seems like a good combination to use jupyterlab for development, along with a remote cluster that I don't need to manage myself.

databrickslabs-jupyterlab configuration breaks for conda 4.8.0

conda version is latest i.e. greater than 4.7.5 - still databrickslabs-jupyterlab complains of it being too old.

(dbconnect) ~ λ databrickslabs-jupyterlab $PROFILE -s -i $CLUSTER_ID
Too old conda version:
Please update conda to at least 4.7.5
(dbconnect) ~ λ conda --version
conda 4.8.0

dbutils.library support

Hi,
Are there any plans to add support for dbutils.library module? Right now simple dbutils.library.help("install") produces an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-28186010f9f4> in <module>
----> 1 dbutils.library.help("install")

AttributeError: 'DbjlUtils' object has no attribute 'library'

In ML runtime there is also a great magic %pip - https://docs.databricks.com/notebooks/notebooks-python-libraries.html#enable-pip-and-conda-magic-commands
It installs libraries both to driver and executor nodes.
In contrast, when running %pip install inside jupyterlab notebook connected to databricks cluster - it installs libraries only on driver node. Which makes it unusable in case of udfs, cause executors need same libraries also.
Could you suggest any workaround? Or maybe there are some plans to bring such support to jupyterlab-integration?

Any way to install notebook scoped libraries interactively without init scripts?

Thanks in advance

Read local files from notebooks connecting to Databricks via SSH

Hello,
I was wondering if there is any way that notebooks connecting to Databricks via SSH could read files on local machine.

I have a yaml file and a notebook on my local side. I opened the notebook from jupyterlab connecting to Databricks via SSH and the notebook tried to read the yaml file but it did not work. Because the yaml file did not exist on Databricks. What I could figure out this is uploading yaml file to dbfs system so that the notebook can read the yaml file from there. Is there any better way to do that?

Thanks.

Comparing to OS solutions

Hi all, thanks for this nice effort and great work! However, I miss the potential to switch the connectivity around (eg. connecting from my k8s to DB cluster).

So, how is this different to jupyter gateway or jupyter enterprise gateway?

Remove Anaconda dependency

Since Anaconda has licensing which is incompatible with some corporate structures, it is important to use open source pip package manager instead.

I will investigate and see what can be done. There is a conda version of the databricks runtime docker images, but they are ostensibly less up-to-date than the pip versions.

After I pip install, but I encounter ModuleNotFoundError: No module named 'version_parser'

Hi Databricks,

After I pip install --upgrade databrickslabs-jupyterlab==2.2.1, but encounter encounter ModuleNotFoundError: No module named 'version_parser'.

Where I can find the version_parser to install??

Best,

Connecting via private IP

We have a setup using VPN to connect to our databricks clusters.
I want to be able to connect to the cluster using its private IP.

I've tried just changing the IP in the ssh config, but that gets overridden.

Any ideas?

Custom environment variables for kernels

Hi!
Firstly, Thank you for your work!

I'm wondering If I can somehow specify custom environment variables for my ssh kernel.
I have some custom libraries for which I need to specify PYTHONPATH, LD_LIBRARY_PATH and some other env vars. When I just use simple Databricks notebooks I have my own docker image and Init Scripts where I setup these vars and add them to .bashrc. But if I run Jupyterlab with databricks integration then I don't see those variables. The only workaround I've found so far is to edit local.py from this library and setup my env there.
Obviously, this is a rather dirty hack.

Token invalid error for Azure Databricks workspaces

In Azure databricks environments we

(db-jlab) C02Y77B9JG5H:~ gobinath$ databrickslabs-jupyterlab $PROFILE -k -o 4116859307136712 -i 0520-162211-ilk548
Valid version of conda detected: 4.7.12

* Getting host and token from .databrickscfg

* Select remote cluster

Token for profile 'jupyterssh' is invalid

=> Exiting
(db-jlab) C02Y77B9JG5H:~ gobinath$


(db-jlab) C02Y77B9JG5H:~ gobinath$ databrickslabs-jupyterlab $PROFILE -s -i 0520-162211-ilk548
Valid version of conda detected: 4.7.12

* Getting host and token from .databrickscfg

   => ssh key '/Users/gobinath/.ssh/id_jupyterssh' does not exist
   => Shall it be created (y/n)? (default = n): y
   => Creating ssh key /Users/gobinath/.ssh/id_jupyterssh
   => OK
Token for profile 'jupyterssh' is invalid

=> Exiting
(db-jlab) C02Y77B9JG5H:~ gobinath$

I know the token is good because I validated over and again with direct cli command:

(db-jlab) C02Y77B9JG5H:~ gobinath$ databricks clusters list --profile jupyterssh
0520-162211-ilk548  test_jupyter  RUNNING
(db-jlab) C02Y77B9JG5H:~ gobinath$

Missing arguments error on getting Spark context

I'm getting a missing arguments error of 'pinned_mode' upon successful connect when I tried to get the Spark context.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-376bcb4d28bb> in <module>
      1 from databrickslabs_jupyterlab.connect import dbcontext, is_remote
----> 2 dbcontext()

/databricks/python/lib/python3.7/site-packages/databrickslabs_jupyterlab/connect.py in dbcontext(progressbar)
    179     # ... and connect to this gateway
    180     #
--> 181     gateway = get_existing_gateway(port, True, auth_token)
    182     print(". connected")
    183     # print("Python interpreter: %s" % interpreter)

TypeError: get_existing_gateway() missing 1 required positional argument: 'pinned_mode'

Any guidance around whether jupyterlab-integration or jupyterlab with databricks-connect?

I found that databricks-connect supports jupyter and I made sure that jupyterlab works with databricks-connect with this link.
https://docs.databricks.com/dev-tools/databricks-connect.html#jupyter
Which one should I use for using jupyterlab with databricks, this library or databricks-connect? Does the development on this repo continue? I'm wondering where the databricks team will be putting effort into to integrate jupyterlab.

Jupyterlab==3.. support

Hi!
Do you have plans to support new jupyterlab 3? If so, are there any approximate dates for it and can I help somehow?

Matplotlib and plotly plots don't render

Dataframes render fine in local Jupyter notebook, but matplotlib and plotly plots do not.

For matplotlib, the plot simply does not render:

For plotly, the error when using default plotly renderer is

When updating the default render to "notebook", nothing is printed out, similar to matplotlib. I can make plotly work with a very hacky workaround, by saving the plot as HTML and then displaying the HTML.