By instrumenting the python launcher with simple timestamps, it was discovered that sp

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Good points <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Status: <li class="task-list-item

Investigate async creation of spark session in launchers about enterprise_gateway HOT 9 CLOSED

jupyter-server commented on May 13, 2024

Investigate async creation of spark session in launchers

from enterprise_gateway.

Comments (9)

kevin-bates commented on May 13, 2024 1

@ckadner - I went ahead and created a version of Elyra that uses "early polling" which I believe we should do anyway and is a prerequisite to looking into async session creation anyway. This version can be found here: https://github.com/kevin-bates/elyra/tree/early-polling

from enterprise_gateway.

ckadner commented on May 13, 2024

Kernels running in yarn-cluster mode (when launched via spark-submit) must initialize a SparkContext in order for the Spark Yarn code to registers the application as RUNNING: https://github.com/apache/spark/blob/3d4d11a80fe8953d48d8bfac2ce112e37d38dc90/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L405

Consequently there cannot be any "lazy initialization" of the SparkContext (or SparkSession) object, as we had seen with the Scala kernels using Apache Toree builds after June 7, Toree commit 5cfbc83 (TOREE-390: Lazily initialize spark sessions).

@kevin-bates -- when using connection_file_mode="socket" do we still need to wait for the kernel "application" to be in RUNNING state before we connect to the kernel?

from enterprise_gateway.

kevin-bates commented on May 13, 2024

Yes, what I found was that RUNNING state just means the launcher has started. For example, pull mode essentially retries pulling the connection file because it doesn't exist for a few seconds (because its creation takes place after the session creation).

Hmm, if async session creation proves difficult, we could try swapping the connection file and socket send logic to occur before the session creation. Although the kernel would still be blocked from being usable, elyra startup can complete sooner.

from enterprise_gateway.

kevin-bates commented on May 13, 2024

@ckadner - sorry, but I hadn't looked at the code (had replied from my phone) and forgot about the lazy-loading behavior we had seen previously.

The application is definitely in the RUNNING state, because Elyra doesn't start attempting to pull the file (or listen on the socket with one second timeouts) until the Yarn API indicates RUNNING. Then, in each of the above, we see multiple retries - 3-4 ocurring. What I can't determine is if those retries are happening after the launcher has created the session or not.

Here's some interesting output of a socket-mode python kernel that happened to be loopback (so the clock synchronization isn't an issue)...

The Elyra log shows the following of sequence where the first entry is the last one where its in the ACCEPTED state, and the last entry is where it actually gets data returned on the socket. In the middle, you can see the application state turn to RUNNING, then the wait/retry on the socket...

[D 2017-07-12 08:34:42.331 KernelGatewayApp] Waiting for application to enter 'RUNNING' state. KernelID=9756df5f-ab63-4c1e-adf5-727349d64abe, ApplicationID=application_1499374582831_0115, AssignedHost=elyra-fyi-node-1.fyre.ibm.com, CurrentState=ACCEPTED, Attempt=9
[D 2017-07-12 08:34:43.858 KernelGatewayApp] KernelID=9756df5f-ab63-4c1e-adf5-727349d64abe, ApplicationID=application_1499374582831_0115, AssignedHost=elyra-fyi-node-1.fyre.ibm.com, CurrentState=RUNNING, Attempt=10
[D 2017-07-12 08:34:44.861 KernelGatewayApp] Waiting for application_1499374582831_0115 to connect back to receive connection info...
[D 2017-07-12 08:34:45.877 KernelGatewayApp] KernelID=9756df5f-ab63-4c1e-adf5-727349d64abe, ApplicationID=application_1499374582831_0115, AssignedHost=elyra-fyi-node-1.fyre.ibm.com, CurrentState=RUNNING, Attempt=11
[D 2017-07-12 08:34:46.879 KernelGatewayApp] Waiting for application_1499374582831_0115 to connect back to receive connection info...
[D 2017-07-12 08:34:47.892 KernelGatewayApp] KernelID=9756df5f-ab63-4c1e-adf5-727349d64abe, ApplicationID=application_1499374582831_0115, AssignedHost=elyra-fyi-node-1.fyre.ibm.com, CurrentState=RUNNING, Attempt=12
[D 2017-07-12 08:34:47.892 KernelGatewayApp] Connected to ('172.16.187.100', 53188)...

Here's the corresponding launcher tracing from that application Id. The period of session creation occurs between entries B and C.

A: 2017-07-12 15:34:41.644
B: 2017-07-12 15:34:41.645
C: 2017-07-12 15:34:47.037
D: 2017-07-12 15:34:47.037
E: 2017-07-12 15:34:47.037
F: 2017-07-12 15:34:47.037
G: 2017-07-12 15:34:47.038
H: 2017-07-12 15:34:47.038
HA: 2017-07-12 15:34:47.038
HB: 2017-07-12 15:34:47.038
HC: 2017-07-12 15:34:47.039
I: 2017-07-12 15:34:47.039
J: 2017-07-12 15:34:47.039

Elyra first discovers RUNNING state at 08:34:43.858 - which is in the middle of the session creation logic - and does appear to corroborate the relationship between RUNNING state and session creation.

If this is true, then it wouldn't do us any good to make session creation asynchronous unless we start attempting to poll the file/socket while in ACCEPTED state - which we could probably do fairly easily (essentially deleting a couple lines of code - at first glance).

from enterprise_gateway.

kevin-bates commented on May 13, 2024

@ckadner - It would be great if you could test with PR #71! Its got a few additional changes beyond the branch I referenced in the previous comment.

from enterprise_gateway.

ckadner commented on May 13, 2024

Thank you @kevin-bates

I have been experimenting with several approaches. I am currently working on a hybrid of the first two approaches summarized below. The basic premise for all approaches is to reorder the sequence of events in the kernel launcher script to enable the kernel to be connected to before or in parallel to a SparkContext/SparkSession being initialized. In combination with the changes from PR #71 ("Poll prior to RUNNING state") the effect would be that notebook users could execute non-Spark related cells before the Spark Context is available.

Alternative approaches:

"Asynchroneus" initialization (non-blocking)
- initialize the sc, spark, sqlContext, ...variables to a dummy SPARK_CONTEXT_NOT_YET_INITIALIZED object
- start a thread to do the SparkContext/SparkSession initialization just before the embed_kernel call
- the user would then find a notebook that is connected (and then ready) before the Spark session is initialized, but able to execute non-Spark related cells
- running a notebook cell which references any of the Spark session variables before the initialization thread completes will cause a 'SPARK_CONTEXT_NOT_YET_INITIALIZED' object has no attribute '...' runtime error (or if overriding the __getattr__ method one could print the message "Spark context not yet initialized...")
- any such premature "failed" Spark cells would have to be re-run after the initialization thread finished (potentially multiple times if the SC initialization takes longer)
"On-demand" initialization (blocking)
- initialize the Spark session variables (sc, spark, sqlContext, ...) with proxy/wrapper objects that override all of the attributes and methods of the respective wrapped Spark objects:
  - calling any of the methods/attributes on the wrapper objects (at a later time) will trigger the code to actually initialize the Spark session objects "on demand" and replace the Spark session variables with the real SparkContext/SparkSession objects
- but we don't do any Spark session initialization yet and just call embed_kernel
- the user would then find a notebook that is connected (and then ready) before/without an initialized Spark session, but able to run non-Spark related cells
- the first time any cell is executed that references any of the Spark session variables (sc, spark, sqlContext, ...) a blocking call will be made to initialize the SparkContext and then the sq, spark, etc session variables would be replaced to point to the actual Spark objects
- Note: if the user does not execute any Spark dependent cell before the spark.yarn.am.waitTime (default: 100 sec) is exceeded then the Yarn application would fail and the kernel be killed/restarted
"Invisible cell" (blocking)
- encapsulate the Spark session variable initialization as text of a notebook cell
- start a separate thread (with a short delay) that will connect to the kernel and run that cell
- call embed_kernel
- the user would then find a notebook that is connected but NOT ready since the "invisible" cell to initialize the Spark session variables is still running
- once the "invisible" cell completed, the notebook is ready and both Spark and non-Spark related cells can be run

In combining approach 1 and 2 we will get the benefit of a notebook with a kernel that is connected (and ready) early, the Spark session initialization is running in the background, and if a Spark-related cell is executed "prematurely" there would be no error but the cell would wait for the initialization thread to complete and then run the cell content.

from enterprise_gateway.

kevin-bates commented on May 13, 2024

Thanks @ckadner and quite an awesome write-up!

I agree that a combination of 1&2 is a good way to go. I'm a little concerned about how tight the coupling would be with the Spark API in option 2. Seems like we could probably narrow things down by seeing what methods are most commonly used across a set of DSX notebooks.

I'm wondering if we shouldn't make the spark.yarn.am.waitTime close to the idle kernel culling value (which would need to be conveyed) or perhaps see how 0 behaves. If the user walks away before touching sc.<method> the kernel would be culled anyway.

Then there's the discussion about consistency across the kernel types, although I think erring on the side of Python is most advantageous (should we need to diverge here).

from enterprise_gateway.

ckadner commented on May 13, 2024

Good points @kevin-bates

...concerned about how tight the coupling would be with the Spark API in option 2

we would use the Python equivalent of reflection to minimize code and be independent of the actual Spark APIs

...make the spark.yarn.am.waitTime close to the idle kernel culling value

we could pass it to the spark-submit via another env variable similar to $KERNEL_ID and $KERNEL_RESPONSE_ADDRESS

...or perhaps see how 0 behaves

we could use Duration.Inf ... for spark.yarn.am.waitTime for unbounded waiting, see scala/concurrent/Awaitable.scala#L34

from enterprise_gateway.

ckadner commented on May 13, 2024

Status:

IPython kernels: the asynchronous initialization of the Spark session variables has been completed on July 18, 2017 with commit 6b0c2b7 (PR #81)
Toree Scala kernels support the "lazy initialization" of the SparkContext since Toree commit 5cfbc83 (TOREE-390: "Lazily initialize spark sessions") from June 7, 2017
iRkernel: we still need to work on a PR, see issue #124: "Modify launch script for iRkernel to delay SC initialization"

from enterprise_gateway.

Investigate async creation of spark session in launchers about enterprise_gateway HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent