Learning Rust: Connecting Jupyter with Databricks

Adrian Macal
6 min readApr 27, 2023

--

Photo by JOHN TOWNER on Unsplash

Dive into a Rust proof-of-concept connecting Jupyter and Databricks for amazing data analyses!

As a software developer passionate about data, I’m constantly seeking new challenges. Jupyter has been a game-changer, providing a fantastic platform for coding, sharing, and visualizing data. Databricks, on the other hand, streamlines big data processing and collaboration using Apache Spark. Despite their similarities, they’re distinct tools.

So, I thought, why not merge these amazing platforms? That’s when I decided to create a custom kernel in Rust, an awesome programming language celebrated for its safety and performance. This kernel bridges Jupyter and Databricks, making data analysis even more efficient and enjoyable.

Let’s start by showing you the end result, and then we’ll go over the steps I followed to make it happen. In the screenshot below, you can see Visual Studio Code’s Jupyter Notebook running on my local machine, hooked up to a local Jupyter Server using the “Learning Rust” Kernel. This custom kernel connects directly to a remote, private Databricks Cluster through the Databricks API.

As you can see, the Kernel is capable of executing code remotely on the driver and returning the results of the execution.

To run the custom kernel locally in Visual Studio Code, I created a new kernel definition pointing to the Rust CLI I had created.

# /usr/local/share/jupyter/kernels/learning-rust/kernel.json

{
"argv": [
"/tmp/cargo/debug/jupyter-kernel",
"--path", "{connection_file}",
"--cluster-id", "1024-181223-ye0dza2w"
],
"display_name": "Learning Rust",
"language": "python",
"metadata": {
"debugger": false
}
}

Then, I ran an instance of the preinstalled Jupyter Server using PIP, making sure to set the required environment variables beforehand.

export DATABRICKS_HOST=dbc-dna-e2-dev.cloud.databricks.com
export DATABRICKS_TOKEN=dapi37483-------------------------

cargo build
jupyter server

Pretty straightforward, right? If not, it might be a good idea to spend some time getting familiar with these tools and concepts. Happy learning!

Jupyter Server to Kernel protocol

Writing new kernel is not a trivial task. Official documentation describes that the communication between Server and Kernel uses ZeroMQ with 5 various types of sockets and Rust supports them with zeromq crate.

Here is a simplified list of sockets to implement:

  1. Shell: Handles code execution requests, as well as introspection and completion requests from the user.
  2. Control: Is responsible for handling shutdown and interrupt signals, ensuring a graceful kernel termination.
  3. IOPub: Is used for broadcasting messages to multiple front ends, such as kernel status updates, execution results, and error messages.
  4. Stdin: Is responsible for handling raw input requests from the user during code execution, such as when the kernel needs additional input for processing.
  5. Heartbeat: Ensures the kernel is responsive and alive by monitoring its activity and sending periodic heartbeat messages.

We won’t be implementing the entire protocol, but we’ll focus on a few core messages that are essential for our custom kernel to function. Messages are transferred using 7 frames, with each frame mapping to a corresponding field of the Wire Protocol data structure:

  1. Identifier: Used to route messages and maintain a conversation.
  2. Delimiter: Hardcoded frame <IDS|MSG> that separates the routing prefix and the message frames.
  3. Signature: A HMAC-SHA256 signature to ensure message integrity.
  4. Header: A JSON object containing metadata about the message, such as the message type and the session it belongs to.
  5. Parent header: A JSON object containing metadata about the parent message, if applicable.
  6. Metadata: A JSON object containing additional information about the message, typically used by front ends and clients for custom processing.
  7. Content: A JSON object that carries the actual content of the message, such as code to be executed or a request for completion suggestions.

Let’s define few structs helping us manage the protocol and abstract a bit from ZeroMQ types.

Having some functions around those structures may also be useful.

  • Jupyter Header should deliver its type and session, can beserialized and deserialized from JSON and even be converted to a reply (very usable).
  • Jupyter Wire Protocol can be serialized or deserialized from ZeroMQ message, should give us insights of the message and allows us to verify message consistency.
  • Jupyter Wire Builder helps us to create outgoing protocol message and sign it at the end to create actual protocol struct.

On a bit higher level of abstraction we can define Jupyter Channel, Jupyter Mesage and Jupyter Content structs to combine raw ZeroMQ structs with meaningful data.

All three structs are used by the Jupyter Client which hides the complexity of dealing with connecting to the server and handling messages in all channels.

Jupyter Server passes a connection file to the kernel as a filename which should be parsed and interpreted at the very beginning. It contains all information needed to start our kernel.

Kernel Implementation

The previous implementation of the Jupyter Client is quite low-level, focusing mainly on decoupling from the ZeroMQ implementation and interpreting bytes. To improve this, let’s introduce a Kernel Client struct, which will be responsible for handling communication between Databricks and Jupyter. This new struct will facilitate a higher-level understanding of the messages being sent and received simplifying the interaction between the two platforms.

The client can be started by creating its own instance containing both Jupyter and Databricks clients. It’s also responsible for reading the passed connection and resolving environment variables.

Furthermore, it offers a recv function that handles incoming messages. Although I initially attempted to incorporate more multi-threading, it didn't work as intended, so you might still notice some remnants of the select! macro.

The content handler deals with incoming messages by matching the content variant and performing two important actions.

  • Handling the initial kernel_info_request, which describes the kernel. It's crucial to handle this correctly, as failure to do so may cause Jupyter to crash.
  • Handling the execute_request, which is the actual command sent from the front end to the kernel for execution. The code determines a new execution counter, sends a notification about the ongoing execution, and starts the process.

The execution process is quite lengthy and involves several steps. It finds the existing execution context, creates a new one if necessary, and locks the kernel by marking it as busy. The code then calls Databricks with the code to be executed and updates the front end display.

Finally, it enters a loop that checks the status of the command being executed and reports it back to the front end.

While the code execution is sequential, it demonstrates the general idea of how a production-quality implementation could work.

Summary

This proof-of-concept is a great starting point for developers wanting to build custom kernels and add features. We’ve only begun exploring the possibilities. Consider your next steps and potential enhancements. What do you think about this integration’s potential?

The code is available here: https://github.com/amacal/learning-rust/tree/jupyter-databricks-kernel

--

--

Adrian Macal
Adrian Macal

Written by Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

No responses yet