Databricks — 8 min read

How to integrate Kedro and Databricks Connect

In this blog post, Diego Lira explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE.

11 Aug 2023 (last updated 20 Nov 2023)
Glass v8

In recent months we've updated Kedro documentation to illustrate three different ways of integrating Kedro with Databricks.

  • You can choose a workflow based on Databricks jobs to deploy a project that finished development.

  • For faster iteration on changes, the workflow documented in "Use a Databricks workspace to develop a Kedro project" is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.

  • Alternatively, you can work locally in an IDE as described by the workflow documented in "Use an IDE, dbx and Databricks Repos to develop a Kedro project". You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.

In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of QuantumBlack, AI by McKinsey, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.

What is Databricks Connect?

Databricks Connect is Databricks' official method of interacting with a remote Databricks instance while using a local environment.

To configure Databricks Connect for use with Kedro, follow the official setup to create a .databrickscfg file containing your access token. It can be installed with a pip install databricks-connect, and it will substitute your local SparkSession:

1from databricks.connect import DatabricksSession
2spark = DatabricksSession.builder.getOrCreate()

Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.

This tool was recently made available as a thin client for Spark Connect, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the documentation as previous versions had different limitations.

This video from the Databricks team gives a thorough introduction to Spark Connect

How can I use a Databricks Connect workflow with Kedro?

Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.

Find out more about Kedro

There are many ways to learn more about Kedro:

How to use Databricks as your PySpark engine

Kedro supports integration with PySpark through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your SPARK_REMOTE environment variable with your Databricks configuration. Here is an example implementation:

1import configparser
2import os
3from pathlib import Path
4
5from kedro.framework.hooks import hook_impl
6from pyspark.sql import SparkSession
7
8class SparkHooks:
9    @hook_impl
10    def after_context_created(self) -> None:
11        """Initialises a SparkSession using the config
12        from Databricks.
13        """
14        set_databricks_creds()
15        _spark_session = SparkSession.Builder().getOrCreate()
16
17def set_databricks_creds():
18    """
19    Pass databricks credentials as OS variables if using the local machine.
20    If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
21    otherwise it will use the DEFAULT profile in databrickscfg.
22    """
23    DEFAULT = os.getenv("DATABRICKS_PROFILE", "DEFAULT")
24    if os.getenv("SPARK_HOME") != "/databricks/spark":
25        config = configparser.ConfigParser()
26        config.read(Path.home() / ".databrickscfg")
27
28        host = (
29            config[DEFAULT]["host"].split("//", 1)[1].strip()[:-1]
30        )  # remove "https://" and final "/" from path
31        cluster_id = config[DEFAULT]["cluster_id"]
32        token = config[DEFAULT]["token"]
33
34        os.environ[
35            "SPARK_REMOTE"
36        ] = f"sc://{host}:443/;token={token};x-databricks-cluster-id={cluster_id}"

This example will populate SPARK_REMOTE with your local .databrickscfg file. You don't setup the remote connection if the project is being run from inside Databricks (if SPARK_HOME points to Databricks), so you're still able to run it in the usual hybrid development flow. Notice that you don’t need to setup a spark.yml file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.

Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the DatabricksSession:

1from databricks.connect import DatabricksSession
2spark = DatabricksSession.builder.getOrCreate()

When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using kedro_datasets.databricks.ManagedTableDataSet as your dataset type in the catalog also allows you use Delta table features.

How to enable MLflow on Databricks

Using MLflow to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use kedro-mlflow. Note that kedro-mlflow is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the documentation from mlflow directly.

After doing the basic setup of the library in your project, you should see a mlflow.yml configuration file. In this file, change the following to set up your URI:

1server:
2  mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
3  mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default

Setup your experiment name (this should be a valid Databricks path):

1experiment:
2    name: /Shared/your_experiment_name

By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.

Limitations of this workflow

Databricks Connect, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.

Users also need to be conscious that .toPandas() will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the kedro-mlflow documentation for all types of supported objects.

Recently on the Kedro blog


On this page:

Photo of Diego Lira
Diego Lira
Specialist Data Scientist
Share post:
Mastodon logoLinkedIn logo

All blog posts

cover image alt

Kedro newsletter — 5 min read

In the pipeline: February 2024

From the latest news to upcoming events and interesting topics, “In the Pipeline” is overflowing with updates for the Kedro community.

Jo Stichbury

7 Feb 2024

cover image alt

Ibis — 10 min read

Building scalable data pipelines with Kedro and Ibis

From production-ready to production. Bring the flexibility and familiarity of Python, and the scale and performance of modern SQL, to Kedro pipelines.

Deepyaman Datta

29 Jan 2024

cover image alt

Kedro news — 10 min read

Your new Kedro project. Your way.

We've made changes to Kedro in the new 0.19 release to tackle one of the most commonly perceived pain points. Find out more!

Jo Stichbury

24 Jan 2024

cover image alt

Kedro news — 4 min read

Learn Kedro from a new video course

Check out the official video course to teach you the skills you need to create data and machine learning pipelines for future-friendly, platform-agnostic data science code

cover image alt

Kedro news — 5 min read

Exploring our new release: Kedro 0.19

This blog post gives details of the enhancements and improvements to Kedro in the recent 0.19 release.