kedro

Databricks — 8 min read

How to integrate Kedro and Databricks Connect

In this blog post, Diego Lira explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE.

11 Aug 2023 (last updated 20 Nov 2023)

In recent months we've updated Kedro documentation to illustrate three different ways of integrating Kedro with Databricks.

You can choose a workflow based on Databricks jobs to deploy a project that finished development.
For faster iteration on changes, the workflow documented in "Use a Databricks workspace to develop a Kedro project" is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.
Alternatively, you can work locally in an IDE as described by the workflow documented in "Use an IDE, dbx and Databricks Repos to develop a Kedro project". You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.

In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of QuantumBlack, AI by McKinsey, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.

What is Databricks Connect?

Databricks Connect is Databricks' official method of interacting with a remote Databricks instance while using a local environment.

To configure Databricks Connect for use with Kedro, follow the official setup to create a .databrickscfg file containing your access token. It can be installed with a pip install databricks-connect, and it will substitute your local SparkSession:

1from databricks.connect import DatabricksSession
2spark = DatabricksSession.builder.getOrCreate()

Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.

This tool was recently made available as a thin client for Spark Connect, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the documentation as previous versions had different limitations.

This video from the Databricks team gives a thorough introduction to Spark Connect

How can I use a Databricks Connect workflow with Kedro?

Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.

Find out more about Kedro

There are many ways to learn more about Kedro:

Join our Slack organisation to reach out to us directly if you’ve a question or want to stay up to date with news. There's an archive of past conversations on Slack too.
Read our documentation or take a look at the Kedro source code on GitHub.
Check out our video course on YouTube.

How to use Databricks as your PySpark engine

Kedro supports integration with PySpark through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your SPARK_REMOTE environment variable with your Databricks configuration. Here is an example implementation:

1import configparser
2import os
3from pathlib import Path
4
5from kedro.framework.hooks import hook_impl
6from pyspark.sql import SparkSession
7
8class SparkHooks:
9    @hook_impl
10    def after_context_created(self) -> None:
11        """Initialises a SparkSession using the config
12        from Databricks.
13        """
14        set_databricks_creds()
15        _spark_session = SparkSession.Builder().getOrCreate()
16
17def set_databricks_creds():
18    """
19    Pass databricks credentials as OS variables if using the local machine.
20    If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
21    otherwise it will use the DEFAULT profile in databrickscfg.
22    """
23    DEFAULT = os.getenv("DATABRICKS_PROFILE", "DEFAULT")
24    if os.getenv("SPARK_HOME") != "/databricks/spark":
25        config = configparser.ConfigParser()
26        config.read(Path.home() / ".databrickscfg")
27
28        host = (
29            config[DEFAULT]["host"].split("//", 1)[1].strip()[:-1]
30        )  # remove "https://" and final "/" from path
31        cluster_id = config[DEFAULT]["cluster_id"]
32        token = config[DEFAULT]["token"]
33
34        os.environ[
35            "SPARK_REMOTE"
36        ] = f"sc://{host}:443/;token={token};x-databricks-cluster-id={cluster_id}"

This example will populate SPARK_REMOTE with your local .databrickscfg file. You don't setup the remote connection if the project is being run from inside Databricks (if SPARK_HOME points to Databricks), so you're still able to run it in the usual hybrid development flow. Notice that you don’t need to setup a spark.yml file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.

Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the DatabricksSession:

1from databricks.connect import DatabricksSession
2spark = DatabricksSession.builder.getOrCreate()

When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using kedro_datasets.databricks.ManagedTableDataSet as your dataset type in the catalog also allows you use Delta table features.

How to enable MLflow on Databricks

Using MLflow to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use kedro-mlflow. Note that kedro-mlflow is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the documentation from mlflow directly.

After doing the basic setup of the library in your project, you should see a mlflow.yml configuration file. In this file, change the following to set up your URI:

1server:
2  mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
3  mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default

Setup your experiment name (this should be a valid Databricks path):

1experiment:
2    name: /Shared/your_experiment_name

By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.

Limitations of this workflow

Databricks Connect, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.

Users also need to be conscious that .toPandas() will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the kedro-mlflow documentation for all types of supported objects.

Recently on the Kedro blog

Recently published on the Kedro blog:

We’re always looking for collaborators to write about their experiences using Kedro. Get in touch with us on our Slack workspace to tell us your story!

On this page:

Diego Lira

Specialist Data Scientist

All blog posts

Kedro newsletter — 5 min read

In the pipeline: July 2024

From the latest news to upcoming events and interesting topics, “In the Pipeline” is overflowing with updates for the Kedro community.

Jo Stichbury

1 Jul 2024

SQL in Python — 7 min read

Streamlining SQL Data Processing in Kedro ML Pipelines

Kedro and Ibis streamline the management of ML pipelines and SQL queries within a Python project, leveraging Google BigQuery for efficient execution and storage.

Dmitry Sorokin

5 Jun 2024

Kedro newsletter — 5 min read

In the pipeline: May 2024

From the latest news to upcoming events and interesting topics, “In the Pipeline” is overflowing with updates for the Kedro community.

Jo Stichbury

7 May 2024

Best practices — 5 min read

A practical guide to team topologies for ML platform teams

Creating data platforms is a challenging task. A guest author explains how Kedro reduces the learning curve and enables data science teams.

Carlos Barreto

30 Apr 2024

Kedro-Viz — 6 min read

Share a Kedro-Viz with Github pages

We have added support to automate publishing to Github pages through the publish-kedro-viz Github Action. Learn how to configure and use the feature!

Nero Okwa

4 Apr 2024