When I joined the Kedro team, I was quite new to data science so I started learning basic concepts in Jupyter notebooks by working on projects in Kaggle. They were great but I noticed a few "code smells" as my projects got more complex. I also ran through the spaceflights tutorial to understand more of the basics of Kedro, but I wanted to understand how to make things meet in the middle. Here's what I learned while combining a notebook project with Kedro.
Smell 1: "Magic" numbers
When writing exploratory code, it is tempting to simply hard code values to save time. But the “magic” number makes the code more difficult to read and later, if you want to change the value to make the code more accurate, you must hunt it out in the code. Magic numbers make code harder to maintain in the longer-term.
In this example, the magic numbers supplied to
1features = [ 2 "engines", 3 "passenger_capacity", 4 "crew", 5 "d_check_complete", 6 "moon_clearance_complete", 7 "iata_approved", 8 "company_rating", 9 "review_scores_rating" 10] 11 12X = model_input_table[features] 13y = model_input_table["price"] 14X_train, X_test, y_train, y_test = train_test_split( 15 X, y, test_size=0.3, random_state=3 16)
Good software engineering practice puts magic numbers into named constants, sometimes defined at the top of a file, or within a utility file. This applies to any value that is used frequently throughout the code.
1test_size=0.3 2random_state=3 3 4features = [ 5 "engines", 6 "passenger_capacity", 7 "crew", 8 "d_check_complete", 9 "moon_clearance_complete", 10 "iata_approved", 11 "company_rating", 12 "review_scores_rating" 13] 14 15X = model_input_table[features] 16y = model_input_table["price"] 17X_train, X_test, y_train, y_test = train_test_split( 18 X, y, test_size=test_size, random_state=random_state 19)
Some of the benefits of this approach include:
It saves you from typing out long file paths repeatedly, which is error-prone.
Changing a constant in one location eliminates time-consuming and error-prone changes to multiple instances of the hard-coded value.
It’s much easier to read a meaningful constant name that explains the purpose of the value than to think backwards and remember the meaning of different numbers across your code (
test_sizeis just more meaningful than
How to use a YAML configuration file for magic values
Having extracted global variables and declared them at the top of the cell, the next step was to extract them out into separate file, known as a configuration file, and read them from there when needed. YAML is a popular choice for writing configuration files, so I used that. I added an empty file called
parameters.yml file to my project, and defined the constants inside it. To extend the concept of magic numbers to encompass magic values in general, since it is likely that the variable
features also might be reusable elsewhere, I added them to
1# parameters.yml 2 3model_options: 4 test_size: 0.3 5 random_state: 3 6 features: 7 - engines 8 - passenger_capacity 9 - crew 10 - d_check_complete 11 - moon_clearance_complete 12 - iata_approved 13 - company_rating 14 - review_scores_rating
The notebook code now looked as follows:
1import yaml 2 3with open("parameters.yml", encoding="utf-8") as yaml_file: 4 parameters = yaml.safe_load(yaml_file) 5 6test_size = parameters["model_options"]["test_size"] 7random_state = parameters["model_options"]["random_state"] 8X = model_input_table[parameters["model_options"]["features"]] 9...
How to use Kedro to load magic values from YAML
Kedro offers a configuration loader to abstract loading values from a YAML file. I decided to use Kedro configuration loading, without a full Kedro project, to replace the call to
yaml.safe_load to load the configuration file. For this, I did need to install Kedro, but didn’t need to create a Kedro project, since I could just drop the following code into my notebook to use Kedro's
OmegaConfigLoader to load
1from kedro.config import OmegaConfigLoader 2 3conf_loader = OmegaConfigLoader(".", base_env="", default_run_env="") 4# Read the configuration file 5conf_params = conf_loader["parameters"] 6 7# Reference the values as previously 8test_size = conf_params["model_options"]["test_size"] 9random_state = conf_params["model_options"]["random_state"] 10X = model_input_table[conf_params["model_options"]["features"]] 11...
At this point I had eliminated hard-coded values by using the Kedro configuration loader. What about the next code smell I needed to tackle: loading data.
Smell 2: Hardcoded file paths
As I started to use more data sources in my project, I encountered problems related to dataset management. I started to use data sources and move them around when I decided that it didn’t make sense to store on my local machine. When the location of data changed i.e. when I moved local data to a S3 bucket, I found myself manually changing all the hardcoded file paths in my project. I also needed to convert datasets to different formats to work with them and that meant the code for each file.
For example, in a Jupyter notebook I began by reading in data sources stored in a
/data subdirectory like this:
1import pandas as pd 2 3companies = pd.read_csv('data/companies.csv') 4reviews = pd.read_csv('data/reviews.csv') 5shuttles = pd.read_excel('data/shuttles.xlsx', engine='openpyxl')
When the location of these files changed, I had to go through and update these paths.
Another friction point was in the data-processing stage when ‘new’ datasets were created after cleaning, transforming or combining datasets. When these datasets were generated as outputs, the code generating them had to be re-run if the input files changed. This could be time consuming for larger datasets, so if I created a dataset I was happy with I would often save it out as a file so it could loaded in directly without regeneration.
Using Kedro for data handling
Kedro’s Data Catalog addresses the problems I’ve described above by providing a separate place to declare and manage datasets, acting as a registry of all data sources available for use by a project. Kedro provides built-in datasets for different file types and file systems so you also don’t have to write any of the logic for reading or writing data. There are a range available, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames, and more. They are supported with the APIs of pandas, spark, networkx, matplotlib, yaml, and beyond. It relies on
[fsspec](<https://filesystem-spec.readthedocs.io/en/latest/>) to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. You can pass arguments in to load and save operations, and use versioning and credentials for data access.
Even without a full Kedro project, I could still take advantage of its data handling solution in my existing project within a Jupyter notebook. To start using the Data Catalog, I created a
catalog.yml file in my notebook directory to define the datasets I wanted to use:
1companies: 2 type: pandas.CSVDataSet 3 filepath: data/companies.csv 4 5reviews: 6 type: pandas.CSVDataSet 7 filepath: data/reviews.csv 8 9shuttles: 10 type: pandas.ExcelDataSet 11 filepath: data/shuttles.xlsx
Then I used Kedro to load the
catalog.yml file, and reference the Data Catalog in my Jupyter notebook.
1 2from kedro.io import DataCatalog 3 4import yaml 5 6# load the configuration file 7with open("catalog.yml") as f: 8 conf_catalog = yaml.safe_load(f) 9 10# Create the DataCatalog instance from the configuration 11catalog = DataCatalog.from_config(conf_catalog) 12 13# Load the datasets 14companies = catalog.load("companies") 15reviews = catalog.load("reviews") 16shuttles = catalog.load("shuttles") 17
Even better, I realised I could use Kedro’s config loader to initialise the Data Catalog:
1from kedro.config import OmegaConfigLoader 2from kedro.io import DataCatalog 3 4conf_loader = OmegaConfigLoader(".", base_env="", default_run_env="") 5conf_catalog = conf_loader["catalog"] 6 7# Create the DataCatalog instance from the configuration 8catalog = DataCatalog.from_config(conf_catalog) 9 10# Load the datasets 11companies = catalog.load("companies") 12reviews = catalog.load("reviews") 13shuttles = catalog.load("shuttles")
Another smell eliminated! The next issue to address was the code duplication in my notebook.
Smell 3: Code duplication and execution order
In several of my notebooks, especially while writing data processing code, I noticed that I had duplicated code between notebooks and sometimes between cells within the same notebook. Even if not a perfect duplicate, it would be similar with small changes, i.e. being used on a different dataset or with a different file name.
In my head, I could hear my undergrad computing professor telling me to separate out these lines of code into their own function. However, the few times I did do this, I often ran into cell execution order errors. If I edited a function I had defined in an earlier cell I had to remember to rerun the cell to make sure all cells were using the same version of the function. With several functions in different cells, this could quickly get out of hand and I questioned the value of having functions at all. I took to running most of my notebooks from the start every time to ensure that cell execution order wasn’t the cause of any issue. With smaller notebooks this is not a huge time sink, but as my notebooks got larger, this created long waiting times which disrupted my development process.
Kedro eliminates the problem by mapping a chunk of code in a notebook cell, into a self-contained function that can be used as the basis of a node in a Kedro pipeline. Pipelines can be run entirely or from specific start and end points, and Kedro takes care of the running order, which helps you execute and debug your code quickly. It’s certainly a step up from Run All in a Jupyter notebook, or trying to remember the correct order in which to run all the cells.
Node outputs can be stored as intermediate datasets, so should the pipeline fail at a specific node, execution can be resumed for the last correctly executed node, with the dataset in the corresponding state. This is especially useful for pipelines that take a long time to execute, saving time by only re-executing the necessary sections.
Here is how I converted two Jupyter notebook cells into the corresponding functions that are wrapped into nodes for use in a pipeline.
1####### Before refactoring ########### 2 3# Prototyping code written in the global scope 4companies["iata_approved"] = companies["iata_approved"] == "t" 5companies["company_rating"] = ( 6 companies["company_rating"].str.replace("%", "").astype(float) 7) 8 9shuttles["d_check_complete"] = shuttles["d_check_complete"] == "t" 10shuttles["moon_clearance_complete"] = ( 11 shuttles["moon_clearance_complete"] == "t" 12) 13shuttles["price"] = ( 14 shuttles["price"] 15 .str.replace("$", "") 16 .str.replace(",", "") 17 .astype(float) 18) 19 20####### Refactored 👍 ########### 21 22# Converted to unit testable, pure python functions 23def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame: 24 companies["iata_approved"] = companies["iata_approved"] == "t" 25 companies["company_rating"] = companies[ 26 "company_rating" 27 ].str.replace("%", "") 28 companies["company_rating"] = ( 29 companies["company_rating"].astype(float) / 100 30 ) 31 return companies 32 33def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame: 34 shuttles["d_check_complete"] = shuttles["d_check_complete"] == "t" 35 shuttles["moon_clearance_complete"] = ( 36 shuttles["moon_clearance_complete"] == "t" 37 ) 38 shuttles["price"] = ( 39 shuttles["price"].str.replace("$", "").str.replace(",", "") 40 ) 41 shuttles["price"] = shuttles["price"].astype(float) 42 return shuttles
I could have gone further in refactoring because there's some duplicate code that applies similar transformations to columns in the dataset, so I could define some utility functions and reduce code duplication, but I won't show that here.
Refactoring is used commonly when converting a notebook into a full Kedro project, to convert code in cells into functions that are used as the basis of nodes. You can see how this is done in an example in the Kedro documentation.
In this article, I’ve described some of the “code smells” I had with notebooks as I used them for experimenting with code, namely hard-coded values, hard-coded data locations and data loading code, and duplication of code, which meant refactoring and then issues with cell execution order as the project became longer.
I found that Kedro can eliminate those smells with configuration management and the Data Catalog, and I have started to refactor code into smaller functions as a gateway to converting my notebook into a full-blown Kedro project.