In the first two parts of this series (part 1, part 2), we covered the following:
- Setting Up the Unity Catalog for Medallion Architecture: We organized our data into bronze, silver, and gold layers within the Unity Catalog, establishing a structured and efficient data management system.
- Ingesting Data into Unity Catalog: We demonstrated how to import raw data into the system, ensuring consistency and quality for subsequent processing stages.
- Training the Model: Utilizing Databricks, we trained a machine learning model tailored to our dataset, following best practices for scalable and effective model development.
- Hyperparameter Tuning with HyperOpt: To enhance model performance, we employed HyperOpt to automate the search for optimal hyperparameters, improving accuracy and efficiency.
- Experiment Tracking with Databricks MLflow: We utilized MLflow to log and monitor our experiments, maintaining a comprehensive record of model versions, metrics, and parameters for easy comparison and reproducibility.
- Batch Inference: Implementing batch processing to generate predictions on large datasets, suitable for applications like bulk scoring and periodic reporting.
- Online Inference (Model Serving): Setting up real-time model serving to provide immediate predictions, essential for interactive applications and services.
- Model Monitoring: to ensure your deployed models maintain optimal performance and reliability over time.
In this last part we take see how we can automate the whole process using Gitlab, Databricks Assets Bundles, and Databricks jobs.
Let's dive in!
Orchestration
Databricks offers various tools for programmatically automating the management of jobs and workflows. In particular, they simplify Databricks workflows development, deployment, and launch across multiple environments. In principle, all these tools build around Databricks Rest API, which allows us to manage and control Databricks resources such as clusters, workspaces, workflows, and machine learning experiments and models. They are built to be actively used both inside CI/CD pipelines and as a part of local tooling for rapid prototyping.
Here a list of some of these tools:
- Databricks CLI eXtentions, aka. dbx (Legacy): It is one of the first generations of such tools. You can read my blog post on how to use DBX to build a CI pipeline. Databricks is currently not developing this tool and recommends using any either of the tools below.
- Databricks Asset Bundles aka. dab: This is the recommended Databricks deployment framework for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. Developers use YAML syntax to declare resources and configurations. DAB allows developers to modularize their source files and metadata that are used to provision and manage infrastructures and other resources.
- Databricks SDK for Python (Beta): This is the latest Databricks tool (currently in Beta as of version 0.53.0). With this tool, Databricks aims to reduce the overhead of managing resources for Python developers. Python SDK allows developer set up, configure, and build resources using Python scripts.
In this tutorial, we will focus on Databricks Asset Bundles.
CI/CD pipeline using Asset Bundles
In my previous blog I described how we can build a simple CI pipeline using Gitlab and Databricks. I suggest going through that post before you continue reading. There I described the structure and different components of such projects in more detail. Here, I’d like to expand on that piece by adding the following:
- multi-environment setup (dev, stag, prod)
- CD pipeline
As mentioned at the beginning of this blog series, I'll try to follow the proposed Databricks reference architecture as closely as possible. I'll Implement a three-stage deployment architecture and follow the deploy-code model deployment pattern instead of the deploy-models as described here
Databricks recommends using separate environments or workspaces for each stage. However, to keep resource usage minimal for this tutorial, I use a single workspace for all stages. To separate the data and artifacts for each stage, we set up a unique catalog for each environment within the workspace. We also store the bundle data for each stage in separate folders.
But I will show you how you can adapt the code here for setting up separate workspaces. Here is a summary of what happens in each stage:
- Development: this is the experimentation stage where I'll develop and test code for data ingestion and transformation, feature engineering, model training, model optimization, and deployment. Basicall,y all the code that we saw so far is developed in this stage.
- Staging: In practice, this is the stage where we should test our pipelines to make sure they are ready for production. This is also where we should run our unit, integration, and other tests. What is important here is that the staging environment should match the production environment as closely as is reasonable. To keep things simple, we are not spending any time on developing tests. We use a few simple test cases just to demonstrate the git workflow. I might devote a blog post to this topic in the future.
- Production: This is the final station. The code and artifacts in this stage are used for real-world scenarios, for example, showing recommendations to users and capturing their interactions with your applications.
OAuth M2M Authentication (Service Principal)
In part 2 (linked above), I showed you how to use Databricks personal access token authentication. However, the recommended authentication method for unattended authentication scenarios such as CI/CD or Airflow data pipelines is to use the Service Principal. In this tutorial, I've used the service principal but only for the production environment.
To read more about the advantage of service principle and how to create one in your workspace read this page. Some important notes:
- To create a service principal you must be an account admin.
- Service principals are always in the form of an application ID, which can be retrieved from a Service principal’s page in your workspace admin settings
For users or groups that need to use the service principal, the Service principal: User role should be granted explicitly. In this case, I also have Manager permissions. You can add a user by clicking the Grant access button.
Configure DAB
The first step is to set up bundle files. The Databricks.yml
is the heart of our project. This is where we define all our workflows and relevant resources. I’ve already explained different components of this file in the previous blog. This time we would have:
- three stages instead of one.
- different configurations for staging and production embodiments than the deployment
- service principal instead of personal token for authentication for production
We’ll also look at how to modularize our bundle by separating it into different files. For this tutorial, I’ve divided the configuration file into three modules.
databricks.yml
: the parent configuration file that define the general structure of our bundleresources.yml
: define the default workflows and jobstarget.yml
: target-specific workflows, jobs, and configuration
Let’s start with the highest-level configuration file of our bundle, Databricks.yml
file:
# yaml-language-server: $schema=bundle_config_schema.json
bundle:
name: DAB_tutorialbnu
variables:
my_cluster_id:
description: The ID of an existing cluster.
default: <cluster_id>
service_principal_name:
description: Service Principal Name for the production environment.
default: 00-00 #use some random value as a fill-in
workspace:
profile: asset-bundle-tutorial
# host:
include:
- ./bundle/*.yml
On the first line, we see yaml-language-server: $schema=bundle_config_schema.json. Databricks Asset Bundle configuration uses JSON schema to make sure our config file has the right format. You can generate this file by using Databricks CLI version 0.205 or above
databricks bundle schema > bundle_config_schema.json
In the variable mapping
we define two variables:
- the id of an existing cluster in our workspace that we like to use for running our jobs in the development environment.
- the name of our service principle for running jobs in the production environment. The reason for defining this variable is that we want to set it during the run time when we deploy and run our jobs on the production environment. The value of the variable is just a place holder. don’t use the actual application id here!
In the workspace mapping
, we define our default workspace configuration, such as the host address or the profile name. Since we only have one workspace, we can use the information here for all the stages of our pipeline.
Finally in include mapping
we import different modules of our bundle configuration into the databricks.yml
file. I put all these modules into the folder named resources that includes two files:
Define workflows
We define all our workflows and jobs in the bundle/resources.yml
file. We define four workflows:
- environment initialization workflow (
init_workflow
): to create all the necessary catalogs, schemas, tables, and volumes. - data ingestion workflow (
ingestion_workflow
): to ingest data from the sources, perform the necessary formatting and transformation, and write it to the right catalog and schema - model training workflow: to train, optimize, deploy, and monitor the model
- testing workflow: to perform the unit and integration tests
Take a look at this file:
resources:
jobs:
init_workflow:
name: "[${bundle.target}]-init"
tasks:
- task_key: setup-env
notebook_task:
notebook_path: ../notebooks/1_initiate_env.py
source: WORKSPACE
base_parameters:
env: ${bundle.target}
existing_cluster_id: ${var.my_cluster_id}
ingestion_workflow:
name: "[${bundle.target}]-ingestion"
tasks:
- task_key: ingest
notebook_task:
notebook_path: ../notebooks/2_ingest.py
source: WORKSPACE
base_parameters:
env: ${bundle.target}
existing_cluster_id: ${var.my_cluster_id}
- task_key: feature-store
depends_on:
- task_key: ingest
notebook_task:
notebook_path: ../notebooks/3_transform.py
source: WORKSPACE
base_parameters:
env: ${bundle.target}
existing_cluster_id: ${var.my_cluster_id}
training_workflow:
name: "[${bundle.target}]-training"
tasks:
- task_key: training
notebook_task:
notebook_path: ../notebooks/4_training.py
source: WORKSPACE
base_parameters:
env: ${bundle.target}
existing_cluster_id: ${var.my_cluster_id}
libraries:
- pypi:
package: pip install --upgrade "mlflow-skinny[databricks]"
- task_key: batch_inference
depends_on:
- task_key: training
notebook_task:
notebook_path: ../notebooks/6_batch_inference.py
source: WORKSPACE
base_parameters:
env: ${bundle.target}
existing_cluster_id: ${var.my_cluster_id}
- task_key: deployment
depends_on:
- task_key: training
notebook_task:
notebook_path: ../notebooks/5_deployment.py
source: WORKSPACE
base_parameters:
env: ${bundle.target}
existing_cluster_id: ${var.my_cluster_id}
unity_test_workflow:
name: "[${bundle.target}]-unity_test"
tasks:
- task_key: unity_test
existing_cluster_id: ${var.my_cluster_id}
notebook_task:
notebook_path: ../notebooks/run_unit_test.py
source: WORKSPACE
libraries:
- pypi:
package: pytest
Pass Runtime Context Variables to Notebooks
Most of the definitions in the code above are similar to what we saw in the previous blog. The main difference is the use of the base_parameters
field for the notebook tasks. This allows us to pass context about a task to the notebook. In our case, we use it to pass the environment name to each notebook so it can apply the correct settings when running the code.
For example, in the 1.init_env.py
notebook, we create and use a catalog as follows:
import json
with open('config.json') as config_file:
config = json.load(config_file)
catalog_name = config['catalog_name']
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")
spark.sql(f"USE CATALOG {catalog_name}")
#-- create all the neccessary schemas within our catalog
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {boronze_layer}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {silver_layer}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {gold_layer}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {output_schema}")
What we need, now, is to set the catalog name based on the stage that is currently running the code. For this, we need to set this value at runtime, when we deploy and run our bundle. This communication occurs through the use of base_parameter in our bundle file and notebook widgets. In this case, we pass ${bundle.target}.
This is how the new code looks:
dbutils.widgets.text(name="env", defaultValue="staging", label="Environment Name")
env = dbutils.widgets.get("env")
#...
catalog_name = f"{config['catalog_name']}_{env}"
#.....
To see what other context information you can send to your notebook, check out Substitutions in bundle configuration.
Environment-specific Configuration
In the CI/CD process, our production and staging environments use different settings and resources for running the jobs. For example, we use larger datasets to train our model, or use clusters with larger resources to serve the users. Asset bundles allow us to partially overwrite our job and workflow definitions that fit the needs of sa pecific target/stage. That is, we can change certain parameters in our default workflow in resouces.yml for each target. Let’s see how it works by looking at the target.yml file:
new_cluster: &new_cluster
new_cluster:
num_workers: 3
spark_version: 13.3.x-cpu-ml-scala2.12
node_type_id: i3.xlarge
autoscale:
min_workers: 1
max_workers: 3
custom_tags:
clusterSource: prod_13.3
targets:
# The 'dev' target, used for development purposes.
# Whenever a developer deploys using 'dev', they get their own copy.
dev:
# We use 'mode: development' to make sure everything deployed to this target gets a prefix
# like '[dev my_user_name]'. Setting this mode also disables any schedules and
# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
mode: development
default: true
staging:
workspace:
host: <host addresss of staging workspace>
root_path: /Shared/staging-workspace/.bundle/${bundle.name}/${bundle.target}
resources:
jobs:
playground_workflow:
name: ${bundle.target}-${var.model_name}
job_clusters:
- job_cluster_key: model_training_job_cluster
<<: *new_cluster
tasks:
- task_key: playground
job_cluster_key: model_training_job_cluster
notebook_task:
base_parameters:
workload_size: Medium
scale_to_zero_enabled: "False"
ingestion_workflow:
name: "[${bundle.target}]-ingestion"
job_clusters:
- job_cluster_key: model_training_job_cluster
<<: *new_cluster
tasks:
- task_key: ingest
job_cluster_key: model_training_job_cluster
- task_key: feature-store
job_cluster_key: model_training_job_cluster
training_workflow:
name: "[${bundle.target}]-training"
job_clusters:
- job_cluster_key: model_training_job_cluster
<<: *new_cluster
tasks:
- task_key: training
job_cluster_key: model_training_job_cluster
- task_key: batch_inference
job_cluster_key: model_training_job_cluster
- task_key: deployment
job_cluster_key: model_training_job_cluster
notebook_task:
base_parameters:
env: ${bundle.target}
workload_size: Medium
scale_to_zero_enabled: False
depends_on:
- task_key: training
production:
mode: production
workspace:
host: <host addresss of staging workspace>
root_path: /Shared/production-workspace/.bundle/${bundle.name}/${bundle.target}
variables:
service_principal_name:
description: Service Principal Name for the production environment.
run_as:
service_principal_name: ${var.service_principal_name}
resources:
jobs:
ingestion_workflow:
name: "[${bundle.target}]-ingestion"
job_clusters:
- job_cluster_key: model_training_job_cluster
<<: *new_cluster
tasks:
- task_key: ingest
job_cluster_key: model_training_job_cluster
- task_key: feature-store
job_cluster_key: model_training_job_cluster
training_workflow:
name: "[${bundle.target}]-training"
job_clusters:
- job_cluster_key: model_training_job_cluster
<<: *new_cluster
tasks:
- task_key: training
job_cluster_key: model_training_job_cluster
- task_key: batch_inference
job_cluster_key: model_training_job_cluster
- task_key: deployment
job_cluster_key: model_training_job_cluster
notebook_task:
base_parameters:
env: ${bundle.target}
workload_size: Medium
scale_to_zero_enabled: False
depends_on:
- task_key: training
Here we make three changes for the staging and production stages.
- Instead of using the existing cluster that we used for our development, we want to run our job using a larger job compute with auto-balancing. For this, we define a new cluster and use the
&new_cluster
anchor to refer it in each job. - Similarly, we want to deploy our modeling serving endpoint using a larger compute instance. For this, we update the
base_parameters
indeployment
task - specify
root_path
under theworkspace
mapping to store the artifacts and files of each stage in a different folder in our workspace.
You can see that we don't need to rewrite the whole job tasks for each target again, but only
1) the job definition and task mapping, and
2) parameter that you wish to change or add.
Databricks uses the job definition to join the job tasks settings in a top-level resources
mapping with the job task settings in a targets
mapping. More about this in this Databricks article.
Additionally, we use Databricks Asset Bundle deployment modes for development and production environments. They provide an optional collection of default behaviors that correspond to each of these modes.
Finally, in the production environment, we use the run_as
mapping to specify the identity to use when running Databricks Asset Bundles workflows. In this case, we set its value to the variable that we defined earlier in our databricks.yml
file.
Git workflow
Defining the right git strategy depends on many factors such as the team size and composition, project size, and deployment life cycle. Here we follow the Databricks standard workflow as described here.
- Development: ML code is developed in the development environment, with code pushed to a dev (or feature) branch.
- Testing: Upon making a pull request from the dev branch to the main branch, a CI trigger runs unit tests on the CI runner and integration tests in the staging environment.
- Merge code: After successfully passing these tests, changes are merged from the dev branch to the main branch.
- Release code: The release branch is cut from the main branch, and doing so deploys the project ML pipelines to the production environment.
Setup the Repo
To setup the repo we create three branches. main, dev, release. main would be our default branch. Make sure that the main and the release branch are protected. That is you can’t updated them manually but only through merge requests. Then go ahead and clone your Databricks repo. I assume you’ve already integrated your Git with your Databricks workspace.
Next, we add the token, host, and service principal application ID to our Gitlab CI/CD settings. For that, open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables
Define the Pipeline
Now we need to define our CI/CD pipeline in the .gitlab-ci.yml
. We define two stages:
onMerge
: will be triggered when we merge the dev or feature branch to the main branch. It will run our unit-test and integration-test jobsonRlease
: will be triggered when we push or merge the changes from the main branch to the release branch
image: python:3.10
stages: # List of stages for jobs, and their order of execution
- onMerge
- onRelease
default:
before_script:
- echo "install databricks cli"
- curl -V
- curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- echo "databricks CLI installation finished"
- echo "Creating the configuration profile for token authentication..."
- |
{
echo "[asset-bundle-tutorial]"
echo "token = $DATABRICKS_TOKEN"
echo "host = $DATABRICKS_HOST"
} > ~/.databrickscfg
- echo "validate the bundle"
- databricks bundle validate
after_script:
- echo "remove all workflows"
#- databricks bundle destroy --auto-approve
unity-test:
stage: onMerge
script:
- echo "--- Running the unit tests"
- databricks bundle deploy -t dev
- databricks bundle run -t dev unity_test_workflow
- databricks bundle destroy --auto-approve -t dev
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main"'
integration-test:
stage: onMerge
needs:
- unity-test
script:
- echo " --- Running the <minimal> integeration tests on the dev env"
# - echo "validate bundle staging"
- databricks bundle validate -t staging
- databricks bundle deploy -t staging
- databricks bundle run -t staging init_workflow
- databricks bundle run -t staging ingestion_workflow
- databricks bundle run -t staging training_workflow
#- databricks bundle destroy --auto-approve -t staging
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main"'
deploy_for_prod: #This job runs in the deploy stage.
stage: onRelease
script:
- echo "validate bundle production"
- databricks bundle validate --var="service_principal_name=$var_spn_prod" -t production
- echo "Deploying jobs"
- databricks bundle deploy --var="service_principal_name=$var_spn_prod" -t production
# - databricks bundle run -t prod ingestion_workflow
# - databricks bundle run -t prod training_workflow
- echo "Application successfully deployed for production"
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "release" || $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "testi"'
Our CI/CD workflow consists of some manual steps and automated steps:
- push the changes to the dev branch
- create a merge request to merge the changes from dev to the main branch (manual)
- run the unity and integration tests (automatic)
- merge the changes when all testing pipeline succussed (manual)
- create a merge request from the main branch to release branch (manual)
- deploy the job to the Databricks production environment (automatic)
It is possible to automate the whole process as part of our CI/CD pipeline. We can also add new stages/steps like creating tags or releases. But for now, we skip to this 😉
After running the pipeline, you should see the following workflows on your Databricks Workflows window. We don’t see the dev workflows because we run the databricks bundle destroy -t dev
as part of the unity-test
step.
If you check your Workspace → Shared folder, you two separate folders for your staging and production bundle files.
you can find the bundle files for your dev environments under workspace/Users/<username>/.bundle/<bundle name>. Similarly, you would find different experiment names for each environment in your Databricks Experiments window
default: /Users/${workspace.current_user.userName}/${bundle.target}-my_mlops_project-experiment
Two notes about passing the service_principle_name
:
- There are different options for setting the service_principle_name in the CI/CD pipeline. Here we use the
--var
option as part of our bundle command. - The variable should be defined at the top level and then overwritten for a specific target. If we define a variable only in a target env, we will get the error
Error: variable service_principal_name is not defined but is assigned a value
Databricks MLOps Stacks
Databricks provides Databricks MLOps Stacks to reduce the overhead of setting up everything from scratch, as we did here. It’s a great tool that gives you a head start to setting up your ML projects. I also adapted part of this tutorial from their template. However, building things up from scratch always helps me to understand the details and thinking behind the processes and tools. One thing that I did differently from the MLOps Stack was to configure and define my ML model and experiments through the bundle. But I think if you understand the principle and thinking behind the asset bundle, you can easily adapt it for your project. Make sure you check it out and adapt it to your needs.
That's it! Hope this blog series helps you build some great things.
Any feedback you have for me would be much appreciated!
And as one always: happy building :)