Streamlining Your Databricks Environment Setup
Guide to Efficiently Streamlining Your Databricks Environment Setup
I'm pretty sure that if you're using Databricks
to run your PySpark job
, these might be your typical steps:
Design and develop business logic.
A notebook that performs all the business logic.
Running that notebook using Databricks Workflow.
This is easier said than done. Logistics for running this notebook(s) is one of the biggest headaches. I am sure you won’t be just running one or two notebooks. Setting the environment correctly for every one of them can be cumbersome. It becomes even tougher if you are using proprietary/private libraries.
Following is the system that I came up with which is the most practical solution:
Step 1: Universal environment setup notebook.
Create a universal environment setup notebook in your repo.
This notebook can be placed in the root of your repo so that all other notebooks can easily access it
Step 2: Secrets, Environment variables, Constants, etc
Secrets, variables, and constants change depending on the environment in which the notebook is running, so they must be set accordingly.
We use
dbutils.secret
to fetch all the secrets. Let’s see in action# Getting all secrets # Note - It is important to make sure that the scope is appropriately mapped to # the secret store assigned to environment-specific workspace PAT = dbutils.secrets.get("<your_secret_scope>", "<secret_1_name>") SECRET_1 = dbutils.secrets.get("<your_secret_scope>", "<secret_1_name>") SECRET_2 = dbutils.secrets.get("<your_secret_scope>", "<secret_2_name>") SECRET_3 = dbutils.secrets.get("<your_secret_scope>", "<secret_3_name>")
# Environment specific constants databricks_host = spark.conf.get("spark.databricks.workspaceUrl") if databricks_host == "<your dev workapce host url>": environment = "dev" CONSTANT_1 = "dev constant 1" CONSTANT_2 = "dev constant 2" volume_path = "/Volumes/dev/path" catalog_name = "dev_catalog" elif databricks_host == "<your uat workapce host url>": environment = "uat" CONSTANT_1 = "uat constant 1" CONSTANT_2 = "uat constant 2" volume_path = "/Volumes/uat/path" catalog_name = "uat_catalog" else databricks_host == "<your prod workapce host url>": environment = "prd" CONSTANT_1 = "prod constant 1" CONSTANT_2 = "prod constant 2" volume_path = "/Volumes/prod/path" catalog_name = "prod_catalog" else: raise NameError("Incorrect databricks workspace")
Step 3: Passing all variables to the downstream notebook
There are multiple ways to pass all the above variables into the downstream notebook so that they access environment-specific values. I tried two -
Store all the values in a JSON file and save it at
dbfs/tmp/
location. So that once the notebook job is finished file will be destroyed. I used it for a while but recently databricks revamped the permission logic & now only admin can access it.Then I switched to use of
TEMP VIEW
. This is even better than using as a file. There is no need for admin permission.
So create a
TEMP VIEW
which has all these variables.# Creating dict of all variables env_vars = { "secret_1": SECRET_1, "secret_2": SECRET_2, "secret_3": SECRET_3, "constant_1": CONSTANT_1, "constant_2": CONSTANT_2, "catalog_name": catalog_name } # Write environment variables to a TEMP VIEW spark.createDataFrame([env_vars]).createOrReplaceTempView("env_vars")
Step 4: Installing libraries, especially private
You should definitely package your codebase as a
Python Library
. Then install & use it just like any otheropen source
library. This way you won’t have to worry aboutPath
orRelative/Absolute Import Error
, etc.You need to separate package for
dev/qa
(which can be experimental) &uat/prod
(must be stable)Follow PEP 440 guidelines to version your code.
use
X.Y.devN
version for the package published from the develop branch.use
X.Y.N
version for the package published from the master branch.
Use
%pip
to install environment-specific private library.if environment == "dev" or environment == "qa": # --pre flag will install package having 'dev' label # NOTE - use your repository usrl appropriately %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --pre --upgrade "<your package name>" if environment == "uat" or environment == "prod": %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --upgrade "<your package name>" # Note - Databricks recommends to restart python to make sure we'll # be using libraries that were just installed dbutils.restartPython()
Step 5: Running the setup notebook in the downstream job notebook
Use databricks’ magic
%run
command.Read the
TEMP VIEW
& update values in the os environment# Running the notebook. Make sure to use correct relative path %run ../../databricks_environment_setup # Read environment variables from the TEMP VIEW & set the environment variables # for use in this notebook import os os.environ.update(spark.table("env_vars").first().asDict())