Streamlining Your Databricks Environment Setup

Streamlining Your Databricks Environment Setup

Guide to Efficiently Streamlining Your Databricks Environment Setup

I'm pretty sure that if you're using Databricks to run your PySpark job, these might be your typical steps:

  • Design and develop business logic.

  • A notebook that performs all the business logic.

  • Running that notebook using Databricks Workflow.

This is easier said than done. Logistics for running this notebook(s) is one of the biggest headaches. I am sure you won’t be just running one or two notebooks. Setting the environment correctly for every one of them can be cumbersome. It becomes even tougher if you are using proprietary/private libraries.

Following is the system that I came up with which is the most practical solution:

  1. Step 1: Universal environment setup notebook.

    • Create a universal environment setup notebook in your repo.

    • This notebook can be placed in the root of your repo so that all other notebooks can easily access it

  2. Step 2: Secrets, Environment variables, Constants, etc

    • Secrets, variables, and constants change depending on the environment in which the notebook is running, so they must be set accordingly.

    • We use dbutils.secret to fetch all the secrets. Let’s see in action

        # Getting all secrets
        # Note - It is important to make sure that the scope is appropriately mapped to
        # the secret store assigned to environment-specific workspace
        PAT = dbutils.secrets.get("<your_secret_scope>", "<secret_1_name>")
        SECRET_1 = dbutils.secrets.get("<your_secret_scope>", "<secret_1_name>")
        SECRET_2 = dbutils.secrets.get("<your_secret_scope>", "<secret_2_name>")
        SECRET_3 = dbutils.secrets.get("<your_secret_scope>", "<secret_3_name>")
      
        # Environment specific constants
        databricks_host = spark.conf.get("spark.databricks.workspaceUrl")
      
        if databricks_host == "<your dev workapce host url>":
            environment = "dev"
            CONSTANT_1 = "dev constant 1"
            CONSTANT_2 = "dev constant 2"
            volume_path = "/Volumes/dev/path"
            catalog_name = "dev_catalog"
        elif databricks_host == "<your uat workapce host url>":
            environment = "uat"
            CONSTANT_1 = "uat constant 1"
            CONSTANT_2 = "uat constant 2"
            volume_path = "/Volumes/uat/path"
            catalog_name = "uat_catalog"
        else databricks_host == "<your prod workapce host url>":
            environment = "prd"
            CONSTANT_1 = "prod constant 1"
            CONSTANT_2 = "prod constant 2"
            volume_path = "/Volumes/prod/path"
            catalog_name = "prod_catalog"
        else:
            raise NameError("Incorrect databricks workspace")
      
  3. Step 3: Passing all variables to the downstream notebook

    • There are multiple ways to pass all the above variables into the downstream notebook so that they access environment-specific values. I tried two -

      • Store all the values in a JSON file and save it at dbfs/tmp/ location. So that once the notebook job is finished file will be destroyed. I used it for a while but recently databricks revamped the permission logic & now only admin can access it.

      • Then I switched to use of TEMP VIEW. This is even better than using as a file. There is no need for admin permission.

    • So create a TEMP VIEW which has all these variables.

        # Creating dict of all variables
      
        env_vars = {
            "secret_1": SECRET_1,
            "secret_2": SECRET_2,
            "secret_3": SECRET_3,
            "constant_1": CONSTANT_1,
            "constant_2": CONSTANT_2,
            "catalog_name": catalog_name
        }
      
        # Write environment variables to a TEMP VIEW
        spark.createDataFrame([env_vars]).createOrReplaceTempView("env_vars")
      
  4. Step 4: Installing libraries, especially private

    • You should definitely package your codebase as a Python Library. Then install & use it just like any other open source library. This way you won’t have to worry about Path or Relative/Absolute Import Error, etc.

    • You need to separate package for dev/qa (which can be experimental) & uat/prod (must be stable)

    • Follow PEP 440 guidelines to version your code.

      • use X.Y.devN version for the package published from the develop branch.

      • use X.Y.N version for the package published from the master branch.

    • Use %pip to install environment-specific private library.

        if environment == "dev" or environment == "qa":
            # --pre flag will install package having 'dev' label
            # NOTE - use your repository usrl appropriately
             %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --pre --upgrade "<your package name>"
      
        if environment == "uat" or environment == "prod":
            %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --upgrade "<your package name>"
      
        # Note - Databricks recommends to restart python to make sure we'll 
        # be using libraries that were just installed
        dbutils.restartPython()
      
  5. Step 5: Running the setup notebook in the downstream job notebook

    • Use databricks’ magic %run command.

    • Read the TEMP VIEW & update values in the os environment

        # Running the notebook. Make sure to use correct relative path
        %run ../../databricks_environment_setup
      
        # Read environment variables from the TEMP VIEW & set the environment variables
        # for use in this notebook
        import os
      
        os.environ.update(spark.table("env_vars").first().asDict())
      

Did you find this article valuable?

Support import idea by becoming a sponsor. Any amount is appreciated!