Databricks

Connecting Databricks to LightBeam


Overview

LightBeam Spectra users can connect various data sources to the LightBeam application and these data sources will be continuously monitored for PII, PHI data.

Example: AWS Glue, Looker, DynamoDB, Redshift, etc.

About Databricks

Databricks is a platform used for a variety of tasks ranging from data warehousing, BI, ML etc.

A workspace is what that will be onboarded with Lightbeam. We find sensitive data present in Databricks that are managed by the Unity catalog. Users can onboard a workspace with a list of Catalogs/Databases and Lightbeam will scan all the tables inside those Catalogs/Databases.

We expect a SQL warehouse enabled inside the workspace which we will use to execute SQL queries to sample data from each table.

Features

Datasource Registration

Databricks admins can create a service principal with restricted user permissions and use a personal access token for that service principal for registration. Users will be provided a list of SQL warehouses present in the workspace. They need to select one workspace. The users will be provided a list of catalogs and databases, they can filter catalogs/databases that they wish to scan.

Metadata Scanning

We scan the tables present in the Unity catalog configured in scan conditions. For each table, we get the list of columns part of the table, their data types etc. We also fetch row count, size of the table if they are available.

PII Detection

For PII detection, we need sample data for all the columns for a table. We use SQL warehouse configured during datasource registration for executing SQL queries for sampling data from tables. For reading data, we sample 5000 rows for each table.


Onboarding Databricks Data Source

  1. Login to your LightBeam Instance.

  2. Click on DATASOURCES on the Top Navigation Bar.

  3. Click on “Add a data source”.

Figure 1. Add a datasource
  1. Search for Databricks.

Figure 2. Databricks datasource
  1. Click on Databricks.

  2. Fill in the details as shown below and click Next:

Figure 3. Datasource configuration

Basic Information

  • Instance Name: This is the unique name given to the data source.

  • Description: This is an optional field needed to describe the use of this data source.

  • Primary Owner: Email address of the person responsible for this data source which will get alerts by default.

  • Source of Truth: LightBeam Spectra would have monitored data sources that contain data acting as a single point of truth and that can be used for looking up entities/attributes that help to identify if the other attributes/entities found in any other data source are accurate or not. A Source of Truth data set would create entities based on the attributes found in the data.

  • Location: The location of the data source.

  • Purpose: The purpose of the data being collected/processed.

  • Stage: The stage of the data source. Example: Source, Processing, Archival, etc.

7. In this step, insert the credentials as shown below and click Test Connection

Figure 4. Datasource credentials
  1. Verify that you get the message Test Connection Success on the screen. Click on Next.

  2. After this select a warehouse from the drop down list.

Figure 5. Select a SQL Warehouse
  1. In the next step, you will see a list of catalogs from dropdown presented. Select catalogs that you wish to scan.

Figure 6. Select Catalogs
  1. By default, all databases part of a catalog will be scanned. If you wish to remove any database from scan conditions, you can do it too.

Figure 7. Filter databases from a Catalog.

Please verify that all databases selected for scanning show up in the list of databases. Ensure you've made your desired selections before connecting the data source.

  1. Finally, click on Start Sampling to connect to the Databricks data source.


APPENDIX

Minimal permissions setup

This guide outlines the process to create a service account or a user with minimal permissions necessary for integrating Databricks with LightBeam. Either of the option A or B can be followed.

A. Using On-behalf token for Service Account

  1. Go to Databricks account management console → user management → service principal. Add a new service principal

Figure 8. Create a service principal
  1. Go to the workspace you want to onboard to Lightbeam by clicking on Open on right side of workspace list.

    Figure 9. Databricks Workspace
  2. Add this newly created service principal to the workspace that you are onboarding with lightbeam. From the workspace console, go to top right side username of account → Settings → Identity and access → Add service Principal.

    Figure 10. Databricks Workspace
  3. Give this service principal access to use SQL warehouse.

    Figure 11. Add service principal to workspace
  4. Go to the catalog tab on left side and click on the catalogs you want to onboard with Lightbeam and click on grant.

    Figure 12. Databricks Catalogs
  5. Now give this service principal SELECT, USECATALOG and USESCHEMA access to the catalogs that you wish to scan inside the workspace.

Figure 13. Catalog Permissions
  1. Give this service principal access to use Personal Access Token. Go to Admin Settings → Advanced → Access Control → Personal Access Token -> Permission settings. Grant Can Use permission to this service principal. If you get error like Token permissions can be set only if at least one token has been created in the workspace. Then Go the workspace -> Click on profile picture -> settings -> Under user section click on Developer -> Access tokens -> manage, create a new access token.

    Figure 14. Token permissions
  2. Install Databricks CLI using following documentation https://docs.databricks.com/en/dev-tools/cli/install.html#homebrew-install.

  3. Configure Databricks CLI

    1. Run databricks auth login –host <URL of the workspace>

    2. It will open the login page in browser, input the credentials to complete login.

  4. Finally from Databricks CLI, generate a personal access token for this service principal.Copy application ID for this service principal from Admin Settings → Identity and access → Service Principals. Keep lifetime seconds 31536000 (1 year) so that the token doesn’t expire soon.

    databricks token-management create-obo-token <application_id> --lifetime-seconds <lifetime_seconds>

  5. Copy the token_value from the response. This will be used for onboarding Databricks with Lightbeam.

B. Using PAT (Personal Access Token)

  1. Add a new user to the workspace that you are onboarding with lightbeam. From the workspace console, go to Top right side username of account → Settings → Identity and access → Add User.

Figure 15. Add a New User
  1. Here, an existing user can be used or a new user can be onboarded.

    1. If an existing user is to be used search for it and select it.

Figure 16. Search for an existing user

b. Enter email address of new user to be added.

Figure 17. Add email of the new user.
  1. After user is added, go to the catalog tab on left side and click on the catalogs you want to onboard with Lightbeam, go to permissions tab and click on grant.

Figure 18. Databricks Catalogs
  1. Now give this added user SELECT, USECATALOG and USESCHEMA access to the catalogs that you wish to scan inside the workspace.

Figure 19. Catalog Permissions
  1. Give this User access to use Personal Access Token. Go to Admin Settings → Advanced → Access Control → Personal Access Token -> Permission settings. Grant Can Use permission to User and service principal.

If you get error like

Token permissions can be set only if at least one token has been created in

the workspace.

Then

Go the workspace -> Click on profile picture -> settings -> Under user section

click on Developer -> Access tokens -> manage, create a new access token.

Figure 20. Token Permission
  1. Now login the with added user email and create Personal Access Token. Go to workspace -> Click on profile picture -> settings -> Under user section click on Developer -> Access tokens -> manage, create a new access token.

Figure 21. Create PAT for user
  1. Copy the token_value. This will be used for onboarding Databricks with Lightbeam.

Validate permissions to the datasource.

Next, the user needs to validate these permissions to the datasource. This ensures authorized access to the datasource by the credentials provided by the user. After validating the permissions to the datasource, the user can onboard Databricks in Lightbeam.

Steps

  1. Go into sql_user_check_databricks directory

  2. Please refer to the README.md file in the directory for detailed instructions.


About LightBeam

LightBeam automates Privacy, Security, and AI Governance, so businesses can accelerate their growth in new markets. Leveraging generative AI, LightBeam has rapidly gained customers’ trust by pioneering a unique privacy-centric and automation-first approach to security. Unlike siloed solutions, LightBeam ties together sensitive data cataloging, control, and compliance across structured and unstructured data applications providing 360-visibility, redaction, self-service DSRs, and automated ROPA reporting ensuring ultimate protection against ransomware and accidental exposures while meeting data privacy obligations efficiently. LightBeam is on a mission to create a secure privacy-first world helping customers automate compliance against a patchwork of existing and emerging regulations.

For any questions or suggestions, please get in touch with us at: [email protected].

Last updated