Databricks
Connecting Databricks to LightBeam
Overview
LightBeam Spectra users can connect various data sources to the LightBeam application and these data sources will be continuously monitored for PII, PHI data.
Example: AWS Glue, Looker, DynamoDB, Redshift, etc.
About Databricks
Databricks is a platform used for a variety of tasks ranging from data warehousing, BI, ML etc.
A workspace is what that will be onboarded with Lightbeam. We find sensitive data present in Databricks that are managed by the Unity catalog. Users can onboard a workspace with a list of Catalogs/Databases and Lightbeam will scan all the tables inside those Catalogs/Databases.
We expect a SQL warehouse enabled inside the workspace which we will use to execute SQL queries to sample data from each table.
Features
Datasource Registration
Databricks admins can create a service principal with restricted user permissions and use a personal access token for that service principal for registration. Users will be provided a list of SQL warehouses present in the workspace. They need to select one workspace. The users will be provided a list of catalogs and databases, they can filter catalogs/databases that they wish to scan.
Metadata Scanning
We scan the tables present in the Unity catalog configured in scan conditions. For each table, we get the list of columns part of the table, their data types etc. We also fetch row count, size of the table if they are available.
PII Detection
For PII detection, we need sample data for all the columns for a table. We use SQL warehouse configured during datasource registration for executing SQL queries for sampling data from tables. For reading data, we sample 5000 rows for each table.
Onboarding Databricks Data Source
Login to your LightBeam Instance.
Click on DATASOURCES on the Top Navigation Bar.
Click on “Add a data source”.
Search for Databricks.
Click on Databricks.
Fill in the details as shown below and click Next:
Basic Information
Instance Name: This is the unique name given to the data source.
Description: This is an optional field needed to describe the use of this data source.
Primary Owner: Email address of the person responsible for this data source which will get alerts by default.
Source of Truth: LightBeam Spectra would have monitored data sources that contain data acting as a single point of truth and that can be used for looking up entities/attributes that help to identify if the other attributes/entities found in any other data source are accurate or not. A Source of Truth data set would create entities based on the attributes found in the data.
Location: The location of the data source.
Purpose: The purpose of the data being collected/processed.
Stage: The stage of the data source. Example: Source, Processing, Archival, etc.
7. In this step, insert the credentials as shown below and click Test Connection –
Verify that you get the message Test Connection Success on the screen. Click on Next.
After this select a warehouse from the drop down list.
In the next step, you will see a list of catalogs from dropdown presented. Select catalogs that you wish to scan.
By default, all databases part of a catalog will be scanned. If you wish to remove any database from scan conditions, you can do it too.
Please verify that all databases selected for scanning show up in the list of databases. Ensure you've made your desired selections before connecting the data source.
Finally, click on Start Sampling to connect to the Databricks data source.
APPENDIX
Minimal permissions setup
This guide outlines the process to create a service account or a user with minimal permissions necessary for integrating Databricks with LightBeam. Either of the option A or B can be followed.
A. Using On-behalf token for Service Account
Go to Databricks account management console → user management → service principal. Add a new service principal
Go to the workspace you want to onboard to Lightbeam by clicking on Open on right side of workspace list.
Add this newly created service principal to the workspace that you are onboarding with lightbeam. From the workspace console, go to top right side username of account → Settings → Identity and access → Add service Principal.
Give this service principal access to use SQL warehouse.
Go to the catalog tab on left side and click on the catalogs you want to onboard with Lightbeam and click on grant.
Now give this service principal SELECT, USECATALOG and USESCHEMA access to the catalogs that you wish to scan inside the workspace.
Give this service principal access to use Personal Access Token. Go to Admin Settings → Advanced → Access Control → Personal Access Token -> Permission settings. Grant Can Use permission to this service principal. If you get error like Token permissions can be set only if at least one token has been created in the workspace. Then Go the workspace -> Click on profile picture -> settings -> Under user section click on Developer -> Access tokens -> manage, create a new access token.
Configure Databricks CLI
Run
databricks auth login –host <URL of the workspace>
It will open the login page in browser, input the credentials to complete login.
Finally from Databricks CLI, generate a personal access token for this service principal.Copy application ID for this service principal from Admin Settings → Identity and access → Service Principals. Keep lifetime seconds 31536000 (1 year) so that the token doesn’t expire soon.
databricks token-management create-obo-token <application_id> --lifetime-seconds <lifetime_seconds>
Copy the token_value from the response. This will be used for onboarding Databricks with Lightbeam.
B. Using PAT (Personal Access Token)
Add a new user to the workspace that you are onboarding with lightbeam. From the workspace console, go to Top right side username of account → Settings → Identity and access → Add User.
Here, an existing user can be used or a new user can be onboarded.
If an existing user is to be used search for it and select it.
b. Enter email address of new user to be added.
After user is added, go to the catalog tab on left side and click on the catalogs you want to onboard with Lightbeam, go to permissions tab and click on grant.
Now give this added user SELECT, USECATALOG and USESCHEMA access to the catalogs that you wish to scan inside the workspace.
Give this User access to use Personal Access Token. Go to Admin Settings → Advanced → Access Control → Personal Access Token -> Permission settings. Grant Can Use permission to User and service principal.
If you get error like
Token permissions can be set only if at least one token has been created in
the workspace.
Then
Go the workspace -> Click on profile picture -> settings -> Under user section
click on Developer -> Access tokens -> manage, create a new access token.
Now login the with added user email and create Personal Access Token. Go to workspace -> Click on profile picture -> settings -> Under user section click on Developer -> Access tokens -> manage, create a new access token.
Copy the token_value. This will be used for onboarding Databricks with Lightbeam.
Validate permissions to the datasource.
Next, the user needs to validate these permissions to the datasource. This ensures authorized access to the datasource by the credentials provided by the user. After validating the permissions to the datasource, the user can onboard Databricks in Lightbeam.
Steps
Go into
sql_user_check_databricks
directoryPlease refer to the
README.md
file in the directory for detailed instructions.
About LightBeam
LightBeam automates Privacy, Security, and AI Governance, so businesses can accelerate their growth in new markets. Leveraging generative AI, LightBeam has rapidly gained customers’ trust by pioneering a unique privacy-centric and automation-first approach to security. Unlike siloed solutions, LightBeam ties together sensitive data cataloging, control, and compliance across structured and unstructured data applications providing 360-visibility, redaction, self-service DSRs, and automated ROPA reporting ensuring ultimate protection against ransomware and accidental exposures while meeting data privacy obligations efficiently. LightBeam is on a mission to create a secure privacy-first world helping customers automate compliance against a patchwork of existing and emerging regulations.
Last updated