LightBeam Documentation
Installer GuidesData SourcesPlaybooksInsightsPrivacyOpsGovernance
  • 💡What is LightBeam?
  • 🚀Getting Started
    • ⚙️Installer Guides
      • Pre-Requisites / Security Configurations
        • Firewall Requirements
        • Securing LightBeam on EKS with AWS Certificate Manager on Elastic Load Balancer
        • Configure HTTPS for LightBeam Endpoint FQDN Standalone deployment
        • Using Custom Certificates with LightBeam
        • Securing LightBeam on GKE with Google Certificate Manager and GCE Ingress
      • Core
        • LightBeam Deployment Instructions
        • LightBeam Installer
        • Web App Deployment
        • LightBeam Diagnostics
        • LightBeam Cluster Backup & Restore using Velero
      • Platform Specific
        • AWS
        • Microsoft Azure
        • Google Cloud (GKE)
        • Standalone Virtual Machine
        • Deployment on an Existing Managed Kubernetes Cluster
        • Azure Marketplace Deployment
      • Integration and Setup
        • Setting Up AWS PrivateLink for RDS-EKS Interaction
        • Twingate and LightBeam Integration Guide
        • Data Subject Request Web Application Server
        • Generate CSR for LightBeam
  • 🧠Core Features
    • 🔦Spectra AI
      • 🔗Data Sources
        • Cloud Platforms
          • AWS Auto Discovery
          • GCP Auto Discovery
        • Databases and Datalakes
          • PostgreSQL
          • Aurora (PostgreSQL)
          • Snowflake
          • MS SQL
          • MySQL
          • Aurora (MySQL)
          • BigQuery
          • AWS Redshift
          • Oracle
          • DynamoDB
          • MongoDB
          • CosmosDB (PostgreSQL)
          • CosmosDB (MongoDB)
          • CosmosDB (NoSQL)
          • Looker
          • AWS Glue
          • Databricks
          • SAP HANA
          • CSV Files as a Datasource
        • Messaging
          • Gmail
          • Slack
          • MS Teams
          • MS Outlook
        • Developer Tools
          • Zendesk
          • ServiceNow
          • Jira
          • GitHub
          • Confluence
        • File Repositories
          • NetDocuments
          • AWS S3
          • Azure Blob
          • Google Drive
          • OneDrive
          • SharePoint
          • Viva Engage
          • Dropbox
          • Box
          • SMB
        • CRM
          • Hubspot
          • Salesforce
          • Automated Data Processing (ADP)
          • Marketo
          • Iterable
          • MS Dynamics 365 Sales
          • Salesforce Marketing Cloud
      • 🔔PlayBooks
        • What is LightBeam Playbooks?
        • Policy and Alerts
          • Types of Policies
          • How to create a rule set
            • File Extension Filter
          • Configuring Retention Policies
          • Viewing Alerts
          • Sub Alerts
            • Reassigning Sub-Alerts
            • Sub-alert States
          • Levels of Actions on Alerts
          • User Roles and Permissions
            • Admin View
            • Alert Owner View
            • Onboarding New Users
              • User Management
              • Okta Integration
              • Alert Assignment Settings
              • Email Notifications
            • Planned Enhancements
          • Audit Logs
          • No Scan List
          • Permit List
          • Policy in read-only mode
      • 📊Insights
        • Entity Workflow
        • Document Classification
        • Attribute Management Overview
          • Attributes Page View
          • Attribute Sets
          • Creating Custom Attribute
          • Attributes List
        • Template Builder
        • Label Management
          • MIP Integration
          • Google Labels Integration
      • 🗃️Reporting
        • Delta Reporting
        • Executive Report
        • LightBeam Lens
      • Scanning and Redaction of Files
        • On-demand scanning
      • How-to Guides
        • Leveraging LightBeam insights for structured data sources
      • LightBeam Dashboard Outlay
      • Risk Score
    • 🏛️PrivacyOps
      • Data Subject Request (DSR)
        • What is DSR?
        • Accessing the DSR Module
        • DSR Form Builder (DPO View)
          • Creating a New DSR Form
            • Using a Predefined Template
            • Creating a Custom Form
          • Form Configuration
          • Form Preview and Publishing
          • Multi-Form Management
          • Messaging Templates
        • Form Submission & Email Verification (Data Subject View)
        • DSR Management Dashboard (DPO View)
        • Processing DSR Requests
          • Data Protection Officer (DPO) Workflow
          • Self Service Workflow (Direct Validation)
          • Data Source Owner (DSO) Workflow
        • DSR Report
      • 🚧Consent Management
        • Overview
        • Consent Logs
        • Preference Centre
        • Settings
      • 🍪Cookie Consent
        • Dashboard
        • Banners
        • Domains
        • Settings
        • CMP Deployment Guide for Google Tag Manager
        • FAQs
      • 🔏Privacy Impact Assessment (PIA)
        • PIA Templates
        • PIA Assessment Workflow
        • Collaborator View
        • Process Owner Login View (With Collaborator)
        • Filling questionnaire without collaborator
        • Submitting the assessment for DPO review
        • DPO review process
        • Marking the assessment as reviewed
        • Editing and resubmitting assessments after DPO review
        • Revoke review request
        • Edit Reviewer
        • PIA Reports
      • ⏺️Records of Processing Activity (RoPA)
        • Creating a RoPA Template
          • How to clone a template
          • How to use a template
        • How to create a process
          • Adding Process Details
          • Adding Data Elements
          • Adding Data Subjects
          • Adding Data Retention
          • Adding Safeguards
          • Adding Transfers
          • Adding a Custom Section
          • Setting a Review Schedule
          • Data Flow Diagram
        • How to add a collaborator
        • Overview Section
        • Generating a RoPA Report Using LightBeam
        • Collaborator working on a ticket
    • 🛡️Governance
      • Access
        • Dashboard
        • Users
        • Groups
        • Objects
        • Active Directory Settings
        • Access Governance at a Data Source Level
        • Policies and Alerting
        • Access Governance Statistics
        • Governance Module Dashboard
      • Privacy At Partners
  • 📊Tools & Resources
    • 🔀API Documentation
      • API to Create Reports for Structured Datasource
    • ❓Onboarding Assessments
      • Structured Datasource Onboarding Questionnaire
        • MongoDB/CosmosDB Questionnaire
        • Oracle Datasource Questionnaire
      • SMB Questionnaire
    • 🛠️Administration
      • Audit Logs
      • SMTP
        • Basic and oAuth Configuration
      • User Management
        • SAML Identity Providers
          • Okta
            • LightBeam Okta SAML Configuration Guide
          • Azure
            • Azure AD SAML Configuration for LightBeam
          • Google
            • Google IDP
        • Local User Management
          • Adding a User to the LightBeam Dashboard
          • Reset Default Admin Password
  • 📚Support & Reference
    • 📅Release Notes
      • LightBeam v2.2.0
      • Reporting Release Notes
      • Q1 2024 Key Enhancements
      • Q2 2024 Key Enhancements
      • Q3 2024 Key Enhancements
      • Q4 2024 Key Enhancements
    • 📖Glossary
Powered by GitBook
On this page
  • Overview
  • About Databricks
  • Features
  • Onboarding Databricks Data Source
  • APPENDIX
  • Minimal permissions setup
  • About LightBeam
  1. Core Features
  2. Spectra AI
  3. Data Sources
  4. Databases and Datalakes

Databricks

Connecting Databricks to LightBeam


Overview

LightBeam Spectra users can connect various data sources to the LightBeam application and these data sources will be continuously monitored for PII, PHI data.

Example: AWS Glue, Looker, DynamoDB, Redshift, etc.

About Databricks

Databricks is a platform used for a variety of tasks ranging from data warehousing, BI, ML etc.

A workspace is what that will be onboarded with Lightbeam. We find sensitive data present in Databricks that are managed by the Unity catalog. Users can onboard a workspace with a list of Catalogs/Databases and Lightbeam will scan all the tables inside those Catalogs/Databases.

We expect a SQL warehouse enabled inside the workspace which we will use to execute SQL queries to sample data from each table.

Features

Datasource Registration

Databricks admins can create a service principal with restricted user permissions and use a personal access token for that service principal for registration. Users will be provided a list of SQL warehouses present in the workspace. They need to select one workspace. The users will be provided a list of catalogs and databases, they can filter catalogs/databases that they wish to scan.

Metadata Scanning

We scan the tables present in the Unity catalog configured in scan conditions. For each table, we get the list of columns part of the table, their data types etc. We also fetch row count, size of the table if they are available.

PII Detection

For PII detection, we need sample data for all the columns for a table. We use SQL warehouse configured during datasource registration for executing SQL queries for sampling data from tables. For reading data, we sample 5000 rows for each table.


Onboarding Databricks Data Source

  1. Login to your LightBeam Instance.

  2. Click on DATASOURCES on the Top Navigation Bar.

  3. Click on “Add a data source”.

  1. Search for Databricks.

  1. Click on Databricks.

  2. Fill in the details as shown below and click Next:

Basic Information

  • Instance Name: This is the unique name given to the data source.

  • Description: This is an optional field needed to describe the use of this data source.

  • Primary Owner: Email address of the person responsible for this data source which will get alerts by default.

  • Source of Truth: LightBeam Spectra would have monitored data sources that contain data acting as a single point of truth and that can be used for looking up entities/attributes that help to identify if the other attributes/entities found in any other data source are accurate or not. A Source of Truth data set would create entities based on the attributes found in the data.

  • Location: The location of the data source.

  • Purpose: The purpose of the data being collected/processed.

  • Stage: The stage of the data source. Example: Source, Processing, Archival, etc.

7. In this step, insert the credentials as shown below and click Test Connection –

  1. Verify that you get the message Test Connection Success on the screen. Click on Next.

  2. After this select a warehouse from the drop down list.

  1. In the next step, you will see a list of catalogs from dropdown presented. Select catalogs that you wish to scan.

  1. By default, all databases part of a catalog will be scanned. If you wish to remove any database from scan conditions, you can do it too.

Please verify that all databases selected for scanning show up in the list of databases. Ensure you've made your desired selections before connecting the data source.

  1. Finally, click on Start Sampling to connect to the Databricks data source.


APPENDIX

Minimal permissions setup

This guide outlines the process to create a service account or a user with minimal permissions necessary for integrating Databricks with LightBeam. Either of the option A or B can be followed.

A. Using On-behalf token for Service Account

  1. Go to Databricks account management console → user management → service principal. Add a new service principal

  1. Go to the workspace you want to onboard to Lightbeam by clicking on Open on right side of workspace list.

  2. Add this newly created service principal to the workspace that you are onboarding with lightbeam. From the workspace console, go to top right side username of account → Settings → Identity and access → Add service Principal.

  3. Give this service principal access to use SQL warehouse.

  4. Go to the catalog tab on left side and click on the catalogs you want to onboard with Lightbeam and click on grant.

  5. Now give this service principal SELECT, USECATALOG and USESCHEMA access to the catalogs that you wish to scan inside the workspace.

  1. Give this service principal access to use Personal Access Token. Go to Admin Settings → Advanced → Access Control → Personal Access Token -> Permission settings. Grant Can Use permission to this service principal. If you get error like Token permissions can be set only if at least one token has been created in the workspace. Then Go the workspace -> Click on profile picture -> settings -> Under user section click on Developer -> Access tokens -> manage, create a new access token.

  2. Configure Databricks CLI

    1. Run databricks auth login –host <URL of the workspace>

    2. It will open the login page in browser, input the credentials to complete login.

  3. Finally from Databricks CLI, generate a personal access token for this service principal.Copy application ID for this service principal from Admin Settings → Identity and access → Service Principals. Keep lifetime seconds 31536000 (1 year) so that the token doesn’t expire soon.

    databricks token-management create-obo-token <application_id> --lifetime-seconds <lifetime_seconds>

  4. Copy the token_value from the response. This will be used for onboarding Databricks with Lightbeam.

B. Using PAT (Personal Access Token)

  1. Add a new user to the workspace that you are onboarding with lightbeam. From the workspace console, go to Top right side username of account → Settings → Identity and access → Add User.

  1. Here, an existing user can be used or a new user can be onboarded.

    1. If an existing user is to be used search for it and select it.

b. Enter email address of new user to be added.

  1. After user is added, go to the catalog tab on left side and click on the catalogs you want to onboard with Lightbeam, go to permissions tab and click on grant.

  1. Now give this added user SELECT, USECATALOG and USESCHEMA access to the catalogs that you wish to scan inside the workspace.

  1. Give this User access to use Personal Access Token. Go to Admin Settings → Advanced → Access Control → Personal Access Token -> Permission settings. Grant Can Use permission to User and service principal.

If you get error like

Token permissions can be set only if at least one token has been created in

the workspace.

Then

Go the workspace -> Click on profile picture -> settings -> Under user section

click on Developer -> Access tokens -> manage, create a new access token.

  1. Now login the with added user email and create Personal Access Token. Go to workspace -> Click on profile picture -> settings -> Under user section click on Developer -> Access tokens -> manage, create a new access token.

  1. Copy the token_value. This will be used for onboarding Databricks with Lightbeam.

Validate permissions to the datasource.

Next, the user needs to validate these permissions to the datasource. This ensures authorized access to the datasource by the credentials provided by the user. After validating the permissions to the datasource, the user can onboard Databricks in Lightbeam.

Steps

  1. Go into sql_user_check_databricks directory

  2. Please refer to the README.md file in the directory for detailed instructions.


About LightBeam

LightBeam automates Privacy, Security, and AI Governance, so businesses can accelerate their growth in new markets. Leveraging generative AI, LightBeam has rapidly gained customers’ trust by pioneering a unique privacy-centric and automation-first approach to security. Unlike siloed solutions, LightBeam ties together sensitive data cataloging, control, and compliance across structured and unstructured data applications providing 360-visibility, redaction, self-service DSRs, and automated ROPA reporting ensuring ultimate protection against ransomware and accidental exposures while meeting data privacy obligations efficiently. LightBeam is on a mission to create a secure privacy-first world helping customers automate compliance against a patchwork of existing and emerging regulations.

PreviousAWS GlueNextSAP HANA

Last updated 3 months ago

Install Databricks CLI using following documentation .

Figure 18. Databricks Catalogs
Figure 19. Catalog Permissions
Figure 20. Token Permission

First, clone the repository

For any questions or suggestions, please get in touch with us at: .

🧠
🔦
🔗
https://docs.databricks.com/en/dev-tools/cli/install.html#homebrew-install
https://github.com/lightbeamai/lb-installer
support@lightbeam.ai
Figure 1. Add a datasource
Figure 2. Databricks datasource
Figure 3. Datasource configuration
Figure 4. Datasource credentials
Figure 5. Select a SQL Warehouse
Figure 6. Select Catalogs
Figure 7. Filter databases from a Catalog.
Figure 8. Create a service principal
Figure 9. Databricks Workspace
Figure 10. Databricks Workspace
Figure 11. Add service principal to workspace
Figure 12. Databricks Catalogs
Figure 13. Catalog Permissions
Figure 14. Token permissions
Figure 15. Add a New User
Figure 16. Search for an existing user
Figure 17. Add email of the new user.
Figure 21. Create PAT for user