LightBeam Documentation
Installer GuidesData SourcesPlaybooksInsightsPrivacyOpsGovernance
  • 💡What is LightBeam?
  • 🚀Getting Started
    • ⚙️Installer Guides
      • Pre-Requisites / Security Configurations
        • Firewall Requirements
        • Securing LightBeam on EKS with AWS Certificate Manager on Elastic Load Balancer
        • Configure HTTPS for LightBeam Endpoint FQDN Standalone deployment
        • Using Custom Certificates with LightBeam
        • Securing LightBeam on GKE with Google Certificate Manager and GCE Ingress
      • Core
        • LightBeam Deployment Instructions
        • LightBeam Installer
        • Web App Deployment
        • LightBeam Diagnostics
        • LightBeam Cluster Backup & Restore using Velero
      • Platform Specific
        • AWS
        • Microsoft Azure
        • Google Cloud (GKE)
        • Standalone Virtual Machine
        • Deployment on an Existing Managed Kubernetes Cluster
        • Azure Marketplace Deployment
      • Integration and Setup
        • Setting Up AWS PrivateLink for RDS-EKS Interaction
        • Twingate and LightBeam Integration Guide
        • Data Subject Request Web Application Server
        • Generate CSR for LightBeam
  • 🧠Core Features
    • 🔦Spectra AI
      • 🔗Data Sources
        • Cloud Platforms
          • AWS Auto Discovery
          • GCP Auto Discovery
        • Databases and Datalakes
          • PostgreSQL
          • Aurora (PostgreSQL)
          • Snowflake
          • MS SQL
          • MySQL
          • Aurora (MySQL)
          • BigQuery
          • AWS Redshift
          • Oracle
          • DynamoDB
          • MongoDB
          • CosmosDB (PostgreSQL)
          • CosmosDB (MongoDB)
          • CosmosDB (NoSQL)
          • Looker
          • AWS Glue
          • Databricks
          • SAP HANA
          • CSV Files as a Datasource
        • Messaging
          • Gmail
          • Slack
          • MS Teams
          • MS Outlook
        • Developer Tools
          • Zendesk
          • ServiceNow
          • Jira
          • GitHub
          • Confluence
        • File Repositories
          • NetDocuments
          • AWS S3
          • Azure Blob
          • Google Drive
          • OneDrive
          • SharePoint
          • Viva Engage
          • Dropbox
          • Box
          • SMB
        • CRM
          • Hubspot
          • Salesforce
          • Automated Data Processing (ADP)
          • Marketo
          • Iterable
          • MS Dynamics 365 Sales
          • Salesforce Marketing Cloud
      • 🔔PlayBooks
        • What is LightBeam Playbooks?
        • Policy and Alerts
          • Types of Policies
          • How to create a rule set
            • File Extension Filter
          • Configuring Retention Policies
          • Viewing Alerts
          • Sub Alerts
            • Reassigning Sub-Alerts
            • Sub-alert States
          • Levels of Actions on Alerts
          • User Roles and Permissions
            • Admin View
            • Alert Owner View
            • Onboarding New Users
              • User Management
              • Okta Integration
              • Alert Assignment Settings
              • Email Notifications
            • Planned Enhancements
          • Audit Logs
          • No Scan List
          • Permit List
          • Policy in read-only mode
      • 📊Insights
        • Entity Workflow
        • Document Classification
        • Attribute Management Overview
          • Attributes Page View
          • Attribute Sets
          • Creating Custom Attribute
          • Attributes List
        • Template Builder
        • Label Management
          • MIP Integration
          • Google Labels Integration
      • 🗃️Reporting
        • Delta Reporting
        • Executive Report
        • LightBeam Lens
      • Scanning and Redaction of Files
        • On-demand scanning
      • How-to Guides
        • Leveraging LightBeam insights for structured data sources
      • LightBeam Dashboard Outlay
      • Risk Score
    • 🏛️PrivacyOps
      • Data Subject Request (DSR)
        • What is DSR?
        • Accessing the DSR Module
        • DSR Form Builder (DPO View)
          • Creating a New DSR Form
            • Using a Predefined Template
            • Creating a Custom Form
          • Form Configuration
          • Form Preview and Publishing
          • Multi-Form Management
          • Messaging Templates
        • Form Submission & Email Verification (Data Subject View)
        • DSR Management Dashboard (DPO View)
        • Processing DSR Requests
          • Data Protection Officer (DPO) Workflow
          • Self Service Workflow (Direct Validation)
          • Data Source Owner (DSO) Workflow
        • DSR Report
      • 🚧Consent Management
        • Overview
        • Consent Logs
        • Preference Centre
        • Settings
      • 🍪Cookie Consent
        • Dashboard
        • Banners
        • Domains
        • Settings
        • CMP Deployment Guide for Google Tag Manager
        • FAQs
      • 🔏Privacy Impact Assessment (PIA)
        • PIA Templates
        • PIA Assessment Workflow
        • Collaborator View
        • Process Owner Login View (With Collaborator)
        • Filling questionnaire without collaborator
        • Submitting the assessment for DPO review
        • DPO review process
        • Marking the assessment as reviewed
        • Editing and resubmitting assessments after DPO review
        • Revoke review request
        • Edit Reviewer
        • PIA Reports
      • ⏺️Records of Processing Activity (RoPA)
        • Creating a RoPA Template
          • How to clone a template
          • How to use a template
        • How to create a process
          • Adding Process Details
          • Adding Data Elements
          • Adding Data Subjects
          • Adding Data Retention
          • Adding Safeguards
          • Adding Transfers
          • Adding a Custom Section
          • Setting a Review Schedule
          • Data Flow Diagram
        • How to add a collaborator
        • Overview Section
        • Generating a RoPA Report Using LightBeam
        • Collaborator working on a ticket
    • 🛡️Governance
      • Access
        • Dashboard
        • Users
        • Groups
        • Objects
        • Active Directory Settings
        • Access Governance at a Data Source Level
        • Policies and Alerting
        • Access Governance Statistics
        • Governance Module Dashboard
      • Privacy At Partners
  • 📊Tools & Resources
    • 🔀API Documentation
      • API to Create Reports for Structured Datasource
    • ❓Onboarding Assessments
      • Structured Datasource Onboarding Questionnaire
        • MongoDB/CosmosDB Questionnaire
        • Oracle Datasource Questionnaire
      • SMB Questionnaire
    • 🛠️Administration
      • Audit Logs
      • SMTP
        • Basic and oAuth Configuration
      • User Management
        • SAML Identity Providers
          • Okta
            • LightBeam Okta SAML Configuration Guide
          • Azure
            • Azure AD SAML Configuration for LightBeam
          • Google
            • Google IDP
        • Local User Management
          • Adding a User to the LightBeam Dashboard
          • Reset Default Admin Password
  • 📚Support & Reference
    • 📅Release Notes
      • LightBeam v2.2.0
      • Reporting Release Notes
      • Q1 2024 Key Enhancements
      • Q2 2024 Key Enhancements
      • Q3 2024 Key Enhancements
      • Q4 2024 Key Enhancements
    • 📖Glossary
Powered by GitBook
On this page
  • Overview
  • About AWS Glue
  • Features
  • Onboarding AWS Glue Data Source
  • APPENDIX
  • Minimal permissions setup
  • Creating a Workgroup in Athena
  • Creating a Policy with S3, Glue and Athena Permissions
  • About LightBeam
  1. Core Features
  2. Spectra AI
  3. Data Sources
  4. Databases and Datalakes

AWS Glue

Connecting Glue to LightBeam


Overview

LightBeam Spectra users can connect various data sources to the LightBeam application and these data sources will be continuously monitored for PII, PHI data.

Example: AWS Glue, Looker, DynamoDB, Redshift, etc.

About AWS Glue

AWS Glue is Amazon's serverless data integration service that streamlines data processing tasks. It provides a convenient way for analytics users to connect, prepare, and integrate data from diverse sources such as Postgres, RDS, S3, and Redshift. One of its key features is the metadata catalog, which stores metadata effectively.

  • Metadata Catalog Utilization:

    • In LightBeam, we focus on the metadata catalog provided by AWS Glue to extract data structures like databases, tables, and columns.

  • Sample Data Queries for PII Detection:

    • LightBeam extends beyond metadata extraction by sampling data for PII detection. To achieve this, we employ AWS Athena to execute sample queries, extracting 5000 rows from each table column.

  • AWS Services Employed:

    • Glue: Accessed for its metadata catalog capabilities.

    • S3: Utilized as the storage repository for data files.

    • Athena: Deployed as the query engine to retrieve sample data from tables.

Operational Notes and Constraints:

  • Glue catalog, S3 data and Athena will need to be present in the same region which is configured during the datasource registration.

  • Data must reside in S3, supporting formats like CSV, JSON, Parquet, Avro, and ORC.

  • The current setup does not support workflows for table clusters or entity creation.

  • We skip columns containing complex Blob data types, such as maps and arrays, in PII classification.

Features

Datasource Registration

AWS administrators can set up users with limited permissions and utilize the user's accessKey and secretKey for registration. The process also requires specifying the AWS region where the Glue catalog resides and the name of the Athena workgroup for query execution. During registration, users are presented with a list of databases from the Glue catalog, allowing them to select specific databases for scanning. Along with list of databases, users are also presented workgroups in Athena with output query location configured. They need to choose one workgroup as a part of registration.

Metadata Scanning

The scanning process targets tables within the Glue databases specified in the scan conditions. A notable restriction is the necessity for the table data to be stored in S3; thus, only those tables with data in S3 are eligible for scanning. For each qualified table, information such as the list of columns, their data types, and whenever available, the row count and table size are retrieved.

PII Detection

PII detection hinges on obtaining sample data from all columns within a table, for which Athena is employed. SQL queries are executed through Athena to gather sample data for each table, with the query results stored in the default output location of the utilized LightBeam workgroup. To manage data and control storage costs in S3, query result files for each table are placed in a dedicated folder, which is removed after the data has been processed.

For reading data, we sample 5000 rows for each table. If the data is partitioned across multiple files, all files are not read. So the amount of data scanned is not equal to the total size of files.


Onboarding AWS Glue Data Source

  1. Login to your LightBeam Instance.

  2. Click on DATASOURCES on the Top Navigation Bar.

  3. Click on “Add a data source”.

  1. Search for Glue.

  1. Click on Glue.

3. Configure Basic Details

In the Basic Details section, enter the following information:

  • Instance Name: Provide a unique name for the AWS Glue data source (e.g., aws-glue-datasource).

  • Primary Owner: Enter the email address of the individual responsible for this data source (e.g., demo@lightbeam.ai).

  • Source of Truth (Optional): Toggle this option on if the Glue Catalog serves as a single source of truth for entity validation.

  • Description (Optional): Add a brief description of the Glue data source (e.g., "AWS Glue Datasource Instance").

  1. Enter Connection Details

Provide the following details in the Connection section:

  • Select Region: Choose the AWS region where the Glue catalog is hosted (e.g., US East (N. Virginia) us-east-1).

  • Access Key: The AWS IAM user's access key.

  • Secret Access Key: The IAM user's secret access key.

  • Query Engine: Select the query engine for Glue, such as Athena (used for querying Glue catalog data).

  1. Click on Test Connection.

  2. Additional Details (Optional):

  • Location: The location of the data source.

  • Purpose: The purpose of the data being collected/processed.

  • Stage: The stage of the data source. Example: Source, Processing, Archival, etc.

  1. Verify that you get the message Test Connection Success on the screen. Click on Next.

  2. After this select a workgroup from the drop down list. Only workgroups with query output location configured will be shown here.

  3. In the next step, you will see a list of databases presented from your Glue datasource.

Select one of the following two options: i) Show all databases to select: By default, all databases to which you have access permissions will be shown.

ii) Select specific database(s) that you have permission for: If you wish to scan only certain databases, click on Add database name and select them from the drop-down menu.

Please verify that all databases selected for scanning show up in the list of databases. Ensure you've made your desired selections before connecting the data source.

  1. Finally, click on Start Sampling to connect to the Glue data source.


APPENDIX

Minimal permissions setup

This guide outlines the process to create an IAM user with minimal permissions necessary for integrating AWS Glue data source with LightBeam, involving three AWS services: Glue, S3, and Athena.

Prerequisites for Glue Onboarding

Before integrating Glue, it's necessary to set up a workgroup in Athena with a specified output location for query results. A separate workgroup is recommended for isolating LightBeam queries.

Creating a Workgroup in Athena

  1. In the Athena console, navigate to Administration → Workgroup.

  2. Click Create Workgroup, enter a Workgroup name.

Choose Athena SQL in the type of engine. Specify the S3 location to store query results output. This workgroup name will be used when registering the data source with LightBeam.

Creating a Policy with S3, Glue and Athena Permissions

Glue Permissions

Create a policy granting specific actions on Glue resources to allow access to catalogs, databases, and tables necessary for LightBeam scanning. This needs to be done for region where Glue is configured:

{
    "Sid": "VisualEditor1",
    "Effect": "Allow",
    "Action": [
        "glue:GetTables",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetDatabase",
        "glue:GetPartitions",
        "glue:GetPartition"
    ],
    "Resource": [
        "arn:aws:glue:<AWS region>:<account_id>:database/*",
        "arn:aws:glue:<AWS region>:<account_id>:catalog",
        "arn:aws:glue:<AWS region>:<account_id>:table/*/*"
    ]
}

These permissions are essential for accessing all relevant Glue resources within the specified region. Adjust permissions as needed for scanning specific databases or tables.

Athena Permissions

Include permissions for Athena to list, query, and manage data catalogs and workgroups:

  1. Workgroup specific permissions:

{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
        "athena:ListDataCatalogs",
        "athena:GetTables",
        "athena:GetTable",
        "athena:GetCatalogs",
        "athena:GetWorkGroup",
        "athena:ListDatabases",
        "athena:GetQueryExecution",
        "athena:StartQueryExecution",
        "athena:GetQueryResults",
        "athena:GetDatabase",
        "athena:GetDataCatalog"
    ],
    "Resource": [
        "arn:aws:athena:<AWS region>:<account_id>:datacatalog/AwsDataCatalog",
        "arn:aws:athena:<AWS region>:<account_id>:workgroup/<workgroup-name>"
    ]
}

These actions are required for Athena catalog named AwsDataCatalog and workgroup used with LightBeam.

  1. Permission for all workgroups:

{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
        "athena:GetWorkGroup",
        "athena:ListWorkGroups"
    ],
    "Resource": "*"
}

S3 Permissions

S3 permissions are categorized into three parts:

  1. Output Bucket Permissions - Permissions for the bucket where Athena query results are stored:

{
    "Sid": "VisualEditor2",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucketMultipartUploads",
        "s3:AbortMultipartUpload",
        "s3:ListBucket",
        "s3:DeleteObject",
        "s3:GetBucketLocation",
        "s3:ListMultipartUploadParts"
    ],
    "Resource": "arn:aws:s3:::<output-bucket-for-workgroup>/*"
}
  1. Data Bucket Read Permissions - Read permissions for buckets containing the data:

{
    "Sid": "VisualEditor1",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<data-bucket-location-1>/*",
        "arn:aws:s3:::<data-bucket-location-2>/*"
    ]
}
  1. General Read-Only Permissions - Broad permissions for read-only access to all buckets:

{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
        "s3:ListAllMyBuckets",
        "s3:ListBucket",
        "s3:GetBucketLocation"
    ],
    "Resource": "arn:aws:s3:::*"
}

Note:

If the Glue data catalog is KMS encrypted or any of the S3 buckets (where the data is stored or the bucket configured for writing query results by Athena), then following permission needs to be added too to the policy. All keys that are used for encryption needs to be specified in Resource field.

{
	"Sid": "VisualEditor6",
	"Effect": "Allow",
	"Action": [
		"kms:Decrypt",
		"kms:Encrypt",
		"kms:GenerateDataKey"
	],
	"Resource": "arn:aws:kms:<AWS region>:<account_id>:key/<key_id>"
}

Few examples of policies:

  1. Example 1:

AWS account ID = 1111

AWS Region = us-east-2

Athena workgroup name = w1

Athena query output location = l1

S3 buckets where data resides = l2 and l3

Databases in Glue Catalog that need to be scanned = d1 and d2

Following Policy needs to be created for above scenario

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "glue:GetTables",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:GetPartitions",
                "glue:GetPartition"
            ],
            "Resource": [
                "arn:aws:glue:us-east-2:1111:database/db1",
                "arn:aws:glue:us-east-2:1111:database/db2",
                "arn:aws:glue:us-east-2:1111:catalog",
                "arn:aws:glue:us-east-2:1111:table/*/*"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "athena:ListDataCatalogs",
                "athena:GetTables",
                "athena:GetTable",
                "athena:GetCatalogs",
                "athena:GetWorkGroup",
                "athena:ListDatabases",
                "athena:GetQueryExecution",
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:GetDatabase",
                "athena:GetDataCatalog"
            ],
            "Resource": [
                "arn:aws:athena:us-east-2:1111:datacatalog/AwsDataCatalog",
                "arn:aws:athena:us-east-2:1111:workgroup/w1"
            ]
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "athena:GetWorkGroup",
                "athena:ListWorkGroups"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::l1/*"
        },
        {
            "Sid": "VisualEditor4",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::l2/*",
                "arn:aws:s3:::l3/*"
            ]
        },
        {
            "Sid": "VisualEditor5",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}
  1. Example 2:

Consider the same scenario as above but with Glue catalog and S3 buckets encrypted with KMS.

Key name used for encrypting glue catalog = k1

Key name used for encrypting S3 buckets = k2

Along with above permission blocks, following block needs to be added too in the policy.

{
	"Sid": "VisualEditor6",
	"Effect": "Allow",
	"Action": [
		"kms:Decrypt",
		"kms:Encrypt",
		"kms:GenerateDataKey"
	],
	"Resource": [ 
	     "arn:aws:kms:us-east-2:1111:key/k1",
	     "arn:aws:kms:us-east-2:1111:key/k2",
	]
}

Policy Assignment

  1. Create an IAM user

  1. Assign above created policy to this user

  1. Utilize the accessKey and secretKey from this IAM user for seamless onboarding with LightBeam, ensuring the necessary permissions for data scanning and classification are in place.

Validate permissions to the datasource.

Next, the user needs to validate these permissions to the datasource. This ensures authorized access to the datasource by the credentials provided by the user. After validating the permissions to the datasource, the user can onboard Glue in Lightbeam.

Steps

  1. Go into sql_user_check_glue directory

  2. Please refer to the README.md file in the directory for detailed instructions.


About LightBeam

LightBeam automates Privacy, Security, and AI Governance, so businesses can accelerate their growth in new markets. Leveraging generative AI, LightBeam has rapidly gained customers’ trust by pioneering a unique privacy-centric and automation-first approach to security. Unlike siloed solutions, LightBeam ties together sensitive data cataloging, control, and compliance across structured and unstructured data applications providing 360-visibility, redaction, self-service DSRs, and automated ROPA reporting ensuring ultimate protection against ransomware and accidental exposures while meeting data privacy obligations efficiently. LightBeam is on a mission to create a secure privacy-first world helping customers automate compliance against a patchwork of existing and emerging regulations.

PreviousLookerNextDatabricks

Last updated 3 months ago

Figure 1. AWS Glue - Add Data Source
Fig 6. AWS Glue - Workgroups in Athena console
Fig 7. AWS Glue - Create workgroup
Fig 8. AWS Glue - Query result location

First, clone the repository

For any questions or suggestions, please get in touch with us at: .

🧠
🔦
🔗
https://github.com/lightbeamai/lb-installer
support@lightbeam.ai
Figure 2. Search AWS Glue
Figure 3. Click on Glue
Figure 4. AWS Glue - Basic Configuration & Connection Details
Fig 5. AWS Glue - Select database
Fig 5.1 AWS Glue - Select database
Fig 5.2 AWS Glue - Select database
Fig 9. Create an IAM user
Fig 10. Attach policy with this user