AWS Glue
Connecting Glue to LightBeam
Overview
LightBeam Spectra users can connect various data sources to the LightBeam application and these data sources will be continuously monitored for PII, PHI data.
Example: AWS Glue, Looker, DynamoDB, Redshift, etc.
About AWS Glue
AWS Glue is Amazon's serverless data integration service that streamlines data processing tasks. It provides a convenient way for analytics users to connect, prepare, and integrate data from diverse sources such as Postgres, RDS, S3, and Redshift. One of its key features is the metadata catalog, which stores metadata effectively.
Metadata Catalog Utilization:
In LightBeam, we focus on the metadata catalog provided by AWS Glue to extract data structures like databases, tables, and columns.
Sample Data Queries for PII Detection:
LightBeam extends beyond metadata extraction by sampling data for PII detection. To achieve this, we employ AWS Athena to execute sample queries, extracting 5000 rows from each table column.
AWS Services Employed:
Glue: Accessed for its metadata catalog capabilities.
S3: Utilized as the storage repository for data files.
Athena: Deployed as the query engine to retrieve sample data from tables.
Features
Datasource Registration
AWS administrators can set up users with limited permissions and utilize the user's accessKey
and secretKey
for registration. The process also requires specifying the AWS region where the Glue catalog resides and the name of the Athena workgroup for query execution. During registration, users are presented with a list of databases from the Glue catalog, allowing them to select specific databases for scanning. Along with list of databases, users are also presented workgroups in Athena with output query location configured. They need to choose one workgroup as a part of registration.
Metadata Scanning
The scanning process targets tables within the Glue databases specified in the scan conditions. A notable restriction is the necessity for the table data to be stored in S3; thus, only those tables with data in S3 are eligible for scanning. For each qualified table, information such as the list of columns, their data types, and whenever available, the row count and table size are retrieved.
PII Detection
PII detection hinges on obtaining sample data from all columns within a table, for which Athena is employed. SQL queries are executed through Athena to gather sample data for each table, with the query results stored in the default output location of the utilized LightBeam workgroup. To manage data and control storage costs in S3, query result files for each table are placed in a dedicated folder, which is removed after the data has been processed.
For reading data, we sample 5000 rows for each table. If the data is partitioned across multiple files, all files are not read. So the amount of data scanned is not equal to the total size of files.
Onboarding AWS Glue Data Source
Login to your LightBeam Instance.
Click on DATASOURCES on the Top Navigation Bar.
Click on “Add a data source”.
Search for Glue.
Click on Glue.
3. Configure Basic Details
In the Basic Details section, enter the following information:
Instance Name: Provide a unique name for the AWS Glue data source (e.g.,
aws-glue-datasource
).Primary Owner: Enter the email address of the individual responsible for this data source (e.g.,
demo@lightbeam.ai
).Source of Truth (Optional): Toggle this option on if the Glue Catalog serves as a single source of truth for entity validation.
Description (Optional): Add a brief description of the Glue data source (e.g., "AWS Glue Datasource Instance").
Enter Connection Details
Provide the following details in the Connection section:
Select Region: Choose the AWS region where the Glue catalog is hosted (e.g.,
US East (N. Virginia) us-east-1
).Access Key: The AWS IAM user's access key.
Secret Access Key: The IAM user's secret access key.
Query Engine: Select the query engine for Glue, such as Athena (used for querying Glue catalog data).
Click on Test Connection.
Additional Details (Optional):
Location: The location of the data source.
Purpose: The purpose of the data being collected/processed.
Stage: The stage of the data source. Example: Source, Processing, Archival, etc.
Verify that you get the message Test Connection Success on the screen. Click on Next.
After this select a workgroup from the drop down list. Only workgroups with query output location configured will be shown here.
In the next step, you will see a list of databases presented from your Glue datasource.
Select one of the following two options: i) Show all databases to select: By default, all databases to which you have access permissions will be shown.
ii) Select specific database(s) that you have permission for: If you wish to scan only certain databases, click on Add database name and select them from the drop-down menu.
Please verify that all databases selected for scanning show up in the list of databases. Ensure you've made your desired selections before connecting the data source.
Finally, click on Start Sampling to connect to the Glue data source.
APPENDIX
Minimal permissions setup
This guide outlines the process to create an IAM user with minimal permissions necessary for integrating AWS Glue data source with LightBeam, involving three AWS services: Glue, S3, and Athena.
Prerequisites for Glue Onboarding
Before integrating Glue, it's necessary to set up a workgroup in Athena with a specified output location for query results. A separate workgroup is recommended for isolating LightBeam queries.
Creating a Workgroup in Athena
In the Athena console, navigate to
Administration → Workgroup
.Click
Create Workgroup
, enter a Workgroup name.
Choose Athena SQL
in the type of engine.
Specify the S3 location to store query results output. This workgroup name will be used when registering the data source with LightBeam.
Creating a Policy with S3, Glue and Athena Permissions
Glue Permissions
Create a policy granting specific actions on Glue resources to allow access to catalogs, databases, and tables necessary for LightBeam scanning. This needs to be done for region where Glue is configured:
These permissions are essential for accessing all relevant Glue resources within the specified region. Adjust permissions as needed for scanning specific databases or tables.
Athena Permissions
Include permissions for Athena to list, query, and manage data catalogs and workgroups:
Workgroup specific permissions:
These actions are required for Athena catalog named AwsDataCatalog and workgroup used with LightBeam.
Permission for all workgroups:
S3 Permissions
S3 permissions are categorized into three parts:
Output Bucket Permissions - Permissions for the bucket where Athena query results are stored:
Data Bucket Read Permissions - Read permissions for buckets containing the data:
General Read-Only Permissions - Broad permissions for read-only access to all buckets:
Note:
If the Glue data catalog is KMS encrypted or any of the S3 buckets (where the data is stored or the bucket configured for writing query results by Athena), then following permission needs to be added too to the policy. All keys that are used for encryption needs to be specified in Resource field.
Few examples of policies:
Example 1:
AWS account ID = 1111
AWS Region = us-east-2
Athena workgroup name = w1
Athena query output location = l1
S3 buckets where data resides = l2 and l3
Databases in Glue Catalog that need to be scanned = d1 and d2
Following Policy needs to be created for above scenario
Example 2:
Consider the same scenario as above but with Glue catalog and S3 buckets encrypted with KMS.
Key name used for encrypting glue catalog = k1
Key name used for encrypting S3 buckets = k2
Along with above permission blocks, following block needs to be added too in the policy.
Policy Assignment
Create an IAM user
Assign above created policy to this user
Utilize the
accessKey
andsecretKey
from this IAM user for seamless onboarding with LightBeam, ensuring the necessary permissions for data scanning and classification are in place.
Validate permissions to the datasource.
Next, the user needs to validate these permissions to the datasource. This ensures authorized access to the datasource by the credentials provided by the user. After validating the permissions to the datasource, the user can onboard Glue in Lightbeam.
Steps
Go into
sql_user_check_glue
directoryPlease refer to the
README.md
file in the directory for detailed instructions.
About LightBeam
LightBeam automates Privacy, Security, and AI Governance, so businesses can accelerate their growth in new markets. Leveraging generative AI, LightBeam has rapidly gained customers’ trust by pioneering a unique privacy-centric and automation-first approach to security. Unlike siloed solutions, LightBeam ties together sensitive data cataloging, control, and compliance across structured and unstructured data applications providing 360-visibility, redaction, self-service DSRs, and automated ROPA reporting ensuring ultimate protection against ransomware and accidental exposures while meeting data privacy obligations efficiently. LightBeam is on a mission to create a secure privacy-first world helping customers automate compliance against a patchwork of existing and emerging regulations.
Last updated