AWS Glue

Connecting Glue to LightBeam

Overview

LightBeam Spectra users can connect various data sources to the LightBeam application and these data sources will be continuously monitored for PII, PHI data.

Example: AWS Glue, Looker, DynamoDB, Redshift, etc.

About AWS Glue

AWS Glue is Amazon's serverless data integration service that streamlines data processing tasks. It provides a convenient way for analytics users to connect, prepare, and integrate data from diverse sources such as Postgres, RDS, S3, and Redshift. One of its key features is the metadata catalog, which stores metadata effectively.

Metadata Catalog Utilization:
- In LightBeam, we focus on the metadata catalog provided by AWS Glue to extract data structures like databases, tables, and columns.
Sample Data Queries for PII Detection:
- LightBeam extends beyond metadata extraction by sampling data for PII detection. To achieve this, we employ AWS Athena to execute sample queries, extracting 5000 rows from each table column.
AWS Services Employed:
- Glue: Accessed for its metadata catalog capabilities.
- S3: Utilized as the storage repository for data files.
- Athena: Deployed as the query engine to retrieve sample data from tables.

Operational Notes and Constraints:

Glue catalog, S3 data and Athena will need to be present in the same region which is configured during the datasource registration.
Data must reside in S3, supporting formats like CSV, JSON, Parquet, Avro, and ORC.
The current setup does not support workflows for table clusters or entity creation.
We skip columns containing complex Blob data types, such as maps and arrays, in PII classification.

Features

Datasource Registration

AWS administrators can set up users with limited permissions and utilize the user's accessKey and secretKey for registration. The process also requires specifying the AWS region where the Glue catalog resides and the name of the Athena workgroup for query execution. During registration, users are presented with a list of databases from the Glue catalog, allowing them to select specific databases for scanning. Along with list of databases, users are also presented workgroups in Athena with output query location configured. They need to choose one workgroup as a part of registration.

Metadata Scanning

The scanning process targets tables within the Glue databases specified in the scan conditions. A notable restriction is the necessity for the table data to be stored in S3; thus, only those tables with data in S3 are eligible for scanning. For each qualified table, information such as the list of columns, their data types, and whenever available, the row count and table size are retrieved.

PII Detection

PII detection hinges on obtaining sample data from all columns within a table, for which Athena is employed. SQL queries are executed through Athena to gather sample data for each table, with the query results stored in the default output location of the utilized LightBeam workgroup. To manage data and control storage costs in S3, query result files for each table are placed in a dedicated folder, which is removed after the data has been processed.

For reading data, we sample 5000 rows for each table. If the data is partitioned across multiple files, all files are not read. So the amount of data scanned is not equal to the total size of files.

Onboarding AWS Glue Data Source

Login to your LightBeam Instance.
Click on DATASOURCES on the Top Navigation Bar.
Click on “Add a data source”.

Search for Glue.

Click on Glue.

3. Configure Basic Details

In the Basic Details section, enter the following information:

Instance Name: Provide a unique name for the AWS Glue data source (e.g., aws-glue-datasource).
Primary Owner: Enter the email address of the individual responsible for this data source (e.g., [email protected]).
Source of Truth (Optional): Toggle this option on if the Glue Catalog serves as a single source of truth for entity validation.
Description (Optional): Add a brief description of the Glue data source (e.g., "AWS Glue Datasource Instance").

Enter Connection Details

Provide the following details in the Connection section:

Select Region: Choose the AWS region where the Glue catalog is hosted (e.g., US East (N. Virginia) us-east-1).
Access Key: The AWS IAM user's access key.
Secret Access Key: The IAM user's secret access key.
Query Engine: Select the query engine for Glue, such as Athena (used for querying Glue catalog data).

Click on Test Connection.
Additional Details (Optional):

Location: The location of the data source.
Purpose: The purpose of the data being collected/processed.
Stage: The stage of the data source. Example: Source, Processing, Archival, etc.

Verify that you get the message Test Connection Success on the screen. Click on Next.
After this select a workgroup from the drop down list. Only workgroups with query output location configured will be shown here.
In the next step, you will see a list of databases presented from your Glue datasource.
Fig 5. AWS Glue - Select database

Select one of the following two options: i) Show all databases to select: By default, all databases to which you have access permissions will be shown.

ii) Select specific database(s) that you have permission for: If you wish to scan only certain databases, click on Add database name and select them from the drop-down menu.

Please verify that all databases selected for scanning show up in the list of databases. Ensure you've made your desired selections before connecting the data source.

Finally, click on Start Sampling to connect to the Glue data source.

APPENDIX

Minimal permissions setup

This guide outlines the process to create an IAM user with minimal permissions necessary for integrating AWS Glue data source with LightBeam, involving three AWS services: Glue, S3, and Athena.

Prerequisites for Glue Onboarding

Before integrating Glue, it's necessary to set up a workgroup in Athena with a specified output location for query results. A separate workgroup is recommended for isolating LightBeam queries.

Creating a Workgroup in Athena

In the Athena console, navigate to Administration → Workgroup.
Fig 6. AWS Glue - Workgroups in Athena console
Click Create Workgroup, enter a Workgroup name.

Choose Athena SQL in the type of engine. Specify the S3 location to store query results output. This workgroup name will be used when registering the data source with LightBeam.

Creating a Policy with S3, Glue and Athena Permissions

Glue Permissions

Create a policy granting specific actions on Glue resources to allow access to catalogs, databases, and tables necessary for LightBeam scanning. This needs to be done for region where Glue is configured:

{
    "Sid": "VisualEditor1",
    "Effect": "Allow",
    "Action": [
        "glue:GetTables",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetDatabase",
        "glue:GetPartitions",
        "glue:GetPartition"
    ],
    "Resource": [
        "arn:aws:glue:<AWS region>:<account_id>:database/*",
        "arn:aws:glue:<AWS region>:<account_id>:catalog",
        "arn:aws:glue:<AWS region>:<account_id>:table/*/*"
    ]
}

These permissions are essential for accessing all relevant Glue resources within the specified region. Adjust permissions as needed for scanning specific databases or tables.

Athena Permissions

Include permissions for Athena to list, query, and manage data catalogs and workgroups:

Workgroup specific permissions:

{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
        "athena:ListDataCatalogs",
        "athena:GetTables",
        "athena:GetTable",
        "athena:GetCatalogs",
        "athena:GetWorkGroup",
        "athena:ListDatabases",
        "athena:GetQueryExecution",
        "athena:StartQueryExecution",
        "athena:GetQueryResults",
        "athena:GetDatabase",
        "athena:GetDataCatalog"
    ],
    "Resource": [
        "arn:aws:athena:<AWS region>:<account_id>:datacatalog/AwsDataCatalog",
        "arn:aws:athena:<AWS region>:<account_id>:workgroup/<workgroup-name>"
    ]
}

These actions are required for Athena catalog named AwsDataCatalog and workgroup used with LightBeam.

Permission for all workgroups:

{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
        "athena:GetWorkGroup",
        "athena:ListWorkGroups"
    ],
    "Resource": "*"
}

S3 Permissions

S3 permissions are categorized into three parts:

Output Bucket Permissions - Permissions for the bucket where Athena query results are stored:

{
    "Sid": "VisualEditor2",
    "Effect": "Allow",
    "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucketMultipartUploads",
        "s3:AbortMultipartUpload",
        "s3:ListBucket",
        "s3:DeleteObject",
        "s3:GetBucketLocation",
        "s3:ListMultipartUploadParts"
    ],
    "Resource": "arn:aws:s3:::<output-bucket-for-workgroup>/*"
}

Data Bucket Read Permissions - Read permissions for buckets containing the data:

{
    "Sid": "VisualEditor1",
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket",
        "s3:GetObject"
    ],
    "Resource": [
        "arn:aws:s3:::<data-bucket-location-1>/*",
        "arn:aws:s3:::<data-bucket-location-2>/*"
    ]
}

General Read-Only Permissions - Broad permissions for read-only access to all buckets:

{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
        "s3:ListAllMyBuckets",
        "s3:ListBucket",
        "s3:GetBucketLocation"
    ],
    "Resource": "arn:aws:s3:::*"
}

Note:

If the Glue data catalog is KMS encrypted or any of the S3 buckets (where the data is stored or the bucket configured for writing query results by Athena), then following permission needs to be added too to the policy. All keys that are used for encryption needs to be specified in Resource field.

{
	"Sid": "VisualEditor6",
	"Effect": "Allow",
	"Action": [
		"kms:Decrypt",
		"kms:Encrypt",
		"kms:GenerateDataKey"
	],
	"Resource": "arn:aws:kms:<AWS region>:<account_id>:key/<key_id>"
}

Few examples of policies:

Example 1:

AWS account ID = 1111

AWS Region = us-east-2

Athena workgroup name = w1

Athena query output location = l1

S3 buckets where data resides = l2 and l3

Databases in Glue Catalog that need to be scanned = d1 and d2

Following Policy needs to be created for above scenario

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "glue:GetTables",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:GetPartitions",
                "glue:GetPartition"
            ],
            "Resource": [
                "arn:aws:glue:us-east-2:1111:database/db1",
                "arn:aws:glue:us-east-2:1111:database/db2",
                "arn:aws:glue:us-east-2:1111:catalog",
                "arn:aws:glue:us-east-2:1111:table/*/*"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "athena:ListDataCatalogs",
                "athena:GetTables",
                "athena:GetTable",
                "athena:GetCatalogs",
                "athena:GetWorkGroup",
                "athena:ListDatabases",
                "athena:GetQueryExecution",
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:GetDatabase",
                "athena:GetDataCatalog"
            ],
            "Resource": [
                "arn:aws:athena:us-east-2:1111:datacatalog/AwsDataCatalog",
                "arn:aws:athena:us-east-2:1111:workgroup/w1"
            ]
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "athena:GetWorkGroup",
                "athena:ListWorkGroups"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::l1/*"
        },
        {
            "Sid": "VisualEditor4",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::l2/*",
                "arn:aws:s3:::l3/*"
            ]
        },
        {
            "Sid": "VisualEditor5",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}

Example 2:

Consider the same scenario as above but with Glue catalog and S3 buckets encrypted with KMS.

Key name used for encrypting glue catalog = k1

Key name used for encrypting S3 buckets = k2

Along with above permission blocks, following block needs to be added too in the policy.

{
	"Sid": "VisualEditor6",
	"Effect": "Allow",
	"Action": [
		"kms:Decrypt",
		"kms:Encrypt",
		"kms:GenerateDataKey"
	],
	"Resource": [ 
	     "arn:aws:kms:us-east-2:1111:key/k1",
	     "arn:aws:kms:us-east-2:1111:key/k2",
	]
}

Policy Assignment

Create an IAM user

Assign above created policy to this user

Utilize the accessKey and secretKey from this IAM user for seamless onboarding with LightBeam, ensuring the necessary permissions for data scanning and classification are in place.

Validate permissions to the datasource.

Next, the user needs to validate these permissions to the datasource. This ensures authorized access to the datasource by the credentials provided by the user. After validating the permissions to the datasource, the user can onboard Glue in Lightbeam.

Steps

First, clone the repository https://github.com/lightbeamai/lb-installer
Go into sql_user_check_glue directory
Please refer to the README.md file in the directory for detailed instructions.

About LightBeam

LightBeam automates Privacy, Security, and AI Governance, so businesses can accelerate their growth in new markets. Leveraging generative AI, LightBeam has rapidly gained customers’ trust by pioneering a unique privacy-centric and automation-first approach to security. Unlike siloed solutions, LightBeam ties together sensitive data cataloging, control, and compliance across structured and unstructured data applications providing 360-visibility, redaction, self-service DSRs, and automated ROPA reporting ensuring ultimate protection against ransomware and accidental exposures while meeting data privacy obligations efficiently. LightBeam is on a mission to create a secure privacy-first world helping customers automate compliance against a patchwork of existing and emerging regulations.

For any questions or suggestions, please get in touch with us at: [email protected].

PreviousLooker NextDatabricks

Last updated 4 months ago