AWS Glue

aws.amazon.com/glue
Data Integration
Weekend Project

Serverless Data Integration – Discover, prepare, and integrate all your data at any scale

How to Replace AWS Glue

Overview

AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development. With generative AI assistance, AWS Glue provides all the capabilities needed for data integration with built-in ETL, schema discovery, and cross-service integration. You pay only for the resources consumed while your jobs are running with no infrastructure to set up or manage.

Features

31 features across 19 categories

AI Assistance(3)

Accelerate Debugging with GenAI TroubleshootingAIPremium

Uses generative AI to quickly identify and resolve issues in Spark jobs by analyzing job metadata, execution logs, and configurations for root cause analysis and recommendations

Amazon Q Data IntegrationAI

Create ETL jobs using natural language descriptions. Automatically generates Apache Spark code that can be customized, tested, and deployed as production jobs

Modernize Apache Spark Jobs with GenAI UpgradesAIPremium

Generative AI automatically analyzes Spark jobs and generates upgrade plans to newer versions, reducing time and effort to keep jobs modern, secure and performant

Cost Optimization(1)

AWS Glue Flex

Flexible execution job class for non-time-sensitive workloads that reduces costs up to 35% for preproduction jobs, testing, and data loads

Also in: OpenAI API, Neon, Melio

Data Preparation(2)

AWS Glue DataBrew

Interactive point-and-click visual interface for cleaning and normalizing data without writing code. Includes over 250 built-in transformations to combine, pivot, and transpose data

FindMatches ML FeatureAI

Deduplicates and finds imperfect matching records using machine learning without requiring ML expertise. Learns from labeled examples to identify matches across databases

Data Processing(2)

AWS Glue for Ray

Enables developers to scale existing Python code and popular Python libraries using Ray.io open-source compute framework. Serverless with no infrastructure management required

Open Source Framework Support

Natively supports Apache Hudi, Apache Iceberg, and Delta Lake for transactional consistency in Amazon S3 based data lakes with read, insert, update, and delete operations

Data Quality(1)

AWS Glue Data Quality

Automatically measures, monitors, and manages data quality in data lakes and pipelines. Computes statistics, recommends rules, monitors quality metrics, and alerts on deterioration

Data Quality & Security(1)

AWS Glue Sensitive Data Detection

Defines, identifies, and processes sensitive data in pipelines and data lakes. Remediates PII and sensitive data by redacting, replacing, or reporting on personally identifiable information

Data Quality & Validation(1)

AWS Glue Schema Registry

Serverless feature that validates and controls evolution of streaming data using registered Apache Avro schemas with integration for Kafka, Amazon MSK, Kinesis, Apache Flink, and AWS Lambda

DevOps & Integration(1)

Git Integration

Integrates with Git version control system including GitHub and AWS CodeCommit for maintaining change history and applying DevOps practices. Works with automation tools like Jenkins and AWS CodeDeploy

Development(1)

AWS Glue Studio Job Notebooks

Serverless notebooks with minimal setup in AWS Glue Studio for quick developer onboarding. Built-in interface for Interactive Sessions with ability to save and schedule notebook code as AWS Glue jobs

Development & Customization(1)

Custom Visual Transforms

Allows data engineers to write and share business-specific Apache Spark logic as reusable visual transforms available across all jobs in account

Development & Debugging(1)

AWS Glue Interactive Sessions

Serverless feature for interactively developing ETL code using IDEs or notebooks of choice. Allows engineers to explore, experiment on, and process data interactively with custom readers, writers, and transformations

Discovery & Cataloging(2)

Automatic Schema Discovery

Crawlers connect to data stores, determine schema using prioritized classifiers, and create metadata in Data Catalog. Can be run on schedule, on demand, or triggered by events

AWS Glue Data Catalog

Persistent metadata store for all data assets with table definitions, job definitions, schemas, automatic statistics computation, partition registration, and comprehensive schema version history

ETL Development(1)

AWS Glue Studio - Drag-and-Drop ETL Editor

Visual interface for authoring highly scalable ETL jobs without becoming Apache Spark expert. Automatically generates code in Scala or Python for extract, transform, and load processes

Integration(3)

Amazon SageMaker Integration

AWS Glue is accessible in next generation of Amazon SageMaker for managing and building workloads in one place with cost-effective serverless data integration

Zero-ETL Integration for Multiple Data Sources

Connects multiple data sources including DynamoDB, SaaS applications (Salesforce, SAP, ServiceNow), and self-managed databases to Amazon Redshift or SageMaker data lakehouse without operational overhead

Zero-ETL Integration for Self-Managed Databases

Provides access to analytics on transactional data by replicating data from Oracle, SQL Server, MySQL, or PostgreSQL to Amazon Redshift within minutes with data filtering capabilities

Monitoring & Observability(1)

CloudWatch Integration

All logs and notifications are pushed to Amazon CloudWatch for centralized monitoring and alerting

Orchestration(1)

Job Scheduling and Orchestration

Jobs can be invoked on schedule, on demand, or based on events. Supports parallel job execution, inter-job dependencies, bad data filtering, and automatic retries

Performance & Optimization(6)

Apache Iceberg Statistics

Calculates and updates number of distinct values (NDVs) for each column in Iceberg tables to support cost-based optimization

Apache Iceberg Table Optimization

Supports optimization of Apache Iceberg tables including binpack, sort, and z-order compaction strategies to improve performance and query execution efficiency

Auto Scaling

Dynamically scales resources up and down based on workload, assigning workers only when needed without over-provisioning or paying for idle resources

Materialized View Auto-refresh

Manages Iceberg tables that store precomputed data with automatic refresh capabilities using managed Spark compute to keep views up-to-date

Snapshot Retention Optimizer

Manages storage overhead for Apache Iceberg tables by retaining only needed snapshots and removing older unnecessary snapshots and associated files

Unreferenced File Deletion

Periodically identifies and removes unnecessary unreferenced files from data storage, freeing up storage space

Security & Governance(1)

Fine-Grained Access Control

AWS Glue 5.0+ provides table, column, and row level permissions for Apache Spark jobs accessing Apache Iceberg, Apache Hudi, and Delta tables for simplified security and governance

Streaming(1)

Serverless Streaming ETL

Continuously consumes data from streaming sources like Amazon Kinesis and Amazon MSK, cleans and transforms it in-flight. Supports event data enrichment, aggregation, and complex analytics operations

Pricing

Free Tier

Free
  • First million metadata objects stored in Data Catalog
  • First million Data Catalog metadata requests per month

Pay-as-you-go - ETL Jobs and Interactive Sessions

$0.44 per DPU-hour
  • ETL jobs billed by second
  • Interactive Sessions billed by second
  • Pricing based on DPU usage

Pay-as-you-go - Data Catalog

Simplified monthly fee
  • Metadata storage and access
  • First million metadata objects free per month
  • $1.00 per 100,000 objects over a million per month

Pay-as-you-go - Crawlers

$0.44 per DPU-hour
  • Data discovery crawlers
  • Billed by second

Pay-as-you-go - DataBrew Interactive Sessions

$1.00 per 30-minute session
  • Interactive data cleaning and preparation

Pay-as-you-go - DataBrew Jobs

$0.48 per node-hour
  • Automated DataBrew jobs
  • Billed per minute

Pay-as-you-go - Data Quality

$0.44 per DPU-hour
  • Recommendation tasks (minimum 2 DPUs)
  • Data Quality tasks (minimum 2 DPUs)
  • Anomaly detection (1 DPU per statistic)
  • 1-minute minimum billing duration

Pay-as-you-go - Iceberg Table Optimization

$0.44 per DPU-hour
  • Table compaction for Apache Iceberg
  • Statistics generation
  • Billed per second with 1-minute minimum

Schema Registry

No additional charge
  • AWS Glue Schema Registry usage

AWS Glue Flex

$0.29 per DPU-hour
  • Non-time-sensitive workloads
  • Up to 35% cost reduction

Cost Calculator

Pricing data not available for AWS Glue. Check their website for current pricing.

Build vs Buy

Should you build a AWS Glue alternative or buy the subscription? Estimate based on 31 features.

Buy AWS Glue

Better Value
Monthly costContact Sales
3-year totalVaries
Time to deployDays

Build Your Own

Development cost$24,000
Maintenance$360/mo
3-year total$36,960
Dev time~2 months

Buying AWS Glue saves ~$36,960 over 3 years vs building.

Estimates based on 31 features and a BuildScore of 5/5. Actual costs vary.

Integrations

29 known integrations

Amazon AthenaAmazon AuroraAmazon DynamoDBAmazon EMRAmazon KinesisAmazon Kinesis Data Analytics for Apache FlinkAmazon Managed Streaming for Apache Kafka (MSK)Amazon MWAAAmazon RDSAmazon RedshiftAmazon S3Amazon SageMakerApache FlinkApache KafkaApache SparkAWS CloudWatchAWS CodeCommitAWS CodeDeployAWS Lake FormationAWS LambdaGitHubJenkinsMySQLOraclePostgreSQLSalesforceSAPServiceNowSQL Server