AWS Glue
aws.amazon.com/glueBuild Difficulty: 5/5
Build a working replacement in a weekend with AI tools
Serverless Data Integration – Discover, prepare, and integrate all your data at any scale
How to Replace AWS GlueOverview
Features
31 features across 19 categories
AI Assistance(3)
Uses generative AI to quickly identify and resolve issues in Spark jobs by analyzing job metadata, execution logs, and configurations for root cause analysis and recommendations
Create ETL jobs using natural language descriptions. Automatically generates Apache Spark code that can be customized, tested, and deployed as production jobs
Generative AI automatically analyzes Spark jobs and generates upgrade plans to newer versions, reducing time and effort to keep jobs modern, secure and performant
Cost Optimization(1)
Flexible execution job class for non-time-sensitive workloads that reduces costs up to 35% for preproduction jobs, testing, and data loads
Data Preparation(2)
Interactive point-and-click visual interface for cleaning and normalizing data without writing code. Includes over 250 built-in transformations to combine, pivot, and transpose data
Deduplicates and finds imperfect matching records using machine learning without requiring ML expertise. Learns from labeled examples to identify matches across databases
Data Processing(2)
Enables developers to scale existing Python code and popular Python libraries using Ray.io open-source compute framework. Serverless with no infrastructure management required
Natively supports Apache Hudi, Apache Iceberg, and Delta Lake for transactional consistency in Amazon S3 based data lakes with read, insert, update, and delete operations
Data Quality(1)
Automatically measures, monitors, and manages data quality in data lakes and pipelines. Computes statistics, recommends rules, monitors quality metrics, and alerts on deterioration
Data Quality & Security(1)
Defines, identifies, and processes sensitive data in pipelines and data lakes. Remediates PII and sensitive data by redacting, replacing, or reporting on personally identifiable information
Data Quality & Validation(1)
Serverless feature that validates and controls evolution of streaming data using registered Apache Avro schemas with integration for Kafka, Amazon MSK, Kinesis, Apache Flink, and AWS Lambda
DevOps & Integration(1)
Integrates with Git version control system including GitHub and AWS CodeCommit for maintaining change history and applying DevOps practices. Works with automation tools like Jenkins and AWS CodeDeploy
Development(1)
Serverless notebooks with minimal setup in AWS Glue Studio for quick developer onboarding. Built-in interface for Interactive Sessions with ability to save and schedule notebook code as AWS Glue jobs
Development & Customization(1)
Allows data engineers to write and share business-specific Apache Spark logic as reusable visual transforms available across all jobs in account
Development & Debugging(1)
Serverless feature for interactively developing ETL code using IDEs or notebooks of choice. Allows engineers to explore, experiment on, and process data interactively with custom readers, writers, and transformations
Discovery & Cataloging(2)
Crawlers connect to data stores, determine schema using prioritized classifiers, and create metadata in Data Catalog. Can be run on schedule, on demand, or triggered by events
Persistent metadata store for all data assets with table definitions, job definitions, schemas, automatic statistics computation, partition registration, and comprehensive schema version history
ETL Development(1)
Visual interface for authoring highly scalable ETL jobs without becoming Apache Spark expert. Automatically generates code in Scala or Python for extract, transform, and load processes
Integration(3)
AWS Glue is accessible in next generation of Amazon SageMaker for managing and building workloads in one place with cost-effective serverless data integration
Connects multiple data sources including DynamoDB, SaaS applications (Salesforce, SAP, ServiceNow), and self-managed databases to Amazon Redshift or SageMaker data lakehouse without operational overhead
Provides access to analytics on transactional data by replicating data from Oracle, SQL Server, MySQL, or PostgreSQL to Amazon Redshift within minutes with data filtering capabilities
Monitoring & Observability(1)
All logs and notifications are pushed to Amazon CloudWatch for centralized monitoring and alerting
Orchestration(1)
Jobs can be invoked on schedule, on demand, or based on events. Supports parallel job execution, inter-job dependencies, bad data filtering, and automatic retries
Performance & Optimization(6)
Calculates and updates number of distinct values (NDVs) for each column in Iceberg tables to support cost-based optimization
Supports optimization of Apache Iceberg tables including binpack, sort, and z-order compaction strategies to improve performance and query execution efficiency
Dynamically scales resources up and down based on workload, assigning workers only when needed without over-provisioning or paying for idle resources
Manages Iceberg tables that store precomputed data with automatic refresh capabilities using managed Spark compute to keep views up-to-date
Manages storage overhead for Apache Iceberg tables by retaining only needed snapshots and removing older unnecessary snapshots and associated files
Periodically identifies and removes unnecessary unreferenced files from data storage, freeing up storage space
Security & Governance(1)
AWS Glue 5.0+ provides table, column, and row level permissions for Apache Spark jobs accessing Apache Iceberg, Apache Hudi, and Delta tables for simplified security and governance
Streaming(1)
Continuously consumes data from streaming sources like Amazon Kinesis and Amazon MSK, cleans and transforms it in-flight. Supports event data enrichment, aggregation, and complex analytics operations
Pricing
Free Tier
- ✓First million metadata objects stored in Data Catalog
- ✓First million Data Catalog metadata requests per month
Pay-as-you-go - ETL Jobs and Interactive Sessions
- ✓ETL jobs billed by second
- ✓Interactive Sessions billed by second
- ✓Pricing based on DPU usage
Pay-as-you-go - Data Catalog
- ✓Metadata storage and access
- ✓First million metadata objects free per month
- ✓$1.00 per 100,000 objects over a million per month
Pay-as-you-go - Crawlers
- ✓Data discovery crawlers
- ✓Billed by second
Pay-as-you-go - DataBrew Interactive Sessions
- ✓Interactive data cleaning and preparation
Pay-as-you-go - DataBrew Jobs
- ✓Automated DataBrew jobs
- ✓Billed per minute
Pay-as-you-go - Data Quality
- ✓Recommendation tasks (minimum 2 DPUs)
- ✓Data Quality tasks (minimum 2 DPUs)
- ✓Anomaly detection (1 DPU per statistic)
- ✓1-minute minimum billing duration
Pay-as-you-go - Iceberg Table Optimization
- ✓Table compaction for Apache Iceberg
- ✓Statistics generation
- ✓Billed per second with 1-minute minimum
Schema Registry
- ✓AWS Glue Schema Registry usage
AWS Glue Flex
- ✓Non-time-sensitive workloads
- ✓Up to 35% cost reduction
Cost Calculator
Pricing data not available for AWS Glue. Check their website for current pricing.
Build vs Buy
Should you build a AWS Glue alternative or buy the subscription? Estimate based on 31 features.
Buy AWS Glue
Better ValueBuild Your Own
Buying AWS Glue saves ~$36,960 over 3 years vs building.
Estimates based on 31 features and a BuildScore of 5/5. Actual costs vary.
Integrations
29 known integrations