Replacement Guide

How to Build Your Own AWS Glue

Replace AWS Glue with a custom build. Serverless Data Integration – Discover, prepare, and integrate all your data at any scale

Weekend Project
31 features29 integrationsOne weekend

Estimated Timeline

Based on 31 features at Weekend Project difficulty, expect about One weekend with AI-assisted development.

1
Setup & scaffolding
2 hours
2
Core features
4-6 hours
3
Polish & deploy
2 hours

Recommended Tech Stack

Next.js 14

Full-stack React framework with API routes and server components

Supabase

PostgreSQL database, auth, and real-time subscriptions

Tailwind CSS

Utility-first styling for rapid UI development

Key Features to Replicate

Top features across 8 categories. See all 31 features

Performance & Optimization(6 features)

Apache Iceberg Statistics

Calculates and updates number of distinct values (NDVs) for each column in Iceberg tables to support cost-based optimization

Apache Iceberg Table Optimization

Supports optimization of Apache Iceberg tables including binpack, sort, and z-order compaction strategies to improve performance and query execution efficiency

Auto Scaling

Dynamically scales resources up and down based on workload, assigning workers only when needed without over-provisioning or paying for idle resources

Materialized View Auto-refresh

Manages Iceberg tables that store precomputed data with automatic refresh capabilities using managed Spark compute to keep views up-to-date

Snapshot Retention Optimizer

Manages storage overhead for Apache Iceberg tables by retaining only needed snapshots and removing older unnecessary snapshots and associated files

+1 more in this category

AI Assistance(3 features)

Accelerate Debugging with GenAI TroubleshootingAIPremium

Uses generative AI to quickly identify and resolve issues in Spark jobs by analyzing job metadata, execution logs, and configurations for root cause analysis and recommendations

Amazon Q Data IntegrationAI

Create ETL jobs using natural language descriptions. Automatically generates Apache Spark code that can be customized, tested, and deployed as production jobs

Modernize Apache Spark Jobs with GenAI UpgradesAIPremium

Generative AI automatically analyzes Spark jobs and generates upgrade plans to newer versions, reducing time and effort to keep jobs modern, secure and performant

Integration(3 features)

Amazon SageMaker Integration

AWS Glue is accessible in next generation of Amazon SageMaker for managing and building workloads in one place with cost-effective serverless data integration

Zero-ETL Integration for Multiple Data Sources

Connects multiple data sources including DynamoDB, SaaS applications (Salesforce, SAP, ServiceNow), and self-managed databases to Amazon Redshift or SageMaker data lakehouse without operational overhead

Zero-ETL Integration for Self-Managed Databases

Provides access to analytics on transactional data by replicating data from Oracle, SQL Server, MySQL, or PostgreSQL to Amazon Redshift within minutes with data filtering capabilities

Data Preparation(2 features)

AWS Glue DataBrew

Interactive point-and-click visual interface for cleaning and normalizing data without writing code. Includes over 250 built-in transformations to combine, pivot, and transpose data

FindMatches ML FeatureAI

Deduplicates and finds imperfect matching records using machine learning without requiring ML expertise. Learns from labeled examples to identify matches across databases

Data Processing(2 features)

AWS Glue for Ray

Enables developers to scale existing Python code and popular Python libraries using Ray.io open-source compute framework. Serverless with no infrastructure management required

Open Source Framework Support

Natively supports Apache Hudi, Apache Iceberg, and Delta Lake for transactional consistency in Amazon S3 based data lakes with read, insert, update, and delete operations

Discovery & Cataloging(2 features)

Automatic Schema Discovery

Crawlers connect to data stores, determine schema using prioritized classifiers, and create metadata in Data Catalog. Can be run on schedule, on demand, or triggered by events

AWS Glue Data Catalog

Persistent metadata store for all data assets with table definitions, job definitions, schemas, automatic statistics computation, partition registration, and comprehensive schema version history

Cost Optimization(1 features)

AWS Glue Flex

Flexible execution job class for non-time-sensitive workloads that reduces costs up to 35% for preproduction jobs, testing, and data loads

Data Quality(1 features)

AWS Glue Data Quality

Automatically measures, monitors, and manages data quality in data lakes and pipelines. Computes statistics, recommends rules, monitors quality metrics, and alerts on deterioration

Cost Calculator

Pricing data not available for AWS Glue. Check their website for current pricing.

Ready to Build?