How to Build Your Own AWS Glue
Replace AWS Glue with a custom build. Serverless Data Integration – Discover, prepare, and integrate all your data at any scale
Build Difficulty: 5/5
Build a working replacement in a weekend with AI tools
Estimated Timeline
Based on 31 features at Weekend Project difficulty, expect about One weekend with AI-assisted development.
Recommended Tech Stack
Full-stack React framework with API routes and server components
PostgreSQL database, auth, and real-time subscriptions
Utility-first styling for rapid UI development
Key Features to Replicate
Top features across 8 categories. See all 31 features
Performance & Optimization(6 features)
Calculates and updates number of distinct values (NDVs) for each column in Iceberg tables to support cost-based optimization
Supports optimization of Apache Iceberg tables including binpack, sort, and z-order compaction strategies to improve performance and query execution efficiency
Dynamically scales resources up and down based on workload, assigning workers only when needed without over-provisioning or paying for idle resources
Manages Iceberg tables that store precomputed data with automatic refresh capabilities using managed Spark compute to keep views up-to-date
Manages storage overhead for Apache Iceberg tables by retaining only needed snapshots and removing older unnecessary snapshots and associated files
+1 more in this category
AI Assistance(3 features)
Uses generative AI to quickly identify and resolve issues in Spark jobs by analyzing job metadata, execution logs, and configurations for root cause analysis and recommendations
Create ETL jobs using natural language descriptions. Automatically generates Apache Spark code that can be customized, tested, and deployed as production jobs
Generative AI automatically analyzes Spark jobs and generates upgrade plans to newer versions, reducing time and effort to keep jobs modern, secure and performant
Integration(3 features)
AWS Glue is accessible in next generation of Amazon SageMaker for managing and building workloads in one place with cost-effective serverless data integration
Connects multiple data sources including DynamoDB, SaaS applications (Salesforce, SAP, ServiceNow), and self-managed databases to Amazon Redshift or SageMaker data lakehouse without operational overhead
Provides access to analytics on transactional data by replicating data from Oracle, SQL Server, MySQL, or PostgreSQL to Amazon Redshift within minutes with data filtering capabilities
Data Preparation(2 features)
Interactive point-and-click visual interface for cleaning and normalizing data without writing code. Includes over 250 built-in transformations to combine, pivot, and transpose data
Deduplicates and finds imperfect matching records using machine learning without requiring ML expertise. Learns from labeled examples to identify matches across databases
Data Processing(2 features)
Enables developers to scale existing Python code and popular Python libraries using Ray.io open-source compute framework. Serverless with no infrastructure management required
Natively supports Apache Hudi, Apache Iceberg, and Delta Lake for transactional consistency in Amazon S3 based data lakes with read, insert, update, and delete operations
Discovery & Cataloging(2 features)
Crawlers connect to data stores, determine schema using prioritized classifiers, and create metadata in Data Catalog. Can be run on schedule, on demand, or triggered by events
Persistent metadata store for all data assets with table definitions, job definitions, schemas, automatic statistics computation, partition registration, and comprehensive schema version history
Cost Optimization(1 features)
Flexible execution job class for non-time-sensitive workloads that reduces costs up to 35% for preproduction jobs, testing, and data loads
Data Quality(1 features)
Automatically measures, monitors, and manages data quality in data lakes and pipelines. Computes statistics, recommends rules, monitors quality metrics, and alerts on deterioration
Cost Calculator
Pricing data not available for AWS Glue. Check their website for current pricing.