This directory contains examples of different ETL (Extract, Transform, Load) workflows implemented using Prefect.
Located in /etl-cicd-pipeline
A complete production-ready ETL pipeline that demonstrates:
- GitHub API data extraction
- AWS S3 integration for data storage
- Data validation and cleaning
- Automated deployment via GitHub Actions
- Email notifications
- Error handling and retries
- Environment configuration management
Key features:
- Uses Docker infrastructure
- Includes CI/CD pipeline configuration
- Configurable via environment variables
- Production-ready monitoring setup
The pipeline extracts repository data from GitHub, processes it through multiple transformation stages, and loads it into S3 buckets. It includes comprehensive error handling and notification systems.
Located in /etl-s3-upload-pipeline
A streamlined ETL pipeline demonstrating two implementation approaches:
- Uses Prefect's complete feature set
- Prefect Blocks for AWS credentials
- Built-in S3 integration with
prefect-aws
- Email notifications using
prefect-email
- Human-in-the-loop capabilities
- Automatic retries and error handling
- Flow run tracking and observability
- Lightweight implementation using basic Prefect features
- Direct boto3 integration with AWS
- Simpler configuration management
- Minimal dependencies
- More control over AWS interactions
- Basic error handling and logging
Each example contains:
- Detailed README with setup instructions
- Environment configuration examples
- Complete source code
- Deployment instructions
To use any example:
- Navigate to the specific example directory
- Follow the README instructions
- Configure the required environment variables
- Deploy using the provided scripts
- Python 3.12+
- AWS account and credentials
- Prefect installation (
pip install prefect
)
- Prefect Cloud account
- Docker
- GitHub account with repository access
- GitHub Actions enabled
prefect-aws
for full integration exampleboto3
for simple examplepandas
for data processingpydantic
for data validation
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
AWS_DEFAULT_REGION=us-east-1
S3_BUCKET_NAME=your-bucket-name
NOTIFICATION_EMAIL=alerts@example.com
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
AWS_REGION=us-east-1
S3_BUCKET_NAME=your-bucket-name
Both pipelines follow a similar ETL pattern:
- Extract data from source
- Store raw data in S3
- Transform and validate data
- Clean and structure data
- Perform aggregations
- Load final results to S3
The main difference is in the implementation approach and additional features included in each version.