Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.
-
dags: Directory containing Airflow DAG scripts.
- create_tables_and_buckets.py: DAG for creating tables on PostgreSQL and buckets on MinIO.
- create_tables.sql: SQL script containing CREATE TABLE queries.
- pipeline_dag.py: Main DAG script for the ETL data pipeline.
-
data: Directory for storing project source data.
- log_data: Subdirectory for log data.
- song_data: Subdirectory for song data.
-
plugins: Directory for custom Airflow plugins.
-
operators: Subdirectory for operator scripts.
- init.py: Initialization script for operators.
- stage_postgresql_operator.py: Operator script for copying data from MinIO to PostgreSQL.
- load_fact_operator.py: Operator script for executing INSERT queries into the fact table.
- load_dimension_operator.py: Operator script for executing INSERT queries into dimension tables.
- data_quality_operator.py: Operator script for performing data quality checks.
-
helpers: Subdirectory for helper scripts.
- init.py: Initialization script for helpers.
- sql_queries.py: Script containing SQL queries for building dimensional tables.
-
$ git clone https://github.com/akarce/Udacity-Data-Pipeline-with-Airflow
$ cd Udacity-Data-Pipeline-with-Airflow
$ docker-compose up airflow-init
$ docker compose up -d
$ docker exec -it postgresdev psql -U postgres_user -d postgres_db
$ CREATE DATABASE sparkifydb;
Airflow webserver
http://localhost:8080/
MinIO
http://localhost:9001/
Username:
airflow
Password:
airflow
Connection Id: postgres_conn
Connection Type: Postgres
Host: 172.18.0.1
Database: sparkifydb
Login: postgres_user
Password: postgres_password
Port: 5432
Go to MinIO WebUI using http://localhost:9001/
From Access Keys section -> create access key -> Create
Store you Access Key and Secret Key using Download for Import
Connection Id: minio_conn
Connection Type: Amazon Web Services
AWS Access Key ID: <your_key_id>
AWS Secret Access Key: <your_secret_key> \
This will create two buckets named
udacity-dend, processed
and 7 postgres tables named
artists, songplays, songs, staging_events, staging_songs, times, users
$ \d <table_name>
Upload the log_data and song_data folders from data folder into udacity-dend bucket from MinIO WebUI
Username:
minioadmin
Password:
minioadmin
After finished successfully you can check the logs for data quality messages, and the all objects will be moved to the processed bucket.
=======
1b712ebd7a3d8fd64ccb00116f6c053a5f10cde7