Udacity Data Engineering Nanodegree. Part5 Airflow Final Project.
Sparkify is a music app taht decided to automate their data warehouse ETL pipelines with Apache Airflow.
The source data resides in S3 buckets, and the basic ETL process is as follows:
- Data in 2 collection of .json files needs to be uploaded to a data warehouse in Amazon Redshift. We'll use COPY function to load .json files into an Amazon Redshift cluster
- After that it needs to be processed to compose a number of tables (1 fact table + 4 dimension tables), out of the 2 json files (and subsequently stagging tables)
The difference in this new project is that we'll use Apache Airflow for workflow management, running at an interval of our choice and also automatically log all tasks and retry them when they fail.
These steps are needed to prepare the workspace to run the code
-
Basic Project template is https://github.com/rantoncuadrado/udacity_p5_airflow/tree/main/Basic%20Project%20Template
-
Launch Redshift Cluster. This is explained here https://classroom.udacity.com/nanodegrees/nd027/parts/69a25b76-3ebd-4b72-b7cb-03d82da12844/modules/445568fc-578d-4d3e-ab9c-2d186728ab22/lessons/21d59f40-6033-40b5-81a2-4a3211d9f46e/concepts/fad03fb3-ce48-4a69-9887-4baf8751cae3 lesson 3: implementing Data Wareouses / 15. Exercise 1. Launch Redshift Cluster
Sign in to the AWS Management Console and open the Amazon Redshift console at https://console.aws.amazon.com/redshift/. On the Amazon Redshift Dashboard, choose Launch cluster.
On the Cluster details page, enter the following values and then choose Continue: Cluster identifier: Enter redshift-cluster. Database name: Enter dev. Database port: Enter 5439. Master user name: Enter awsuser. Master user password and Confirm password: Enter a password for the master user account.
-
After you have updated the DAG, you will need to run /opt/airflow/start.sh command to start the Airflow webserver.
CONNECTION 1
Conn Id: Enter aws_credentials.
Conn Type: Enter Amazon Web Services.
Login: Enter your Access key ID from the IAM User credentials you downloaded earlier.
Password: Enter your Secret access key from the IAM User credentials you downloaded earlier.
CONNECTION 2
Conn Id: Enter redshift.
Conn Type: Enter Postgres.
Host: Enter the endpoint of your Redshift cluster, excluding the port at the end. You can find this by selecting your cluster in the Clusters page of the Amazon Redshift console. See where this is located in the screenshot below. IMPORTANT: Make sure to NOT include the port at the end of the Redshift endpoint string.
Schema: Enter dev. This is the Redshift database you want to connect to.
Login: Enter awsuser.
Password: Enter the password you created when launching your Redshift cluster.
Port: Enter 5439.
- Create tables in RedShift (https://github.com/rantoncuadrado/udacity_p5_airflow/blob/main/Basic%20Project%20Template/create_tables.sql) I opted for manually create them in the Redshift console
Once finshed the setup the Raul_project DAG can be launched by togling the ON/OFF button.
Once Raul_project DAG has run succesfully, you will be able to query the songplays fact table and the songs, users, artists and time dimension tables in Redshift, from the Query Editor.