Skip to content

Files

Latest commit

96c8e5b · Oct 2, 2018

History

History
866 lines (545 loc) · 23.6 KB

integration.rst

File metadata and controls

866 lines (545 loc) · 23.6 KB

Integration

Reverse Proxy

Airflow can be set up behind a reverse proxy, with the ability to set its endpoint with great flexibility.

For example, you can configure your reverse proxy to get:

https://lab.mycompany.com/myorg/airflow/

To do so, you need to set the following setting in your airflow.cfg:

base_url = http://my_host/myorg/airflow

Additionally if you use Celery Executor, you can get Flower in /myorg/flower with:

flower_url_prefix = /myorg/flower

Your reverse proxy (ex: nginx) should be configured as follow:

  • pass the url and http header as it for the Airflow webserver, without any rewrite, for example:

    server {
      listen 80;
      server_name lab.mycompany.com;
    
      location /myorg/airflow/ {
          proxy_pass http://localhost:8080;
          proxy_set_header Host $host;
          proxy_redirect off;
          proxy_http_version 1.1;
          proxy_set_header Upgrade $http_upgrade;
          proxy_set_header Connection "upgrade";
      }
    }
    
  • rewrite the url for the flower endpoint:

    server {
        listen 80;
        server_name lab.mycompany.com;
    
        location /myorg/flower/ {
            rewrite ^/myorg/flower/(.*)$ /$1 break;  # remove prefix from http header
            proxy_pass http://localhost:5555;
            proxy_set_header Host $host;
            proxy_redirect off;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
        }
    }
    

To ensure that Airflow generates URLs with the correct scheme when running behind a TLS-terminating proxy, you should configure the proxy to set the X-Forwarded-Proto header, and enable the ProxyFix middleware in your airflow.cfg:

enable_proxy_fix = True

Note: you should only enable the ProxyFix middleware when running Airflow behind a trusted proxy (AWS ELB, nginx, etc.).

Azure: Microsoft Azure

Airflow has limited support for Microsoft Azure: interfaces exist only for Azure Blob Storage and Azure Data Lake. Hook, Sensor and Operator for Blob Storage and Azure Data Lake Hook are in contrib section.

Azure Blob Storage

All classes communicate via the Window Azure Storage Blob protocol. Make sure that a Airflow connection of type wasb exists. Authorization can be done by supplying a login (=Storage account name) and password (=KEY), or login and SAS token in the extra field (see connection wasb_default for an example).

WasbBlobSensor

.. autoclass:: airflow.contrib.sensors.wasb_sensor.WasbBlobSensor

WasbPrefixSensor

.. autoclass:: airflow.contrib.sensors.wasb_sensor.WasbPrefixSensor

FileToWasbOperator

.. autoclass:: airflow.contrib.operators.file_to_wasb.FileToWasbOperator

WasbHook

.. autoclass:: airflow.contrib.hooks.wasb_hook.WasbHook

Azure File Share

Cloud variant of a SMB file share. Make sure that a Airflow connection of type wasb exists. Authorization can be done by supplying a login (=Storage account name) and password (=Storage account key), or login and SAS token in the extra field (see connection wasb_default for an example).

AzureFileShareHook

.. autoclass:: airflow.contrib.hooks.azure_fileshare_hook.AzureFileShareHook

Logging

Airflow can be configured to read and write task logs in Azure Blob Storage. See :ref:`write-logs-azure`.

Azure Data Lake

AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name)

(see connection azure_data_lake_default for an example).

AzureDataLakeHook

.. autoclass:: airflow.contrib.hooks.azure_data_lake_hook.AzureDataLakeHook

AWS: Amazon Web Services

Airflow has extensive support for Amazon Web Services. But note that the Hooks, Sensors and Operators are in the contrib section.

AWS EMR

EmrAddStepsOperator

.. autoclass:: airflow.contrib.operators.emr_add_steps_operator.EmrAddStepsOperator

EmrCreateJobFlowOperator

.. autoclass:: airflow.contrib.operators.emr_create_job_flow_operator.EmrCreateJobFlowOperator

EmrTerminateJobFlowOperator

.. autoclass:: airflow.contrib.operators.emr_terminate_job_flow_operator.EmrTerminateJobFlowOperator

EmrHook

.. autoclass:: airflow.contrib.hooks.emr_hook.EmrHook


AWS S3

S3Hook

.. autoclass:: airflow.hooks.S3_hook.S3Hook

S3FileTransformOperator

.. autoclass:: airflow.operators.s3_file_transform_operator.S3FileTransformOperator

S3ListOperator

.. autoclass:: airflow.contrib.operators.s3_list_operator.S3ListOperator

S3ToGoogleCloudStorageOperator

.. autoclass:: airflow.contrib.operators.s3_to_gcs_operator.S3ToGoogleCloudStorageOperator

S3ToHiveTransfer

.. autoclass:: airflow.operators.s3_to_hive_operator.S3ToHiveTransfer


AWS EC2 Container Service

ECSOperator

.. autoclass:: airflow.contrib.operators.ecs_operator.ECSOperator


AWS Batch Service

AWSBatchOperator

.. autoclass:: airflow.contrib.operators.awsbatch_operator.AWSBatchOperator


AWS RedShift

AwsRedshiftClusterSensor

.. autoclass:: airflow.contrib.sensors.aws_redshift_cluster_sensor.AwsRedshiftClusterSensor

RedshiftHook

.. autoclass:: airflow.contrib.hooks.redshift_hook.RedshiftHook

RedshiftToS3Transfer

.. autoclass:: airflow.operators.redshift_to_s3_operator.RedshiftToS3Transfer

S3ToRedshiftTransfer

.. autoclass:: airflow.operators.s3_to_redshift_operator.S3ToRedshiftTransfer


Databricks

Databricks has contributed an Airflow operator which enables submitting runs to the Databricks platform. Internally the operator talks to the api/2.0/jobs/runs/submit endpoint.

DatabricksSubmitRunOperator

.. autoclass:: airflow.contrib.operators.databricks_operator.DatabricksSubmitRunOperator



GCP: Google Cloud Platform

Airflow has extensive support for the Google Cloud Platform. But note that most Hooks and Operators are in the contrib section. Meaning that they have a beta status, meaning that they can have breaking changes between minor releases.

See the :ref:`GCP connection type <connection-type-GCP>` documentation to configure connections to GCP.

Logging

Airflow can be configured to read and write task logs in Google Cloud Storage. See :ref:`write-logs-gcp`.

BigQuery

BigQuery Operators

BigQueryCheckOperator
.. autoclass:: airflow.contrib.operators.bigquery_check_operator.BigQueryCheckOperator

BigQueryValueCheckOperator
.. autoclass:: airflow.contrib.operators.bigquery_check_operator.BigQueryValueCheckOperator

BigQueryIntervalCheckOperator
.. autoclass:: airflow.contrib.operators.bigquery_check_operator.BigQueryIntervalCheckOperator

BigQueryGetDataOperator
.. autoclass:: airflow.contrib.operators.bigquery_get_data.BigQueryGetDataOperator

BigQueryCreateEmptyTableOperator
.. autoclass:: airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyTableOperator

BigQueryCreateExternalTableOperator
.. autoclass:: airflow.contrib.operators.bigquery_operator.BigQueryCreateExternalTableOperator

BigQueryDeleteDatasetOperator
.. autoclass:: airflow.contrib.operators.bigquery_operator.BigQueryDeleteDatasetOperator

BigQueryCreateEmptyDatasetOperator
.. autoclass:: airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyDatasetOperator

BigQueryOperator
.. autoclass:: airflow.contrib.operators.bigquery_operator.BigQueryOperator

BigQueryTableDeleteOperator
.. autoclass:: airflow.contrib.operators.bigquery_table_delete_operator.BigQueryTableDeleteOperator

BigQueryToBigQueryOperator
.. autoclass:: airflow.contrib.operators.bigquery_to_bigquery.BigQueryToBigQueryOperator

BigQueryToCloudStorageOperator
.. autoclass:: airflow.contrib.operators.bigquery_to_gcs.BigQueryToCloudStorageOperator


BigQueryHook

.. autoclass:: airflow.contrib.hooks.bigquery_hook.BigQueryHook
    :members:


Cloud Functions

Cloud Functions Operators

.. autoclass:: airflow.contrib.operators.gcp_operator.GCP

GcfFunctionDeployOperator
.. autoclass:: airflow.contrib.operators.gcp_function_operator.GcfFunctionDeployOperator


GcfFunctionDeleteOperator
.. autoclass:: airflow.contrib.operators.gcp_function_operator.GcfFunctionDeleteOperator


Cloud Functions Hook

.. autoclass:: airflow.contrib.hooks.gcp_function_hook.GcfHook
    :members:


Cloud DataFlow

DataFlow Operators

DataFlowJavaOperator
.. autoclass:: airflow.contrib.operators.dataflow_operator.DataFlowJavaOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date':
        (2016, 8, 1),
    'email': ['alex@vanboxel.be'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=30),
    'dataflow_default_options': {
        'project': 'my-gcp-project',
        'zone': 'us-central1-f',
        'stagingLocation': 'gs://bucket/tmp/dataflow/staging/',
    }
}

dag = DAG('test-dag', default_args=default_args)

task = DataFlowJavaOperator(
    gcp_conn_id='gcp_default',
    task_id='normalize-cal',
    jar='{{var.value.gcp_dataflow_base}}pipeline-ingress-cal-normalize-1.0.jar',
    options={
        'autoscalingAlgorithm': 'BASIC',
        'maxNumWorkers': '50',
        'start': '{{ds}}',
        'partitionType': 'DAY'

    },
    dag=dag)
DataflowTemplateOperator
.. autoclass:: airflow.contrib.operators.dataflow_operator.DataflowTemplateOperator

DataFlowPythonOperator
.. autoclass:: airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator


DataFlowHook

.. autoclass:: airflow.contrib.hooks.gcp_dataflow_hook.DataFlowHook
    :members:



Cloud DataProc

DataProc Operators

DataprocClusterCreateOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator

DataprocClusterScaleOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataprocClusterScaleOperator

DataprocClusterDeleteOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataprocClusterDeleteOperator

DataProcPigOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataProcPigOperator

DataProcHiveOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataProcHiveOperator

DataProcSparkSqlOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataProcSparkSqlOperator

DataProcSparkOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataProcSparkOperator

DataProcHadoopOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataProcHadoopOperator

DataProcPySparkOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataProcPySparkOperator

DataprocWorkflowTemplateInstantiateOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateOperator

DataprocWorkflowTemplateInstantiateInlineOperator
.. autoclass:: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateInlineOperator

Cloud Datastore

Datastore Operators

DatastoreExportOperator
.. autoclass:: airflow.contrib.operators.datastore_export_operator.DatastoreExportOperator

DatastoreImportOperator
.. autoclass:: airflow.contrib.operators.datastore_import_operator.DatastoreImportOperator

DatastoreHook

.. autoclass:: airflow.contrib.hooks.datastore_hook.DatastoreHook
    :members:


Cloud ML Engine

Cloud ML Engine Operators

MLEngineBatchPredictionOperator
.. autoclass:: airflow.contrib.operators.mlengine_operator.MLEngineBatchPredictionOperator
    :members:

MLEngineModelOperator
.. autoclass:: airflow.contrib.operators.mlengine_operator.MLEngineModelOperator
    :members:

MLEngineTrainingOperator
.. autoclass:: airflow.contrib.operators.mlengine_operator.MLEngineTrainingOperator
    :members:

MLEngineVersionOperator
.. autoclass:: airflow.contrib.operators.mlengine_operator.MLEngineVersionOperator
    :members:

Cloud ML Engine Hook

MLEngineHook
.. autoclass:: airflow.contrib.hooks.gcp_mlengine_hook.MLEngineHook
    :members:


Cloud Storage

Storage Operators

FileToGoogleCloudStorageOperator
.. autoclass:: airflow.contrib.operators.file_to_gcs.FileToGoogleCloudStorageOperator

GoogleCloudStorageCreateBucketOperator
.. autoclass:: airflow.contrib.operators.gcs_operator.GoogleCloudStorageCreateBucketOperator

GoogleCloudStorageDownloadOperator
.. autoclass:: airflow.contrib.operators.gcs_download_operator.GoogleCloudStorageDownloadOperator

GoogleCloudStorageListOperator
.. autoclass:: airflow.contrib.operators.gcs_list_operator.GoogleCloudStorageListOperator

GoogleCloudStorageToBigQueryOperator
.. autoclass:: airflow.contrib.operators.gcs_to_bq.GoogleCloudStorageToBigQueryOperator

GoogleCloudStorageToGoogleCloudStorageOperator
.. autoclass:: airflow.contrib.operators.gcs_to_gcs.GoogleCloudStorageToGoogleCloudStorageOperator

GoogleCloudStorageHook

.. autoclass:: airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook
    :members:

Google Kubernetes Engine

Google Kubernetes Engine Cluster Operators

GKEClusterCreateOperator
.. autoclass:: airflow.contrib.operators.gcp_container_operator.GKEClusterCreateOperator
GKEClusterDeleteOperator
.. autoclass:: airflow.contrib.operators.gcp_container_operator.GKEClusterDeleteOperator
GKEPodOperator
.. autoclass:: airflow.contrib.operators.gcp_container_operator.GKEPodOperator

Google Kubernetes Engine Hook

.. autoclass:: airflow.contrib.hooks.gcp_container_hook.GKEClusterHook
    :members:


Qubole

Apache Airflow has a native operator and hooks to talk to Qubole, which lets you submit your big data jobs directly to Qubole from Apache Airflow.

QuboleOperator

.. autoclass:: airflow.contrib.operators.qubole_operator.QuboleOperator

QubolePartitionSensor

.. autoclass:: airflow.contrib.sensors.qubole_sensor.QubolePartitionSensor


QuboleFileSensor

.. autoclass:: airflow.contrib.sensors.qubole_sensor.QuboleFileSensor