Skip to main content

Integration Runtime in Azure Data Factory

 





Integration runtime joins Activity and Linked services. It provides a compute environment to the Activity to process enlisted actions.

 

Azure IR

Self-Hosted

Azure-SSIS

Running Dataflows, data movement inside Azure

Data movement from externally hosted systems

To execute SSIS packages

 

Azure data factory can be hosted in any azure region of Customer’s choice and IR location can be independent of ADF’s azure regions. Generally, IRs are hosted in the Azure region where data movement, activity dispatching etc. is required.


IR behavior with AutoResolve Region in public network


Time To Live TTL in Integration Runtime


Auto resolve, Adhoc integration runtime clusters add a cluster acquisition time (approximately 4-5 mins) every time it spins up a new cluster for being used in a data flow. Thus it adds an additional compute setup time in total job time for dataflow execution, this behavior can be detrimental for overall execution time optimization of a batch load, data warehouse daily ETL load for instance.

To overcome this Microsoft has added TTL time to live feature in Integration Runtime. This feature waves off some of the extra time taken by each workflow in acquiring the cluster.

This Setting is designed for dataflow activities, and this pre-warming cluster feature basically operates at the core level and working is explained in blog.

Setting TTL significantly reduces overall batch timings when jobs are interdependent and are executed in a sequential manner. 
    
The picture below gives a non-geeky explanation of the TTL feature. Batch 1 is a set of 3 jobs running in a sequence. Each having 5 minutes of spark cluster setup time and 10 minutes of data flow processing time. Batch 2 uses an integration runtime with TTL set to 10 minutes hence after setting up a cluster in Job A, the same cluster is used in subsequent jobs, and therefore Cluster warming and acquiring time is reduced to 2 minutes in both B and C and overall executions of batch 2 takes 4 minutes lesser than that of batch 1. 



This time difference might like okay in a batch of three jobs but will have a mighty effect when job number is scaled out to thousands in real-life DW/BI batch loads.


**So TTL feature can be a very handy add-on to optimize batch timings. 



Quick re-use


With TTL there still be a cluster grab time of approx. 2 minutes however this latest feature Quick re-use will further limit cluster acquiring time to few seconds. This is going to be a game-changer, and brings ADF at par levels with legacy data integration tools for instance Power Center where a job can directly start processing data without worrying about 'setting up compute facility'


Photo by Vitaly Vlasov from Pexels

Comments

Post a Comment

Popular posts from this blog

How to work with XML files in Databricks using Python

This article will walk you through the basic steps of accessing and reading XML files placed at the filestore using python code in the community edition databricks notebook. We will also explore a few important functions available in the Spark XML maven library. Think of this article as a stepping stone in the databricks community edition. Features and functionalities elaborated herein can be scaled at the enterprise level using Enterprise editions of databricks to design reusable file processing frameworks. Requirements We will be using the Spark-XML package from Maven. **Spark 3.0 or above is required on your cluster for working with XML files. This article uses Databricks Community edition, refer to  this  video tutorial from Pragmatic works for getting started with  Databricks Community Edition . Create your first cluster in seconds : The next step is to install the Spark-XML library on your cluster. The cluster needs to be in a running state to install this li...

Microsoft Fabric

Complete Analytics Platform  In its new Software as a service offering, Microsoft basically clubbed every tool in their Analytics portfolio and gave it a new name - Fabric :). Claims are Fabric can serve every Data stakeholder ranging from a developer working with Data Lake to a Sales associate working on a self-serve Powerbi report. Microsoft has implemented tenant centric architecture in Fabric like office 365, In optimal design an organization will have 1 fabric similar to 1 office 365 tenant for entire organization. Lake centric and Open  All the data and apps built on Fabric provided solutions will get stored at a single lake, It auto calculates the lineage for objects stored on a single data lake. It uses delta file format and parquet data storage for all the objects.  Advantage: Table storage is shared across the fabric workspace, suppose you have a data issue in a Synapse datawarehouse query, just run a fix on the data set using Synapse data engineering python not...

Hierarchies in Oracle.

This article explores the functionality and features offered by CONNECT BY clause in Oracle with a hands-on exercise approach. Prerequisite: Oracle 9g or lastest installed, any oracle SQL client. We have used Oracle's sample schema for this article, you can download it too from the link below. Execute this SQL in your oracle client and you should be all set with data and schema. https://download.oracle.com/oll/tutorials/DBXETutorial/html/module2/les02_load_data_sql.htm Let's get started with CONNECT BY clause in Oracle. This is basically an oracle clause to place eligible datasets in a hierarchical fashion. Meaning, usage of this function is generally for creating a new resultant query that will elaborate hierarchical relations in a table. Here is the basic syntax [ START WITH condition ] CONNECT BY [ NOCYCLE ] condition START WITH is an optional keyword that can be used as a starting point for hierarchy. CONNECT BY describes the relationship between a child and parent r...