Skip to main content

Learn Azure Data Factory.


Azure Data Factory aka ADF is Azure's Data offering that caters development and orchestration of Data pipelines. ADF empowers cloud developers to orchestrate their Databrick notebooks and various other codebases. This cloud-managed service is specially designed for complex hybrid ELT, ETL, and Data Integration solutions.

ETL Tool - Azure Data Factory
ADF is one among many data offerings by Azure and is designed to orchestrate data pipelines. Capabilities like Data flows make it a powerful ETL tool with an ever-growing list of data source integrations.

How Azure Data Factory is licensed?
Azure Data Factory is Azure's Platform as service (PaaS) solution.

Azure Data Factory Components.
It has a number of components wrapped in 'Author' and 'Management' options in the left pane.
Author components( GA till date) include Pipelines, Dataflows, Datasets, Power Queries.
Management Components are Integration Runtimes, Linked Services, Triggers, Global Parameters, etc.

What is a Pipeline in Azure Data Factory?
Logical grouping of activities to perform a task. The task can be data extraction, data transformation or loading etcetera!

What is an Activity in Azure Data Factory?
Activity defines the action to be performed, for instance, Data copy in the Copy activity. Based on their actions, ADF activities can be split into three categories.
  • Data Movement Activities- Copy Activity
  • Data Transformation Activities- Mapping Dataflows, Stored Procedure Activity
  • Control Activities- Until Activity, GetMetaData activity 


What are the Chaining activities in Azure Data Factory?
dependsOn (Add Activity On UI interface) property in an activity can be used in the latest ADF version to chain activities to one another. Unlike the old days when we had to configure the output of an activity as an input of the upstream activities for managing Control Flow. The snippet below explains the usage and options available.


Add Activity on Option selected for Success & Fail activity will become dependsOn property respective activities and can be viewed in XML code.

"name": "Success",
"type": "Delete",
"dependsOn": [
{
"activity": "BASE",
"dependencyConditions": [
"Succeeded"
]
}
],
Linked services in Azure Data Factory.
These are nothing but connection strings that contain connection-related information details required by ADF to connect to the source instance of a dataset. Datasets require a Linked service.

Integration Runtime in Azure Data Factory
Integration Runtime IR provides compute facility for various actions such as data movement, activity dispatching.

What are Mapping data flows?
Mapping data flow allows cloud engineers to set up visually created data transformation logic. Data flows can be executed inside a pipeline hence all the scheduling capabilities are available in case of data flows.

Data flows can be leveraged to create data pipelines for loading various types of Dimensions and Fact entities in a DW/BI application. Data flows simplify the creation of complex ETL logic which was quite a tedious task (using native ADF activities in an ADF pipeline) in the past when data flows were not available.

                                         

To enable Data Flow, While creating Azure Data Factory you need to choose version 2 with Data flows.
Currently, there are three versions available.
  • Version 1
  • Version 2
  • Version 2 with Data Flows
How can we prevent sensitive data from being displayed in Monitor logs when passing/receiving inputs across ADF Activities?
We can check/uncheck the following options under the General table of activities to securely pass sensitive information across activities.



What are global parameters in Azure Data Factory?
ADF allows you to define a parameter globally and use it across pipelines. It can be done in Global Parameters under Manage tab.



Comments

Popular posts from this blog

How to work with XML files in Databricks using Python

This article will walk you through the basic steps of accessing and reading XML files placed at the filestore using python code in the community edition databricks notebook. We will also explore a few important functions available in the Spark XML maven library. Think of this article as a stepping stone in the databricks community edition. Features and functionalities elaborated herein can be scaled at the enterprise level using Enterprise editions of databricks to design reusable file processing frameworks. Requirements We will be using the Spark-XML package from Maven. **Spark 3.0 or above is required on your cluster for working with XML files. This article uses Databricks Community edition, refer to  this  video tutorial from Pragmatic works for getting started with  Databricks Community Edition . Create your first cluster in seconds : The next step is to install the Spark-XML library on your cluster. The cluster needs to be in a running state to install this li...

Microsoft Fabric

Complete Analytics Platform  In its new Software as a service offering, Microsoft basically clubbed every tool in their Analytics portfolio and gave it a new name - Fabric :). Claims are Fabric can serve every Data stakeholder ranging from a developer working with Data Lake to a Sales associate working on a self-serve Powerbi report. Microsoft has implemented tenant centric architecture in Fabric like office 365, In optimal design an organization will have 1 fabric similar to 1 office 365 tenant for entire organization. Lake centric and Open  All the data and apps built on Fabric provided solutions will get stored at a single lake, It auto calculates the lineage for objects stored on a single data lake. It uses delta file format and parquet data storage for all the objects.  Advantage: Table storage is shared across the fabric workspace, suppose you have a data issue in a Synapse datawarehouse query, just run a fix on the data set using Synapse data engineering python not...

Hierarchies in Oracle.

This article explores the functionality and features offered by CONNECT BY clause in Oracle with a hands-on exercise approach. Prerequisite: Oracle 9g or lastest installed, any oracle SQL client. We have used Oracle's sample schema for this article, you can download it too from the link below. Execute this SQL in your oracle client and you should be all set with data and schema. https://download.oracle.com/oll/tutorials/DBXETutorial/html/module2/les02_load_data_sql.htm Let's get started with CONNECT BY clause in Oracle. This is basically an oracle clause to place eligible datasets in a hierarchical fashion. Meaning, usage of this function is generally for creating a new resultant query that will elaborate hierarchical relations in a table. Here is the basic syntax [ START WITH condition ] CONNECT BY [ NOCYCLE ] condition START WITH is an optional keyword that can be used as a starting point for hierarchy. CONNECT BY describes the relationship between a child and parent r...