Skip to main content

How to work with XML files in Databricks using Python




This article will walk you through the basic steps of accessing and reading XML files placed at the filestore using python code in the community edition databricks notebook. We will also explore a few important functions available in the Spark XML maven library. Think of this article as a stepping stone in the databricks community edition. Features and functionalities elaborated herein can be scaled at the enterprise level using Enterprise editions of databricks to design reusable file processing frameworks.

Requirements

We will be using the Spark-XML package from Maven. **Spark 3.0 or above is required on your cluster for working with XML files. This article uses Databricks Community edition, refer to this video tutorial from Pragmatic works for getting started with Databricks Community Edition.

Create your first cluster in seconds :




The next step is to install the Spark-XML library on your cluster. The cluster needs to be in a running state to install this library. This library can also be installed using pip command inside notebook. Here are the steps to install Spark-XML onto your cluster.
  1. Goto cluster
  2. Select libraries
  3. Install New - Maven - Search Packages.
  4. Choose-Maven Central, Spark XML - Select Spark-XML_2.12
  5. Click install
For this practice article, we have used the books.xml file available at link. You can try this or any other file of your choice.

Let's get started with accessing and reading the XML file. We are going to use the filestore available in the databricks account. 'Import & Explore data' option available on the homepage can be used to load files in filestore storage.
  • %fs ls FileStore/tables/ - Let's locate files uploaded.

In the next step, we are trying to read the schema of the file in use, library allows us to use read.format for this purpose. read.format leverage rootTag and rowTag of an XML file to present its schema in readable format.
  • df=spark.read.format("com.databricks.spark.xml").option("rootTag", "catalog").option("rowTag","book").load("/FileStore/tables/books.xml")
  • df.printSchema()


  • display(df) 
display function lists the python dataframe in tabular form.

Now that we have captured the XML file in a dataframe, there are numerous operations that we can do ranging from data cleansing to data loading in relational databases depending on business wanting. We can additionally design a framework for flat file validation using parameterized file names and file paths which can help us with a basic set of validations, for instance, checking if a file exists at the storage location, if the file has content and schema validations using XSD. You will also find a code snippet in this article that has a setup of functions for a prototype of file exist and empty file validations. Think of this as an initial draft and enhance and improvise it per your requirements. 



XML file once converted into a dataframe can be used in a number of ways, for example, a temporary view creation using functions like createOrReplaceTempView and usage in the form of a cursor to load data into any relational database. Spark-xml maven library provides us three reading modes that we can leverage per the requirements of the design. 

  • PERMISSIVE mode replaces malformed data with null in all fields and places a copy of bad data into autogenerated field columnNameOfCorruptRecord. This is the default mode. 
  • DROPMALFORMED can be used in case the requirement is to drop off the malformed entry. 
  • FAILFAST mode will enable the application to throw an exception as soon as it encounters the first erroneous entry. 

Apart from a variation in reading modes, HyukjinKwon has added a number of super-helpful features to this library. I recommend you to take a look at all the features this package offers; I have added the GitHub link in credits. The next article will delve into file writing mode. Stay tuned..


**********Credits: 
https://docs.databricks.com/data/data-sources/xml.html#example&language-sql
Photo by James Harrison on Unsplash
https://github.com/databricks/spark-xml

Comments

Popular posts from this blog

Microsoft Fabric

Complete Analytics Platform  In its new Software as a service offering, Microsoft basically clubbed every tool in their Analytics portfolio and gave it a new name - Fabric :). Claims are Fabric can serve every Data stakeholder ranging from a developer working with Data Lake to a Sales associate working on a self-serve Powerbi report. Microsoft has implemented tenant centric architecture in Fabric like office 365, In optimal design an organization will have 1 fabric similar to 1 office 365 tenant for entire organization. Lake centric and Open  All the data and apps built on Fabric provided solutions will get stored at a single lake, It auto calculates the lineage for objects stored on a single data lake. It uses delta file format and parquet data storage for all the objects.  Advantage: Table storage is shared across the fabric workspace, suppose you have a data issue in a Synapse datawarehouse query, just run a fix on the data set using Synapse data engineering python not...

Hierarchies in Oracle.

This article explores the functionality and features offered by CONNECT BY clause in Oracle with a hands-on exercise approach. Prerequisite: Oracle 9g or lastest installed, any oracle SQL client. We have used Oracle's sample schema for this article, you can download it too from the link below. Execute this SQL in your oracle client and you should be all set with data and schema. https://download.oracle.com/oll/tutorials/DBXETutorial/html/module2/les02_load_data_sql.htm Let's get started with CONNECT BY clause in Oracle. This is basically an oracle clause to place eligible datasets in a hierarchical fashion. Meaning, usage of this function is generally for creating a new resultant query that will elaborate hierarchical relations in a table. Here is the basic syntax [ START WITH condition ] CONNECT BY [ NOCYCLE ] condition START WITH is an optional keyword that can be used as a starting point for hierarchy. CONNECT BY describes the relationship between a child and parent r...