Data exists in different formats – from simple CSV tables and JSON objects to more descriptive and hierarchical XML documents. XML might look a bit old-school, but it’s still everywhere: in RSS feeds, APIs, and software configurations.

Before you can use XML data, you need to parse it. In other words, you teach your program to read what’s inside. With Python’s libraries like ElementTree, minidom, and lxml, parsing XML is a clear and intuitive process.

No matter which data format you encounter, learning how to work with them, especially XML, is a must-have skill for any developer. In this guide, you’ll learn how to read, navigate, and extract data from XML step by step.

The fundamentals of XML

XML stands for eXtensible Markup Language and is a universal format for storing and transporting data. Its structure allows it to be easily read by humans while remaining compatible with machines. The language was built with a focus on simplicity, adaptability, and broad usability online.

XML can best be described as a tree. Similar to a real tree with branches and leaves, it has a root and nested child elements. This framework illustrates how XML arranges data in a hierarchical format. Python offers various methods to handle XML data, with the ElementTree module being one of the most popular options. The ElementTree module includes two primary classes: ElementTree, which signifies the entire document, and Element, which denotes an individual node. 

How developers parse XML: the main techniques

In Python, the parse() function is typically the initial step for handling XML files. It processes an XML document and transforms it into a tree structure that can be navigated by your program. This function serves as the basis for various parsing methods, whether you’re utilizing ElementTree, lxml, or other libraries.

To grasp the parsing methods, you need to first know the elements of an XML file:

  • Namespace – helps avoid naming conflicts by qualifying the names of elements and attributes.
  • Root element – the primary container that encompasses all other elements.
  • Attributes – pairs of keys and values that offer extra information about elements.
  • Child elements – nested elements that hierarchically organize data.
  • Text content – the actual information contained within elements.

These elements create a structured format that enables programs to store and share complex data. After understanding the structure, we can investigate the primary XML parsing techniques in Python: 

1. ElementTree Module  

A built-in library in Python with an easy-to-use API for exploring XML trees. It works great for most tasks as it balances usability with performance.  

2. lxml  

A third-party library that includes support for XPath, XSLT, and validation. It is faster and has more features than ElementTree. It is perfect for complex XML processing.  

3. DOM (Document Object Model)  

This loads the complete XML document into memory as a tree structure. While it’s straightforward to navigate and modify, it can be memory-intensive for larger files. 

4. SAX (Simple API for XML)  

An event-driven parser that sequentially processes XML. It is efficient in terms of memory usage and quick for handling large files, but it can be less intuitive for managing complex operations.  

5. Untangle  

Transforms XML data into Python objects for easy access. It is perfect for quick parsing with minimal code. However,  it is less adaptable for intricate XML structures.  

For this tutorial, you’ll need…

  • Python and Python requests 

In case you don’t have Python 3.x installed on your computer, you can download it from https://www.python.org/. Open a terminal and install Requests using pip:

pip install requests or  pip3 install requests

  • Visual Studio Code (VS Code)

To run all the necessary commands, you can use the terminal in Visual Studio Code (VS Code). Just download it from the official website and follow the installation instructions for your operating system.

  • RSS Feed

An RSS feed is a special XML file format used to automatically deliver news or updates from a website. Simply put, it’s a kind of “list of the latest posts” structured in a way that programs can read.

For example, on our website, there’s a section with tutorials. Instead of checking the site every day to see if a new tutorial has been published, the site provides an RSS feed — an XML file containing information about all recent posts. We’ll use the DataImpulse RSS feed for our tutorial. Now, let’s look at the libraries we need and the following code.

ElementTree

Let’s look at how to use Python’s built-in ElementTree module to parse XML data from an RSS feed. 

The first step is to import xml.etree.ElementTree. The process is simple: we send a GET request to the RSS feed URL, and the server responds with XML data. The .content attribute of the response gives us the raw byte data that ElementTree can parse. Using ET.fromstring(), we convert this XML text into a tree structure and get the root element. Inside it, most content is organized under <channel><item> elements.

The find() method returns the first matching element, while findall() returns all child elements with a given tag. Here, items is a list of all elements, each representing a tutorial or article.

Example of the code: 


import xml.etree.ElementTree as ET
import requests

# URL of the RSS feed (from DataImpulse site)
FEED_URL = "https://dataimpulse.com/tutorials-category/web-scraping-seo/feed/"

# Download the XML data
response = requests.get(FEED_URL)
xml_data = response.content

# Parse the XML
root = ET.fromstring(xml_data)

# In RSS feeds, items are inside 
channel = root.find("channel")
items = channel.findall("item")

# Print titles and links of the items
for item in items:
   title = item.findtext("title")
   link = item.findtext("link")
   print(f"Title: {title}")
   print(f"Link: {link}\n")
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

Finally, we extract and print the title and link for every item using findtext(), which gets the text content directly. When you run the script, you’ll get a neatly printed list of all tutorial titles and their URLs. 

lxml 

Lxml is an external library that extends ElementTree’s functionality and adds support for XPath, a query language for navigating XML documents. In this example, we’ll use lxml.etree to parse an RSS feed and extract post titles.

We start by fetching the RSS feed using requests.get(), just like before. The response content is passed to etree.fromstring(), which parses the XML and returns the root element. With lxml, we can use XPath expressions to find elements because they make navigation faster and more flexible than the standard ElementTree methods.


import requests
from lxml import etree

# URL of the RSS feed
FEED_URL = "https://dataimpulse.com/tutorials-category/web-scraping-seo/feed/"

# Download the XML data
response = requests.get(FEED_URL)
xml_data = response.content

# Parse XML
root = etree.fromstring(xml_data)
items = root.xpath("//channel/item")

# Show list of titles with numbers
print("Available blog posts:\n")
for item in items:
   title = item.findtext("title")
   print(title)
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

Finally, we loop through the <item> elements, extract each post’s title using findtext(“title”), and print them.

Minidom 

In this example, we’re using Minidom, short for Minimal DOM Implementation. It’s part of Python’s standard library and works by loading the entire XML document into memory as a Document Object Model (DOM). 

We call minidom.parseString() to parse the XML and build a DOM object (doc). Among familiar DOM methods, there are two we’ll use. The first is getElementsByTagName(“tag”,) which finds all elements with a given tag name. And, the next one is .firstChild.nodeValue that retrieves the text content of a node.

Finally, we use toprettyxml(). It is used to print the first item element in a formatted XML layout.  

Here’s how to parse data from an RSS feed using Minidom:


import requests
from xml.dom import minidom

# URL of the RSS feed
FEED_URL = "https://dataimpulse.com/tutorials-category/web-scraping-seo/feed/"

# Download XML data
response = requests.get(FEED_URL)
xml_data = response.content

# Parse XML with minidom
doc = minidom.parseString(xml_data)

# Get all  elements
items = doc.getElementsByTagName("item")

print(f"Total articles found in feed: {len(items)}\n")

# Show titles with links
for i, item in enumerate(items, start=1):
   title = item.getElementsByTagName("title")[0].firstChild.nodeValue
   link = item.getElementsByTagName("link")[0].firstChild.nodeValue
   print(f"{i}. {title}")
   print(f"   Link: {link}\n")

# Show raw XML of the first  (just for demo)
print("First  as raw XML:\n")
print(items[0].toprettyxml())
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

SAX 

SAX reads the XML sequentially, element by element.  it triggers specific events for each XML part:

  • startElement(name, attrs) → when a tag begins
  • characters(content) → when text is read between tags
  • endElement(name) → when a tag ends

Save the code in the file and run it: 


import requests
import xml.sax

# Custom handler for SAX parsing
class RSSHandler(xml.sax.ContentHandler):
   def __init__(self):
       super().__init__()
       self.current_tag = ""
       self.in_item = False
       self.title = ""
       self.link = ""
       self.counter = 0

   def startElement(self, name, attrs):
       if name == "item":
           self.in_item = True
           self.title = ""
           self.link = ""
       self.current_tag = name

   def characters(self, content):
       if self.in_item:
           if self.current_tag == "title":
               self.title += content
           elif self.current_tag == "link":
               self.link += content

   def endElement(self, name):
       if name == "item":
           self.counter += 1
           print(f"{self.counter}. {self.title.strip()}")
           print(f"   Link: {self.link.strip()}\n")
           self.in_item = False
       self.current_tag = ""

# Download the XML feed
FEED_URL = "https://dataimpulse.com/tutorials-category/web-scraping-seo/feed/"
response = requests.get(FEED_URL)
xml_data = response.content

# Parse XML with SAX
print("Parsing RSS feed with SAX...\n")
handler = RSSHandler()
xml.sax.parseString(xml_data, handler)

print(f"Total items parsed: {handler.counter}")
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

The advantage of this method is memory efficiency, since it doesn’t store the whole document. 

Untangle

This method is considered to be a simpler way to handle XML. The Untangle library converts XML documents into Python objects. It means there is no need for manual element traversal or event handling. 

When you parse XML with untangle.parse(), it automatically reads the XML file,  converts every XML tag into a Python object, and stores tag data in .cdata attributes (if it is text content). Instead of working with nodes or events, you simply navigate the XML structure like a regular Python object. 

Here is the code:


import requests
import untangle
from io import BytesIO

# RSS feed from DataImpulse
FEED_URL = "https://dataimpulse.com/tutorials-category/web-scraping-seo/feed/"

# Download XML
resp = requests.get(FEED_URL)
resp.raise_for_status()
xml_bytes = resp.content

# Parse XML with untangle (use BytesIO for bytes input)
doc = untangle.parse(BytesIO(xml_bytes))

# Get all  elements
items = doc.rss.channel.item

print("Articles from DataImpulse RSS:\n")

# Print articles
for i, it in enumerate(items, start=1):
   title = it.title.cdata if hasattr(it, "title") else ""
   link = it.link.cdata if hasattr(it, "link") else ""
   desc = it.description.cdata if hasattr(it, "description") else ""
  
   print(f"{i}. {title}")
   print(f"   Link: {link}")
   if desc:
       print(f"   Description: {desc}")
   print()

      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

A few additional recommendations

  • Check for errors during parsing

Sometimes feeds may be unavailable or contain unexpected tags. In such cases, use try/except when downloading or parsing XML.

  • Choose .content for XML downloads

The .content attribute returns raw byte data, preventing encoding issues that .text attribute might cause.

  • Verify that elements exist 

Not every XML file has the same structure. Use conditions like if item.find(‘title’) is not None before accessing tags.

  • Pay attention to resource usage for large files

For huge XML documents, use SAX, which reads data sequentially without loading everything into memory.

Fetching XML from external sites can be troublesome. IP blocks, rate limits, regional restrictions, or privacy concerns can be solved with one instrument, which is a proxy. 

Key Highlights

  • Parsing XML is the process of reading XML files and extracting data in a structured, usable way.
  • ElementTree provides a tree-based approach to navigate and extract XML elements.
  • lxml is a library that supports XPath queries for faster XML parsing.
  • Minidom loads the entire XML as a DOM tree, letting you access nodes and text like a structured object.
  • SAX is ideal for large XML files because it processes the XML step by step without keeping everything in memory.
  • Untangle converts XML into Python objects to simply access tags and their content.

Related Articles 

Integrating proxies with Python requests

How to scrape eBay with Python and proxies

How to configure DataImpulse  proxies in Java

Scraping Amazon product data with Python

Building a custom proxy rotator with Python

Handle CAPTCHA using Python and Selenium

Olia L

Content Editor

Content Writer at DataImpulse, specializing in translation studies, and has a solid background in sales & business development. With strong communication, research, and persuasive writing skills, Olia is focused on creating content that engages and appeals to different audiences.

Stay tuned with us for more updates and insights.