Site Map - skip to main content - dyslexic font - mobile - text - print

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


In-Depth Series

A Little Bit of Python

Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller. http://www.voidspace.org.uk/python/weblog/arch_d7_2009_12_19.shtml#e1138

Now the series is open to all.


Parsing XML in Python with Xmltodict - klaatu | 2016-04-20

If Untangle is too simple for your XML parsing needs, check out xmltodict. Like untangle, xmltodict is simpler than the usual suspects (lxml, beautiful soup), but it's got some advanced features as well.

If you're reading this article, I assume you've read at least the introduction to my article about Untangle, and you should probably also read, at some point, my article on using JSON just so you know your options.

Quick re-cap about XML:

XML is a way of storing data in a hierarchical arrangement so that the data can be parsed later. It's explicit and strictly structured, so one of its benefits is that it paints a fairly verbose definition of data. Here's an example of some simple XML:

<?xml version="1.0"?>
<book>
   <chapter id="prologue">
      <title>
     The Beginning
  </title>
      <para>
     This is the first paragraph.
      </para>
    </chapter>

    <chapter id="end">
      <title>
     The Ending
  </title>
      <para>
     Last para of last chapter.
      </para>
    </chapter>
</book>

And here's some info about the xmltodict library that makes parsing that a lot easier than the built-in Python tools:

Install

Install xmltodict manually, or from your repository, or using pip:

$ pip install xmltodict

or if you need to install it locally:

$ pip install --user xmltodict

Xmltodict

With xmltodict, each element in an XML document gets converted into a dictionary (specifically an OrderedDictionary), which you then treat basically the same as you would JSON (or any Python OrderedDict).

First, ingest the XML document. Assuming it's called sample.xml and is located in the current directory:

>>> import xmltodict
>>> with open('sample.xml') as f:
...     data = xmltodict.parse(f.read())

If you're a visual thinker, you might want or need to see the data. You can look at it just by dumping data:

>>> data
OrderedDict([('book', OrderedDict([('chapter',
[OrderedDict([('@id', 'prologue'),
('title', 'The Beginning'),
...and so on...

Not terribly pretty to look at. Slightly less ugly is your data set piped through json.dumps:

>>> import json
>>> json.dumps(data)
'{"book": {"chapter": [{"@id": "prologue",
"title": "The Beginning", "para": "This is the first paragraph."},
{"@id": "end", "title": "The Ending",
"para": "This is the last paragraph of the last chapter."}]
}}'

You can try other feats of pretty printing, if they help:

>>> pp = pprint.PrettyPrinter(indent=4)
>>> pp.pprint(data)
{ 'book': { 'chapter': [{'@id': 'prologue',
                         'title': 'The Beginning',
             'para': 'This is the ...
                         ...and so on...                 

More often than not, though, you're going to be "walking" the XML tree, looking for specific points of interest. This is fairly easy to do, as long as you remember that syntactically you're dealing with a Python dict, while structurally, inheritance matters.

Elements (Tags)

Exploring the data element-by-element is very easy. Calling your data set by its root element (in our current example, that would be data['book']) would return the entire data set under the book tag. We'll skip that and drill down to the chapter level:

>>> data['book']['chapter']
[OrderedDict([('@id', 'prologue'), ('title', 'The Beginning'),
('para', 'This is the first paragraph.')]),
OrderedDict([('@id', 'end'), ('title', 'The Ending'),
('para', 'Last paragraph of last chapter.')])]

Admittedly, it's still a lot of data to look at, but you can see the structure.

Since we have two chapters, we can enumerate which chapter to select, if we want. To see the zeroeth chapter:

>>> data['book']['chapter'][0]
OrderedDict([('@id', 'prologue'),
('title', 'The Beginning'),
('para', 'This is the first paragraph.')])

Or the first chapter:

>>> data['book']['chapter'][1]
OrderedDict([('@id', 'end'), ('title', 'The Ending'),
('para', 'Last paragraph of last chapter.')])

And of course, you can continue narrowing your focus:

>>> data["book"]["chapter"][0]['para']
'This is the first paragraph.'

It's sort of like Xpath for toddlers. Having had to work with Xpath, I'm happy to have this option.

Attributes

You may have already noticed that in the dict containing our data, there is some special notation happening. For instance, there is no @id element in our XML, and yet that appears in the dict.

Xmltodict uses the @ symbol to signify an attribute of an element. So to look at the attribute of an element:

>>> data['book']['chapter'][0]['@id']
'prologue'

If you need to see each attribute of each chapter tag, just iterate over the dict. A simple example:

>>> for c in range(0,2):
...     data['book']['chapter'][c]['@id']
...
'prologue'
'end'

Contents

In addition to special notation for attributes, xmltodict uses the # prefix to denote contents of complex elements. To show this example, I'll make a minor modification to sample.xml:

<?xml version="1.0"?>
<book>
   <chapter id="prologue">
      <title>
     The Beginning
  </title>
      <para class="linux">
     This is the first paragraph.
      </para>
    </chapter>

    <chapter id="end">
      <title>
     The Ending
  </title>
      <para class="linux">
     Last para of last chapter.
      </para>
    </chapter>
</book>

Notice that the <para> elements now have a linux attribute, and also contain text content (unlike <chapter> elements, which have attributes but only contain other elements).

Look at this data structure:

>>> import xmltodict
>>> with open('sample.xml') as g:
...     data = xmltodict.parse(g.read())
>>> data['book']['chapter'][0]
OrderedDict([('@id', 'prologue'),
('title', 'The Beginning'),
('para', OrderedDict([('@class', 'linux'),
('#text', 'This is the first paragraph.')]))])

There is a new entry in the dictionary: #text. It contains the text content of the <para> tag and is accessible in the same way that an attribute is:

>>> data['book']['chapter'][0]['para']['#text']
'This is the first paragraph.'

Advanced

The xmltodict module supports XML namespaces and can also dump your data back into XML. For more documentation on this, have a look at the module on github.com/martinblech/xmltodict.

What to Use?

Between untangle, xmltodict, and JSON, you have pretty good set of options for data parsing. There really are diferent uses for each one, so there's not necessarily a "right" or "wrong" answer. Try them out, see what you prefer, and use what is best. If you don't know what's best, use what you're most comfortable with; you can always improve it later.

[EOF]

Made on Free Software.


Parsing XML in Python with Untangle - klaatu | 2016-04-19

XML is a popular way of storing data in a hierarchical arrangement so that the data can be parsed later. For instance, here is a simple XML snippet:

<?xml version="1.0"?>
<book>
   <chapter id="prologue">
      <title>
     The Beginning
 </title>
   </chapter>
</book>

The nice thing about XML is that it is explicit and strictly structured. The trade-off is that it's pretty verbose, and getting to where you want to go often requires fairly complex navigation.

If you do a quick search online for XML parsing in Python, your two most common results are lxml and beautifulsoup. These both work, but using them feels less like opening a dictionary (as with JSON) to look up a definition and more like wandering through a library to gather up all the dictionaries you can possibly find.

In JSON, the thought process might be something like:

"Go to the first chapter's title and print the contents."

With traditional XML tools, it's more like:

"Open the book element and gather all instances of titles that fall within those chapters. Then, look into the resulting object and print the contents of the first occurrence."

There are at least two libaries that you can install and use to bring some sanity to complex XML structures, one of which is untangle.

Untangle

With untangle, each element in an XML document gets converted into a class, which you can then probe for information. Makes no sense? well, follow along and it will become clear:

First, ingest the XML document. Assuming it's called sample.xml and is located in the current directory:

>>> import untangled
>>> data = untangle.parse('sample.xml')

Now our simple XML sample is sitting in RAM, as a Python class. The first element is <book> and all it contains is more elements, so its results are not terribly exciting:

>>> data.book
Element(name = book, attributes = {}, cdata = )

As you can see, it does identify itself as "book" (under the name listing) but otherwise, not much to look at. That's OK, we can keep drilling down:

>>> data.book.chapter
Element(name = chapter, attributes = {'id': 'prologue'}, cdata = )

Now things get more interesting. The next element identifies itself as "chapter", and reveals that it has an attribute "id" which has a value of "prologue". To continue down this path:

>>> data.book.chapter.title
Element(name = title, attributes = {}, cdata = The Beginning )

And now we have a pretty complete picture of our little XML document. We have a breadcrumb trail of where we are in the form of the class we are invoking (data.book.chapter.title) and we have the contents of our current position.

Sniping

That's very linear; if you know your XML schema (and you usually do, since XML is quite strict) then you can grab values without all the walking. For instance, we know that our chapters have 'id' attributes, so we can ask for exactly that:

>>> data.book.chapter['id']
'prologue'

You can also get the contents of elements by looking at the cdata component of the class. Depending on the formatting of your document, untangle may be a little too literal with how it stores contents of elements, so you may want to use .strip() to prettify it:

>>> data.book.chapter.title.cdata.strip()
'The Beginning'

Dealing with More Than One Element

My example so far is nice and tidy, with only one chapter in the book. Generally you'll be dealing with more data than that. Let's add another chapter to our sample file, and some content to each:

<?xml version="1.0"?>
<book>
   <chapter id="prologue">
      <title>
     The Beginning
  </title>
      <para>
     This is the first paragraph.
      </para>
    </chapter>

    <chapter id="end">
      <title>
     The Ending
  </title>
      <para>
     Last para of last chapter.
      </para>
    </chapter>
</book>

Accessing each chapter is done with index designations, just like with a dict:

>>> data.book.chapter[0]
Element(name = chapter, attributes = {'id': 'prologue'}, cdata = )
>>> data.book.chapter[1]
Element(name = chapter, attributes = {'id': 'end'}, cdata = )

If there is more than one instance of a tag, you must use a designator or else untangle won't know what to return. For example, if we want to access either the title or para elements within a chapter:

>>> data.book.chapter.title
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'title'

Oops. But if we tell it which one to look at:

>>> data.book.chapter[0].title.cdata.strip()
'The Beginning'
>>> data.book.chapter[1].title.cdata.strip()
'The Ending'

Or you can look at the paragraph instead of the title. The lineage is the same, only instead of looking at the title child, you look at the para child:

>>> data.book.chapter[0].para.cdata.strip()
'This is the first paragraph.'
>>> data.book.chapter[1].para.cdata.strip()
'Last para of last chapter.'

You can also iterate over items:

>>> COUNT = [0,1]
>>> for TICK in COUNT:
...     print(data.book.chapter[TICK])
Element <chapter> with attributes {'id': 'prologue'} and children
[Element(name = title, attributes = {}, cdata = The Beginning ),
Element(name = para, attributes = {}, cdata = This is the first paragraph.)]

Element <chapter> with attributes {'id': 'end'} and children
[Element(name = title, attributes = {}, cdata = The Ending ),
Element(name = para, attributes = {}, cdata = Last para of last chapter.)]

And so on.

Easy and Fast

I'll admit the data structure of the classes does look odd, and you could probably argue it's not the cleanest and most elegant of all output; it's unnerving to see empty cdata fields or to constantly run into the need to strip() whitespace. However, the ease and speed and intuitiveness of parsing XML with untangle is usually well worth any trade-offs.

[EOF]

Made on Free Software.


Parsing JSON with Python - klaatu | 2016-04-15

JSON is a popular way of storing data in a key/value type arrangement so that the data can be parsed easily later. For instance, here is a very simple JSON snippet:

{
"name":"tux",
"health":"23",
"level":"4"
}

If you are like me, three questions probably spring to your mind:

  1. That looks an awful lot like a Python dictionary.

    Yes, it looks exactly like a Python dictionary. They are shockingly similar. If you are comfortable with Python lists and dictionaries, you will feel right at home with JSON.

  2. I don't feel comfortable with dictionaries, can't I just use a delimited text file?

    You can, but you will have to write parsers for it yourself. If your data gets very complex, the parsing can get pretty ugly.

    That is not to say that you should not use a simple delimited text file if that is all that your programme needs. For example, I would not want to open a config file as a user and find that I have to format all my options as valid JSON.

    Just know that JSON is out there and available, and that the JSON Python module has some little features that make your life easier when dealing with sets of data.

  3. Why not use XML instead?

    You can. Mostly one should use the most appropriate format for one's project. I'm a big fan of XML, but sometimes JSON makes more sense.

I am not going to make this post about teaching the JSON format. If you need clarification on how to structure data into JSON, go through a tutorial on it somewhere; there are several good ones online. Honestly, it's not that complex; you can think of JSON as nested dictionaries.

Starting from scratch, let's say that you write a programme that by nature gathers data as it runs. When the user quits, you want to save the data to a file so that when the user resumes the app later, they can load the file back in and pick up where they left off.

Storing Data as JSON

At its most basic, the JSON data structure is basically the same as a Python dictionary, and in fact the nice thing about JSON is that it can be directly imported into a Python dictionary. Usually, however, you are resorting to JSON because you have somewhat complex data, so in the sample code we will use a dictionary-within-a-dictionary:

#!/usr/bin/env python

game = {'tux': {'health': 23, 'level': 4}, 'beastie': {'health': 13, 'level': 6}}
# you can always add more to your dictionary

game['konqi'] = {'health': 18, 'level': 7}

That code creates a ditionary called game which stores the player name and a corresponding dictionary of attributes about how the player is doing in the progress of the game. As you can see after the comment, adding new players is simple.

Now let's see how to save that data to a save file.

## continued...
import json

with open('dosiero.json', 'w') as outfile:
    json.dump(game, outfile)

That would be your save command. Simple as that, all the structured content of your game dictionary is committed to a file on your hard drive.

Reading Data from a JSON File

If you are saving data to JSON, you probably will evenually want to read the data back into Python. For this, Python features the function json.load

import json

dosiero = open('dosiero.json')
game = json.load(dosiero)

print game['tux']     # prints {'health': 23, 'level': 4}
print game['tux']['health']    # prints 23
print game['tux']['level']     # prints 4

# when finished, close the file

json_data.close()

As you can see, JSON integrates surprisingly well with Python, so it's a great format when your data fits in with its model.

Have fun!

[EOF]

Made with Free Software.


A Little Bit of Python: Episode 14 2010-06-06 - Michael Foord | 2011-03-10

<a href="http://bitofpython.com/">A Little Bit of Python</a> is an occasional podcast on all things <a href="http://python.org">Python</a>. The four protagonists on the show are all core Python developers and members of the Python Software Foundation. They are: <a href="http://www.voidspace.org.uk">Michael Foord</a> (author of IronPython in Action and maintainer of unittest), <a href="http://www.amk.ca/">Andrew Kuchling</a> (creator of PyCrypto and one of the python.org webmasters), <a href="http://holdenweb.com/">Steve Holden</a> (PSF chairman), <a href="http://sayspy.blogspot.com/">Dr. Brett Cannon</a> (author of importlib amongst other things) and <a href="http://jessenoller.com/">Jesse Noller</a> (maintainer of multiprocessing).

Episode 14.Bit-of-Python-2010-06-06

Interview with Christian Tismer

Christian Tismer is a long standing member of the Python community and, amongst other things, he is the original creator of Stackless and has worked on both psyco and PyPy. In this interview we discuss all of these projects, both their history and what the future holds for them.


A Little Bit of Python: Episode 13 - Michael Foord | 2011-02-17

A Little Bit of Python is an occasional podcast on all things Python. The four protagonists on the show are all core Python developers and members of the Python Software Foundation. They are: Michael Foord (author of IronPython in Action and maintainer of unittest), Andrew Kuchling (creator of PyCrypto and one of the python.org webmasters), Steve Holden (PSF chairman), Dr. Brett Cannon (author of importlib amongst other things) and Jesse Noller (maintainer of multiprocessing).

Several topics are covered in this 40-minute episode:

  • Python 2.7 beta 1 released.
  • PEP 3147: New bytecode directory layout.
  • Google's Summer of Code beginning.
  • SEC proposes mandating Python's use in financial filings.
  • PyCon interview: Dr Tim Couper
  • How to Fund Python Development
  • Python for Beginners: Getting started on Windows.


A Little Bit of Python: 12 Global Interpreter Lock; Concurrency - Michael Foord | 2010-10-28

A Little Bit of Python is an occasional podcast on all things Python. The four protagonists on the show are all core Python developers and members of the Python Software Foundation. They are: Michael Foord (author of IronPython in Action and maintainer of unittest), Andrew Kuchling (creator of PyCrypto and one of the python.org webmasters), Steve Holden (PSF chairman), Dr. Brett Cannon (author of importlib amongst other things) and Jesse Noller (maintainer of multiprocessing). We discuss the significance of the Global Interpreter Lock (or GIL) and recent work at improving it, PEP 3148 proposing futures as a new asynchronous execution method, some recent IronPython work, and a new Python podcast.

Episode 11.Bit-of-Python-2010-04-07 - Michael Foord | 2010-10-19

A Little Bit of Python is an occasional podcast on all things Python. The four protagonists on the show are all core Python developers and members of the Python Software Foundation. They are: Michael Foord (author of IronPython in Action and maintainer of unittest), Andrew Kuchling (creator of PyCrypto and one of the python.org webmasters), Steve Holden (PSF chairman), Dr. Brett Cannon (author of importlib amongst other things) and Jesse Noller (maintainer of multiprocessing). Episode 11.Bit-of-Python-2010-04-07 Interview with Antoine Pitrou An interview recorded at PyCon 2010, Atlanta, with Antoine Pitrou. Antoine Pitrou is the core CPython developer responsible for creating the "new-GIL".

Interview with Richard Jones - Michael Foord | 2010-06-17

10.Bit-of-Python-2010-03-24 Interview with Richard Jones Richard Jones organizes the PyWeek game programming challenge. Richard and Andrew discuss how the challenge is run, what sort of games people write, and the libraries that are used.

Little Bit of python episode nine - Michael Foord | 2010-06-03

9.Bit-of-Python-2010-03-22 Bits of News We discuss a variety of recent news items: some recent CPython changes, the new PyPy 1.2 release, crypto support and Debian packaging for IronPython, the PyWeek game programming contest, upcoming conference plans, and upcoming podcast plans.

Little Bit of Python Episode 8 - Michael Foord | 2010-05-18

Episode 8.Bit-of-Python-2010-03-20 Interview: Mark Shuttleworth Steve Holden interviews Mark Shuttleworth, founder of the Ubuntu project and a keynote speaker at PyCon 2010.

Little Bit of Python Episode 7 - Michael Foord | 2010-04-30

Episode 7.Bit-of-Python-2010-03-15 Unladen Swallow PEP 3146 proposes that the Unladen Swallow branch, which adds a just-in-time compiler to Python, be merged into the main Python repository. We discuss what Unladen Swallow does, and what impact it's likely to have.

Episode 6.Bit-of-Python - Michael Foord | 2010-04-16

Episode 6.Bit-of-Python-2010-03-10 Interview: Van Lindberg Michael Foord interviews Van Lindberg, conference chair for PyCon 2010 in Atlanta GA, on the success of the conference, plans for the 2011 Atlanta conference, and his work as an intellectual-property lawyer.

New Features in Python 2.7 - Michael Foord | 2010-04-06

Episode 5.Bit-of-Python-2010-02-10 New Features in Python 2.7 We discuss some of the new features coming in Python 2.7.

Mercurial Transition and comments on the Python Package Index - Michael Foord | 2010-04-03

Mercurial Transition / Python 2.7 alpha 1 / Comments on the Python Package Index We cover the status of the transition to using Mercurial for the Python source code, the first alpha release of Python 2.7, and the recent controversy over adding commenting to the Python Package Index.

Selecting Talks for PyCon 2010 - Michael Foord | 2010-02-19

Selecting Talks for PyCon 2010
In this episode, we discuss how talks were selected for the upcoming PyCon conference, and what else is being planned.

Python Language Moratorium Python 2.7 End of the Line? - Michael Foord | 2010-02-02

Python Language Moratorium / Python 2.7 End of the Line? A round-table discussion of the moratorium on Python language development and whether Python 2.7 will be the last of the 2.x series.