Site Map - skip to main content

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes every weekday Monday through Friday.
This page was generated by The HPR Robot at

hpr3596 :: Extracting text, tables and images from docx files using Python

In this episode, I describe how I used 2 python libraries to extract import data from docx files

<< First, < Previous, , Latest >>

Thumbnail of Mr. Young
Hosted by Mr. Young on 2022-05-16 is flagged as Clean and is released under a CC-BY-SA license.
python, docx. (Be the first).
The show is available on the Internet Archive at:

Listen in ogg, spx, or mp3 format. Play now:

Duration: 00:08:37

A Little Bit of Python.

Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller.

Now the series is open to all.

Tools to extract data from docx files:

  1. docx2txt
  2. python-docx2txt
  3. python-docx

Code Snippets

text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
    table_data = []
    for row in table.rows:
        row_data = []
        for cell in row.cells:

for i, table in enumerate(tables):
    with open(f"{i}.csv", "wt") as f:
        writer = csv.writer(f)


Subscribe to the comments RSS feed.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Anti Spam Question: What does the letter P in HPR stand for?
Are you a spammer?
What is the HOST_ID for the host of this show?
What does HPR mean to you?