Site Map - skip to main content

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.

hpr3596 :: Extracting text, tables and images from docx files using Python

In this episode, I describe how I used 2 python libraries to extract import data from docx files

<< First, < Previous, Latest >>

Host Image
Hosted by b-yeezi on Monday 2022-05-16 is flagged as Clean and is released under a CC-BY-SA license.
Tags: python,docx.

Listen in ogg, spx, or mp3 format. | Comments (0)

Part of the series: A Little Bit of Python

Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller.

Now the series is open to all.

Tools to extract data from docx files:

  1. docx2txt
  2. python-docx2txt
  3. python-docx

Code Snippets

text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
    table_data = []
    for row in table.rows:
        row_data = []
        for cell in row.cells:

for i, table in enumerate(tables):
    with open(f"{i}.csv", "wt") as f:
        writer = csv.writer(f)

Show Transcript

Automatically generated using whisper

whisper --model tiny --language en hpr3596.wav

<< First, < Previous, Latest >>


Subscribe to the comments RSS feed.

<< First, < Previous, Latest >>

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Anti Spam Question: What does the P in HPR stand for ?
Are you a spammer →
Who hosted this show →
What does HPR mean to you ?