hpr3596 :: Extracting text, tables and images from docx files using Python
In this episode, I describe how I used 2 python libraries to extract import data from docx files
Hosted by b-yeezi on Monday 2022-05-16 is flagged as Clean and is released under a CC-BY-SA license.
Tags: python,docx.
Listen in ogg,
spx, or
mp3 format. | Comments (0)
Part of the series: A Little Bit of Python
Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller. https://www.voidspace.org.uk/python/weblog/arch_d7_2009_12_19.shtml#e1138
Now the series is open to all.
Tools to extract data from docx files:
Code Snippets
text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
f.write(text)
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
table_data = []
for row in table.rows:
row_data = []
for cell in row.cells:
row_data.append(cell.text)
table_data.append(row_data)
data.append(table_table)
for i, table in enumerate(tables):
with open(f"{i}.csv", "wt") as f:
writer = csv.writer(f)
writer.writerows(table)
Show Transcript
Automatically generated using whisper
whisper --model tiny --language en hpr3596.wav
<< First, < Previous, Next >, Latest >>