Site Map - skip to main content

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.

hpr2091 :: Everyday Unix/Linux Tools for data processing

In this episode, I give some examples of common and uncommon tools for processing data files

<< First, < Previous, Latest >>

Host Image
Hosted by b-yeezi on Monday 2016-08-08 is flagged as Clean and is released under a CC-BY-SA license.
Tags: linux,unix,data,command-line.

Listen in ogg, spx, or mp3 format. | Comments (4)

Here are some of the tools I use to process and clean data from all manner of customers:


The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It’ll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

See other episodes for great sed information. I like to remove DOS end of line and end of file characters:

sed -i 's/
//g' *.txt


sed -i 's/\r//g' *.txt

Command-line tools

  • ack
  • awk
  • detox
  • grep
  • pandoc
  • pdftotext -layout
  • sed
  • unix2dos and dos2unix
  • wget
  • curl

R libraries

  • RCurl
  • XML
  • rvest
  • tm
  • xlsx

Python libraries

Vim tricks

  • buffer searches (:vim /pattern/ ##)
  • Ack plugin
  • bufdo (:bufdo %s/pattern/replace/ge | update)

Other tools

Show Transcript

Automatically generated using whisper

whisper --model tiny --language en hpr2091.wav

<< First, < Previous, Latest >>


Subscribe to the comments RSS feed.

Comment #1 posted on 2016-08-09T00:46:44Z by Jonathan Kulp


Thanks this is a genius tool. Never heard of it before.

Comment #2 posted on 2016-08-17T16:55:35Z by Ken Fallon

I love detox

detox -vr *

wow what an excellent tool.

Comment #3 posted on 2016-08-19T16:30:03Z by Dave Morriss

Thanks for mentioning 'ack'

Wow! I had never encountered 'ack' before. It's amazing.

I have written a bunch of Bash scripts to work with a PostgreSQL database (yes, I know, it's a bit like wearing a hair shirt; self mortification), and I found I could do things like:

ack --shell --pager=more psql .

There's no other easy way to do this that I know of.

Thanks very much for pointing this one out.

Comment #4 posted on 2016-08-21T14:53:50Z by ivor


I always love vim tips. So I got pulled in looking at the buffer search. Then I noticed the other tools mentioned. Most of them I know about and use all that are relevant to me very frequently. So now I'm going to subscribe...

<< First, < Previous, Latest >>

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Anti Spam Question: What does the P in HPR stand for ?
Are you a spammer →
Who hosted this show →
What does HPR mean to you ?