Last Updated: 2013-07-07

Recently, I was asked if I could help to build a Perl regular expression to parse ISO8601 duration expressions. This format uses expressions like:

P3Y6M4DT12H30M5S

which defines a duration of 3 years, 6 months, 4 days, 12 hours, 30 minutes, 5 seconds.

Just about every component of the format is optional except for the leading 'P'. Though the 'T' is required if there's a time component. Each of the numeric parts must be followed by a designator letter, but any can be omitted and is assumed to be zero. The components must be given in the correct order.

All of the numeric parts must be integers except for the seconds which can be a decimal fraction.

So, the following are all valid:

P23DT23H | 23 days, 23 hours |

P4Y | 4 years |

P1M | 1 month |

PT1M | 1 minute |

P1Y2M3DT10H30M | 1 year, 2 months, 3 days, 10 hours, 30 minutes |

PT0.5S | 0.5 seconds |

P0Y1347M | zero years, 1347 months |

P0YT | zero years |

-P23DT23H | minus 23 days, 23 hours |

A Perl script was written as a vehicle for a regular expression (*regex*)
which parses the duration expression.

The regular expression was arrived at in a stepwise manner as described below.

First a regex was written to match a full duration expression:

/P**\d+**Y**\d+**M**\d+**DT**\d+**H**\d+**M**\d+**S/

Here the sequence *\d+* means *one or more digits*. So the whole regex
matches a letter 'P', one or more digits, a letter 'Y', one or more digits,
etc. The whole regex is enclosed in '/' characters since that is the standard
regex delimiter in Perl.

Perl's regular expressions can be laid out in a more readable way than most. The expression can contain spaces, be split over lines and can be commented. To do this the regex must take a modifier, which normally follows the expression. For example:

/abc/x

Here the **x** is a modifier for the regex /abc/. This modifier is the one
that allows the more readable layout. If we rewrite the regex from step
1 using this modifier and reformatting it with newlines and comments we get:

/P # Begin with a 'P' \d+Y # Some digits and a 'Y' for the years \d+M # Some digits and a 'M' for the months \d+D # Some digits and a 'D' for the days T # A 'T' introduces the time part \d+H # Some digits and a 'H' for the hours \d+M # Some digits and a 'M' for the minutes \d+S/x # Some digits and a 'S' for the seconds

So far all we have is a conceptual regex, not a script. The regex is to be
used to match the duration expression and to extract the individual fields
(year, month, etc). We need to make the regex into something that can be used
in a Perl script and make it *capture* values.

In Perl regular expressions can be pre-compiled. There is an operator, **qr**
which does this. It is usually used in this way:

$re = qr{abc};

This is a Perl statement which compiles the regex and stores it in the
variable **$re**.

Parts of a regex which extract or *capture* values are enclosed in
parentheses.

It will probably be a good idea to *anchor* the regex to the start and end
of the line. We do this by using a '^' character to denote the start of the
line and a '$' character to denote the end.

With what we now know, the regex which we built in step 2 can be rewritten as follows:

$re = qr{ ^P # Begin with a 'P' (\d+)Y # Some digits and a 'Y' for the years (\d+)M # Some digits and a 'M' for the months (\d+)D # Some digits and a 'D' for the days T # A 'T' introduces the time part (\d+)H # Some digits and a 'H' for the hours (\d+)M # Some digits and a 'M' for the minutes (\d+)S # Some digits and a 'S' for the seconds $}x;

Now we have a regex that can parse a fully-populated duration expression and extract the relevant fields. However, as we found out at the beginning, all of these fields are optional, so we need to cater for this.

To define part of a regular expressions as optional the '?' question mark is used and applies to the preceding component of the expression. So /ab?c/ matches the sequence 'abc' and 'ac'. That is, the 'b' may occur zero times or once.

We need to apply this to several components. For example, the part of the regex which matches the number of years '(\d+)Y' needs to be made optional. However, if the question mark was placed after the 'Y' then it would only apply to the 'Y'. We need a way of grouping this sub-expression, and the way in which it is done is with parentheses. There is a problem with this though; parentheses also cause the regex to capture part of the matched string, and we do not want to capture the 'Y'.

Perl solves this by means of an extension. The extended non-capturing parentheses look like this: (?: ), so using these with the question mark we have '(?:(\d+)Y)?' for the year field. We also need to make the whole time part after the 'T' optional.

So if we apply this extended feature to the regex in step 3 we have:

$re = qr{ ^P # Begin with a 'P' (?:(\d+)Y)? # Some digits and a 'Y' for the years (?:(\d+)M)? # Some digits and a 'M' for the months (?:(\d+)D)? # Some digits and a 'D' for the days (?:T # A 'T' introduces the time part (?:(\d+)H)? # Some digits and a 'H' for the hours (?:(\d+)M)? # Some digits and a 'M' for the minutes (?:(\d+)S)? # Some digits and a 'S' for the seconds )? # The time element is optional $}x;

The regex is now complete apart from some final refinements.

Firstly, the full definition of the ISO8601 duration specification permits a sign before the leading 'P'. Secondly, the seconds field may contain a decimal fraction.

So, for the sign, we can add the sub-expression '([+-]?)' meaning an optional plus or minus sign, which is to be captured.

For the decimal fraction we can use the sub-expression '\d+(?:\.\d+)?', meaning one or more digits followed by an optional sequence of a decimal point and one or more digits.

For convenience and legibility the decimal fraction regex can be declared separately and interleaved in the main regex.

Finally, the 'x' modifier can be moved from the end of the regex inside it. This is largely a matter of personal preference, but is shown here for information. The alternative modifier is defined as '(?x)'.

Now the regex in step 4 looks as follows:

$frac = qr{\d+(?:\.\d+)?}; $re = qr{(?x) ^([+-]?) # Assume the string begins with the optional sign P # Begin with a 'P' (?:(\d+)Y)? # Some digits and a 'Y' for the years (?:(\d+)M)? # Some digits and a 'M' for the months (?:(\d+)D)? # Some digits and a 'D' for the days (?:T # A 'T' introduces the time part (?:(\d+)H)? # Some digits and a 'H' for the hours (?:(\d+)M)? # Some digits and a 'M' for the minutes (?:($frac)S)? # Some digits and a 'S' for the seconds )? # The time element is optional $};

- XKCD ISO 8601: http://xkcd.com/1179/
- A definition of the ISO8601 duration format: http://www.w3.org/TR/xmlschema-2/#duration
- Wikipedia ISO8601 entry: https://en.wikipedia.org/wiki/ISO_8601#Durations
- The script developed for parsing: https://gitorious.org/hprmisc/hprmisc/blobs/master/parse_8601_duration
- Some test data for checking the parser: https://gitorious.org/hprmisc/hprmisc/blobs/master/8601_duration_test.dat
- Regular expressions in Perl; some cheat-sheets:
- http://www.cs.tut.fi/~jkorpela/perl/regexp.html
- http://ult-tex.net/info/perl/
- http://regexcheatsheet.com/ (includes PHP and Python)

- Perl manuals:
- Perl Regular Expressions: http://perldoc.perl.org/perlre.html
- Perl Regular Expression FAQ: http://perldoc.perl.org/perlfaq6.html
- Perl Regular Expressions Tutorial: http://perldoc.perl.org/perlretut.html
- Perl Regular Expressions Reference: http://perldoc.perl.org/perlreref.html
- Perl Regular Expressions Quick Start: http://perldoc.perl.org/perlrequick.html
- Perl Regular Expression Backslash Sequences and Escapes: http://perldoc.perl.org/perlrebackslash.html