Parsing ISO8601 Durations in Perl

Dave Morriss, Ken Fallon
Last Updated: 2013-07-07

The Problem

Recently, I was asked if I could help to build a Perl regular expression to parse ISO8601 duration expressions. This format uses expressions like:

  P3Y6M4DT12H30M5S

which defines a duration of 3 years, 6 months, 4 days, 12 hours, 30 minutes, 5 seconds.

Just about every component of the format is optional except for the leading 'P'. Though the 'T' is required if there's a time component. Each of the numeric parts must be followed by a designator letter, but any can be omitted and is assumed to be zero. The components must be given in the correct order.

All of the numeric parts must be integers except for the seconds which can be a decimal fraction.

So, the following are all valid:

P23DT23H 23 days, 23 hours
P4Y 4 years
P1M 1 month
PT1M 1 minute
P1Y2M3DT10H30M 1 year, 2 months, 3 days, 10 hours, 30 minutes
PT0.5S 0.5 seconds
P0Y1347M zero years, 1347 months
P0YT zero years
-P23DT23H minus 23 days, 23 hours

The script

A Perl script was written as a vehicle for a regular expression (regex) which parses the duration expression.

The regular expression was arrived at in a stepwise manner as described below.

Step 1

First a regex was written to match a full duration expression:

/P\d+Y\d+M\d+DT\d+H\d+M\d+S/

Here the sequence \d+ means one or more digits. So the whole regex matches a letter 'P', one or more digits, a letter 'Y', one or more digits, etc. The whole regex is enclosed in '/' characters since that is the standard regex delimiter in Perl.

Step 2

Perl's regular expressions can be laid out in a more readable way than most. The expression can contain spaces, be split over lines and can be commented. To do this the regex must take a modifier, which normally follows the expression. For example:

/abc/x

Here the x is a modifier for the regex /abc/. This modifier is the one that allows the more readable layout. If we rewrite the regex from step 1 using this modifier and reformatting it with newlines and comments we get:

      /P          # Begin with a 'P'
      \d+Y        # Some digits and a 'Y' for the years
      \d+M        # Some digits and a 'M' for the months
      \d+D        # Some digits and a 'D' for the days
      T           # A 'T' introduces the time part
      \d+H        # Some digits and a 'H' for the hours
      \d+M        # Some digits and a 'M' for the minutes
      \d+S/x      # Some digits and a 'S' for the seconds

Step 3

So far all we have is a conceptual regex, not a script. The regex is to be used to match the duration expression and to extract the individual fields (year, month, etc). We need to make the regex into something that can be used in a Perl script and make it capture values.

In Perl regular expressions can be pre-compiled. There is an operator, qr which does this. It is usually used in this way:

$re = qr{abc};

This is a Perl statement which compiles the regex and stores it in the variable $re.

Parts of a regex which extract or capture values are enclosed in parentheses.

It will probably be a good idea to anchor the regex to the start and end of the line. We do this by using a '^' character to denote the start of the line and a '$' character to denote the end.

With what we now know, the regex which we built in step 2 can be rewritten as follows:

      $re = qr{
          ^P          # Begin with a 'P'
          (\d+)Y      # Some digits and a 'Y' for the years
          (\d+)M      # Some digits and a 'M' for the months
          (\d+)D      # Some digits and a 'D' for the days
          T           # A 'T' introduces the time part
          (\d+)H      # Some digits and a 'H' for the hours
          (\d+)M      # Some digits and a 'M' for the minutes
          (\d+)S      # Some digits and a 'S' for the seconds
          $}x;

Step 4

Now we have a regex that can parse a fully-populated duration expression and extract the relevant fields. However, as we found out at the beginning, all of these fields are optional, so we need to cater for this.

To define part of a regular expressions as optional the '?' question mark is used and applies to the preceding component of the expression. So /ab?c/ matches the sequence 'abc' and 'ac'. That is, the 'b' may occur zero times or once.

We need to apply this to several components. For example, the part of the regex which matches the number of years '(\d+)Y' needs to be made optional. However, if the question mark was placed after the 'Y' then it would only apply to the 'Y'. We need a way of grouping this sub-expression, and the way in which it is done is with parentheses. There is a problem with this though; parentheses also cause the regex to capture part of the matched string, and we do not want to capture the 'Y'.

Perl solves this by means of an extension. The extended non-capturing parentheses look like this: (?: ), so using these with the question mark we have '(?:(\d+)Y)?' for the year field. We also need to make the whole time part after the 'T' optional.

So if we apply this extended feature to the regex in step 3 we have:

      $re = qr{
          ^P              # Begin with a 'P'
          (?:(\d+)Y)?     # Some digits and a 'Y' for the years
          (?:(\d+)M)?     # Some digits and a 'M' for the months
          (?:(\d+)D)?     # Some digits and a 'D' for the days
          (?:T            # A 'T' introduces the time part
          (?:(\d+)H)?     # Some digits and a 'H' for the hours
          (?:(\d+)M)?     # Some digits and a 'M' for the minutes
          (?:(\d+)S)?     # Some digits and a 'S' for the seconds
          )?              # The time element is optional
          $}x;

Step 5

The regex is now complete apart from some final refinements.

Firstly, the full definition of the ISO8601 duration specification permits a sign before the leading 'P'. Secondly, the seconds field may contain a decimal fraction.

So, for the sign, we can add the sub-expression '([+-]?)' meaning an optional plus or minus sign, which is to be captured.

For the decimal fraction we can use the sub-expression '\d+(?:\.\d+)?', meaning one or more digits followed by an optional sequence of a decimal point and one or more digits.

For convenience and legibility the decimal fraction regex can be declared separately and interleaved in the main regex.

Finally, the 'x' modifier can be moved from the end of the regex inside it. This is largely a matter of personal preference, but is shown here for information. The alternative modifier is defined as '(?x)'.

Now the regex in step 4 looks as follows:

      $frac = qr{\d+(?:\.\d+)?};
      $re = qr{(?x)
          ^([+-]?)        # Assume the string begins with the optional sign
          P               # Begin with a 'P'
          (?:(\d+)Y)?     # Some digits and a 'Y' for the years
          (?:(\d+)M)?     # Some digits and a 'M' for the months
          (?:(\d+)D)?     # Some digits and a 'D' for the days
          (?:T            # A 'T' introduces the time part
          (?:(\d+)H)?     # Some digits and a 'H' for the hours
          (?:(\d+)M)?     # Some digits and a 'M' for the minutes
          (?:($frac)S)?   # Some digits and a 'S' for the seconds
          )?              # The time element is optional
          $};

Links