My APOD Downloader

Astronomy Picture of the Day

You have probably heard of the Astronomy Picture of the Day (APOD) site. It has existed since 1995, is provided by NASA and Michigan Technological University (MTU) and is created and managed by Robert Nemiroff and Jerry Bonnell. The FAQ on the site says "The APOD archive contains the largest collection of annotated astronomical images on the internet".

The Downloader

Being a KDE user I quite like a moderate amount of bling, and I particularly like to have a picture on my desktop. I like to rotate my wallpaper pictures every so often, so I want to have a collection of images. To this end I download the APOD on my server every day and make the images available through an NFS-mounted volume.

In 2012 I wrote a Perl script to perform the download, using a fairly primitive HTML parsing method. This script has been improved over the intervening years and now uses the Perl module HTML::TreeBuilder which I believe is much better at parsing HTML.

The version of the script I use myself also includes the Perl module Image::Magick which interfaces to the awesome ImageMagick image manipulation software suite. I use this to annotate the downloaded image with the title parsed from the HTML so I know what it is.

The script I am presenting here is called collect_apod_simple and does not use ImageMagick. I chose to omit it because the installation of this suite and the related Perl module can be difficult. Also, I do not feel that the annotation always works as well as it could, and I have not yet found the time to correct this shortcoming.

A version of the more advanced script (called collect_apod) is available in the same place as collect_apod_simple should you wish to give it a try. Both scripts are available on Gitorious under the link https://gitorious.org/hprmisc/hprmisc.

The Code

If you are acquainted with Perl you'll probably find this script quite simple. All it really does is:

The following is a numbered listing with annotations. There are a several comments in the script itself, but the annotations are there to try and make the various sections as clear as possible.

     1  #!/usr/bin/env perl
     2  #===============================================================================
     3  #
     4  #         FILE: collect_apod_simple
     5  #
     6  #        USAGE: ./collect_apod_simple [YYMMDD]
     7  #
     8  #  DESCRIPTION: Downloads the current Astronomy Picture of the Day or that
     9  #               relating to the formatted date provided as an argument. In
    10  #               this context "current" can mean two URLs: .../astropix.html or
    11  #               .../apYYMMDD.html. We now *do not* download the
    12  #               .../astropix.html version since it has a different HTML
    13  #               layout.
    14  #
    15  #      OPTIONS: ---
    16  # REQUIREMENTS: ---
    17  #         BUGS: ---
    18  #        NOTES: Based on 'collect_apod' but without the Image::Magick stuff,
    19  #               for simplicity and for release to the HPR community
    20  #       AUTHOR: Dave Morriss (djm), Dave.Morriss@gmail.com
    21  #      VERSION: 0.0.1
    22  #      CREATED: 2015-01-02 19:58:01
    23  #     REVISION: 2015-01-03 23:00:27
    24  #
    25  #===============================================================================
    26  
    27  use 5.010;
    28  use strict;
    29  use warnings;
    30  use utf8;
    31  
    32  use LWP::UserAgent;
    33  use DateTime;
    34  use HTML::TreeBuilder 5 -weak;
    35  

Lines 32-34 define the modules the script uses:


    36  #
    37  # Version number (manually incremented)
    38  #
    39  our $VERSION = '0.0.1';
    40  
    41  #
    42  # Set to 0 to be more silent
    43  #
    44  my $DEBUG = 1;
    45  
    46  #
    47  # Script name
    48  #
    49  ( my $PROG = $0 ) =~ s|.*/||mx;
    50  
    51  #-------------------------------------------------------------------------------
    52  # Edit this to your needs
    53  #-------------------------------------------------------------------------------
    54  #
    55  # Where the script will download the picture. Edit this to where you want
    56  #
    57  my $image_base = "$ENV{HOME}/Backgrounds/apod";
    58  
    59  #-------------------------------------------------------------------------------
    60  # Nothing needs editing below here
    61  #-------------------------------------------------------------------------------
    62  
    63  #
    64  # Get the argument or default it
    65  #
    66  my $arg = shift;
    67  unless ( defined($arg) ) {
    68      #
    69      # APOD wants a date in YYMMDD format
    70      #
    71      my $dt = DateTime->now;
    72      $arg = sprintf( "%02i%02i%02i",
    73          substr( $dt->year, -2 ),
    74          $dt->month, $dt->day );
    75  }
    76  
    77  #
    78  # Check the argument is a valid date in YYMMDD format
    79  #
    80  die "Usage: $PROG [YYMMDD]\n" unless ( $arg =~ /^\d{6}$/ );
    81  

Lines 66-80 collect the date from the command line, or if none is given generate the correctly formatted date. If a date in an invalid format is given the script aborts.


    82  #
    83  # Make an URL depending on the argument
    84  #
    85  my $apod_base = "http://apod.nasa.gov/apod";
    86  my $apod_URL  = "$apod_base/ap$arg.html";
    87  

Lines 85-86 define the APOD URL for the chosen date. This will look like http://apod.nasa.gov/apod/ap150106.html for 2015-01-06 for example.


    88  #
    89  # General declarations
    90  #
    91  my ( $image_URL, $image_file );
    92  my ( $tree,      $title );
    93  my ( $url,       $element, $attr, $tag );
    94  
    95  #
    96  # Enable Unicode mode
    97  #
    98  binmode STDOUT, ":encoding(UTF-8)";
    99  binmode STDERR, ":encoding(UTF-8)";
   100  
   101  if ($DEBUG) {
   102      print "Base URL:   $apod_base\n";
   103      print "APOD URL:   $apod_URL\n";
   104      print "Image base: $image_base\n";
   105      print "\n";
   106  }
   107  
   108  #
   109  # Get the HTML page, pretending to be some unknown User Agent
   110  #
   111  my $ua = LWP::UserAgent->new;
   112  $ua->agent("MyApp/0.1");
   113  
   114  my $req = HTTP::Request->new( GET => $apod_URL );
   115  
   116  my $res = $ua->request($req);
   117  if ( $res->is_success ) {
   118      print "GET request successful\n" if $DEBUG;
   119  
   120      #
   121      # Parse the HTML we got back
   122      #
   123      $tree = HTML::TreeBuilder->new;
   124      $tree->parse_content( $res->content_ref );
   125  

Lines 111-114 set up and download the APOD web page. If the download was successful then the HTML is parsed with HTML::TreeBuilder in lines 123 and 124.


   126      #
   127      # Get and display the title in debug mode
   128      #
   129      if ($DEBUG) {
   130          if ( $title = $tree->look_down( _tag => 'title' ) ) {
   131              $title = $title->as_trimmed_text();
   132              print "Found title: $title\n" if $title;
   133          }
   134      }
   135  
   136      #
   137      # Look for the image. This is expected to be the href attribute of an <a>
   138      # tag. The image we see on the page is merely a link to this (usually)
   139      # larger image.
   140      #
   141      for ( @{ $tree->extract_links('a') } ) {
   142          ( $url, $element, $attr, $tag ) = @$_;
   143          if ($DEBUG) {
   144              print "Found: $url\n" if $url;
   145          }
   146          last unless defined($url);
   147          last if ( $url =~ /\.(jpg|png)$/i );
   148      }
   149  

Lines 141-148 consist of a loop which walks through the parsed HTML looking for tags. The loop ends if the tag references an image URL.


   150      #
   151      # Abort if no image (it might be a video or a GIF)
   152      #
   153      die "Image URL not found\n"
   154          unless defined($url)
   155          && $url =~ /\.(jpg|png)$/i;
   156  

Lines 153-155 check that an image URL was actually found. Some days the APOD site might host a YouTube video or some other animated display. The script is not interested in these since they are no use as wallpaper.


   157      $image_URL = "$apod_base/$url";
   158  
   159      #
   160      # Extract the final part of the URL for the file name. We usually get
   161      # a JPEG, sometimes with a shouty extension, which we change.
   162      #
   163      ( $image_file = $image_URL ) =~ s|.*/||mx;
   164      ( $image_file = "$image_base/$image_file" ) =~ s/JPG$/jpg/mx;
   165  
   166      if ($DEBUG) {
   167          print "Image URL:      $image_URL\n";
   168          print "Image file:     $image_file\n";
   169      }
   170  
   171      #
   172      # Abort if the file already exists (the script already ran?)
   173      #
   174      die "File $image_file already exists\n" if ( -f $image_file );
   175  

Lines 157-174 prepare the image URL and make a file name to hold the image.


   176      #
   177      # Set up the GET request for the image
   178      #
   179      $req = HTTP::Request->new( GET => $image_URL );
   180  
   181      #
   182      # Download the image to the (possibly renamed) image file
   183      #
   184      $res = $ua->request( $req, $image_file );
   185      if ( $res->is_success ) {
   186          print "Downloaded to $image_file\n" if $DEBUG;
   187      }
   188      else {
   189          #
   190          # The image download failed
   191          #
   192          die $res->status_line, " ($image_URL)\n";
   193      }
   194  

Lines 179-193 download the image to a file


   195  }
   196  else {
   197      #
   198      # We failed to get the web page
   199      #
   200      die $res->status_line, " ($apod_URL)\n";
   201  }
   202  
   203  exit;
   204  
   205  # vim: syntax=perl:ts=8:sw=4:et:ai:tw=78:fo=tcrqn21:fdm=marker

I hope you find the script interesting and/or useful.