USING HTML::Parser - a quick guide

The orthodox answer as given in perlfaq9 and oft recounted in comp.lang.perl.misc to questions about manipulating HTML is to use the module HTML::Parser by Gisle Aas. There are of course some differences of opinion about the wider usefulness of this module but on the whole for the general user it is the best solution available for manipulating the fairly limited nature of the HTML that people want to manipulate on a regular basis.

One of the major problems the relative neophyte will have with it is its purely Object Oriented interface. It might prove difficult to discuss the use of the module without using terms from the Object Oriented argot however and the confused reader might want to refere to some basic work on the subject or perhaps the perltoot manpage. The basic thing is that the programmer has to create a new class that inherits from HTML::Parser to obtain the parseing functionality from that module but also needs to provide code (over-ride) for the functions that HTML::Parser calls when it recognizes the elements of an HTML file.

Perhaps a first brief example :

    1   #!/usr/bin/perl -w
    2   
    3   package Example;
    4   
    5   use strict;
    6   
    7   require HTML::Parser;
    8   
    9   @Example::ISA = qw(HTML::Parser);
   10   
   11   my $parser = Example->new;
   12   
   13   $parser->parse_file('index.html');
   14   
   15   print $parser->{TEXT};
   16   
   17   sub text
   18    {
   19      my ($self,$text) = @_;
   20   
   21      $self->{TEXT} .= $text;
   22    }
  

This is probably the simplest useful program that can be made with HTML::Parser - it simply removes the HTML from a file - yet it shows most of the necessary elements.

I will say this only once about lines 1 & 5 but they will appear in nearly any example - the use of the -w flag and use strict; is essential when developing in nearly all cases but becomes more important when working with modules such as HTML::Parser if you are to catch all the bugs you are likely to make. But enough of that.

Line 3 is probably the cornerstone of the whole enterprise of working with the module - it is necessary to create a new package in order to be able to use (inherit) the methods of HTML::Parser and provide the over-rides that will do anything useful. (It is infact possible to do this in package main but I dont think that its useful to go into that right now.) In later examples I will show the creation and use of a separate module that inherits from HTML::Parser but for the simpler applications it is probably easier to put the whole program in its own package.

Lines 7 & 9 should be taken together (and probably in reverse.) When perl is looking for a method ( a package subroutine) it will, of course, look in the current package first and then in the packages indicated in the package array variable @ISA - of course it also needs to have those methods defined somewhere (otherwise it will fail with an error ) and this is why it is necessary to require HTML::Parser - so saying :

      @Example::ISA = qw(HTML::Parser);
    
is telling Perl to look for methods in the package HTML::Parser if they are not found in the current package.

But why do this ? Well HTML::Parser will call specific subroutines whenever it discovers particular elements within the HTML document - these are described in the HTML::Parser manpages in detail and will be described here as they are used. The new object created in line 11 is in package Example and so these callbacks will be first looked for in that package so we are able to define new behaviours for the defaults. Here we are over-riding the text method.

Some other example code :