The orthodox answer as given in perlfaq9 and oft recounted in comp.lang.perl.misc to questions about manipulating HTML is to use the module HTML::Parser by Gisle Aas. There are of course some differences of opinion about the wider usefulness of this module but on the whole for the general user it is the best solution available for manipulating the fairly limited nature of the HTML that people want to manipulate on a regular basis.
One of the major problems the relative neophyte will have
with it is its purely Object Oriented interface. It might prove
difficult to discuss the use of the module without using terms from
the Object Oriented argot however and the confused reader might want
to refere to some basic work on the subject or perhaps the
perltoot manpage. The basic thing
is that the programmer has to create a new class that
inherits from HTML::Parser to obtain the
parseing functionality from that module but also needs to provide
code (over-ride) for the functions that HTML::Parser
calls when it recognizes the elements of an HTML file.
Perhaps a first brief example :
1 #!/usr/bin/perl -w
2
3 package Example;
4
5 use strict;
6
7 require HTML::Parser;
8
9 @Example::ISA = qw(HTML::Parser);
10
11 my $parser = Example->new;
12
13 $parser->parse_file('index.html');
14
15 print $parser->{TEXT};
16
17 sub text
18 {
19 my ($self,$text) = @_;
20
21 $self->{TEXT} .= $text;
22 }
This is probably the simplest useful program that can be made with
HTML::Parser - it simply removes the HTML from a file -
yet it shows most of the necessary elements.
I will say this only once about lines 1 & 5 but they will appear
in nearly any example - the use of the -w flag and
use strict; is essential when developing in nearly all
cases but becomes more important when working with modules such as
HTML::Parser if you are to catch all the bugs you are
likely to make. But enough of that.
Line 3 is probably the cornerstone of the whole enterprise of working with
the module - it is necessary to create a new package in order to be
able to use (inherit) the methods of HTML::Parser
and provide the over-rides that will do anything useful. (It is infact
possible to do this in package main but I dont think
that its useful to go into that right now.) In later examples I will
show the creation and use of a separate module that inherits from
HTML::Parser but for the simpler applications it is
probably easier to put the whole program in its own package.
Lines 7 & 9 should be taken together (and probably in reverse.)
When perl is looking for a method ( a package subroutine) it will,
of course, look in the current package first and then in the packages
indicated in the package array variable @ISA - of course it also needs
to have those methods defined somewhere (otherwise it will fail with
an error ) and this is why it is necessary to
require HTML::Parser - so saying :
@Example::ISA = qw(HTML::Parser);
is telling Perl to look for methods in the package
HTML::Parser if they
are not found in the current package.
But why do this ? Well HTML::Parser will call
specific subroutines whenever it discovers particular elements
within the HTML document - these are described in the HTML::Parser
manpages in detail and will be described here as they are
used. The new object created in line 11 is in package Example
and so these callbacks will be first looked for in that
package so we are able to define new behaviours for the defaults.
Here we are over-riding the text method.
Some other example code :