XML and Perl by usr10478

VIEWS: 59 PAGES: 17

									                                       XML and Perl
[Week9.zip contains the samples – unzip it and transfer the contents to your Kenbane
directory. You should also place a copy of any .cgi files in public_html/cgi-bin as well
as a copy of grades.xml]
Much of the “traffic” between databases and the Web is in the form of HTML or
increasingly XML documents. The storage and retrieval of XML from databases has
become one of the most important issues for database developers. In this practical you
will investigate the way in which this can be achieved using Perl and Oracle.

Fig 1 shows a simple XML document that we are going to use in our Perl programs

         <Grades>
         <Student id="123">
         <name>William</name>
         <score>77</score>
         </Student>
         <Student id="234">
         <name>Susan</name>
         <score>67</score>
         </Student>
         <Student id="345">
         <name>Mary</name>
         <score>78</score>
         </Student>
         </Grades>

         Fig 1 The grades.xml file.


Note that XML is very similar to the HTML that is used for web pages. The key
differences are:

•   The XML is always well formed, i.e. all tags have a matching closing tag (HTML is
    not as strict)
•   You can makeup new tags yourself (HTML has a predefined set of tags you must use)

If you open the grades.xml file in Internet Explorer (using File|Open then click browse
button to find the file. Set the Files of type: pull-down box to all files)
When open it displays as shown in fig 2.




                                Fig 2 XML displayed in IE

Try clicking on the minus (-) signs to the left of <Grades> and <Student id>
Notice how the sections can be collapsed or opened.

To do this the browser must first read in the file, and understand the nesting created by
the various pairs of tags. We are now going to examine this process in Perl.

There are two main approaches to reading XML documents into a program. Either we
tell the program to read it token by token and control what happens with each bit of data,
or we read the complete XML document in one go. These two approaches are known as
SAX (Simple API for XML) and DOM (Document Object Model). We will look briefly
at SAX first.

The program below (sax1.pl) shows a very simple use of SAX:
 use XML::Parser;
 my $file=shift ;
 my $parse =new XML::Parser();
 $parse->setHandlers(Start =>\&handler_start,
 End =>\&handler_end,);
 $parse->parsefile($file);
 sub handler_start
 {
 my ($parser,$element,%attr)=@_;
 print "<$element>";
 }
 sub handler_end
 {
 my ($parser,$element)=@_;
 print "</$element>";
 }
When you run this program you can pass in the XML file you want it to deal with as a
parameter:
         The Perl
         interpreter                    The XML file

perl sax1.pl grades.xml

                our perl program

Run this program now. You should see the following output:




[Note: we have put in the < > characters round each tag in our print statement for
readability]
If you compare this to the grades.xml document you can see that it is printing out the
opening and closing tags in the document. [Reminder: XML opening tags are like <this>
closing tags are like </this>]

Here is how the program works:

use XML::Parser;

This tells Perl that we are going to make use of this module.

my $file=shift ;

We pick up the name of the XML file from the command line

my $parse =new XML::Parser();

We create a variable $parse which is a parser object. A parser object "knows" how to
read (parse) XML files in the same way that a garden spade "knows" how to dig. As
most gardeners know the spade will not dig the garden on its own!
$parse->setHandlers(Start =>\&handler_start, End =>\&handler_end,);

We tell the parser that whenever it comes across a start tag in XML it should go to the
subroutine called handler_start, and when it encounters an end tag it it should go to the
subroutine called handler_end.

$parse->parsefile($file);

We instruct our parser to get on with the job, i.e. start reading in the XML file and calling
our handler routines as appropriate.

sub handler_start
{
my ($parser,$element,%attr)=@_;
print "<$element>\n";
}

sub handler_end
{
my ($parser,$element)=@_;
print "</$element>\n";
}
These are the two subroutines that are called every time a start or end tag is read. Note
that we are free to name these with any name we choose, it must however match the
name in the setHandlers line.

my ($parser,$element,%attr)=@_;
This line declares three new variables to pick up the three parameters that are passed to
our subroutine when it is called by the parser. The one we are interested in at the
moment is the name of the tag held in $element.

Exercise: There is no sign of the data between tags. Edit program sax1.pl and add a new
handler for this data. Change the line

From this:
$parse->setHandlers(Start =>\&handler_start, End =>\&handler_end,);
To this:
$parse->setHandlers(Start =>\&handler_start, End =>\&handler_end, Char
=>\&handler_char);

Now add the following subroutine routine at the end of your program:

sub handler_char
{
my ($parser, $data) = @_;
print $data;
}

Run the program again and observe the result. [Note that the formatting is caused by
tabs and linefeed characters from the original file coming through as char data.]

SAX as you can see from the above reads the XML file as a stream of tags and allows
detailed control over what happens at each stage of this "reading in". One advantage of
this approach is that the complete XML document does not have to be stored in memory,
so this technique is suitable for very large documents. One disadvantage is that often we
simply want to read the complete XML document and then examine it inside our
program. This is the approach taken by DOM.

use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("grades.xml");

This three-line program reads in the XML structure into the variable $doc

Let us examine this step by step.

use XML::DOM;

This tells the Perl system that we will be making use of the XML::DOM library. This
provides code to do most of the tricky XML bits and pieces.

my $parser = new XML::DOM::Parser;
We make use of the library in this line by creating a new Parser. A Parser is capable of
reading in an XML document.

my $doc = $parser->parsefile ("grades.xml");

We tell our new parser to read in the file grades.xml. This assumes that the file is in the
current directory - otherwise we would have to give an explicit pathway.

So where has the XML gone? It is now stored in the variable $doc
To prove that it is actually in there we can add one more line to our program using
another of the library routines provided which prints out the XML document.

use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("grades.xml");
print $doc->toString;

Type: perl xml1.pl which should produce the output shown below:




                                Fig 3 Output from xml1.pl

XML documents like this can be used simply for displaying values to a user, although
you would not normally present the data in this format - it is hard to read with all the tag
names interlaced with the data. We shall see later on that a typical way of displaying
XML is to translate it into HTML that can be viewed in a web browser.

To transform the XML we need to have access to the internal components (tags, text,
etc.) The DOM library also provides us with routines to do this (Type: perl xml2.pl
xml2.pl).

use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("grades.xml");
my $root=$doc->getFirstChild;
print $root->getTagName ."\n";

We have removed the print statement that prints out the complete document and added
two new lines:

my $root=$doc->getFirstChild;

Remember $doc was where we stored the whole XML document that was read (parsed)
in. We ask $doc for its first child which we store in $root

print $root->getTagName ."\n";

Now we ask this child for its tag name and print it out.

To understand what is going on here you need to understand the tree-like structure of an
XML document. First things first. If you are expecting a nice Oak or Beech your going
to be disappointed - computer scientists like their trees upside down with the root at the
top of the page. A closer analogy is a family history tree in that we will be talking of
Child nodes. The next diagram shows how the tree is stored in $doc
                 $doc
                        Grades




                                        Student                Attribute Id




                                                                 text 123



                                                  Score
                                 Name



                                                  Text 77
                                 Text William

    Fig 4 All the elements of the XML document are represented as nodes in the tree.

It would be very complicated if you had to manoeuvre through the tree node at a time, so
several helpful search routines are provided.
Type: perl xml4.pl

use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile("grades.xml");

foreach my $node ($doc->getElementsByTagName("Student"))
 {
 my $tag=$node->getTagName;
 my $id=$node->getAttribute("id");
 print "Node : $tag, Value of id: $id \n";
 }

The key line is the for loop:

foreach my $node ($doc->getElementsByTagName("Student"))

The getElementsByTagName("Student") searches through the tree and returns any
nodes of type Student. The variable $doc now holds a list of all these nodes.

The foreach my $node loop places each Student in $node one at a time and then carries
out the instructions inside the braces {}.

Inside the loop we get the name of the tag and the value of the attribute id and then print
them out.
[N.B. You need to be careful using this search technique since it will find Student tags
anywhere in the tree structure. If you use it to search for a tag that appears at several
depths of the tree, you will get them all returned.]

The only item we have not yet retrieved from the tree is the text entries stored between
tags, e.g

<name>Adam</name>
<score>77</score>

Type: perl xml5.pl to run the program which shows how to access these entries:

use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile("grades.xml");

foreach my $node ($doc->getElementsByTagName("Student"))
 {
 $nameList = $node->getElementsByTagName("name");
 my $name=$nameList->item(0);
 my $tagName =$name->getTagName;
 my $txt=$name->getFirstChild->getNodeValue;
 print "Node : $tagName, Value of txt: $txt \n";
 }

The key line is:

my $txt=$name->getFirstChild->getNodeValue;

Note that the text stored between tags is viewed as a node in its own right so we first get
the enclosing tag element $name, then we get the child of this element getFirstChild,
and finally we ask this child its value, getNodeValue.

We are now in a position to write our own printXML routine which has the same
functionality of the line print $doc->toString; in our first program.

Type: perl xml6.pl

use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile("grades.xml");

traverse($doc->getFirstChild());
exit;
sub traverse {
    my($node)= @_;
    if ($node->getNodeType == ELEMENT_NODE) {
      print "<", $node->getNodeName, ">";
      foreach my $child ($node->getChildNodes()) {
        traverse($child);
      }
      print "</", $node->getNodeName, ">";
    } elsif ($node->getNodeType() == TEXT_NODE) {
      print $node->getData;
    }
}

This is the first program to make use of a subroutine - a piece of code that your main
program can call and optionally pass parameters to.

traverse($doc->getFirstChild());

We call the subroutine with this line and pass as a parameter the root node of the tree
(<Grades>)

sub traverse {

This marks the beginning of the subroutine - it must begin with the key work sub,
followed by the name we choose to call the routine (traverse).

 my($node)= @_;

We declare a variable to hold the parameter passed to the subroutine - the @_ is a special
Perl variable which holds these values.

if ($node->getNodeType == ELEMENT_NODE) {

We now check what sort of node we have reached; there are two we have to deal with:
either it is a standard element node in which case it may have to deal with children nodes
off this node, or it is a text node in which case we can simply print out the text.

print "<", $node->getNodeName, ">";
  foreach my $child ($node->getChildNodes()) {
   traverse($child);

This code deals with element nodes - first we print out the tag name in brackets < >
then we deal with each of the children of this node. For each child we start the whole
subroutine again with a call to traverse($child); This is called recursion in that a
subroutine has called itself.
Perl can keep all the nested calls to traverse separate in terms of variables used.
Exercise: Change the last line from

print $node->getData;

to this:

print "*" . $node->getData;

Run the program again.




                   Fig 5 The extra text nodes in the DOM tree exposed.

Note the * before every text string, but also at the end of each line. This shows that the
white spaces and carriage return that we typed into the file grades.xml are also
represented as nodes in our tree (another good reason to search for tags rather than
wander through the tree)

We should be fairly confident that we can now read an XML file into a DOM and extract
the data we want from inside the tree. It is time to put this to some use by transforming
the XML into something we can display in a browser (HTML).

To do this we need to run the program as a cgi just as we did for the DBI examples.
#!/usr/bin/perl -w
use CGI::Carp qw(fatalsToBrowser);
print "Content-type: text/html\r\n\r\n";
use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile("grades.xml");
print '<table width=50% BORDER="1">';
print '<tr><td>ID</td><td>Text</td><td>Score</td></tr>';
foreach my $node ($doc->getElementsByTagName("Student"))
 {
 print "<tr>";
 my $id=$node->getAttribute("id");
 $nameList = $node->getElementsByTagName("name");
 my $name=$nameList->item(0);
 my $nametxt=$name->getFirstChild->getNodeValue;
 my $scoreList= $node->getElementsByTagName("score");
 my $score=$scoreList->item(0);
 my $scoretxt=$score->getFirstChild->getNodeValue;
 print "<td>$id</td><td>$nametxt</td><td>$scoretxt</td>";
 }
print "</tr></table>";

We have seen most of this before in the DBI examples. Note the way we have HTML
codes printed out with the data embedded in it. The print statements in Perl just send
their output to the standard output stream. Thus you can run this as a "straight" Perl
program and you will get the output shown below:
Type perl xml5.cgi [Note the .cgi extension indicating that this is really designed for a
web server]




                           Fig 6 HTML output from xml5cgi.
When this is run from the Web server this is what is sent back to your browser - note that
HTML does not mind that it is not neatly laid out! It is the browser that interprets the
HTML and displays it formatted for viewing. Move the xml5.cgi script to your
public_html/cgi-bin directory then run the perl script from Internet Explorer by typing its
address as follows:(remember to put your ~username in place of ~karl)




Fig 7 Browser interprets and displays HTML

Creation of XML documents
One of the tasks that programs are frequently required to do is to take data in one format -
from a database for example - and create an XML document to send to another machine.
The next example shows how to create an XML document inside a Perl program.

#!/usr/bin/perl -w
use CGI::Carp qw(fatalsToBrowser);
print "Content-type: text/html\r\n\r\n";
use XML::DOM;
use strict;

#Create the data in a Hash
#Normally read from DBI

my %students =(
 123 => {name => 'William',
      score => '77'},
 234 => {name => 'Susan',
      score => '67'},
 345 => {name => 'Mary',
      score => '78'});

#Now create XML doc
my $doc = XML::DOM::Document->new;
my $xml_pi = $doc->createXMLDecl('1.0');
my $root=$doc->createElement('Grades');
foreach my $item (keys(%students)){
 my $student=$doc->createElement('Student');
 $student->setAttribute('id',$item);
 my $name=$doc->createElement('name');
 my $nameText=$doc->createTextNode($students{$item}->{name});
 $name->appendChild($nameText);
 $student->appendChild($name);
 my $score=$doc->createElement('score');
 my $scoreText=$doc->createTextNode($students{$item}->{score});
 $score->appendChild($scoreText);
 $student->appendChild($score);
 $root->appendChild($student);
 }
print $xml_pi->toString;
print $root->toString;

There are a number of critical parts to the above program, which we now examine:

my $doc = XML::DOM::Document->new;

This gives us a handle $doc which we can then use to call a series of functions to create
the parts of the DOM.

my $xml_pi = $doc->createXMLDecl('1.0');

The standard XML file header that is usually included with each XML file.

my $root=$doc->createElement('Grades');

We create our first Element - the top or root element in the tree (<Grades>)

foreach my $item (keys(%students)){

A loop to go through each of the keys in the hash

$student->setAttribute('id',$item);

The way to add an attribute to an element.

my $nameText=$doc->createTextNode($students{$item}->{name});

We create a text node and pick the value out of the hash.
$name->appendChild($nameText);




        Fig 8 Output of xml7.cgi is treated as an XML document by the browser.

Exercise Modify the above program so that it reads its information from the Oracle
database table Students. [Hint – look back at the previous practical in which we
used DBI to open and read in the students database. The foreach my $item loop in
the program should be replaced by the loop that reads students from the database.
As in a real family tree you must make the appropriate parent-child connections
using appendChild as shown in the sample program]


XPath Library to simplify searching XML DOM trees.
The XPath module provides a number of routines that make programming with XML
DOM easier. Instead of having to explicitly search for items by tag name alone, XPath
gives an almost query language control over which bits of the tree are returned. The
simple example below shows XPath in action:


Type: perl xpath1.pl

use XML::XPath;
use strict;
my $file ='grades2.xml';
my $xp = XML::XPath->new(filename => $file);
foreach my $student ($xp->find('/Grades/Student[@id="234"]')->get_nodelist){
  print XML::XPath::XMLParser::as_string($student) . "\n";
}

The program is short, but achieves a lot.

use XML::XPath;

To make use of XPath expressions we must include the module with the standard use
syntax.

my $xp = XML::XPath->new(filename => $file);

We create a single XPath variable $xp that references the complete document.

$xp->find('/Grades/Student[@id="234"]')->get_nodelist

The /Grades/Student bit we have seen before in the XML DOM example - it simply
looks for Student nodes nested inside Grades nodes. The [@id="234"] bit says only
select nodes that have the attribute id ="123". The last bit, ->get_nodelist returns a list
off all such nodes that we have found. In this case there is only one student with
id="234", but in general we could get back several nodes. Thus we process the result in a
foreach loop. Note the use of the built-in function
XML::XPath::XMLParser::as_string which turns a node into a string that we can print
out.

Here is a more complex example:

use XML::XPath;
use strict;
my $file ='grades2.xml';
my $xp = XML::XPath->new(filename => $file);
foreach my $student ($xp-
>find('/Grades/Student/marks/module[@code="Com715c2"][number(score)<60]/../..')-
>get_nodelist){
  print XML::XPath::XMLParser::as_string($student) . "\n";
}

Note that we have a more deeply nested XML file, grades2.xml which has marks for four
modules:

<Grades>
   <Student id="123">
     <Firstname>William</Firstname>
     <Surname>Groves</Surname>
     <marks year="2001">
      <module code="Com715c2">
       <score>77</score>
      </module>
      <module code="Com716c2">
       <score>78</score>
      </module>
      <module code="Com717c2">
       <score>79</score>
      </module>
      <module code="Com718c2">
       <score>80</score>
      </module>
     </marks>
   </Student>
 ...
This XPath expression finds all Com715c2 results that were lower than 60 and then
returns the student records for the student concerned. Note the following:

module[@code="Com715c2"][number(score)<60]

The expressions in square brackets limit or filter out only those results that match their
rules. The @ is used for an attribute, and number() converts strings to numeric values.
You can't write score<60 or score<"60"

Exercise: Experiment with different expressions in the program above to select
various parts of the XML structure.

								
To top