XML And XPath

Document Sample
scope of work template
							XML And XPath

  DSA Term 2
   Week 14


    DSA/2006/week 14   1
           Lecture overview
• Matters arising
  – Character coding
• Well-formed XML
• Creating simple XML files
     • Placename to BBC code
• Introduction to XPath



                    DSA/2006/week 14   2
                Character Coding
• Character set
   – ISO 8549 - 1 Byte
       • 0 - 127 are ASCII
       • 128- 255 vary depending on the part of the standard
       • 15 different character maps
           – ISO-8859-1 - Latin -1 - the default for HTML
           – ISO-8859-2 – Central European
       • A document must be on one encoding
           – problem of mixing characters e.g. an Arabic quotation in a Cyrillic text
   – UTF-8 - Unicode 1- 4 byte variable length to support a huge range of
     international languages in a single code
       • ASCII is included as characters 0-127
       • Ensures that the internet is truly multi-lingual
       • Key invention by Ken Thompson of self-synchronisation allowing character
         boundaries to be detected
• Character references in HTML
   – Named                              °
   – decimal                            &176;
   – Hexadecimal                        &#B0;
                                   DSA/2006/week 14                                     3
           Defining the Encoding
• Encodings in HTML
   – In a meta-tag
       • <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
   – In the xml processing instruction
       • <?xml version="1.0" encoding="ISO-8859-1"?>
   – In the HTTP content header
       • Content-Type: text/html; charset=ISO-8859-1

• Setting Encoding in PHP
   – header("Content-type: text/html; charset=UTF-8");

• Setting encoding in the Browser
   – Firefox
       • View/Character Encoding




                               DSA/2006/week 14                                   4
     Design a simple XML file
• Design an XML vocabulary to represent
  pairs of place names and codes

  – Bristol 1263
  – Bath 1123


• First review XML structure


                   DSA/2006/week 14       5
                                 Example
<MapSet>
 <Map id="P2" desc="P Block level 2">
  <room id="2P2">
     <area shape="rect" coords="118,39,138,68"/>
     <type>Staff Room</type>
     <occupant>Tony Solomonides</occupant>
  </room>
  <room id="2P3">
     <area shape="rect" coords="141,40,162,69"/>
     <type>Staff Room</type>
     <occupant>Richard Lawson</occupant>
  </room>
  <room id="2P4">
     <area shape="poly" coords="201,40,234,40,234,118,164,119,163,71,200,71"/>
     <type>Office</type>
     <occupant>Eleanor Gibbons</occupant>
     <occupant>Dee Evans</occupant>
     <occupant>Ali Jack</occupant>
  </room>
   ….
 </Map>
</MapSet>                            DSA/2006/week 14                            6
               Well-formed XML documents (1)
Every XML document must be well-formed and must therefore adhere to
the following rules (among others):
 1. Every start-tag must have a matching end tag.
 2. Elements may nest but must not overlap.
    <name>Anna<em>Coffey</em></name> - √
    <name><em>Anna</name>Coffey</em> - ×
 3. There must be exactly one root element.
 4. Attribute values must be quoted.
 5. An element must not be quoted.
 6. Comments and processing instructions may not appear inside tags.
 7. No unescaped < or & signs may occur in the character data of an
    element.




                            DSA/2006/week 14                           7
                   Well-formed XML documents (2)
Element names are case sensitive - <NAME>, <name>, <Name> & <NaMe>
are four different element types.
No white spaces in element name - <First Name> not allowed; <First_Name>
OK.
Element names cannot start with the letters “XML” or “xml” – reserved terms.
Element names must start with a letter or a underscore. Element names
cannot start with a number but numbers may be embedded within an element
name - <2you> not allowed; <me2you> is OK.
Attribute names are constrained by the above rules for element names.
Entity references are used to substitute specific characters. There are five
predefined entities built into XML:
Entity   Char     Notes
&amp;    &        Do not use inside processing instructions
&lt;     <        Use inside attribute values quoted with “.
&gt;     >        Use after ]] in normal text and inside processing instruction.
&quot;   “        Use inside attribute values quoted with “.
&apos;   „        Use inside attribute values quoted with „.
                                     DSA/2006/week 14                              Map
                                                                                     8
                   Errors
• Look at the listing of the XML file and
  identify all the places which prevent this
  XML from being well-formed




                   DSA/2006/week 14            9
<Map id=P2 desc="P Block level 2'>
  <room id="2P2">
      This is a nice big office
      <area rect coords="118,39,138,68">
      <typo>Staff Room</typo>
      <occupant>Tony Solomonides</occupant>
  </Room>
  <room id="2P3">
      <area rect coords="141,40,162,69"></area>
      <typo>Staff Room</typo>
      <occupant>”Richard Lawson”</occupant>
  </Room>
  <room id="2P4">
      <area poly coords="201,40,234,40,234,118,164,119,163,71,200,71"/>
      <typo>Office</typo>
      <occupant>Eleanor Gibbons</occupant>
      <person>Dee Evans</person>
      <occupant>Ali Jack</occupant
  </Room>
  ---                           DSA/2006/week 14                        10
                      Task
• Draw the structure
  – Use ER notation
    • Attributes in the Entity
    • Cross-foot notation for one-many, optional
    • Identify any restricted sets of values (ennumerated
      types)
  – In the lab, QSEE will allow you to define the
    structure and generate the schema definition
    (XML Schema or DTD)

                     DSA/2006/week 14                   11
DSA/2006/week 14   12
                              XPATH
• Core language for selecting nodes in XML
• Version 1.0 used in XSLT 1.0
   –   client-side in Browsers
   –   xalan engine
   –   w3.schools Tutorial is for XPath 1.0
   –   SimpleXML in PHP
• Version 2.0 used in XSLT 2.0
   – Saxon parser
   – XQuery 1.0
• Differences
   –   Code data structure in 2.0 is a node sequence
   –   Full support for all XML schema datatypes
   –   Two kinds of equality operators
   –   Larger function library



                                DSA/2006/week 14       13
                XPath Language
• Not a programming language
• Expressions to be evaluated
• Focus on
  – Navigation in a tree structure
     • Multiple directions or „axes‟
         –   Down to children (child axis)
         –   Up to parent (parent axis)
         –   Down to attributes (attribute axis)
         –   Across to siblings (sibling axis)
  – Operators
  – Functions

                              DSA/2006/week 14     14
                      XPATH                       DOM
                      Declarative                 Procedural (navigational)

Root                  /                           document
Context Node          .
     Current
     Directory
(self)
Local Name                                        n

Child                 *                           n.childNodes
                      [1]                         n.firstChild
                      [last()]                    n.lastChild
Parent                ..                          f.parentNode

Sibling               ../g                        f.nextSibling
                      ../*                        f.previousSibling
Attribute             f/@att                      f.getAttribute(att)
Locate node by ID     //[@id='f1']                document.getElementById('f1')
Locate nodes by       //img                       document.getElementsByTagName('img')
   Tag
Number of nodes in    count(s)                    s.length
   sequence
Select by predicate   [@size= 10 and . ='fred']
                                        DSA/2006/week 14                                 15
                    XPath operators
• Arithmetic operators
   + - * div idiv mod
• Value comparisons
   eq, le, ge, gt, lt
• Sequence comparisons
   = , !=
   = is true if there are common elements
   != is true if there are no common elements
       (1,2,3) = (2,3,4) is true
       (1,2,3) != (2,3,4) is also true
       not ((1,2,3) = (2,3,4) ) is false
• Logical operators
   and, or, not()


                                 DSA/2006/week 14   16
          large function library
• count (seq) , max((seq)) ,min((seq)),
  average
  – count(1,2,3) = 3
• max, min
• string functions
  – string-length(„abc‟)
  – tokenize(„a,b,c‟,‟,‟)
  – string-join((a,b,c),‟, „)
                       DSA/2006/week 14   17
     Using the eXist database
• eXist database as an XPath / XQuery engine.
  – Rest interface
     • ..exist/rest/db/chriswallace/rooms?_query=//Map
  – Java client
  – Sandbox (using Ajax to do dynamic syntax checking)
     • Context is the whole database
• The demo database includes
  – the whole text for Romeo and Juliet
  – the mondial world database


                         DSA/2006/week 14                18
                         Examples
• all Rooms
   – /MapSet/Map/room
   – //room
• room 2P5
   – //room[@id=„2P5‟]
• the occupants of room 2P4
   – //room[@id=„2P4‟]/occupant
• the roomNo of the room which Colin Fudge occupies
   – //room[occupant = „Colin Fudge‟]/@id
• the number of occupants of 2P4
   – count(//room[@id=„2P4‟]/occupant)
• The floor of Ali Jack‟s room
   – //room[occupant = „Ali Jack‟]/../@desc

                           DSA/2006/week 14           19
                  Notes
• Note how = tests if a person is amongst
  the occupants
• To „serialise‟ an attribute use string()
• See how ../ allows navigation to the parent
  element




                  DSA/2006/week 14          20
          Examples for you
• The room number for Richard Lawton

• The coordinates of room 2P2

• All rooms with poly shape

• Who are Ali Jack‟s office mates?

                 DSA/2006/week 14      21
               XML design
• Rooms is a mixture of text elements and
  attributes.
• Could be all attributes – what would change?
• Could be no attributes – what would change?
• For the workshop exercise use elements instead
  of attributes – its simpler even if more verbose
• Generally, what do the experts recommend?



                    DSA/2006/week 14             22
               Workshop
• Create a simple XML file containing pairs
  of Place names and BBC codes
• Change the PHP script to accept a
  placename
• Read the new xml file and decode the
  name to get the code using PHP
  SimpleXML interface and xpath(„‟)


                  DSA/2006/week 14            23

						
Related docs