Lecture 4 - DOM and Sax by shuifanglj


									           XML and Web Technologies

                    Lecture 4: DOM and SAX

                    Prof Mark Baker

                    ACET, University of Reading
                    Tel: +44 118 378 8615
                    E-mail: Mark.Baker@computer.org
                    Web: http://acet.rdg.ac.uk/~mab

      Spring 2010             mark.baker@computer.org
• Parsing XML.
• Document Object Model (DOM):
   –   What is the DOM,
   –   DOM Processing,
   –   Processing documents with JAXP,
   –   DOM Summary,
• Simple API for XML (SAX):
   – Features,
   – Example,
   – Summary.
• DOM versus SAX.

 Spring 2010       mark.baker@computer.org
                 Parsing XML
• XML processing occurs during parsing by an
• Parsing is the dissection of a block of text into
  discernible words (also known as tokens).
• There are three common ways to parse an XML
   – Using the Simple API for XML (SAX),
   – Building a Document Object Model (DOM), and
   – Employing a new technique called pull parsing.
• SAX is a style of parsing called event-based parsing
  where each information class in the instance document
  generates a corresponding event in the parser as the
  document is traversed:
   – SAX parsers are useful for parsing very large XML
     documents or in low-memory environments.

 Spring 2010         mark.baker@computer.org
                   Parsing XML
• Building a DOM is the most common approach to
  parsing an XML document.
• Pull parsing is a new technique that aims for both low-
  memory consumption and high performance:
   – It is especially well suited for parsing XML-based Web
• Pull parsing is also an event-based parsing technique;
  however, the events are read by the application
  (pulled) and not automatically triggered as in SAX.

• The majority of systems use the DOM approach to
  parse XML documents.

 Spring 2010         mark.baker@computer.org
             Parsing XML - An Inside Look
                  XML Document

   Core Document Parser           DTD declarations

                                                                                   XML DTD
Well-formed-ness Syntax Checker
                                                          Configurable Retrieval
                                                               of External         <?xml?>
                                                               Resources           <!ELEMENT...
     Entity Resolution                     DTD Resolver                            <!ATTLIST...

   Attribute Defaulting

    Structure Validation

       Tree Builder

   Spring 2010                        mark.baker@computer.org
Parsing XML - The View from the


                                             XML Document

                       Monolithic Parser
                        Loads document
                      Parses declarations
                          Builds DTD
                Interprets document against DTD
                          May validate
                      May build DOM tree
                May provide XSL or XLink Services


Spring 2010   mark.baker@computer.org
XML Processing - Parser to Application
• Two dominant standards:
  – Document Object Model (DOM):
      • Tree-based model passes complete picture of document to
        application at processing conclusion,
      • Java, JavaScript, IDL descriptions; Perl implementation in
        independent development,
      • Developed by W3C, Level 1 complete, Level 2 in progress.
  – Simple API for XML (SAX):
      • Event-based model “reads” document to application
      • Supported by nearly all Java XML parsers,
      • Developed by XML-Dev mailing list (SAX 2 in dev.).

  Spring 2010          mark.baker@computer.org
                What is the DOM?
• The Document Object Model (DOM) is a language-
  neutral data model and Application Programming
  Interface (API) for programmatic access and
  manipulation of XML and HTML.
• Unlike XML instances and schemas, which reside in
  files on disk, the DOM is an in-memory representation
  of a document.
• The need for this arose from differences between
  the way Internet Explorer (IE) and Netscape
  Navigator, which allowed access and manipulation of
  HTML documents to support Dynamic HTML
   – IE and Navigator represent the parts of a document with
     different names, which made cross-browser scripting
     extremely difficult.
  Spring 2010        mark.baker@computer.org
               What is the DOM?
• Out of the desire for cross-browser scripting came
  the need for a standard representation for document
  objects in the browser‟s memory.
• The model for this memory representation is object
  oriented programming (OOP).
• So, by turning around the title, we get the definition
  of a DOM:
   – A data model, using objects, to represent an XML or HTML

 Spring 2010        mark.baker@computer.org
                 What is the DOM?
• Object-oriented programming introduces two key data
  modelling concepts classes and objects.
• A class is a definition or template describing the
  characteristics and behaviour of a real-world entity or
   – From this description, an in memory instance of the class can
     be constructed, which is called an object:
        • So, an object is a specific instance of a class.
   – The key benefit of this approach to modelling program data is
     that your programming language more closely resembles the
     problem domain you are solving,
   – Real-world entities have characteristics and behaviour:
        • Thus, programmers create classes that model real world entities
          like “Auto”, “Employee” and “Product”,
   – Along with a class name, a class has characteristics, known as
     data members, and behaviour, known as methods.

 Spring 2010             mark.baker@computer.org
                 What is the DOM?

Classes and Objects

   Spring 2010        mark.baker@computer.org
               What is the DOM?
• You can think of a DOM as a set of classes that allow
  you to create a tree of objects in memory that
  represents a version of an XML or HTML document
  that can be manipulated.
• There are two ways to access this tree of objects: a
  generic way and a specific way.
• The generic way shows all parts of the document as
  objects of the same class, called Node.
• The generic DOM representation is often called a
  “flattened view” because it does not use class
   – Class inheritance is where a child class inherits
     characteristics and behaviour from a parent class just like in
     biological inheritance.

 Spring 2010          mark.baker@computer.org
              What is the DOM?

                            A DOM as a tree of nodes.

Spring 2010      mark.baker@computer.org
              What is the DOM?

                         A DOM as a tree of sub-classes.

Spring 2010       mark.baker@computer.org
                   Dom Standards
• The DOM has steadily evolved by increasing the detail
  of the representation, the scope of the
  representation, and adding new manipulation methods.
• There are currently three DOM levels:
   – DOM Level 1:
        • This set of classes represents XML 1.0 and HTML 4.0
   – DOM Level 2:
        • This extends Level 1 to add support for namespaces; cascading
          style sheets, level 2 (CSS2); alternate views; user interface
          events; and enhanced tree manipulation via interfaces for
          traversal and ranges.
   – DOM Level 3:
        • This extends Level 2 by adding support for mixed vocabularies
          (different namespaces), XPath expressions, load and save
          methods, and a representation of abstract schemas (includes
          both DTD and XML Schema).

 Spring 2010            mark.baker@computer.org
               Document Object Model
• Generic Form:

                    Parser            DOM      Application

 Spring 2010         mark.baker@computer.org
           Document Object Model
• Programming interface:
   – Builds a representation of an XML document in
   – DOM is an hierarchical data structure containing all
     the information in the XML document,
   – Provides access to the object representation using
     standard properties and methods,
   – Allows programmers access to the XML data
     without having to process the actual XML

 Spring 2010      mark.baker@computer.org
                      XML Document
  • Information to be represented in the DOM

<?xml version="1.0" encoding="UTF-8"?>
 <entry id="Baker2005">
  <author>Mark Baker and Amy W. Apon and Clayton Ferner and Jeff Brown</author>
  <title>Emerging Grid Standards</title>
  <journal>IEEE Computer</journal>

    Spring 2010             mark.baker@computer.org
Information To Be Represented

                  Attributes                        id           <citation.xml>

              Root Element                                      <entry>

                                                    <journal>   <year>

Spring 2010          mark.baker@computer.org
              DOM Inspector

Spring 2010    mark.baker@computer.org
                        DOM Example

<?xml version=“1.0”?>

<addressbook>                                       Node   addressbook

 <person>                                                  Node   person
  <name>John Doe</name>
                                                                  Node   Name=“John Doe”
                                                                  Node   email=jdoe@yahoo.com”

 <person>                                                  Node   person

  <name>Jane Doe</name>
                                                                  Node   Name=“John Doe”
                                                                  Node   email=jdoe@yahoo.com”

  Spring 2010                    mark.baker@computer.org
                 DOM Representation
                Document Root
                                                        <Child id=“123>Text here</Child>
                                    <Parent>           </Parent>


                                            Node        <Child>


                                                                          Node       <id=“123”>


                                                                Text CDATA
                                                                   Node        Text here

  Spring 2010                        mark.baker@computer.org
                      Kinds of nodes
• Node - primary datatype for the entire DOM,
  represents a single node in the document tree:
      – All objects implementing the Node interface expose methods
        for dealing with children:
           • Not all objects implementing the Node interface may have
• Document - represents the XML document, the root
  of the document tree:
      – It provides the primary access to the document's data and
        the methods to create them.
•    Element - represents an element in an XML document, may have
     associated attributes or text nodes.
•    Attr - represents an attribute in an ELEMENT object.
•    Text - represents the textual content of an ELEMENT or ATTR.
•    Others (COMMENT, ENTITY).

    Spring 2010            mark.baker@computer.org
                 Common DOM Methods
• Node.getNodeType()- the type of the underlying
  object, e.g. Node.ELEMENT_NODE.
• Node.getNodeName() - value of this node,
  depending on its type, e.g. for elements it„s tag name,
  for text nodes always string #text.
• Node.getFirstChild() and
  Node.getLastChild()- the first or last child of a
  given node.
• Node.getNextSibling() and
  Node.getPreviousSibling()- the next or
  previous sibling of a given node.
• Node.getAttributes()- collection containing the
  attributes of this node (if it is an element node) or

   Spring 2010       mark.baker@computer.org
           Common DOM methods (2)
• Node.getNodeValue()- value of this node, depending
  on its type, e.g. value of an attribute but null in case
  of an element node.
• Node.getChildNodes()- collection that contains all
  children of this node.
• Node.getParentNode()- parent of this node.
• Element.getAttribute(name)- an attribute value by
• Element.getTagName()- name of the element.
• Element.getElementsByTagName()- collection of all
  descendant Elements with a given tag name.

   Spring 2010      mark.baker@computer.org
          Common DOM methods (3)
• Element.setAttribute(name,value)- adds a new
  attribute, if an attribute with that name is already
  present in the element, its value is changed.
• Attr.getValue()- the value of the attribute.
• Attr.getName()- the name of this attribute.
• Document.getDocumentElement()- allows direct
  access to the child node that is the root element of
  the document.
• Document.createElement(tagName)- creates an
  element of the type specified.

   Spring 2010     mark.baker@computer.org
                 Text nodes
• Text inside an element
  (or attribute) is
  considered as a child of
  this element (attribute)
  not a value of it!


  – element named H1 with
    value null and one text
    node child, which value is
    Hello! and name #text

 Spring 2010      mark.baker@computer.org
                Overview of JAXP
• JAXP = Java API for XML Processing:
   – Provides a common interface for creating and using the
     standard SAX, DOM, and XSLT APIs in Java,
   – All JAXP packages are included standard in JDK 1.4+.
     The key packages are:

javax.xml.parsers         The main JAXP APIs, which provide a common
                          interface for various SAX and DOM parsers.

org.w3c.dom               Defines the Document class (a DOM), as well as
                          classes for all of the components of a DOM.

org.xml.sax               Defines the basic SAX APIs.

javax.xml.transform       Defines the XSLT APIs that let you transform XML
                          into other forms. (Not covered today.)

  Spring 2010         mark.baker@computer.org
               JAXP XML Parsers
• javax.xml.parsers defines abstract classes
  DocumentBuilder (for DOM) and SAXParser (for
   – It also defines factory classes DocumentBuilderFactory
     and SAXParserFactory.
   – By default, these give you the “reference implementation” of
     DocumentBuilder and SAXParser, but they are intended to be
     vendor-neutral factory classes, so that you could swap in a
     different implementation if you preferred.
• The JDK includes XML parser implementations from
   – Xerces: More features - supports XML Schema. Based on
     code donated to Apache by IBM.
   – Xerces 2: The future - standard implementation for J2SE

 Spring 2010         mark.baker@computer.org
      Parsing XML with JAXP DOM
• Import necessary classes:

  import       javax.xml.parsers.DocumentBuilder;
  import       javax.xml.parsers.DocumentBuilderFactory;
  import       org.xml.sax.SAXException;
  import       org.xml.sax.SAXParseException;
  import       org.w3c.dom.*;
  import       org.w3c.dom.DOMException

 Spring 2010           mark.baker@computer.org
Creating new DOM parser instance
• DocumentBuilderFactory - creates an instance of the parser
  determined by the system property:
    – DocumentBuilderFactory factory =

• DocumentBuilder - defines the API to obtain DOM Document
  instances from an XML document (several parse methods):
    – DocumentBuilder builder = factory.newDocumentBuilder();

• Document document=builder.parse(new File("test.xml"));

 Spring 2010         mark.baker@computer.org
    Recursive DOM tree processing
private static void scanDOMTree(Node node){
   int type = node.getNodeType();
   switch (type){
        case Node.ELEMENT_NODE:{
                NamedNodeMap attrs = node.getAttributes();
                for (int i = 0; i < attrs.getLength(); i++){
                        Node attr = attrs.item(i);
                        System.out.print(" "+attr.getNodeName()+
                        "=\"" + attr.getNodeValue() + "\"");
                NodeList children = node.getChildNodes();
                if (children != null)   {
                        int len = children.getLength();
                        for (int i = 0; i < len; i++)

  Spring 2010           mark.baker@computer.org
Recursive DOM tree processing (2)
    case Node.DOCUMENT_NODE:{
         System.out.println("<?xml version=\"1.0\" ?>");
    case Node.TEXT_NODE:{

Spring 2010          mark.baker@computer.org
               Simple DOM Example
public class EchoDOM{
 static Document document;
  public static void main(String argv[]){
    DocumentBuilderFactory factory =
    try {
      DocumentBuilder builder =
      document = builder.parse( new File("test.xml"));
    } catch (Exception e){
      System.err.println("Sorry, an error: " + e);
      scanDOMTree(document); //call DOM scanner!

 Spring 2010        mark.baker@computer.org
                   Node Objects
   • Node Type                     • Property
        –     Document                  –   #document
        –     Element                   –   Tag name
        –     Attribute                 –   #text
        –     Text                      –   #cdata-section
        –     CDATASection              –   #comment
        –     Comment                   –   Entity name
        –     Entity                    –   Notation name
        –     Notation                  –   Name of entity
        –     EntityReference           –   Target
        –     ProcessingInstruction     –   Document type name
        –     Document Fragment         –   #document-fragment

Spring 2010              mark.baker@computer.org
              JavaScript Objects

Spring 2010       mark.baker@computer.org
               DOM Summary
• In summary, the Document Object Model is an in-
  memory representation of an XML or HTML document
  and methods to manipulate it.
• DOMs can be loaded from XML documents, saved to
  XML documents, or dynamically generated by a
• The DOM has provided a standard set of classes and
  APIs for browsers and programming languages to
  represent XML and HTML.
• The DOM is represented as a set of interfaces with
  specific language bindings to those interfaces.

 Spring 2010     mark.baker@computer.org
        Simple API for XML (SAX)
• Event driven processing of XML documents.
• Parser sends events to programmer„s code (start and
  end of every component).
• Programmer decides what to do with every event.
• SAX parser does not create any objects at all, it
  simply delivers events.

  Spring 2010      mark.baker@computer.org
                  SAX features
•    SAX API acts like a data stream.
•    Stateless.
•    Events are not permanent.
•    Data not stored in memory.
•    Impossible to move backward in XML data.
•    Impossible to modify document structure.
•    Fastest and least memory intensive way of
     working with XML.

    Spring 2010     mark.baker@computer.org
               Basic SAX events
• startDocument – receives notification of the
  beginning of a document.
• endDocument – receives notification of the end of a
• startElement – gives the name of the tag and any
  attributes it might have.
• endElement – receives notification of the end of an
• characters – parser will call this method to report
  each chunk of character data.

 Spring 2010      mark.baker@computer.org
               Additional SAX events
• ignorableWhitespace – allows to react (ignore)
  whitespace in element content.
• warning – reports conditions that are not errors or
  fatal errors as defined by the XML 1.0
  recommendation, e.g. if an element is defined twice in
  a DTD.
• error – non-fatal error occurs when an XML
  document fails a validity constraint.
• fatalError – a non-recoverable error e.g. the
  violation of a well-formed-ness constraint; the
  document is unusable after the parser has invoked
  this method.

 Spring 2010         mark.baker@computer.org
      SAX events in a simple example
<?xml version="1.0"?>     startDocument()
<xmlExample>              startElement(): xmlExample
   <heading>              startElement(): heading
     This is a simple
                          characters(): This is a simple example
                          endElement(): heading
   That is all folks.
                          characters(): That is all folks
</xmlExample>             endElement(): xmlExample

     Spring 2010        mark.baker@computer.org
         SAX2 handlers interfaces
• ContentHandler - receives notification of the
  logical content of a document (startDocument,
  startElement, characters etc.).
• ErrorHandler - for XML processing errors
  generates events (warning, error,
  fatalError) instead of throwing exception (this
  decision is up to the programmer).
• DTDHandler - receives notification of basic
  DTD-related events, reports notation and
  unparsed entity declarations.
• EntityResolver – handles the external entities.

 Spring 2010    mark.baker@computer.org
               DefaultHandler class
• Class org.xml.sax.helpers.DefaultHandler:
   – Implements all four handle interfaces with null
   – Programmer can derive from DefaultHandler his
     own class and pass its instance to a parser,
   – Programmer can override only methods responsible
     for some events and ignore the rest.

 Spring 2010        mark.baker@computer.org
                        How Does SAX work?
                XML Document                                  SAX Objects

<?xml version=“1.0”?>                     Parser            startDocument

<addressbook>                             Parser            startElement

  <name>John Doe</name>                   Parser            startElement & characters

  <email>jdoe@yahoo.com</email>           Parser            startElement & characters

 </person>                                Parser            endElement

 <person>                                 Parser            startElement
  <name>Jane Doe</name>                   Parser            startElement & characters

  <email>jdoe@mail.com</email>            Parser            startElement & characters

 </person>                                Parser            endElement
</addressbook>                            Parser            endElement & endDocument

      Spring 2010                 mark.baker@computer.org
                SAX vs. DOM
• DOM:
   – More information about
     structure of the
   – Allows to create or modify
• SAX:
   – You need to use the
     information in the
     document only once,
   – Less memory use.

 Spring 2010         mark.baker@computer.org

To top