Lecture 4 - DOM and Sax by shuifanglj

VIEWS: 18 PAGES: 46

									           XML and Web Technologies

                    Lecture 4: DOM and SAX
http://acet.rdg.ac.uk/~mab/Education/Undergraduate/CS2K7/Lectures/


                    Prof Mark Baker

                    ACET, University of Reading
                    Tel: +44 118 378 8615
                    E-mail: Mark.Baker@computer.org
                    Web: http://acet.rdg.ac.uk/~mab

      Spring 2010             mark.baker@computer.org
                      Outline
• Parsing XML.
• Document Object Model (DOM):
   –   What is the DOM,
   –   DOM Processing,
   –   Processing documents with JAXP,
   –   DOM Summary,
• Simple API for XML (SAX):
   – Features,
   – Example,
   – Summary.
• DOM versus SAX.

 Spring 2010       mark.baker@computer.org
                 Parsing XML
• XML processing occurs during parsing by an
  application.
• Parsing is the dissection of a block of text into
  discernible words (also known as tokens).
• There are three common ways to parse an XML
  document:
   – Using the Simple API for XML (SAX),
   – Building a Document Object Model (DOM), and
   – Employing a new technique called pull parsing.
• SAX is a style of parsing called event-based parsing
  where each information class in the instance document
  generates a corresponding event in the parser as the
  document is traversed:
   – SAX parsers are useful for parsing very large XML
     documents or in low-memory environments.

 Spring 2010         mark.baker@computer.org
                   Parsing XML
• Building a DOM is the most common approach to
  parsing an XML document.
• Pull parsing is a new technique that aims for both low-
  memory consumption and high performance:
   – It is especially well suited for parsing XML-based Web
     services.
• Pull parsing is also an event-based parsing technique;
  however, the events are read by the application
  (pulled) and not automatically triggered as in SAX.

• The majority of systems use the DOM approach to
  parse XML documents.


 Spring 2010         mark.baker@computer.org
             Parsing XML - An Inside Look
                  XML Document


   Core Document Parser           DTD declarations



                                                                                   XML DTD
Well-formed-ness Syntax Checker
                                                          Configurable Retrieval
                                                               of External         <?xml?>
                                                               Resources           <!ELEMENT...
     Entity Resolution                     DTD Resolver                            <!ATTLIST...



   Attribute Defaulting


    Structure Validation


       Tree Builder



                   Application
   Spring 2010                        mark.baker@computer.org
Parsing XML - The View from the
          Application

                         <?xml?>

                                             XML Document



                       Monolithic Parser
                        Loads document
                      Parses declarations
                          Builds DTD
                Interprets document against DTD
                          May validate
                      May build DOM tree
                May provide XSL or XLink Services




               Application




Spring 2010   mark.baker@computer.org
XML Processing - Parser to Application
          Communication
• Two dominant standards:
  – Document Object Model (DOM):
      • Tree-based model passes complete picture of document to
        application at processing conclusion,
      • Java, JavaScript, IDL descriptions; Perl implementation in
        independent development,
      • Developed by W3C, Level 1 complete, Level 2 in progress.
  – Simple API for XML (SAX):
      • Event-based model “reads” document to application
        handlers,
      • Supported by nearly all Java XML parsers,
      • Developed by XML-Dev mailing list (SAX 2 in dev.).

  Spring 2010          mark.baker@computer.org
                What is the DOM?
• The Document Object Model (DOM) is a language-
  neutral data model and Application Programming
  Interface (API) for programmatic access and
  manipulation of XML and HTML.
• Unlike XML instances and schemas, which reside in
  files on disk, the DOM is an in-memory representation
  of a document.
• The need for this arose from differences between
  the way Internet Explorer (IE) and Netscape
  Navigator, which allowed access and manipulation of
  HTML documents to support Dynamic HTML
  (DHTML):
   – IE and Navigator represent the parts of a document with
     different names, which made cross-browser scripting
     extremely difficult.
  Spring 2010        mark.baker@computer.org
               What is the DOM?
• Out of the desire for cross-browser scripting came
  the need for a standard representation for document
  objects in the browser‟s memory.
• The model for this memory representation is object
  oriented programming (OOP).
• So, by turning around the title, we get the definition
  of a DOM:
   – A data model, using objects, to represent an XML or HTML
     document.




 Spring 2010        mark.baker@computer.org
                 What is the DOM?
• Object-oriented programming introduces two key data
  modelling concepts classes and objects.
• A class is a definition or template describing the
  characteristics and behaviour of a real-world entity or
  concept:
   – From this description, an in memory instance of the class can
     be constructed, which is called an object:
        • So, an object is a specific instance of a class.
   – The key benefit of this approach to modelling program data is
     that your programming language more closely resembles the
     problem domain you are solving,
   – Real-world entities have characteristics and behaviour:
        • Thus, programmers create classes that model real world entities
          like “Auto”, “Employee” and “Product”,
   – Along with a class name, a class has characteristics, known as
     data members, and behaviour, known as methods.


 Spring 2010             mark.baker@computer.org
                 What is the DOM?




Classes and Objects


   Spring 2010        mark.baker@computer.org
               What is the DOM?
• You can think of a DOM as a set of classes that allow
  you to create a tree of objects in memory that
  represents a version of an XML or HTML document
  that can be manipulated.
• There are two ways to access this tree of objects: a
  generic way and a specific way.
• The generic way shows all parts of the document as
  objects of the same class, called Node.
• The generic DOM representation is often called a
  “flattened view” because it does not use class
  inheritance:
   – Class inheritance is where a child class inherits
     characteristics and behaviour from a parent class just like in
     biological inheritance.

 Spring 2010          mark.baker@computer.org
              What is the DOM?




                            A DOM as a tree of nodes.




Spring 2010      mark.baker@computer.org
              What is the DOM?




                         A DOM as a tree of sub-classes.




Spring 2010       mark.baker@computer.org
                   Dom Standards
• The DOM has steadily evolved by increasing the detail
  of the representation, the scope of the
  representation, and adding new manipulation methods.
• There are currently three DOM levels:
   – DOM Level 1:
        • This set of classes represents XML 1.0 and HTML 4.0
          documents.
   – DOM Level 2:
        • This extends Level 1 to add support for namespaces; cascading
          style sheets, level 2 (CSS2); alternate views; user interface
          events; and enhanced tree manipulation via interfaces for
          traversal and ranges.
   – DOM Level 3:
        • This extends Level 2 by adding support for mixed vocabularies
          (different namespaces), XPath expressions, load and save
          methods, and a representation of abstract schemas (includes
          both DTD and XML Schema).


 Spring 2010            mark.baker@computer.org
               Document Object Model
• Generic Form:




       XML
     Document
                    Parser            DOM      Application
                                               Programme




 Spring 2010         mark.baker@computer.org
           Document Object Model
• Programming interface:
   – Builds a representation of an XML document in
     memory,
   – DOM is an hierarchical data structure containing all
     the information in the XML document,
   – Provides access to the object representation using
     standard properties and methods,
   – Allows programmers access to the XML data
     without having to process the actual XML
     statements.




 Spring 2010      mark.baker@computer.org
                      XML Document
  • Information to be represented in the DOM
    structure.

<?xml version="1.0" encoding="UTF-8"?>
 <entry id="Baker2005">
  <author>Mark Baker and Amy W. Apon and Clayton Ferner and Jeff Brown</author>
  <title>Emerging Grid Standards</title>
  <journal>IEEE Computer</journal>
  <year>2005</year>
  <volume>38</volume>
  <pages>43-50</pages>
  <number>4</number>
 </entry>




    Spring 2010             mark.baker@computer.org
Information To Be Represented
                             Document


                  Attributes                        id           <citation.xml>



              Root Element                                      <entry>




                                                                                    <etc…>
                               <author>
                                          <title>
                                                                         <volume>
                                                    <journal>   <year>




Spring 2010          mark.baker@computer.org
              DOM Inspector




Spring 2010    mark.baker@computer.org
                        DOM Example

<?xml version=“1.0”?>

<addressbook>                                       Node   addressbook

 <person>                                                  Node   person
  <name>John Doe</name>
                                                                  Node   Name=“John Doe”
  <email>jdoe@yahoo.com</email>
                                                                  Node   email=jdoe@yahoo.com”
                                           XML
 </person>
                                          Parser

 <person>                                                  Node   person

  <name>Jane Doe</name>
                                                                  Node   Name=“John Doe”
  <email>jdoe@mail.com</email>
                                                                  Node   email=jdoe@yahoo.com”
 </person>
</addressbook>




  Spring 2010                    mark.baker@computer.org
                 DOM Representation
Document
  Node
                Document Root
                                                       <Parent>
           NodeList
                                                        <Child id=“123>Text here</Child>
                      Element
                                    <Parent>           </Parent>
                       Node

                                NodeList


                                           Element
                                            Node        <Child>

                                                       NamedNodeMap


                                                                        Attribute
                                                                          Node       <id=“123”>

                                                     NodeList

                                                                Text CDATA
                                                                   Node        Text here


  Spring 2010                        mark.baker@computer.org
                      Kinds of nodes
• Node - primary datatype for the entire DOM,
  represents a single node in the document tree:
      – All objects implementing the Node interface expose methods
        for dealing with children:
           • Not all objects implementing the Node interface may have
             children.
• Document - represents the XML document, the root
  of the document tree:
      – It provides the primary access to the document's data and
        the methods to create them.
•    Element - represents an element in an XML document, may have
     associated attributes or text nodes.
•    Attr - represents an attribute in an ELEMENT object.
•    Text - represents the textual content of an ELEMENT or ATTR.
•    Others (COMMENT, ENTITY).

    Spring 2010            mark.baker@computer.org
                 Common DOM Methods
• Node.getNodeType()- the type of the underlying
  object, e.g. Node.ELEMENT_NODE.
• Node.getNodeName() - value of this node,
  depending on its type, e.g. for elements it„s tag name,
  for text nodes always string #text.
• Node.getFirstChild() and
  Node.getLastChild()- the first or last child of a
  given node.
• Node.getNextSibling() and
  Node.getPreviousSibling()- the next or
  previous sibling of a given node.
• Node.getAttributes()- collection containing the
  attributes of this node (if it is an element node) or
  null.

   Spring 2010       mark.baker@computer.org
           Common DOM methods (2)
• Node.getNodeValue()- value of this node, depending
  on its type, e.g. value of an attribute but null in case
  of an element node.
• Node.getChildNodes()- collection that contains all
  children of this node.
• Node.getParentNode()- parent of this node.
• Element.getAttribute(name)- an attribute value by
  name.
• Element.getTagName()- name of the element.
• Element.getElementsByTagName()- collection of all
  descendant Elements with a given tag name.



   Spring 2010      mark.baker@computer.org
          Common DOM methods (3)
• Element.setAttribute(name,value)- adds a new
  attribute, if an attribute with that name is already
  present in the element, its value is changed.
• Attr.getValue()- the value of the attribute.
• Attr.getName()- the name of this attribute.
• Document.getDocumentElement()- allows direct
  access to the child node that is the root element of
  the document.
• Document.createElement(tagName)- creates an
  element of the type specified.




   Spring 2010     mark.baker@computer.org
                 Text nodes
• Text inside an element
  (or attribute) is
  considered as a child of
  this element (attribute)
  not a value of it!

  <H1>Hello!</H1>

  – element named H1 with
    value null and one text
    node child, which value is
    Hello! and name #text

 Spring 2010      mark.baker@computer.org
                Overview of JAXP
• JAXP = Java API for XML Processing:
   – Provides a common interface for creating and using the
     standard SAX, DOM, and XSLT APIs in Java,
   – All JAXP packages are included standard in JDK 1.4+.
     The key packages are:


javax.xml.parsers         The main JAXP APIs, which provide a common
                          interface for various SAX and DOM parsers.

org.w3c.dom               Defines the Document class (a DOM), as well as
                          classes for all of the components of a DOM.

org.xml.sax               Defines the basic SAX APIs.


javax.xml.transform       Defines the XSLT APIs that let you transform XML
                          into other forms. (Not covered today.)



  Spring 2010         mark.baker@computer.org
               JAXP XML Parsers
• javax.xml.parsers defines abstract classes
  DocumentBuilder (for DOM) and SAXParser (for
  SAX):
   – It also defines factory classes DocumentBuilderFactory
     and SAXParserFactory.
   – By default, these give you the “reference implementation” of
     DocumentBuilder and SAXParser, but they are intended to be
     vendor-neutral factory classes, so that you could swap in a
     different implementation if you preferred.
• The JDK includes XML parser implementations from
  Apache:
   – Xerces: More features - supports XML Schema. Based on
     code donated to Apache by IBM.
   – Xerces 2: The future - standard implementation for J2SE
     5.0.

 Spring 2010         mark.baker@computer.org
      Parsing XML with JAXP DOM
• Import necessary classes:

  import       javax.xml.parsers.DocumentBuilder;
  import       javax.xml.parsers.DocumentBuilderFactory;
  import       org.xml.sax.SAXException;
  import       org.xml.sax.SAXParseException;
  import       org.w3c.dom.*;
  import       org.w3c.dom.DOMException




 Spring 2010           mark.baker@computer.org
Creating new DOM parser instance
• DocumentBuilderFactory - creates an instance of the parser
  determined by the system property:
    – DocumentBuilderFactory factory =
      DocumentBuilderFactory.newInstance();


• DocumentBuilder - defines the API to obtain DOM Document
  instances from an XML document (several parse methods):
    – DocumentBuilder builder = factory.newDocumentBuilder();


• Document document=builder.parse(new File("test.xml"));




 Spring 2010         mark.baker@computer.org
    Recursive DOM tree processing
private static void scanDOMTree(Node node){
   int type = node.getNodeType();
   switch (type){
        case Node.ELEMENT_NODE:{
                System.out.print("<"+node.getNodeName()+">");
                NamedNodeMap attrs = node.getAttributes();
                for (int i = 0; i < attrs.getLength(); i++){
                        Node attr = attrs.item(i);
                        System.out.print(" "+attr.getNodeName()+
                        "=\"" + attr.getNodeValue() + "\"");
                }
                NodeList children = node.getChildNodes();
                if (children != null)   {
                        int len = children.getLength();
                        for (int i = 0; i < len; i++)
                        scanDOMTree(children.item(i));
                }
                System.out.println("</"+node.getNodeName()+">");
                break;
        }


  Spring 2010           mark.baker@computer.org
Recursive DOM tree processing (2)
    case Node.DOCUMENT_NODE:{
         System.out.println("<?xml version=\"1.0\" ?>");
         scanDOMTree(((Document)node).getDocumentElement());
         break;
    }
    case Node.ENTITY_REFERENCE_NODE:{
         System.out.print("&"+node.getNodeName()+";");
         break;
    }
    case Node.TEXT_NODE:{
         System.out.print(node.getNodeValue().trim());
         break;
    }
    //...
}




Spring 2010          mark.baker@computer.org
               Simple DOM Example
public class EchoDOM{
 static Document document;
  public static void main(String argv[]){
    DocumentBuilderFactory factory =
             DocumentBuilderFactory.newInstance();
    try {
      DocumentBuilder builder =
      factory.newDocumentBuilder();
      document = builder.parse( new File("test.xml"));
    } catch (Exception e){
      System.err.println("Sorry, an error: " + e);
    }
    if(document!=null){
      scanDOMTree(document); //call DOM scanner!
    }
  }


 Spring 2010        mark.baker@computer.org
                   Node Objects
   • Node Type                     • Property
        –     Document                  –   #document
        –     Element                   –   Tag name
        –     Attribute                 –   #text
        –     Text                      –   #cdata-section
        –     CDATASection              –   #comment
        –     Comment                   –   Entity name
        –     Entity                    –   Notation name
        –     Notation                  –   Name of entity
        –     EntityReference           –   Target
        –     ProcessingInstruction     –   Document type name
        –     Document Fragment         –   #document-fragment


Spring 2010              mark.baker@computer.org
              JavaScript Objects




Spring 2010       mark.baker@computer.org
               DOM Summary
• In summary, the Document Object Model is an in-
  memory representation of an XML or HTML document
  and methods to manipulate it.
• DOMs can be loaded from XML documents, saved to
  XML documents, or dynamically generated by a
  program.
• The DOM has provided a standard set of classes and
  APIs for browsers and programming languages to
  represent XML and HTML.
• The DOM is represented as a set of interfaces with
  specific language bindings to those interfaces.




 Spring 2010     mark.baker@computer.org
        Simple API for XML (SAX)
• Event driven processing of XML documents.
• Parser sends events to programmer„s code (start and
  end of every component).
• Programmer decides what to do with every event.
• SAX parser does not create any objects at all, it
  simply delivers events.




  Spring 2010      mark.baker@computer.org
                  SAX features
•    SAX API acts like a data stream.
•    Stateless.
•    Events are not permanent.
•    Data not stored in memory.
•    Impossible to move backward in XML data.
•    Impossible to modify document structure.
•    Fastest and least memory intensive way of
     working with XML.




    Spring 2010     mark.baker@computer.org
               Basic SAX events
• startDocument – receives notification of the
  beginning of a document.
• endDocument – receives notification of the end of a
  document.
• startElement – gives the name of the tag and any
  attributes it might have.
• endElement – receives notification of the end of an
  element.
• characters – parser will call this method to report
  each chunk of character data.




 Spring 2010      mark.baker@computer.org
               Additional SAX events
• ignorableWhitespace – allows to react (ignore)
  whitespace in element content.
• warning – reports conditions that are not errors or
  fatal errors as defined by the XML 1.0
  recommendation, e.g. if an element is defined twice in
  a DTD.
• error – non-fatal error occurs when an XML
  document fails a validity constraint.
• fatalError – a non-recoverable error e.g. the
  violation of a well-formed-ness constraint; the
  document is unusable after the parser has invoked
  this method.




 Spring 2010         mark.baker@computer.org
      SAX events in a simple example
<?xml version="1.0"?>     startDocument()
<xmlExample>              startElement(): xmlExample
   <heading>              startElement(): heading
     This is a simple
                          characters(): This is a simple example
      example.
                          endElement(): heading
   </heading>
   That is all folks.
                          characters(): That is all folks
</xmlExample>             endElement(): xmlExample
                          endDocument()




     Spring 2010        mark.baker@computer.org
         SAX2 handlers interfaces
• ContentHandler - receives notification of the
  logical content of a document (startDocument,
  startElement, characters etc.).
• ErrorHandler - for XML processing errors
  generates events (warning, error,
  fatalError) instead of throwing exception (this
  decision is up to the programmer).
• DTDHandler - receives notification of basic
  DTD-related events, reports notation and
  unparsed entity declarations.
• EntityResolver – handles the external entities.


 Spring 2010    mark.baker@computer.org
               DefaultHandler class
• Class org.xml.sax.helpers.DefaultHandler:
   – Implements all four handle interfaces with null
     methods,
   – Programmer can derive from DefaultHandler his
     own class and pass its instance to a parser,
   – Programmer can override only methods responsible
     for some events and ignore the rest.




 Spring 2010        mark.baker@computer.org
                        How Does SAX work?
                XML Document                                  SAX Objects

<?xml version=“1.0”?>                     Parser            startDocument

<addressbook>                             Parser            startElement
 <person>

  <name>John Doe</name>                   Parser            startElement & characters

  <email>jdoe@yahoo.com</email>           Parser            startElement & characters

 </person>                                Parser            endElement

 <person>                                 Parser            startElement
  <name>Jane Doe</name>                   Parser            startElement & characters

  <email>jdoe@mail.com</email>            Parser            startElement & characters

 </person>                                Parser            endElement
</addressbook>                            Parser            endElement & endDocument




      Spring 2010                 mark.baker@computer.org
                SAX vs. DOM
• DOM:
   – More information about
     structure of the
     document,
   – Allows to create or modify
     documents.
• SAX:
   – You need to use the
     information in the
     document only once,
   – Less memory use.




 Spring 2010         mark.baker@computer.org

								
To top