element Markup by alicejenny




     What is XML?

   the Extensible Markup Language, is a universal syntax for describing
    and structuring data independent from the application logic. XML can
    be used to define unlimited languages for specific industries and

   XML is a text-based markup language that is fast becoming the
    standard for data interchange on the Web. As with HTML, you
    identify data using tags (identifiers enclosed in angle brackets, like
    this: <...>). Collectively, the tags are known as "markup".

   Since identifying the data gives you some sense of what means (how
    to interpret it, what you should do with it), XML is sometimes
    described as a mechanism for specifying the semantics (meaning) of
    the data.

     What is XML?

   In the same way that you define the field names for a data structure,
    you are free to use any XML tags that make sense for a given
    application. Naturally, though, for multiple applications to use the
    same XML data, they have to agree on the tag names they intend to

                                               XML is case-sensitive.

     XML History

   XML: eXtensible Markup Language
   What’s “Markup Language”?
       The markup is the codes, embedded with the document text,
        which store the information required for electronic processing,
        like font name, boldness or, in the case of XML, the document
       Methodology for encoding data with some information.
       Typically, it defines a set of tags each of which has some as
        associated meaning.

     Who developed XML?

   XML is an activity of the World Wide Web Consortium (W3C)
    http://www.w3c.org. The XML development effort started in 1996.

   A diverse group of markup language experts, from industry to
    academic, developed a simplified version of SGML (Standard
    Generalized Markup Language) for the Web. In February 1998, XML
    1.0 specification became a recommendation by the W3C.

   XML 1.1: W3C Recommendation in February 2004

     Standard Generalized Markup Language

SGML extends generic
coding. Furthermore, it is
an international standard
published by the ISO
(International Organization
of Standardization). It is
based on the early work
done by Dr. Charles Goldfarb
from IBM.

SGML is similar to generic coding but with two additional characteristics:
   The markup describes the document's structure, not the document
   The markup conforms to a model, which is similar to a database
   schema. This means that it can be processed by software or stored
   in a database.
     Data Problems

   Fundamental “issues”: How do I represent my
    “application” data?
       Performance (speed/time)
       Persistence(short/long lived)
       Mutability
       Composition
       Security (encryption/identity)
   Open Information Management:
       Interpretation
       Presentation
       Interoperation
       Portability
       Interrogation

     Why XML?

   Unfortunately, there are things that HTML just can’t do
    for you. Fortunately, XML is growing quickly to meet
    these needs.
   Unfortunately, no matter how many new tags are added,
    there will never be enough for all the good ideas people
    keep having. Fortunately, XML is a form of SGML, an ISO
    standard that allows you to invent the tags you need, and
    declare them so others can use them.
   Unfortunately, the SGML standard is large, takes time to
    learn, and doesn’t have a “starter-kit”. Fortunately, XML
    is here.

     Why XML?
   Plain Text
        any editor, readable, for configuration information
   Data Identification
        Self-described markup style
   internationalized
        Unicode-based (UTF-8 / UTF-16), XML as universal data representation.
   Inline Reusability
        can integrate data from multiple sources as a single document,
         modularization without using linking.
   Linkability
        More powerful than HTML’s, W3C Xlink & XPointer specifications
   Easily Processed
        well-formed rules, validity checking, available tools: parsers,
         transformers, browsers.
   Hierarchical
        Faster to access and rearrange each element.

          XML as a Self-Describing
          Data Exchange Format

 can be easily “understood” by our friend
 can be parsed easily
 contains its own structure (=parse tree) in the data
=> allows the application programmer to rediscover schema and
  content/semantics (to which extent???)

 may include an explicit schema description (e.g., DTD)
=> meta-language: definition of a language w.r.t. which it is valid

   allows separation of marked-up content from presentation (=>style sheets)

   many tools (and many more to come -- (re)use code): parsers, validators,
    query languages, storage, …
   standards (good for interoperation, integration, etc):
     => generic standards (XML, DTDs, XML Schema, XPath,...)
     => community/industry standards (=specific markup languages)

     Key Features of XML
   Extensibility
       You define your own markup languages (tags) for your own
        problem domain
   Media and Presentation independence
       Same data can be presented to different communication medium
        (browser, voice device) and different format (WML, HTML):
   Separation of contents from presentation
       Clear separation between contents (data) and presentation (data
   Structure: Relationship and Hierarchical Structure
       Faster to access, easier to rearrange
   Validation
       XML data is “constrained” by a Rule (or Syntax)
     What are XML applications?

   XML is poised to play a prominent role as a data interchange format
    in B2B Web applications such as e-commerce, supply-chain
    management, workflow, and application integration. Another use of
    XML is for structured information management, including information
    from databases. XML also supports media-independent publishing,
    allowing documents to be written once and published in multiple
    media formats and devices. On the client, XML can be used to create
    customized views into data.

     XML and Java Technology Relationship

   XML and the Java technology are complementary. Java technology
    provides the portable, maintainable code to process portable,
    reusable XML data. In addition, XML and Java technology have a
    number of shared features that make them the ideal pair for Web
    computing, including being industry standards, platform-
    independence, extensible, reusable, Web-centric, and
   It’s a “Match made in Heaven”
       Java enables Portable Code
       XML enables Portable Data
   XML tools and programs are mostly written in the Java programming
   Better API support for Java platform than any other languages

     Benefits of Using Java Technology with XML

   Java technology offers a substantial productivity boost for software
    developers compared to programming languages such as C or C++.
    In addition, developers using the Java platform can create
    sophisticated programs that are reusable and maintainable compared
    to programs written with scripting languages. Using XML and Java
    together, developers can build sophisticated, interoperable Web
    applications more quickly and at a lower cost.


   The most popular markup language
   Defines a fixed set of tags
   Designed for presentation for data
   HTML documents are processed by HTML processing application
   Easy to implement and author e.g. small number of tags, forgiving
    syntax checking
   No formal validation
   Does not support semantic search
   Not for complex document

     HTML vs XML

   Fast becoming the standard for data interchange on the web.
   Extensible Markup Language (XML) is closely related to HTML, the
    original document representation of the WWW. While HTML enables
    the creation of Web pages that can be viewed on any browser, XML
    adds tags to data so that it can be processed by any application.
   Using XML, companies can separate the business rules from the
    content and structure of the data. By focusing on exchanging data
    content and structure, the trading partners are free to implement
    their own business rules, which can be quite distinct from one
   Custom tag like defining filed names for a data structure. Same
    application can agree upon the same XML tag names.

    HTML vs XML
            HTML                               XML
 for display                      for data structure
 (may) reveal document           reveals document meaning

  structure                       presentation independent

 no knowledge of data            open language; define your own

 closed language, closed          tags for all browsers
  standards                       steep learning curve (tools)

 shallow learning curve          case sensitive

 case insensitive                empty tag needs special syntax,

 empty tag like <BR> requires     i.e., <empty-tag></empty-tag>
  nothing special                  or <empty-tag />
 white space ignored             white space significant within


     XML Standards

   XML, DTD
   XSL, XSLT, XPath
   DOM, SAX
   W3C XML Schema
   Namespaces
   XLink, XPointer
   XQL

     Domain Specific XML Standards

   Chemical - CML
   2D Graphics - SVG
   Math - MathML
   Music - MusicML
   Travel -OTA
   Many more ...
       http://xml.org/xmlorg_registry/index.shtml

     Core Java APIs for XML

   JAXP: Parsing and Transforming
   JAXB: High-level XML programming
   JAXM: Messaging
   JAXR: Registry APIs
   JDOM: Java-optimized Parsing

     E-Commerce Standards

   ebXML
   UDDI (Universal Description, Discovery and Integration)
   SOAP (Simple Open Access Protocol)
   W3C XP (XML Protocol)
   WSDL (Web Services Definition Lang.)
   S2ML (Security Services ML)
   XAML (Transaction Authority ML)


     XML Document Components

   Processing Instruction
   Elements and Attributes
   Empty Tags
   Comments
   Special Characters
   Entity References
   Whitespaces
   Namespaces
   XPath, XLink, XPointer

     The XML Prolog: XML Declaration

   The part of an XML document that precedes the XML data. The
    prolog includes the declaration and an optional DTD.
   An XML file always starts with a prolog. The minimal prolog contains
    a declaration that identifies the document as an XML document

                           <?xml version="1.0"?>

    <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

   version Identifies the version of the XML markup language used in the data.
    This attribute is not optional.
   encoding Identifies the character set used to encode the data. "ISO-8859-
    1" is "Latin-1" the Western European and English language character set.
    (The default is compressed Unicode: UTF-8.)
   standalone Tells whether or not this document references an external
    entity or an external data type specification

     Processing Instruction

   PIs give commands or information to an application that is
    processing the XML data.

                          <?target instructions?>

   the target is the name of the application that is expected to do the
    processing, and instructions is a string of characters that embodies
    the information or commands for the application to process.


   XML tags usually surround an identified object in the data stream. A
    start-tag and an end-tag, together with the data enclosed by
    them, represent an element. The start-tag is delimited using the ‘<’
    and ‘>’ characters. The end-tag is delimited by ‘</’ and ‘>’

   It is this ability for one tag to contain others that gives XML its ability
    to represent hierarchical data structures


   every XML file defines exactly one element, known as the root
    element. Any other elements in the file are contained within that


   Additional information included about an element as part of the tag
    itself, within the tag's angle brackets. It consists of an attribute
    name and an attribute value. The attribute name precedes its
    value enclosed by quotes (“ ”, ‘ ’) and separated by an equals sign.

   There must be a least one space between the element name and the
    first attribute. Multiple attributes are separated by spaces

   Since you could design a data structure like <message> equally well
    using either attributes or tags, it can take a considerable amount of
    thought to figure out which design is best for your purposes.

        Elements and their Content

                     element type
 <paper ID="object-fusion">                             content
   <author>S. Abiteboul</author>                      empty
   <author>H. Garcia-Molina</author>
  </authors>                                          element
  <fullPaper source="fusion"/>
  <title>Object Fusion in Mediator Systems</title>
  <booktitle>VLDB 96</booktitle>

                              character content
       Element Attributes

                         Attribute name

 <paper pid="object-fusion">
                                          Attribute Value
   <author>S. Abiteboul</author>
   <author>H. Garcia-Molina</author>
  <fullPaper source="fusion"/>
  <title>Object Fusion in Mediator Systems</title>
  <booktitle>VLDB 96</booktitle>

     Empty Tags

   Sometimes, we might want to add a "flag" tag that marks message
    as important. A tag like that doesn't enclose any content, so it's
    known as an "empty tag". We can create an empty tag by ending it
    with /> instead of >.

   The empty tag saves you from having to code <flag></flag> in order
    to have a well-formed document.

     Comments in XML Files

   XML comments look just like HTML comments
   It will not appear in published output.

     Handling Special Characters

   In XML, an entity is an XML structure (or plain text) that has a name.
    Referencing the entity by name causes it to be inserted into the
    document in place of the entity reference. To create an entity
    reference, the entity name is surrounded by an ampersand and a
   Predefined Entities:

                        Character            Reference
                            &                 &amp;
                            <                  &lt;
                            >                  &gt;
                            "                 &quot;
                            '                 &apos;

     Using Entity Reference in an XML Document

   The problem with putting that line into an XML file directly is that
    when the parser sees the left-angle bracket (<), it starts looking for
    a tag name, which throws off the parse. To get around that problem,
    you put &lt; in the file, instead of "<".

           XML File

          XML Output

     Handling Text with XML-Style Syntax

   When you are handling large blocks of XML or HTML that include
    many of the special characters, it would be inconvenient to replace
    each of them with the appropriate entity reference. For those
    situations, you can use a CDATA section.
   all white space in a CDATA section is significant, and characters in it
    are not interpreted as XML.

                          <![CDATA[ ............. ]]>

 Handling Text with XML-Style Syntax

                    XML File

XML Output


Prolog <?xml version=“1.0”?>
         <!-- comments and processing instructions -->
         <!DOCTYPE sdsc_play_groups SYSTEM “http://localserver/spg.dtd”>
         <!-- comments and processing instructions -->

Body          <sdsc_play_groups>
                   <play_group ID="Data-issues">
                                      <group>Scientific Computing</group>
                                      <group>Data Intensive Computing</group>
                                      <group>Security Technologies</group>
                            <charter source=”XPG"/>
                            <title>XML Play Group</title>

         <!-- comments and processing instructions -->

     White Space
XML specification normalizes different line-ending conventions to a single
convention but preserves all other white space, except in attribute values.
White Space and the XML Declaration: According to the current XML 1.0
standard, white space is not allowed before the XML declaration. If white
space appears before the XML declaration, it will be treated as a processing
instruction. The information, particularly the encoding, may not be used by
the parser.

White Space in Element Content: XML parsers are required to report all
white space that appears in element content within a document. For this
reason, the following three documents are different to an XML parser.

      White Space
White Space in Attributes: Although XML processors preserve all white
space in element content, they frequently normalize it in attribute values.
Tabs, carriage returns, and spaces are reported as single spaces. In certain
types of attributes, they trim white space that comes before or after the main
body of the value and reduce white space within the value to single spaces. If
a DTD is available, this trimming will be performed on all attributes that are
not of type CDATA. If there is no DTD, the parser assumes that all attributes
are of type CDATA.

For the above example, an XML parser reports both attribute values as "this is
a note.", converting the line breaks to single spaces.
End of Line Handling: XML processors treat the character sequence
Carriage Return-Line Feed (CRLF) like single CR or LF characters. All are
reported as a single LF character.
By using XML namespaces, authors can qualify element names uniquely
on the Web and thus avoid conflicts between elements that have the
same name. Associating a Universal Resource Identifier (URI) with a
namespace is purely to ensure that two elements with the same name
can remain unambiguous; it does not matter what, if anything, the URI
points to.

        Identifying Vocabularies:
        XML Namespaces

   My element may not be your element:
       geometry context: <element>line</element>
       chemistry context: <element>oxygen</element>
       SGML/XML context: ....
    use XML namespaces to identify the vocabulary

          XML Namespaces
     mechanism for globally unique tag names:
    <h:html        xmlns:xdc="http://www.xml.com/books"
    <h:head><h:title>Book Review</h:title></h:head>
      <xdc:title>XML: A Primer</xdc:title>
     mix of different tag vocabularies without confusion

•     namespaces only identify the vocabulary; additional mechanisms
      required for structure and meaning of tags

      XPath, XLink, and XPointer
   a declarative language for locating nodes and fragments in XML trees
   used in both XPointer (for addressing) and XSL (for pattern matching)
   a generalization of the HTML link concept
   higher abstraction level (intended for general XML - not just hypertext)
   more expressive power (multiple destinations, special behaviours, out-
        of-line links, ...)
       uses XPointer to locate resources
   an extension of XPath suited for linking
   specifies connection between XPath expressions and URIs

     XML Path Language: XPath

   W3C Recommendation Nov. 1999
   for addressing parts within an XML document
   (non-XML) syntax used for XSLT and XPointer

   Find the root element (bookstore) of this document:
   /bookstore
   Find all author elements anywhere within the current
   //author

      XML Linking Language (XLink)

   W3C Candidate Recommendation, July/2000
   language for typed links between documents
   extends the simple untyped href links in HTML:
        multidirectional links
        any element can be the source (not just <a ... > </a>)
        link to arbitrary positions within a document (via URIs and XPointer)
   richer custom applications possible
   xlink:type declaration: {simple, extended, locator, arc}
   optional "semantic attributes": {role, arcrole, title}
   Example:

<author xmlns:xlink="... " xlink:href="....itmaven.com/peter.html"
        xlink:title="Peter's homepage"
        xlink:role="further info about the book author“>
 Peter Pan Sr. </author>

        XML Pointer Language (XPointer)

   W3C Candidate Recommendation, June/2000
   for locating internal structures of XML documents
   XLinks URIs can include XPointer parts
   extends HTML's named anchors:
       target doc: <a name="target"> ... </a>
       source doc: <a href="#target">...</a>
   ... and select via XPath expressions
     + some extension (points and ranges, ...)
       intro/14/3        ("intro" is an ID attribute value)
        /1/2/5/14/3
       xpointer(id("chap1"))      xpointer(//*[@id="chap1"])

       Four Common Errors
The XML syntax is very strict: Elements must have both a start and end tag, or
they must use the special empty element tag; attribute values must be fully
quoted; there can be only one top-level element; and so on.
A strict syntax was a design goal for XML. The browser vendors asked for it.
HTML is very lenient, and HTML browsers accept anything that looks vaguely like
HTML. It might have helped with the early adoption of HTML but now it is a
Studies estimate that more than 50% of the code in a browser deals with errors
or the sloppiness of HTML authors. Consequently, an HTML browser is difficult to
write, it has slowed competition, and it makes for mega-downloads.
The four most common errors in writing XML code are:

   Forget End Tags
   Forget that XML Is Case-Sensitive
   Introduce Spaces in the Name of Element
   Forget the Quotes for Attribute Value
   Four Common Errors (cont)
Forget End Tags
                       <street>34 Fountain Square Plaza

Forget that XML Is Case-Sensitive
            <tel>513-744-7098</tel>       <tel>513-744-7098</TEL>

Introduce Spaces in the Name of Element
            <address book>
                  <name>John Doe</name>
                  <email href=‘mailto:jdoe@emailaholic.com’/>
            </address book>

Forget the Quotes for Attribute Value
 <tel preferred=true>513-744-8889</tel>   <tel preferred=“true>513-744-8889</tel>

               XML Document Map
                         XML Declaration
                                           Processing Instruction

                                                               DOCTYPE Declaration
Root Element                                                   Namespace

    Element                                                    Entity Reference

   Start Tag                                                   End Tag

                                                               CDATA Section


                                                                Textual Content


     XML Document Types
   Well-formed XML Document
      Conforms to the basic XML syntax
      Can be parsed without regard to the DTD

   Valid XML Document
      Well-formed
      Conforms to its DTD

     Document Type Definition (DTD)

1.   Firstly and most importantly a DTD can define a class of document.
     Classes are very powerful concepts in programming, because if you
     have a class it has an expected structure which means it will have
     consistent behaviors and properties. You should also be able to
     carry out certain predefined operations on a class of documents.
2.   If you use a DTD you can force a writer to include certain elements.
     You can't enforce them to put PCDATA content in the element, but
     that's another story.
3.   If you are planning to display a document using a style sheet, A
     DTD will ensure that you do not include elements that do not have
     any display instructions.
4.   If you are planning to search the document, or otherwise
     manipulate it using the DOM, a rigid structure will simplify your
     coding, and speed up the execution of your code, by a huge factor.

         In Search of
         the Lost Structure & Semantics

                           How do I share
                            structure and
                         metadata/semantics     How to make all
  How do I learn and use
                                 with         this automatable?
the element structure
                           my community?
    of a document?

        Adding Structure and Semantics

   XML Document Type Definitions (DTDs):
    •   define the structure of "allowed" documents
        (i.e., valid wrt. a DTD)
    •  database schema
     => improve query formulation, execution, ...
   XML Schema
        defines structure and data types
       allows developers to build their own libraries of interchanged data
   XML Namespaces
       identify your vocabulary

      XML DTDs as Extended Context Free

  <!ELEMENT bibliography paper*>
  <!ELEMENT paper        (authors,fullPaper?,title,booktitle)>
  <!ELEMENT authors      author+>

    bibliography       paper*
    paper              authors fullPaper? title booktitle
    authors            author+

lhs = element (name)
rhs = regular expression over elements + strings (PCDATA)
      Document Type Definitions (DTDs)

                           Define and Constrain
                        Element Names & Structure
<!ELEMENT bibliography paper*>
<!ELEMENT paper (authors, fullPaper?, title, booktitle)>
<!ELEMENT authors author+>
<!ELEMENT author (#PCDATA)>               Element Type
<!ATTLIST author age CDATA>                Declaration
<!ELEMENT fullPaper EMPTY>
<!ELEMENT title (#PCDATA)>
                                          Attribute List
<!ELEMENT booktitle (#PCDATA)>

      Element Declarations
   Sequence of 0 or                        Authors followed by
     more papers                            optional fullpaper,
                                             followed by title,
                                          followed by booktitle

<!ELEMENT bibliography (paper*)>
<!ELEMENT paper (authors, fullPaper?, title, booktitle)>
<!ELEMENT authors (author+)>
<!ELEMENT author (#PCDATA)>                      Sequence of 1 or
<!ATTLIST author age CDATA>                         more authors

<!ELEMENT fullPaper EMPTY>               Character content
<!ELEMENT title (#PCDATA)>
<!ELEMENT booktitle (#PCDATA)>

         Element Content Declarations

Declaration         Meaning
 <ELEMENT 2>        Exactly one <ELEMENT 2>
cardinality: R?     Zero or one instances of R
               R*   Zero or more instances of R
               R+   One or more instances of R
 R 1 |R2 |…| n      One instance of R   1 or R 2 or …   Rn
   , ,
 R 1 R 2 …, R n     Sequence of R’s, order matters
 #PCDATA            Character content
 EMPTY              Empty element
 (#PCDATA | e*)*    Mixed Content
 ANY                Anything goes

        Attribute Types (DTD)
Type               Meaning
ID                 Token unique within the document
IDREF              Reference to an ID token
IDREFS             Reference to multiple ID tokens
ENTITY             External entity (image, video, …)
ENTITIES           External entities
CDATA              Character data
NMTOKEN            Name token
NMTOKENS           Name tokens
NOTATION           Data other than XML
Enumeration        Choices
Conditional Sec    INCLUDE & IGNORE declarations
Attributes may be: REQUIRED, IMPLIED (optional)
           can have: default values, which may be FIXED
Attribute Types (DTD)

Attribute-Specification Parameters

     Attribute Declarations

<!ELEMENT bibliography paper*>
<!ELEMENT paper (authors, fullPaper?, title, booktitle)>
<!ELEMENT authors (author+)>
<!ELEMENT author (#PCDATA)>

<!ELEMENT fullPaper EMPTY>
<!ELEMENT title (#PCDATA)>
<!ELEMENT booktitle (#PCDATA)>
<!ATTLIST fullPaper source ENTITY #REQUIRED>
<!ATTLIST person pid ID>
<!ATTLIST author authorRef IDREF>
                                       Pointer (IDREF) and
                                     target (ID) declarations
                                  for intra document “pointers”
       XML Attribute

 <person pid=”joyce">   …   </person>

                                  Object Identity Attribute
 <paper pubid="wsa" role="publication">

  <authors>                               CDATA (character data)
   <author authorRef=”joyce” age=“???”>
               J. L. R. Colina </author>
        </authors>                                   intradocument
  <fullPaper source="http://...confusion"/>             reference
  <title>Object Confusion in a Deviator System </title>
  <related papers= "deviation101 x_deviators"/>
                                           Reference to
                                          external ENTITY
XML Attribute

    Uses of XML Entities

   Physical partition
       size, reuse, "modularity", … (both XML docs & DTDs)
   Non-XML data
       unparsed entities  binary data
   Non-standard characters
       character entities
 Shorthand for phrases & markup,
=> effectively are macros

        Types of Entities

   Internal (to a doc) vs. External ( use URI)
   General (in XML doc) vs. Parameter (in DTD)
   Parsed (XML) vs. Unparsed (non-XML)

      Internal Text Entities

                       Internal Text Entity Declaration
  <!ENTITY WWW "World Wide Web">

XML                              Entity Reference
  <p>We all use the &WWW;.</p>
                                 Logically equivalent to
                                   actually appearing

  <p>We all use the World Wide Web.</p>

 Entities & Physical Structure

DTD...                   A logical element
                         can be split into
                         physical entities
 <teen>yada yada

<adult>blah blah..
      External Text Entities

DTD                            External Text Entity Declaration

   <!ENTITY chap1 SYSTEM "http://...chap1.xml">

XML               Entity Reference

<mylife> &chap1; &chap2;</mylife>

                        Logically equivalent to inlining file contents

   <mylife> <teen>yada yada</teen>
                  <adult> blah blah</adult>

       Unparsed (& "Binary") Entities

DTD                                      ... and unparsed entity
         Declare external...

<!ENTITY fusion SYSTEM "http://... fusion.ps" NDATA ps>

                                Declare attribute type to be entity

 <!ATTLIST fullPaper source ENTITY #REQUIRED>

XML                              Element with ENTITY attribute

 <fullPaper source="fusion"/>

                       NOTATION declaration (helper app)

    <!NOTATION ps SYSTEM "ghostview.exe">
            Pure XML Model (DTD)

   Any DTD myDTD defines a language valid(myDTD):
             valid(myDTD) = {docs D | D is valid wrt. myDTD}
                                   Content ("container") model: A contains one B,
   <!ELEMENT A (B,C*)>            followed by any number of Cs

   <!ELEMENT B (#PCDATA)> B is a leaf, contains actual data

     <A>                            B:
      <B>foo</B>                                                  A
      <C>lab</C>                    "bar"               B         C         C
                                    "lab"             "foo"     "bar"     "lab"
          Data Modeling with DTDs

   XML element types ~ "object types"
   content model for children elements ~ "subobject structure"
   recursive types (container analogy!?)
     <!ELEMENT A (B|C)>           "an A can contain a B..."
     <!ELEMENT B (A|C)>           "... which contains an A!"
      found in doc world: document DIVision (=generic block-level container)
   loose typing
        <!ELEMENT A    ANY>                "so what's in the box, please??"
   no context-sensitive types:
     DTDs cannot distinguish between the publisher in
      <journal> <publisher>... </publisher> </journal>
      <website> <publisher> ... </publisher> </website>
     => renaming “hack” <j_pub> and <w_pub>
     => DTD extensions (XML SCHEMA)
     Where is the Data??

   Actual data can go into leaf elements and/or attributes

   Common/good practice (!?):
        XML   element ~ container (object)
        XML   element type (tag) ~ container (object) type
        XML   attribute ~ properties of the container as a whole ("metadata")
        XML   leaf elements ~ contain actual data

   Problems with DTDs:
        no data types
        no specialization/extension of types
        no "higher level" modeling (classes, relationships, constraints, etc.)

         Extending DTDs: Data Modeling

   XML main stream: XML Schema
       data types
       user defined types, type extensions/restrictions ("subclassing")
       cardinality constraints
   XML side streams:
       RELAX (REgular Language description for XML), SOX (Schema for
        Object-Oriented XML), Schematron, ...
   alternative approach:
       use well-established data modeling formalisms like (E)ER, UML,
        ORM, OO models, ...
... and just encode them in XML!
       e.g. UML: XMI (standardized, has much more=>big),
        UXF (UML eXchange Format)
     How to use DTD?

   DTD can be declared internal(local subset) or external
    XML file.

   data element—it contains only sub elements with no
    intervening text.
   a document element, it is defined to include both text and sub


To top