There are no unstructured documents

Document Sample
There are no unstructured documents Powered By Docstoc
					            There are no unstructured documents

Author: David Slocombe

          David Slocombe's career in computing began in 1969 while he was a Canadian
     newspaper reporter. During the next 30 years he developed many applications of
     computing to journalism and publishing. He was a founder and vice-president of
     SoftQuad Inc. and was the architect of SoftQuad's first product-line, SoftQuad
     Publishing Software. He contributed to the early development of DSSSL. Until recently
     he was a consultant at Tata Infotech Ltd. in Bombay, India. He is currently Chief
     Scientist at Exegenix Research Inc. in Toronto, a company he co-founded in 2001.

Author: Rodney Boyd

          Rodney Boyd worked at SoftQuad for 10 years, writing about and using SGML,
     HTML, and XML. He wrote manuals for Author/Editor, XMetaL, HoTMetaL, and other
     products. He was dragged back bodily from Mexico to work as a Document
     Conversion Analyst at Exegenix Research Inc.


                    Printed documents have sufficient visual markers to make their structure
              obvious. We discuss the theory behind, and demonstrate, a system that
              converts documents to XML based on their two-dimensional visual structure.

     There are no unstructured documents
     All of us recognize the usefulness of having our documents in a structured, parsable,
reusable format such as XML, but we have assumed that we must add this structure
explicitly, at the cost of substantial human effort.
    We think this approach to human-computer interaction is backwards: just as we do not
expect people to read XML-tagged documents directly, we should not expect them to write
them either. What is good enough for humans should be good enough for our increasingly
powerful machines.
     We have created a system that reads PostScript™ and PDF files produced by
word-processing and desktop-publishing products, and extracts two-dimensional visual
representations of the documents. From these representations, document structures can be
identified, and valid and rich XML files can be generated. This approach to document
conversion is proving more fruitful than the traditional methods of analyzing
one-dimensional streams of typographic codes.
      The motivation for our approach is our contention that virtually all documents already
have sufficient visual markers of structure to make them parsable. The clearest indication of
this is the commonplace observation that we rarely encounter a printed document whose
logical structure we cannot mentally parse at a glance.
     A document that fails this test is often ambiguous to us humans—let alone machines—
even after reflection, because its author (or typographer) did not sufficiently follow
commonly understood rules of layout. As negative examples, some design-heavy
commonly understood rules of layout. As negative examples, some design-heavy
publications deliberately flout as many of these rules as they can get away with; this is
entertaining, but hardly helps us to discern their structure.
     Dimensions of identification
      What do we mean by one- and two-dimensional structure identification?
One-dimensional identification parses a document as a linear sequence of content and
typographical codes, and then takes advantage of prior knowledge of the meanings of those
codes. For example, \b in RTF indicates bold text. The general paradigm is procedural,
similar to a tape-driven numerical control machine in a high-tech factory: “Do this. Now do
this. Now do this. ...”
     By contrast, two-dimensional identification parses the page into objects, based on their
formatting, position, and context. It then considers page layout holistically, building on the
observation that pages tend to have a structure or geometry that includes one or more of
header, footer, body, and footnotes. Then, a software system can identify objects and
higher-level structures using general knowledge of typographic principles. This paradigm is
declarative: “At this location on the page there is an object with such and such a set of
properties.” We don't care how (by what set of steps) an object came to be, only that it
     A major problem with the one-dimensional approach is that exactly the same construct
can be created in many different ways. Both the author—the user of the application—and the
programmer who created the application, have many choices, and therefore many sets of
procedural codes can be encountered that produce the same results. For example, a simple
indented bulleted list item can be constructed by the user thus:

          Inserting a number of spaces using the space bar, and then inserting a bullet
          Inserting a tab, and then a bullet character
          Doing either of the above, but instead of entering a bullet character, entering a
     period, increasing its font size, and superscripting it
          Clicking on the word-processor's “Bulleted list” tool
     For each of these methods, a different set of typographical codes will be generated.
And that's to represent identical constructs. In addition, different word-processing and
typesetting applications will emit different codes for the same user actions.
      By contrast, two-dimensional, declarative structure identification decreases the number
of parameters that need to be considered when deciding on the identity of a particular
construct. This makes the system more robust. In the examples above, the only parameters
that would be considered when determining the identity of the object in question would be
the location and appearance of the bullet character (and any content that followed it) in
relation to their surroundings.
    Two-dimensional parsing also more closely emulates the human eye-brain system for
understanding printed documents. (This, as we never tire of pointing out, is a mechanism
whose effectiveness has been verified by centuries of QA!)
     As we said above, our approach to structure identification is based on determining
     As we said above, our approach to structure identification is based on determining
which sets of typographical properties are distinctive of specific objects. Some examples
will demonstrate how human beings use typographical cues to distinguish structure:
     Example 1 - numbered lists

    The first example is a simple list, with another nested inside it. It's easy to tell this
because there are two significant visual clues that the second through fifth items are
members of a sub-list.

          These items are indented to the right—perhaps the most obvious indication.
          They use a different numbering style (alphabetic rather than numeric).
     In the second example the sub-list uses the same numbering style, but it's still indented,
so there's really no doubt that it's a nested list.
     The third list is a bit more unusual, and many people would say it is typeset badly,
since all the items are indented the same. Nonetheless, we can still guess with a high
probability of correctness that the second through fifth items are logically nested, since they
use a different numbering scheme. Since we can identify the sub-list without it being
indented, we believe that numbering scheme has a higher weight than indentation in this
      The fourth list looks very strange indeed. Not only are all of the items indented the
same, they all use the same numbering scheme. You would be quite justified in concluding
that the structure of this list can't be deciphered with any confidence. By making certain
assumptions, it is possible to guess at a structure, but it might not be the one that the author
     Example 2 - bulleted lists
     Now let's see how the observations that we made above apply to bulleted
(non-numbered) lists.
     The first example clearly contains a nested list: the second through fifth items are
indented, and have a different bullet character than the first and sixth items.
      The second example is similar. Even though all items have the same bullet character,
the indent tells us that some items are nested.
     In the third example, even without the indentation we can easily tell that the middle
items are in a sub-list because the unfilled bullet characters are clearly distinct.
     The fourth example presents somewhat of a dilemma. None of the items are indented,
and while we may be able to tell that the second through fifth bullets are different, is that
enough to conclude that they indicate items in a sublist? Perhaps the typographer simply
made a mistake and the lists have only one level. Here is an example where humans and
software programs may rightly conclude that the situation is ambiguous, and just make their
best guess. We are led to conclude that, when recognizing a nested bulleted list, indentation
has a higher weighting than the choice of list mark.
        Example 3 - block quotes
      We use several different cues when recognizing a block quote: font and font style,
indentation, inter-line spacing, quote marks, and dividers. Let's see how they interact. Note
that this is a somewhat simplified example since other constructs such as notes and warnings
may have formatting attributes similar to those of block quotes.
      The first example is the classic block quote: the quoted material is indented and

        In example two, the quote is again emphasized in two ways: by indentation and point
     Now look what happens in the next two examples:

     Now the indentation has been removed, but the font changes have been retained. We
might recognize these blocks as quotes, but it's not as obvious as before. Indentation seems
to be a crucial characteristic of block quotes. If typographers don't want to use this
formatting property, they will have to surround the quoted block with explicit quote marks:
     Object taxonomy
     Our empirical research, based on an examination of thousands of pages of documents,
has produced a taxonomy of objects that are in general use in documents, along with visual
(typographic) cues that communicate the objects on the page or screen, and an analysis of
which combinations of cues are sufficient to identify specific objects. These results provide
a repository of typographic knowledge that is employed in the structure identification
      This taxonomy classifies objects into the typical categories that you might expect
(block/inline, text/non-text), with some finer categorizations—to distinguish different kinds
of titles and lists, for example.
     What is interesting is that after a few weeks of research the number of new, distinct
object types that we were encountering dropped significantly, and the ones that we were
discovering were generally in use in a minority of documents. Furthermore, most documents
tend to use a relatively small subset of these objects. Naturally, we don't believe that we
have captured all possible object types. But this experience leads us to conclude that the set
of object types in general use in typography is manageably finite.
       The set of visual cues or characteristics that concretely realize graphical objects is not
as easy to capture, since for each object, there are many ways that individual authors will
format it. Just think of the number of different ways of formatting as simple an object as a
title. Even in this case, however, we have found that the majority of documents use a fairly
well-defined set of typographical conventions.
      Although the list of objects and visual cues is large, it is sufficiently stable over time to
provide the common, reliable protocol that authors use to communicate with their readers,
just as programs use XML to communicate with each other.
     A working system for structure identification and conversion
     What follows is an outline of the system.

          The structure identification process starts with visual data acquisition. We begin
     with a PostScript or PDF file, which is actually a program for instructing a PostScript
     printer or its surrogate how to draw the document on a page or the screen. We execute
     the PostScript or PDF and extract information about the basic marks that it paints on the
     page. In particular, for each mark we obtain the type of mark (text, image, drawing
     command, rule), the mark itself (for example, the word 'wildebeest'), the mark's
     command, rule), the mark itself (for example, the word 'wildebeest'), the mark's
     location, and its font style (if appropriate). At this stage no decisions are made as to
     the identity of the graphical objects that these marks constitute.
           The information gleaned from the visual data acquisition phase is then used in the
     visual tokenization phase, which groups together marks that seem to have a
     relationship to each other, thereby synthesizing the basic tokens of visual human
     communication found in the document. This stage still does not attempt to identify
          In the structure identification phase, the system considers the properties of each
     token (appearance, position, and context) and arrives at a conclusion as to its
     identity—title, paragraph, table, etc.—and perhaps its relationship to a larger structure,
     based on knowledge of typographic conventions (that is, the information contained in
     the object taxonomy).
           If the system is being used to perform document conversion, there will then be an
     export phase that serializes the structure created in the previous phase into an XML
     file. The format of this file was chosen to be easy to transform into customer-specific
     formats using tools such as XSLT.

      Two-dimensional structure identification, which attempts to model the ways in which
human beings actually discern the structure of documents, is proving to be a useful approach
to the problem of making electronic documents available to computers for purposes that go
beyond simple word-processsing. A set of typographical conventions that are used in the
vast majority of documents has been compiled, and work is progressing on expressing these
conventions in a software system.
     We have had a great deal of success in identifying both simple and complex structures
via two-dimensional parsing techniques, and have been able to parse a wide class of
documents: for example, legal rulings, textbooks, novels and other trade publications, and
software manuals.
     While the initial motivation for our work is and continues to be largely XML-centric,
structure recognition is not just about generating XML files. Some other applications that we
envision are:

          Natural language parsing will benefit from the ability to recognize the hierarchical
     structure of documents.
          Search engines could use structure identification to identify the most relevant
     parts of documents and thereby provide better indexing capabilities.

Shared By: