AC omparative Study of Some Multiple Expert Recognition Strategies

W
Shared by: HC120807202030
Categories
Tags
-
Stats
views:
0
posted:
8/7/2012
language:
pages:
17
Document Sample
scope of work template
							Content Extraction from HTML
         Documents


    A. Rahman H. Alam R. Hartono
 Document Analysis and Recognition Team
                 (DART)
          BCL Computers Inc.
          Santa Clara, Calif, USA
             Current need?

• Viewing website using small screen
  handheld devices
• Since web sites are written using HTML
  codes, we need to translate these to systems
  that the wireless devices can support.
             Current Solutions

•   Handcrafting:
    –   Custom Web Sites are typically crafted by
        hand by a set of content experts
•   Transcoding:
    –   Thranscoding replaces HTML tags with
        suitable device specific tags (HDML, WML
        etc)
                    Handcrafting
•   Automation
    –   Use of XML.
        •   There is no standard XML tagset (Document Type
            Definition – DTD) in use by vendors.
        •   XML has been available to web designers for the last 10
            years. Examination of websites shows little use of document
            structural elements.
    –   Web masters see themselves as artists rather than
        programmers.
    –   XML may meet the same fate as SGML, an earlier
        attempt to create structured documents.
               Handcrafting
•   Take an existing website and make it available
    to wireless access. Aether Systems, Mshift and
    2Roam currently offer these types of solutions.
•   Use a proprietary graphical interface to ease the
    development of wireless applications from
    scratch. Covigo and iConverse offer these type
    of solutions.
•   Let the user do all coding in languages such as
    C++ or Java. ThinAirApps offers this type of
    solution.
      Handcrafting
•   Labor intensive
•   Expensive.
•   Typically less than 1% of a
    web site gets converted to
    wireless content.
            Transcoding
•   Most web pages have a loose repeating
    visual structure. The wireless user gets the
    same repeating information with every
    screen
•   Browsing is an unfriendly experience
•   Transcoding sends all the information to the
    wireless device, making it substantially slow
    on the wireless network
            Transcoding

•   Transcoding was introduced in Japan
    during 1999-2000. It was widely rejected
    by the Japanese users.
•   Recently, Google and Pixo introduced this
    solution for the US market, but have so far
    failed to attract attention of end users.
     The Alternate Solution

•   Separate the content into smaller segments
•   Generate a summary of these segments
•   Prioritize these summaries from individual
    segments
•   Put together to form a summary of the
    overall document
    Steps to Content Extraction
• Structural analysis: Understanding the
  relationship of the various segments with
  the document
• Decomposition: Breakdown on these
  segments into operational units
• Contextual Analysis: Employment of
  context to revise the segmentation
                               (Continued=>)
   Steps to Content Extraction
           (Continued)
• Labeling => Segment Summary: Extraction
  of a low level summary of the segment
• Priority: Estimating importance of these
  segments
• Table of Content (TOC) => Document
  Summary: Putting together a summary of
  the document
          Content Extraction
•   Proximity Analysis: Relational analysis of
    content between segments
•   Content Classification: callification into various
    types, i.e. [stories], [navigation], [links],
    [images], [forms] etc.
•   Relationship Analysis
    –   Contextual grammar (Natural Language)
    –   Knowledge modes
    –   Information retrieval techniques
 Content Extraction: Why do we
            need it?
• Viewing any website: Any solution to web
  browsing has to be universal
• High network access: Any transformation
  has to be fast and on-the-fly
• Network Usage: Network traffic should
  increase because of these systems
                                (Continued=>)
 Content Extraction: Why do we
      need it (continued)?
• Easy Configurability: Any such system should be
  easiliy configurable
• Rapid Deployment: Should be rapidly deployable
• Non-intrusive Design: Should be possible to
  transform web sites without modifying the actual
  web site
• Multiple Views: System Integrators should be
  able to create multiple views of the same site
    Advantages of Content Extraction
•   Displays size
•   Locating information
•   Important content can be on top
•   Multiple levels of abstraction can be created
•   The browsing can use a demand-driven model
•   Faster download
•    More efficient use of small display areas
•   Mapping of the importance of content from the
    original document
Supported Devices and Formats
    • PDAs (HTML3.2)
    • Cell phones
      – USA/Europe:
        • WAP
      – Japan
        • iMode (NTT DoCoMo)
        • J-Sky (J-Phone)
        • EZWeb (KDDI)
            Conclusion
• Content from web documents can be extracted
  based on the
   –   HTML structure
   –   Proximity analysis
   –   Logical relationship analysis
   –   Information retrieval techniques
• Content can be used effectively to summarize web
  documents
   – Better option compared to handcrafting or transcoding
   – Produces faster browsing experience

						
Other docs by HC120807202030
Motor Skill Learning
Views: 35  |  Downloads: 0
PowerPoint Presentation
Views: 0  |  Downloads: 0
Grade Book - Excel
Views: 5  |  Downloads: 0
MEMBERSHIP COMMITMENT FORM 2012 2013
Views: 1  |  Downloads: 0
w 01 diener cognio
Views: 0  |  Downloads: 0
EPSC501LectureWeek1 Jan2012
Views: 1  |  Downloads: 0