Docstoc

Associated Poster for Demo _PowerPoint_

Document Sample
Associated Poster for Demo _PowerPoint_ Powered By Docstoc
					                Geographic Information Retrieval (GIR) Ranking Methods for Digital Libraries
                                                                                                Ray R. Larson and Patricia Frontiera: School of Information Management & Systems and College of Environmental Design, University of California, Berkeley -- ray@sims.berkeley.edu

     Geographic data are an extremely important resource for a wide range of scientists, planners, policy makers, and analysts who                                                                                                                                                                          The Geographic Footprint                                                                                    Spatial Similarity Measures Matching and Spatial Ranking
     study natural and planned environments. Notably, the landscape of geographic analysis has been changing rapidly from data and                                                                                                                                                                          •In GIR applications, the geographic footprint is                                                           •Spatial similarity can be considered as a indicator of relevance: documents whose spatial content is more similar to the spatial content of query
     computation poor to data and computation rich. Developments in digital electronic technologies, such as satellites, integrated
                                                                                                                                                                                                                                                                                                            typically the only quantitative spatial characteristic                                                      will be considered more relevant to the information need represented by the query.
     GPS units, digital cameras, and miniature sensors, are dramatically increasing the types and amounts of digitally available raw
                                                                                                                                                                                                                                                                                                            that is encoded and utilized.                                                                               •Need to consider both: Qualitative, non-geometric spatial attributes and Quantitative, geometric spatial attributes
     geographic data and derived information products. At the same time, advances in computer hardware, software and network
     technologies continue to improve our ability to store and analyze these large, complex data sets.                                                                                                                                                                                                      •The Footprint is a geometric representation of the
                                                                                                                                                                                                                                                                                                            extent of the geographic content of the information                                                         •Three basic approaches to spatial similarity measures and ranking
     These factors are contributing to a growing political, social, scientific and economic awareness of the value of geographic                                                                                                                                                                            object being described. Usually expressed in                                                                  Method 1: Simple Overlap                                                              Method 2: Topological Overlap                                                         Method 3: Degree of Overlap
                                                                                                                                                                                                                                                                                                                                                                                                                          •Candidate geographic information objects (GIOs)                                      •Spatial searches are constrained to only those                                       •Candidate geographic information objects (GIOs)
     information and driving new applications for its use. In response to this, geographic digital libraries that specialize in providing                                                                                                                                                                   geographic coordinates (i.e. latitude and longitude                                                                                                                                                 candidate GIOs that either:
                                                                                                                                                                                                                                                                                                                                                                                                                          that have any overlap with the query region are                                                                                                                             that have any overlap with the query region are
     access to these data are growing in number, collection size, and sophistication. Moreover, mainstream digital libraries, i.e. those                                                                                                                                                                        •Points: maintain a general sense of location but not                                                     retrieved.                                                                                     •are completely contained within the                                         retrieved.
     that deal with primarily text materials, are increasingly considering geographic access methods for information resources that,                                                                                                                                                                            extent or shape                                                                                           •Included in the result set are any GIOs that are                                              query region,                                                                •A spatial similarity score is determined based on
                                                                                                                                                                                                                                                                                                                                                                                                                          contained within, overlap, or contain the query                                                •overlap with the query region,
     while not specifically about geographic features, have important geographic characteristics. Simply stated, most of the objects in                                                                                                                                                                         •Polygons: identify location, extent, and shape with                                                      region.                                                                                                                                                                     the degree to which the candidate GIO overlaps with
     digital libraries are, to a greater or lesser extent, about or related to particular places on or near the surface of the Earth.                                                                                                                                                                                                                                                                                                                                                                                    •or, contain the query region.                                               the query region.
                                                                                                                                                                                                                                                                                                                varying degree of precision                                                                               •The spatial score for all GIOs is either relevant
                                                                                                                                                                                                                                                                                                                                                                                                                          (1) or not relevant (0).                                                              •Each category is exclusive and all retrieved                                         •The greater the overlap with respect to the query
                                                                                                                                                                                                                                                                                                                       –The minimum aligned bounding rectangle (MBR) is the most                                                                                                                                items are considered relevant.
                                                                                                                                                                                                                                                                                                                                                                                                                          •The result set cannot be ranked                                                                                                                                            region, the higher the spatial similarity score.
     Place name georeferencing is extremely effective because names are the primary means by which people refer to geographic                                                                                                                                                                                          commonly used polygonal spatial representantion in GIR                                                 –topological relationship only, no metric refinement                              •The result set cannot be ranked                                                      •This method provides a score by which the result
     locations. However, place names have well-documented lexical and geographical problems. Lexical problems include lack of                                                                                                                                                                                          systems.                                                                                                                                                                                          •categorized topologoical relationship                                       set can be ranked
     uniqueness, alternate names or spellings, and name changes. Geographical problems include boundaries that change over time,                                                                                                                                                                                                                                                                                                                                                                                         only,                                                                                 •topological relationship: overlap
     places with ambiguous boundaries, and geographic features or areas of interest without known place names. Unlike place names,                                                                                                                                                                                                                                                                                                                                                                                       •no metric refinement                                                                 •metric refinement: area of overlap
     geographic coordinate representations provide an unambiguous and persistent method for locating geographic areas or features.                                                                                                                                                                                                                                                                                       Our Approach Is a Probabilistic Estimate of Probability of Relevance based on Logistic Regression from from a sample of
     However, the use of coordinates presents many challenges in terms of storage, indexing, processing and user interface design                                                                                                                                                                                                                                                                                        data with relevance judgements.
     that only recently have begun to be investigated in the context of geographic information retrieval (GIR).                                                                                                                                                                                             Spatial query formation                                                                                                                      m        •X1 = area of overlap(query region, candidate GIO) / area of query region

                                                                                                                                                                                                                                                                                                                                                                                                                             P(R | Q,D)  c 0   c i X
.                                                                                                                                                                                                                                                                                                           •A spatial approach to GIR requires a geographic
    Geographic Information Retrieval (GIR)                                                                                                                                                                                                                                                                                                                                                                                                                                                                        •X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
    •Definitions: Geographic information retrieval (GIR) is concerned with spatial approaches to the retrieval of geographically referenced, or                                                                                                                                                             interface to support spatial thinking and query                                                                                                                                                     i •X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore)
    Georeferenced information objects (GIOs):                                                                                                                                                                                                                                                               formation.                                                                                                                                                                    i1                      •Where: Range for all variables is 0 (not similar) to 1 (same)
       •Information objects that are about specific regions or features on or near the surface of the Earth.                                                                                                                                                                                                Spatial Queries – key issues
                                                                                                                                                                                                                                                                                                            •Communicating with the user: if the user selects a place name from a                                        •Test Data                                                                                                                                                                         Counties                    Cities                    Bioregions
                                                                                                                                                                                                                                                                                                            list, what type of geometric approximation is used to represent the
        •Geospatial data are a special type of georeferenced information that encodes a specific geographic feature or set of features along with                                                                                                                                                           query region (a point, a simple bounding box polygon, a complex
                                                                                                                                                                                                                                                                                                                                                                                                                             •2554 metadata records indexed by 322 unique geographic regions (represented as MBRs) and associated place names.
                                                                                                                                                                                                                                                                                                                                                                                                                                 –2072 records (81%) indexed by 141 unique CA place names
        associated attributes                                                                                                                                                                                                                                                                                                                                                                                                         •881 records indexed by 42 unique counties (out of a total of 46 unique counties indexed in CEIC collection)
                                                                                                                                                                                                                                                                                                            polygon?)
           –maps, air photos, satellite imagery, digital geographic data, etc                                                                                                                                                                                                                               •Level of detail in a graphic interface needs to be sufficient to support
                                                                                                                                                                                                                                                                                                                                                                                                                                      •427 records indexed by 76 cities (of 120)
                                                                                                                                                                                                                                                                                                                                                                                                                                      •179 records by 8 bioregions (of 9)


                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                      •3 records by 2 national parks (of 5)
                                                                                                                                                                                                                                                                                                            geographic queries.                                                                                                       •309 records by 11 national forests (of 11)                                                                                                                           National                    National                    Water QCB
                                                                                                                                                                                                                                                                                                                                                                                                                                      •3 record by 1 regional water quality control board region (of 1)
                                                                                                                                                                                                                                                                                                            •How can queries for more complex spatial characteristics be                                                              •270 records by 1 state (CA)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Parks                       Forests                     Regions
                                                                                                                                                                                                                                                                                                            supported?                                                                                                           –482 records (19%) indexed by 179 unique user defined areas (approx 240) for regions within or overlapping CA
                                                                                                                                                                                                                                                                                                                                                                                                                                      •12% represent onshore regions (within the CA mainland)
                                                                                                                                                                                                                                                                                                                –Density, dispersion, pattern                                                                                         •88% (158 of 179) offshore or coastal regions
                                                                                                                                                                                                                                                                                                            Spatial Query Example: 1st and 2nd generation interfaces
    Georeferencing and GIR                                                                                                                                                                                                                                                                                                                                                                                               •Geographic Approximations for CA Counties, UDAs, and training sample:
                                                                                                                                                                                                                                                                                                            from the FGDC/NSDI Efforts:
    •Within a GIR system, e.g., a geographic digital library, information objects can be georeferenced by place names or by geographic                                                                                                                                                                                                           Geodata.gov
    coordinates (i.e. longitude & latitude)                                                                                                                                                                                                                                                                       NSDI Clearinghouse                                                                                                    MBRs                                        Convex Hulls                                                                                                                           42 of 58 counties referenced in
                                                                                                                             San Francisco Bay Area                                                                                                                                                                                                                                                                                                                                                                                                                                                                          the test collection metadata
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           • 10 counties randomly
                                                                                                                            -122.418, 37.775
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             selected as query regions to
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             train LR model
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           • 32 counties used as query
    GIR is not GIS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           regions to test model
    •GIS is concerned with spatial representations, relationships, and analysis at the level of the individual spatial object or field.                                                                                                                                                                                                                                                                                                     Ave. False Area of Approximation:

    •GIR is concerned with the retrieval of geographic information resources (and geographic information objects at the set level) that may be                                                                                                                                                                                                                                                                                           MBRs: 94.61%
                                                                                                                                                                                                                                                                                                                                                                                                                                                         26.73%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Convex Hulls:


    relevant to a geographic query region.                                                                                                                                                                                                                                                                                                                                                                               Logistic Regression model #1
                                                                                                                                                                                                                                                                                                                 The Geodata.gov site provides better support for a query on wetlands near Petaluma because of the
    Spatial Approaches to GIR                                                                                                                                                                                                                                                                                   increase cartographic detail that appears as the user zooms in. ( you can’t even find Petaluma on the    •X1 = area of overlap(query region, candidate GIO) / area of query region
                                                                                                                                                                                                                                                                                                                                NSDI site – and you can get even more lost if you zoom in further).
    •A spatial approach to geographic information retrieval is one based on the integrated use of spatial representations, and spatial                                                                                                                                                                                                                                                                                   •X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
    relationships.
                                                                                                                                                                                                                                                                                                            Presentation of Results and Spatial evaluation                                                               •Where:
                                                                                                                                                                                                                                                                                                            •The user interface for a GIR system requires a geographic interface to                                          Range for all variables is 0 (not similar) to 1 (same)
    •A spatial approach to GIR can be qualitative or quantitative                                                                                                                                                                                                                                           support the formation of spatial queries, this component is also needed
        –Quantitative: based on the geometric spatial properties of a geographic information object                                                                                                                                                                                                         to support the presentation of results to the user and to assist the user                                    •Results in Mean Average Query Precision: (the average precision values after each new relevant document is observed in a ranked list.)
        –Qualitative: based on the non-geometric spatial properties.                                                                                                                                                                                                                                        in evaluating the results. This type of display also helps the user                                             •For metadataindexed by CA named place regions,                                                     and For all metadata in the test collection:
                                                                                                                                                                                                                                                                                                            understand how the system has interpreted the geospatial aspect of his
           •Based on the coordinate encoding of spatial representations                                                                                                                                                                                                                                                                                                                                                                                                                                               These results suggest:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      •Convex Hulls perform better than MBRs
                                                                                                                                                                                                                                                                                                            query and matched it against the available information objects.
           •The coordinate representation is used to geometrically determine:                                                                                                                                                                                                                                                                                                                                                                                                                                                •Expected result given that the CH is a higher quality
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             approximat
                •topological spatial relationships,                                                                                                                                                                                                                                                         Example: Results display from CheshireGeo:                                                                                                                                                                •A probabilistic ranking based on MBRs can perform as well if
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      not better than a non-probabiliistic ranking method based on
                   •metric spatial characteristics                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Convex Hulls
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      •Since any approximation other than the MBR requires great
    Topological Relationships                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         expense, this suggests that the exploration of new ranking
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      methods based on the MBR are a good way to go.

    •Relative spatial relations without concern for measureable distance or absolute direction.
                                                                                                                                                                                                                                                                                                                                                                                                                         One factor missing from these results is how much of the areas are offshore and how much onshore. So we add a ShoreFactor variable…
       •Possible Topological relationships between two overlapping regions:                                                                                                                                                                                                                                                                                                                                              Logistic Regression Model #2
    Spatial Representation - Considerations                                                                                                                                                                                                                                                                                                                                                                              •X1 = area of overlap(query region, candidate GIO) / area of query region                                                                                    Shorefactor = 1 – abs(fraction of query region approximation that is onshore
    •What spatial characteristics should be represented and how?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              – fraction of candidate GIO approximation that is onshore)
    •How much coordinate detail is needed to represent geographic features or regions?                                                                                                                                                                                                                      Spatial Ranking methods from the Literature:                                                                 •X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
    •How can imprecise geographic regions be represented?                                                                                                                                                                                                                                                                                                                                                                •X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Onshore Areas
    •How do the representation choices impact the types of queries that can be asked and the spatial operations that can be performed within the system?                                                                                                                                                                                                                                                                               that is onshore)
    •How do the representation choices impact storage and system performance?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Candidate GIO MBRs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               A) GLORIA Quad 13: fraction onshore = .55
    •How do the representation choices impact those people who will need to catalog (i.e. encode) these representations and those who will use these representations?                                                                                                                                                                                                                                                                                                                                                                                                                                                          B) WATER Project Area: fraction onshore = .74
    Geometric Approximations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               A
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Query Region MBR
    •The decomposition of spatial objects into approximate representations is a common approach to simplifying complex and often multi-part coordinate representations                                                                                                                                                                                                                                                                                                                                                                                                                                                         Q) Santa Clara County: fraction onshore = .95
    •Types of Geometric Approximations
        •Conservative: superset                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Computing Shorefactor:
        •Progressive: subset                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      B                            Q – A Shorefactor: 1 – abs(.95 - .55) = .60
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Q – B Shorefactor: 1 – abs(.95 - .74) = .79
        •Generalizing: could be either
        •Concave or Convex                                                                                                                                                                           1) Minimum Bounding Circle (3)        2) MBR: Minimum
                                                                                                                                                                                                                                                                          3) Minimum Bounding Ellipse (5)
                                                                                                                                                                                                                                            aligned Bounding
             –Provide faster geometric operations on convex polygons                                                                                                                                                                           rectangle (4)


    Goals of a geometric Approximation                                                                                                                                                                                                                                                                                                                                                                                   Results for both models over all test data…
        •Quality: The quality criterion concerns maximizing the degree to which the approximation models the complex spatial object from which it was derived.
             •A simple measure of quality is area: the greater the difference in area between an approximation and a spatial object, the poorer the quality. Given this definition, a point                                                                                                                                                                                                                              These results suggest:
             representation is the poorest quality approximation of a geographic region or feature, since all of these have area.
        •Simplicity: The simplicity criterion concerns maximizing the computational efficiencies of storage, disk access, and computational geometry operations
             that utilize the approximation.                                                                                                                                                                                                                                                                                         Acknowledgements                                                                    •Addition of Shorefactor variable improves the model (LR 2), especially
             –Factors that contribute to simplicity include: fewer points needed to encode the representation; fewer parts, simple polygons without holes, and convexity rather than          4) Rotated minimum bounding rectangle (5) 5) 4-corner convex polygon (8)        6) Convex hull (varies)
                                                                                                                                                                                                                                                                                                                                                                                                                          for MBRs
             concavity.                                                                                                                                                                                                                                                      After Brinkhoff et al, 1993b    This research was sponsored at U.C. Berkeley by the National Science
             –In the context of GIR, simplicity also has to involve a consideration of how easy it is to create (catalog) the approximation and for the end-user to understand the                                Conservative Approximations                                                               Foundation and the Joint Information Systems Committee (UK) under the                                        •Improvement not so dramatic for convex hull approximations – b/c the
             approximation.
                                                                                                                                                                                                         Presented in order of increasing quality. Number in parentheses denotes number of
                                                                                                                                                                                                                              parameters needed to store representation
                                                                                                                                                                                                                                                                                                             International Digital Libraries Program award #IIS-99755164. Additional                                      problem that shorefactor addresses is not that significant when areas
                                                                                                                                                                                                                                                                                                            Support was provided by the Institute for Museum and Library Services as                                      are represented by convex hulls.
                                                                                                                                                                                                                                                                                                                         part of the “Going Places in the Catalog” project.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:9/1/2011
language:English
pages:1