Geographic Information Retrieval (GIR) Ranking Methods for Digital Libraries
Ray R. Larson and Patricia Frontiera: School of Information Management & Systems and College of Environmental Design, University of California, Berkeley -- firstname.lastname@example.org
Geographic data are an extremely important resource for a wide range of scientists, planners, policy makers, and analysts who The Geographic Footprint Spatial Similarity Measures Matching and Spatial Ranking
study natural and planned environments. Notably, the landscape of geographic analysis has been changing rapidly from data and •In GIR applications, the geographic footprint is •Spatial similarity can be considered as a indicator of relevance: documents whose spatial content is more similar to the spatial content of query
computation poor to data and computation rich. Developments in digital electronic technologies, such as satellites, integrated
typically the only quantitative spatial characteristic will be considered more relevant to the information need represented by the query.
GPS units, digital cameras, and miniature sensors, are dramatically increasing the types and amounts of digitally available raw
that is encoded and utilized. •Need to consider both: Qualitative, non-geometric spatial attributes and Quantitative, geometric spatial attributes
geographic data and derived information products. At the same time, advances in computer hardware, software and network
technologies continue to improve our ability to store and analyze these large, complex data sets. •The Footprint is a geometric representation of the
extent of the geographic content of the information •Three basic approaches to spatial similarity measures and ranking
These factors are contributing to a growing political, social, scientific and economic awareness of the value of geographic object being described. Usually expressed in Method 1: Simple Overlap Method 2: Topological Overlap Method 3: Degree of Overlap
•Candidate geographic information objects (GIOs) •Spatial searches are constrained to only those •Candidate geographic information objects (GIOs)
information and driving new applications for its use. In response to this, geographic digital libraries that specialize in providing geographic coordinates (i.e. latitude and longitude candidate GIOs that either:
that have any overlap with the query region are that have any overlap with the query region are
access to these data are growing in number, collection size, and sophistication. Moreover, mainstream digital libraries, i.e. those •Points: maintain a general sense of location but not retrieved. •are completely contained within the retrieved.
that deal with primarily text materials, are increasingly considering geographic access methods for information resources that, extent or shape •Included in the result set are any GIOs that are query region, •A spatial similarity score is determined based on
contained within, overlap, or contain the query •overlap with the query region,
while not specifically about geographic features, have important geographic characteristics. Simply stated, most of the objects in •Polygons: identify location, extent, and shape with region. the degree to which the candidate GIO overlaps with
digital libraries are, to a greater or lesser extent, about or related to particular places on or near the surface of the Earth. •or, contain the query region. the query region.
varying degree of precision •The spatial score for all GIOs is either relevant
(1) or not relevant (0). •Each category is exclusive and all retrieved •The greater the overlap with respect to the query
–The minimum aligned bounding rectangle (MBR) is the most items are considered relevant.
•The result set cannot be ranked region, the higher the spatial similarity score.
Place name georeferencing is extremely effective because names are the primary means by which people refer to geographic commonly used polygonal spatial representantion in GIR –topological relationship only, no metric refinement •The result set cannot be ranked •This method provides a score by which the result
locations. However, place names have well-documented lexical and geographical problems. Lexical problems include lack of systems. •categorized topologoical relationship set can be ranked
uniqueness, alternate names or spellings, and name changes. Geographical problems include boundaries that change over time, only, •topological relationship: overlap
places with ambiguous boundaries, and geographic features or areas of interest without known place names. Unlike place names, •no metric refinement •metric refinement: area of overlap
geographic coordinate representations provide an unambiguous and persistent method for locating geographic areas or features. Our Approach Is a Probabilistic Estimate of Probability of Relevance based on Logistic Regression from from a sample of
However, the use of coordinates presents many challenges in terms of storage, indexing, processing and user interface design data with relevance judgements.
that only recently have begun to be investigated in the context of geographic information retrieval (GIR). Spatial query formation m •X1 = area of overlap(query region, candidate GIO) / area of query region
P(R | Q,D) c 0 c i X
. •A spatial approach to GIR requires a geographic
Geographic Information Retrieval (GIR) •X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
•Definitions: Geographic information retrieval (GIR) is concerned with spatial approaches to the retrieval of geographically referenced, or interface to support spatial thinking and query i •X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore)
Georeferenced information objects (GIOs): formation. i1 •Where: Range for all variables is 0 (not similar) to 1 (same)
•Information objects that are about specific regions or features on or near the surface of the Earth. Spatial Queries – key issues
•Communicating with the user: if the user selects a place name from a •Test Data Counties Cities Bioregions
list, what type of geometric approximation is used to represent the
•Geospatial data are a special type of georeferenced information that encodes a specific geographic feature or set of features along with query region (a point, a simple bounding box polygon, a complex
•2554 metadata records indexed by 322 unique geographic regions (represented as MBRs) and associated place names.
–2072 records (81%) indexed by 141 unique CA place names
associated attributes •881 records indexed by 42 unique counties (out of a total of 46 unique counties indexed in CEIC collection)
–maps, air photos, satellite imagery, digital geographic data, etc •Level of detail in a graphic interface needs to be sufficient to support
•427 records indexed by 76 cities (of 120)
•179 records by 8 bioregions (of 9)
•3 records by 2 national parks (of 5)
geographic queries. •309 records by 11 national forests (of 11) National National Water QCB
•3 record by 1 regional water quality control board region (of 1)
•How can queries for more complex spatial characteristics be •270 records by 1 state (CA)
Parks Forests Regions
supported? –482 records (19%) indexed by 179 unique user defined areas (approx 240) for regions within or overlapping CA
•12% represent onshore regions (within the CA mainland)
–Density, dispersion, pattern •88% (158 of 179) offshore or coastal regions
Spatial Query Example: 1st and 2nd generation interfaces
Georeferencing and GIR •Geographic Approximations for CA Counties, UDAs, and training sample:
from the FGDC/NSDI Efforts:
•Within a GIR system, e.g., a geographic digital library, information objects can be georeferenced by place names or by geographic Geodata.gov
coordinates (i.e. longitude & latitude) NSDI Clearinghouse MBRs Convex Hulls 42 of 58 counties referenced in
San Francisco Bay Area the test collection metadata
• 10 counties randomly
selected as query regions to
train LR model
• 32 counties used as query
GIR is not GIS regions to test model
•GIS is concerned with spatial representations, relationships, and analysis at the level of the individual spatial object or field. Ave. False Area of Approximation:
•GIR is concerned with the retrieval of geographic information resources (and geographic information objects at the set level) that may be MBRs: 94.61%
relevant to a geographic query region. Logistic Regression model #1
The Geodata.gov site provides better support for a query on wetlands near Petaluma because of the
Spatial Approaches to GIR increase cartographic detail that appears as the user zooms in. ( you can’t even find Petaluma on the •X1 = area of overlap(query region, candidate GIO) / area of query region
NSDI site – and you can get even more lost if you zoom in further).
•A spatial approach to geographic information retrieval is one based on the integrated use of spatial representations, and spatial •X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
Presentation of Results and Spatial evaluation •Where:
•The user interface for a GIR system requires a geographic interface to Range for all variables is 0 (not similar) to 1 (same)
•A spatial approach to GIR can be qualitative or quantitative support the formation of spatial queries, this component is also needed
–Quantitative: based on the geometric spatial properties of a geographic information object to support the presentation of results to the user and to assist the user •Results in Mean Average Query Precision: (the average precision values after each new relevant document is observed in a ranked list.)
–Qualitative: based on the non-geometric spatial properties. in evaluating the results. This type of display also helps the user •For metadataindexed by CA named place regions, and For all metadata in the test collection:
understand how the system has interpreted the geospatial aspect of his
•Based on the coordinate encoding of spatial representations These results suggest:
•Convex Hulls perform better than MBRs
query and matched it against the available information objects.
•The coordinate representation is used to geometrically determine: •Expected result given that the CH is a higher quality
•topological spatial relationships, Example: Results display from CheshireGeo: •A probabilistic ranking based on MBRs can perform as well if
not better than a non-probabiliistic ranking method based on
•metric spatial characteristics Convex Hulls
•Since any approximation other than the MBR requires great
Topological Relationships expense, this suggests that the exploration of new ranking
methods based on the MBR are a good way to go.
•Relative spatial relations without concern for measureable distance or absolute direction.
One factor missing from these results is how much of the areas are offshore and how much onshore. So we add a ShoreFactor variable…
•Possible Topological relationships between two overlapping regions: Logistic Regression Model #2
Spatial Representation - Considerations •X1 = area of overlap(query region, candidate GIO) / area of query region Shorefactor = 1 – abs(fraction of query region approximation that is onshore
•What spatial characteristics should be represented and how? – fraction of candidate GIO approximation that is onshore)
•How much coordinate detail is needed to represent geographic features or regions? Spatial Ranking methods from the Literature: •X2 = area of overlap(query region, candidate GIO) / area of candidate GIO
•How can imprecise geographic regions be represented? •X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO
•How do the representation choices impact the types of queries that can be asked and the spatial operations that can be performed within the system? that is onshore)
•How do the representation choices impact storage and system performance? Candidate GIO MBRs
A) GLORIA Quad 13: fraction onshore = .55
•How do the representation choices impact those people who will need to catalog (i.e. encode) these representations and those who will use these representations? B) WATER Project Area: fraction onshore = .74
Geometric Approximations A
Query Region MBR
•The decomposition of spatial objects into approximate representations is a common approach to simplifying complex and often multi-part coordinate representations Q) Santa Clara County: fraction onshore = .95
•Types of Geometric Approximations
•Conservative: superset Computing Shorefactor:
•Progressive: subset B Q – A Shorefactor: 1 – abs(.95 - .55) = .60
Q – B Shorefactor: 1 – abs(.95 - .74) = .79
•Generalizing: could be either
•Concave or Convex 1) Minimum Bounding Circle (3) 2) MBR: Minimum
3) Minimum Bounding Ellipse (5)
–Provide faster geometric operations on convex polygons rectangle (4)
Goals of a geometric Approximation Results for both models over all test data…
•Quality: The quality criterion concerns maximizing the degree to which the approximation models the complex spatial object from which it was derived.
•A simple measure of quality is area: the greater the difference in area between an approximation and a spatial object, the poorer the quality. Given this definition, a point These results suggest:
representation is the poorest quality approximation of a geographic region or feature, since all of these have area.
•Simplicity: The simplicity criterion concerns maximizing the computational efficiencies of storage, disk access, and computational geometry operations
that utilize the approximation. Acknowledgements •Addition of Shorefactor variable improves the model (LR 2), especially
–Factors that contribute to simplicity include: fewer points needed to encode the representation; fewer parts, simple polygons without holes, and convexity rather than 4) Rotated minimum bounding rectangle (5) 5) 4-corner convex polygon (8) 6) Convex hull (varies)
concavity. After Brinkhoff et al, 1993b This research was sponsored at U.C. Berkeley by the National Science
–In the context of GIR, simplicity also has to involve a consideration of how easy it is to create (catalog) the approximation and for the end-user to understand the Conservative Approximations Foundation and the Joint Information Systems Committee (UK) under the •Improvement not so dramatic for convex hull approximations – b/c the
Presented in order of increasing quality. Number in parentheses denotes number of
parameters needed to store representation
International Digital Libraries Program award #IIS-99755164. Additional problem that shorefactor addresses is not that significant when areas
Support was provided by the Institute for Museum and Library Services as are represented by convex hulls.
part of the “Going Places in the Catalog” project.