Putting Unstructured Data in its “PLACE”
EPA’s Environmental Information Symposium 2006 Savannah, GA December 5 – 7, 2006
For Conference Purposes Only.
Structured versus Unstructured?
PDF Documents, Documentum Repositories Tabular Data (Oracle, Excel…) MS Word Documents
XML
Tagged Data
Websites
Structured Data:
Organized for machines to read and understand
For Conference Purposes Only.
Unstructured Data:
Free-form, organized for humans to read and understand
How do we improve access to unstructured data?
• Natural Language Processing
– Crawl repositories and ingest documents, parse and analyze them contextually for keywords, locations, et cetera…
• PLACE - Locational Data
– Match found locations against a gazetteer of known places – Geocoding of addresses – Tag the document with latitude and longitude of found locations
• Build a collection
– Construct a database of documents – Create indices for keywords – Create a spatial index for spatial searching
• Expose a search interface
– SOAP Web Service
For Conference Purposes Only.
How do we put this to use?
• Create a means of making this data available…
– Option: Standalone GUI-driven search interface – search by keyword and geography – Option: XML Web Service – Others..
• XML Web Services can be deeply embedded into other applications for rich functionality • Integration Opportunities:
– EnviroMapper – GeoData Gateway
For Conference Purposes Only.
Unstructured Data Search In Action
1. SOAP request to perform search on MetaCarta Appliance via HeadNode Web Service using WME bounding box and user-supplied keyword 2. SOAP response from MetaCarta Appliance – XML is parsed and returned to populate resultset panel with matching documents 3. Geocoded locations in XML Search Resultset Latitudes and Longitudes are plotted on WME Window
4. Live URLs hotlinking to document matching search keyword found within map extent
For Conference Purposes Only.
Unstructured Data Query Results
Query Relevance (can allow sorting of results) Confidence of Geographic Location Document Location (Decimal Degrees of Latitude and Longitude (truncated in this application) Document Title A brief abstract, automatically extracted from the first line in the document is also returned (not shown) Document location (returned as URL)
For Conference Purposes Only.
Next Steps…
• Ingestion, processing and geo-enablement of unstructured data from websites:
– EPA websites – Federal websites (Health, USGS, Interior, CDC…) – State Departments of Environmental Protection, Environmental Quality, et cetera
• EPA Document repositories
– Documentum/ECMS – EPA Libraries
For Conference Purposes Only.
Thank You • Dave Catlin (202) 566-0694 catlin.dave@epa.gov
For Conference Purposes Only.
EPADocs 5/13/2008 |
532 |
3 |
0 |
legal
EPADocs 5/13/2008 |
267 |
0 |
0 |
legal
EPADocs 5/18/2008 |
297 |
0 |
0 |
legal
EPADocs 5/9/2008 |
224 |
2 |
0 |
legal
EPADocs 5/15/2008 |
108 |
3 |
0 |
legal
EPADocs 5/18/2008 |
68 |
0 |
0 |
legal
EPADocs 5/14/2008 |
72 |
0 |
0 |
legal
EPADocs 5/14/2008 |
146 |
2 |
0 |
legal
EPADocs 5/9/2008 |
164 |
0 |
0 |
legal
EPADocs 5/9/2008 |
177 |
2 |
0 |
legal
EPADocs 5/14/2008 |
143 |
3 |
0 |
legal
EPADocs 5/18/2008 |
84 |
0 |
0 |
legal
EPADocs 5/18/2008 |
90 |
0 |
0 |
legal
EPADocs 5/18/2008 |
95 |
1 |
0 |
legal
EPADocs 5/21/2008 |
270 |
8 |
0 |
legal
EPADocs 5/21/2008 |
169 |
2 |
0 |
legal
EPADocs 5/21/2008 |
177 |
2 |
0 |
legal
EPADocs 5/21/2008 |
210 |
1 |
0 |
legal
EPADocs 5/21/2008 |
196 |
4 |
0 |
legal
EPADocs 5/21/2008 |
177 |
3 |
0 |
legal
EPADocs 5/21/2008 |
178 |
0 |
0 |
legal
EPADocs 5/21/2008 |
163 |
0 |
0 |
legal
EPADocs 5/21/2008 |
158 |
0 |
0 |
legal
EPADocs 5/21/2008 |
166 |
0 |
0 |
legal