Docstoc

Slides - Semanticommunity.info - Semanticommunity.info

Document Sample
Slides - Semanticommunity.info - Semanticommunity.info Powered By Docstoc
					            Build the UK’s COINS in the
            Data Science Library Cloud

                        Brand Niemann
                           US EPA
                         June 9, 2010
                http://semanticommunity.net

Disclaimer: These slides do not reflect the views of the U.S. Environmental Protection Agency
and does not constitute endorsement by the EPA of the standards or products mentioned.        1
                     Overview
•   The Challenge
•   The Data.gov.uk Program
•   The Expert and His Advice
•   The Cloud Tools
•   The Inspiration
•   The Data Sources
•   Other Sources of Data
•   The Process
•   The Results
•   Comments
•   Acknowledgements
•   References
                                2
                              The Challenge
•   Tim Berners-Lee "Bag of Chips" talk:
     –   http://www.youtube.com/watch?v=ga1aSJXCFe0
           •   To get five stars: 1-Expose your data, 2-Provide in machine readable format (Excel), 3-Provide as
               CSV, 4-Provide at permanent URL, and 5-Provide metadata.
•   Nigel_Shadbolt: Lots of eyeballs pouring over COINS:
     –   http://bit.ly/b8XQGB - opendata in the wild - more functionality all the time.
           •   http://twitter.com/Nigel_Shadbolt/status/15419573652
•   bniemannsr: @jahendler Hope data.gov evolves from quantity (500,000 datasets) to
    quality (data science applications):
     –   http://twitter.com/bniemannsr/status/15334914269
           •   Note: Now data.gov says only 272,677.
•   jahendler: @bniemannsr sure, but check out the Sem Web and Apps sections - lots
    of stuff there that prototypes what we could do #websci:
     –   http://twitter.com/jahendler/status/15335026437
•   bniemannsr: @jahendler Did, but neat prototypes don't improve data quality-data
    science does:
     –   http://radar.oreilly.com/2010/06/wha...a-science.html.
           •   http://twitter.com/bniemannsr/status/15549816659
•   eGovernment Interest Group Teleconference, 04 Jun 2010:
     –   http://www.w3.org/2010/06/04-egov-minutes.html Excerpts: Cory Casanave: Can't see the
         Web of Data:
           •   Cory to write up requirements/wishlist for generic Web of Data browser. See Supporting the Linked
               Data Consumer.
                                                                                                                   3
           http://gaininitiative.wik.is/United_Kingdom#The_Challenge
     The Data.gov.uk Program
• Advised by Sir Tim Berners-Lee and Professor
  Nigel Shadbolt and others, government is
  opening up data for reuse. This site seeks to
  give a way into the wealth of government data
  and is under constant development. We want to
  work with you to make it better.
• We’re very aware that there are more people
  like you outside of government who have the
  skills and abilities to make wonderful things out
  of public data. These are our first steps in
  building a collaborative relationship with you.
   http://gaininitiative.wik.is/United_Kingdom#The_Data.gov_Program   4
The Data.gov.uk Program




        http://data.gov.uk/   5
The Data.gov.uk Program




  http://data.gov.uk/blog/finance-data-coins-goes-live   6
     The Expert and His Advice
• Edward Tufte Presidential appointment
  announced by White House, March 5, 2010.
• Tufte Comment on iPhone interface design:
  Better to have users looking over material
  adjacent in space within our eyespan rather than
  stacked in time. This is especially the case for
  statistical data, where the fundamental analytical
  task is to make comparisons. Also see page 159
  in the above book reference.
  http://gaininitiative.wik.is/United_Kingdom#The_Expert_and_His_Advice
                                                                          7
The Cloud Tools




   http://cloud.mindtouch.com/
                                 8
The Cloud Tools




http://gaininitiative.wik.is/United_Kingdom
                                              9
The Cloud Tools




  http://spotfire.tibco.com/
                               10
    The Cloud Tools




http://ondemand.spotfire.com/public/Help/index.htm   11
            The Inspiration




H1N1 Spread Courtesy of TIBCO Spotfire. See Web Player.
                                                          12
      The Inspiration




http://www.wheredoesmymoneygo.org/dashboard/   13
                  The Inspiration
• What is data science? Analysis: The future belongs to
  the companies and people that turn data into products.
  Mike Loukides.
   – http://radar.oreilly.com/2010/06/wha...a-science.html.
• My Response: Please see my Data Science Library in
  the Cloud: http://ondemand.spotfire.com/public/...VL-
  4372/public and my suggestion that The 2010 Health 2.0
  Developer Challenge should build a community health
  data science library-see http://federaldata.wik.is/ June
  3rd: http://twitter.com/bniemannsr/status/15482514867 a
  nd http://www.hhs.gov/open/discussion/chdi.html.

          http://gaininitiative.wik.is/United_Kingdom#The_Inspiration
                                                                        14
                   The Data Sources




Scroll down to
Full Description
(see next slide)

                     http://data.gov.uk/dataset/coins   15
The Data Sources




  http://hm-treasury.gov.uk/coins   16
              The Data Sources
• Tried Zipped 2009/10 Adjustment table, 31MiB (405MiB
  uncompressed): Got 405 MB text file that when imported
  into Spotfire gave three columns with no headers and
  317,346 rows (with the last row saying: (316,119 row(s)
  affected)!
   – See next slide.
• Read Comments: Saw where others had had trouble
  using these datasets.
   – Is this CSV?
      • I unzipped the (non-torrent) version of the 09/10 adjustment table
        and it wasn't CSV but rather 2-sign delimited (think tab-delim with
        an @ instead of a tab). also the data wasn't clean for import to
        something like Excel as it had some lines of non-table data at the
        end - just the sort of thing to upset already hard-pushed
        spreadsheet importers on non-high end rigs.
          – Posted on: Fri, 04/06/2010 - 14:18 — Anonymous
       http://gaininitiative.wik.is/United_Kingdom#The_Data_Sources
                                                                              17
      The Data Sources




COINS: Adjustment_table_extract_2009_10 in Spotfire-PC
                                                         18
                    The Data Sources
•   Should have first read: The structure of the data is similar to that in a .csv
    file with a string of characters being formed to represent each row, using the
    following delimiters:
     – Line: carriage return (so lines are presented separately); and
     – Fields: @ .
          • http://gaininitiative.wik.is/United_Kingdom/Understanding_the_COINS_data#The_Data_
            Files_and_Downloading
•   And read: COINS contains millions of rows of data; as a consequence the
    files are large and the data held within the files complex. Using these
    download files will require some degree of technical competence and
    expertise in handling and manipulating large volumes of data. As such it is
    likely that this data will be most easily used by organisations that have such
    expertise, rather than individuals. More directly useful and accessible
    datasets that draw on the contents of the COINS database will be made
    available by August 2010.
     – http://gaininitiative.wik.is/United_Kingdom/Understanding_the_COINS_data#Who
       _might_find_the_data_useful




                                                                                           19
                Other Sources of Data
                    http://coins.guardian.co.uk/coins-explorer/search




For Output all as CSV
could get only
5,000 of 72,644 rows.
Sent question: Why?




         http://gaininitiative.wik.is/United_Kingdom#Other_Sources_of_Data   20
               Other Sources of Data



Hugh Expenditure for
Financial Stability for
Northern Rock
Refinancing!




                          COINS: Data Explorer in Spotfire-PC   21
                 Other Sources of Data


Each has link
To detailed
Table – see next slide.

Could only get
100 rows per page.
Sent Question: How
get all 3,897,330?




          http://coins.wheredoesmymoneygo.org/?items_per_page=100&page=1   22
   Other Sources of Data




http://coins.wheredoesmymoneygo.org/coins/fact_table_extract_2009_10.1361871   23
Other Sources of Data




 COINS: Where Does the Money Go? in Spotfire-PC   24
                     The Process
• The Basic Steps:
   –   Inventory Data Sources and Plan Application
   –   Prepare and Import Data and Metadata
   –   Implement Layout and Analytics
   –   Add Bookmarks and Create Data Stories
   –   Publish and Test in Web Player
   –   Get Feedback and Improve
• First create visualizations, faceted search
  (filters), and analytics for each individual data
  source and then look for relationships between
  the data sources.
            http://gaininitiative.wik.is/United_Kingdom#The_Process   25
                 The Results
• Recall The Challenges in slide 3:
  – TBL – Get 5 stars.
  – NS – Get more eyeballs on COINS.
  – JH - Data.gov/semantic prototypes what we
    could do with Web Science.
  – BN - Evolve from quantity of datasets to
    quality data science applications.
  – CC - Can't see the Web of Data – Support the
    Linked Data Consumer.
        http://gaininitiative.wik.is/United_Kingdom#The_Results
                                                                  26
                  The Results
• Tried to accomplish all five challenges.
• Waiting to hear back on requests for full
  data sets.
• Want to emulate Dashboard for Where
  Does My Money go?
• Want to work with other data sources in
  Data.gov.UK:
  – E.g. Climate Change.
        http://gaininitiative.wik.is/United_Kingdom#The_Results
                                                                  27
                        Comments
• The initial objective to see how fast one could create this basic
  application. I am waiting to hear back on requests for full data sets. I
  want to emulate the Dashboard for Where Does My Money go? I
  want to work with other data sources in Data.gov.uk: E.g. Climate
  Change.
• Please use the Add Comment feature at the bottom of this wiki page
  to provide feedback and suggest additional analyses you would like
  to see. To use the Add Comment feature you first need to register
  by providing your email address. Your privacy will be respected and
  your email addressed will not be available to others or used for any
  other purpose. You can also download the Spotfire File from this
  Wiki and a 30-day free evaluation copy
  from http://spotfire.tibco.com/ and reuse these analyses, add your
  own data to this file or new Spotfire files that you create. Have fun
  and give us your feedback!

              http://gaininitiative.wik.is/United_Kingdom#Comments

                                                                        28
           Acknowledgements
• The author acknowledges gratefully Dean
  Allemang, Cory Casanave, Sean Connors, Mills
  Davis, Li Ding, David Eng, Lee Feigenbaum,
  Aaron Fulkerson, Jim Hendler, Ralph Hodgson,
  Kevin Kirby, Kevin Jackson, Bob Marcus, John
  McMahon, Richard Murphy, Brand Niemann, Jr.,
  Barry Nussbaum, Matthew Phoenix, Tony Shaw,
  Jeff Stein, George Strawn, George Thomas,
  Pete Tseronis, and Edward Tufte.
      http://gaininitiative.wik.is/United_Kingdom#Acknowledgements

                                                                     29
                       References
• Brand L. Niemann, Put Your Desktop in the Cloud to Support the
  Open Government Directive and Data.gov/semantic, April 19,
  2010, Semantic Universe.
• Brand L. Niemann, Build Your Own Data.gov (Spotfire) and EPA
  Microsite (Spotfire) with Semantics and Statistics in the Cloud, May
  15, 2010. Slides.
• Brand L. Niemann, Build Your Community Health Information
  "Design for America" Using Mindtouch and Spotfire Analytics, May
  17, 2010. Slides.
• Brand Niemann, Build Your Own Data.gov/semantic with Spotfire in
  the Cloud: The White House Visitor Database, May 22,
  2010. Slides. See Data.gov takes the 'Mumsy' test, FCW, May 26,
  2010.
• Edward R. Tufte, Beautiful Evidence (2006), Graphics Press LLC.

              http://gaininitiative.wik.is/United_Kingdom#References

                                                                       30

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:7/8/2011
language:English
pages:30