Learning Center
Plans & pricing Sign in
Sign Out

Slides - -


									            Build the UK’s COINS in the
            Data Science Library Cloud

                        Brand Niemann
                           US EPA
                         June 9, 2010

Disclaimer: These slides do not reflect the views of the U.S. Environmental Protection Agency
and does not constitute endorsement by the EPA of the standards or products mentioned.        1
•   The Challenge
•   The Program
•   The Expert and His Advice
•   The Cloud Tools
•   The Inspiration
•   The Data Sources
•   Other Sources of Data
•   The Process
•   The Results
•   Comments
•   Acknowledgements
•   References
                              The Challenge
•   Tim Berners-Lee "Bag of Chips" talk:
           •   To get five stars: 1-Expose your data, 2-Provide in machine readable format (Excel), 3-Provide as
               CSV, 4-Provide at permanent URL, and 5-Provide metadata.
•   Nigel_Shadbolt: Lots of eyeballs pouring over COINS:
     – - opendata in the wild - more functionality all the time.
•   bniemannsr: @jahendler Hope evolves from quantity (500,000 datasets) to
    quality (data science applications):
           •   Note: Now says only 272,677.
•   jahendler: @bniemannsr sure, but check out the Sem Web and Apps sections - lots
    of stuff there that prototypes what we could do #websci:
•   bniemannsr: @jahendler Did, but neat prototypes don't improve data quality-data
    science does:
•   eGovernment Interest Group Teleconference, 04 Jun 2010:
     – Excerpts: Cory Casanave: Can't see the
         Web of Data:
           •   Cory to write up requirements/wishlist for generic Web of Data browser. See Supporting the Linked
               Data Consumer.
     The Program
• Advised by Sir Tim Berners-Lee and Professor
  Nigel Shadbolt and others, government is
  opening up data for reuse. This site seeks to
  give a way into the wealth of government data
  and is under constant development. We want to
  work with you to make it better.
• We’re very aware that there are more people
  like you outside of government who have the
  skills and abilities to make wonderful things out
  of public data. These are our first steps in
  building a collaborative relationship with you.   4
The Program

The Program   6
     The Expert and His Advice
• Edward Tufte Presidential appointment
  announced by White House, March 5, 2010.
• Tufte Comment on iPhone interface design:
  Better to have users looking over material
  adjacent in space within our eyespan rather than
  stacked in time. This is especially the case for
  statistical data, where the fundamental analytical
  task is to make comparisons. Also see page 159
  in the above book reference.
The Cloud Tools
The Cloud Tools
The Cloud Tools
    The Cloud Tools   11
            The Inspiration

H1N1 Spread Courtesy of TIBCO Spotfire. See Web Player.
      The Inspiration   13
                  The Inspiration
• What is data science? Analysis: The future belongs to
  the companies and people that turn data into products.
  Mike Loukides.
• My Response: Please see my Data Science Library in
  the Cloud:
  4372/public and my suggestion that The 2010 Health 2.0
  Developer Challenge should build a community health
  data science library-see June
  3rd: a

                   The Data Sources

Scroll down to
Full Description
(see next slide)

The Data Sources   16
              The Data Sources
• Tried Zipped 2009/10 Adjustment table, 31MiB (405MiB
  uncompressed): Got 405 MB text file that when imported
  into Spotfire gave three columns with no headers and
  317,346 rows (with the last row saying: (316,119 row(s)
   – See next slide.
• Read Comments: Saw where others had had trouble
  using these datasets.
   – Is this CSV?
      • I unzipped the (non-torrent) version of the 09/10 adjustment table
        and it wasn't CSV but rather 2-sign delimited (think tab-delim with
        an @ instead of a tab). also the data wasn't clean for import to
        something like Excel as it had some lines of non-table data at the
        end - just the sort of thing to upset already hard-pushed
        spreadsheet importers on non-high end rigs.
          – Posted on: Fri, 04/06/2010 - 14:18 — Anonymous
      The Data Sources

COINS: Adjustment_table_extract_2009_10 in Spotfire-PC
                    The Data Sources
•   Should have first read: The structure of the data is similar to that in a .csv
    file with a string of characters being formed to represent each row, using the
    following delimiters:
     – Line: carriage return (so lines are presented separately); and
     – Fields: @ .
•   And read: COINS contains millions of rows of data; as a consequence the
    files are large and the data held within the files complex. Using these
    download files will require some degree of technical competence and
    expertise in handling and manipulating large volumes of data. As such it is
    likely that this data will be most easily used by organisations that have such
    expertise, rather than individuals. More directly useful and accessible
    datasets that draw on the contents of the COINS database will be made
    available by August 2010.

                Other Sources of Data

For Output all as CSV
could get only
5,000 of 72,644 rows.
Sent question: Why?

               Other Sources of Data

Hugh Expenditure for
Financial Stability for
Northern Rock

                          COINS: Data Explorer in Spotfire-PC   21
                 Other Sources of Data

Each has link
To detailed
Table – see next slide.

Could only get
100 rows per page.
Sent Question: How
get all 3,897,330?

   Other Sources of Data   23
Other Sources of Data

 COINS: Where Does the Money Go? in Spotfire-PC   24
                     The Process
• The Basic Steps:
   –   Inventory Data Sources and Plan Application
   –   Prepare and Import Data and Metadata
   –   Implement Layout and Analytics
   –   Add Bookmarks and Create Data Stories
   –   Publish and Test in Web Player
   –   Get Feedback and Improve
• First create visualizations, faceted search
  (filters), and analytics for each individual data
  source and then look for relationships between
  the data sources.
                 The Results
• Recall The Challenges in slide 3:
  – TBL – Get 5 stars.
  – NS – Get more eyeballs on COINS.
  – JH - prototypes what we
    could do with Web Science.
  – BN - Evolve from quantity of datasets to
    quality data science applications.
  – CC - Can't see the Web of Data – Support the
    Linked Data Consumer.
                  The Results
• Tried to accomplish all five challenges.
• Waiting to hear back on requests for full
  data sets.
• Want to emulate Dashboard for Where
  Does My Money go?
• Want to work with other data sources in
  – E.g. Climate Change.
• The initial objective to see how fast one could create this basic
  application. I am waiting to hear back on requests for full data sets. I
  want to emulate the Dashboard for Where Does My Money go? I
  want to work with other data sources in E.g. Climate
• Please use the Add Comment feature at the bottom of this wiki page
  to provide feedback and suggest additional analyses you would like
  to see. To use the Add Comment feature you first need to register
  by providing your email address. Your privacy will be respected and
  your email addressed will not be available to others or used for any
  other purpose. You can also download the Spotfire File from this
  Wiki and a 30-day free evaluation copy
  from and reuse these analyses, add your
  own data to this file or new Spotfire files that you create. Have fun
  and give us your feedback!


• The author acknowledges gratefully Dean
  Allemang, Cory Casanave, Sean Connors, Mills
  Davis, Li Ding, David Eng, Lee Feigenbaum,
  Aaron Fulkerson, Jim Hendler, Ralph Hodgson,
  Kevin Kirby, Kevin Jackson, Bob Marcus, John
  McMahon, Richard Murphy, Brand Niemann, Jr.,
  Barry Nussbaum, Matthew Phoenix, Tony Shaw,
  Jeff Stein, George Strawn, George Thomas,
  Pete Tseronis, and Edward Tufte.

• Brand L. Niemann, Put Your Desktop in the Cloud to Support the
  Open Government Directive and, April 19,
  2010, Semantic Universe.
• Brand L. Niemann, Build Your Own (Spotfire) and EPA
  Microsite (Spotfire) with Semantics and Statistics in the Cloud, May
  15, 2010. Slides.
• Brand L. Niemann, Build Your Community Health Information
  "Design for America" Using Mindtouch and Spotfire Analytics, May
  17, 2010. Slides.
• Brand Niemann, Build Your Own with Spotfire in
  the Cloud: The White House Visitor Database, May 22,
  2010. Slides. See takes the 'Mumsy' test, FCW, May 26,
• Edward R. Tufte, Beautiful Evidence (2006), Graphics Press LLC.



To top