R Screen Scraping Example Notes by KansasWatchdog

VIEWS: 2,204 PAGES: 10

									                                               R Screen Scraping Example Notes
                                                           Earl F Glynn
                                        Franklin Center for Government & Public Integrity
Setup
The R statistical analysis language can be downloaded for free from
http://www.r-project.org/

Packages XML and gdata need to be installed.

Both R and the packages are free.

Screen Scraping
This document provides detailed notes about screen scraping data from http://www.kssos.org/ent/shawnee.html using R.

The goal is to "screen scrape" data from an online webpage for analysis.

Here are example data seen online from the page of interest:
...




...




9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                               1
18 Feb 2011
                                             R Screen Scraping Example Notes
1. Extract all HTML tables from document using getNodeSet into "tables" object
Read web page from R and experiment with getNodeSet and xpathApply to extract only the desired data from the HTML.

library(XML)           # htmlTreeParse

# Read web page into object "doc"
url <- "http://www.kssos.org/ent/Shawnee.html"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)

# extract all HTML tables into a list
tables <- getNodeSet(doc, "//table")
length(tables)
145

R extracted 145 HTML tables from the page.

> tables[[116]]
<table width="500" border="0" cellspacing="0" cellpadding="0">
  <tr>                                                                         tables[[117]]
    <td width="500" height="20" align="left" valign="top">
       <table width="500" border="0" cellpadding="0" cellspacing="2" bgcolor="#e1b85b">
         <tr>
           <td width="500" align="center" valign="top">
              <a name="0188" id="0188"></a>
              <a href="#0188"></a>
              <span class="headers">Supreme Court Justice- 01</span>
              <br />
              <span class="result_numbers">County Precincts Reporting:0201 of 0201</span>
              <br />
              <span class="result_numbers">State Precincts Reporting:3315 of 3315</span>
           </td>
         </tr>
       </table>
    </td>
  </tr>
  <tr>



9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                             2
18 Feb 2011
                                       R Screen Scraping Example Notes
     <td width="500" align="left" valign="top">
        <table width="500" border="0" cellpadding="0" cellspacing="2" bgcolor="#FFE37E">
          <tr><td width="181" height="20" align="left" valign="middle" class="racetitle">Candidate</td>&#13;
<td width="79" height="20" align="right" valign="middle" class="racetitle">County<br />Votes</td>&#13;
<td width="55" height="20" align="center" valign="middle" class="racetitle">County<br />%</td>&#13;
<td width="79" height="20" align="right" valign="middle" class="racetitle">State<br />Votes</td>&#13;
<td width="55" height="20" align="center" valign="middle" class="racetitle">State<br />%</td>&#13;
<td width="95" height="20" align="center" valign="middle" class="racetitle">&#13;
 </td></tr>                                                                                 tables[[118]]
        </table>
     </td>
   </tr>
   <tr>
     <td width="500" align="left" valign="top">
        <table width="500" border="0" cellpadding="0" cellspacing="2" bgcolor="#e3e3e3">
          <tr><td width="181" align="left" valign="middle" class="result_numbers"><strong>&amp;nbsp&amp;nbspCarol
A. Beier - "YES"</strong></td>&#13;
<td width="70" align="right" valign="middle" class="result_numbers">      33,016</td>&#13;
<td width="55" align="center" valign="middle" class="result_numbers"> 65%</td>&#13;
<td width="70" align="right" valign="middle" class="result_numbers">     439,474</td>&#13;
<td width="55" align="center" valign="middle" class="result_numbers"> 63%</td>&#13;
<td width="69" align="center" valign="middle" class="result_numbers">Â </td>&#13;              tables[[119]]
</tr>
        </table>
     </td>
   </tr>
   <tr>
     <td width="500" align="left" valign="top">
        <table width="500" border="0" cellpadding="0" cellspacing="2" bgcolor="#e3e3e3">
          <tr><td width="181" align="left" valign="middle" class="result_numbers"><strong>&amp;nbsp&amp;nbspCarol
A. Beier - "NO"</strong></td>&#13;
<td width="70" align="right" valign="middle" class="result_numbers">      17,845</td>&#13;
<td width="55" align="center" valign="middle" class="result_numbers"> 35%</td>&#13;
<td width="70" align="right" valign="middle" class="result_numbers">     256,806</td>&#13;
<td width="55" align="center" valign="middle" class="result_numbers"> 37%</td>&#13;
<td width="69" align="center" valign="middle" class="result_numbers">Â </td>&#13;                tables[[120]]
</tr>
        </table>
     </td>
   </tr>
   <tr>
     <td align="center">
        <a href="#top">Back to Top</a>
9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                            3
18 Feb 2011
                                                 R Screen Scraping Example Notes
    </td>
  </tr>
</table>

Only the subtables from tables[[116]] have the desired data.


2. Use xpathApply to extract "leaf" tables
xmlNames above in red will be used in the xpathApply query below to select XML nodes of interest.

See "Extracting a data.frame from HTML code", http://www.mail-archive.com/r-help@r-project.org/msg17496.html with some additional
details in the first 3 pages of http://research.stowers-institute.org/efg/Report/ABF-Files.pdf

In addition, to eliminate several undesired matches, the query restricts the high-level table attribute to width=500.

xmlValue is convenient for extracting the text value of a node.

library(XML)               # htmlTreeParse

# Read web page into object "doc"
url <- "http://www.kssos.org/ent/Shawnee.html"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)

# Use unlist to convert "list" to vector of character strings
x <- unlist(xpathApply(doc, "//table[@width='500']/tr/td/table", xmlValue))
length(x)
82
> x
 [1]   "United States SenateCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
 [2]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\nView Map"
 [3]   "D-Lisa Johnston\r\n    19,703\r\n 35%\r\n   215,270\r\n 26%\r\n \r\n"
 [4]   "L-Michael Wm. Dann\r\n     1,595\r\n 3%\r\n     17,437\r\n 2%\r\n \r\n"
 [5]   "F-Joseph (Joe) K. Bellis\r\n       824\r\n 2%\r\n     11,356\r\n 1%\r\n \r\n"
 [6]   "R-Jerry Moran\r\n    34,652\r\n 61%\r\n   578,768\r\n 70%\r\n \r\n"
 [7]   "United States House of Representatives 002County Precincts Reporting:0201 of 0201State Precincts Reporting:0827 of 0827"
 [8]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\nView Map"
 [9]   "D-Cheryl Hudspeth\r\n    22,421\r\n 40%\r\n    65,448\r\n 32%\r\n \r\n"

9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                                            4
18 Feb 2011
                                                 R Screen Scraping Example Notes
[10]   "L-Robert Garrard\r\n     2,712\r\n 5%\r\n      9,166\r\n 5%\r\n \r\n"
[11]   "R-Lynn Jenkins\r\n    31,330\r\n 56%\r\n   128,083\r\n 63%\r\n \r\n"
[12]   "Governor / Lt. GovernorCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[13]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\nView Map"
[14]   "D-Tom Holland\r\n    23,303\r\n 41%\r\n   264,214\r\n 32%\r\n \r\n"
[15]   "L-Andrew P. Gray\r\n     2,816\r\n 5%\r\n     21,932\r\n 3%\r\n \r\n"
[16]   "F-Kenneth (Ken) W. Cannon\r\n       988\r\n 2%\r\n     15,050\r\n 2%\r\n \r\n"
[17]   "R-Sam Brownback\r\n    29,799\r\n 52%\r\n   522,540\r\n 63%\r\n \r\n"
[18]   "Secretary of StateCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[19]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\nView Map"
[20]   "D-Chris Biggs\r\n    27,690\r\n 49%\r\n   302,102\r\n 37%\r\n \r\n"
[21]   "L-Phillip Horatio Lucas\r\n     1,061\r\n 2%\r\n     16,946\r\n 2%\r\n \r\n"
[22]   "F-Derek Langseth\r\n       697\r\n 1%\r\n     13,482\r\n 2%\r\n \r\n"
[23]   "R-Kris Kobach\r\n    27,189\r\n 48%\r\n   482,979\r\n 59%\r\n \r\n"
[24]   "Attorney GeneralCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[25]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\nView Map"
[26]   "D-Steve Six\r\n    30,519\r\n 54%\r\n   342,004\r\n 42%\r\n \r\n"
[27]   "L-Dennis Hawver\r\n     1,748\r\n 3%\r\n     24,000\r\n 3%\r\n \r\n"
[28]   "R-Derek Schmidt\r\n    24,670\r\n 43%\r\n   453,629\r\n 55%\r\n \r\n"
[29]   "State TreasurerCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[30]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\nView Map"
[31]   "D-Dennis McKinney\r\n    28,918\r\n 52%\r\n   334,245\r\n 41%\r\n \r\n"
[32]   "R-Ron Estes\r\n    27,054\r\n 48%\r\n   474,175\r\n 59%\r\n \r\n"
[33]   "Kansas House of Representatives 052County Precincts Reporting: 29 of    29State Precincts Reporting: 29   of   29"
[34]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[35]   "D-Kyle Kessler\r\n     3,616\r\n 39%\r\n     3,616\r\n 39%\r\n \r\n"
[36]   "R-Lana Gordon\r\n     5,637\r\n 61%\r\n     5,637\r\n 61%\r\n \r\n"
[37]   "Kansas House of Representatives 053County Precincts Reporting: 27 of    27State Precincts Reporting: 30   of   30"
[38]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[39]   "D-Ann E. Mah\r\n     4,913\r\n 61%\r\n     5,293\r\n 61%\r\n \r\n"
[40]   "R-L.W. Abney\r\n     3,148\r\n 39%\r\n     3,414\r\n 39%\r\n \r\n"
[41]   "Kansas House of Representatives 054County Precincts Reporting: 26 of    26State Precincts Reporting: 26   of   26"
[42]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[43]   "D-Scott Seel\r\n     2,991\r\n 35%\r\n     2,991\r\n 35%\r\n \r\n"
[44]   "L-Sean Tabor\r\n       399\r\n 5%\r\n        399\r\n 5%\r\n \r\n"
[45]   "R-Joe Patton\r\n     5,178\r\n 60%\r\n     5,178\r\n 60%\r\n \r\n"
[46]   "Kansas House of Representatives 055County Precincts Reporting: 23 of    23State Precincts Reporting: 23   of   23"
[47]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[48]   "D-Annie Kuether\r\n     3,128\r\n 59%\r\n     3,128\r\n 59%\r\n \r\n"
[49]   "R-Bruce G Williamson\r\n     2,186\r\n 41%\r\n     2,186\r\n 41%\r\n \r\n"
[50]   "Kansas House of Representatives 056County Precincts Reporting: 25 of    25State Precincts Reporting: 25   of   25"
[51]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[52]   "D-Annie Tietze\r\n     3,274\r\n 50%\r\n     3,274\r\n 50%\r\n \r\n"
[53]   "L-Troy Abbot\r\n       229\r\n 4%\r\n        229\r\n 4%\r\n \r\n"
[54]   "R-Becky Nioce\r\n     3,018\r\n 46%\r\n     3,018\r\n 46%\r\n \r\n"
[55]   "Kansas House of Representatives 057County Precincts Reporting: 19 of    19State Precincts Reporting: 19   of   19"
[56]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[57]   "D-Sean Gatewood\r\n     2,291\r\n 54%\r\n     2,291\r\n 54%\r\n \r\n"
[58]   "R-Cheryl Reynolds\r\n     1,953\r\n 46%\r\n     1,953\r\n 46%\r\n \r\n"
[59]   "Supreme Court Justice- 01County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[60]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[61]   "&nbsp&nbspCarol A. Beier - \"YES\"\r\n     33,016\r\n 65%\r\n   439,474\r\n 63%\r\n \r\n"
[62]   "&nbsp&nbspCarol A. Beier - \"NO\"\r\n    17,845\r\n 35%\r\n   256,806\r\n 37%\r\n \r\n"
[63]   "Supreme Court Justice- 02County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                                     5
18 Feb 2011
                                                 R Screen Scraping Example Notes
[64]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[65]   "&nbsp&nbspDan Biles - \"YES\"\r\n     31,507\r\n 63%\r\n   424,952\r\n 62%\r\n \r\n"
[66]   "&nbsp&nbspDan Biles - \"NO\"\r\n    18,646\r\n 37%\r\n   261,169\r\n 38%\r\n \r\n"
[67]   "Supreme Court Justice- 03County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[68]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[69]   "&nbsp&nbspLawton R. Nuss - \"YES\"\r\n     31,080\r\n 62%\r\n   428,828\r\n 63%\r\n \r\n"
[70]   "&nbsp&nbspLawton R. Nuss - \"NO\"\r\n    18,964\r\n 38%\r\n   257,019\r\n 38%\r\n \r\n"
[71]   "Supreme Court Justice- 05County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[72]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[73]   "&nbsp&nbspMarla J. Luckert - \"YES\"\r\n     33,251\r\n 65%\r\n   428,714\r\n 63%\r\n \r\n"
[74]   "&nbsp&nbspMarla J. Luckert - \"NO\"\r\n    17,571\r\n 35%\r\n   255,057\r\n 37%\r\n \r\n"
[75]   "Constitutional Amendment 1 - Bear ArmsCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[76]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[77]   "&nbsp&nbsp C.A. #1 - \"YES\"\r\n     48,447\r\n 87%\r\n   710,255\r\n 89%\r\n \r\n"
[78]   "&nbsp&nbsp C.A. #1 - \"NO\"\r\n     7,006\r\n 13%\r\n    91,004\r\n 11%\r\n \r\n"
[79]   "Constitutional Amendment 2 - Vote RightsCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[80]   "Candidate\r\nCountyVotes\r\nCounty%\r\nStateVotes\r\nState%\r\n\r\n "
[81]   "&nbsp&nbsp C.A. #2 - \"YES\"\r\n     35,140\r\n 65%\r\n   493,764\r\n 62%\r\n \r\n"
[82]   "&nbsp&nbsp C.A. #2 - \"NO\"\r\n    19,171\r\n 35%\r\n   297,382\r\n 38%\r\n \r\n"


3. Cleanup data
Let's use a pipe ("|") character as a delimiter in the file. The first R global substitution statement below changes a CR+LF combination to a
"|" delimiter.

The remaining global substitutions remove various strings from the file that were not wanted or caused parsing problems.

#   Cleanup problems in data
x   <- gsub("\r\n", "|", x)
x   <- gsub("\\|\\|View Map", "", x)
x   <- gsub("\\|Â \\|", "", x)
x   <- gsub("\\|\\|Â ", "", x)
x   <- gsub("\\&nbsp\\&nbsp", "", x)
x   <- gsub("\\\"", "", x)

Cleaned up data:

>x
 [1]   "United States SenateCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
 [2]   "Candidate|CountyVotes|County%|StateVotes|State%"
 [3]   "D-Lisa Johnston|    19,703| 35%|   215,270| 26%"
 [4]   "L-Michael Wm. Dann|     1,595| 3%|     17,437| 2%"
 [5]   "F-Joseph (Joe) K. Bellis|       824| 2%|     11,356| 1%"
 [6]   "R-Jerry Moran|    34,652| 61%|   578,768| 70%"
9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                                                        6
18 Feb 2011
                                                 R Screen Scraping Example Notes
 [7]   "United States House of Representatives 002County Precincts Reporting:0201 of 0201State Precincts Reporting:0827 of 0827"
 [8]   "Candidate|CountyVotes|County%|StateVotes|State%"
 [9]   "D-Cheryl Hudspeth|    22,421| 40%|    65,448| 32%"
[10]   "L-Robert Garrard|     2,712| 5%|      9,166| 5%"
[11]   "R-Lynn Jenkins|    31,330| 56%|   128,083| 63%"
[12]   "Governor / Lt. GovernorCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[13]   "Candidate|CountyVotes|County%|StateVotes|State%"
[14]   "D-Tom Holland|    23,303| 41%|   264,214| 32%"
[15]   "L-Andrew P. Gray|     2,816| 5%|     21,932| 3%"
[16]   "F-Kenneth (Ken) W. Cannon|       988| 2%|     15,050| 2%"
[17]   "R-Sam Brownback|    29,799| 52%|   522,540| 63%"
[18]   "Secretary of StateCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[19]   "Candidate|CountyVotes|County%|StateVotes|State%"
[20]   "D-Chris Biggs|    27,690| 49%|   302,102| 37%"
[21]   "L-Phillip Horatio Lucas|     1,061| 2%|     16,946| 2%"
[22]   "F-Derek Langseth|       697| 1%|     13,482| 2%"
[23]   "R-Kris Kobach|    27,189| 48%|   482,979| 59%"
[24]   "Attorney GeneralCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[25]   "Candidate|CountyVotes|County%|StateVotes|State%"
[26]   "D-Steve Six|    30,519| 54%|   342,004| 42%"
[27]   "L-Dennis Hawver|     1,748| 3%|     24,000| 3%"
[28]   "R-Derek Schmidt|    24,670| 43%|   453,629| 55%"
[29]   "State TreasurerCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[30]   "Candidate|CountyVotes|County%|StateVotes|State%"
[31]   "D-Dennis McKinney|    28,918| 52%|   334,245| 41%"
[32]   "R-Ron Estes|    27,054| 48%|   474,175| 59%"
[33]   "Kansas House of Representatives 052County Precincts Reporting: 29 of    29State Precincts Reporting: 29 of    29"
[34]   "Candidate|CountyVotes|County%|StateVotes|State%"
[35]   "D-Kyle Kessler|     3,616| 39%|     3,616| 39%"
[36]   "R-Lana Gordon|     5,637| 61%|     5,637| 61%"
[37]   "Kansas House of Representatives 053County Precincts Reporting: 27 of    27State Precincts Reporting: 30 of    30"
[38]   "Candidate|CountyVotes|County%|StateVotes|State%"
[39]   "D-Ann E. Mah|     4,913| 61%|     5,293| 61%"
[40]   "R-L.W. Abney|     3,148| 39%|     3,414| 39%"
[41]   "Kansas House of Representatives 054County Precincts Reporting: 26 of    26State Precincts Reporting: 26 of    26"
[42]   "Candidate|CountyVotes|County%|StateVotes|State%"
[43]   "D-Scott Seel|     2,991| 35%|     2,991| 35%"
[44]   "L-Sean Tabor|       399| 5%|        399| 5%"
[45]   "R-Joe Patton|     5,178| 60%|     5,178| 60%"
[46]   "Kansas House of Representatives 055County Precincts Reporting: 23 of    23State Precincts Reporting: 23 of    23"
[47]   "Candidate|CountyVotes|County%|StateVotes|State%"
[48]   "D-Annie Kuether|     3,128| 59%|     3,128| 59%"
[49]   "R-Bruce G Williamson|     2,186| 41%|     2,186| 41%"
[50]   "Kansas House of Representatives 056County Precincts Reporting: 25 of    25State Precincts Reporting: 25 of    25"
[51]   "Candidate|CountyVotes|County%|StateVotes|State%"
[52]   "D-Annie Tietze|     3,274| 50%|     3,274| 50%"
[53]   "L-Troy Abbot|       229| 4%|        229| 4%"
[54]   "R-Becky Nioce|     3,018| 46%|     3,018| 46%"
[55]   "Kansas House of Representatives 057County Precincts Reporting: 19 of    19State Precincts Reporting: 19 of    19"
[56]   "Candidate|CountyVotes|County%|StateVotes|State%"
[57]   "D-Sean Gatewood|     2,291| 54%|     2,291| 54%"
[58]   "R-Cheryl Reynolds|     1,953| 46%|     1,953| 46%"
[59]   "Supreme Court Justice- 01County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[60]   "Candidate|CountyVotes|County%|StateVotes|State%"
9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                                           7
18 Feb 2011
                                                 R Screen Scraping Example Notes
[61]   "Carol A. Beier - YES|     33,016| 65%|   439,474| 63%"
[62]   "Carol A. Beier - NO|    17,845| 35%|   256,806| 37%"
[63]   "Supreme Court Justice- 02County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[64]   "Candidate|CountyVotes|County%|StateVotes|State%"
[65]   "Dan Biles - YES|     31,507| 63%|   424,952| 62%"
[66]   "Dan Biles - NO|    18,646| 37%|   261,169| 38%"
[67]   "Supreme Court Justice- 03County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[68]   "Candidate|CountyVotes|County%|StateVotes|State%"
[69]   "Lawton R. Nuss - YES|     31,080| 62%|   428,828| 63%"
[70]   "Lawton R. Nuss - NO|    18,964| 38%|   257,019| 38%"
[71]   "Supreme Court Justice- 05County Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[72]   "Candidate|CountyVotes|County%|StateVotes|State%"
[73]   "Marla J. Luckert - YES|     33,251| 65%|   428,714| 63%"
[74]   "Marla J. Luckert - NO|    17,571| 35%|   255,057| 37%"
[75]   "Constitutional Amendment 1 - Bear ArmsCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[76]   "Candidate|CountyVotes|County%|StateVotes|State%"
[77]   " C.A. #1 - YES|     48,447| 87%|   710,255| 89%"
[78]   " C.A. #1 - NO|     7,006| 13%|    91,004| 11%"
[79]   "Constitutional Amendment 2 - Vote RightsCounty Precincts Reporting:0201 of 0201State Precincts Reporting:3315 of 3315"
[80]   "Candidate|CountyVotes|County%|StateVotes|State%"
[81]   " C.A. #2 - YES|     35,140| 65%|   493,764| 62%"
[82]   " C.A. #2 - NO|    19,171| 35%|   297,382| 38%"




4. Step through lines and reformat to be parsed by Excel or other programs
library(gdata)   # trim
# Loop through data extracting candidate, and reformatting output
contest <- ""
for (i in 1:length(x))
{
  if (length(grep("County Precincts Reporting:", x[i])) > 0)
  {
    candidate <- unlist(strsplit(x[i], "County Precincts Reporting:"))[1]
  } else {
      raw <- trim(unlist(strsplit(x[i], "\\|")))
      raw <- gsub(",", "", raw) # remove commas from numbers
      raw <- gsub("%", "", raw) # remove percent sign
      line <- c(county, candidate, raw)
      cat( paste(line, collapse="|"), "\n")
  }
}

9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                                                         8
18 Feb 2011
                                               R Screen Scraping Example Notes
Shawnee|United States Senate|Candidate|CountyVotes|County|StateVotes|State
Shawnee|United States Senate|D-Lisa Johnston|19703|35|215270|26
Shawnee|United States Senate|L-Michael Wm. Dann|1595|3|17437|2
Shawnee|United States Senate|F-Joseph (Joe) K. Bellis|824|2|11356|1
Shawnee|United States Senate|R-Jerry Moran|34652|61|578768|70
Shawnee|United States House of Representatives 002|Candidate|CountyVotes|County|StateVotes|State
Shawnee|United States House of Representatives 002|D-Cheryl Hudspeth|22421|40|65448|32
Shawnee|United States House of Representatives 002|L-Robert Garrard|2712|5|9166|5
Shawnee|United States House of Representatives 002|R-Lynn Jenkins|31330|56|128083|63
Shawnee|Governor / Lt. Governor|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Governor / Lt. Governor|D-Tom Holland|23303|41|264214|32
Shawnee|Governor / Lt. Governor|L-Andrew P. Gray|2816|5|21932|3
Shawnee|Governor / Lt. Governor|F-Kenneth (Ken) W. Cannon|988|2|15050|2
Shawnee|Governor / Lt. Governor|R-Sam Brownback|29799|52|522540|63
Shawnee|Secretary of State|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Secretary of State|D-Chris Biggs|27690|49|302102|37
Shawnee|Secretary of State|L-Phillip Horatio Lucas|1061|2|16946|2
Shawnee|Secretary of State|F-Derek Langseth|697|1|13482|2
Shawnee|Secretary of State|R-Kris Kobach|27189|48|482979|59
Shawnee|Attorney General|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Attorney General|D-Steve Six|30519|54|342004|42
Shawnee|Attorney General|L-Dennis Hawver|1748|3|24000|3
Shawnee|Attorney General|R-Derek Schmidt|24670|43|453629|55
Shawnee|State Treasurer|Candidate|CountyVotes|County|StateVotes|State
Shawnee|State Treasurer|D-Dennis McKinney|28918|52|334245|41
Shawnee|State Treasurer|R-Ron Estes|27054|48|474175|59
Shawnee|Kansas House of Representatives 052|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Kansas House of Representatives 052|D-Kyle Kessler|3616|39|3616|39
Shawnee|Kansas House of Representatives 052|R-Lana Gordon|5637|61|5637|61
Shawnee|Kansas House of Representatives 053|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Kansas House of Representatives 053|D-Ann E. Mah|4913|61|5293|61
Shawnee|Kansas House of Representatives 053|R-L.W. Abney|3148|39|3414|39
Shawnee|Kansas House of Representatives 054|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Kansas House of Representatives 054|D-Scott Seel|2991|35|2991|35
Shawnee|Kansas House of Representatives 054|L-Sean Tabor|399|5|399|5
Shawnee|Kansas House of Representatives 054|R-Joe Patton|5178|60|5178|60
Shawnee|Kansas House of Representatives 055|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Kansas House of Representatives 055|D-Annie Kuether|3128|59|3128|59
Shawnee|Kansas House of Representatives 055|R-Bruce G Williamson|2186|41|2186|41
Shawnee|Kansas House of Representatives 056|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Kansas House of Representatives 056|D-Annie Tietze|3274|50|3274|50
Shawnee|Kansas House of Representatives 056|L-Troy Abbot|229|4|229|4
Shawnee|Kansas House of Representatives 056|R-Becky Nioce|3018|46|3018|46
Shawnee|Kansas House of Representatives 057|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Kansas House of Representatives 057|D-Sean Gatewood|2291|54|2291|54
Shawnee|Kansas House of Representatives 057|R-Cheryl Reynolds|1953|46|1953|46
Shawnee|Supreme Court Justice- 01|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Supreme Court Justice- 01|Carol A. Beier - YES|33016|65|439474|63
Shawnee|Supreme Court Justice- 01|Carol A. Beier - NO|17845|35|256806|37
Shawnee|Supreme Court Justice- 02|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Supreme Court Justice- 02|Dan Biles - YES|31507|63|424952|62
Shawnee|Supreme Court Justice- 02|Dan Biles - NO|18646|37|261169|38
Shawnee|Supreme Court Justice- 03|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Supreme Court Justice- 03|Lawton R. Nuss - YES|31080|62|428828|63
9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                           9
18 Feb 2011
                                                R Screen Scraping Example Notes
Shawnee|Supreme Court Justice- 03|Lawton R. Nuss - NO|18964|38|257019|38
Shawnee|Supreme Court Justice- 05|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Supreme Court Justice- 05|Marla J. Luckert - YES|33251|65|428714|63
Shawnee|Supreme Court Justice- 05|Marla J. Luckert - NO|17571|35|255057|37
Shawnee|Constitutional Amendment 1 - Bear Arms|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Constitutional Amendment 1 - Bear Arms|C.A. #1 - YES|48447|87|710255|89
Shawnee|Constitutional Amendment 1 - Bear Arms|C.A. #1 - NO|7006|13|91004|11
Shawnee|Constitutional Amendment 2 - Vote Rights|Candidate|CountyVotes|County|StateVotes|State
Shawnee|Constitutional Amendment 2 - Vote Rights|C.A. #2 - YES|35140|65|493764|62
Shawnee|Constitutional Amendment 2 - Vote Rights|C.A. #2 - NO|19171|35|297382|38



This example showed how to extract data from a single web page for a single county.

The next exercise builds on this example to read pages for all 105 pages of county data.




9fb2531e-9375-4ae2-85c6-4105b980d319.doc                                                         10
18 Feb 2011

								
To top