Docstoc

R Screen Scraping: 105 Counties of Election Data

Document Sample
R Screen Scraping: 105 Counties of Election Data Powered By Docstoc
					                                      R Screen Scraping: 105 Counties of Election Data
                                                            Earl F Glynn
                                         Franklin Center for Government & Public Integrity
The goal of this exercise is to screen scrape election results from all 105 counties shown at http://www.kssos.org/ent/kssos_ent.html and
write the data to a file for subsequent processing.

scrape.county function. To extend the earlier example of screen scraping a single county (Shawnee County), an R function, scape.county,
was constructed to process any county.
scrape.county <- function (county, outfile)
{
  url <- paste("http://www.kssos.org/ent/", county, ".html", sep="")
  doc <- htmlTreeParse(url, useInternalNodes=TRUE)

    # Extract lines using xpathApply
    x <- unlist(xpathApply(doc, "//table[@width='500']/tr/td/table", xmlValue))

    #   Cleanup problems in data
    x   <- gsub("\r\n", "|", x)
    x   <- gsub("\\|\\|View Map", "", x)
    x   <- gsub("\\|Â \\|", "", x)
    x   <- gsub("\\|\\|Â ", "", x)
    x   <- gsub("\\&nbsp\\&nbsp", "", x)
    x   <- gsub("\\\"", "", x)

    # Loop through data extracting candidate, and reformatting output
    contest <- ""
    for (i in 1:length(x))
    {
      if (length(grep("County Precincts Reporting:", x[i])) > 0)
      {
        candidate <- unlist(strsplit(x[i], "County Precincts Reporting:"))[1]
      } else {
          raw <- trim(unlist(strsplit(x[i], "\\|")))
          raw <- gsub(",", "", raw) # remove commas from numbers
          raw <- gsub("%", "", raw) # remove percent sign
          line <- c(county, candidate, raw)
          cat( paste(line, collapse="|"), "\n", file=outfile)
      }
    }
}


91d35598-10c6-426d-91e1-01e5018a1902.doc                                                                                                    1
18 Feb 2011
                                     R Screen Scraping: 105 Counties of Election Data
The scrape.county function takes two arguments. The first is the county name, e.g., "Shawnee", and the second is the name of the output
file to contain the results.

List of Kansas counties. An R character vector is defined with names of all 105 Kansas Counties in alphabetical order:.

Kansas.Counties    <- c(
  "Allen",         "Anderson",         "Atchison",        "Barber",       "Barton",
  "Bourbon",       "Brown",            "Butler",          "Chase",        "Chautauqua",
  "Cherokee",      "Cheyenne",         "Clark",           "Clay",         "Cloud",
  "Coffey",        "Comanche",         "Cowley",          "Crawford",     "Decatur",
  "Dickinson",     "Doniphan",         "Douglas",         "Edwards",      "Elk",
  "Ellis",         "Ellsworth",        "Finney",          "Ford",         "Franklin",
  "Geary",         "Gove",             "Graham",          "Grant",        "Gray",
  "Greeley",       "Greenwood",        "Hamilton",        "Harper",       "Harvey",
  "Haskell",       "Hodgeman",         "Jackson",         "Jefferson",    "Jewell",
  "Johnson",       "Kearny",           "Kingman",         "Kiowa",        "Labette",
  "Lane",          "Leavenworth",      "Lincoln",         "Linn",         "Logan",
  "Lyon",          "Marion",           "Marshall",        "McPherson",    "Meade",
  "Miami",         "Mitchell",         "Montgomery",      "Morris",       "Morton",
  "Nemaha",        "Neosho",           "Ness",            "Norton",       "Osage",
  "Osborne",       "Ottawa",           "Pawnee",          "Phillips",     "Pottawatomie",
  "Pratt",         "Rawlins",          "Reno",            "Republic",     "Rice",
  "Riley",         "Rooks",            "Rush",            "Russell",      "Saline",
  "Scott",         "Sedgwick",         "Seward",          "Shawnee",      "Sheridan",
  "Sherman",       "Smith",            "Stafford",        "Stanton",      "Stevens",
  "Sumner",        "Thomas",           "Trego",           "Wabaunsee",    "Wallace",
  "Washington",    "Wichita",          "Wilson",          "Woodson",      "Wyandotte")

This list could be constructed from the HTML source from the page http://www.kssos.org/ent/kssos_ent.html.

Visiting 105 web pages. With the scrape.county function and the list of counties above, the task of extracting the data becomes very simple
and takes about a minute depending on the speed of the Internet:




91d35598-10c6-426d-91e1-01e5018a1902.doc                                                                                                  2
18 Feb 2011
                                      R Screen Scraping: 105 Counties of Election Data
library(gdata)       # trim
library(XML)         # htmlTreeParse

basedir <- "C:/Users/earl/Desktop/CAR/Screen-Scraping/"    #### set base dir ####
outfile <- file(paste(basedir, "2010-Kansas-General-Election-11-03.txt", sep=""), "w")

for (i in 1:length(Kansas.Counties))
{
  county <- Kansas.Counties[i]
  cat(county, "\n")
  flush.console()
  scrape.county(county, outfile)
}

close(outfile)

The call to the flush.console function above is needed in Windows to show incremental progress as the counties are processed:

       Allen
       Anderson
       Atchison
       ...
       Wilson
       Woodson
       Wyandotte

Otherwise, Windows shows all results on the screen at the end of all processing.

The resulting file, 2010-Kansas-General-Election-11-03.txt, has 5086 lines and looks like this:

Allen|United States Senate|Candidate|CountyVotes|County|StateVotes|State
Allen|United States Senate|D-Lisa Johnston|1103|26|215270|26
Allen|United States Senate|L-Michael Wm. Dann|109|3|17437|2
Allen|United States Senate|F-Joseph (Joe) K. Bellis|68|2|11356|1
Allen|United States Senate|R-Jerry Moran|3026|70|578768|70
Allen|United States House of Representatives 002|Candidate|CountyVotes|County|StateVotes|State
Allen|United States House of Representatives 002|D-Cheryl Hudspeth|1154|27|65448|32
Allen|United States House of Representatives 002|L-Robert Garrard|142|3|9166|5
Allen|United States House of Representatives 002|R-Lynn Jenkins|3001|70|128083|63
Allen|Governor / Lt. Governor|Candidate|CountyVotes|County|StateVotes|State
91d35598-10c6-426d-91e1-01e5018a1902.doc                                                                                        3
18 Feb 2011
                                      R Screen Scraping: 105 Counties of Election Data
Allen|Governor    /   Lt.   Governor|D-Tom Holland|1213|28|264214|32
Allen|Governor    /   Lt.   Governor|L-Andrew P. Gray|111|3|21932|3
Allen|Governor    /   Lt.   Governor|F-Kenneth (Ken) W. Cannon|95|2|15050|2
Allen|Governor    /   Lt.   Governor|R-Sam Brownback|2967|68|522540|63
...
Wyandotte|Supreme Court Justice- 01|Candidate|CountyVotes|County|StateVotes|State
Wyandotte|Supreme Court Justice- 01|Carol A. Beier - YES|17293|70|439474|63
Wyandotte|Supreme Court Justice- 01|Carol A. Beier - NO|7578|31|256806|37
Wyandotte|Supreme Court Justice- 02|Candidate|CountyVotes|County|StateVotes|State
Wyandotte|Supreme Court Justice- 02|Dan Biles - YES|16177|66|424952|62
Wyandotte|Supreme Court Justice- 02|Dan Biles - NO|8260|34|261169|38
Wyandotte|Supreme Court Justice- 03|Candidate|CountyVotes|County|StateVotes|State
Wyandotte|Supreme Court Justice- 03|Lawton R. Nuss - YES|15788|66|428828|63
Wyandotte|Supreme Court Justice- 03|Lawton R. Nuss - NO|8292|34|257019|38
Wyandotte|Supreme Court Justice- 05|Candidate|CountyVotes|County|StateVotes|State
Wyandotte|Supreme Court Justice- 05|Marla J. Luckert - YES|16569|68|428714|63
Wyandotte|Supreme Court Justice- 05|Marla J. Luckert - NO|7685|32|255057|37
Wyandotte|Constitutional Amendment 1 - Bear Arms|Candidate|CountyVotes|County|StateVotes|State
Wyandotte|Constitutional Amendment 1 - Bear Arms|C.A. #1 - YES|25806|87|710255|89
Wyandotte|Constitutional Amendment 1 - Bear Arms|C.A. #1 - NO|3945|13|91004|11
Wyandotte|Constitutional Amendment 2 - Vote Rights|Candidate|CountyVotes|County|StateVotes|State
Wyandotte|Constitutional Amendment 2 - Vote Rights|C.A. #2 - YES|18534|63|493764|62
Wyandotte|Constitutional Amendment 2 - Vote Rights|C.A. #2 - NO|10774|37|297382|38

Manual data cleanup. Note there are extra header lines, which are shown in black above.

Using an ASCII text editor, globally delete all rows with "CountyVotes" except the first row.

Change the first row header from

Allen|United States Senate|Candidate|CountyVotes|County|StateVotes|State

to

County|Contest|Candidate|CountyVotes|CountyPercent|StateVotes|StatePercent

Note "Percent" was added back to two column headers. The original "%" sign was stripped out by R since R does not allow that symbol as
part of a data.frame header.

91d35598-10c6-426d-91e1-01e5018a1902.doc                                                                                                 4
18 Feb 2011
                                     R Screen Scraping: 105 Counties of Election Data

Kansas election result data for 105 counties. The resulting file should now have 3681 rows with a pipe ("|") delimiter:

County|Contest|Candidate|CountyVotes|CountyPercent|StateVotes|StatePercent
Allen|United States Senate|D-Lisa Johnston|1103|26|215270|26
Allen|United States Senate|L-Michael Wm. Dann|109|3|17437|2
Allen|United States Senate|F-Joseph (Joe) K. Bellis|68|2|11356|1
Allen|United States Senate|R-Jerry Moran|3026|70|578768|70
Allen|United States House of Representatives 002|D-Cheryl Hudspeth|1154|27|65448|32
Allen|United States House of Representatives 002|L-Robert Garrard|142|3|9166|5
Allen|United States House of Representatives 002|R-Lynn Jenkins|3001|70|128083|63
Allen|Governor / Lt. Governor|D-Tom Holland|1213|28|264214|32
Allen|Governor / Lt. Governor|L-Andrew P. Gray|111|3|21932|3
Allen|Governor / Lt. Governor|F-Kenneth (Ken) W. Cannon|95|2|15050|2
Allen|Governor / Lt. Governor|R-Sam Brownback|2967|68|522540|63
. . .
Wyandotte|Supreme Court Justice- 01|Carol A. Beier - YES|17293|70|439474|63
Wyandotte|Supreme Court Justice- 01|Carol A. Beier - NO|7578|31|256806|37
Wyandotte|Supreme Court Justice- 02|Dan Biles - YES|16177|66|424952|62
Wyandotte|Supreme Court Justice- 02|Dan Biles - NO|8260|34|261169|38
Wyandotte|Supreme Court Justice- 03|Lawton R. Nuss - YES|15788|66|428828|63
Wyandotte|Supreme Court Justice- 03|Lawton R. Nuss - NO|8292|34|257019|38
Wyandotte|Supreme Court Justice- 05|Marla J. Luckert - YES|16569|68|428714|63
Wyandotte|Supreme Court Justice- 05|Marla J. Luckert - NO|7685|32|255057|37
Wyandotte|Constitutional Amendment 1 - Bear Arms|C.A. #1 - YES|25806|87|710255|89
Wyandotte|Constitutional Amendment 1 - Bear Arms|C.A. #1 - NO|3945|13|91004|11
Wyandotte|Constitutional Amendment 2 - Vote Rights|C.A. #2 - YES|18534|63|493764|62
Wyandotte|Constitutional Amendment 2 - Vote Rights|C.A. #2 - NO|10774|37|297382|38

This example showed how to extract election data for each of 105 Kansas counties from separate web pages.

The next two exercises show how to reshape the data to form a summary table for a specific election contest, and how to analyze and map
the results from the judicial retention elections.




91d35598-10c6-426d-91e1-01e5018a1902.doc                                                                                                  5
18 Feb 2011

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:773
posted:2/19/2011
language:English
pages:5