by Earl F. Glynn, Kansas Watchdog
The goal of this article is to describe how to “reshape” data from screen scraping to make analysis with existing tools easier.
The Kansas Secretary of State published Nov. 2010 election results online, including results by county. Viewing results by county for some statewide contests was not possible without visiting 105 web pages of county data.
Visiting each of the 105 county election result pages would be a bit tedious, and manually copying all the numbers into a county summary spreadsheet could be error prone.
An automated way to screen scrape and analyze the data was a preferred approach described in a series of articles.
Series of Articles about Screen Scraping Using R
This is the third of four articles about screen scraping and analyzing data:
- Simple R Screen Scraping Example
- R Screen Scraping: 105 Counties of Election Data
- R Reshape Examples
- Analysis and Mapping of Kansas Judicial Retention Elections by County Using R
R’s online help provides this description for the “reshape” function:
This function reshapes a data frame between “wide” format with repeated measurements in separate columns of the same record and “long” format with the repeated measurements in separate records.
The result from scraping 105 web pages was a file in “long” format, where each candidate by contest by county was listed with both county and state totals:
Allen|Governor / Lt. Governor|D-Tom Holland|1213|28|264214|32
Allen|Governor / Lt. Governor|L-Andrew P. Gray|111|3|21932|3
Allen|Governor / Lt. Governor|F-Kenneth (Ken) W. Cannon|95|2|15050|2
Allen|Governor / Lt. Governor|R-Sam Brownback|2967|68|522540|63
. . .
The contests by county vary, but the statewide contests are present for each county.
These statewide contests can be extracted from all the “long” format county data and displayed in a “wide” format for readability or analysis by many software tools:
A single R “reshape” function call makes the data much more usable, but the reshape must be applied to county data for a particular contest since the “long” format is by both contest and county.
The detailed notes below are broken into several sections:
- Loading the file into an R data.frame
- Extracting data for governor’s race
- Reshaping the data to a “wide” format
- Writing data to CSV file
- Displaying data in Excel
- Performing a similar analysis for “Supreme Court Justice 1″ contest
Input Data File: 2010-Kansas-General-Election-11-03.txt from scraping 105 web pages
Output Data Files:
R Program: Reshape-Examples.R
Earl F Glynn • KansasWatchdog.org • Franklin Center for Government and Public Integrity