Computer Assisted Reporting
This article describes technical analysis of the Kansas voter registration file from February 2012.
The data were scrutinized in several ways:
- By data field statewide to determine what values were present and what problems exist.
- By county to create descriptive statistics and verify values such as districts were correct.
- By Political Party by congressional district, county, state senate district, state house district, state board of education district.
- By Active vs. Inactive status and by the last year a voter cast a ballot in a federal election.
- By comparison to U.S. Census voting age population.
This technical article will be the foundation for several future Kansas Watchdog articles.
Problem getting started: Geary County delimiter problem
Since the new ELVIS (ELection Voter Information System) system was introduced in 2006 there has been a chronic problem with a small number of voters in Geary County with delimiters embedded inside delimited fields.
The extra delimiters appear in mailing address fields and shift other data into incorrect columns.
There is a complication to fix the problem. With tab-delimited data, many ASCII editors do not clearly show the positions of a tab since a tab is often represented by a number of spaces and not a single character.
The records with extra delimiters cannot be correctly parsed automatically. Many programs including Access or Excel will read the data but silently shift data into wrong columns without indicating there is a problem. The problem is only apparent when trying to account for voters by districts or by party and there are unexpectedly wrong data in many fields.
This problem was first reported in 2006 to the Geary County Clerk and the Secretary of State’s office when there were about 87 problem voters. The number is now down to 25, but unless these problem records are removed, there are still hundreds of fields that appear to be incorrect during any analysis of the data.
The source of the problem is not at all clear. The problem may be in the data in Geary County or perhaps the problem is in the process used by the Secretary of State’s office to create a composite file of voters from all 105 counties.
For now the easiest way to avoid all the incorrect shifted data is to find and remove the records with the wrong number of delimiters.
The analysis below was done after removing the 25 voters in Geary County from the original file with the wrong number of delimiters.
The document below gives details. The R script shows how to find the problem records.
Summary of data fields
When ELVIS was introduced in 2006 the names of all the voter data fields were different from before and some abbreviations seemed unusual.
The use of “db_logid” for “county” is still not intuitive. Using “ks” in Kansas for “Kansas Senate” is also not intuitive — at least to me.
The voter data file is released on a CD sometimes with the names of the data fields, and at other times with no information about the contents. A description of the fields is not known to exist.
With 105 election authorities (county clerks or election offices) and no known written standard for the data fields, deciphering some data fields can be difficult. One must have some knowledge of what is in the file and reverse-engineer its contents to use the data effectively.
To understand the data fields, I wrote an R program to create descriptive “meta data” (data about data) for each of the 58 data fields, including:
- Field length meta-data: min, median, mean, max lengths
- Number of missing values for each field
- Number of unique values for each field
- Number of counties using each field
- Notes in “Comments” and “Problems” about each field, including missing or invalid data coding
The document Field Summary for Kansas Voter Registration File is my attempt to understand the data.
I still do not know what information is in field district_pt (or other fields in an expanded version of the file). I do not understand why field cde_name_title is in the file when the information is not recorded on a voter registration form and 63 counties have only used the field for at total of 336 voters. Two fields about “carrier routes” are in the data file yet all of that data seem to be invalid. Why put fields in a file that have no existing purpose?
At any given time district information about groups of voters can be missing. In the current file about 900 voters are missing congressional district, state house district and state senate district information. There should be a periodic check of the data and this should never happen!
Additional data quality checks are needed to insure the integrity of the data. Ten registered voters in the state do not have a first name. Why is the addition of a new voter not blocked until the name and all district information has been checked?
There appears to be more than a dozen different codes used for a “missing” phone number. Why not a single code for a “missing” phone number?
In addition to creating a summary of meta data about each field, the R program creates a frequency count for each field. Often problem data can be spotted by looking at the beginning or end of these frequency count files.
See the files below for additional information.
- Field Summary for Kansas Voter Registration File, Feb. 24, 2012. Word • PDF
- Meta data, summary stats file: Statewide-FieldLengthSummary.csv
- Field frequency count files 001-db_logid.csv .. 058-district.sd.csv in Counts.zip
- R script: Statewide.R
Summary of election codes used for voter history
Ten fields in the voter file give information about the most recent elections in which a voter has cast a ballot.
Many of these election codes start with a two-character prefix followed by a four-digit year.
The known “valid” prefixes include:
- CP (city primary)
- CG (city general)
- GN (November General election)
- MB (Mail ballots)
- PR (August Primary election)
- SP (special elections)
- PP (presidential primary — does NOT include caucuses)
For example, the code GN2008 means a voter cast a ballot in the presidential general election in 2008.
Oddly, the current data show one ballot already cast in PR2012, the August 2012 primary, and GN2012, the November 2012 general election.
The document Kansas Voter Registration Election Codes shows the election codes present in the current voter file. This document shows which codes have been used how many times in how many counties.
Many codes are used by a small number of counties. Without contacting the county clerks or election officials there is no way to know what many of the codes mean.
I have declared election codes without a “standard” prefix, or without a four-digit year as “invalid” because such data are useless for analysis.
Knowing what year a voter last cast a ballot was key in a past study of dead registered voters in Kansas — and will be needed in a study planned for later this year. Such analysis is limited when about half the ballot codes do not show a year a ballot was cast — essentially such ballot codes have no information.
See additional information in the files below.
Download (processing with R script Statewide.R from above):
County stats and validation of district codes
A separate R script was used to look at data from each county. A report was created for each county with summary statistics and additional information about what may be a problem in each county.
The Kansas Secretary of State publishes a very useful document that gives the districts (congressional, state senate, state house, board of ed) that are valid within any county. The R script checked that only those districts were assigned to voters within a county.
The R script also checked for “correct” ZIP codes within a county. Some ZIP codes span county lines, so there are limitations to that check.
Here is a summary of the information in the Allen County report, which was annotated to highlight certain issues. (Reports for other counties are not annotated).
The annotated report for Allen County will be described below:
- table showing party membership within the county. Allen County has 3905 Republicans, 2768 Unaffiliated (Independents), 1855 Democrats, 54 Libertarians and 5 Reform party members.
- Cross tabulations show the breakdown within Allen County for Active/Inactive voters by Party (there should be little difference), and the breakdown by gender and party.
- Near the bottom of page 1 the R script comments that 2 voters are missing residential street numbers and street addresses (this report does not give details).
- Continue from page 1 is a table showing the number of voters by city. This table can be used to find misspelled city names.
- The table of voter 5-digit zip codes show that “6672” only has four digits and is invalid. A table of all 5-digit zips thought to be in the count appears next, and identifies zip codes that may be from border counties. The voter with the incorrect zip code “6672” is identified.
- Tables of mailing city, state and zips are shown.
- A list of possibly invalid mailing zip codes is shown, but Google says all of those zips are OK.
- Phone area codes without 3-digits can be identified. Other area codes not in a validation table are identified.
- Validation of districts and precinct information looks OK.
- Precinct information for the voter file can be compared with the election result file. See section below on how this comparison is impossible for Allen County.
- Stats of election codes used in Allen County are shown. Codes without the “standard” prefix or a valid year are marked “InvalidCode?”
- Voters who have not voted since the 2004 election are identified. The “I” prefix on some voters means the voter is “inactive”. A number of voters on this list have not cast ballots in almost 20 years.
See a similar report for each county via the links below.
- ZIP file with 105 county ASCII text summary/problem files: County-Stats-Validation.zip
[Open in ASCII editor, or in Word in using landscape layout with Courier New font]
- R scripts: CountyValidation.R and CountySummary.R (uses output file from Statewide.R as input)
- Validation file: LegislativeDirectory2012.txt from 2012 Legislative Directory
- Output statistics file to compare data problems by county: state.stats.csv
PDFs of summary stats/problem files by county:
Allen • Anderson • Atchison • Barber • Barton • Bourbon • Brown • Butler •
Chase • Chautauqua • Cherokee • Cheyenne • Clark • Clay • Cloud • Coffey •
Comanche • Cowley • Crawford • Decatur • Dickinson • Doniphan • Douglas •
Edwards • Elk • Ellis • Ellsworth • Finney • Ford • Franklin • Geary •
Gove • Graham • Grant • Gray • Greeley • Greenwood • Hamilton • Harper •
Harvey • Haskell • Hodgeman • Jackson • Jefferson • Jewell • Johnson • Kearny •
Kingman • Kiowa • Labette • Lane • Leavenworth • Lincoln • Linn • Logan • Lyon •
Marion • Marshall • McPherson • Meade • Miami • Mitchell • Montgomery • Morris •
Morton • Nemaha • Neosho • Ness • Norton • Osage • Osborne • Ottawa • Pawnee •
Phillips • Pottawatomie • Pratt • Rawlins • Reno • Republic • Rice • Riley •
Rooks • Rush • Russell • Saline • Scott • Sedgwick • Seward • Shawnee • Sheridan •
Sherman • Smith • Stafford • Stanton • Stevens • Sumner • Thomas • Trego •
Wabaunsee • Wallace • Washington • Wichita • Wilson • Woodson • Wyandotte
Impossible to match voter registration precincts with election results precincts in many counties
Voter and election data are not helpful for looking for potential voter fraud in Kansas.
Comparing the list of voters in a precinct before an election with voter history from a precinct after an election is not possible in Kansas in many counties.
A simple check of election results to see if there were more voters than registered at the precinct level statewide is not possible currently. (Admittedly, the data for such comparisons is not perfect either because changes are made to ELVIS all the time.)
Precincts from voter data cannot be easily connected to election results in many counties.
In the section above the report for Allen County on page 5 showed a list of the precincts associated with voters, and the list of precincts used to reported election results in the Nov. 2010 election.
The diagram below shows how difficult the comparison of precincts is:
Click on the diagram above, or the links below, to see the comparison of voter precincts with election result precincts for several Kansas counties.
- Matching Kansas Precincts in Voter Registration Data and Election Results (Allen, Grant, Montgomery, and Sherman Counties). Word • PDF
- Summary file of ballots cast by office by precinct: Ballot-Totals-2010-General.csv derived from 2010 Election Information from Kansas Secretary of State
Political Party Breakdown
This chart shows “Unaffiliated (Independent)” voters growing slightly in Kansas at the expense of Democrats and Republicans, but Republicans have a sizable edge.
Scatterplots of %Democrat vs. %Republican show the political party breakdown by various districts.
Shown below is the scatterplot of %Democrat vs. %Republican by county:
Greenwood County is either missing a number of Democrats and Republicans, or the county is now controlled by about two-thirds of voters being unaffiliated.
My guess is this chart shows a data problem with party affiliation in Greenwood County.
See links to charts and data below:
- Party History: % of Voters by Party in Kansas, 2002-2012. PDF
- Excel File: Political-Party-History-Stats-2012-02-24.xlsx
- R Script: Political-Party-History-2012.R
- Scatterplots of %Democratic vs. %Republican for Congressional Districts, Counties, House Districts, Senate Districts, State Board of Education Districts. PDF
- CSV files showing party breakdown by
- R Script: Party.R
“Active” and “Inactive” voters. Last year voted.
About 8.1% of Kansas voters are “inactive” currently — that is about 138,000 voters.
“Inactive” voters cannot receive mail at their voter registration address.
A second map that can be seen by clicking on the map above shows “Voters not casting ballots since 2004.”
About 6.3% of voters, that is about 108,000 voters, have not cast ballots since 2004 — or have only cast ballots that are reflected by “invalid” election codes in the data file.
There is some overlap between “inactive” voters and voters who have not voted recently. About 13% of voters are one list or the other.
- “Inactive” Voters[%]
- Voters Not Casting Ballots Since 2004 [%]
- “Invalid” Ballot History Codes [%]
- R Script: Kansas-Maps-2012-02-24.R (uses state.stats.csv created during county analysis)
- File: KansasMaps-SummaryData.csv
Comparison of U.S. Census voting age population to Kansas registered voters
A Kansas Watchdog report in Oct. 2010 suggested six Kansas counties had more voters than the U.S. Census said were of voting age (18 or older).
That analysis was done before the official 2010 U.S. Census numbers were available.
Using official 2010 U.S. Census numbers with voter registration data from 2010, there were seven Kansas counties — not six — that had more voters than the census said were of voting age.
See the table below.
- Working with Wisconsin Voter Data in Access 2007; Analysis with R, Franklin Center CAR Blog, Dec. 2, 2011.
- More Voters than Census Voting Age Population?, Franklin Center CAR Blog, Feb. 19, 2011.
- Dead voters in Kansas?, Kansas Watchdog, Oct. 28, 2010.
- Six Kansas counties have more voters than census voting population, Kansas Watchdog, Oct. 26, 2010.
- Kansas has almost 138,000 inactive voter, Kansas Watchdog, Oct. 25, 2010.
- Political Party Trends in Kansas, Kansas Watchdog, Sept. 24, 2010.