Computer Assisted Reporting
The GAB announced the PDF files could be viewed on their Recall Petition Public Access Site.
But how easy is it to download all the files and see if all pages can be accounted for?
How to “see” all the files?
Before 6 AM on Wed. the GAB’s web site was quite sluggish, possibly because of others also looking at the Scott Walker recall petition files. I could view the first page of the list of files but navigation to the end of the list was extremely slow. At that time I gave up trying to find out how long the list was.
I looked at the URLs in a web browser for several of the PDFs after clicking on the links, e.g.:
The URL pattern looked easy to mimic. The URL for each PDF simply had a range of pages, 1-50, 51-100, 101-150, ….
Initially I did not know where the upper limit was, but my guess was there would be about 153,335 / 50 = 3067 files of 50 pages each.
The annoying blanks in the URL, each coded in hexadecimal “%20″ representation, made things look a bit more complicated. Care had to be taken to make sure blanks in URLs would be coded correctly. The “%20″ coding made many page numbers in the filenames much more difficult to read.
Instead of using the GAB convention for the first file of “Gov 1-50″ I decided to use “GOV-000001-000050″ for the name of my local copy. Six digits for the start and stop page numbers seemed excessive, but the final page numbers were likely well over 100000.
I knew from past projects leading zeros would make filenames appear in the “correct” order in a list, and could be scanned more quickly by a human. Otherwise files starting with a “1″ (e.g., 1, 10, 100, 1000, …) may appear sorted together (and likewise for other single digits) when a filename was treated as a character string instead of a number make sorting and scanning a problem — like on the GAB site.
R Script to Download Files
In theory a simple R script could use a single loop to construct the URL for each file and then a call to the download.file function would grab each file. See the R script Gov-Recall-Get-Petition-PDFs.R.
The theory was correct but the script failed as it encountered missing files and filenames that did not “obey” the observed pattern.
Problems with Downloads
In all I needed more than 25 restarts of the script to deal with a variety of problems not anticipated.
I saw a few browser “503 errors” (service unavailable), possibly caused by an overworked server, but most were browser “404 errors” for “file not found.”
A variety of problems caused the 404 errors, which were manually fixed. In general the “pattern” of 50 pages at a time was not followed:
- “GOV%2020051-21000.pdf” had transposed two digits.
- “GOV%2020501%2020550.pdf” had a blank instead of a dash.
- “GOV%2048201-48251.pdf” had “51” instead of “50” in its range.
- “GOV 52151-52300″ should have been “GOV 52251-52300″
- “GOV 52151-62200″ should have been “GOV 62151-62200″
- “GOV 52657-52700″ should have been “Gov 52651-52700″
- “GOV 71301-71350B” had an unexpected “B” in the number.
- “GOV 101501-101550-” ended with a dash.
Four of the 404 errors were caused by missing files. This means a total 200 pages of signatures were missing (down to 150 pages now):
- GOV 40801-40850
- GOV 46951-47000 [uploaded at 4:18 PM on 2/2]
- GOV 134101-134150
- GOV 149851-149900
[Update Feb. 13. The three missing files above are now online and total 151 pages.]
After about 36 hours of downloading PDF files (with many interruptions) I had a total of 3047 files that took 11.2 GB of disk space.
Three of the files were 0-length files caused by missing files on the server. Only 3044 files were PDFs with signatures.
Check for corrupted PDFs and number of pages in each PDF
The command line utility program pdftk was used to check for file corruption and extract the number of pages in each PDF.
In a command prompt window a single line like the following creates a file with information about a PDF and stores the data in a text file:
pdftk x.pdf dump_data output x.txt
That single line needed to be used 3044 times to extract the number of pages from all the PDFs. This is a simple and fairly quick task using a batch file.
A single-line pdfdata.bat file called the pdftk utility for each PDF in the directory:
for %%d in (*.pdf) do pdftk %%d dump_data output pdfdata\%%d.txt
The output from the batch file was redirected to a file (pdfdata.bat > pdfdata.txt) leaving error messages to appear on the console.
Error messages indicated a few of the PDF files were “corrupt”. In three of the cases, the message was caused by the 0-length PDF file, and could be ignored.
However, the file GOV-132401-132450.pdf was identified as corrupt for some unknown reason. The file was manually downloaded from GAB and then saved.
Files with 6 or 80 pages instead of 50?
The Gov-Recall-pdftk-pages.R script read the output from the pdftk utility and summarized the number of pages in the 3044 PDF files:
The table output above shows 2526 of the 3044 PDF files had 50 pages of signatures.
One file had only 6 pages, while another file had 80 pages of signatures.
The R script formed two summary files:
- Excel file Gov-Recall-List.xlsx: List of the 3044 files showing number of pages in each PDF and total of 152,307 pages in all PDFs.
- Excel file Gov-Recall-Not50Pages-List.xlsx: List of 518 PDF files that do not have 50 pages. The range is from 6 to 80 pages, instead of the expected 50 pages.
Many files with “extra” pages had a number with an “A” and “B” suffix on the same page number. I saw one case of “A”, “B” and “C” suffixes with the same number.
One file with 62 pages (GOV-148151-148200) shows a number of irregularities in the page numbers:
It’s unclear why a handwritten number replaced the stamped number, but a check of the stamped numbered page shows different signatures.
Files with fewer than 50 pages had placeholder notices to indicate pages were not submitted.
For example the file GOV-063501-063550 had only 12 pages but this notice:
How many total pages?
The GAB press release said there were 153,335 petition pages, but counting all the pages from the 3044 PDF files only adds up to 152,307 pages — a difference of more than a thousand pages.
Analysis shows 150 pages are missing in the “gaps” in the filenames.
It’s unknown how many pages were not submitted.
Reconciling this difference will likely be a tedious process since the petition pages are all handwritten and automation is not possible.
All files accounted for?
A list of all downloaded PDF files was created from a command prompt:
dir *.pdf > pdf-file-list.txt
This list of files was scrutinized with the script Gov-Recall-Verify-PDF-List.R to make sure mistakes were not introduced with the many manual corrections that were applied in downloading files.
This analysis showed there were no gaps in the pages in the filenames, and the first and last files were:
GOV-000001-000050.pdf . . . GOV-152301-152350.pdf
The last numbered page in this last file was 152335B (which followed 152335A).
Is it a coincidence this number is 1000 less than the GAB-specified number of pages, 153,335?
Is it possible that 152,335 is the “correct” number of pages instead of 153,335?
In addition to verifying no files were missing, the script Gov-Recall-Verify-PDF-List.R took a look at the sizes of the PDF files.
The following boxplot shows the median PDF file size was about 4 MB with most of the files ranging from 3 to 5 MB:
File sizes generally correlate with number of pages:
> s[megabytes < 2] # small files (excluding 0-length files)  "02/01/2012 09:56 PM 898,733 GOV-063501-063550.pdf" (12 pages)  "02/02/2012 02:28 PM 1,889,719 GOV-151751-151800.pdf" (28 pages)  "02/02/2012 02:29 PM 369,713 GOV-152251-152300.pdf" (6 pages) > s[megabytes > 6] # large files  "02/01/2012 09:37 AM 7,009,732 GOV-006151-006200.pdf" (80 pages)  "02/01/2012 09:33 PM 6,035,890 GOV-057701-057750.pdf" (54 pages)  "02/01/2012 10:55 PM 6,619,462 GOV-072451-072500.pdf" (50 pages)  "02/02/2012 09:36 AM 7,579,142 GOV-094551-094600.pdf" (50 pages)
The last two large files with 50 pages had several pages with “noisy” grey backgrounds, possibly caused by scanning a petition on a colored sheet of paper.
- Wisconsin Reporter review finds missing recall petition signatures, Wisconsin Reporter, Feb. 2, 2012.
- GAB site gaining popularity, Fox 11 Green Bay, Feb. 1, 2012.
- WI’s GAB sides with transparency in petition privacy debate, Wisconsin Reporter, Jan. 31, 2012.
- Privacy should not trump transparency, Wisconsin Reporter, Jan. 31, 2012.
- Working with Wisconsin Voter Data in Access 2007; Analysis with R, Franklin Center Computer-Assisted-Reporting blog, Dec. 2, 2011.