Computer Assisted Reporting
This is the second of four articles about analyzing distances between sex offenders and child daycare centers in Missouri as part of a joint project with KSHB NBC Action News in Kansas City.
The previous article gave details of analyzing the Missouri Sex Offender Registry and preparing a data file for geocoding.Our goal was to compute distances from Missouri registered sex offenders to child day centers in Missouri, and in particular church daycare centers. KSHB’s recent report explained that a loophole in law allows hundreds of Missouri sex offenders to live near church day cares.
To compute distances we needed the latitude-longitude coordinates from the offenders and the child daycare providers. But how do we convert street addresses to geolocation coordinates?
The process of converting addresses to latitude-longitude coordinates is called “geocoding” and Google provides an application program interface (API) for this. The Google Geocoding API page provides many details of how this process works. Also note Google has a number of restrictions on how this works.
We used a similar geocoding process described below on both lists of offenders and childcare centers.
Let’s look at an example to understand how geocoding works.
One of the Missouri registered sex offenders lives at this address: 5007 S Fuller Ave, Independence, MO. How can we use Google Geocoding to get the latitude-longitude of this address?
To geolocate this address simply enter the following URL into a web browser (it should be a single line URL):
Note in the URL each space in the address was replaced with a “+”.
Google returns the following XML code for the address. The fields we want to extract are shown in red:
<?xml version="1.0" encoding="UTF-8" ?>
<formatted_address>5007 S Fuller Ave, Independence, MO 64055, USA</formatted_address>
<long_name>S Fuller Ave</long_name>
<short_name>S Fuller Ave</short_name>
The R XML package provides functions for parsing XML (or in theory one could also parse the JSON default code returned by the Google API using a different R package).
Parsing XML is still a bit of a pain using the xpathApply function to extract the various data items. But to simplify parsing XML returned by Google geocoding, we wrote the function get.geocode, which is in the GoogleGeocode.R file.
Here’s an example of using the R get.geocode function to request the same information as shown above:
> d <- get.geocode("IDNumber","5007 S Fuller Ave","Independence","MO","")
street "5007 S Fuller Ave"
formatted.address "5007 S Fuller Ave, Independence, MO 64055, USA"
But how can we geocode a file of many addresses, such as a list of Missouri sex offenders or Missouri child daycare providers?
The R script GoogleGeocodeMOSexOffenders.R reads a list of addresses from the tab-delimited file geocode-MO-offender-in.txt, calls the function get.geocode for each line, saves the information from geocoding, and when complete, writes the results to the file geocode-MO-offender-out.txt.
Most files of addresses can be geocoded this way by modifying the script appropriately.
The batch setup process is fairly simple.
Because there can be a conflict in column names, I normally use an uppercase letter as the first letter of each field in the original file and lowercase letters with the geocoded info. That way ZIP is from the original file and zip is from geocoding.
Add these columns to the file you want to geocode — their exact placement does not matter:
status, zip, county, state, lat, lng, location.type, formatted.address, result.count
Modify the R script, like GoogleGeocodeMOSexOffenders.R, to assign quantities to street, city, state, and zip and pass those quantities to the get.geocode function. The geocoding results are then assigned to the extra columns you added to your file.
The program uses a while loop to stop geocoding if you reach the daily limit.
Limit of 2500 geolocation requests per day
Google imposes a limit of 2500 geolocation requests per day and that’s tied to your IP address. If somehow you can use more than one IP address, you can get 2500 geocodes per day per IP address.
If you exceed the 24-hour limit or otherwise abuse the service, the Geocoding API may stop working for you temporarily. If you continue to exceed this limit, your access to the Geocoding API may be blocked.
My experience is that Google knows to the minute when you exceeded their limit. Wait 24 hours before continuing. Google knows.
To restart geocoding on additional days, edit the file geocode-MO-offender-out.txt and look down the status column. I normally edit the file in Excel.
Most status values will say “OK” but the last one before Google quit will be “OVER_QUERY_LIMIT”. Clear this cell in Excel and save the file back to a tab-delimited file.
I normally number the “in” files after they’ve been processed to “in1″, “in2″, etc., and then rename the modified “out” file to be the “in” file for the next batch run.
The 11,000 Missouri offenders took 5 days to geocode this way.
Most geolocation assignments work well, but there can be problems. Study the “Status Codes” and “Results” on the Google page.
Sometimes Google will return more than one address match (result.count >0), and sometimes Google will decide the address is in a different state — which doesn’t make sense to me when the state is specified in the request.
To “accept” a geolocation assignment from Google I normally require:
- status is “OK”, and
- state matches the specified state, and
- result.count > 0 (but result.count of 1 is best).
I often use R statements like the following to decide which rows to accept in data.frame “d”:
select <- (d$status == "OK") & (d$result.count > 0) &
(d$state == STATE)
d <- d[select,]
For the results that I keep, I still often color a “fuzzy” match for approximate assignments where:
- result.count > 1 or
- location.type is not “ROOFTOP”
I find that location.type “ROOFTOP” with result.count of 1 gives an “exact” match, but many approximate matches are still acceptable.
I often use this R statement to define “fuzzy”:
d$fuzzy <- (d$result.count > 1) | (d$location.type != "ROOFTOP")
While the ultimate goal is to create an interactive Google map using the geocoded results, sometimes non-interactive maps of the results are useful during analysis and can be created using the R maps package.
If an R object “offender” is a dataframe containing the geocoded results as described above, these R statements will create a static map that uses the “fuzzy” quantity to show better matches as a saturated “red” and fuzzy matches as the less saturated “palevioletred”.
map("county", "Missouri", col="grey")
mtext("Missouri Sexual Offenders [June 20, 2011]", adj=0, col="red")
col=ifelse(offender.data$fuzzy, "palevioletred", "red")
Note the Google matches are “fuzzy” in many rural areas and are “ROOFTOP” mostly in urban areas.
Program and Files
GoogleGeocode.R with function get.geocode
GoogleGeocodeMOSexOffenders.R script to process file geocode-MO-offender-in.txt
(you will want to modify the basedir directory in line 13 to be your working directory)
Sample input and output files with 10 offenders:
- Loophole in law allows hundreds of Missouri sex offenders to live near church daycares, KSHB NBC Action News, Kansas City, July 18, 2011.
- Analysis of Missouri Sex Offender Registry Data, Franklin Center CAR Blog, July 18, 2011.
- Computing distance matrix between Missouri sex offenders and child daycare facilities, Franklin Center CAR Blog, July 28, 2011.
- Displaying Missouri sex offender/child day care facility proximity map using batchgeo.com, Franklin Center CAR Blog, July 28, 2011.
Earl F Glynn • KansasWatchdog.org • Franklin Center for Government and Public Integrity