Some states provide results data in a way that is not easily loaded using the data pipleline. This could be because the data is provided in a format that isn’t easily machine readable, such as an image PDF, or a format that isn’t easy to inspect or load incrementally, such as a database dump stored in monolithic files. Also, results may not be publicly accessible and may have been obtained on physical media or by email.
In these cases, the data needs to be converted to CSV and placed in a separate repository that is accessible over HTTP. This could be done in a variety of ways, from one-off scripts to more customizable libraries that can handle multiple results files. You might need to perform OCR on images of results before converting them to CSV files. (For those records that have to be manually typed in, see our data entry page.)
This is very important work because many states don’t keep their results in a single format, which means converting a state’s results into a single format (CSV) is crucial to our efforts.
Repositories of preprocessed data should be named openelections-data-{state_abbreviation}
. For example, Mississippi’s repository is named openelections-data-ms
.
Preprocessed results go into state-specific data repositories, which is where you’ll find tasks to work on. Some examples of state data repositories include:
Take Oregon’s 2008 general election county-level results, which are in an electronic PDF. Using Tabula or xPDF, we can manually extract the tabular data inside, copy it to a spreadsheet and convert it to a CSV file using OpenElections’ conventions: the result is here. Here are the steps, using Tabula (and assuming you have it installed and running):
Tasks involving pre-processing of electronic PDFs often are tagged with our easy task
label (here, for Oregon).
A similar approach could be used for a fixed-width text file using a text editor, although in that case most of the parsing would be done in the text editor.
Another approach could use a programming language (preferably Python) to read and parse a text or HTML results file and create a CSV file. An example is the Python parser for results in Josephine County, Oregon.
For image-based PDF files (here’s an example from Lincoln County, Oregon), performing OCR can make it possible to avoid data entry and parse the results as data. There are several options for OCR, but they usually require purchasing software. If you have the professional version of Acrobat, able2Extract or similar software, you can take PDF files and do OCR on their contents. If you don’t have access to any of those, please contact us at openelections@gmail.com and we can figure out a way to make it work.
Because OCR is a tricky process that isn’t always accurate, you’ll need to review the results to ensure that they are correct - in particular, check rows in which the votes are 0 or contain 5, 6 or 8.
In some cases, scraping and processing the raw data requires Python code. Processing scripts should be placed in a bin
directory and any common supporting code should go in a packaged named openelexdata.us.{state_abbreviation}}
, but we’re happy to accept one-off scrapers and the output. We prefer Python, but are most concerned about the quality of the data, so if Ruby or Node is more your thing, contact us on GitHub, Twitter or by email at openelections@gmail.com and we’ll figure it out.
An example directory structure for Iowa looks like this:
It is important to note that:
README.md
file that describes the conversion process, the role of the supporting code and any manual processes.bin
directory.openelexdata
package directory.The openelexdata.us.{state_abbreviation}
should be implemented as a Namespace package so that the different state packages can live in separate repositories but all be imported under openelexdata.us
. The easiest way to make this work is to use the pkgutil package and place the following code in openelexdata/__init__.py
and openelexdata/us/__init__.py
:
from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)