Giter Club home page Giter Club logo

openelections-core's People

Contributors

aljohri avatar bsmithgall avatar chrisroat avatar divergentdave avatar dwillis avatar ericlagergren avatar ghing avatar jamesdunham avatar jmcarp avatar jslap avatar konklone avatar myersjustinc avatar nbdavies avatar rabidaudio avatar warwickmm avatar zstumgoren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openelections-core's Issues

Add archive invoke task

  • Create openelex.base.archive.py
  • Delete md.archiver.py
  • openelex.base.archive.py methods to implement
    • save_file(datefilter) - saves cached file to S3
    • delete_file(datefiler) - delete files from S3 not found in datasource.mappings

Add cache.diff task

Create cache.diff invoke task that shows diff between locally cached files and expected files, based on comparison of files in state cache dir with expected files generated from datasource.mappings.

Candidates in more than one Contest?

Hi there, me again. This sorta relates to #30 but I thought I'd open another issue just to keep things neat.

@zstumgoren pointed out that you guys are working on new models for the elections data in your tasks branch which move the Candidate and Result objects out from under the Contest objects. Which is great and I've gone ahead and implemented that approach in our fork of this project.

However, I'm wondering what the thinking is behind making a Candidate only able to be related to one Contest. It seems to me that there it is more often the case that a candidate will run in more than one election for the same office but, the way I interpret this relation is that you'd end up with new Candidate objects for every Contest that a given person runs in. Even though they are the same person. Would it make sense to make that relation a ListField full of ReferenceFields pointing to the various contests the person has run in? That way you only end up with a single record for that person.

Or am I totally interpreting this the wrong way?

When should PDF conversion/extraction occur?

For results files that are electronic PDFs (or even for those which are not), when do we want to convert/extract the results? Before the loader is run, or during that process? Seems to me that we might want the data portion done before the loader is run, meaning that generated_filenames would not be PDFs, even if the source files were. Thoughts?

Include comment with sample data rows

This is just a style suggestion that I thought of while looking through us.md.load.

To reduce contributor friction and to make it easier to return to code, putting an example of a row in the comment of the code block that parses or transforms it would be a big help. This is particularly true when there's a lot of variance from year to year.

Generated file names for multiple offices at the same level

Ohio has some instances in which multiple files cover the same "office" label we use - for example, they have separate HTML pages for state senators and state reps, as well as for state officer posts (see primary results here: http://www.sos.state.oh.us/sos/elections/Research/electResultsMain/2008ElectionResults.aspx, for example). We haven't really decided on the naming conventions for those. Do we want to have something like:

20080304__oh__democratic__primary__state_leg_1.html

or

20080304__oh__democratic__primary__state_senate.html
20080304__oh__democratic__primary__state_house.html

?

Devise naming convention for raw results files

We need a consistent file naming convention for raw result files. This file name would be applied during the initial download of the file (in the fetch class), and would be the name of the file archived on S3. It should provide enough information about the source file to link up to our metadata API.

Strategies

Metadata ID

Resolve a canonical ID using metadata API.

Pros

  • Minimizes how gnarly the file name gets.
  • Provides an early tie-in to our metadata that could be used in downstream parsing and loading processes.

Cons

  • May not be as intuitive and reverseable as a plain-language ID.
  • Tightly couples our scraping process to an external API.
  • Does not account for local data and referendums, since these source types are not reflected in metadata. Could devise a secondary convention for non-target file types.

Composite name

Generate composite file names that reflect metadata captured in our data admin.

File name components could include:

  • election date
  • state
  • race type (general, primary-dem, primary runoff-dem, etc.)
  • OCD id for jurisdictional boundary of the data. This could be the OCD id of the geographic area for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD name
  • race_code that denotes types of races covered in the data file. Could be an optional element only used when state slices up data for a single election date into multiple files. For example, Louisiana provides precinct-level results, by parish, for each race.
  • reporting level - precinct, city, county, state, etc.
  • file type extension - db, csv, html, json, xml, etc.

Examples:

FORMAT

<YYYYMMDD>_<state>_<race_type>_<jurisdiction>_[<race_code>_]<level>.<ext>

EXAMPLES

Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106_la_general_jefferson-davish-parish_cd-1_precinct.html

Allegeny County precincnt results for general election (contains multiple race types)
20121106_md_general_allegany-county_precinct.csv

Pros

  • file names are plain language and reversable
  • not coupled to an external API but could be parsed and used to query the metadata API
  • naming convention should work across target and non-target data (e.g. local races and referendums)

Cons

  • Some rather gnarly file names
  • the race_code handling is a bit fuzzy and ad hoc. Would need to be careful about enforcing a convention here.

How to present party field in baked output?

In the baker (see #39) output:

  • where should the party field described in the Result spec come from? I'm assuming this is Candidate.parties rather than Contest.party.
  • Candidate.parties is a list. What's the motivation for making this a list rather than a single value? The list serializes fine in CSV, at least for MD, but should we compress this to a string for output. How should we handle multiple values in this string.

@zstumgoren, @dwillis do you have an answer to this?

Accessing undeclared variables or not using variables in us.md.load

load_county_2002()

  • mapping variable is referenced but not declared
  • candidate variable is referenced but not declared
  • write_in variable is referenced but not declared

load_2002_file()

  • A result object is instantiated at the end, but it looks like it's never appended to the results list that gets passed to Result.objects.insert

load_2000_primary_file()

  • winner, cand_kwargs variables never declared

Add archive invoke task to generate manifest on S3

Add archive task that generates a manifest for a given state based on list of files saved to S3. Should link the up by standard filename to original/raw urls, plus any other metadata that's appropriate from datasource.mappings.

Add optional flag to archive previous version of a file on S3. This is important in cases where we hand-keyed results data or used some combination of automation (e.g. Tabula) and manual processes. The flag would allow us to version the data over time. (see #55)

Add bakery to invoke framework

  • Write tests using FactoryBoy
  • Create base/bake.py module with Baker class
    • Executes Mongo query constructed from bake invoke task's CLI params
    • Preloads Election and Candidate records, by key, into memory
    • Writes batched data (default batch of 5000) as stream to a target file
  • Add to_csv and to_json methods to Election, Candidate, Result models. Do not serialize references by default (include_refs=False).
  • Create bake.state_file invoke task
    • Accepts filters for customizing result output
      • state (required) - postal code
      • format (default: CSV) - format for output
      • outputdir (default: openelections/us/bakery)
      • date [YYYY|YYYY-MM-DD] (require to avoid huge file sizes?)
      • type [general|primary|special|etc] - use choices list from models.py
      • office - most useful after data standardized
      • district - most useful after data standardized
      • party - most useful after data standardized
      • reporting level [state|county|precinct| etc.] - use choices list from models.py
    • Sets sensible defaults. For example, output all state/contest-wide results for all races when no filters are applied.
    • Writes two files to openelections/us/bakery folder:
      • ._ - the baked results
      • manifest.txt - query parameters used to produce the result

Office standardization lookup CSV file

Office standardization - list all offices and office holder names; identify upper and lower chamber for state legislatures; give generic titles to offices that don’t have common names.

Create RawResult documents in load step, create Result, Contest and Candidate documents in transform step

Tasks

  • Update models to reflect changes (see code and questions in models.py on rawresults branch)
  • Update MD loader to use RawResult model
  • Migrate logic for creating unique Contest and Candidate entries from MD loader to transforms
  • Update Contributor docs to reflect the new workflow

Background

This comes from a discussion in #46 where @zstumgoren said:

But I'm starting to wonder if the creation of unique Candidate and Contest instances
should be treated as a transform step. Our initial goal with the data load step should
simply be getting the data loaded into Mongo in its raw form. @dwillis and I agreed to
this approach a while back, and have gradually migrated transforms and various
cleanups from the load step to the transformation step.

Enforcing uniqueness of contests and candidates in the load step adds a great deal
of complexity to this phase of the pipeline, and it feels like we're blending concerns a
bit. Unless @dwillis has strong feelings against, I'd be favorable to shifting our
approach. I don't think it would take a great deal of reworking of the models or
loader/bakery. In fact, it would greatly simplify the loader and possibly v1 of bakery.

Here's one possible strategy:

Create a RawResult model that lets us load a flat model of all raw data (this would be
our current Result model, plus contest and candidate fields currently normalized to
their own models)
Generate unique Contest and Candidate instances and "clean" Result documents as
subsequent transform steps
In this new model, Result documents would store cleaned or processed Result data
migrated from RawResult, or generated subsequently from lower-level results (e.g.
race-wide results rolled up from precinct-level results). In general, these collections
would store transformed, normalized versions of our raw data.

Use multiple loader classes instead of methods for state data formats that vary over time

For example in md.

Instead of:

class LoadResults(BaseLoader):

    def run(self, mapping):
        ...
        # Load results based on file type
        if '2002' in self.election_id:
            self.load_2002_file(mapping)
        ...

have multiple classes:

class Load2002Results(BaseLoader):
    def run(self, mapping):
        ...

    # Any other year related supporting methods happen here

class LoadResults(BaseLoader):
    def run(self, mapping):
        self.election_id = mapping['election']

        if '2002' in self.election_id:
            Load2002Results().run(mapping)

The big advantage to this is to make it easier to see which helper methods are related to a particular vintage without having to think about or stick to a naming convention for the methods.

I could also imagine being able to put common functionality in a state loader class and then reuse it in year-based subclasses.

Finish BaseLoader and port MD load.py

  • Finish port of openelex.base.load.BaseLoader
  • Port md.load.py to:
    • use updated BaseLoader
    • use md.datasource.py methods (mappings, elections, etc.).
    • move name parsing from BaseLoader to md loader

Two elections on same day or one election?

We looking at an election board that seems to think of a Republican Primary as a different election than a Democratic Primary even though they are held on the same day.

What's the thinking in this project?

Models missing fields in spec

The data models are missing some of the fields described in the spec.

Missing from Result model

  • pct
  • precincts

Missing from Contest model

  • absentee_provisional
  • source_url
  • notes

@zstumgoren suggested that some of these fields might be dynamic properties on the model class, but it's important to remember that we bypass the model layer when baking in the interest of performance.

Originally opened as part of the discussion for #39.

Implement name conversion strategy for raw results files

Create a module to standardize names of raw result files. Raw results will be stored on S3 using the standardized name.

Standardized names should:

  • be resolvable back to raw file names
  • encapsulate enough information about the contained results to link up to metadata via API

Naming Convention

See #4 for details on naming convention

Standardization should generate a composite file name that reflects metadata captured in our data admin.

File name components should include:

  • election date - YYYYMMDD
  • state - postal code
  • race type - general, primary-dem, primary runoff-dem, etc.
  • jurisdiction - OCD id of the jurisdiction, or geographic area, for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD name
  • race_code that denotes types of races covered in the data file. Optional element that should only be used when state provides data for single race in distinct file. For example, Louisiana provides precinct-level results, by parish, for each race. This field could also be expanded, on a state-by-state basis, to handle arbitrary groupings of results (e.g. separate files for state leg., federal, local).
  • reporting level - precinct, city, county, state, etc.
  • file type extension - db, csv, html, json, xml, etc.

Format

File name components separated by double underscores; component sub-parts separated by single underscores.

<YYYYMMDD>__<state>__<race_type>__<jurisdiction>__[<race_code>__]<level>.<ext>

Examples

Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106__la__general__jefferson_davish_parish__cd_1__precinct.html

Allegeny County precincnt results for general election (contains multiple race types)
20121106__md__general__allegany_county__precinct.csv

Implementation

Standardized name should be generated during file download process (in state-specific fetch.py modules).

Each state directory should have a 2-column mappings.txt file that contains standardized name and link to raw result file. The raw link should point to result file located at source agency or to copy of raw file archived on S3. The latter would be used in cases where result files are not scrapable (e.g. if agency provided a database dump).

## mappings.txt ##
standard_name, raw_source_name
20121106__md__general__anne_arundel_county__precinct.csv, http://www.elections.state.md.us/elections/2012/election_data/Anne_Arundel_By_Precinct_2012_General.csv

Add validations for MD

Write validation tests for MD results to ensure various result file types were loaded correctly.

Examples:

  • count of result records by candidate and file type (e.g. all target candidates in 2012 general state leg file should have 5504 results - 64 leg districts x 86 candidate types, including other write-ins)
  • Tally results for candidates at sub-racewiide reporting levels and compare to known totals
  • compare number of candidates and contests to expected numbers

Create datasource for MD

Encapsulate process of building source data URLs and standardizing results filenames (now in fetcher.py) in a new datasource.py module. This Datasource class should provide a simple public interface for dynamic querying by downstream fetcher and loader.

Add cache.diff alert to load.run invoke task

Update load.run invoke task to:

  • exit with alert if there's a difference between cached files and those expected (based on cache.diff; see #17).
  • force loading despite cache.diff using -f/--force flag

Normalized Candidates

@fgregg and I have been exploring implementing the openelections data structure for our local elections in Chicago and we ran across an issue today which I'm wondering if you might consider implementing in a slightly different way.

Since a Candidate is stored as an EmbeddedDocument within each Result, (which is itself an EmbeddedDocument within a Contest) the process of updating an individual Candidate can be somewhat of a bear, especially for a candidate who has been running in elections for as long as we have data for (and since our data is at the precinct level)

The main reason this comes up is because we're storing information about local aldermen in a pupa instance which is giving us ocd_person ids for them. We'd like to be able to cross reference that info with the info about the elections that they've run in that we're storing in this app and the only way we have to do that is to manually add the ocd_person id into this app manually. The manual part of this we were expecting and can handle but I'm wondering if you might consider storing the candidates as a separate Document the way that you're storing the Office for a given result. This would certainly make the process of getting at the information about candidates a whole heck of a lot easier.

OCD Division mappings

Should the dashboard have an API response containing OCD mappings for a state? Should we be able to add our own via the admin?

Add tasks to generate and archive manifest file

  • git rm filenames.json from md/mappings/filenames.json
  • Add invoke datasource.create_manifest to generate manifest.csv in state/mappings dir
  • Add invoke cache.save_manifest to save manifest.csv (formerly filenames.json) to S3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.