Giter Club home page Giter Club logo

fastfec's Introduction

FastFEC

A C program to stream and parse Federal Election Commission (FEC) filings, writing output to CSV.

Installation

Download the latest release and place it on your path, or if you have Homebrew and are on Mac or Linux, you can install via:

brew install fastfec

You can also build a binary yourself following the development instructions below.

Usage

Once FastFEC has been installed, you can run the program by calling fastfec in your terminal:

Usage: fastfec [flags] <id or file> [output directory=output] [override id]
  • [flags]: optional flags which must come before other args; see below
  • <file or id> is either
    • a file, in which case the filing is read from disk at the specified local path
    • a numeric ID (only works with --print-url): prints the possible URLs the filing lives on the FEC docquery website
  • [output directory] is the folder in which CSV files will be written. By default, it is output/.
  • [override id] is an ID to use as the filing ID. If not specified, this ID is pulled out of the first parameter as a numeric component that can be found at the end of the path.

The CLI will read the specified filing from disk and then write output CSVs for each form type in the output directory. The paths of the outputted files are:

  • {output directory}/{filing id}/{form type}.csv

You can also pipe the output of another command in by following this usage:

[some command] | fastfec [flags] <id> [output directory=output]

Flags

The CLI supports the following flags:

  • --include-filing-id / -i: if this flag is passed, then the generated output will include a column at the beginning of every generated file called filing_id that gets passed the filing ID. This can be useful for bulk uploading CSVs into a database
  • --silent / -s : suppress all non-error output messages
  • --warn / -w : show warning messages (e.g. for rows with unexpected numbers of fields or field types that don't match exactly)
  • --no-stdin / -x: disable receiving piped input from other programs (stdin)
  • --print-url / -p: print URLs from docquery.fec.gov (cannot be specified with other flags)

The short form of flags can be combined, e.g. -is would include filing IDs and suppress output.

Examples

Parsing a local filing

fastfec -s 13360.fec fastfec_output/

  • This will run FastFEC in silent mode, parse the local filing 13360.fec, and store the output in CSV files at fastfec_output/13360/.

Downloading and parsing a filing

Get the FEC filing URL needed:

fastfec -p 13360

If you have curl installed, you can then run this command:

curl https://docquery.fec.gov/dcdev/posted/13360.fec | fastfec 13360
  • This will download the filing with ID 13360 from the FEC's servers and stream/parse it, storing the output in CSV files at output/13360/

If you don't have curl installed, you can also download the filing from the URL (https://docquery.fec.gov/dcdev/posted/13360.fec), save the file, and run (is equivalent to the above):

fastfec 13360.fec

Benchmarks

The following was performed on an M1 Macbook Air:

Filing Size Time Memory usage CPU usage
1464847.fec 8.4gb 1m 42s 1.7mb 98%

Local development

Build system

Zig is used to build and compile the project. Download and install the latest version of Zig (>=0.11.0) by following the instructions on the website (you can verify it's working by typing zig in the terminal and seeing help commands).

Dependencies

FastFEC has no external C dependencies. PCRE is bundled with the library to ensure compatibility with Zig's build system and cross-platform compilation.

Building

From the root directory of the repo, run:

zig build
  • The above commands will output a binary at zig-out/bin/fastfec and a shared library file in the zig-out/lib/ directory
  • If you want to only build the library, you can pass -Dlib-only=true as a build option following zig build
  • You can also compile for other operating systems via -Dtarget=x86_64-windows (see here for additional targets)

Testing

Currently, there's C tests for specific parsing/buffer/write/CLI functionality and Python integration tests.

  • Running the C tests: zig build test
  • Running the Python tests:
    cd python
    pip install -r requirements-dev.txt
    tox -e py

See the GitHub test workflow for more info

Scripts

python scripts/generate_mappings.py: A Python script to auto-generate C header files containing column header and type mappings

fastfec's People

Contributors

anthonyjpesce avatar chriszs avatar esonderegger avatar freedmand avatar hs4man21 avatar james-clemer-actblue avatar mattdennewitz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastfec's Issues

feat: include full executable in PyPI releases

Thanks for this excellent utility!!

If you pip install fastfec, then you get a python package, and that python code use s dynamically linked lib. You DON'T get a standalone binary. So you can only interact with fastfec with the python bindings. You can't call fastfec from the command line.

I'm looking for a scriptable way to install the CLI interface, and the version on homebrew is still stuck at 0.0.4, which doesn't work on v8.4 of .fec files. I can manually install the CLI executable from the releases, but this can't be automated.

I can help with this if you point me in the right direction.

Port to PCRE2

As I'm sure you know, PCRE is no longer under development. Using PCRE2 instead would help ensure your users are less vulnerable to security problems. Moreover, older software tends to get trickier to build on newer systems as time goes by, so switching will probably make maintenance simpler in the long run.

This might be helpful: PCRE2Project/pcre2#51

failed to build against 0.11

๐Ÿ‘‹ it looks like the 0.1.9 release build does not build against zig 0.11

==> zig build -Dvendored-pcre=false
/private/tmp/fastfec-20240112-70052-um5wxm/FastFEC-0.1.9/build.zig:18:6: error: no field or member function named 'setPreferredReleaseMode' in 'Build'

Only parse Schedule A itemizations

Hi! Thanks for this great utility.

I only care about the Schedule A itemizations. In some cases of multi gig .FEC files, the non-schedule A entries can take up more than half of the file, and so really slow down parsing.

Can we add some options to only parse particular itemizations?

In the meantime, I do this, do you see any problems with it? Like are schedule A itemizations always going to come before other schedules?

# filter_fec.sh

# We only want the individual contributions from an FEC file. We don't want
# the other itemizations, they can be gigabytes and slow parsings

# From the FEC file format documentation:

# The first record of every electronic file that is submitted to the FEC must be an
# HDR record that precedes the main body of the ASCII CSV (comma separated values) data.
# The second record will be a "cover" record for the particular filing, (for example,
# a F3 or and F3X record for a FEC-3 or FEC-3X electronic report). An unlimited number
# of Schedule records (examples: SA, SB, SC/ ...) can follow the first two records of
# an FEC electronic report file. (Electronic fi les are usually assigned the file
# suffix ".fec".)

# So as soon as we see a line starting with "SB", "SC", or "SD", we stop.
# From https://stackoverflow.com/a/8940829/5156887
awk '{if(/^SB|^SC|^SD/)exit;else print}'

and use it as curl https://docquery.fec.gov/dcdev/posted/13360.fec | filter_fec.sh | fastfec 13360

Consider implementing CLI in python

What if we made it so the .c stuff only ever compiles to a library, and the only way to access it was through python? The current CLI stuff in c is awkward and hard to test. If we implemented the CLI in python it would be way easier and way more testable.

The downside is that users would have to have python installed. Currently they can just download the binary and it works. But I'm not sure how common that is, I feel like mot users would have python??? IDK though.

Python client `SEGFAULT`s instead of calling `CustomWriteFunction`/`CUSTOM_WRITE`s in `parse_as_files`/`parse_as_files_custom`

It seems like calls of context->customWriteFunction are going amiss. I'm seeing SEGFAULTs coming without any evidence that the custom open function or write callback are ever called.

I've tried to demonstrate that the custom function passed in is called via print statements followed by stdout flushes, and bybreakpoints, but have seen no evidence that the FFI is behaving as one would hope. The problem persists with calls to parse_as_files, as well.

To recreate, one can run:

import smart_open
from fastfec import *


if __name__ == "__main__":
    headers = {'headers': {'User-Agent': 'Mozilla/5.0'}}
    with smart_open.open(f'http://docquery.fec.gov/dcdev/posted/1606847.fec', 'rb', transport_params=headers) as f:
        with FastFEC() as fastfec:
            fastfec.parse_as_files(f, "some_output_directory", include_filing_id='1606847')

On at least revision 460d0c4, built on MacOS and run on MacOS.

As I understand it, somewhere in writer.c's call of the custom function being handed in from the python client, there seems to be something going wrong.

The issue presented here is the smallest bit I could get to fail easily without getting rid of, for example, the use of smart_open (in the event that that's causing problems), but I'd ideally be able to use fastfec.parse_as_files_custom in a more general case, with other file-like objects. This is simply the smallest failing case I could demonstrate.

I'll keep looking at this as time and priority allow, but I figured a GH issue might be helpful in this instance.

Incorrect number of trailing commas when last field(s) are empty

FastFEC export seems to be missing a trailing comma in lines that have one or more empty items at the end of a row.

Using homebrew version of fastfec on a M1 MacBook Pro running macOS Montery 12.4.

For example, you can reproduce this by running fastfec 876050 fastfec_output/ and checking the header.csv (should be an additional trailing comma after report_number 002), SB28A.csv (42 fields in line items vs 43 in header) or SB23.csv (43 fields in line items vs 44 in header)

header.csv:

record_type,ef_type,fec_version,soft_name,soft_ver,report_id,report_number,comment
HDR,FEC,8.0,Microsoft Navision 3.60 - AVF Consulting,1.00,FEC-840327,002

Building from source no longer links with Homebrew PCRE

Attempting to build from source with brew install --build-from-source fastfec no longer links with Homebrew PCRE, but the system-provided one instead.

This seems to be due to a change in Zig 0.9.0 (Homebrew/homebrew-core@72b36e9).

โฏ brew install --quiet --build-from-source fastfec
==> zig build -Dvendored-pcre=false
๐Ÿบ  /usr/local/Cellar/fastfec/0.0.4: 6 files, 982.8KB, built in 11 seconds
==> Running `brew cleanup fastfec`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
โฏ brew linkage fastfec
System libraries:
  /usr/lib/libSystem.B.dylib
  /usr/lib/libcurl.4.dylib
  /usr/lib/libpcre.0.dylib

I've poked at this for a bit, but I don't use Zig so I'm unsure how to get this to ignore the system libpcre. Passing --search-prefix doesn't work. I'd appreciate it if you could take a look. Thanks!

fastfec 0.1.6 regression test failed

While upgrading fastfec to 0.1.6, I found that the test behavior got changed a bit.

The following test no longer works:

$ /opt/homebrew/Cellar/fastfec/0.1.6/bin/fastfec --no-stdin 13425
About to parse filing ID 13425
Trying filename: (null)
Couldn't open file: (null)

This is the old working test:

$ /opt/homebrew/Cellar/fastfec/0.0.4/bin/fastfec --no-stdin 13425
About to parse filing ID 13425
Trying URL: https://docquery.fec.gov/dcdev/posted/13425.fec
Done; parsing successful!

Proposal: removing Curl from the CLI

Per #16 we have an issue compiling Curl with Zig for multiple platforms, which has made it difficult/impossible to release new versions of the CLI automatically. We can still update the Python library, which most downstream tooling has seemingly adopted, but for the health of the project, I am proposing removing Curl support from the CLI.

Instead of fastfec [filing_id] to download and parse a filing from FEC, you'd have to have Curl installed locally (or use wget or similar) and run curl [filing_url] | fastfec [filing_id], which admittedly is not as convenient.

I think it's ultimately for the best for the project to move forward though. When the Zig Homebrew issue is resolved we can think about adding internal curl back in, but this should let us rethink and significantly refine the release process.

I've created this issue to track it, and when it's resolved it should be quick and easy to get the other issues closed.

BUG: (maybe?) Missing trailing commas from output

Not sure if this is a bug or not. If I run fastfec 878160 and I look at the resulting output/878160/SA11D.csv, then I see this:

form_type,filer_committee_id_number,transaction_id,back_reference_tran_id_number,back_reference_sched_name,entity_type,contributor_organization_name,contributor_last_name,contributor_first_name,contributor_middle_name,contributor_prefix,contributor_suffix,contributor_street_1,contributor_street_2,contributor_city,contributor_state,contributor_zip_code,election_code,election_other_description,contribution_date,contribution_amount,contribution_aggregate,contribution_purpose_descrip,contributor_employer,contributor_occupation,donor_committee_fec_id,donor_committee_name,donor_candidate_fec_id,donor_candidate_last_name,donor_candidate_first_name,donor_candidate_middle_name,donor_candidate_prefix,donor_candidate_suffix,donor_candidate_office,donor_candidate_state,donor_candidate_district,conduit_name,conduit_street1,conduit_street2,conduit_city,conduit_state,conduit_zip_code,memo_code,memo_text_description,reference_code
SA11D,C00477828,C7168136,,,CAN,,Clarke,Hansen,,,,2900 E Jefferson Ave,Apt C4,Detroit,MI,482074242,P2012,,2013-06-30,565.73,565.73,,,,,,H0MI13398,Clarke,Hansen,,,,H,MI,13,,,,,,,,"* In-Kind: In-kind, web hosting and phone services, to be reimbursed"

It looks to me that this is missing the required trailing comma that separates the memo_text_description and (the missing) reference_code value. If I try to load this with a pyarrow csv reader with the given 45 column names, it gets mad because it only sees 44 values in the row. You can replicate with pd.read_csv(path, engine="pyarrow"). Other CSV parsers such as vanilla pandas (pd.read_csv(path)) and vaex are more forgiving and just fill in NA for the missing reference_code values, so perhaps that is why this hasn't been caught before.

If I look at at the resulting output/878160/SB17.csv, it's a similar story: there is one less trailing comma than there should be to separate the missing last value.

However, if I look at output/878160/F3S.csv, then this looks correct. I'd guess this is because the last value in that row are non-missing:

form_type,filer_committee_id_number,date_general_election,date_day_after_general_election,a_total_contributions_no_loans,b_total_contribution_refunds,c_net_contributions,a_total_operating_expenditures,b_total_offsets_to_operating_expenditures,c_net_operating_expenditures,a_i_individuals_itemized,a_ii_individuals_unitemized,a_iii_individuals_total,b_political_party_committees,c_all_other_political_committees_pacs,d_the_candidate,e_total_contributions,transfers_from_other_auth_committees,a_loans_made_or_guarn_by_the_candidate,b_all_other_loans,c_total_loans,offsets_to_operating_expenditures,other_receipts,total_receipts,operating_expenditures,transfers_to_other_auth_committees,a_loan_repayment_by_candidate,b_loan_repayments_all_other_loans,c_total_loan_repayments,a_refund_individuals_other_than_pol_cmtes,b_refund_political_party_committees,c_refund_other_political_committees,d_total_contributions_refunds,other_disbursements,total_disbursements
F3S,C00477828,2012-11-06,2012-11-07,3120.73,0.00,3120.73,2153.17,3340.65,-1187.48,1500.00,55.00,1555.00,0.00,1000.00,565.73,3120.73,0.00,0.00,0.00,0.00,3340.65,0.00,6461.38,2153.17,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2153.17

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.