eregs / regulations-core Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 27.0 1 MB

An engine that supplies the API that allows users to read regulations and their various layers.

License: Creative Commons Zero v1.0 Universal

Python 100.00%

regulations-core's People

Contributors

Stargazers

Watchers

Forkers

lauraggit noahmanger thejennywang ericschles theresaanna jmcarp cmc333333 vrajmohan tadhg-ohiggins navigo gregoryfoster govtmirror joshuaeveleth mustyoshi grnrlabs fecgov kaitlin kaomte cmsgov

regulations-core's Issues

Search always returns "no results"

Whether using elasticsearch (was able to get a successful import on 12 CFR 1008) or haystack/solr for storage, search always seems to return "not found." Am I missing something obvious?

Add the linting plugins to match -parser

Several flake8 plugins were added to the parser in eregs/regulations-parser#346. Add them here, too.

Remove django-mptt

We've been using django-mptt to describe the schema of our nested set implementation. Unfortunately, it makes relatively strict demands on which versions of Django it supports (and doesn't encode those in its own dependencies). This required us to stop supporting Django 1.8 before its LTS expired, which is against our policy.

It should be pretty simple to replace the bits of schema that we use and remove mptt altogether.

Secure write access via hmac or other authentication

Right now, the only way to lock down write access is via including or not including django projects. Let's beef that up so that the api can be public facing. Options include HMACing the message, HTTP Auth, or simpler API key sharing.

Postgres full-text-search as an option

In addition to the existing options of haystack and elastic search

Migrate tests to Py.test style

(After #64)

Import_docs sequencing error

There are some dependencies between document types (e.g. layers pointing to regulations) which should be accounted for in the import_docs script.

Current work around: run the script twice.

Add debug output to console

Right now, if you get a 500 while uploading, it's very difficult to figure out why.

Add support for Django 1.9

Ideally we'd be supporting the long term release and the current release of Django

Use pip-tools

Given that regulations-core is sometimes ran independently (i.e. as an application rather than a library), we should be pinning its requirements. These will be ignored when included as a library.

Update to django 1.8

Consider using Whoosh in the example settings

The setup instructions and example settings files all assume Solr or Elasticache. Can we use Whoosh instead? It seems like it has fewer dependencies, making getting started much easier.

Use time-insensitive string comparison w/ basic auth

Right now, we're vulnerable to timing attacks as we leak little bits of information about the auth string.

Use the nested set model for storage

Currently, when storing the regulation tree, we store each subtree, keyed by label. While this makes processing very simple, it leads to a great deal of redundancy and can be quite slow when importing the tree (as the structure must be walked and each subtree inserted).

An alternative is the nested set model, which stores each node once but makes grabbing subtrees (equivalent to subsets) painless. There's even a few implementations in django.

Elastic Search 'Amendments' Model Parsing Failure

When parsing 37 CFR 42 and core configured to use elastic search, every PUT to a notice URI fails the same way. Here are some snippets that don't make it to the console but provide a great deal of context, pulled from local variables in paused client.py post-exception:

Can't merge a non object mapping [amendments.changes] with an object mapping [amendments.changes]` [{'reason': '[YW_wNku][127.0.0.1:9300][indices:data/write/index[p]]', 'type': 'remote_transport_exception'}] '[YW_wNku][127.0.0.1:9300][indices:data/write/index[p]]'

As the request is made, regulations-core/regcore/db/es.py line 115 local variable notice has the following under the amendments key (ie. notice[‘amendments’]):

[ 
    {'authority': '35 U.S.C. 2(b)(2).', 'instruction': '1. The authority citation for 37 CFR part 1 continues to read as follows:', 'cfr_part': '1'},
     {'changes': [['1-301', [{'action': 'DELETE'}]]], 'instruction': '2. Section 1.301 is removed and reserved.', 'cfr_part': '1'},
     {'changes': [['1-302', [{'action': 'DELETE'}]]], 'instruction': '3. Section 1.302 is removed and reserved.', 'cfr_part': '1'},
     {'changes': [['1-303', [{'action': 'DELETE'}]]], 'instruction': '4. Section 1.303 is removed and reserved.', 'cfr_part': '1'},
     {'changes': [['1-304', [{'action': 'DELETE'}]]], 'instruction': '5. Section 1.304 is removed and reserved.', 'cfr_part': '1'},
     {'instruction': '6. Part 42 is added to read as follows:', 'cfr_part': '1'},     {'instruction': '7. Part 90 is added to read as follows:', 'cfr_part': '90'} 
]

With the debugger paused immediately after this failure, I attempted to pull what we already have there. There is no record:

$ curl 'http://localhost:9200/eregs/notice/2012-17900'
{"_index":"eregs","_type":"notice","_id":"2012-17900","found":false}

And pulling the schema didn't give me any hints about the preferred structure of amendments.

$ curl http://localhost:9200/eregs/_mapping/notice
{
  "eregs":{
    "mappings":{
      "notice":{
        "properties":{
          "cfr_parts":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "cfr_title":{
            "type":"long"
          },
          "dockets":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "document_number":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "effective_on":{
            "type":"date"
          },
          "footnotes":{
            "type":"object"
          },
          "fr_citation":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "fr_url":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "fr_volume":{
            "type":"long"
          },
          "meta":{
            "properties":{
              "start_page":{
                "type":"long"
              }
            }
          },
          "primary_agency":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "publication_date":{
            "type":"date"
          },
          "title":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            }
          },
          "versions":{
            "properties":{
              "42":{
                "properties":{
                  "left":{
                    "type":"text",
                    "fields":{
                      "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                      }
                    }
                  },
                  "right":{
                    "type":"text",
                    "fields":{
                      "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Consider consolidating backends

We currently support writing data to sql via django, and to elastic. We also support a second elastic / solr index via haystack. I also see that we're talking about an additional search backend in #10. I'm guessing I'm missing some context here, but why is it useful to have all these backend options? Do we have sufficiently different use cases that some users would want postgres full-text search, others postgres + haystack, and others elastic?

Update mention of example_settings.py in README.md

Hi, I'm new to regulations-core and a relative Python newb, so my apologies in advance should I miss the obvious. Helpful pointers welcome!

In walking through the documentation on building regulations-core from source, there's mention of an example_settings.py file which I'm unable to find in the source repository. Should this be a reference to regcore/settings/base.py?

Thank you!

Wrap http requests in transactions

We don't wrap our requests in transactions, currently. While concurrent writes/reads during writes haven't been a use case we've cared much about, it's a good practice.

Integrate Elastic Search in the cloud foundry environment

This will prevent the need for #10

import_reg command failing due to renamed import

Hello,
I'm new-ish to the project and was attempting to import the local output from the suggested regulations-parser example regulation (Title 27 Part 447) into my instance of regulations-core using the suggested import_reg command. I saw that Python 2.7 is required for that command due to its usage of urlparse, which may be worth documenting as a separate issue if/for when the project should move to depend only on Python 3. The command then presented the following error:

File "...regulations-core/regcore/management/commands/import_reg.py", line 13, in <module>
    from regcore_write.views import regulation, diff, layer, notice
ImportError: cannot import name regulation

Seeing there's no longer a regulation.py file in regcore_write.views I was able to identify a commit (2a5f8cd) which shows the object was renamed to Document. Changing the import plus the only reference I could see (line #201) resulted in the command running successfully and some rows added to the SQLite database - but no rows were added to regcore_document which seems suspicious to me. There are quite a few other variables in the command which reference "regulation" and perhaps some expectation of values named accordingly in the JSON files, so I wanted to check with the experts before submitting a minimal pull request.

Thank you for your important work on this project, more timely than ever!

Verify and cleanup any issues around Django 1.10

We currently only support 1.8 and 1.9

Delete and/or cleanup a single CFR part

Often we've wanted to clean up single parts or delete a single part. Consider pulling in:
cfpb/regulations-core#63
cfpb/regulations-core#64