coderxio / dailymed-api Goto Github PK
View Code? Open in Web Editor NEWREST API for DailyMed SPLs
Home Page: https://coderx.io/
License: MIT License
REST API for DailyMed SPLs
Home Page: https://coderx.io/
License: MIT License
I think this makes sense at the SPL level... but might be more than one product NDC (i.e. 12345-6789) per SPL?
Repo is missing a license.
Add a license.
Potentially MIT license?
extract_zips.py
can only handle a single .zip file. There are multiple .zip files when the data is downloaded from dailymed. Additional finding when deploying to droplet was memory errors, out of memory
, due to writing entire file in one chunk. Need ability to download large, > 1 Gb, file on droplet with minimal memory.
All zip files are looped through and unzipped.
An out of memory
error does not occur.
README.md has incorrect information on steps to reproduce. Additionally, README does not contain links to the hosted API.
Deployment steps are correct and link to API works.
Capture the DEA drug schedule for the product within the model.
This is a next step as we expand the scrapping of the XML files. The item fits within our current models logically, i.e. the product level.
Code does not align with PEP8 standards
Successfully run flake8 against code base with no errors
Create branch, process via python module Black and then forward, require all code changes to be compliant.
Command line arguments have wrong metavar. Metavar needs to be set to int
for all three inputs.
Metavar identified as int
Update get_zips with correct metavar in argparse.
Need to filter down to product by
Product endpoint filter for product_codes and not_inactive_ingredient_uniis
Desired workflow is probably this:
SPLs should not be returned multiple times just because they have a product that has multiple inactive ingredients with the same name.
This query should return only one result, not three:
http://api.coderx.io/spl/?set_id=15776e53-0ae5-4605-914a-8a1bcd97323a&inactive_ingredient_name=oxide
Keyword arguments for "distinct" exist, but haven't figured out how to use them, and not sure this would fix it anyway: https://django-filter.readthedocs.io/en/stable/ref/filters.html?highlight=distinct#distinct
There are currently some xml parsing errors that are occurring with only some of the spl zip files. Appears the errors are related to missing containerPackagedProduct
tags in some spl files. Instead containerPackagedMedicine
tags are being used.
We need to determine if this is the only error in these spl files. Solution will likely involve testing for either tag when scraping the xml files.
None
The scrapy run logs currently show two errors after execution on every run. Our setup is using scrapy to crawl a bunch of xml files so this error is really related to its default settings which apply more so for crawling URLs instead of files. The error appears to be related to a missing robots.txt
file.
To fix this bug, the default scrapy settings need to be adjusted to disable searching for this file. We should be cautious with this if we decide to actually scrape urls in the future.
Ability to filter views by the items presented in the view. For example, /spl
view could filter by product name.
Per discussion on 09/27, the recommendation was Django Filters.
Currently, users are unable to curate the returned information. Filtering would enable the user to select only information pertinent to the problem being solved. If a user needs to know NDC numbers for lisinopril, they could filter only products with the name lisinopril.
Would propose we come up with some branch naming conventions. Maybe something like name/issue#/description
. Food for though what do you guys think?
This would better organize and identify branches in the repo.
ref: https://stackoverflow.com/questions/273695/what-are-some-examples-of-commonly-used-practices-for-naming-git-branches
Github allows you to create standardized Issue and PR templates. I think it would be a good idea to come up with some basic templates for this repo. Wondering if it would be cool to actually make a modified SBAR format for this. These would be defaults but can still be adjusted in the issue or PR.
Templates I have seen in the past look like this (shout out to @toozej):
For issues:
[What needs to be done and why]
[Measureable outcome if possible]
[ways one might accomplish this task, links, documentation, alternatives, etc.]
For PRs:
Fixes org/repo#ISSUE_NUMBER
[What did you change?]
[Why did you make the changes mentioned above? What alternatives did you consider?]
The API end points needs to include pagination.
Currently, the API is slow to load for products endpoint and possibly other end points due to the size of the data.
Need wsgi http server, e.g. gunicorn, added to project dependency files.
n/a
I would recommend gunicorn as it is a very simple command to use it.
Single view set, /spl
with filtering.
The project is crawling SPL documentation. It makes sense to have the URL route to be /spl
.
Problem Statement
Individual SPLs sometimes have more than one NDC within the XML (and within the "Ingredients and Appearance" section). This prevents us from:
lookup_field
Example: https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=2524b253-069e-4028-819c-361b888df110
Criteria for Success
Either modify Scrapy logic to remove duplicates (if appropriate) or modify Django Rest Framework to use an auto-incrementing number as the primary key and don't use NDC as a lookup_field
.
Success = 0 duplicate errors during Scrapy run.
Additional Information
The two XML documents I found are below:
I pulled this from the API for one of them:
GET /spl/8d4d72be-638b-11ea-918e-832cfc2ca371/
HTTP 200 OK
Allow: GET, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
{
"id": "8d4d72be-638b-11ea-918e-832cfc2ca371",
"set": "http://192.168.1.12:8000/set/2524b253-069e-4028-819c-361b888df110/",
"ndcs": [
"http://192.168.1.12:8000/ndc/50458-178-00/",
"http://192.168.1.12:8000/ndc/50458-178-15/",
"http://192.168.1.12:8000/ndc/50458-178-00/",
"http://192.168.1.12:8000/ndc/50458-178-20/",
"http://192.168.1.12:8000/ndc/50458-178-00/",
"http://192.168.1.12:8000/ndc/50458-178-12/",
"http://192.168.1.12:8000/ndc/50458-178-28/",
"http://192.168.1.12:8000/ndc/50458-178-06/",
"http://192.168.1.12:8000/ndc/50458-176-00/",
"http://192.168.1.12:8000/ndc/50458-176-15/",
"http://192.168.1.12:8000/ndc/50458-176-28/",
"http://192.168.1.12:8000/ndc/50458-176-06/"
]
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.