Giter Club home page Giter Club logo

animl-api's People

Contributors

dependabot[bot] avatar ingalls avatar jbmarsha avatar miles-po avatar nathanielrindlaub avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

animl-api's Issues

Transition to ECS

Per Dev Seed's recommendation. The backend has a variety of clients and increasing number of users, so it probably makes sense to move to a dedicated instance.

Add MIRA label reconcilation function

If a model detects something it's been trained on, return that, if both are empty, return empty, if there's a conflict (e.g. mira-large thinks it's a fox and mira-small thinks it's a rodent), return both.

Create script tools for updating db

Nice to have features:

  • modular design to support future scripted updates
  • first automatically backup db with mongodump
  • print out some of matched results? Match count?
  • before performing update, require command line confirmation w/ prompt

Fix bug with getProjects if user is superuser

Right now if a user is a superuser, getProjects will return all projects, but the user may not have roles for all of those projects in Cognito, which causes the front end to crash if they then select a project from the project nav for which they have no roles. This might be more appropriate to fix on the front end. Needs investigation.

DateAdded query issue

The following query should return 3 images from that camera, but something about the time comparison is off.

    "input": {
        ...
        "cameras": ["X8114541"],
        "addedEnd": "2020:10:27 05:05:24",
        "addedStart": "2020:07:27 05:05:24"
    }

We call moment's toDate() method when we build the query, which converts it to UTC (no offset):

addedStart:  Moment<2020-07-27T05:05:24-07:00>
input.addedStart.toDate():  2020-07-27T12:05:24.000Z

The dateAdded is formatted correctly when it's sent to the front end:

"2020-10-27 03:03:44"

But the time looks strange in the db???

2020-10-27T22:03:47.322+00:00

A note on using SSM Parameter Store for storing secrets

Not an issue, but I wanted to document my decision making on this somewhere. @postfalk, this might be relevant to keep in mind as we work on developing serverless best practices.

Originally, we were using .env files to store secrets and sensitive config values (DB connection strings, API keys, etc.). However, this got clunky and hard to maintain because often the same secrets are used by multiple individual stacks (animl-ingest, animl-api, animl-frontend), each of which have their own .env, and in some cases some of those shared values are generated dynamically by the Serverless stack build process (e.g. API Gateway entrypoints, SQS URLs).

Option 1: Cloudfront cross-stack-references
At first I tried using Cloudfront Outputs and creating cross-stack-references to import them into other serverless stacks. Essentially, within each stack's Serverless config file, you can concat AWS vars to create some of those dynamic strings (e.g. API gateway URL, SQS URL), which can then be passed directly into that specific stack environment and/or Output to be available to other stacks:

From animl-api serverless.yml file:

  ...
  custom:
    apiUrl: !Sub https://${ApiGatewayRestApi}.execute-api.${AWS::Region}.amazonaws.com/${self:provider.stage}/
    inferenceQueueUrl: !Sub https://sqs.${AWS::Region}.amazonaws.com/${AWS::AccountId}/inferenceQueue-${self:provider.stage}
    imagesUrl: ${cf:animl-ingest.imagesUrl-${opt:stage, self:provider.stage, 'dev'}}  # importing values created by other stacks
  environment: 
    ANIML_API_URL: ${self.custom.apiUrl}
    INFERENCE_QUEUE_URL: ${self.custom.inferenceQueueUrl}
    ANIML_IMAGES_URL: ${self.custom.imagesUrl}
  ...
  Outputs:
    animlApiUrl:
      Value: ${self.custom.apiUrl}
      Export:
        Name: animlApiUrl-${opt:stage, self:provider.stage, 'dev'}. # exporting values to be available to other stacks

From animl-ingest serverless.yml file:

  Outputs:
    imagesUrl:
      Description: Cloudfront distro domain name for image bucket
      Value: { "Fn::GetAtt": ["CloudfrontDistributionAnimldataproduction", "DomainName"] }
      Export:
        Name: imagesUrl-${opt:stage, self:provider.stage, 'dev'}

The advantages of this are that if those if dynamically created values change, they would be automatically pulled into the stack's env and also made available to others. However, there are a bunch of disadvantages: other stacks would still have to redeploy to get the updated URLs if they were to change, it creates a convoluted web of outputs and imports that will be hard to maintain, we'd still have a bunch of other env variables (e.g. API keys) that are not Serverless outputs that we need to manage, and the frontend doesn’t use Serverless at all so we'd still be manually keeping those values in sync.

Clearly, a centralized store for these variables seemed sensible.

Option 2: SSM Parameter Store
After reading up a bit on cloud-based secrets management options, it seemed like the AWS SSM Parameter Store would offer a simple, cheap, secure tool to store and retrieve secrets and shared config data, with the caveat that we'd still need to manually create and manage the keys & values in the AWS console*. Another nice advantage of this is that while we could pull SSM values in the serverless.yml config files, we can also use the AWS SDK to fetch SSM values at runtime, so we wouldn't have to redeploying stacks that depend on certain SSM parameters every time they change.

Anyhow, that's the route I've taken for now, and so far I'm pretty happy with it.

*Future improvement idea: one way to add dynamically created Serverless Output values to the Parameter Store from the serverless build process might be to use serverless-stack-output plugin, which allows you to hook in a custom script that takes in the output as an argument. We could write an output handler script that uses the AWS-SDK’s SSM class to put and/or update those dynamically generated outputs in the parameter store.

Image querying bug

After implementing #53, I discovered our approach for merging all of the filters in utils.js buildFilters() had a bug: I was using the spread operator to merge all of the filter objects, but if multiple filter objects had the $or key, the earlier ones would get overwritten by later ones. Not sure how I missed this earlier!

Implement retroactive inference

E.g., after a user creates an automation rule, review all existing matches in the db, check whether or not they already have predictions from that model, and if not add to queue.

Bug: prevent false duplicate error from being thrown

Currently, we do dupe Image checking when saving to the db by (1) ensuring the _id field, which was generated by a md5 hashing function in animl-ingest, is unique, and (2) by enforcing a unique compound index in MongoDB for images (cameraId + dateTimeOriginal):

ImageSchema.index(
  { cameraId: 1, dateTimeOriginal: -1 },
  { unique: true, sparse: true }
);

This has worked fine the vast majority of the time, however, sometimes two sequential images that were taken in a bursts have the same timestamp b/c their temporal resolution is limited to seconds, not milliseconds. This causes an incorrect duplicate error to be thrown, and the image gets rejected. I think the solution is to remove the unique constraint from the secondary compound index and just rely on the _ids being unique.

Bug: Timezone issue with dateTimeOriginal from Buckeye exif data

The dateTimeOriginal timestamps are in PST and don't have a TZ offset (e.g. 2021:09:25 21:36:12), and there’s no info in the exif data to indicate timezone. So when we cast the string to a date moment assumes they’re in UTC and they get saved to the DB as 2021-09-25T21:36:12.000+00:00, which is the wrong time.

Meanwhile, we don’t convert Created Date or Added Date to local timezone on the frontend, even though added date is correctly created in UTC. Because we don’t parse it as PST when filtering and displaying it, it shows up as the UTC time and looks off.

Not sure what to do here... as a first step I'll compare dateTimeOriginal from other camera makes and see if they do the same.

When we fix this we'll need to do a scripted update of all images in the prod DB.

Add "view" mutation handler

  • create view schema ("filters" as string or object?, "name" of view)
  • create mutation handler for creating, updating, and deleting view
  • create query handler for getting all views

Implement concept of "Deployments"

A "deployment" is a specific camera at a specific location for a certain period of time. Users should not be forced to add "deployments" or set them manually (all cameras should start with a default deployment that doesn't have start/end dates). If a camera is put out in the field and never moved, the default deployment would suffice. But users should also be able to retroactively add & adjust deployments (move the start & end date, change the name, and potentially change permissions).

A first step will be to expand the "Camera" schema & implement revolvers to allow users to add names and other metadata.

Things to think about:

  • how to define the start/end date in the schema? Should we include a start date? Or just set end dates, and if there's another deployment's end date before it, consider that the start date?
  • if we start by setting access control on the camera level, how would we move to deployment-level permissions down the road? Deployments are really the natural resource to perform access control on.

Implement CSV export

Think about making this work for the ecological / data analysis use case as well as the ML data training use case.

It might be a little tricky with GraphQL... would likely have to stringify the entire csv and return it as one field in the JSON response, and/or potentially convert it to base64.

We'd also have to make some decisions about the structure of the CSV:

  • only include validated labels? include all validated labels on an object or just the first validated label in the labels array?
  • create a column for each available label, and for each image row use the number of validated labels present as the label column's value? YES

Make app idempotent

(capable of handling duplicate messages). It's rare but a sure bet that we will replay SQS messages that have already been processed, resulting in unnecessary inference and duplicate labels. Do the following:

  • before adding a label, check image to see whether a label from that model has been applied to it already

Refactor label schema

Use representations of "objects" (in the "megadetector detected an object" sense of the word) as the unit of storage, with arrays of potential labels stored within them. When new label is added (e.g. by MIRA), first see if there's already an object with the same bbox and add it to that, else create new object.

Make inference pipeline more efficient

Right now the Megadetector API caps us at 10 requests per 5 minutes. I've talked with Dan Morris at MS AI 4 Earth, and he's totally fine with increasing this for us, I just need to get back to him with our expected usage. So that would be an obvious first step.

However, the whole inference worker is structured around this limitation: in order to not inundate the Megadetector with requests once we've maxed out, I have the inference handler.js function poll SQS for new messages every 5 minutes, pull the first 10 out of the queue, and process request inference on them. There's a bunch of ways this could be improved:

  • set up a separate queue & worker function for MIRA inference requests that runs inference requests as soon as they hit SQS (we host it so there's not request limit)
  • send 8 images per request to Megadetector API
  • host Megadetector ourselves

It's also worth noting that there are really two separate use cases that we need to support but might have different solutions: (1) real time inference from images coming into the system from wireless camera traps, and (2) bulk inference from users uploading images from a hard drive.

Fix label filtering bug

images with non-validated labels within locked objects are still passing the labels filter. For example, if you only have the 'rodent' label selected, it's returning locked objects that have a different label validated (e.g. skunk, bird, etc.) and the rodent label validation = null. There must be something wrong with the query here.

A note on error handling in GraphQL

GraphQL doesn't send back HTTP status codes like a regular REST API would because there is not a 1-to-1 mapping of requests/queries to route handlers (or "resolvers" in graphQL parlance). Single queries can require many resolvers to fire and respond, so errors can occur in many places, multiple errors can be returned at once, and partial data might get returned. Because of this graphQL APIs will almost always return a 200 status, partial data if it can and an array of error messages if they occur, e.g.:

{
  "data": {
    "getInt": 12,
    "getString": null
  },
  "errors": [
    {
      "message": "Failed to get string!",
      // ...additional fields...
    }
  ]
}

I like the approach Apollo Server took in Apollo server 2.0: it returns arrays of errors with an error[i].extensions field that can contain useful info, most importantly a standardized code field. You can also read more about their strategy here.

Unfortunately, the latest graphql-yoga is not running apollo-server 2.0 under the hood, so I had to implement a hacky formatError() function to make the errors look more like [ApolloErrors](https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-errors/src/index.ts). We'll see how that goes.

Another thing to consider might be to wrap all resolver functions in a helmet() function that catches all db-related errors. See this article for more info.

Duplicate image saving bug

It looks like if animl-ingest doesn’t get a 200 response, it will keep trying the saveImage call to animl-api. When trying to upload p_000309.jpg and p_000310.jpg in tandem, 310 saved successfully, but 309 got stuck I think in the addToAutomationQueue() step, and never returned a response to animal-ingest, causing it to keep trying (it saved multiple 309’s - which also mongo shouldn’t allow - so there seems to be a separate de-dupe issue)
- happened again with p_000536.jpg

Move automation rules to Project level

Currently users set automation rules at the View level, but I think this is somewhat a relic from before we had implemented multi-tenancy, and it also poses issues now that users can configure confidence thresholds & disable classes. For example, if an image belongs to two views, and each view has an automation rule set to request inference from the same ml model but with different class settings, reconciling that would get complicated. We could maybe treat class settings as a front-end filter and always actually request inference with the same default settings, but I think the simpler option is to move all automation rules to the project level & apply them to all views within that project.

It would entail the following:

  • Move Automation Rules from View to Project schema, migrate existing automation rules in DB to their respective Projects
  • Update buildCallstack() in scr/automation/utils.js to reflect new structure. This will become much simpler: we will no longer need to check what view an image belongs to or de-dupe rules.
  • Create new mutation resolvers for creating/updating/deleting automation rules (currently we're just using the updateView() resolver, which we should still keep).
  • Update frontend

Restrict access to S3 and serve images through API

Right now the images themselves are unprotected: if someone knew an ID they'd be able to request it.

To fix, create a resolver that checks authentication and then reads in the images data from the S3 bucket, encodes it in base 64, and returns it to the front end in a JSON object.

Fix race condition issue when updating objects

There are a couple places where animl-frontend requests two different updates to the same image in rapid succession.

Essentially we've created a race condition, and MongoDB blocks the second update by throwing a versioning error, e.g. : "Error: VersionError: No matching document found for id <_id> version 2". Because often the second request is to set the object.locked property to true, this most commonly manifests as objects not locking after being validated. I'm not yet sure how many times this has occurred. Relevant discussions of the issue:

I think the solution is to use Model.updateOne() or Model.findOneAndUpdate() rather than Model.find() and Model.save() in the Image model's updateObject() function. Good documentation and explaination of the difference between .save() and .updateOne() can be found here, and here.

Things to test:

  • how to update a field in a nested array (i.e., a single object in an image's array of objects)?
  • how to update multiple fields at once (because we might have multiple object diffs to set)?

Authorization and multi-tenant user management system

The goal is to implement an auth and user management system that supports the following: 

  • Separate, siloed “projects”, in which read access to images and all other permissions are limited to users who are members of of that project
  • Tiered user roles, to further limit/grant permissions within within each project:
    • Product Owner/Super user/admin: Natty/Falk created 2 new projects: one for SCI biosecurity, and one for a large carnivore study on the Dangermond preserve. They can create and add users to those projects and set those users permission levels. 
    • Project Manager/Project admin: Juli, the NPS Biosecurity manager, (and probably Natty & Falk as well) are admins for the SCI biosecurity project, they can add users, configure inference pipelines and other automation rules, see all images that belong to that project, and edit labels. They can create and edit views. If the data model requires that cameras or base stations be registered to projects, the project admins can do this. They can also upload images directly from their computers. 
    • Project Contributor: Miles added this one, but I am not sure we need it. Essentially A role permitting image/asset write access.
    • Project Member/Project reviewer: Will is an NPS biosecurity intern for the summer, and part of his job is to help review and edit labels - he will need read access to all images within a project and write access to the labels, and he can create views to help him with his review workflow. He can not edit inference pipelines or create new users. 
    • Project observer: I’m not 100% sure we need this one, but basically a read-only role. Perhaps this is useful for ecologists or researchers outside of our organization who want to consume the reviewed, validated data to build a population model but we don’t necessarily want them editing any labels. 

We will utilize Cognito groups to organize both the projects and roles.

  • Each group name will contain both the project name/id and role, e.g.: animl/sci_biosecurity/project_owner
  • When a user is authenticated by Cognito, the returned ID token will include a list of all groups (project + role combinations) they are a member of.
  • The front end parses out the projects and roles the user is a member of, builds list of projects in state to allow a user to navigate between projects so that they know what project they're "logged into" and acting on behalf of.
  • Front end will also show/hide/disable certain functionality based on role.
  • All requests to the API will contain the ID token, but we also need to figure out a way to pass in the current group the user is acting on behalf of... perhaps a custom header?
  • The animl-api resolvers (at the model level) will be responsible for roll gatekeeping operations (making sure the user's role has permission to perform the action)
  • The siloing of data by project will happen at the DB query/filtering level: every entity/record in the DB will be associated with a project, and we will simply append "project = usersCurrentProjectId" to all queries.
  • This requires associating all new entities with a project when they're created.
    • For manually created records, like a view or automation rule, the project would be whatever project the user is currently "logged into"
    • For automatically created image records, users must first "register" that particular camera's serial number with their project. They would use the UI do request to add a new camera, the API would then check to make sure that serial number isn't already associated with a different project, if not it associates it with the user's current project, and all new images coming into the system from that camera will be tagged with that project.
    • Future ideas: implement the ability for projects to "release"/"unregister" a serial number in case the camera needs to be moved to a different project, freeing it up for a different project to grab it
    • also maybe implement a way for different projects to request access to those cameras images or some subset of them.... so there's a primary owner of each image but maybe a list of other projects that have read-only access or something? not sure how that would work exactly.
    • Not sure we need to associate projects w/ label records. Kind of depends if an image can be viewed or owned by more than one project. If no, then we don’t really need to, if yes, then maybe we do, as different projects might have different labeling goals/needs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.