lookit / lookit-api Goto Github PK
View Code? Open in Web Editor NEWCodebase for Lookit v2 and Experimenter v2. Includes an API. Docs: http://lookit.readthedocs.io/
Home Page: https://lookit.mit.edu/
License: MIT License
Codebase for Lookit v2 and Experimenter v2. Includes an API. Docs: http://lookit.readthedocs.io/
Home Page: https://lookit.mit.edu/
License: MIT License
Pain point: Complying with GDPR requests requires manual work from an admin to collect and/or delete data from Lookit.
Acceptance criteria: Lookit complies with GDPR "right to be forgotten," "right to access," "right to portability" by giving participants a way to request deletion of all their data, a copy of all their data, and information about how their data is being used.
Implementation notes/Suggestions: We will likely need to differentiate between the "Lookit copy" of data and what researchers have already been given access to; when planning this task, we should schedule a quick conversation with OGC. When deleting data, we may be allowed to retain some record that it existed at all, for researchers' use in reporting rate of these events. We are already in compliance in that we can/will respond to requests appropriately, but for scaling we should make it easier to do this without manually handling requests.
Rather than downloading ALL videos or ALL CONSENT videos, allow researchers to select which sessions to download videos from by checking/unchecking a list of sessions or giving a date range. Display both session dates and the number of videos associated.
Also consider putting download link in Experimenter instead of email.
Review entire codebase (lookit-api, ember-lookit-frameplayer, exp-addons) for overlooked security vulnerabilities, opportunities to apply best practices, inefficiency. Add tasks as needed. Idea is to catch major/obvious issues.
This would both make the collected data much easier for researchers to interpret (instead of having a separate "record" every time someone refreshed the setup page) and work more in line with how a participant might expect. It would also mean we'd only have conditions assigned in cases where the user proceeded through consent, meaning that relative counts ("we have enough of condition A but need more of condition B") would be more accurate. Plus participants who'd refreshed the setup page a lot wouldn't run into increasingly long load times as the system worked to look up every record and send in groups of 10. In principle, this could be done by either not saving the session data at all until consent, or by only granting access to an admin unless consent was completed. This would likely need to be handled at the level of generating the experiment from the JSON spec - wait for a frame that has an isConsent designation, and every frame thereafter sends data, OR a consent frame sets some property of the session data that's then accessed to check whether to send data each transition.
Pain point: Email announcements of new studies for a child are currently handled by a collection of python scripts run on Kim's laptop, writing to / checking from text files of email addresses already notified. This will not scale well and allows no independence for researchers in turning on/off notifications, specifying what to say, etc. Researchers cannot send these emails themselves because they (appropriately) do not have the ability to email Lookit users who have not already participated in their studies.
Acceptance criteria:
Implementation notes/Suggestions: Centralizing email announcements of new studies allows us to prevent participants from receiving overwhelming numbers of announcements. Sending study announcements will allow us to better leverage the existing participant userbase for recruitment as the platform grows, and to keep families engaged with research.
I.e., generate object from text, but additionally store actual text. Researchers find it unintuitive that their JSON document describing their study is rearranged upon saving; structure that makes it easier to read (e.g., grouping field names in an object by function, even though technically they're unordered) is erased. Although working from a local copy is sound practice anyway, and researchers will likely just be pasting in their JSON, it would be nicer to store their own text (and just use the parsed json).
Pain Point: The distinctions between confirmatory and exploratory work and between pilot and "real" data collection are critical for planning studies and interpreting results, but researchers face challenges in learning new practices as scientific norms change rapidly to reflect our better understanding of their impact.
Acceptance Criteria: The standard workflow in designing and starting a study on Lookit includes clear delineation of whether the work is exploratory or confirmatory, and when "real" data collection begins. Researchers can link to preregistration in their study details. Researchers state whether data collection is in "pilot" mode and that information is included in session data.
Implementation notes/Suggestions:
Everyone wants this! It'd expand the measures we could collect from kids (natural search/exploration, interactive games, etc.) greatly. My impression is that it'd be a bit complicated to make sure everything worked smoothly across mobile OS/browsers, plus we'd need to start designating some studies as computer and some as tablet (or phone?) since these devices won't be so good in most cases for collecting infant looking data. This should happen after phase 2 for a variety of reasons. Because this is a separate and substantial block of work, ideally we might be able to find a collaborator to provide funding specifically for this piece.
Pain point: Onboarding new researchers to use the staging and/or production Lookit sites requires a substantial number of manual steps and some awkward workarounds. For instance, researchers have to try to log in, let Kim know they've done so, and then Kim grants access to the site (she is not automatically notified as an admin that someone has requested access). Kim also grants them access to existing example studies manually. There is no way for closely associated researchers (i.e., those in the same lab) to automatically get access to all of their lab's studies, and no way to associate studies with a particular research group beyond the PI contact info provided. There is no distinction between access to study details (which we might want to share broadly, e.g. to allow easy replication) and access to study data.
Planned functionality/changes:
Preview | Design | Analysis | Submission-processor | Researcher | Manager | Admin | |
---|---|---|---|---|---|---|---|
Read study details (protocol, etc) | x | x | x | x | x | x | x |
Write study details | x | x | x | ||||
Change study status (incl. write changes that would reject study) | x | x | x | x | |||
Manage study researchers (grant/change permissions) | x | x | |||||
View/download study data | x | x | x | ||||
Code consent and create/edit feedback | x | x | x | ||||
Change study lab | x |
Create new Lab model to encompass the following functionality:
Joining a lab:
Editing/managing a lab:
Creating a lab:
Lab permissions:
Lab researcher | Lab read | Lab admin | |
---|---|---|---|
Create studies associated this Lab, and can be added manually to any of this Lab's studies | x | x | x |
Preview role for all Lab studies | x | x | |
Manager role for all Lab studies | x | ||
Manage permissions for Lab (add new researchers, etc.) | x | ||
Edit lab metadata (description, website, etc.) | x |
Note that new Lookit-wide permissions will be expected to eventually replace the "MIT Org Read" and "MIT Org Admin," (see #459) but for now (given the staff of two...) we will just rely on superuser perms.
Create a list of other systems for building or deploying web-based experiments and/or participant management (e.g. labJS, pavlovia, jsPsych, opensesame, expfactory, prolific) -
Goal is to then decide:
(a) which if any we want to eventually support using on Lookit in place of ember-lookit-frameplayer, and when
(b) whether it makes sense to pursue collaborative development at this point and when to revisit if not
Pain Point: Available studies are currently displayed to families in a grid in a fixed order; with either complex inclusion criteria or more than a few studies, it is difficult for a parent to tell which studies are appropriate for their own children.
High-level acceptance criteria: A parent can see a list of studies appropriate for a given child without reading through individual study criteria. Parents and users not logged in can still browse all "discoverable" studies. Parents can view a list of studies by child under "past studies."
Tasks:
Card design:
Finding studies (Participant-facing study list view):
completedConsentFrame
is true.Navigation:
Tests:
Pain point: The first step in coding data from a Lookit session is to check that the associated consent video shows a parent making a statement of informed consent. It is possible for data to be collected in the absence of an informed consent statement – for instance, a parent might not read the statement because he or she does not understand written English. Currently, it is up to each lab to come up with a consent coding workflow and avoid viewing any session video until consent has been confirmed. Researchers can download individual videos, all consent videos, or all videos. This manual task is potentially error-prone and represents a lot of duplication of work across labs. A simple GUI for coding consent and central storage of consent information would enforce a clear consent coding process, reduce potential for dangerous human error, and reduce the burden on individual labs.
Acceptance criteria: Researchers can use a view on the experimenter interface to see consent videos, which they can filter by new, marked as non-consent, and marked as consent. Each video is displayed on the page along with some basic information about the associated session, including existing consent information if available. Next to the video, the researcher selects whether the consent video is valid or not (possibly from a short list of reasons why not, to distinguish tech issues from non-reading), and can enter a brief string with any notes. The consent value, note, and identity of the researcher who provided them are stored with the session record. Only videos from sessions where consent has been confirmed may be downloaded.
Implementation notes/Suggestions: Consent judgments should be possible to overwrite, e.g. in case an RA thinks something isn't valid but it is or vice versa. Storing history of such judgments would be nice to have but not critical.
Pain point: Upon launch, we will need to work to increase the Lookit userbase via outreach and advertising, but we don't currently have a way to evaluate such efforts (e.g., to see how many people registered in the past week).
Acceptance criteria: The family outreach specialist can easily monitor and evaluate advertising efforts by answering questions like the following, using the Lookit admin or experimenter interface without doing any programming:
Implementation notes/Suggestions: This can possibly be part of either the experimenter or the admin apps in Django. It seems like it might build on existing functionality in admin, except that we don't want usage to be limited to people who are actually admins (able to see/manipulate all data).
We've discussed building a dashboard and essentially fetching a bunch of data, then allowing filtering down from that using sliders/etc. (e.g. for age range, demographics). It could show things like new participants registered per week, a bar chart of the age distribution, tables of demographic form responses, and a plot of # unique study participants / week (one line for total unique participants, lines for individual studies).
It might turn out that there's nothing preventing us from allowing all researchers to use this from an ethics/privacy standpoint (if there's no way for them to get identifying info, just composite stats we could share with them anyway), which would be great someday, BUT the primary intended users are still a couple people at MIT for the purposes of whether we need to engineer database access based on many users.
Pain point: Sending an email to a participant - e.g., to verify consent, answer a question raised in the exit survey, or provide compensation - is cumbersome; the researcher needs to note the account ID or UUID from the response view or CSV, then go to the "email participants" view, select what type of email, and select the account by ID or UUID from a list of accounts that accept that type of email. Emails sent by researchers through the Lookit email interface then disappear forever, with no record to allow coordination among multiple lab members sending emails, documentation of participant compensation or responses to questions, etc.
Acceptance criteria:
Implementation notes/Suggestions:
Studies often generate many short clips, and it's much easier for coding to have these put together into one video per session (with embedded labels). My own coding workflow does involve doing this automatically (see https://github.com/kimberscott/lookit-data-processing); it would be helpful to be able to provide concatenated clips directly to researchers, but this will be a substantial task as the video processing will have to happen on one of the servers, triggered by finishing the study, and this is fairly computationally intensive. Also, we may want to set up to allow frames to create video labels as a special type of recording an event, and then put those labels on the videos at those times as we do for Molly's videos.
Researchers can leave feedback on studies, but few parents currently ever find this feedback on the website. It would be helpful to have a little "flag" shown to users to indicate that they have new feedback.
Pain point: Before studies are actually deployed on Lookit, they have to be approved by an admin, which reserves an opportunity for us to check for compliance with terms of use, help researchers ensure instructions are clear, etc. Studies have to be re-approved after changes are made. But there's currently no way for an admin to tell what has changed, which would allow vastly expedited review in the case where the researcher fixed a typo or changed the age range, allowing us to focus energy on cases where new code has been introduced, etc.
Acceptance criteria: When reviewing a submitted study, a Lookit admin can see what has changed since the last approved version (if there is one) and see a history of actions taken on the study (e.g. edits/state changes). Either when saving changes to a study or when submitting, a researcher can provide a note about the purpose of the changes (like a commit message).
Implementation notes/Suggestions: Changes may have been made to any of the fields on the study model - e.g. purpose, description, title, eligibility criteria, JSON doc, commit SHAs. For all except the JSON doc simply displaying the previous and current versions of any changed fields would be fine. For the JSON doc some sort of actual diff output would be helpful if possible since often changes will be just to a few lines.
Make it easy to transfer exactly the data where we've confirmed consent & participant allowed Databrary use to Databrary. What's easy to do has a big influence on what's done!
Pain point: The details of viewing and editing studies on the experimenter interface are unintuitive.
Acceptance criteria:
Implementation notes/Suggestions:
Collecting minor notes/ideas:
There are ~23 entries in this file with:
https://storage.googleapis.com/io-osf-lookit-staging2/static/images/nsf.gif
I believe these would work if updated to /static/images/nsf.gif
.
Pain Point: Once parents participate in a study, there is no immediate confirmation that it "worked" and Lookit has their data, or any later confirmation that it has been used to do cool science. They also do not have automatic access to their own data, although they often want to see their videos; instead, a researcher has to provide it if desired, which is labor-intensive and introduces unnecessary possibilities for human error (e.g. sending the wrong child's video).
Acceptance Criteria:
Implementation notes/Suggestions:
Because all the code for a particular study is deployed on Google Cloud storage, previewing a study takes some time - about 15-20 minutes. This is an extremely impractical delay for iterative testing of an experiment - if you realize you made a slight mistake on a frame specification, or want to change the colors a little and see how it looks, etc., every single change is a 15-20 minute delay. I think researchers are going to get frustrated with this very quickly and/or be less independent (less willing/able to do their own troubleshooting) in this situation, and that it's worth figuring out how to re-use the most recent deployment when (as in most cases!) the underlying code used won't have changed.
Also, the link for preview (which doesn't change anyway) should be shown on the build study page, indicating whether the most recent request has been completed, so that researchers don't have to go through their email trying to find the link.
Success: Study preview or deployment should take <= 20s when no changes have been made to the exp-addons code used.
Pain point: When a child has already participated in a study at least once, the parent sees a standard warning upon participating again, even if it is designed to be a longitudinal study with multiple sessions.
Acceptance criteria: Researchers can specify whether participants should see a warning when participating multiple times.
Implementation notes/Suggestions: Eventually we want to support particular schedules for longitudinal designs, this is a first step that will cover the immediate problem.
This is an important long-term priority, but there's not a lot we could do with it yet (before good automated gaze detection). It could potentially be supported separately, e.g. as part of a grant to work on the automated coding.
Pain point: There is no way for a Lookit admin to track cumulative time that particular studies have been active, how many active studies a particular user is responsible for, etc. This will make setting up to allow researchers to eventually pay for "study slots," or enforcing fair usage of the site, difficult.
Acceptance criteria:
Implementation notes/Suggestions: This will require additional scoping. More flexible limits such as total study-days (e.g. researcher X has 100 study-days, after running 2 studies for 40 days each and another for 20 days the study is cut off) may be needed.
Arrange basic expert accessibility review via MIT Accessibility & Usability. Decide which things to fix when, and which practices to adopt more generally. Add tasks per results.
Initial meeting Thursday 5/6.
It can be difficult to find a particular participant's previous sessions on Individual Responses, or look up a particular session when responding to a question from a participant. Adding a search box (probably by parent ID, child ID, possibly child/parent nickname) would be helpful in a variety of cases.
e.g., jsPsych, labJS.
Placeholder for work to come after #177
Pain point: Fetching data via the API, e.g. "all sessions of this study completed by this child" as fetched at the start of any study, is quite slow.
Acceptance criteria: Fetching all user data (or say 3000 records) takes <15 seconds, OR some more reasonable amount of time if that's unrealistic. (It seems like not a huge amount of actual data but I don't deal with databases much.) Fetching records of all 50 sessions of a study completed by one child takes <5 seconds.
Implementation notes/Suggestions: Maybe we don't need data paginated in groups of 10 by default in the API? Is it properly indexed?
Demo should show data collected, and be easily able to be stored & hosted by the researcher.
Right now the code for a particular study (a snapshot of all the code RIGHT NOW, so that future changes don't affect operation of the study) is already bundled up and stored on Google Cloud storage. We need a way for researchers to host this themselves and run it in preview mode, without having to also run ember-lookit-frameplayer / lookit-api, and with some dummy functions overriding usual video recording. This is important to support eventually but can be added at any point.
Pain point: At the conclusion of any study, parents select a privacy level for the video (in-lab use only, scientific/educational sharing allowed, or publicity use allowed) and whether data may be shared on Databrary. Parents also have the option to withdraw video data entirely. This option is rarely used, but is important because we are filming in families’ homes and sensitive information could be accidentally disclosed during recording. When a parent selects that they would like to withdraw video, this information is included in information given to researchers, but the video is not automatically deleted - that is left to the researcher. This introduces unnecessary potential for human error.
Acceptance criteria: Video from any session that a parent withdraws video from is deleted from all Lookit storage automatically, and is not available to the researcher running the study. Other data from the session is still available.
Implementation notes/Suggestions: Ideally we might introduce a slight delay (e.g. 1-7 days) before deleting video from Lookit servers, in case the parent withdrew accidentally and asks us to restore the video; however, this is likely much more trouble than it's worth.
Update django -> >= 1.11.19 (see Github vulnerability alerts)
We may want to allow researchers to specify a location for their videos to be stored rather than downloading from Lookit, and/or separate videos into locations based on studies.
This has been requested due to several IRBs' restrictions on storing participant data. However, it would substantially complicate e.g. showing participants their own videos.
Pain point: We need to increase the userbase but don't have information about how many people are visiting the site from what sources in order to evaluate advertising efforts.
Acceptance criteria: We can see traffic to the site over time, broken down by audience and how they got there if possible.
Implementation notes/Suggestions: Setting up Google search console & analytics would probably work well for our purposes, but can consider other options.
Any concern w/ the delay in cache invalidation when new assets are deployed via collectstatic? worth fingerprinting? or would it be preferred to use a CDN and issue invalidation?
Security or code structure issues identified during review.
Current list of small stuff to address or defer at this point:
Pain point: Many researchers are interested in working with specific populations, but families do not have a standardized way to indicate whether they fall into various groups in a way that would support use of this information for recruitment/eligibility.
Acceptance criteria:
Implementation notes/Suggestions: MIT OGC advises us that including medically-relevant options is fine from a legal standpoint (MIT is not a covered entity under HIPAA).
Allow researchers to generate their own tokens. Currently an admin has to make a researcher a token and provide it to them in order to use the API. Ideally anyone could try out the API on their own, without this manual step; this would encourage more development (not by us!) of coding tools that make use of automated data retrieval.
Considerations:
Pain point: Currently researchers need to use the API to leave feedback to parents about particular sessions. Feedback is one of the better ways we have to build engagement (let parents know a human watches & appreciates their video!) and we want it to be easy for researchers to leave.
Acceptance criteria:
Implementation notes/Suggestions:
Feedback is currently implemented sort of like consent rulings, as a separate model; there can be multiple feedbacks for the same response, perhaps left by different researchers. This is fine but if it complicates things, condensing down to a single string feedback field on the response or making it so there is at most one feedback associated with a response would also work fine. (Migrations would be a little annoying but could just join together multiple feedbacks if needed without messing up any existing .)
MIT is the only group currently using the feedback API, but we do use it and will need to update code to accommodate changes (see https://github.com/kimberscott/lookit-data-processing/blob/coding-workflow-multilab/scripts/experimenter.py)
Feedback would ideally be possible to add/edit via the consent manager, "individual responses" view, and by uploading a copy of the study CSV they downloaded with a "feedback" column updated. It should be possible to add feedback on sessions where the consent ruling is 'rejected' or 'pending.'
The arbitrary data should be cleaned to protect the database, and multiple attributes can be stored in a single standard JSON field if that's helpful. This could also be somehow joined with the "feedback" model if easier, e.g. have a "researcher-generated session data" model that's associated with a session and includes feedback and also this other data.
Planned as part of testing & acceptance before launch: interview, select, and hire outside firm to conduct security audit. Includes time to fix any issues found.
Pain point: Researchers can enter a custom repo/branch/commit to use for frame definitions. Right now if the sha isn't valid the default is used (i.e., the attempt to select a particular commit silently fails). Also, it's not clear to a researcher whether the "new" default (latest commit of lookit repo's master branch) will be used each time a build is initiated or whether the commit sha will be left unchanged unless they take action, so they may experience unexpected behavior that changes how their studies work.
Acceptance criteria:
Implementation notes/Suggestions:
Join the baby-gaze-coding listserv to stay up-to-date on efforts in this direction: https://mailman.mit.edu:444/mailman/listinfo/baby-gaze-coding
Plan out additional tools for email to participants and create separate issues per feature.
These tools will allow researchers to use the email functionality more easily, and to more flexibly communicate with families. Preliminary list:
Automatically email new feedback to participants if they've opted in. (Make a new email preference about this.)
A researcher would ideally be able to specify messages to send on a particular schedule for longitudinal participants, e.g. sending up to N total automatic messages per study of the following types:
A user would ideally be able to opt out of emails from a particular RESEARCHER as well as by type of email. This is so that if e.g. they've decided not to continue a particular study and don't want to hear about it anymore, they could opt out of those emails without opting out of all reminder emails. Or so that if one research group sends much more email they don't affect the opt-out rate as much for everyone else.
Pain point: Compensating participants is necessary both for treating their time/effort fairly and for rapid recruitment. Currently labs are individually responsible for identifying participants who should receive compensation and sending gift card codes by email through the experimenter interface, which is time-consuming and introduces unnecessary possibility of human error. Setting up to give "points" per study, which families can collect across studies/children and flexibly redeem, would give us a variety of ways to improve recruitment & user engagement.
Acceptance criteria:
Implementation notes/Suggestions: Allocation of points could be done in a GUI analogous to that for consent coding, or actually in the same interface (although storing consent information should be possible without also determining point allocation), and/or by uploading a file. Note that tracking points allocated by organizations could also be done by having them "pre-pay" for some amount of points and only be allowed to "spend" up to that amount, as on MTurk; due to MIT's billing practices we will likely prefer just to let them allocate whatever points they want (clearly indicating the total) and bill them later.
E.g., have users enter sequence and frames separately; possibly provide a way to enter each frame definition separately so they can see which frames are available in a given repo and then arrange them.
A CSV file generated automatically shows the session data collected during the study, but the data is highly nested and is not flattened before turning into a CSV. Simple flattening would go a long way to making this data more usable; it would be even better if researchers could select which fields they wanted to include. A (partial) data dictionary could be generated and available for download alongside the CSV. The only problem with doing this after launch is that researchers relying on this data may have scripts that rely on the old version. @kimberscott may be able to work on it along with frames/documentation.
Pain point: Currently the primary IDs (study, user, child) are in the form "907bf070-3f2e-4667-9aa8-4d2aa5c72e46" which makes checking any code or finding correct videos (a key part of coding!) manually very difficult. Realistically, we don't check the whole thing - we use the last 3-5 characters plus context as a poor man's hash. Also, families log in with their email addresses, but those aren't shown directly to researchers, so a researcher that gets an email from a family generally won't be able to identify their records for them.
Acceptance criteria: Researchers use reasonably short, human-readable ID codes for session, child, demographic, and user records. These IDs are specific to the researcher's organization rather than uniquely associated with that user/child/data, so that if two researchers publish data including IDs a reader cannot link the data in an unanticipated way. A family can indicate which account is theirs when contacting researchers directly, e.g. to ask a question about or report a problem with a study.
Implementation notes/Suggestions: Currently, researchers are just not allowed to publish the actual IDs and are supposed to generate their own random IDs for publication purposes, but that's duplicating a bunch of work (and subject to human error). We should consider whether to implement this by actually generating IDs per researcher (or more likely organization) - datum pair or whether to provide, as a convenience, random IDs while also giving researchers access to the raw IDs (to avoid making future collaborations etc. unnecessarily difficult). I.e. this could be handled at the level of the data structure or at the level of the UI and norms for interacting with the data. To allow parents to indicate which account is theirs, we could provide parents with a contact box on Lookit or show their IDs on their account or past studies page (depending on whether IDs are researcher-specific).
Consider using https://github.com/celery/django-celery-beat so we do not have to mount a "special" persistent disk for the database file beat generates by default.
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html#using-custom-scheduler-classes - Item #4
Nice fake
creds, wouldn't it be better if these were project settings?
Change subprocess.call -> subprocess.checkcall and catch errors so things don't fail silently and mysteriously.
Some records of related problems are in the COS Lookit slack channel.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.