edgi-govdata-archiving / archivers.space Goto Github PK
View Code? Open in Web Editor NEW🗄 Event data management app used at DataRescues
Home Page: https://www.archivers.space/
License: GNU Affero General Public License v3.0
🗄 Event data management app used at DataRescues
Home Page: https://www.archivers.space/
License: GNU Affero General Public License v3.0
can you add an option for "Data Rescue Twin Cities" on the list that you use to generate invites? we need to start adding guides for the event this weekend. also, can you put that on the list of events? Friday 11-6, Saturday 10-6. thanks!
when harvesters upload things through the browser (I haven’t checked if this happens via S3) the “Harvest URL / Location” field gets filled with “harvest_url”. I’ve had to go into S3, copy the upload URL, and then paste that into the “Bag” section.
From @librlaurie on February 10, 2017 13:43
Again, not a super high priority, but I told the IPFS folks that I'd be into adding hashes to the records for these in their public display. I would indeed. But, seems like something that could be done whenever bagging happens, I hope. For the future, but worth doing .
Copied from original issue: b5/pipeline#57
Would be a great to generate a summary of work done at an event, so attendees can enjoy the sense of accomplishment (which will hopefully motivate them to keep working on the project!), without organizers having to bug app creators to pull the info from the backend.
Viewable either publicly (for all event participants) or just by all app users. Perhaps in the existing Event view?
From @khdelphine on February 16, 2017 20:48
Add a separate checkbox for bag validation, saying: “Bag validated after downloading it back from S3?” This should appear before the "I certify" checkbox.
Copied from original issue: b5/pipeline#71
From @dcwalk on February 9, 2017 21:13
Before we make it public can we add a license and contributor guidelines to the repo?
Copied from original issue: b5/pipeline#55
From @dcwalk on February 1, 2017 4:20
All I'm seeing is the agency/event title, e.g. http://harvest-pipeline.herokuapp.com/agencies/Q9qF7gxvf5jTDa4aT is "ENVIRONMENTAL PROTECTION AGENCY
EPA"
Do we want or need these to be clickable from the summary pages for now?
http://harvest-pipeline.herokuapp.com/agencies
http://harvest-pipeline.herokuapp.com/events
Copied from original issue: b5/pipeline#12
volunteers who are bagging often don't know how big a dataset is before they download it, even if they look in the "harvest' section (because it's difficult to estimate how large something will be in the "research" stage. this results in attempting to download huge datasets, which depending on connection and computer speed, might take hours. can you add size info to the "bag" section so people know how big something will be before they download it? thanks!
This is a small design suggestion for the “Harvest” section of the App. If you want users to strictly input MB values, perhaps after “Estimated Size in MB” could you put a bit that gives some relationships for size? 1MB = 1000KB; 1GB = 1000MB, 1TB = 1000GB. People may click on a link that ends up being multiple gigs.
With code from https://github.com/b5/pipeline/pull/74 in place, running "meteor" to start the server throws errors if process.env can't find GOOGLE_SERVICE_CLIENT_EMAIL or GOOGLE_SERVICE_PRIVATE_KEY
From @dcwalk on February 1, 2017 3:58
See edgi-govdata-archiving/presidential-harvest-nomination-tool#7.
Copied from original issue: b5/pipeline#9
Occasionally when trying to use keys generated by the app to upload large files to S3, you get permission denied errors.
Potential solutions:
Need to work out a security review process in advance of releasing the code (or decide never to do it). I know nothing as yet about the meteor community so I don't know what resources there are for this.
We moved the FAQ (see https://datarefuge.github.io/workflow/faq/), could that link in the nav be updated?
I can reproduce an issue reported on Slack: clicking the "Download Zip Starter" button does nothing for this particular URL.
http://www.archivers.space/urls/40B56E61-11C5-42E9-B440-49E5B164A80C
The resource was checked out and the Research phase was marked complete.
If we want to go all in on this, it's probably a phase 2 matter, since it would involve collecting and displaying "meta meta data"--e.g., not just data in the fields, but a trail of when that data was changed and by who. This would be a big effort and veer into "reinventing the wheel" territory--we'd essentially be engineering something like a wiki platform--so it might be better done by leveraging some other kind of back end system down the road. There's also a fair amount of anonymity built into our process--no required email address, anonymous submissions via the Chrome app, etc.--so this kind of audit trail may be of mixed value, particularly if most of our volunteers vanish after attending a single event.
That said, I'm sure it would help, when confronted with an ambiguous set of harvest or research notes, to know who made them so that you could ask for clarification. And if we are serious about the vision of building a robust metadata platform and encouraging domain experts (scientists, etc.) to use it going forward, then we need to consider how to authenticate people's contributions.
In the short term, there are a few things we could do that wouldn't be such a huge effort:
I see the rationale for only letting users with, say, the "bagger" role edit the bag section, but is there a good reason not to let all users see all the data fields, particularly since work done upstream will auto-populate some of them? Seems like this would help people to understand their place in the process, and might help with validation/UI issues like #42
From @dcwalk on February 1, 2017 3:41
cough I'll just leave this here.
Copied from original issue: b5/pipeline#5
From @khdelphine on February 9, 2017 20:14
“If this will be handed off to someone else to harvest, pass on any useful info here.” → add “, including any alternate URLs people should use as entry points.”
Or alternately have a separate field for people to enter harvestable URLs (like we had in the spreadsheet)
Copied from original issue: b5/pipeline#53
From @b5 on February 2, 2017 13:40
Just looks bad to all our friends who start to poke around with the app :)
Copied from original issue: b5/pipeline#19
From @b5 on February 10, 2017 21:45
Need to pull checker aspects out of the bagger phase and provide a proper method for bouncing archives that need improvement back out for re-harvesting & association.
Copied from original issue: b5/pipeline#59
From @danielballan on February 4, 2017 17:7
... or at least make it more obvious that the pane is expandable. Via Lou on Slack.
Copied from original issue: b5/pipeline#38
From @khdelphine on February 16, 2017 12:11
Could we automatically check back in URLs after they have been checked out for two days?
And perhaps add a comment in the note field like “This URL was automatically checked back in after two days so that it can be processed expeditiously.”
Copied from original issue: b5/pipeline#69
It looks like at least some URLs that may have already been harvested are still in Harvest phase in the app, since the user did not click the checkbox next to Harvest. Since this step was not included in previous event workflow documentation, it may be a widespread issue, so it may make sense to programmatically compare UUIDs of datasets already uploaded with UUIDs still in Harvest phase in app (and change status in app for any UUIDs with uploaded datasets).
Example: http://www.archivers.space/urls/F68DCA69-4377-40DA-B576-7D3C88CC6C2A
Harvest notes: "Over 6,000 files totaling 82 GB. Largest file is 12 GB, which is a massive orthographic mosaic tif. Zip file of 62 GB was uploaded via AWS token, appears to have completed successfully at 5:26 PM, though this site does not seem to acknowledge it."
This may explain why there are relatively few URLs in post-harvest phases in the app, despite the many recent events.
Reproduce steps:
Expected behavior:
Actual behavior:
From @librlaurie on February 2, 2017 22:5
Bagger should confirm that they've checked the contents, should create bags, and should upload bags to new location, and indicate location.
Baggers need to apply for credentials to take on the bagger process. Event organizers should be able to grant bagger priviliges (to librarians or information professionals) after review. Baggers need to give email addresses, and jobs in the information, data science, digital preservation, library, or archival professions.
Copied from original issue: b5/pipeline#26
From @librlaurie on February 2, 2017 19:25
Checker role:
Answers three questions:
Need Re-uploader to harvester bucket for changes
People should not get checker role without having approval from someone further up the chain. An organizer should sign off on checker credentials.
Copied from original issue: b5/pipeline#25
From @librlaurie on February 10, 2017 13:2
I may be missing something key, but while obviously everything has to have a UUID that we can see, would it be desirable to use the name of the dataset rather than the UUID in the main tracking screens to show the list? I'm pretty sure the name does come through the chrome app, as we used it in all the spreadsheets. There may be harvesters who are more moved to harvest data based on its content, I'm assuming, and the URL isn't all the info about that we have.
Copied from original issue: b5/pipeline#56
From @khdelphine on February 9, 2017 20:12
Copied from original issue: b5/pipeline#51
From @khdelphine on February 9, 2017 20:2
Also remove “Bag Url / Location” → Not useful, right? Or am I missing something?
I mean, why not do the upload directly inside the app, as we do for the Harvesters?
Copied from original issue: b5/pipeline#47
From @khdelphine on February 14, 2017 14:2
A number of URLs that appear ready to be bagged do not have the Zip URL under “Harvest URL/Location”. For instance: http://www.archivers.space/urls/0C87975E-C222-4BD2-8516-B4E623EB67CB
Copied from original issue: b5/pipeline#67
This is a question for long-term discussion, decoupled from the ongoing useful efforts to make the app more usable over the next 2-3 months.
The conda-forge project (which I have contributed to) manages a community of volunteers who adopt software packages they care about and collaboratively create and maintain scripts for building binaries for those packages. The scripts that they write are automatically executed using free CI services, and the resultant artifacts are uploaded to a common public site. Each software package is assigned a separate repo with a tiny subcommunity of users who follow notifications and perform maintenance.
This leads to a lot of repos, so conda-forge uses custom bots and the GH API to impose additional structure, keepings things organized and as automated as possible.
I see an analogy to our community of volunteers: we intend to adopt subdomains or sub-sections of subdomains, collaboratively write and maintain scripts that capture their data, execute those scripts on a server and upload the results. Once the Bagging phase takes place on a remote server, the task of our archives.space app will be reduced to Research/Checking and uploading a harvesting script to a server. These sound like tasks that could be managed with GH labels and milestones and comments, with harvesting scripts coming in through pull requests.
Maybe conda forge's model could work for us. What do you think? In what ways are our needs similar and different?
The mark-complete checkboxes are easy to overlook.
I didn't know we were supposed to check them to signal a stage was done, and just checked the URL back in. Judging by other tickets in the pipeline, others are making the same mistake; they have notes about what they harvested but are still open.
Perhaps label the checkbox, or put the checkbox beside the section header.
From @khdelphine on February 10, 2017 15:21
http://www.archivers.space/urls?phase=crawlable
Copied from original issue: b5/pipeline#58
From @titaniumbones on February 11, 2017 7:28
When an AWS burner token is generated (for upload after harvesting), would it be possible to also include a script or one-liner that a user could run trivially on a remote VM? Asking mostly for those of us who have no AWS experience, Even if not downloadable, such a script could at least be documented & put in the harvesting-tools.
Copied from original issue: b5/pipeline#62
It isn't clear how the Link Related URLs box works.
In particular:
From @titaniumbones on February 11, 2017 7:20
Is it possible to filter out the ann arbor questionables from the actively-displayed records for this weekend? If so that would be a big help to the SF event, i think.
Copied from original issue: b5/pipeline#60
This is really more of a programmatic decision than a programming one--who buys the certificate? who provides hosting? etc.--but if we're concerned about security we should at least discuss it.
From @librlaurie on February 2, 2017 22:7
Describer should open bag, look at json file to confirm, create metadata in ckan (as ckan admin) and mark complete.
Describers should apply for credentials and approved by Data Rescue organizers. Same application as baggers, but event organizers can't grant describer status. They need ckan username and password from a very small list of people who've agreed on the credentials for ckan.
Copied from original issue: b5/pipeline#27
The UI of a URL page really only implements a "happy path" to take an item from seeding through to bagging and description. It's possible to route items back to an earlier stage by unchecking the stage completion box, and it's possible to send an item to "Crawlable" by checking "Do Not Harvest," but these non-standard routings are not intuitive.
I could imagine a UI that's structured around choices at a given step. So a researcher might be asked to open a page, review the information provided by the seeder, and then choose one of three options:
I'm not proposing any data model changes: we already have the boolean fields needed to represent different paths (or those fields would come along with features that are proposed in separate tickets, like 33 above). I'm proposing streamlining the UI so that users can understand where a ticket has been and where it's going without having to check boxes.
From @khdelphine on February 16, 2017 12:21
Have a button to flag "problem URLs", so that they get quarantined into a separate list and could be easily reviewed by an expert/admin. The button could be actioned regardless of the specific step the URL is in.
Copied from original issue: b5/pipeline#70
From @dcwalk on February 1, 2017 3:45
Small alignment issue in AgencyForm.js:
Copied from original issue: b5/pipeline#6
From @titaniumbones on February 5, 2017 17:34
If we could feed the pipeline directly from the chrome extension, seeders at events could provide dataset URL's for use on that day, even if @b5 and @danielballan are not physically present. So... that seems like a very substantial improvement. Here' are some todo items for that -- happy to add to README if feature seems attainable/worth pressing towards.
GET
interface that accepts an entry from the Chrome extension & creates a new mongodb recordI'm sleepy! hope y'all are getting some rest!
Copied from original issue: b5/pipeline#40
From @khdelphine on February 9, 2017 20:6
Copied from original issue: b5/pipeline#50
From @titaniumbones on February 7, 2017 12:50
the chrome extension currently queries the IA db already; it could pass a timestamp to the app if we can get to #40.
Copied from original issue: b5/pipeline#42
From @khdelphine on February 9, 2017 20:4
Add a Describe step before Done?
It seems to me that it would not be very complicated to do: just add the metadata fields listed here (in Col B) for the Describers to fill out.
If we do that, the Done section could provide pretty much the same view/functionality as CKAN (at least as a short term solution). What do you think?
Copied from original issue: b5/pipeline#49
From @b5 on February 2, 2017 13:38
I've modeled the event-to-url association backwards. We should get users to check off that they're attending an event, add to some sort of user-events join collection (or maybe on the user collection).
If we have a better association between users and events, we know which users were at which event (wildly useful), and derive the url's event from the user that last updated the url, this'll also very useful in tracking down who attended / onboarded at which event.
Copied from original issue: b5/pipeline#18
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.