Giter Club home page Giter Club logo

archivers.space's People

Contributors

b5 avatar chaibapchya avatar dcwalk avatar kmcculloch avatar patcon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archivers.space's Issues

add "Data Rescue Twin Cities"

can you add an option for "Data Rescue Twin Cities" on the list that you use to generate invites? we need to start adding guides for the event this weekend. also, can you put that on the list of events? Friday 11-6, Saturday 10-6. thanks!

Harvest_url instead of URL

when harvesters upload things through the browser (I haven’t checked if this happens via S3) the “Harvest URL / Location” field gets filled with “harvest_url”. I’ve had to go into S3, copy the upload URL, and then paste that into the “Bag” section.

Add hashes to json file

From @librlaurie on February 10, 2017 13:43

Again, not a super high priority, but I told the IPFS folks that I'd be into adding hashes to the records for these in their public display. I would indeed. But, seems like something that could be done whenever bagging happens, I hope. For the future, but worth doing .

Copied from original issue: b5/pipeline#57

Add view of all work completed at an event

Would be a great to generate a summary of work done at an event, so attendees can enjoy the sense of accomplishment (which will hopefully motivate them to keep working on the project!), without organizers having to bug app creators to pull the info from the backend.

Viewable either publicly (for all event participants) or just by all app users. Perhaps in the existing Event view?

Add a separate checkbox for bag validation

From @khdelphine on February 16, 2017 20:48

Add a separate checkbox for bag validation, saying: “Bag validated after downloading it back from S3?” This should appear before the "I certify" checkbox.

Copied from original issue: b5/pipeline#71

display size in "bag" section

volunteers who are bagging often don't know how big a dataset is before they download it, even if they look in the "harvest' section (because it's difficult to estimate how large something will be in the "research" stage. this results in attempting to download huge datasets, which depending on connection and computer speed, might take hours. can you add size info to the "bag" section so people know how big something will be before they download it? thanks!

Add size measurements?

This is a small design suggestion for the “Harvest” section of the App. If you want users to strictly input MB values, perhaps after “Estimated Size in MB” could you put a bit that gives some relationships for size? 1MB = 1000KB; 1GB = 1000MB, 1TB = 1000GB. People may click on a link that ends up being multiple gigs.

Flakey S3 uploads

Occasionally when trying to use keys generated by the app to upload large files to S3, you get permission denied errors.

Potential solutions:

  • behind-the-scenes direct upload from browsers to S3
    • the backend passes the browser a temp key when the user presses the download button
  • cloud-native development (no need to upload if it's all in the cloud)

Meteor code security review

Need to work out a security review process in advance of releasing the code (or decide never to do it). I know nothing as yet about the meteor community so I don't know what resources there are for this.

Track changes (who/when) and expose in UI

If we want to go all in on this, it's probably a phase 2 matter, since it would involve collecting and displaying "meta meta data"--e.g., not just data in the fields, but a trail of when that data was changed and by who. This would be a big effort and veer into "reinventing the wheel" territory--we'd essentially be engineering something like a wiki platform--so it might be better done by leveraging some other kind of back end system down the road. There's also a fair amount of anonymity built into our process--no required email address, anonymous submissions via the Chrome app, etc.--so this kind of audit trail may be of mixed value, particularly if most of our volunteers vanish after attending a single event.

That said, I'm sure it would help, when confronted with an ambiguous set of harvest or research notes, to know who made them so that you could ask for clarification. And if we are serious about the vision of building a robust metadata platform and encouraging domain experts (scientists, etc.) to use it going forward, then we need to consider how to authenticate people's contributions.

In the short term, there are a few things we could do that wouldn't be such a huge effort:

  1. We could keep and display a trail of everyone who checked out an item. This wouldn't necessarily show who did exactly what, but it would be a start
  2. We could encourage people to sign their comments using their slack handles or some such, or provide dedicated fields for the same

Let all users see all data

I see the rationale for only letting users with, say, the "bagger" role edit the bag section, but is there a good reason not to let all users see all the data fields, particularly since work done upstream will auto-populate some of them? Seems like this would help people to understand their place in the process, and might help with validation/UI issues like #42

Clarify "can it be crawled?" language under Research

From @dcwalk on February 12, 2017 2:21

From #54:

remove the "can it be crawled?" language, and make this flag an explicit Do not harvest option, meaning this url contains no harvestable information, and should be ignored.

Copied from original issue: b5/pipeline#64

Harvest step: encourage the entry of alternate entry points

From @khdelphine on February 9, 2017 20:14

“If this will be handed off to someone else to harvest, pass on any useful info here.” → add “, including any alternate URLs people should use as entry points.”
Or alternately have a separate field for people to enter harvestable URLs (like we had in the spreadsheet)

Copied from original issue: b5/pipeline#53

Add Proper Checker Phase

From @b5 on February 10, 2017 21:45

Need to pull checker aspects out of the bagger phase and provide a proper method for bouncing archives that need improvement back out for re-harvesting & association.

Copied from original issue: b5/pipeline#59

Compare UUIDs in Harvest phase in app w/ UUIDs of datasets already uploaded

It looks like at least some URLs that may have already been harvested are still in Harvest phase in the app, since the user did not click the checkbox next to Harvest. Since this step was not included in previous event workflow documentation, it may be a widespread issue, so it may make sense to programmatically compare UUIDs of datasets already uploaded with UUIDs still in Harvest phase in app (and change status in app for any UUIDs with uploaded datasets).

Example: http://www.archivers.space/urls/F68DCA69-4377-40DA-B576-7D3C88CC6C2A

Harvest notes: "Over 6,000 files totaling 82 GB. Largest file is 12 GB, which is a massive orthographic mosaic tif. Zip file of 62 GB was uploaded via AWS token, appears to have completed successfully at 5:26 PM, though this site does not seem to acknowledge it."

This may explain why there are relatively few URLs in post-harvest phases in the app, despite the many recent events.

Bagger role specs

From @librlaurie on February 2, 2017 22:5

Bagger should confirm that they've checked the contents, should create bags, and should upload bags to new location, and indicate location.

Baggers need to apply for credentials to take on the bagger process. Event organizers should be able to grant bagger priviliges (to librarians or information professionals) after review. Baggers need to give email addresses, and jobs in the information, data science, digital preservation, library, or archival professions.

Copied from original issue: b5/pipeline#26

Checker role specs

From @librlaurie on February 2, 2017 19:25

Checker role:
Answers three questions:

  1. Does the data match what's on the website (and how did you test that?)
  2. Are all of the files there that need to be?
  3. What changes did you make? (record to json)

Need Re-uploader to harvester bucket for changes

People should not get checker role without having approval from someone further up the chain. An organizer should sign off on checker credentials.

Copied from original issue: b5/pipeline#25

Add names of datasets

From @librlaurie on February 10, 2017 13:2

I may be missing something key, but while obviously everything has to have a UUID that we can see, would it be desirable to use the name of the dataset rather than the UUID in the main tracking screens to show the list? I'm pretty sure the name does come through the chrome app, as we used it in all the spreadsheets. There may be harvesters who are more moved to harvest data based on its content, I'm assuming, and the URL isn't all the info about that we have.

Copied from original issue: b5/pipeline#56

Discussion: What would it be like to manage all this through GitHub repositories?

This is a question for long-term discussion, decoupled from the ongoing useful efforts to make the app more usable over the next 2-3 months.

The conda-forge project (which I have contributed to) manages a community of volunteers who adopt software packages they care about and collaboratively create and maintain scripts for building binaries for those packages. The scripts that they write are automatically executed using free CI services, and the resultant artifacts are uploaded to a common public site. Each software package is assigned a separate repo with a tiny subcommunity of users who follow notifications and perform maintenance.

This leads to a lot of repos, so conda-forge uses custom bots and the GH API to impose additional structure, keepings things organized and as automated as possible.

I see an analogy to our community of volunteers: we intend to adopt subdomains or sub-sections of subdomains, collaboratively write and maintain scripts that capture their data, execute those scripts on a server and upload the results. Once the Bagging phase takes place on a remote server, the task of our archives.space app will be reduced to Research/Checking and uploading a harvesting script to a server. These sound like tasks that could be managed with GH labels and milestones and comments, with harvesting scripts coming in through pull requests.

Maybe conda forge's model could work for us. What do you think? In what ways are our needs similar and different?

Make stage complete checkboxes more obvious

The mark-complete checkboxes are easy to overlook.

I didn't know we were supposed to check them to signal a stage was done, and just checked the URL back in. Judging by other tickets in the pipeline, others are making the same mistake; they have notes about what they harvested but are still open.

Perhaps label the checkbox, or put the checkbox beside the section header.

include a command-line invocation when generating an AWS token for harvesting

From @titaniumbones on February 11, 2017 7:28

When an AWS burner token is generated (for upload after harvesting), would it be possible to also include a script or one-liner that a user could run trivially on a remote VM? Asking mostly for those of us who have no AWS experience, Even if not downloadable, such a script could at least be documented & put in the harvesting-tools.

Copied from original issue: b5/pipeline#62

Explain Link URLs box

It isn't clear how the Link Related URLs box works.
In particular:

  • it's not clear that it's a search - I saw people pasting in Harvest Pipeline and page URLs
  • it's not clear that it searches for words in the page URL
  • you have to click a search result to add it

Run archivers.space on SSL?

This is really more of a programmatic decision than a programming one--who buys the certificate? who provides hosting? etc.--but if we're concerned about security we should at least discuss it.

Describer role specs

From @librlaurie on February 2, 2017 22:7

Describer should open bag, look at json file to confirm, create metadata in ckan (as ckan admin) and mark complete.

Describers should apply for credentials and approved by Data Rescue organizers. Same application as baggers, but event organizers can't grant describer status. They need ckan username and password from a very small list of people who've agreed on the credentials for ckan.

Copied from original issue: b5/pipeline#27

Map out possible paths for a URL through the app, including rejection and review, and enhance the UI to make routing more transparent

The UI of a URL page really only implements a "happy path" to take an item from seeding through to bagging and description. It's possible to route items back to an earlier stage by unchecking the stage completion box, and it's possible to send an item to "Crawlable" by checking "Do Not Harvest," but these non-standard routings are not intuitive.

I could imagine a UI that's structured around choices at a given step. So a researcher might be asked to open a page, review the information provided by the seeder, and then choose one of three options:

  1. This item is crawlable, skip harvesting [which would populate the do not harvest? boolean]
  2. This item needs to be harvested [which would unlock the fields to put in harvest recommendations --> then the user would see a "Submit for Harvesting" button, which would populate the researched? boolean]
  3. I'm not sure/This item needs review [which would unlock a comment field and put the item into the "Needs Review" state requested in https://github.com//issues/33]

I'm not proposing any data model changes: we already have the boolean fields needed to represent different paths (or those fields would come along with features that are proposed in separate tickets, like 33 above). I'm proposing streamlining the UI so that users can understand where a ticket has been and where it's going without having to check boxes.

Have a button to flag "problem URLs"

From @khdelphine on February 16, 2017 12:21

Have a button to flag "problem URLs", so that they get quarantined into a separate list and could be easily reviewed by an expert/admin. The button could be actioned regardless of the specific step the URL is in.

Copied from original issue: b5/pipeline#70

Automatically seed submitted URLs to the IA

From @dcwalk on February 12, 2017 2:22

From #54:

all urls that are submitted to the app are automatically seeded to the Internet Archive (this is not currently the case, but I think it should be for preservation & comparison purposes).

Copied from original issue: b5/pipeline#65

Feed pipeline directly from chrome extension

From @titaniumbones on February 5, 2017 17:34

If we could feed the pipeline directly from the chrome extension, seeders at events could provide dataset URL's for use on that day, even if @b5 and @danielballan are not physically present. So... that seems like a very substantial improvement. Here' are some todo items for that -- happy to add to README if feature seems attainable/worth pressing towards.

  • implement a GET interface that accepts an entry from the Chrome extension & creates a new mongodb record
  • implement a security model for dealing w/ extension users -- either by marking URL's as chrome-seeded, or by adding authentication to the extension
  • decide whether we want the app to handle all nominations, even URLS going only to the IA. If so, figure out how to manage the IA seeds. If not, decide on criteria for sending URL to one storage location, the other, or both.

I'm sleepy! hope y'all are getting some rest!

Copied from original issue: b5/pipeline#40

Add Describers section?

From @khdelphine on February 9, 2017 20:4

Add a Describe step before Done?
It seems to me that it would not be very complicated to do: just add the metadata fields listed here (in Col B) for the Describers to fill out.
If we do that, the Done section could provide pretty much the same view/functionality as CKAN (at least as a short term solution). What do you think?

Copied from original issue: b5/pipeline#49

Need to associate users with events

From @b5 on February 2, 2017 13:38

I've modeled the event-to-url association backwards. We should get users to check off that they're attending an event, add to some sort of user-events join collection (or maybe on the user collection).

If we have a better association between users and events, we know which users were at which event (wildly useful), and derive the url's event from the user that last updated the url, this'll also very useful in tracking down who attended / onboarded at which event.

Copied from original issue: b5/pipeline#18

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.