Giter Club home page Giter Club logo

archives's People

Contributors

brannondorsey avatar cbluth avatar davidar avatar eminence avatar flyingzumwalt avatar hacdias avatar hsanjuan avatar jbenet avatar kubuxu avatar leerspace avatar richardlitt avatar smwa avatar stebalien avatar victorb avatar whyrusleeping avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archives's Issues

Metamaps

http://Metamaps.cc have been collecting some great datasets with their mindmap-style interface.

They are working on exposing these as JSON-LD: metamaps/metamaps#425 (comment) -- when this is ready, we should archive the maps and have at least one client-side viz running in IPFS as an interface to them.

Protein Data Bank

http://www.wwpdb.org/download/downloads

Since 1971, the Protein Data Bank archive (PDB) has served as the single repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies.

The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community.

cdnjs

Would love to have a IPFS compatible fork of https://github.com/cdnjs/cdnjs serving files via IPFS. Super large repository though but will give it a try to develop the integration locally.

Website mirroring

TL;DR: People should be able to simply run:

ipfs-mirror http://example.com/

without having to worry about copyright violations, etc.

There are several open-access collections that could be archived by simply spidering their website, in the same way that Google Cache or IA's Wayback Machine does. Of course, this should only be performed for the portions of the website not disallowed by robots.txt.

IANAL, but from what I can tell, this is all kosher so long as there's an appropriate procedure for opting out. According to this article (which links to this document), Google is safe because they allow webmasters opt out via robots.txt, and also has a process for responding to DMCA takedown requests.

This is the policy that the Internet Archive follows:

Request by a webmaster of a private (non-governmental) web site, typically for reasons of privacy, defamation, or embarrassment.

  1. Archivists should provide a 'self-service' approach site owners can use to remove their materials based on the use of the robots.txt standard.
  2. Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site.
  3. This allows archivists to ensure that material will no longer be gathered or made available.
  4. These requests will not be made public; however, archivists should retain copies of all removal requests.

Third party removal requests based on the Digital Millennium Copyright Act of 1998 (DMCA).

  1. Archivists should attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.
  2. If the claim appears valid, archivists should comply.
  3. Archivists will strive to make DMCA requests public via Chilling Effects, and notify searchers when requested pages have been removed.
  4. Archivists will notify the webmaster of the affected site, generally via email.

Third party removal requests based on non-DMCA intellectual property claims (including trademark, trade secret).

  1. Archivists will attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.
  2. If the original pages have been removed and the archivist has determined that removal from public servers is appropriate, then the archivists will remove the pages from their public servers.
  3. Archivists will strive to make these requests public via Chilling Effects, and notify searchers when requested pages have been removed.
  4. Archivists will notify the webmaster of the affected site, generally via email

Third party removal requests based on objection to controversial content (e.g. political, religious, and other beliefs).
[...] archivists should not generally act on these requests.

Third party removal requests based on objection to disclosure of personal data provided in confidence.
[...] These requests are generally treated as requests by authors or publishers of original data.

Requests by governments.
Archivists will exercise best-efforts compliance with applicable court orders

Other requests and grievances, including underlying rights issues, error correction and version control, and re-insertions of web sites based on change of ownership.
These are handled on a case by case basis by the archive and its advisors.

Anyway, it would be really helpful if IPFS had an official procedure regarding this (presumably gateway-dmca-denylist would be a part of this).

IRC logs

I don't know if there's a chan log already, but a bot creating it and storing it in ipfs looks like a great idea to me. The only issue is how to split messages into files:

Do we use a file for every day containing all the messages of the day? Or we divide them by the hour? Or some other way? By the day sounds the best for me.

Search engine

Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:

  • the static way: each node stores their index in a trie-like structure distributed across multiple IPFS objects, in such a way that clients only need to download a small subset of the index for any given query
  • the dynamic way: queries are broadcast to the network, nodes perform the search on their local index, and provide the results back over IPFS
  • the magic way: somehow storing the index in the IPFS DHT?

Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?

CERN

http://opendata.cern.ch

CERN is, since the end of 2014, serving some fraction of the colossal amount of captured data about particle collision in LHC (with detectors like CMS, ATLAS, ALICE) - summing up to 60,000,000 GB.

Through the help of a small Python crawler, I've compiled an index of all CMS-detector primary datasets (all .root files totaling cca. 27,4TB). Also index of indexes. Other detector indexes of datasets + derivative datasets to come :)

  • CMS
    • Primary datasets from 2010 runs (27,4TB)
      • Scrap data from CERN's OpenData (via cmspull.py)
      • Compile index of all primary dataset files (.root)
      • Somehow get those 28TB into IPFS (maybe in cooperation w/ CERN? - than all steps are unnecessary)
  • ATLAS
  • ALICE
  • LHCb

To use all that data, a special environment is required - normally CERN's OpenData is recommending the use of their CernVM, which is basically Scientific Linux + ROOT, a data analysis framework (therefore the .root files). Without ROOT, this historical milestones cannot be used as computable data directly - so the tool must also be, as the collision data, preserved/archived. There's also a mirror right here at Github.

Oh, and thanks for the amazing project!

CIA PDBs

If you follow news about the United States, you might know that the CIA has recently declassified and released a large number of "President's Daily Brief" documents (PDBs) from the 1960s.

https://www.cia.gov/library/publications/intelligence-history/presidents-daily-brief/index.html

I thought it would be fun to archive these in IPFS. You can find them here:

http://ipfs.io/ipns/QmbuG3dYjX5KjfAMaFQEPrRmTRkJupNUGRn1DXCgKK5ogD/archives/PDBs

My understanding is that these documents are a product of the US Government, and as such are not subject to copyright. I've not found any notices that contradict this, and because they are all declassified, I believe that distribution is unrestricted.

Archive package file

Following from #25

@eminence said:

For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.

This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.

  • Metadata should be stored in a file called _Metadata.json. The name is designed so that I'll appear near the top of directory listings.
  • The json object is a dictionary with the following keys:
    • title -- Provides a name for the archive
    • description -- A more verbose description, if needed
    • source -- Lists of URLs where this data came from
    • license -- An array of dictionaries listing the relevant licenses. Each has the following keys:
      • summary -- a brief summary of the license
      • source -- Where to find the license/legal terms in full
    • last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated
  • I think to start "license" and "title" should be required, others can be optional

For two concrete examples, see the metadata for #23 and the metadata for #18

Other thoughts:

Should the metadata include maintainer information?
Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away
CC #5 for related discussion

IPFS as a backend to a web archiving

I am building a new on-demand web archiving system, called webrecorder.io, which allows for on-demand archiving of any web site (by acting as a rewriting + recording proxy).
This version (actually beta.webrecorder.io) will soon be open-sourced and will be available for users to deploy on their own.

The system allows for a user to create recording of any web site, including dynamic content, by browsing it through a the recorder, eg. https://webrecorder.io/record/example.com/ and replay by browsing through replay, https://webrecorder.io/replay/example.com/

The recording is a WARC file, a standard used by Internet Archive and other archiving orgs. The file can be broken down into records (basically contents of HTTP response + request and extra metadata), and each of these records could be put individually into IPFS.

I suppose this sort of relates to #7 but perhaps in a more sophisticated way.

Most obvious mode of operation: Store each WARC record in IPFS individually.

Some unknowns (to me):

  • Resolving URL + datetime to the hash of the stored object in IPFS (This is also part of the memento protocol). Basically looking up a url and datetime and mapping it to an IPFS hash.
  • Privacy / security concerns: Would want to have users create private archives, or be able to set controls on what is accessible to whom. This is not specific to web archiving, but something I don't (yet) know much about.

For more reference:
The system is built using these tools: https://github.com/ikreymer/pywb , https://github.com/ikreymer/warcprox
An older simplified version of the "webrecorder" concept: https://github.com/ikreymer/pywb-webrecorder.

Automatic Updates

All of the archives are currently being imported to IPFS manually. This is fine as a starting point, but we need to write some scripts to keep them up to date with the origin, run them periodically and publish the changes over IPNS.

CKAN

CKAN powers a number of open data repositories, including datahub.io, data.gov, data.gov.uk, data.gov.au, etc. It also has a harvesting mechanism.

  • CKAN harvest -> IPFS

Xkcd

I plan to archive all the comics in http://xkcd.com/

I think i'll use (comicnumber)-(comictitle).png for the image and figure out how to save the alt-text in the png metadata

Please post if you want to keep a copy of the archive or you manage to create it before I do :)

Archive metadata and licensing --> js discussion

For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.

This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.

  • Metadata should be stored in a file called _Metadata.json. The name is designed so that I'll appear near the top of directory listings.
  • The json object is a dictionary with the following keys:
    • title -- Provides a name for the archive
    • description -- A more verbose description, if needed
    • source -- Lists of URLs where this data came from
    • license -- An array of dictionaries listing the relevant licenses. Each has the following keys:
      • summary -- a brief summary of the license
      • source -- Where to find the license/legal terms in full
    • last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated
  • I think to start "license" and "title" should be required, others can be optional

For two concrete examples, see the metadata for #23 and the metadata for #18

Other thoughts:

  • Should the metadata include maintainer information?
  • Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away

CC #5 for related discussion

dat

http://dat-data.com/

Should be able to use dat to add datasets to ipfs soon. See http://kevinchai.net/datasets for ideas.

mafintosh: you just need to wrap ipfs in a blob store interface, https://github.com/maxogden/abstract-blob-store and you'll be able to use ipfs inside dat for file data
davidar: ah, so not quite there yet?
davidar: in terms of working out-of-the-box
mafintosh: no but as soon as there are good js bindings for ipfs that'll probably happen pretty fast
which they are already working on

CC: @mafintosh

discogs.com

http://data.discogs.com/

Download Discogs Data

Here you will find monthly dumps of Discogs Release, Artist, Label, and
Master Release data. The data is in XML format and formatted according
to the API spec: http://www.discogs.com/developers/

This data is made available under the CC0 No Rights Reserved license:
http://creativecommons.org/about/cc0

2015-11-04T17:54:32.000Z        154.1 MB       discogs_20151101_artists.xml.gz
2015-11-04T17:54:32.000Z        26.7 MB        discogs_20151101_labels.xml.gz
2015-11-04T17:54:32.000Z        99.4 MB        discogs_20151101_masters.xml.gz
2015-11-04T17:54:32.000Z        3.3 GB         discogs_20151101_releases.xml.gz

Archive webpage archive hub

Every independent archival effort that we do should have a webpage. It would be useful for it to have certain things like:

  • latest head
  • how to replicate
  • version history
  • archival scripts
  • maintainer
  • license
  • authors
  • original urls

There may be standards for this already. (Check the Internet Archive and OKFN?)

It may be doable as a package.json style metadata file, and a script to produce an index.html.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.