ipfs-inactive / archives Goto Github PK

View Code? Open in Web Editor NEW

182.0 182.0 24.0 62 KB

[ARCHIVED] Repo to coordinate archival efforts with IPFS

Home Page: https://awesome.ipfs.io/datasets

archives's People

Contributors

Stargazers

Watchers

archives's Issues

Internet Archive

(Keep in mind https://blog.archive.org/developers/ and https://developers.archive.org/ and contribute improvements)

See http://blog.archive.org/2011/03/31/how-archive-org-items-are-structured/ for a description of items.
See https://api.archivelab.org/items?page=0 for listing of items

Stackexchange Archives

These are published here https://archive.org/download/stackexchange

I've downloaded them to my machine and added them to ipfs 0.4. Currently they are being pinned by biham.

The folder hash is QmYgHvTrSfPJH5Dswq6NB8wTHH77BFaJdLP8UBYJz9Wz19
with the nested parts being listed here

Freebase

http://www.freebase.com/

http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2015-08-09-00-01.gz (mirrored here)

License: CC-BY

make available in a more useful format

Metamaps

http://Metamaps.cc have been collecting some great datasets with their mindmap-style interface.

They are working on exposing these as JSON-LD: metamaps/metamaps#425 (comment) -- when this is ready, we should archive the maps and have at least one client-side viz running in IPFS as an interface to them.

mizar.org

Latest archive.org archive (2013): http://web.archive.org/web/20150905080837/http://www.mizar.org/

Protein Data Bank

http://www.wwpdb.org/download/downloads

Since 1971, the Protein Data Bank archive (PDB) has served as the single repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies.

The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community.

Gmane & Gwene

Mailing lists http://news.gmane.org/
RSS feeds http://gwene.org/about.php

Cc: @cryptix

openwrt.org snapshots

Specifically http://downloads.openwrt.org/.

Have been tried in ipfs/kubo#1924, ipfs/kubo#1584.
But failed due to memory leak (not mainly because ipfs add is slow).

Project Gutenberg

The first thing I mirrored to IPFS was a small subset of Project Gutenberg, so I'm definitely interested in getting the whole thing into IPFS, as both @rht (#14) and @simonv3 (https://github.com/simonv3/ipfs-gutenberg) have suggested.

Making an issue to coordinate this.

OEIS

https://oeis.org/

Download stripped.gz (20 MB) and names.gz (5 MB)
Extract, merge and process data
Publish merged version via IPFS (24 MB)

OEIS data under: OEIS-EULA / CC-BY-NC 3.0

Note:

The OEIS base is updating daily - need for #5

EDIT: Here's the whole OEIS microproject folder: QmbXX6jkJSx1aH41nfBqPyZAgxe1m4CzMMVnhYz1xRyFDQ

hosted JS libraries

Basically a mirror of https://developers.google.com/speed/libraries/

The following libraries are included, each with an update script and a datapackage.json file

angular_js
angular_material
dojo
ext_core
jquery
jquery_mobile
jquery_ui
mootools
prototype
scriptaculous
spf
swfobject
three.js
webfontloader

https://ipfs.io/ipns/em32.net/archives/hosted_js

(At the time of this writing, the latest version is /ipfs/QmYjAUMQDCPYiE9A1wjczXHvAMhYiBwhSu3SUxNoadqJCT)

CC #35

British Library Pictures

Would be cool if we had all of these pictures archived: http://www.openculture.com/2015/09/the-british-library-puts-over-1000000-images-in-the-public-domain-a-deeper-dive-into-the-collection.html

BASE

https://base-search.net

cdnjs

Would love to have a IPFS compatible fork of https://github.com/cdnjs/cdnjs serving files via IPFS. Super large repository though but will give it a try to develop the integration locally.

Public hypothes.is annotations

Public annotations on Hypothes.is are licensed under the CC0 licence (aka Public Domain).

CC: @jbenet @BigBlueHat

Archive Team projects

@ArchiveTeam seems to be awaiting a proposal from IPFS developers:

http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/ipfs_implementation

SpringerOpen books

SpringerOpen books are published under the Creative Commons Non-Commercial (CC BY-NC) license, so they can be reused and redistributed for non-commercial purposes as long as the original author is attributed.

Cc: @mekarpeles

arXMLiv / CorTeX

@kohlhase @dginev I'd really like to get arXMLiv working with our arXiv corpus (#2), as it would really help with ipfs/apps#1 (including ipfs/apps#5). Would you be able to help with this?

Cc: @jbenet @brucemiller

Website mirroring

TL;DR: People should be able to simply run:

ipfs-mirror http://example.com/

without having to worry about copyright violations, etc.

There are several open-access collections that could be archived by simply spidering their website, in the same way that Google Cache or IA's Wayback Machine does. Of course, this should only be performed for the portions of the website not disallowed by robots.txt.

IANAL, but from what I can tell, this is all kosher so long as there's an appropriate procedure for opting out. According to this article (which links to this document), Google is safe because they allow webmasters opt out via robots.txt, and also has a process for responding to DMCA takedown requests.

This is the policy that the Internet Archive follows:

Request by a webmaster of a private (non-governmental) web site, typically for reasons of privacy, defamation, or embarrassment.

Archivists should provide a 'self-service' approach site owners can use to remove their materials based on the use of the robots.txt standard.

Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site.

This allows archivists to ensure that material will no longer be gathered or made available.

These requests will not be made public; however, archivists should retain copies of all removal requests.

Third party removal requests based on the Digital Millennium Copyright Act of 1998 (DMCA).

Archivists should attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.

If the claim appears valid, archivists should comply.

Archivists will strive to make DMCA requests public via Chilling Effects, and notify searchers when requested pages have been removed.

Archivists will notify the webmaster of the affected site, generally via email.

Third party removal requests based on non-DMCA intellectual property claims (including trademark, trade secret).

Archivists will attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.

If the original pages have been removed and the archivist has determined that removal from public servers is appropriate, then the archivists will remove the pages from their public servers.

Archivists will strive to make these requests public via Chilling Effects, and notify searchers when requested pages have been removed.

Archivists will notify the webmaster of the affected site, generally via email

Third party removal requests based on objection to controversial content (e.g. political, religious, and other beliefs).
[...] archivists should not generally act on these requests.

Third party removal requests based on objection to disclosure of personal data provided in confidence.
[...] These requests are generally treated as requests by authors or publishers of original data.

Requests by governments.
Archivists will exercise best-efforts compliance with applicable court orders

Other requests and grievances, including underlying rights issues, error correction and version control, and re-insertions of web sites based on change of ownership.
These are handled on a case by case basis by the archive and its advisors.

Anyway, it would be really helpful if IPFS had an official procedure regarding this (presumably gateway-dmca-denylist would be a part of this).

GitHub

via https://github.com/joeyh/github-backup

OpenStreetMap

http://www.openstreetmap.org/

mirror latest planet.osm.pbf to ipfs (here)
convert to vector tiles
OSM viewer on ipfs
install necessary tools on pollux
write a proper configuration for tilemaker
make vector tiles for entire planet
clean up the viewer to look nice

Links:

https://github.com/osm-for-schools/osm-bright.tm2source

License: ODbL
CC: @zignig

PGP public key database

It would be nice to have the PGP public key database in IPFS.

SKS (OpenPGP keyserver)

Internet in a box

https://github.com/braddockcg/internet-in-a-box
http://internet-in-a-box.org/
http://downloads.internet-in-a-box.org/

I don't know how to bit it and it is most likely project for future (0.5TiB of data).

IRC logs

I don't know if there's a chan log already, but a bot creating it and storing it in ipfs looks like a great idea to me. The only issue is how to split messages into files:

Do we use a file for every day containing all the messages of the day? Or we divide them by the hour? Or some other way? By the day sounds the best for me.

Earth (https://www.planet.com/data/)

BookReader

https://archive.org/details/BookReader

IPFS Demo

add more books!
add to ipfs/examples?

CC: @jbenet

Search engine

Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:

the static way: each node stores their index in a trie-like structure distributed across multiple IPFS objects, in such a way that clients only need to download a small subset of the index for any given query
the dynamic way: queries are broadcast to the network, nodes perform the search on their local index, and provide the results back over IPFS
the magic way: somehow storing the index in the IPFS DHT?

Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?

DBLP

http://dblp.uni-trier.de/

mirror DBLP database (from http://dblp.uni-trier.de/xml/ )
identify links to open-access content, and mirror it

PubMed Central: Open Access Subset

http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

CERN

http://opendata.cern.ch

CERN is, since the end of 2014, serving some fraction of the colossal amount of captured data about particle collision in LHC (with detectors like CMS, ATLAS, ALICE) - summing up to 60,000,000 GB.

Through the help of a small Python crawler, I've compiled an index of all CMS-detector primary datasets (all .root files totaling cca. 27,4TB). Also index of indexes. Other detector indexes of datasets + derivative datasets to come :)

CMS
- Primary datasets from 2010 runs (27,4TB)
  - Scrap data from CERN's OpenData (via cmspull.py)
  - Compile index of all primary dataset files (.root)
  - Somehow get those 28TB into IPFS (maybe in cooperation w/ CERN? - than all steps are unnecessary)
ATLAS
ALICE
LHCb

To use all that data, a special environment is required - normally CERN's OpenData is recommending the use of their CernVM, which is basically Scientific Linux + ROOT, a data analysis framework (therefore the .root files). Without ROOT, this historical milestones cannot be used as computable data directly - so the tool must also be, as the collision data, preserved/archived. There's also a mirror right here at Github.

Oh, and thanks for the amazing project!

CiteSeerX

http://citeseerx.ist.psu.edu

add citeseer metadata to ipfs
add citeseer pdfs to ipfs (@jbenet)

IETF RFCs

All RFCs! \o/

Wikipedia

In terms of being able to view this on the web, I'm tempted to push Pandoc through a Haskell-to-JS compiler like Haste.

CC: @jbenet

W3C specs

All W3C specs! \o/

CIA PDBs

If you follow news about the United States, you might know that the CIA has recently declassified and released a large number of "President's Daily Brief" documents (PDBs) from the 1960s.

https://www.cia.gov/library/publications/intelligence-history/presidents-daily-brief/index.html

I thought it would be fun to archive these in IPFS. You can find them here:

http://ipfs.io/ipns/QmbuG3dYjX5KjfAMaFQEPrRmTRkJupNUGRn1DXCgKK5ogD/archives/PDBs

My understanding is that these documents are a product of the US Government, and as such are not subject to copyright. I've not found any notices that contradict this, and because they are all declassified, I believe that distribution is unrestricted.

Archive package file

Following from #25

@eminence said:

For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.

This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.

Metadata should be stored in a file called _Metadata.json. The name is designed so that I'll appear near the top of directory listings.

The json object is a dictionary with the following keys:

title -- Provides a name for the archive

description -- A more verbose description, if needed

source -- Lists of URLs where this data came from

license -- An array of dictionaries listing the relevant licenses. Each has the following keys:

summary -- a brief summary of the license

source -- Where to find the license/legal terms in full

last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated

I think to start "license" and "title" should be required, others can be optional

For two concrete examples, see the metadata for #23 and the metadata for #18

Other thoughts:

Should the metadata include maintainer information?
Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away
CC #5 for related discussion

IPFS as a backend to a web archiving

I am building a new on-demand web archiving system, called webrecorder.io, which allows for on-demand archiving of any web site (by acting as a rewriting + recording proxy).
This version (actually beta.webrecorder.io) will soon be open-sourced and will be available for users to deploy on their own.

The system allows for a user to create recording of any web site, including dynamic content, by browsing it through a the recorder, eg. https://webrecorder.io/record/example.com/ and replay by browsing through replay, https://webrecorder.io/replay/example.com/

The recording is a WARC file, a standard used by Internet Archive and other archiving orgs. The file can be broken down into records (basically contents of HTTP response + request and extra metadata), and each of these records could be put individually into IPFS.

I suppose this sort of relates to #7 but perhaps in a more sophisticated way.

Most obvious mode of operation: Store each WARC record in IPFS individually.

Some unknowns (to me):

Resolving URL + datetime to the hash of the stored object in IPFS (This is also part of the memento protocol). Basically looking up a url and datetime and mapping it to an IPFS hash.
Privacy / security concerns: Would want to have users create private archives, or be able to set controls on what is accessible to whom. This is not specific to web archiving, but something I don't (yet) know much about.

For more reference:
The system is built using these tools: https://github.com/ikreymer/pywb , https://github.com/ikreymer/warcprox
An older simplified version of the "webrecorder" concept: https://github.com/ikreymer/pywb-webrecorder.

Automatic Updates

All of the archives are currently being imported to IPFS manually. This is fine as a starting point, but we need to write some scripts to keep them up to date with the origin, run them periodically and publish the changes over IPNS.

http://gen.lib.rus.ec/

Buckminster Fuller's open works

CKAN

CKAN powers a number of open data repositories, including datahub.io, data.gov, data.gov.uk, data.gov.au, etc. It also has a harvesting mechanism.

CKAN harvest -> IPFS

Xkcd

I plan to archive all the comics in http://xkcd.com/

I think i'll use (comicnumber)-(comictitle).png for the image and figure out how to save the alt-text in the png metadata

Please post if you want to keep a copy of the archive or you manage to create it before I do :)

arXiv

TL;DR: Click here.

Archive metadata and licensing --> js discussion

For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.

This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.

Metadata should be stored in a file called _Metadata.json. The name is designed so that I'll appear near the top of directory listings.
The json object is a dictionary with the following keys:
- title -- Provides a name for the archive
- description -- A more verbose description, if needed
- source -- Lists of URLs where this data came from
- license -- An array of dictionaries listing the relevant licenses. Each has the following keys:
  - summary -- a brief summary of the license
  - source -- Where to find the license/legal terms in full
- last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated
I think to start "license" and "title" should be required, others can be optional

For two concrete examples, see the metadata for #23 and the metadata for #18

Other thoughts:

Should the metadata include maintainer information?
Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away

CC #5 for related discussion

scholarpedia.org

LICENSE: CC BY-NC-SA 3.0 [1]
Like SEP but for science, e.g. http://www.scholarpedia.org/article/Faddeev-Popov_ghosts by Faddeev himself.
There is an outdated archive in https://archive.org/details/wiki-scholarpediaorg_w.

[1] http://www.scholarpedia.org/article/Scholarpedia:Terms_of_Use#Scholarpedia.27s_Licenses_to_You.2C_and_Your_license_to_parties_other_than_Scholarpedia

dat

http://dat-data.com/

Should be able to use dat to add datasets to ipfs soon. See http://kevinchai.net/datasets for ideas.

mafintosh: you just need to wrap ipfs in a blob store interface, https://github.com/maxogden/abstract-blob-store and you'll be able to use ipfs inside dat for file data
davidar: ah, so not quite there yet?
davidar: in terms of working out-of-the-box
mafintosh: no but as soon as there are good js bindings for ipfs that'll probably happen pretty fast
which they are already working on

CC: @mafintosh

discogs.com

http://data.discogs.com/

Download Discogs Data

Here you will find monthly dumps of Discogs Release, Artist, Label, and
Master Release data. The data is in XML format and formatted according
to the API spec: http://www.discogs.com/developers/

This data is made available under the CC0 No Rights Reserved license:
http://creativecommons.org/about/cc0

2015-11-04T17:54:32.000Z        154.1 MB       discogs_20151101_artists.xml.gz
2015-11-04T17:54:32.000Z        26.7 MB        discogs_20151101_labels.xml.gz
2015-11-04T17:54:32.000Z        99.4 MB        discogs_20151101_masters.xml.gz
2015-11-04T17:54:32.000Z        3.3 GB         discogs_20151101_releases.xml.gz

Archive webpage archive hub

Every independent archival effort that we do should have a webpage. It would be useful for it to have certain things like:

latest head
how to replicate
version history
archival scripts
maintainer
license
authors
original urls

There may be standards for this already. (Check the Internet Archive and OKFN?)

It may be doable as a package.json style metadata file, and a script to produce an index.html.

Logo

@jbenet What do you think of this one?

Books derived from the Noun Project

ipfs-inactive / archives Goto Github PK

archives's People

Contributors

Stargazers

Watchers

Forkers

archives's Issues

Recommend Projects

Recommend Topics

Recommend Org