ipfs-inactive / archives Goto Github PK
View Code? Open in Web Editor NEW[ARCHIVED] Repo to coordinate archival efforts with IPFS
Home Page: https://awesome.ipfs.io/datasets
[ARCHIVED] Repo to coordinate archival efforts with IPFS
Home Page: https://awesome.ipfs.io/datasets
(Keep in mind https://blog.archive.org/developers/ and https://developers.archive.org/ and contribute improvements)
These are published here https://archive.org/download/stackexchange
I've downloaded them to my machine and added them to ipfs 0.4. Currently they are being pinned by biham.
The folder hash is QmYgHvTrSfPJH5Dswq6NB8wTHH77BFaJdLP8UBYJz9Wz19
with the nested parts being listed here
http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2015-08-09-00-01.gz (mirrored here)
License: CC-BY
http://Metamaps.cc have been collecting some great datasets with their mindmap-style interface.
They are working on exposing these as JSON-LD: metamaps/metamaps#425 (comment) -- when this is ready, we should archive the maps and have at least one client-side viz running in IPFS as an interface to them.
Latest archive.org archive (2013): http://web.archive.org/web/20150905080837/http://www.mizar.org/
http://www.wwpdb.org/download/downloads
Since 1971, the Protein Data Bank archive (PDB) has served as the single repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies.
The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community.
Cc: @cryptix
Specifically http://downloads.openwrt.org/.
Have been tried in ipfs/kubo#1924, ipfs/kubo#1584.
But failed due to memory leak (not mainly because ipfs add
is slow).
The first thing I mirrored to IPFS was a small subset of Project Gutenberg, so I'm definitely interested in getting the whole thing into IPFS, as both @rht (#14) and @simonv3 (https://github.com/simonv3/ipfs-gutenberg) have suggested.
Making an issue to coordinate this.
stripped.gz
(20 MB) and names.gz
(5 MB)OEIS data under: OEIS-EULA / CC-BY-NC 3.0
Note:
EDIT: Here's the whole OEIS microproject folder: QmbXX6jkJSx1aH41nfBqPyZAgxe1m4CzMMVnhYz1xRyFDQ
Basically a mirror of https://developers.google.com/speed/libraries/
The following libraries are included, each with an update script and a datapackage.json
file
https://ipfs.io/ipns/em32.net/archives/hosted_js
(At the time of this writing, the latest version is /ipfs/QmYjAUMQDCPYiE9A1wjczXHvAMhYiBwhSu3SUxNoadqJCT
)
CC #35
Would be cool if we had all of these pictures archived: http://www.openculture.com/2015/09/the-british-library-puts-over-1000000-images-in-the-public-domain-a-deeper-dive-into-the-collection.html
Would love to have a IPFS compatible fork of https://github.com/cdnjs/cdnjs serving files via IPFS. Super large repository though but will give it a try to develop the integration locally.
Public annotations on Hypothes.is are licensed under the CC0 licence (aka Public Domain).
CC: @jbenet @BigBlueHat
@ArchiveTeam seems to be awaiting a proposal from IPFS developers:
http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/ipfs_implementation
SpringerOpen books are published under the Creative Commons Non-Commercial (CC BY-NC) license, so they can be reused and redistributed for non-commercial purposes as long as the original author is attributed.
Cc: @mekarpeles
@kohlhase @dginev I'd really like to get arXMLiv working with our arXiv corpus (#2), as it would really help with ipfs/apps#1 (including ipfs/apps#5). Would you be able to help with this?
Cc: @jbenet @brucemiller
TL;DR: People should be able to simply run:
ipfs-mirror http://example.com/
without having to worry about copyright violations, etc.
There are several open-access collections that could be archived by simply spidering their website, in the same way that Google Cache or IA's Wayback Machine does. Of course, this should only be performed for the portions of the website not disallowed by robots.txt
.
IANAL, but from what I can tell, this is all kosher so long as there's an appropriate procedure for opting out. According to this article (which links to this document), Google is safe because they allow webmasters opt out via robots.txt
, and also has a process for responding to DMCA takedown requests.
This is the policy that the Internet Archive follows:
Request by a webmaster of a private (non-governmental) web site, typically for reasons of privacy, defamation, or embarrassment.
- Archivists should provide a 'self-service' approach site owners can use to remove their materials based on the use of the robots.txt standard.
- Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site.
- This allows archivists to ensure that material will no longer be gathered or made available.
- These requests will not be made public; however, archivists should retain copies of all removal requests.
Third party removal requests based on the Digital Millennium Copyright Act of 1998 (DMCA).
- Archivists should attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.
- If the claim appears valid, archivists should comply.
- Archivists will strive to make DMCA requests public via Chilling Effects, and notify searchers when requested pages have been removed.
- Archivists will notify the webmaster of the affected site, generally via email.
Third party removal requests based on non-DMCA intellectual property claims (including trademark, trade secret).
- Archivists will attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.
- If the original pages have been removed and the archivist has determined that removal from public servers is appropriate, then the archivists will remove the pages from their public servers.
- Archivists will strive to make these requests public via Chilling Effects, and notify searchers when requested pages have been removed.
- Archivists will notify the webmaster of the affected site, generally via email
Third party removal requests based on objection to controversial content (e.g. political, religious, and other beliefs).
[...] archivists should not generally act on these requests.Third party removal requests based on objection to disclosure of personal data provided in confidence.
[...] These requests are generally treated as requests by authors or publishers of original data.Requests by governments.
Archivists will exercise best-efforts compliance with applicable court ordersOther requests and grievances, including underlying rights issues, error correction and version control, and re-insertions of web sites based on change of ownership.
These are handled on a case by case basis by the archive and its advisors.
Anyway, it would be really helpful if IPFS had an official procedure regarding this (presumably gateway-dmca-denylist would be a part of this).
Links:
It would be nice to have the PGP public key database in IPFS.
https://github.com/braddockcg/internet-in-a-box
http://internet-in-a-box.org/
http://downloads.internet-in-a-box.org/
I don't know how to bit it and it is most likely project for future (0.5TiB of data).
I don't know if there's a chan log already, but a bot creating it and storing it in ipfs looks like a great idea to me. The only issue is how to split messages into files:
Do we use a file for every day containing all the messages of the day? Or we divide them by the hour? Or some other way? By the day sounds the best for me.
Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:
Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?
CERN is, since the end of 2014, serving some fraction of the colossal amount of captured data about particle collision in LHC (with detectors like CMS, ATLAS, ALICE) - summing up to 60,000,000 GB.
Through the help of a small Python crawler, I've compiled an index of all CMS-detector primary datasets (all .root
files totaling cca. 27,4TB). Also index of indexes. Other detector indexes of datasets + derivative datasets to come :)
cmspull.py
).root
)To use all that data, a special environment is required - normally CERN's OpenData is recommending the use of their CernVM, which is basically Scientific Linux + ROOT, a data analysis framework (therefore the .root
files). Without ROOT, this historical milestones cannot be used as computable data directly - so the tool must also be, as the collision data, preserved/archived. There's also a mirror right here at Github.
Oh, and thanks for the amazing project!
All RFCs! \o/
All W3C specs! \o/
If you follow news about the United States, you might know that the CIA has recently declassified and released a large number of "President's Daily Brief" documents (PDBs) from the 1960s.
https://www.cia.gov/library/publications/intelligence-history/presidents-daily-brief/index.html
I thought it would be fun to archive these in IPFS. You can find them here:
http://ipfs.io/ipns/QmbuG3dYjX5KjfAMaFQEPrRmTRkJupNUGRn1DXCgKK5ogD/archives/PDBs
My understanding is that these documents are a product of the US Government, and as such are not subject to copyright. I've not found any notices that contradict this, and because they are all declassified, I believe that distribution is unrestricted.
Following from #25
@eminence said:
For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.
This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.
- Metadata should be stored in a file called
_Metadata.json
. The name is designed so that I'll appear near the top of directory listings.- The json object is a dictionary with the following keys:
- title -- Provides a name for the archive
- description -- A more verbose description, if needed
- source -- Lists of URLs where this data came from
- license -- An array of dictionaries listing the relevant licenses. Each has the following keys:
- summary -- a brief summary of the license
- source -- Where to find the license/legal terms in full
- last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated
- I think to start "license" and "title" should be required, others can be optional
For two concrete examples, see the metadata for #23 and the metadata for #18
Other thoughts:
Should the metadata include maintainer information?
Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away
CC #5 for related discussion
I am building a new on-demand web archiving system, called webrecorder.io, which allows for on-demand archiving of any web site (by acting as a rewriting + recording proxy).
This version (actually beta.webrecorder.io) will soon be open-sourced and will be available for users to deploy on their own.
The system allows for a user to create recording of any web site, including dynamic content, by browsing it through a the recorder, eg. https://webrecorder.io/record/example.com/ and replay by browsing through replay, https://webrecorder.io/replay/example.com/
The recording is a WARC file, a standard used by Internet Archive and other archiving orgs. The file can be broken down into records (basically contents of HTTP response + request and extra metadata), and each of these records could be put individually into IPFS.
I suppose this sort of relates to #7 but perhaps in a more sophisticated way.
Most obvious mode of operation: Store each WARC record in IPFS individually.
Some unknowns (to me):
For more reference:
The system is built using these tools: https://github.com/ikreymer/pywb , https://github.com/ikreymer/warcprox
An older simplified version of the "webrecorder" concept: https://github.com/ikreymer/pywb-webrecorder.
All of the archives are currently being imported to IPFS manually. This is fine as a starting point, but we need to write some scripts to keep them up to date with the origin, run them periodically and publish the changes over IPNS.
CKAN powers a number of open data repositories, including datahub.io, data.gov, data.gov.uk, data.gov.au, etc. It also has a harvesting mechanism.
I plan to archive all the comics in http://xkcd.com/
I think i'll use (comicnumber)-(comictitle).png for the image and figure out how to save the alt-text in the png metadata
Please post if you want to keep a copy of the archive or you manage to create it before I do :)
TL;DR: Click here.
For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.
This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.
_Metadata.json
. The name is designed so that I'll appear near the top of directory listings.For two concrete examples, see the metadata for #23 and the metadata for #18
Other thoughts:
CC #5 for related discussion
LICENSE: CC BY-NC-SA 3.0 [1]
Like SEP but for science, e.g. http://www.scholarpedia.org/article/Faddeev-Popov_ghosts by Faddeev himself.
There is an outdated archive in https://archive.org/details/wiki-scholarpediaorg_w.
Should be able to use dat
to add datasets to ipfs soon. See http://kevinchai.net/datasets for ideas.
mafintosh: you just need to wrap ipfs in a blob store interface, https://github.com/maxogden/abstract-blob-store and you'll be able to use ipfs inside dat for file data
davidar: ah, so not quite there yet?
davidar: in terms of working out-of-the-box
mafintosh: no but as soon as there are good js bindings for ipfs that'll probably happen pretty fast
which they are already working on
CC: @mafintosh
Download Discogs Data
Here you will find monthly dumps of Discogs Release, Artist, Label, and
Master Release data. The data is in XML format and formatted according
to the API spec: http://www.discogs.com/developers/This data is made available under the CC0 No Rights Reserved license:
http://creativecommons.org/about/cc0
2015-11-04T17:54:32.000Z 154.1 MB discogs_20151101_artists.xml.gz
2015-11-04T17:54:32.000Z 26.7 MB discogs_20151101_labels.xml.gz
2015-11-04T17:54:32.000Z 99.4 MB discogs_20151101_masters.xml.gz
2015-11-04T17:54:32.000Z 3.3 GB discogs_20151101_releases.xml.gz
Every independent archival effort that we do should have a webpage. It would be useful for it to have certain things like:
There may be standards for this already. (Check the Internet Archive and OKFN?)
It may be doable as a package.json style metadata file, and a script to produce an index.html.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.