Giter Club home page Giter Club logo

archive's Introduction

YouTube annotations were removed around 15:00 UTC on January 15th, 2019. The tracker was taken down around a day later, so workers are no longer required, and the URL they connect to will no longer work. See here for more information about the future of this project.


For cloudrac3r's work, see README.md in the node folder.


Youtube Annotation Archive

Provides scripts for archiving YouTube Annotations. See the wiki for information about how it works.

Annotations on every YouTube video will be deleted forever on the 15th of January. The purpose of this project is to archive as much annotation data as possible before that happens.

The current process is to scrape as many channel IDs as possible, then to scrape video IDs from those channels, then to download annotation data for those videos.

If you would like to make sure specific channels are archived before the 15th, you can use this tool.

Usage

Installing and running a worker (Node.js):

With Docker:

Download the Dockerfile located in the /docker folder with

$ wget https://github.com/omarroth/archive/raw/master/docker/Dockerfile

Then in the same directory run the following command to build the image:

$ docker build -t archive .

Use the following commands to create a container with the image and run it to begin the archiving process:

$ docker create --name=archive-worker archive:latest
$ docker container start archive-worker

On Ubuntu:

# Install dependencies
$ sudo apt-get install curl python-software-properties
$ curl -sL https://deb.nodesource.com/setup_10.x | sudo -E bash -
$ sudo apt-get install nodejs gcc g++ make

$ git clone https://github.com/omarroth/archive
$ cd archive/node
$ npm install
$ cd worker
$ node index.js

With Heroku

Create a new Heroku app and point it to https://github.com/omarroth/archive on the branch "heroku", and trigger a manual deploy. You can do this by creating a Heroku account and visiting this link: https://dashboard.heroku.com/new?template=https://github.com/omarroth/archive/tree/heroku.

Enable automatic deploys to receive the latest updates automatically.

The webserver is just a placeholder โ€” open the logs to see what's currently going on.

Installing and running a worker (Crystal):

On Ubuntu:

# Install dependencies
$ curl -sSL https://dist.crystal-lang.org/apt/setup.sh | sudo bash
$ sudo apt-get update
$ sudo apt-get install crystal libssl-dev libxml2-dev libyaml-dev libgmp-dev libreadline-dev librsvg2-dev

$ git clone https://github.com/omarroth/archive
$ cd archive
$ shards
$ crystal build src/worker.cr --release
$ ./worker -u https://archive.omar.yt -t 20
$ ./worker -h
    -u URL, --batch-url=URL          Master server URL
    -t THREADS, --max-threads=THREADS
                                     Number of threads for downloading annotations
    -h, --help                       Show this help

Contributors

archive's People

Contributors

cloudrac3r avatar mateon1 avatar omarroth avatar tech234a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

archive's Issues

Invalid size error

i am getting invalid size errors all the time now

Continuing 57f8b4c2-16c9-4ef1-899f-f4063cf75054...
GC Warning: Repeated allocation of very large block (appr. size 33558528):
May lead to memory leak and poor performance
All annotations collected (39.5 MiB)
Compressing...
Committing...
Invalid size for 57f8b4c2-16c9-4ef1-899f-f4063cf75054
Continuing 57f8b4c2-16c9-4ef1-899f-f4063cf75054...
GC Warning: Repeated allocation of very large block (appr. size 67112960):
May lead to memory leak and poor performance
All annotations collected (39.5 MiB)
Compressing...
Committing...
Invalid size for 57f8b4c2-16c9-4ef1-899f-f4063cf75054
Continuing 57f8b4c2-16c9-4ef1-899f-f4063cf75054...
GC Warning: Repeated allocation of very large block (appr. size 67112960):
May lead to memory leak and poor performance
All annotations collected (39.5 MiB)
Compressing...
Committing...
Invalid size for 57f8b4c2-16c9-4ef1-899f-f4063cf75054

some items failed

no idea why but some items in the beginning of the batch failed

(node:10616) UnhandledPromiseRejectionWarning: RequestError: Error: socket hang up at new RequestError (/home/niemand/archive/node/node_modules/request-promise-core/lib/errors.js:14:15) at Request.plumbing.callback (/home/niemand/archive/node/node_modules/request-promise-core/lib/plumbing.js:87:29) at Request.RP$callback [as _callback] (/home/niemand/archive/node/node_modules/request-promise-core/lib/plumbing.js:46:31) at self.callback (/home/niemand/archive/node/node_modules/request/request.js:185:22) at emitOne (events.js:116:13) at Request.emit (events.js:211:7) at Request.onRequestError (/home/niemand/archive/node/node_modules/request/request.js:881:8) at emitOne (events.js:116:13) at ClientRequest.emit (events.js:211:7) at TLSSocket.socketErrorListener (_http_client.js:387:9) (node:10616) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 170)
i hope this doesnt lead to incomplete data

cannot build with docker

niemand@akaralan:~/archive/docker$ sudo docker build -t archive .
Sending build context to Docker daemon 2.048kB
Step 1/6 : FROM node:10
10: Pulling from library/node
no matching manifest for unknown in the manifest list entries

missing dependencies

you should add:
sudo npm install request-promise-native
sudo npm install sqlite

to the node.js part of the readme because those were not present when i installed npm on linux mint. the sqlite install failed so personally i unfortunately cannot use it anyway. so there could be even more dependencies missing.

Everything completed?

It appears that everything has been backed up now. CONGRATULATIONS! the script also doesnt upload anything anymore. wouldnt it be beneficial to still recieve uploads for possible error corrections or are there possibly other sources for video ids? especially since there are now so many active workers.

failed to upload

uaing the crystal version i got this error:

GC Warning: Repeated allocation of very large block (appr. size 33558528):
May lead to memory leak and poor performance
All annotations collected (28.1 MiB)
Compressing...
Committing...
All annotations compressed (2.6 MiB)
Uploading to S3...
Unhandled exception: Status 204 should not have a body (ArgumentError)
from /usr/share/crystal/src/http/client/response.cr:0:11 in 'exec_internal_single'
from /usr/share/crystal/src/http/client.cr:499:5 in 'exec'
from /usr/share/crystal/src/http/client.cr:342:3 in 'post'
from src/worker.cr:192:5 in '__crystal_main'
from /usr/share/crystal/src/kernel.cr:453:5 in 'main'
from /usr/share/crystal/src/string.cr:4202:5 in '__libc_start_main'
from ???
niemand@akaralan:~/archive$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.