Giter Club home page Giter Club logo

wisewebspider's Introduction

WISeWEBSpider version 0.4 Build StatusPython Version

Original Authors: Jerod Parrent, James Guillochon

###Description wisewebspider is a simple program built to scrape and download all publicly available supernova spectra from the Weizmann Interactive Supernova data REPository (WISeREP); a bulk download option is not available through WISeREP and the number of supernova spectra to download are in the 10,000s. The script creates one main directories, sne-external-WISEREP/, where spectra are stored in individual subdirectories alongside README.json files. The README files detail event metadata for each spectrum collected and keep track of the number of private spectra. Also stored in sne-external-WISEREP/ are log files and a lists.json file to keep track of the scripts progress, as well as non-supernova events to save time.

The script guards against spectra already collected, duplicate files found on WISeREP, and events that are not supernovae. However, no effort has been made to collate spectra for objects with multiple aliases (e.g., SN2011fe and PTF11kly, both the same event, have separate directories), nor does the script determine supernova types for objects that are unspecified on WISeREP.

Without excluding by event type and/or survey program (UCB, CfA, SuSpect, etc), the full runtime for scraping everything is about 18.7 hours. Fortunately this need only be done once. After an initial scrape, the script can be run in update mode, which at most takes a few minutes.

###Usage For the initial scrape, check/edit exlcluded lists in main.py, then run:

python3.5 -m wisewebspider

To run in update mode:

python3.5 -m wisewebspider --update --daysago 30

where the value after daysago can be 1, 2, 7, 14, 30, 180, or 365, i.e., how many days since last you scraped.

Dependencies and Credits

wisewebspider's People

Contributors

guillochon avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

guillochon villrv

wisewebspider's Issues

Missing "reducer" field

Looks like there's only one metadata field that's currently missing from the spider that was in the old code: The "reducer" field, which should be a name. Is it possible to add this field to the spider?

Memory Use

Attempting to run a full rebuild on the OSC server, which has a rather modest memory budget. I'm seeing "Killed" errors, which generally suggest that the script is running out of memory. The process is killed after about an hour of running, so unfortunately a full rebuild is requiring multiple restarts (lucky, lists.json prevents it from re-scraping the first events listed repeatedly).

My guess is there's some variable that's growing continuously as the script runs that just needs to be purged between files. It's also possible it's related to a memory leak in RoboBrowser of some sort.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.