Giter Club home page Giter Club logo

beacon's Introduction

Beacon Data Collection

The tools in this repository provide data harvesting using the PKP Beacon, a lightweight mechanism to identify PKP software to us for security updates and research.

Installation

To install the tool:

  • Clone the repository locally
  • Install composer dependencies: composer install
  • Create an empty database called "beacon" (username=beacon, password=beacon). (This can be overridden using environment variables, see .)
  • Create the database schema by running php beacon.php database create

Usage

There are three stages involved in processing the beacon data:

  1. Parsing the PKP webserver's access logs to find installations using beacon pings
  2. Extracting the list of contexts (e.g. journals) from the installations
  3. Updating the beacon data for each context.

Each stage is executed by running a different script.

Parsing the access logs

To look through the access logs for beacons that might identify new installations (or provide updates from already-seen ones), use the process-log command. For usage information, run:

php beacon.php process-log -h

This is a single-threaded tool that can handle a large number of log entries relatively quickly.

Additional processing done at this stage:

Extracting the context list

In this stage, a list of contexts (e.g. journals) is extracted from each OAI endpoint.

To extract the context list, use the scan command. For usage information, run:

php beacon.php scan -h

This is a multi-threaded tool that allows potentially several minutes for each beacon entry. Timeouts, concurrency, etc. are all configurable. Records can be selected for update by OAI URL. See the usage information for details.

Additional processing done at this stage:

  • The driver set specifier is excluded from further processing. Though it's possible that there's a journal called driver, this set is also added by the DRIVER plugin and can be misunderstood by the beacon as a journal set.

Processing the beacon list

In this stage, the data stored in the beacon for each context (e.g. journal or preprint server) is enriched.

To update the beacon data for each context, use the synchronize command. For usage information, run:

php beacon.php synchronize -h

This is a multi-threaded tool that allows potentially several minutes for each beacon entry. Timeouts, concurrency, etc. are all configurable. Records can be selected for update by OAI URL. See the usage information for details.

Additional processing done at this stage:

  • If one is not yet stored, the ISSN is fetched for each context by looking for ISSN-like data in a DC record's source element. (The first available ISSN-like data is used.)
  • If one is not yet stored and an ISSN is available, the country is looked up from the ISSN using the ISSN API.
  • If one is not yet stored, A "count span" is fetched and stored for the specified year (default: the previous year from June onward in the current year) based on the number of modified records.

Extracting the data

Data can be exported as CSV using the export command:

php beacon.php export > beacon.csv

Additional processing done at this stage:

  • If selected in the export options, limits e.g. the number of active records in the last year can be applied here.
  • Further deduplication is applied. Entries are grouped into a single row if they share the same:
    • application
    • version
    • administrator email
    • earliest record datestamp
    • OAI repository name
    • OAI set specifier

beacon's People

Contributors

asmecher avatar jonasraoni avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

jonasraoni

beacon's Issues

Generate a dashboard with useful statistics

After collecting some data it might be interesting to generate some statistics/visualizations to give a better perception of what's being used or not, so we'll be better grounded when deciding what to prioritize.

Refactoring

@asmecher, I've created this issue just to keep track of my PR.

I did some initial refactoring in the code, mostly to remove duplicated things, updated the packages and added code-linting (I took the rules from Nate's PR), so nothing great to watch yet ๐Ÿ˜

Convert MARC countries to an ISO format

In order to make it easier to filter and extract statistics, it would be easier if the countries were following a better format.

Note: Prepare something to update the database.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.