Giter Club home page Giter Club logo

developers-italia-backend's Introduction

Crawler for the OSS catalog of Developers Italia

Go Report Card Join the #website channel Get invited

Description

Developers Italia provides a catalog of Free and Open Source software aimed to Public Administrations.

This crawler finds and retrieves the publiccode.yml files from the organizations publishing the software that have registered through the onboarding procedure.

The generated YAML files are then used by developers.italia.it build to generate its static pages.

Setup and deployment processes

The crawler can either run manually on the target machine or it can be deployed from a Docker container with its helm-chart in Kubernetes.

Elasticsearch 6.8 is used to store the data and has ready to accept connections before the crawler is started.

Manually configure and build the crawler

  1. cd crawler

  2. Save the auth tokens to domains.yml.

  3. Rename config.toml.example to config.toml and set the variables

    NOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"

  4. Build the crawler binary with make

Docker

The repository has a Dockerfile, used to build the production image, and a docker-compose.yml file to setup the development environment.

  1. Copy the .env.example file into .env and edit the environment variables as it suits you. .env.example has detailed descriptions for each variable.

    cp .env.example .env
  2. Save your auth tokens to domains.yml

    cp crawler/domains.yml.example crawler/domains.yml
    editor crawler/domains.yml
  3. Start the environment:

    docker-compose up
    

Run the crawler

Crawl mode (all item in whitelists): bin/crawler crawl whitelist/*.yml

Gets the list of organizations in whitelist/*.yml and starts to crawl their repositories.

If it finds a blacklisted repository, it will remove it from Elasticsearch, if it is present.

It also generates:

One mode (single repository url): bin/crawler one [repo url] whitelist/*.yml

In this mode one single repository at the time will be evaluated. If the organization is present, its iPA code will be matched with the ones in whitelist, otherwise it will be set to null and the slug will have a random code in the end (instead of the iPA code).

Furthermore, the iPA code validation, which is a simple check within whitelists (to ensure that code belongs to the selected PA), will be skipped.

If it finds a blacklisted repository, it will exit immediately.

Other commands

  • bin/crawler updateipa downloads iPA data and writes them into Elasticsearch

  • bin/crawler delete [URL] deletes software from Elasticsearch using its code hosting URL specified in publiccode.url

  • bin/crawler download-whitelist downloads organizations and repositories from the onboarding portal repository and saves them to a whitelist file

Crawler whitelists

The whitelist directory contains the of organizations to crawl from.

whitelist/manual-reuse.yml is a list of Public Administrations repositories that for various reasons were not onboarded with developers-italia-onboarding, while whitelist/thirdparty.yml contains the non-PAs repos.

Here's an example of how the files might look like:

- id: "Comune di Bagnacavallo" # generic name of the organization.
  codice-iPA: "c_a547" # codice-iPA
  organizations: # list of organization urls.
    - "https://github.com/gith002"

Crawler blacklists

Blacklists are needed to exclude individual repository that are not in line with our guidelines.

You can set BLACKLIST_FOLDER in config.toml to point to a directory where blacklist files are located. Blacklisting is currently supported by the one and crawl commands.

See also

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

developers-italia-backend's People

Contributors

alranel avatar r3vit avatar sebbalex avatar bfabio avatar lussoluca avatar gmereu avatar ruphy avatar libremente avatar lucaprete avatar davidegiarolo avatar mattmattv avatar silviorelli avatar lorello avatar ilghera avatar davide-zerbetto avatar ghtmtt avatar martinomaggio avatar deneb-alpha avatar cappe87 avatar sardylan avatar nardil avatar pgiacomo69 avatar kokorin avatar biagiot avatar andreapoli avatar alessiolombardo avatar 8vaid8 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.