Giter Club home page Giter Club logo

browsertrix-yaml-examples's Introduction

browsertrix-yaml-examples

YAML files for including and excluding things

See the browsertrix project for more general documentation.

Default and handy starter file

This includes all sub domains like abc.collection.litme.com.ua, handles http:// to https:// conversions & subdomains.

collection: "collection-litme-com-ua"
workers: 16
saveState: always

# A few examples that exclude parts of a page from being fetched and recorded. Feel free to add more
blockRules:
  - url: google-analytics.com
  - url: googletagmanager.com
  - url: yandex.ru
  - url: mycounter.ua
  - url: facebook.(com|net)
  #- url: youtube.com/embed/ # Uncomment this line to skip the recording of embedded YouTube videos

seeds:
    - url: http://collection.litme.com.ua/
      scopeType: "domain"

Excluding trouble

If you notice that a crawl is collecting duplicate links due to parameters, like

        - '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2022-4-1","seedId":0,"depth":2}'
        - '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2023-03-1","seedId":0,"depth":2}'
        - '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2022-2-1","seedId":0,"depth":2}'

where each is the same base page when loaded in the browser, then first check that excluding it results in the same webpage. in this case, as https://nmiu.org/index.php/novyny-museum goes to the same webpage, we can exclude all cases where iccaldate= is used as a query parameter / link to follow by adding the following to our yaml file:

exclude:
  - .*iccaldate=.*

This will prevent the same page taking up multiple workers / space in the final file / time. So the whole file will read

collection: "collection-nmiu-org"
workers: 16
saveState: always
seeds:
    - url: https://nmiu.org
      include: .*nmiu\.org.*
      scopeType: "host"
      exclude:
        - .*iccaldate=.*

Making Sure You Don't Go Out Of Depth

Some websites have recursive links. They can look like this

'{"url":"https://nmiu.org/lektoriy/181-ekskursiji-kontent/vysta/3/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/exponat-tyzhnya","seedId":0,"depth":48}'

If you notice this kind of pattern in the URLs of a site that never seems to end, avoid getting stuck in such a trap by adding the "depth" flag to the yaml file:

collection: "collection-litme-com-ua"
workers: 16
saveState: always
seeds:
    - url: http://collection.litme.com.ua/
      include: 
        - .*collection\.litme\.com.ua.*
      scopeType: "host"
      depth: 25

This will tell the tool to only follow a given path to a depth of 25 different URL clicks, so use it with care!

In the example given at the start of this section, exclude can also be used with

exclude: 
  - .*index.php/index.php.*

as this was what was causing the recursion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.