Giter Club home page Giter Club logo

browsertrix-yaml-examples's Introduction

browsertrix-yaml-examples

YAML files for including and excluding things

See the browsertrix project for more general documentation.

Full sample configurations that were helpful for crawling certain websites are available in the subdirectiories, such as biph-kiev-ua/crawl-config.yaml.

Default and handy starter file

This includes all sub domains like abc.collection.litme.com.ua, handles http:// to https:// conversions & subdomains.

collection: "collection-litme-com-ua"
workers: 16
saveState: always

# A few examples that exclude parts of a page from being fetched and recorded.
blockRules:

  # Unnecessary trackers
  - url: google-analytics.com
  - url: googletagmanager.com
  - url: googlesyndication.com
  - url: yandex.ru
  - url: liveinternet.ru
  - url: hotlog.ru
  - url: openstat.net
  - url: mycounter.ua
  - url: facebook.(com|net)

  # Malware
  - url: www.acint.net       # spam-seo.sape malware
  - url: news.2xclick.ru     # mwblacklisted35 malware
  - url: culturaltracking.ru # culturaltracking malware
  - url: sape.ru             # sape backlinks SEO malware

  # Non-threatened resources that are bandwidth-intensive and perhaps not
  # a current priority. Uncomment these rules to bypass recording.
  #- url: youtube.com/embed/ # Embedded YouTube videos
  #- url: w.soundcloud.com   # Embedded SoundCloud tracks

seeds:
  - url: http://collection.litme.com.ua/
    scopeType: "domain"

Excluding trouble

If you notice that a crawl is collecting duplicate links due to parameters, like

        - '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2022-4-1","seedId":0,"depth":2}'
        - '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2023-03-1","seedId":0,"depth":2}'
        - '{"url":"https://nmiu.org/index.php/novyny-museum?iccaldate=2022-2-1","seedId":0,"depth":2}'

where each is the same base page when loaded in the browser, then first check that excluding it results in the same webpage. in this case, as https://nmiu.org/index.php/novyny-museum goes to the same webpage, we can exclude all cases where iccaldate= is used as a query parameter / link to follow by adding the following to our yaml file:

exclude:
  - .*iccaldate=.*

This will prevent the same page taking up multiple workers / space in the final file / time. So the whole file will read

collection: "collection-nmiu-org"
workers: 16
saveState: always
seeds:
    - url: https://nmiu.org
      include: .*nmiu\.org.*
      scopeType: "host"
      exclude:
        - .*iccaldate=.*

Making Sure You Don't Go Out Of Depth

Some websites have recursive links. They can look like this

'{"url":"https://nmiu.org/lektoriy/181-ekskursiji-kontent/vysta/3/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/index.php/exponat-tyzhnya","seedId":0,"depth":48}'

If you notice this kind of pattern in the URLs of a site that never seems to end, avoid getting stuck in such a trap by adding the "depth" flag to the yaml file:

collection: "collection-litme-com-ua"
workers: 16
saveState: always
seeds:
    - url: http://collection.litme.com.ua/
      include: 
        - .*collection\.litme\.com.ua.*
      scopeType: "host"
      depth: 25

This will tell the tool to only follow a given path to a depth of 25 different URL clicks, so use it with care!

In the example given at the start of this section, exclude can also be used with

exclude: 
  - .*index.php/index.php.*

as this was what was causing the recursion

Handling an Enormous Site like WikiMedia

This is an example crawl of a WikiMedia site, reducing an enormous, unfinishable crawl to a manageable one, by ignoring all wiki paraphernalia (history pages, etc.).

collection: "wiki-library-kr-ua"
workers: 16
saveState: always
seeds:
- url: https://wiki.library.kr.ua/
include: .*\.wiki\.library\.kr\.ua/
exclude: 
  - .*action\=.*
  - .*page\=.*
  - .*limit\=.*
  - .*oldid\=.*
  - .*title=%D0%A1%D0%BF%D0%B5%D1%86%D1%96%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0\:.*
  - .*returnto\=.*
scopeType: "host"

Line 12 tells it to ignore links with titles beginning with the Ukrainian word for "special" followed by a colon: that might need to be reconfigured for each site.

browsertrix-yaml-examples's People

Contributors

magpiedin avatar kathrynn avatar quinnanya avatar starchy avatar cdchapman avatar storytracer avatar hawc2 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.