jakopako / goskyr Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 4.0 11.23 MB

A configurable command-line web scraper written in go with auto configuration capability

License: GNU General Public License v3.0

Go 99.26% Makefile 0.74%

go golang scraper webscraping

goskyr's People

Contributors

Stargazers

Watchers

Forkers

opacous markjaroski laerm lekesiz

goskyr's Issues

Check out dateparse library to possibly simplify date config

https://github.com/araddon/dateparse

Add Mascotte

https://www.mascotte.ch/nu/events/event_list/

Add filter option

In some cases it might be necessary to filter certain events, such as in the case of Backstage. There we don't want the CORONA TESTZENTRUM events in our list.

Therefore, add a filter for text fields (currently title & comment) that is based on a regex.

Move concert config and github action to separate repository

It is better to separate the code development from the use case 'concert scraping'. Like that we would have a repository solely for managing which concert websites get crawled and could also accept pull requests for this.

Add flag `-version`

Add example config file

Add comments to this example config file to explain the config options in detail.

Write README

Separate branch for event crawling action

It would be good to 'purify' this repo by only providing a generic crawler that extracts any structured data from websites. But since Github actions are very handy to execute a regular crawl with a specific configuration there could be a separate branch only for this purpose.

volkshaus data can't be written to api

the respective post request in the github action results in a 400

Use fmt.Errorf

Use fmt.Errorf instead of errMsg := fmt.Sprintf and then errors.New(errMsg)

support js rendered websites

Mehrspur Events with month 'Mrz' not parsed correctly

Make url attribute customizable

Add LocomotivClub

https://www.locomotivclub.it/eventtype/calendario/

Rename `loc` keys to `selector` in configuration yml

Parallelize crawling

Improve logging

Add info which crawler logs which log string.

Add Heldenbar

url: https://www.heldenbar.ch/programm/

Add Het Depot

https://www.hetdepot.be/programma

Add filters to crawlers for canceled events

Not all crawlers have filters to remove canceled or postponed concerts. Add filters where necessary.

Add Papiersaal

https://www.papiersaal.ch/

Add Botanique

https://botanique.be/en/concerts

Add Mongodb output

Add Kaufleuten

https://kaufleuten.ch/events/kultur/konzerte/

Add X-Tra

https://www.x-tra.ch/en/agenda/concerts/

Filters don’t work as expected in all cases

If there are two filters with match: true then we’d expect the item to be included in the results if at least one of those filters returns true. Currently, both have to return true for the item to be included.

date fields sometimes don't have a consistent format even within one scraper/location

Add a way to cope with different formats. Issue #35 possibly solves this as well.

Fix code ql errors

Change implementation of filter function

I think it would be better to have separate field filters per crawler like so:

filters:
  - field: "title"
    regex: "some regex"
  - field: "comment"
    regex: "some other regex"

This would make the implementation easier since we could simply loop over the array of filter items and apply the filter to the corresponding field. On the other hand we would still need sth like a switch case to map the field string to the corresponding field in the crawler struct... Think about this and improve the implementation.

Merge `exclude` config with `filters` config

Currently, those two config keys logically do very similar things. So it makes sense to merge their functionality into one config snippet.

Add LaCigale

http://www.lacigale.fr/en/

Add testing

How to handle more complex event comments (descriptions)?

Example: location 'Strom' There the description can be found on the event specific subpage but does not always have the same selector. Probably other locations have the same issue. Just look at the config.yml and check locations that don't have the comment field defined.

Fix date guessing issue

This was fixed already but with refactoring had to be removed. The problem is smart-guessing the year of a date if it is not provided by the crawled website. An example is the Helsinki crawler. During the year this does not really matter but in between two years the dates would be wrong as the year would always be the 'old' one and not increased by one for dates that lie in the new year.