Giter Club home page Giter Club logo

goskyr's People

Contributors

dependabot[bot] avatar jakopako avatar laerm avatar markjaroski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

goskyr's Issues

Add filter option

In some cases it might be necessary to filter certain events, such as in the case of Backstage. There we don't want the CORONA TESTZENTRUM events in our list.

Therefore, add a filter for text fields (currently title & comment) that is based on a regex.

Separate branch for event crawling action

It would be good to 'purify' this repo by only providing a generic crawler that extracts any structured data from websites. But since Github actions are very handy to execute a regular crawl with a specific configuration there could be a separate branch only for this purpose.

Use fmt.Errorf

Use fmt.Errorf instead of errMsg := fmt.Sprintf and then errors.New(errMsg)

Filters don’t work as expected in all cases

If there are two filters with match: true then we’d expect the item to be included in the results if at least one of those filters returns true. Currently, both have to return true for the item to be included.

Change implementation of filter function

I think it would be better to have separate field filters per crawler like so:

filters:
  - field: "title"
    regex: "some regex"
  - field: "comment"
    regex: "some other regex"

This would make the implementation easier since we could simply loop over the array of filter items and apply the filter to the corresponding field. On the other hand we would still need sth like a switch case to map the field string to the corresponding field in the crawler struct... Think about this and improve the implementation.

How to handle more complex event comments (descriptions)?

Example: location 'Strom' There the description can be found on the event specific subpage but does not always have the same selector. Probably other locations have the same issue. Just look at the config.yml and check locations that don't have the comment field defined.

Fix date guessing issue

This was fixed already but with refactoring had to be removed. The problem is smart-guessing the year of a date if it is not provided by the crawled website. An example is the Helsinki crawler. During the year this does not really matter but in between two years the dates would be wrong as the year would always be the 'old' one and not increased by one for dates that lie in the new year.

Extend filtering mechanism

currently, only existing fields can be used for filtering. Sometimes however it is necessary to filter on other text elements that are not used for any fields. One example would be the location 'LaCigale'

Generalize configuration

Make field names customizable. Add a field type that defines what parameters are needed to extract this field. Currently, there would be a text type a url type and a date type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.