jakopako / goskyr Goto Github PK
View Code? Open in Web Editor NEWA configurable command-line web scraper written in go with auto configuration capability
License: GNU General Public License v3.0
A configurable command-line web scraper written in go with auto configuration capability
License: GNU General Public License v3.0
In some cases it might be necessary to filter certain events, such as in the case of Backstage. There we don't want the CORONA TESTZENTRUM events in our list.
Therefore, add a filter for text fields (currently title & comment) that is based on a regex.
It is better to separate the code development from the use case 'concert scraping'. Like that we would have a repository solely for managing which concert websites get crawled and could also accept pull requests for this.
Add comments to this example config file to explain the config options in detail.
It would be good to 'purify' this repo by only providing a generic crawler that extracts any structured data from websites. But since Github actions are very handy to execute a regular crawl with a specific configuration there could be a separate branch only for this purpose.
the respective post request in the github action results in a 400
Use fmt.Errorf instead of errMsg := fmt.Sprintf and then errors.New(errMsg)
Add info which crawler logs which log string.
Not all crawlers have filters to remove canceled or postponed concerts. Add filters where necessary.
If there are two filters with match: true then we’d expect the item to be included in the results if at least one of those filters returns true. Currently, both have to return true for the item to be included.
Add a way to cope with different formats. Issue #35 possibly solves this as well.
I think it would be better to have separate field filters
per crawler like so:
filters:
- field: "title"
regex: "some regex"
- field: "comment"
regex: "some other regex"
This would make the implementation easier since we could simply loop over the array of filter items and apply the filter to the corresponding field. On the other hand we would still need sth like a switch case to map the field string to the corresponding field in the crawler struct... Think about this and improve the implementation.
Currently, those two config keys logically do very similar things. So it makes sense to merge their functionality into one config snippet.
Example: location 'Strom' There the description can be found on the event specific subpage but does not always have the same selector. Probably other locations have the same issue. Just look at the config.yml
and check locations that don't have the comment field defined.
This was fixed already but with refactoring had to be removed. The problem is smart-guessing the year of a date if it is not provided by the crawled website. An example is the Helsinki crawler. During the year this does not really matter but in between two years the dates would be wrong as the year would always be the 'old' one and not increased by one for dates that lie in the new year.
currently, only existing fields can be used for filtering. Sometimes however it is necessary to filter on other text elements that are not used for any fields. One example would be the location 'LaCigale'
Make field names customizable. Add a field type
that defines what parameters are needed to extract this field. Currently, there would be a text
type a url
type and a date
type.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.