Comments (1)
Wikidata based seed URLs will probably require some significant deduplication, filtering, reranking, etc, but here's a version of the query which adds the language of the URL to account for sites which have different base URLs for different languages, like Blick. It also expands the language list (because * doesn't work), but it could be generalized more. As an example of the type of filtering needed, the Hubei Daily item has three URLs - a corporate site, an e-paper, and a 404.
SELECT DISTINCT ?item ?itemLabel ?lang ?worklang ?url WHERE {
?item (wdt:P31/(wdt:P279*)) wd:Q11032;
p:P856 ?statement.
?statement ps:P856 ?url.
OPTIONAL {
?statement pq:P407 ?worklanguage.
?worklanguage wdt:P220 ?worklang.
}
OPTIONAL {
?item wdt:P407 ?language.
?language wdt:P220 ?lang.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,uk,ru,fr,es,it,ja,zh,ar,hu,pt,be,rus,ce,br,cs,sv,dk,da,he,fi,nb,id,eu,pl,nl,az,mar,lv,hr,am,ba,r". }
}
LIMIT 100
Query
As of today, there are 11,177 results. There are more than 200 languages represented, plus a couple of thousand sites with no language tag, and that distribution looks like about what you'd expect (the two letter codes represent TLDs, not language codes, eg. hk, ru, uk, de, au, cn, etc):
eng 3562
fra 826
spa 586
rus 467
deu 316
ita 177
ara 168
ukr 166
fin 152
zho 146
jpn 145
swe 140
nor 122
hk 112
ru 112
por 108
hun 103
nld 93
uk 90
de 86
kor 86
au 78
cn 78
pol 66
hin 60
bel 59
from news-crawl.
Related Issues (20)
- Allow to follow news sites not providing RSS/Atom feed or news sitemap HOT 2
- Do not use "http/2" protocol version in HTTP headers in WARC files HOT 2
- Error in build docker HOT 3
- Odd duplicate content behaviour on www.diariodeavila.es domain HOT 4
- How to get a listing of WARC/WAT/WET files using HTTP for News Dataset ? HOT 2
- News archive is not available since 06.06.2021 HOT 3
- Run docker in a non-interactively way HOT 1
- How large is the dataset HOT 2
- Explore schema.org annotations for seed completions
- Consider archiving of news feeds and sitemaps
- produce WET files? HOT 6
- mvn clean package fails on Mac on Apple M1 Pro chip HOT 5
- News archive is not available since 2023-10-23 15:36:50 HOT 1
- Avoid following advertisements in news feeds and sitemaps
- Nutch-compatible implementation of FastURLFilter + use it in PreFilterBolt
- Port topology and resources to StormCrawler 2.10 HOT 2
- news-crawl 2.x Broken when using multiple workers (across multiple hosts) HOT 17
- Have as many WARCBolt instances as there are workers
- Route tuples to the status updater bolt based on URLs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from news-crawl.