Giter Club home page Giter Club logo

wellynews's Introduction

wellynews

Aggregates community news items and RSS feeds from my hometown Wellington, New Zealand into a newslog (see https://wellington.gen.nz and @wellynews).

The content is automatically categorised and can be output as customised RSS feeds.

For example:

This a long running project (> 15 years of continuous operation) and the code base has changed alot over the years.

Currently implemented as Scala controllers served from Spring Boot. MongoDB is used for persistence and Elasticsearch for indexing.

Related services

Specific concerns have been pushed into these potentially reusable services:

  • Whakaoko for RSS feed polling and aggregation.
  • Cards for decorating news items with social media images.
  • Nominatim AC for OpenStreetName place name lookups.
  • RSS to Twitter for automatic publishing to Mastodon and Twitter.

Model

Website

The website of a content publisher such as Wellington City Council.
Newsitems and feeds found on this website will be attributed to this publisher.

Feed

An RSS feed published by a website. News items accepted from this feed will be attributed to this feed.

Newsitem

A news item published on a publisher's website.

A page with a unique URL containing a press release or a match report.

Watchlist

A page on a publishers website which is known to contain links to new news items. This might be a homepage or a news page.

Watchlist is used when a publisher with interesting content does not provide a feed but their content is valuable enough to post manually.

Watchlist items are polled regularly to detect updates; see detecting page changes (below)[(#detecting-changes)].

Tag

Tags are used to categorise content into use ful categories (like consultation and transport). Tags can be applied to content types.

Tags can be arranged into a hierarchy which effects where tagged content appears.

ie. Trains is a child of Transport.

News items about trains are included in the transport tag's news items.

Hand tagging

Different people (or systems) may have different opinions about what tags should be applied to an item.

A hand tagging records that a user thinks a tag should be applied to a content item.

These taggings contribution to the items actual tagging (see Tagging votes below).

Tagging votes

Been able to automatically arrange news items into meaningful categories like consultation and transport is something we really wanted. We can infer alot about a news item by considering where it came from and who published it.

These signals are combined in a tagging vote to determine a news items visible tags.

ie. This example news item talks about an exhibition at a cafe in Newtown.

News items tags

The tagging votes show how we arrived at this set of tags.

Tagging votes

  • We know which suburb it's in because the publisher is tagged with Newtown.
  • We have a geotagged location because the publisher has a geotag.
  • We know it's about an exhibition because it was accepted a feed tagged exhibition.
  • We know it's about art because exhibitions is a child ot art.

Index tags

Index tags determine which tags give resources appear under. For example an item tagged trains will also appear under transport. Some slightly interesting things happen to automatically calculate the index tags

We can infer alot about an item's tags from what we know about where it came from.

Hand tagging

Represents a tag applied directly by a user.

Publisher tags

News items inherit tags from their publisher. For well categorised publishers such as transport operators or sports clubs this approach can tag their newsitems with a high level of confidence.

Feed tags

News items accepted from a feed inherit the hand taggings applied to the feed.

Some publishers has multiple feeds which each cover a very specific topic (such as the city council's planning applications). In this case a feed tagging very accurately tag the newsitems from that feed.

Ancestor tags

Includes the ancestors of applied tags. Been tagged as Trains implies that this newsitem is related to Transport.

Geotag votes

Votes which contribute to the visible location of a newsitem. Could be an explicit geocode of could be inferred from the news items tags or publisher

Autotagging

When newsitem text matches keywords associated with specific tags we apply an autotagging. This is represented as a hand tagging applied by the autotagger user.

RSS feed item categories

If an RSS feed item contains RSS category tags, the autotagger wil try to match the values of these category tags to tag autotag hints.

An item with a category of 'events' will be matched to the tag Events.

Detecting changes

Changes in pages can be detected by periodically downloading and checking them. Changes in content checksums indicate potential new content.

Pages often contain elements such as timestamps which make page's checksum unstable even if contains no new content. Only comparing the plain text content of a page helps to reduce these false positives.

Feed reading

Most news items are accepted from publisher's RSS feeds. Feed reading is the process of periodically polling these feeds and deciding which items to accept and distribute.

Feed acceptance policy

Each publisher feed is assigned a feed acceptance policy which describes how we should treat feed items in that particular feed.

Most feeds always contain relevant and appropriate content which can be automatically accepted; some don't. The feed acceptance policy helps document which feeds require manual moderation.

ACCEPT A trusted source of relevant content. All items can be automatically accepted.

ACCEPT_EVEN_WITHOUT_DATES Accept even without dates

ACCEPT_IGNORING_DATE Accept ignoring date

Trusted sources with good content but questionable publication dates. These feed items can be automatically accepted we'll ignore the publication date.

SUGGEST

Feeds with a mix of relevant and irrelevant content. New feeds items should be suggested for manual moderation. The contents of suggested feeds appear on the feeds inbox screeb (below).

IGNORE

Feeds with no relevant content at the moment. Ignore the contents of these feeds.

Accepted feed items view

Shows the news items which have been accepted from feeds on a particular day. This is useful for moderation and discovering items which could benefit from having additional tags applied.

Accepted feed items

Feeds inbox

The feeds inbox shows the feed items currently available in suggested feeds.

If a feed contains a mix of relevant and irrelevant items, we can't automatically accept all items from it.

The feeds inbox screen is used to quickly scan the feed items available in the feeds with a suggest acceptance policy.

Relevant items which can be manually accepted using the accept action.

Feeds inbox

Promotion of recently accepted news items

The main news items page and RSS feed contains the most recent 30 news items ordered by publication date.

Reviewing the suggested feeds items is an infrequent activity and good content may mess out on been surfaced in the main feed if it's publication date is not recent.

If a news items is manually accepted from a suggested feed and has a publication date which is not recent enough to show in the main feed and is within the last 2 weeks it will be appended to the main feed for 1 day.

This makes manually accepted content visible to feed consumers.

RSS output

Example output RSS feed item including tags, media element and geotag.

Example RSS feed item

Social media Cards / Open Graph images

News items are decorated with Twitter Cards and Open Graph social media images using the Cards service.

We try to detect and filter out images which are a publisher's generic filler images.

ie. This generic logo should not be included but a article specific images should be. Duplicate images

Admin actions

Backfill new tag

The autotag prompt allows a new tag to be backfilled with existing news items which match the new tag's autotagging rules.

Gather publisher resources

Given a publisher find unassigned newsitems and feeds which probably belong to this publisher.

This decision is based on url hostnames.

Local dev

Use docker to provide local copies of the MongoDB, Elasticsearch, Memcached andRabbitMQ dependencies.

docker compose -f docker/docker-compose.yml up

Start locally.

mvn spring-boot:run

Cloud build

gcloud components install cloud-build-local
cloud-build-local --config=cloudbuild.yaml --dryrun=false --push=false .

wellynews's People

Contributors

tonytw1 avatar dependabot[bot] avatar

Stargazers

Lucas Slebos avatar Tom Hackshaw avatar Mossaddeque Mahmood avatar

Watchers

 avatar James Cloos avatar  avatar

wellynews's Issues

Allow pagination of feed items

For example so that that we can go back and accept the original library closure announcement from the WCL Blog.
This link is no longer in the head of the WCL feed but it will be in the persisted long tail in the Whakaoko feed.
Whakaoko may need to surface a total items count to populate the pagination UI.

Record locations of sitemaps seen in robots.txt while link checking

Are sitemaps useful for locating press releases on sites with no RSS?
Can we discover sitemaps?

Yes. robots.txt specifies this as of 2007.
ie. https://wellington.govt.nz/robots.txt

Sitemap: sitemap.xml

User-agent: *
Disallow: /have-your-say/epetitions/view-signatures/
Disallow: /search/
Disallow: /~/search/

Allow: /

And in the WCC example, when filtered by known path prefix, and decorated with the single page title or og:title a sparse new items could be extracted.

2nd example:

<url>
<loc>
https://pataka.org.nz/whats/exhibitions/here-kupe-cook/
</loc>
<lastmod>2022-08-02</lastmod>
</url>

lastmod date is not the publication date; but the og meta data is good.

It's probably useful for the link checker to record sitemap locations, and for these to be potentially feed into a sitemap based scraper in brownbag.

Title guessing from html title tag is interesting but what about the social meta tags?

<meta property="og:site_name" content="WELLINGTON UNITED"/>
    <meta property="og:title" content="HARD FOUGHT WIN GIVES UNITED FOUR IN A ROW"/>
    <meta property="og:description" content="The first half of the season, Wellington United couldn&#039;t get a win for  trying, with six single goal defeats seeing them rooted to the bottom of the table despite having a positive goal difference...."/>

Null pointer in feed is blocking all feed reading

2010-12-05 11:25:57,557 INFO [nz.co.searchwellington.feeds.rss.RssHttpFetcher] - Fetching rss from live url: http://feeds.feedburner.com/wgtnywca

marlow:/var/lib/tomcat5.5# less /var/lib/tomcat5.5/logs/searchwellington.log

lass [class nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher$$EnhancerByCGLIB$$cabab98f] failed
java.lang.NullPointerException
at nz.co.searchwellington.feeds.LiveRssfeedNewsitemService.extractThumbnail(LiveRssfeedNewsitemService.java:126)
at nz.co.searchwellington.feeds.LiveRssfeedNewsitemService.extractNewsitemFromFeedEntire(LiveRssfeedNewsitemService.java:96)
at nz.co.searchwellington.feeds.LiveRssfeedNewsitemService.getFeedNewsitems(LiveRssfeedNewsitemService.java:63)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher.loadAndCacheFeedNewsitems(RssNewsitemPrefetcher.java:87)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher.run(RssNewsitemPrefetcher.java:61)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher$$FastClassByCGLIB$$18c82b60.invoke()
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:149)
at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:695)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:139)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:630)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher$$EnhancerByCGLIB$$cabab98f.run()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:283)
at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:272)
at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:529)
2010-12-05 11:25:58,193 INFO [org.quartz.core.JobRunShell] - Job DEFAULT.rssPrefetcherRun threw a JobExecutionException:
org.quartz.JobExecutionException: Invocation of method 'run' on target class [class nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher$$EnhancerByCGLIB$$cabab
98f] failed [See nested exception: java.lang.NullPointerException]
at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:287)
at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:529)
Caused by: java.lang.NullPointerException
at nz.co.searchwellington.feeds.LiveRssfeedNewsitemService.extractThumbnail(LiveRssfeedNewsitemService.java:126)
at nz.co.searchwellington.feeds.LiveRssfeedNewsitemService.extractNewsitemFromFeedEntire(LiveRssfeedNewsitemService.java:96)
at nz.co.searchwellington.feeds.LiveRssfeedNewsitemService.getFeedNewsitems(LiveRssfeedNewsitemService.java:63)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher.loadAndCacheFeedNewsitems(RssNewsitemPrefetcher.java:87)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher.run(RssNewsitemPrefetcher.java:61)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher$$FastClassByCGLIB$$18c82b60.invoke()
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:149)
at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:695)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:139)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:630)
at nz.co.searchwellington.feeds.rss.RssNewsitemPrefetcher$$EnhancerByCGLIB$$cabab98f.run()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:283)
at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:272)
... 3 more

Unblock Spring Boot updates

Spring Boot 1.2.8 was the last release which supported Velocity templates.
This Boot release uses Velocity view resolver classes from spring-webmvc-4.1.9.RELEASE.jar

The removal of Velocity classes from webmvc in Spring 5 is the problem.
We need to back port a VelocityViewResolver to work with Spring 5 / Spring Boot 2.

Rabbit consumers do not recover from Rabbit restart

Because these are in memory queues which disappear.
They are recreated when published to.
Need these consumers to poll rather than stop.

2023-08-07T09:19:13.368Z ERROR 1 --- [On Write Thread] c.r.c.impl.ForgivingExceptionHandler     : Caught an exception when recovering topology Caught an exception while recovering consumer amq.ctag-CboLLsOfKCCU0y1yWJFo6g: null

com.rabbitmq.client.TopologyRecoveryException: Caught an exception while recovering consumer amq.ctag-CboLLsOfKCCU0y1yWJFo6g: null
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConsumer(AutorecoveringConnection.java:863) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverTopology(AutorecoveringConnection.java:727) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:597) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:519) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:821) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:798) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQConnection.handleIoError(AMQConnection.java:772) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$null$1(AutorecoveringConnection.java:138) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
Caused by: java.io.IOException: null
        at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:129) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:125) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.ChannelN.basicConsume(ChannelN.java:1384) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.RecordedConsumer.recover(RecordedConsumer.java:60) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.wrapRetryIfNecessary(AutorecoveringConnection.java:909) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.internalRecoverConsumer(AutorecoveringConnection.java:884) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConsumer(AutorecoveringConnection.java:859) ~[amqp-client-5.17.0.jar!/:5.17.0]
        ... 8 common frames omitted
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no queue 'wellynewselasticindex' in vhost '/', class-id=60, method-id=20)
        at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:66) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:502) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.ChannelN.basicConsume(ChannelN.java:1378) ~[amqp-client-5.17.0.jar!/:5.17.0]
        ... 12 common frames omitted
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no queue 'wellynewselasticindex' in vhost '/', class-id=60, method-id=20)
        at com.rabbitmq.client.impl.ChannelN.asyncShutdown(ChannelN.java:517) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.ChannelN.processAsync(ChannelN.java:341) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChannel.java:182) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:114) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:743) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47) ~[amqp-client-5.17.0.jar!/:5.17.0]
        at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:670) ~[amqp-client-5.17.0.jar!/:5.17.0]

Feed page map is not display pins

/coastguard-mana news items have latitude and longitude in the RSS feed and in the main content panel.
The map has no pins and is centered on the default location.
Should be a bounding box around the pins.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.