Giter Club home page Giter Club logo

blaggregator's People

Contributors

akaptur avatar alliejones avatar danluu avatar davidbalbert avatar dependabot[bot] avatar esommer avatar graue avatar kenyavs avatar nnja avatar pnf avatar porterjamesj avatar puercopop avatar punchagan avatar santialbo avatar strickinato avatar strugee avatar sursh avatar thomasboyt avatar zachallaun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blaggregator's Issues

README out of date

The admin login info is out of date given the oauth changes. Need to add shell instructions to make your account a superuser.

AmbiguousTimeError on crawl

A handful of blogs are throwing the following error when crawled in production. It was only three blogs on last crawl, so this is not super high priority.

2014-11-02 01:00:00
Traceback (most recent call last):
  File "/app/home/management/commands/crawlposts.py", line 86, in handle_noargs
    self.crawlblog(blog)
  File "/app/home/management/commands/crawlposts.py", line 50, in crawlblog
    date = timezone.make_aware(date, timezone.get_default_timezone())
  File "/app/.heroku/python/lib/python2.7/site-packages/django/utils/timezone.py", line 280, in make_aware
    return timezone.localize(value, is_dst=None)
  File "/app/.heroku/python/lib/python2.7/site-packages/pytz/tzinfo.py", line 349, in localize
    raise AmbiguousTimeError(dt)
AmbiguousTimeError: 2014-11-02 01:00:00

Gather the most-viewed blog posts of the past week

Write a script that can be run weekly. It grabs the last week of LogEvents and tallies up the visits. Outputs a CSV to be emailed. Maybe with a structure like this:

post_id, post_title, post_url, visits

This is the first step toward weekly digest emails out to alums.

Oauth login fails from blaggregator.herokuapp.com

Here's what I get when I click the new login button:

An error has occurred

The redirect uri included is not valid.

Works fine from blaggregator.us, though. If this is part of a plan to get people to use the new URL, I suspect you could figure out a more direct way to do it :-).

Investigate timezone warning

/Users/sasha/code/blaggregator/lib/python2.7/site-packages/django/db/models/fields/__init__.py:827: RuntimeWarning: DateTimeField received a naive datetime (2013-03-04 22:31:00) while time zone support is active.

Make two separate crawler scripts

The current crawler script pulls the publication date from the post itself, so that the posts are correctly ordered when a new blog is registered. Otherwise, the entire history of one blog would appear to be the time when the blog was added to Blaggregator, which is confusing.

Now that there is high volume of posts, someone can start a draft, take two days to post it, and then when they post it it's already buried on the second page of Blaggregator listings. It also doesn't get Zuliped out as there is a filter for posts <2 days old.

Solution: continue to use existing crawlposts.py to crawl posts on their initial add. Create a second script that looks for new posts every 10 minutes, and timestamps each new post it sees as datetime.now.

That way, a post will be at the top of Blaggregator and Zuliped when in becomes public, which is what users are expecting. This should obviously be throttled, in the case of a blog going offline for a few days and then suddenly coming back.

Clear out Blogs that don't parse

Each registered blog is pinged, in order. This is the slowest (and most expensive) part of the crawling, and the biggest reason why Blaggregator costs me $30/mo to run. There are a LOT of blogs that aren't parsing, which really slows down the crawling. I've been ignoring this, but as blaggregator grows this is becoming a bigger issue.

There are several possible ways to do this. One way:

  • each time a blog doesn't parse successfully, increment a counter
  • after 3 days, email the blog owner with some troubleshooting suggestions (does it exist, does it parse, watch out for redirects, etc)
  • after 6 days, remove the blog and email the blog owner (I believe this would also remove all associated Posts as well)

Thoughts?

All links have 500's when clicked from

@punchagan Links to view blog posts works great from blaggregator.us, but ALL links are broken when coming from Zulip. I'm at the end of my time allotted to work on this today so I haven't looked at this in detail. Going to roll back now.

Here is a portion of the logsโ€”I believe this is 1.5 of these errors. It's a little hard to tell since Heroku doesn't record | display logs in order.

2015-03-08T23:02:10.017844+00:00 app[web.1]: Traceback (most recent call last):
2015-03-08T23:02:10.017838+00:00 app[web.1]: Internal Server Error: /post/ThNclE/view
2015-03-08T23:02:10.017850+00:00 app[web.1]:   File "/app/home/views.py", line 60, in view_post
2015-03-08T23:02:10.017848+00:00 app[web.1]:     response = callback(request, *callback_args, **callback_kwargs)
2015-03-08T23:02:10.017854+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/manager.py", line 149, in create
2015-03-08T23:02:10.017859+00:00 app[web.1]:     obj.save(force_insert=True, using=self.db)
2015-03-08T23:02:10.017852+00:00 app[web.1]:     user_agent=request.META.get('HTTP_USER_AGENT', None),
2015-03-08T23:02:10.017855+00:00 app[web.1]:     return self.get_query_set().create(**kwargs)
2015-03-08T23:02:10.017862+00:00 app[web.1]:     force_update=force_update, update_fields=update_fields)
2015-03-08T23:02:10.017873+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 937, in execute_sql
2015-03-08T23:02:10.017867+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/manager.py", line 215, in _insert
2015-03-08T23:02:10.017863+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 650, in save_base
2015-03-08T23:02:10.017872+00:00 app[web.1]:     return query.get_compiler(using=using).execute_sql(return_id)
2015-03-08T23:02:10.017870+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 1661, in insert_query
2015-03-08T23:02:10.017846+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
2015-03-08T23:02:10.017878+00:00 app[web.1]:     six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
2015-03-08T23:02:10.017865+00:00 app[web.1]:     result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
2015-03-08T23:02:10.017875+00:00 app[web.1]:     cursor.execute(sql, params)
2015-03-08T23:02:10.017876+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 56, in execute
2015-03-08T23:02:10.017885+00:00 app[web.1]: DETAIL:  Failing row contains (29, 6677, 2015-03-08 23:02:10.010438+00, null, 10.123.66.198, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.3...).
2015-03-08T23:02:10.017869+00:00 app[web.1]:     return insert_query(self.model, objs, fields, **kwargs)
2015-03-08T23:02:10.017883+00:00 app[web.1]: IntegrityError: null value in column "referer" violates not-null constraint
2015-03-08T23:02:10.017860+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 546, in save
2015-03-08T23:02:10.017881+00:00 app[web.1]:     return self.cursor.execute(query, args)
2015-03-08T23:02:10.017886+00:00 app[web.1]: 
2015-03-08T23:02:10.017880+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 54, in execute
2015-03-08T23:02:20.749020+00:00 heroku[router]: at=info method=GET path="/post/UHZQQj/view" host=www.blaggregator.us request_id=53f98df7-cb9f-400e-a04a-00e6ea1dd86b fwd="24.193.114.250" dyno=web.1 connect=1ms service=60ms status=500 bytes=225
2015-03-08T23:02:20.744323+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
2015-03-08T23:02:20.744330+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/manager.py", line 149, in create
2015-03-08T23:02:20.744333+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 402, in create
2015-03-08T23:02:20.744339+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 650, in save_base
2015-03-08T23:02:20.744342+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/manager.py", line 215, in _insert
2015-03-08T23:02:20.744349+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 937, in execute_sql
2015-03-08T23:02:20.744353+00:00 app[web.1]:     six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
2015-03-08T23:02:20.744356+00:00 app[web.1]:     return self.cursor.execute(query, args)
2015-03-08T23:02:20.744358+00:00 app[web.1]: IntegrityError: null value in column "referer" violates not-null constraint
2015-03-08T23:02:20.744332+00:00 app[web.1]:     return self.get_query_set().create(**kwargs)
2015-03-08T23:02:20.744328+00:00 app[web.1]:     user_agent=request.META.get('HTTP_USER_AGENT', None),
2015-03-08T23:02:20.744335+00:00 app[web.1]:     obj.save(force_insert=True, using=self.db)
2015-03-08T23:02:20.744338+00:00 app[web.1]:     force_update=force_update, update_fields=update_fields)
2015-03-08T23:02:20.744341+00:00 app[web.1]:     result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
2015-03-08T23:02:20.744346+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 1661, in insert_query
2015-03-08T23:02:20.744301+00:00 app[web.1]: Internal Server Error: /post/UHZQQj/view
2015-03-08T23:02:20.744306+00:00 app[web.1]: Traceback (most recent call last):
2015-03-08T23:02:20.744360+00:00 app[web.1]: DETAIL:  Failing row contains (30, 6593, 2015-03-08 23:02:20.735531+00, null, 10.123.66.198, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.3...).
2015-03-08T23:02:20.744325+00:00 app[web.1]:     response = callback(request, *callback_args, **callback_kwargs)
2015-03-08T23:02:20.744327+00:00 app[web.1]:   File "/app/home/views.py", line 60, in view_post
2015-03-08T23:02:20.744344+00:00 app[web.1]:     return insert_query(self.model, objs, fields, **kwargs)
2015-03-08T23:02:20.744336+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 546, in save
2015-03-08T23:02:20.744350+00:00 app[web.1]:     cursor.execute(sql, params)
2015-03-08T23:02:20.744347+00:00 app[web.1]:     return query.get_compiler(using=using).execute_sql(return_id)
2015-03-08T23:02:20.744361+00:00 app[web.1]: 
2015-03-08T23:02:20.744352+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 56, in execute
2015-03-08T23:02:20.744355+00:00 app[web.1]:   File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 54, in execute

Avatars are broken

When Sonali updated everyone's avatars for this batch, the URLs all changed. We currently hotlink to their avatar images rather than storing them ourselves (that's another Issue entirely) so they're all broken.

Solution: write a web scraping script to run once to pull all the URLs from hackerschool.com/private and update the avatar URLs in our database.

Follow HS's blog

Add a dummy 'HS' user and their blog feed so official posts are tracked.

Blog-post titles contain nasty HTML codes

Updating at Sasha's request on 20130719.

Summary

Some blog-post titles display certain Unicode symbols incorrectly. The symbols involved are those for which there are special HTML escapes. The Blaggregator is receiving these symbols correctly as Unicode strings but for some reason is converting them to HTML escapes, and then after that they are appearing literally as ASCII representations of those escapes, in which the ampersand element is being rendered as &amp;. This last issue is the reason for the incorrect appearance of the symbols on the website; if well-formed HTML escapes were being written to HTML, browsers would probably display them correctly.

This is not happening because of feedergrabber27.py for feedparser.parse(), which are passing Unicode strings correctly.

Example 1 โ€” the greater-than symbol:

In [1]: text = u'Now filtering (> /dev/null) some spam before it reaches the Gmail Spam folder'

In [2]: text
Out[2]: u'Now filtering (> /dev/null) some spam before it reaches the Gmail Spam folder'

In [3]: print text
Now filtering (> /dev/null) some spam before it reaches the Gmail Spam folder

It appears on the Blaggregator at http://blaggregator.herokuapp.com/post/4ruqWY/ โ€” the HTML contains &amp;gt; rather than &gt; or >. Appears normally on WordPress: http://brannerchinese.wordpress.com/2013/06/26/freeman-halton-3x3-exact-test/

Example 2 โ€” the cross-product symbol:

In [4]: another = u'Freeman-Halton 3\xd73 exact test'

In [5]: another
Out[5]: u'Freeman-Halton 3\xd73 exact test'

In [6]: print another
Freeman-Halton 3ร—3 exact test

It appears on the Blaggregator at http://blaggregator.herokuapp.com/post/GQrsuX/ โ€” the HTML contains &amp;#215; rather than &#215; or ร—. Appears normally on WordPress: http://brannerchinese.wordpress.com/2013/07/07/now-filtering-devnull-some-spam-before-it-reaches-the-gmail-spam-folder/

Important: Note that the Blaggregator's HTML does not contain the actual HTML escapes, which would probably display correctly; it contains an ASCII rendering of the HTML escapes, with the ampersand element of each escape replaced by &amp.

Even if the Blaggregator were outputting Unicode, the Blaggregator page is not configured to display Unicode.

The HTML should contain the line

 <meta charset="UTF-8">

within <head>. At the moment it does not.

The nature of this updated error description

Previously, I had thought that this was a problem in feedergrabber27.py. I am now able to demonstrate that it is not.

[end]

Post periodic reminder to 'blogging' stream

Now that #67 is deployed, new folks to the blogging stream on Zulip won't know about the blaggregator.us site, and won't know how to add their own blogs.

A periodic (perhaps monthly?) reminder to the stream about the site, how to add or edit their blogs, how to contribute, etc, might be useful for everyone.

cc @punchagan

Feature request: remove a blog

Am I failing to find the UI element for this, or is there really no way to do it? I changed domains, and added my "new" blog, and now it seems to exist twice, which caused my most recent blog post to show up twice. I may try to figure out how to add the feature myself when I have time, but I know neither programming language used for blaggregator.

Also, for reasons that might be related (???), my second most recent blog post shows up more than just twice. Not sure what's going on there.

Blog dates being parsed incorrectly?

Whenever I look at Blaggregator, it's always the same articles at the top.

image

It looks like this is happening when blogs set their date as being in the future, thus nothing current will ever pass them.

image

Could Blaggregator filter out articles that have invalid dates, so that content doesn't get stuck at the top of the feed?

Thanks :)

Implement Celery on the backend for asynchronous tasks

I'm currently working on this. There's a lot of work that goes on in the background that we don't need to make a user wait for, like checking their blog for updates and sending messages to Humbug. Celery is a much less janky way to handle this.

It will also be able to handle upcoming features, like error notifications and email digests.

Deleting blog doesn't work since introduction of LogEntry

Looks like deleting the blog post violates the foreign key constraint on corresponding LogEntry instances.

Environment:

Request Method: GET
Request URL: http://blaggregator-staging.herokuapp.com/delete_blog/44/

Django Version: 1.5.1
Python Version: 2.7.4
Installed Applications:
('django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
'django.contrib.messages',
'django.contrib.staticfiles',
'home',
'django.contrib.admin',
'storages',
'south',
'django.contrib.humanize',
'social.apps.django_app.default')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'social.apps.django_app.middleware.SocialAuthExceptionMiddleware')

Traceback:
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response

  1.                     response = callback(request, _callback_args, *_callback_kwargs)
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/contrib/auth/decorators.py" in _wrapped_view
  2.             return view_func(request, _args, *_kwargs)
    
    File "/app/home/views.py" in delete_blog
  3. blog.delete()
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py" in delete
  4.     collector.delete()
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/deletion.py" in decorated
  5.             transaction.commit(using=self.using)
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/db/transaction.py" in commit
  6. connection.commit()
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/init.py" in commit
  7.     self._commit()
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py" in _commit
  8.             six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
    
    File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py" in _commit
  9.             return self.connection.commit()
    

Exception Type: IntegrityError at /delete_blog/44/
Exception Value: update or delete on table "home_post" violates foreign key constraint "post_id_refs_id_673a729446c2b217" on table "home_logentry"
DETAIL: Key (id)=(725) is still referenced from table "home_logentry".

User adding second blog to profile throws 500

views.py, line 127 uses get() which assumes the queryset will only return one blog instance. If the user already has a blog on their profile this will add the blog but throw a 500.

Not sure what the best solution is here. Probably to display the URLs associated with the logged-in user (so they know what blogs they've already added) and to make the logic more flexible.

New posts not going out to humbug on initial crawl.

when a user adds their blog URL, their blog is crawled properly but the newest posts don't go out to humbug.

Need to refactor the add_blog view: remove the auto-crawl feature and let the hourly crawlposts handle it.

Upgrade runtime to python-2.7.11

You are receiving this email because the following apps that you own are using an older release of the Python runtime (e.g. 2.7.0โ€”2.7.10) that is not officially supported by Heroku:

blaggregator

No action is required, but using the latest stable release is highly recommended.

You can upgrade your app to Python 2.7.11 by adding a runtime.txt file (next to requirements.txt) with the contents: python-2.7.11. After deploying, this change will install the updated version of Python, as well as re-install all of your dependencies.

Not handling unicode properly

Exception:

2013-04-25T21:58:26.919945+00:00 app[scheduler.1307]: ** CRAWLING http://brannerchinese.wordpress.com/feed/atom/
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]: Retrieved 'ascii' codec can't encode character u'\u014d' in position 4: ordinal not in range(128)
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]: UnicodeEncodeError: 'ascii' codec can't encode character u'\u014d' in position 4: ordinal not in range(128)
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]:     self.crawlblog(blog)
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]: Traceback (most recent call last):
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]:   File "/app/home/management/commands/crawlposts.py", line 57, in crawlblog
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]:   File "/app/home/management/commands/crawlposts.py", line 78, in handle_noargs
2013-04-25T21:58:27.194885+00:00 app[scheduler.1307]:     print "Retrieved", title

Email subscriptions

Want to contribute? Since people may not check in every day but want to stay updated with a daily email digest of today's posts. This will also really help the community stay engaged, leading to better discussions on the site.

Consider retiring frames

Hey, great work on Blaggregator, it's awesome that it's still being contributed to and maintained.

I propose retiring frames and pointing all Blaggregator links directly to the source. This would fix three problems frames cause:

  1. On mobile browsers, if the blog has a responsive design, framing causes the responsive design to be ignored. Example: load this post of mine in a mobile browser. Note that zooming is required to read anything, but it becomes nicely formatted when you click "close frame".
  2. Some blogs use X-Frame-Options or similar CSP directives to prevent being framed. This causes, e.g., @wismer's recent post not to show up at all, and was previously an issue on @brannerchinese's blog, though changes to settings on Bitbucket's end seem to have worked around this.
  3. When you follow a link from a blog post to any site using X-Frame-Options (including GitHub), the link won't load. E.g., go to @madhuvishy's post here, click on "Yet another BitTorrent client" (which is a link to GitHub). Note that the frame goes blank.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.