libo26 / feedparser Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/feedparser
License: Other
Automatically exported from code.google.com/p/feedparser
License: Other
>>> feedparser.parse('<rss xmlns:media="http://search.yahoo.com/
mrss/"><channel><item><media:content
medium="document"><media:description>foo</media:description></
media:content></item></channel></rss>').entries[0]
{'content': [{'value': u'foo'}]}
I expected the content item to have type, language, and base attributes
and so on.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:26
Attachments:
>>> feedparser.parse("<rss><channel><item><description><a
href=\"javascript:alert('foo')\">Link</a></description></item></
channel></rss>").entries[0].summary
u'<a href="javascript:alert(\'foo\')">Link</a>'
The HTML sanitizer doesn't strip out HTML links that execute
JavaScript. An feed author could use this to embed a link in the feed
that executes arbitrary JavaScript as that user if the user clicks on it.
It's tempting to say "well, the user clicked on it, it's their fault". But
since the user probably subscribed to the feed in the first place, they're
probably tempted to click on its links as well, and most users are
unlikely to check the URL before clicking a link to make sure it's safe.
Depending on the software using the library, an unscrupulous feed
author could include a link that when clicked on first asks the feed
reader to delete all subscriptions to competing sites and then passes
the user on to the actual link. The user would likely not notice anything
for a while, then later think that the competing site mysteriously
disappeared.
Cal Henderson identifies a number of different types of URLs to strip:
"javascript:foo"
"java script:foo"
"java\tscript:foo"
"java\nscript:foo"
"java"+chr(1)+"script:foo"
"jscript:foo"
"vbscript:foo"
"view-source:foo"
(http://www.iamcal.com/publish/articles/php/
processing_html_part_2/)
but it seems like the right strategy might be a whitelist here as well.
http, ftp, mailto, aim, etc. would all be passed through. Other links
would be treated as relative and the relative link resolution algorithm
would be run on them, resulting in links like:
http://example.org/blog/javascript:foo
which should be fairly safe.
It's also worth noting that the same rules should be applied to all the
URIs in
the document, like those in <link> tags.
Also, if the relative resolution algorithm becomes used for security
purposes
as I suggest, then the base URIs too must be sanitized. For example:
>>> feedparser.parse("""<rss xml:base="http://safe.example.com/">
<channel><item>
<link>this</link>
</item></channel>
</rss>""").entries[0].link
u'http://safe.example.com/this'
should not be overwritable using something like:
>>> feedparser.parse("""<rss xml:base="http://safe.example.com/">
<channel><item xml:base="javascript:hack">
<link>this</link>
</item></channel>
</rss>""").entries[0].link
u'this'
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:26
I use feedparser in our project, it works good. But when I
follow http://feedparser.org/docs/http-authentication.html
to add http authentication handler, can not get the
correct result, always return Http 401 authentication
failed page. I found out it's because feedparser always
put given handlers after the buildin handlers, then they
are not used, my patch only changed line 1817, after that,
everything works just fine.
1817c1817
< opener = apply(urllib2.build_opener,
tuple([_FeedURLHandler()] + handlers))
---
> opener = apply(urllib2.build_opener,
tuple(handlers + [_FeedURLHandler()]))
I am using python2.3 on Debian.
Please tell me if there is anything I missed, thanks for
writing this useful package.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:29
There are some RSSes feeded with "SHIFT_JIS" encoding,
but this encoding may often have some illegal multibyte
sequence... This is "CP932" encoding, very similar to
"SHIFT_JIS" but has some extended character codes.
Now, I added a piece of code for detection of "cp932".
If possible, apply this patch.
regards.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:23
Attachments:
Looking at this line of code(140):
sgmllib.charref = re.compile('&#(x?[0-9A-Fa-f]+)[^0-9
A-Fa-f]')
Note that Numeric character references support both &#
xH; and &#XH; syntax. This re has left out the captial
X.
I found this because we have some code conflict when
we both tried to 'fix' the sgmllib.
PS. See http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:28
Feed elements, certainly in atom, can contain not only
HTML but also XHTML. This means that next to attributes
defined in (X)HTML, they can also contain xml special
attributes like xml:lang, xml:id etc. These do not
introduce javascript/security risks and could be very
useful, so it would make sense to whitelist them for
(X)HTML sanitization.
XHTML content could also contain elements and
attributes from other namespaces, and the same could
probably be said for those. I'm not 100% sure there
about the risk, but it would seem that they are
harmless, and - if present - almost certainly too
important to throw away. Obvious examples could be
MathML or SVG.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:34
At the moment, feedparser doesn't handle media:content
or media:thumbnail URLs as they're defined in
attributes, and not by the element values.
These methods make it work
def _start_media_content(self, attrsD):
url = attrsD.get('url')
if url:
self._save('media_content', url)
def _start_media_thumbnail(self, attrsD):
url = attrsD.get('url')
if url:
self._save('media_thumbnail', url)
It would be nice to add it, so that feedparser could
get images URL from flickr feeds
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:18
When I try to parse the feed at: http://feeds.gawker.com/defamer/full
feedparser starts using a lot of resources and hangs.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:05
there is a bug when parse feed that have title and
dc:title.
if have dc:title after title in a feed,the value of
dc:title will replace title's value. But title's value
is information that we want, so that when we want use
feed.get("title","") to get title,however we get the
value of dc:title.
for example parse
"http://ajaxcn.org/exec/rss?snip=start":
feed.get("title","") get "start",but we want is "Ajax
**"
- <channel>
<title>Ajax**</title>
<link>http://ajaxcn.org/space/start</link>
<description>Ajax lead the way!</description>
<dc:creator>dlee</dc:creator>
<dc:type>Text</dc:type>
<dc:title>start</dc:title>
<dc:identifier>http://ajaxcn.org/space/start</dc:identifier>
<dc:date>2006-08-26T14:41:05+08:00</dc:date>
<dc:language>zh</dc:language>
- <!--
<blogChannel:changes>http://www.weblogs.com/rssUpdates/changes.xml</changes
>
-->
<admin:generatorAgent
rdf:resource="http://www.snipsnap.org/space/version-1.0b3-uttoxeter"
/>
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:14
Attachments:
The Feed Parser always returns the tags for a
del.icio.us feed entry as a single-item list.
The following example first shows the tags returned for
a valid del.icio.us RSS 1.0 feed entry, then shows the
desired behavior with the tags returned for a valid
Atom 1.0 feed entry:
Python 2.3.5 (#2, Sep 4 2005, 22:01:42)
[GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2
Type "help", "copyright", "credits" or "license" for
more information.
>>> import feedparser
>>> mpurl = 'http://del.icio.us/rss/wearehugh'
>>> mp = feedparser.parse(mpurl)
>>> mp.entries[0].tags
[{'term': u'games nomic philosophy', 'scheme': None,
'label': None}]
>>> gmurl = 'http://groovymother.com/links/index.atom'
>>> gm = feedparser.parse(gmurl)
>>> gm.entries[0].tags
[{'term': u'backups', 'scheme':
u'http://groovymother.com/links/tag/', 'label':
u'backups'}, {'term': u'markpilgrim', 'scheme':
u'http://groovymother.com/links/tag/', 'label':
u'markpilgrim'}]
Here's the source of the first entry from the example
del.icio.us feed:
<item
rdf:about="http://www.earlham.edu/~peters/writing/nomic.htm">
<title>Peter Suber, "Nomic"</title>
<link>http://www.earlham.edu/~peters/writing/nomic.htm</link>
<dc:creator>wearehugh</dc:creator>
<dc:date>2006-05-09T21:39:12Z</dc:date>
<dc:subject>games nomic philosophy</dc:subject>
<taxo:topics>
<rdf:Bag>
<rdf:li resource="http://del.icio.us/tag/philosophy" />
<rdf:li resource="http://del.icio.us/tag/games" />
<rdf:li resource="http://del.icio.us/tag/nomic" />
</rdf:Bag>
</taxo:topics>
</item>
I'm not familiar with the RSS 1.0 taxonomy module, and
I don't know if del.icio.us's feeds are presenting tags
"correctly," but because of the popularity of
del.icio.us, it would be desirable to handle them
ultra-liberally.
Here's the source of the first entry from the example
Atom 1.0 feed:
<entry>
<title>Long-term backup [dive into mark]</title>
<link rel="alternate" type="text/html"
href="http://groovymother.com/links/archives/2006/05/07-week/#002448"
/>
<link rel="related" type="text/html"
title="Long-term backup [dive into mark]"
href="http://diveintomark.org/archives/2006/05/08/backup"
/>
<published>2006-05-09T21:49:30Z</published>
<updated>2006-05-09T21:49:30Z</updated>
<id>tag:arsecandle.org,2006:groovymother/links/2448</id>
<summary type="text">When you're building up
gigabytes of data, how can you realistically
back-it-up?</summary>
<category
scheme="http://groovymother.com/links/tag/"
term="backups" label="backups" />
<category
scheme="http://groovymother.com/links/tag/"
term="markpilgrim" label="markpilgrim" />
</entry>
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:25
Attachments:
It's been a while since I actually ran into this problem, but I just
remembered that I forgot to file a bug. This might not be completely
accurate, but it's what I remember.
On Ubuntu dapper, I tried to parse an RSS feed that used additional
namespaces, but these weren't accessible through feedparser. After poking
around for a while, I found out that feedparser can use several different
XML parsers, so I installed the preferred one, libxml2. Everything worked
from there.
It'd be nice if the namespace handling page in the documentation pointed
out this difference in behavior. Maybe I'll get around to actually fixing
the problem someday, making that unnecessary.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:08
When I attempt to parse the attached body with
feedparser, the whole process hangs.
It's losing during HTML sanitizing -- the last thing
debug output emits indicates it's looking at:
<p><a
href="http://www.flickr.com/photos/14155499@N00/110182459/"
title="Photo Sharing"><img
src="http://static.flickr.com/54/110182459_eced6c8a60_o.png"
width="366" height="102" alt=""Include the list of
links referenced in the article"" /></a></p>
<p>&#8212;</p>
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:11
Attachments:
Fix for: [ 1440553 ] misparses core elements within
extension elements
The attached patch to feedparser.py fixes the problem
reported as bug #1440553 -
http://sourceforge.net/tracker/index.php?func=detail&aid=1440553&group_id=1
12328&atid=661937
- where core element values are taken from inside
extension elements. Also attached is test case to
demonstrate the viability.
The solution isn't a catch all, notable it won't work
for core elements with doubly nested extension
elements, but that would require the already requested
refactoring of state preservation.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:34
Attachments:
What steps will reproduce the problem?
1. try to parse this feed from command line with feedparser. check attached
file
I expect to see parsed feed, but script produces error.
I took feedparser.py from SVN latests verion. I run it with python 2.4.3 on
WindowsXP.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 1:56
Attachments:
Go here:
http://www.feedparser.org/docs/annotated-rss20.html
copy/paste the xml somewhere. run python:
>>> import feedparser
>>> feedparser.parse(path_to_rss20_example).feed.links
[{'href': u'http://example.org/', 'type': 'text/html',
'rel': 'alternate'}, {'type': 'text/html', 'rel':
'alternate'}, {'type': 'text/html', 'rel': 'alternate'}]
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:18
Imagine a tag <author>joe</author>. Clearly joe is the name of the
author (at least choosing from the limited possibilities we have).
Currently
feedparser gives up trying to make an author_detail element. But it should
at least set name.
That's what the attached patch does. (I don't think I did all the unit
tests I
should have, though.)
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:32
Attachments:
feedparser fails when processing this feed,
http://yaoke12345.bokee.com/rss2.xml, because of the
encoding.
gb2312 is a famous and old (published in 1980's)
charator encoding standard in mainland China. It is
has now been obseleted by the new standard, "gb18030".
gb18030 is downward compatible with gb2312, but
contains much more charactors than gb2312, such as
traditional Chinese charactors, Japanese symbols, etc.
However, many Chinese website still announce their
encoding as gb2312, although their web pages are
actually gb18030 encoded. This is a so common that
both MSIE and Firefox use gb18030 to decode gb2312
pages, to render correctly.
In the above example, bokee.com, the largest blog
service provider in mainland Chinese, allowing its user
to post gb18030 blog articles, announces its feed
encoding as gb2312, too.
I think feedparser should be able to handle this
situation, so I made a simple patch (see the
attachment). When testing the gb2312 encoding, it use
gb18030 instead.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:12
Attachments:
I apologize if the SF project bug database isn't the best place to report
this documentation "bug."
In the following example:
http://feedparser.org/docs/http-authentication.html#example.auth.required
line two currently reads:
>>> d =
feedparser.parse('http://feedparser.org/docs/examples/digest_auth.xml')
but, I believe, should read:
>>> d =
feedparser.parse('http://feedparser.org/docs/examples/basic_auth.xml')
On line six, d.headers['www-authenticate'] is inspected as:
'Basic realm="Use test/basic"'
and the digest authentication example is then given.
--
Alexander McCormmach
Email: my first name, all lower case, at tunicate.org
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:31
Universal character set detection crashes on the
attached text sample with an infinite recursion error.
Taking only a short sample of the text (say, 400 bytes)
works without error, but not the whole file at once.
Incremental detection as described in the advanced
usage section also fails with a recursion error.
The attachment contains a sample of text that produces
the problem along with a reproduction script. The text
FYI is the text of the president of Iran's new blog
that has been in the news recently.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:21
Attachments:
When parsed
http://feedparser.org/docs/annotated-rss20.html
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:17
In the feed http://feeds.feedburner.com/drivetime, each
entry has a "media:thumbnail" which has a url
attribute. However feedparser assigns it the value of
a blank string.
The feed source looks horrible in my browser, but
feedparser parses it fairly well otherwise.
example:
<media:thumbnail
url="http://ravijain.org/pressroom/drivetime_itunes_02.jpg"
/>
but:
f =
feedparser.parse("http://feeds.feedburner.com/drivetime")
len(f['entries'][0]['media_thumbnail'])
0
:(
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:15
Attachments:
.etag is always present when feed is fetched from the network even if HTTP
response does not contain ETag header. This contradicts the documentation.
Example:
>>> import feedparser
>>> d =
feedparser.parse('http://www.nsu.ru/dynamic/news/rss.php?news_type=3')
>>> d.headers
{'x-powered-by': 'PHP/4.1.2', 'transfer-encoding': 'chunked', 'vary':
'accept-charset, user-agent', 'server': 'Apache', 'connection': 'close',
'date': 'Sun, 18 Feb 2007 16:25:51 GMT', 'content-type': 'text/xml;
charset=koi8-r'}
>>> d.etag
>>>
This is due to missing check of ETag presence in response parsing. Attached
is the patch which fixes it.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:03
Attachments:
A feed like:
http://www.aaronsw.com/2002/feeds/pg
has a header (sent by Apache) like:
Content-Location: pg.cgi
feedparser does:
baseuri = http_headers.get('content-location', result.get('href'))
which ends up setting the baseuri to "pg.cgi". It should instead probably
join the content-location with the feed document, getting the (more)
correct http://www.aaronsw.com/2002/feeds/pg.cgi
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:24
http://community.livejournal.com/dozory/data/rss and
http://community.livejournal.com/dozory/data/atom are
wrongly parsed as ISO-8859-2 instead of UTF-8.
http://community.livejournal.com/technion/data/rss is
parsed correctly as UTF-8.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:22
unicode error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31:
ordinal not in range(128)
reproduce:
import feedparser
d =
feedparser.parse('http://flickr.com/services/feeds/photos_public.gne?tags=e
speranto&format=rss_200')
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:05
On the main index, there are two entries for feed.icon.
The second one should actually be feed.logo, since it
links to reference-feed-logo.html.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:30
test case: last entry of
http://www.snellspace.com/public/ordertest.xml
entry updated/id/title pick up values from inside
x:foo, which is wrong.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:33
The problem is introduced by the new 'type' parameter
to _sanitizeHTML() method. It overrides/masks the
builtin type() that is used in the same method (only)
when Tidy is enabled.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:22
This bug was submitted to Debian BTS [1] by Mikhail Gusarov:
"""
python-feedparser documentation says "etag will only be present if the
feed was retrieved from a web server, and only if the web server
provided an ETag HTTP header for the feed."
However, .etag is present always if feed was fetched from network, and
contains None value, due to missing check in code.
"""
Please check if Mikhail is right and the proposed patch is useful. If you
don't plan to release a new version soon, please tell me if you find
suitable to include the changes as a patch in de debian package until a new
upstream release is ready.
Thanks.
[1]http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411388
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 3:52
Attachments:
(Apparently I should put my email address in here. It's [email protected])
To my understanding of the namespace support, the media_content dictionary
key should be populated for entries which have such a XML tag set. This is
not occuring. For example, for this RSS feed:
http://video.google.com/videofeed?type=search&q=engedu&so=0&num=20&output=r
ss
We have:
<media:content url="
http://video.google.com/videofile/DebuggingBackwardsin.flv?docid=3897010229
726822034&itag=5" type="video/x-flv" medium="video" expression="full"
duration="307
9" width="320" height="240" /><media:content
url="http://video.google.com/videofile/DebuggingBackwardsin.avi?docid=38970
10229726822034&itag=9" type="video/x-m
svideo" medium="video" expression="full" duration="3079" width="480"
height="360" /><media:content
url="http://video.google.com/videofile/DebuggingBackwardsin.mp4
?docid=3897010229726822034&itag=7" type="video/mp4" medium="video"
expression="full" duration="3079" width="320" height="240" />
For the first entry, but this code:
print 'Namespaces: %s' % parser.namespaces
print 'media:description lengh: %d' %
len(parser.entries[0].media_description)
print 'Has media_content: %s' %
repr(parser.entries[0].has_key('media_content'))
Gives:
Namespaces: {'media': 'http://search.yahoo.com/mrss', 'opensearch':
'http://a9.com/-/spec/opensearchrss/1.0/'}
media:description lengh: 1238
Has media_content: False
This is using FeedParser 4.1-2ubuntu1 (a Ubuntu package).
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:06
STR:
Go to http://feedparser.org/docs/
Search for 'feed.icon'
Examine pages that are linked to with that text
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:07
the methods to handle "issued" are aliases for the
methods that handle "published". i think that is a bug
in itself, since it's fairly commonplace to include
both <issued/> and <published/> elements in atom, with
different values (e.g. livejournal's atom output).
anyway, imagine this scenario:
<issued>foo</issued>
<published>bar</published>
for the first node, the code will set published->foo
and published_parsed->f,o,o (lets pretend) - remember
issued is an alias for published. for the second,
it'll set published->bar and leave published_parsed as
f,o,o because of the setdefault in
_FeedParserMixin._save()!
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:16
>>> feedparser.parse
("<rss><channel><item><description><blockquote></
description></item></channel></rss>").entries[0].summary
u'<blockquote>'
A feed like this can cause the entire rest of a page to be indented if it's
used in the obvious manner that HTML sanitization appears to be
intended for. Instead, feedparser should close the tag and the end of the
item:
u'<blockquote></blockquote>'
And for some software, close tags without open tags can also be harmful.
Cal Henderson (Flickr)'s thoughts on HTML sanitization seem relatively
sensible and contain a long list of test cases:
http://www.iamcal.com/publish/articles/php/processing_html_part_2/
Date: 2006-04-20 17:17
Sender: aaronsw
Logged In: YES
user_id=122141
I've made a couple more fixes. The latest version will always be at:
http://www.aaronsw.com/2002/sanitize/
Date: 2006-04-18 15:29
Sender: aaronsw
Logged In: YES
user_id=122141
The patch is missing a comma at the end of a line after 'colgroup'. That
Python
doesn't consider `'foo' 'bar'` a syntax error like `1 2` is becoming an
increasingly large annoyance for me.
Date: 2006-04-06 16:28
Sender: aaronsw
Logged In: YES
user_id=122141
Since I don't feel like I have a good grasp on the structure of the test
cases and
since I assume you have some sort of automated tool for making them, I
didn't
try to make the test cases for this myself. But attached is a Python
script that
runs a bunch of tests on the sanitize function. It should not be hard to
convert it
into the feedparser test case format.
Date: 2006-04-06 16:23
Sender: aaronsw
Logged In: YES
user_id=122141
Attached is a patch to add this functionality. Two test cases had
unbalanced tags
so I closed them in the test cases to minimize the code they tested.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:27
Attachments:
The RSS feeds by the BBC have links in the form:
<link>http://blah</link>
feedparser currently gives empty links for this:
{'type': 'text/html', 'rel': 'alternate'}
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:29
Attachments:
When installing http://chardet.feedparser.org/ it seems
that data is double encoded.
Sorry, this is what I was told: "
What is strange though is that the parser is returning a
unicode string of a utf-8 encoded string. Somewhere along
the line, there is a double decoding happening."
This is not the case if we uninstall Universal Encoding
Detector.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:31
Typepad feeds with blog entries exported from Flickr usually include a
<style> tag. Feedparser strips out the opening and closing tags correctly,
but not the CSS in-between, and the resulting CSS spills into the text.
The quick and simple option is to just strip out everything in-between
style tags.
A second option would be to use the style attribute sanitizing code on the
tag content, but this would allow a feed to influence other feeds'
presentation on pages that have more than one feed aggregated in a single
HTML page.
The fix is quite simple: just add style to
_HTMLSanitizer.unacceptable_elements_with_end_tag
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:08
I think this is a separate bug to the other fractional seconds bug that's
open from the other day.
>>> feedparser._parse_date('2007-04-23T23:25:47+10:00')
(2007, 4, 23, 13, 25, 47, 0, 113, 0)
>>> feedparser._parse_date('2007-04-23T23:25:47.538+10:00')
(2007, 4, 24, 0, 25, 47, 1, 114, 1)
>>> feedparser._parse_date_w3dtf('2007-04-23T23:25:47.538+10:00')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/var/lib/python-support/python2.4/rawdoglib/feedparser.py", line
2245, in _parse_date_w3dtf
gmt = __extract_date(m) + __extract_time(m) + (0, 0, 0)
File "/var/lib/python-support/python2.4/rawdoglib/feedparser.py", line
2207, in __extract_time
seconds = int(seconds)
ValueError: invalid literal for int(): 47.538
It's because m.group('seconds') returns a string that is passed to int() in
_parse_date_w3dft.
I fixed it with the following but there are other uses of int() on output
from m.group around there.
--- feedparser.py.orig 2007-04-23 15:48:10.000000000 +0100
+++ feedparser.py 2007-04-23 15:47:40.000000000 +0100
@@ -2204,7 +2204,7 @@
minutes = int(m.group('minutes'))
seconds = m.group('seconds')
if seconds:
- seconds = int(seconds)
+ seconds = int(float(seconds))
else:
seconds = 0
return hours, minutes, seconds
With this patch I get:
>>> feedparser._parse_date('2007-04-23T23:25:47+10:00')
(2007, 4, 23, 13, 25, 47, 0, 113, 0)
>>> feedparser._parse_date('2007-04-23T23:25:47.538+10:00')
(2007, 4, 23, 13, 25, 47, 0, 113, 0)
Discovered using rawdog and the feed at http://etbe.blogspot.com/atom.xml
Original issue reported on code.google.com by [email protected]
on 23 Apr 2007 at 2:50
Hi, I'm Zach Beane, <[email protected]>.
The W3CDTF says that the date format may include fractional seconds, but
the regular expression does not take that into account. This causes a
failure to detect a time zone that follows the fractional second.
Here's a patch that detects (but ignores) the fractional second value and
allows the correct processing of the timezone that follows.
diff -c /home/xach/tmp/feedparser.py\~ /home/xach/tmp/feedparser.py
*** /home/xach/tmp/feedparser.py~ 2006-01-10 23:32:22.000000000 -0500
--- /home/xach/tmp/feedparser.py 2006-11-17 11:11:54.355518778 -0500
***************
*** 1860,1865 ****
--- 1860,1866 ----
'CC', r'(?P<century>\d\d$)')
+ r'(T?(?P<hour>\d{2}):(?P<minute>\d{2})'
+ r'(:(?P<second>\d{2}))?'
+ + r'(\.(?P<fracsecond>\d+))?'
+ r'(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?'
for tmpl in _iso8601_tmpl]
del tmpl
Ooops! This is fixed in CVS, but the pattern for fractional seconds should
be \.\d+ instead of \.\d*.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:11
As in attached file, feedparser cannot handle malformed
numeric entities such &#a; and raises an exception. A
workaround for this is
505,508c505,511
< if ref[0] == 'x':
< c = int(ref[1:], 16)
< else:
< c = int(ref)
---
> try:
> if ref[0] == 'x':
> c = int(ref[1:], 16)
> else:
> c = int(ref)
> except:
> c = 0
Don't know if this is what the parser is supposed to do
in case of malformed entities...
Bye,
Giuseppe Ottaviano
[email protected]
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:19
Attachments:
I am using Feedparser to parser RSS feeds. When I run
the CGI scripts, it displays feedparser is undefined
and can not find several modules. I have downloaded
and installed in my cgi-bin directory. Then I
use "import feedparser" in my CGI scripts. But it
does not work.
Date: 2007-01-09 20:01
Sender: nobody
Logged In: NO
same problem for me..
python 2.4.3
__version__ = "4.1"# + "$Revision: 1.92 $"[11:15] + "-cvs"
BUT, it only fails (sometimes) when importing from the interpreter. Ha,
what a bug! After the 7th time trying >>>import feedparser, it imported
just fine.
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/site-packages/feedparser.py", line 1958, in ?
_korean_nate_date_re = \
File "/usr/lib/python2.4/sre.py", line 180, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.4/sre.py", line 225, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.4/sre_compile.py", line 500, in compile
code = _code(p, flags)
File "/usr/lib/python2.4/sre_compile.py", line 484, in _code
_compile(code, p.data, flags)
File "/usr/lib/python2.4/sre_compile.py", line 96, in _compile
_compile(code, av[1], flags)
File "/usr/lib/python2.4/sre_compile.py", line 52, in _compile
_compile_charset(av, flags, code, fixup)
File "/usr/lib/python2.4/sre_compile.py", line 178, in _compile_charset
for op, av in _optimize_charset(charset, fixup):
File "/usr/lib/python2.4/sre_compile.py", line 221, in
_optimize_charset
return _optimize_unicode(charset, fixup)
File "/usr/lib/python2.4/sre_compile.py", line 341, in
_optimize_unicode
mapping = array.array('b', mapping).tostring()
AttributeError: 'module' object has no attribute 'array'
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:20
In an RSS 1.0 feed, feedparser 4.1 doesn’t fails to parse a dc:subject
element and expose it via either entry.dc_subject or entry['dc_subject']
Attached there’s a test case (an RSS 1.0 feed showing that behavior).
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:13
Attachments:
I've built two serious pieces of software on top of feedparser and both
have run into the same problem: feedparser gives no reliable indication of
whether it's successfully parsed a feed. I've tried to look to see if the
entries list exists, but sometimes a successfully-parsed feed has no
entries. I've tried to look at bozo_exception but that gets set even when
the feed was actually parsed.
I suggest a new attribute (bozo_status?) set by feedparser.parse to
indicate the status. It'd indicate whether an exception was thrown in
_open_resource or decompressing, causing feedparser to give up. Ideally
it'd also indicate if the document didn't even look anything like RSS, but
that doesn't seem to be easily doable within the current structure.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:25
the parser doesn't recognize microsoft's 'msn' and 'live' namespaces, and
do not construct variables like 'msn_type' or 'live_type', which msl-live's
feeds use to distinguish entry between photos and articles
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:02
On page
http://www.feedparser.org/docs/version-detection.html
the link for atom 10 should be
http://www.ietf.org/rfc/rfc4287.txt
but is missing .txt
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:23
The latest Atom-1.0 spec
(http://www.ietf.org/rfc/rfc4287) makes it legal for
multiple <author> elements to be present in <feed>,
<entry>, and <source>, the same as <contributor>.
Feedparser currently only supports a single <author>. A
little experimenting shows that with multiple <author>
elements, the subelements of the second and later
<author> overwrite those of the first <author>.
Suggested fix: Treat <author> like <contributor> and
create feed.authors[i].{name,href,email}. Leave
feed.author and feed.author_detail as-is, or have them
contain information for only the first <author>, for
backwards-compatibility.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:30
The line:
<?xml-stylesheet href="http://www.blogger.com/styles/atom.css"
type="text/css"?>
which appears in the beginning of Blogger's feeds causes the parser to fail
and return empty.
Can you please fix this problem?
My email is [email protected].
10x!
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:06
Don't see an automated way of doing it, will probably have to copy and paste.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 4:06
In iTunes feeds containing both the <image> tag and <itunes:image> tag the
feed image is overwritten by the iTunes image. All the image_detail is also
overwritten. The feed validates containing both tags, both are legal.
Feedparser does follow the iTunes application behavior and falls back to
the <image> tag when no <itunes:image> tag exists, so it's not broken per
se.
see http://www.freetalklive.com/netcast.xml for described behavior.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:01
With the feed http://www.democracynow.org/podcast.xml
feedparser is occasionally munging entry titles. With
the present feed contents, entries[0].title is
'\r\xe9\xa8r\xb6\x9c\xc8\xda02\x89\xddk"n\x97-\xf5\xdbM:'
entries[1].title is
u'Democracy Now! - Thursday, July 27, 2006'
FWIW, the feed does validate.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:21
Version 4.1 is crashing on input with some upper ascii
characters and I guess no encoding. Traceback:
doc = feedparser.parse(self.url)
File
"/var/lib/python-support/python2.4/feedparser.py", line
2623, in parse
feedparser.feed(data)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python2.4/sgmllib.py", line 100, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 139, in
goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.4/sgmllib.py", line 297, in
parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.4/sgmllib.py", line 337, in
finish_endtag
self.unknown_endtag(tag)
File
"/var/lib/python-support/python2.4/feedparser.py", line
476, in unknown_endtag
method()
File
"/var/lib/python-support/python2.4/feedparser.py", line
1217, in _end_description
value = self.popContent('description')
File
"/var/lib/python-support/python2.4/feedparser.py", line
700, in popContent
value = self.pop(tag)
File
"/var/lib/python-support/python2.4/feedparser.py", line
641, in pop
output = _resolveRelativeURIs(output, self.baseuri,
self.encoding)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1594, in _resolveRelativeURIs
p.feed(htmlSource)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python2.4/sgmllib.py", line 100, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 134, in
goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.4/sgmllib.py", line 284, in
parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.4/sgmllib.py", line 315, in
finish_starttag
self.unknown_starttag(tag, attrs)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1589, in unknown_starttag
_BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1460, in unknown_starttag
strattrs = u''.join([u' %s="%s"' % (key, value) for
key, value in uattrs]).encode(self.encoding)
LookupError: unknown encoding:
XML that breaks it is attached.
Original issue reported on code.google.com by [email protected]
on 19 Apr 2007 at 5:16
Attachments:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.