libo26 / feedparser Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 4.63 MB

Automatically exported from code.google.com/p/feedparser

License: Other

Python 95.37% ApacheConf 4.63%

feedparser's People

Contributors

Watchers

feedparser's Issues

[ 1466592 ] media:description elements parse to malformed content object

>>> feedparser.parse('<rss xmlns:media="http://search.yahoo.com/
mrss/"><channel><item><media:content
medium="document"><media:description>foo</media:description></
media:content></item></channel></rss>').entries[0]
{'content': [{'value': u'foo'}]}

I expected the content item to have type, language, and base attributes
and so on.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:26

Attachments:

alterman.xml

[ 1463296 ] sanitize does not strip out javascript links

>>> feedparser.parse("<rss><channel><item><description>&lt;a
href=\"javascript:alert('foo')\">Link&lt;/a></description></item></
channel></rss>").entries[0].summary
u'<a href="javascript:alert(\'foo\')">Link</a>'

The HTML sanitizer doesn't strip out HTML links that execute
JavaScript. An feed author could use this to embed a link in the feed
that executes arbitrary JavaScript as that user if the user clicks on it.

It's tempting to say "well, the user clicked on it, it's their fault". But

since the user probably subscribed to the feed in the first place, they're

probably tempted to click on its links as well, and most users are
unlikely to check the URL before clicking a link to make sure it's safe.

Depending on the software using the library, an unscrupulous feed
author could include a link that when clicked on first asks the feed
reader to delete all subscriptions to competing sites and then passes
the user on to the actual link. The user would likely not notice anything
for a while, then later think that the competing site mysteriously
disappeared.

Cal Henderson identifies a number of different types of URLs to strip:

"javascript:foo"
"java script:foo"
"java\tscript:foo"
"java\nscript:foo"
"java"+chr(1)+"script:foo"
"jscript:foo"
"vbscript:foo"
"view-source:foo"

(http://www.iamcal.com/publish/articles/php/
processing_html_part_2/)

but it seems like the right strategy might be a whitelist here as well.
http, ftp, mailto, aim, etc. would all be passed through. Other links
would be treated as relative and the relative link resolution algorithm
would be run on them, resulting in links like:

http://example.org/blog/javascript:foo

which should be fairly safe.




It's also worth noting that the same rules should be applied to all the
URIs in 
the document, like those in <link> tags.

Also, if the relative resolution algorithm becomes used for security
purposes 
as I suggest, then the base URIs too must be sanitized. For example:

>>> feedparser.parse("""<rss xml:base="http://safe.example.com/">
<channel><item>
<link>this</link>
</item></channel>
</rss>""").entries[0].link
u'http://safe.example.com/this'

should not be overwritable using something like:

>>> feedparser.parse("""<rss xml:base="http://safe.example.com/">
<channel><item xml:base="javascript:hack">
<link>this</link>
</item></channel>
</rss>""").entries[0].link
u'this'

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:26

[ 1459882 ] auth handlers not working

I use feedparser in our project, it works good. But when I
follow http://feedparser.org/docs/http-authentication.html
to add http authentication handler, can not get the
correct result, always return Http 401 authentication
failed page. I found out it's because feedparser always
put given handlers after the buildin handlers, then they
are not used, my patch only changed line 1817, after that,
everything works just fine.

1817c1817
< opener = apply(urllib2.build_opener,
tuple([_FeedURLHandler()] + handlers))

---
> opener = apply(urllib2.build_opener,
tuple(handlers + [_FeedURLHandler()]))

I am using python2.3 on Debian.

Please tell me if there is anything I missed, thanks for
writing this useful package.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:29

Merged into: #45

[ 1494288 ] CP932 detection

There are some RSSes feeded with "SHIFT_JIS" encoding,
but this encoding may often have some illegal multibyte
sequence... This is "CP932" encoding, very similar to
"SHIFT_JIS" but has some extended character codes.

Now, I added a piece of code for detection of "cp932".

If possible, apply this patch.

regards.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:23

Attachments:

feedparser-cp932.patch

[ 1463179 ] Numeric character references support both &#xH; and &#XH;

Looking at this line of code(140):

sgmllib.charref = re.compile('&#(x?[0-9A-Fa-f]+)[^0-9
A-Fa-f]')

Note that Numeric character references support both &#
xH; and &#XH; syntax. This re has left out the captial
X.

I found this because we have some code conflict when
we both tried to 'fix' the sgmllib.



PS. See http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:28

[ 1404219 ] extend sanitization whitelist for XHTML

Feed elements, certainly in atom, can contain not only
HTML but also XHTML. This means that next to attributes
defined in (X)HTML, they can also contain xml special
attributes like xml:lang, xml:id etc. These do not
introduce javascript/security risks and could be very
useful, so it would make sense to whitelist them for
(X)HTML sanitization.

XHTML content could also contain elements and
attributes from other namespaces, and the same could
probably be said for those. I'm not 100% sure there
about the risk, but it would seem that they are
harmless, and - if present - almost certainly too
important to throw away. Obvious examples could be
MathML or SVG.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:34

[ 1546854 ] Support for media urls

At the moment, feedparser doesn't handle media:content
or media:thumbnail URLs as they're defined in
attributes, and not by the element values.

These methods make it work

def _start_media_content(self, attrsD):
url = attrsD.get('url')
if url:
self._save('media_content', url)

def _start_media_thumbnail(self, attrsD):
url = attrsD.get('url')
if url:
self._save('media_thumbnail', url)

It would be nice to add it, so that feedparser could
get images URL from flickr feeds

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:18

[ 1651355 ] Feed cannot be parsed

When I try to parse the feed at: http://feeds.gawker.com/defamer/full
feedparser starts using a lot of resources and hangs.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:05

Merged into: #146

[ 1572566 ] have bug when parse title and dc:title

there is a bug when parse feed that have title and
dc:title.
if have dc:title after title in a feed,the value of
dc:title will replace title's value. But title's value
is information that we want, so that when we want use
feed.get("title","") to get title,however we get the
value of dc:title.
for example parse
"http://ajaxcn.org/exec/rss?snip=start":
feed.get("title","") get "start",but we want is "Ajax
**"
- <channel>
<title>Ajax**</title>
<link>http://ajaxcn.org/space/start</link>
<description>Ajax lead the way!</description>
<dc:creator>dlee</dc:creator>
<dc:type>Text</dc:type>
<dc:title>start</dc:title>
<dc:identifier>http://ajaxcn.org/space/start</dc:identifier>

<dc:date>2006-08-26T14:41:05+08:00</dc:date>
<dc:language>zh</dc:language>
- <!--
<blogChannel:changes>http://www.weblogs.com/rssUpdates/changes.xml</changes
>

-->
<admin:generatorAgent
rdf:resource="http://www.snipsnap.org/space/version-1.0b3-uttoxeter"
/>

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:14

Merged into: #76

Attachments:

rss.xml

[ 1485079 ] entries[i].tags always a 1-item list w/ del.icio.us entries

The Feed Parser always returns the tags for a
del.icio.us feed entry as a single-item list.

The following example first shows the tags returned for
a valid del.icio.us RSS 1.0 feed entry, then shows the
desired behavior with the tags returned for a valid
Atom 1.0 feed entry:

Python 2.3.5 (#2, Sep 4 2005, 22:01:42)
[GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2
Type "help", "copyright", "credits" or "license" for
more information.
>>> import feedparser
>>> mpurl = 'http://del.icio.us/rss/wearehugh'
>>> mp = feedparser.parse(mpurl)
>>> mp.entries[0].tags
[{'term': u'games nomic philosophy', 'scheme': None,
'label': None}]
>>> gmurl = 'http://groovymother.com/links/index.atom'
>>> gm = feedparser.parse(gmurl)
>>> gm.entries[0].tags
[{'term': u'backups', 'scheme':
u'http://groovymother.com/links/tag/', 'label':
u'backups'}, {'term': u'markpilgrim', 'scheme':
u'http://groovymother.com/links/tag/', 'label':
u'markpilgrim'}]

Here's the source of the first entry from the example
del.icio.us feed:

<item
rdf:about="http://www.earlham.edu/~peters/writing/nomic.htm">
<title>Peter Suber, "Nomic"</title>
<link>http://www.earlham.edu/~peters/writing/nomic.htm</link>
<dc:creator>wearehugh</dc:creator>
<dc:date>2006-05-09T21:39:12Z</dc:date>

<dc:subject>games nomic philosophy</dc:subject>
<taxo:topics>
<rdf:Bag>
<rdf:li resource="http://del.icio.us/tag/philosophy" />
<rdf:li resource="http://del.icio.us/tag/games" />
<rdf:li resource="http://del.icio.us/tag/nomic" />
</rdf:Bag>
</taxo:topics>
</item>

I'm not familiar with the RSS 1.0 taxonomy module, and
I don't know if del.icio.us's feeds are presenting tags
"correctly," but because of the popularity of
del.icio.us, it would be desirable to handle them
ultra-liberally.

Here's the source of the first entry from the example
Atom 1.0 feed:

<entry>
<title>Long-term backup [dive into mark]</title>
<link rel="alternate" type="text/html"
href="http://groovymother.com/links/archives/2006/05/07-week/#002448"
/>
<link rel="related" type="text/html"
title="Long-term backup [dive into mark]"
href="http://diveintomark.org/archives/2006/05/08/backup"
/>

<published>2006-05-09T21:49:30Z</published>
<updated>2006-05-09T21:49:30Z</updated>

<id>tag:arsecandle.org,2006:groovymother/links/2448</id>
<summary type="text">When you&apos;re building up
gigabytes of data, how can you realistically
back-it-up?</summary>
<category
scheme="http://groovymother.com/links/tag/"
term="backups" label="backups" />
<category
scheme="http://groovymother.com/links/tag/"
term="markpilgrim" label="markpilgrim" />

</entry>

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:25

Attachments:

[ 1614576 ] Namespace handling requires libxml2

It's been a while since I actually ran into this problem, but I just
remembered that I forgot to file a bug. This might not be completely
accurate, but it's what I remember.

On Ubuntu dapper, I tried to parse an RSS feed that used additional
namespaces, but these weren't accessible through feedparser. After poking
around for a while, I found out that feedparser can use several different
XML parsers, so I installed the preferred one, libxml2. Everything worked
from there.

It'd be nice if the namespace handling page in the documentation pointed
out this difference in behavior. Maybe I'll get around to actually fixing
the problem someday, making that unnecessary.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:08

[ 1587728 ] The attached feed causes a hang during HTML sanitizing

When I attempt to parse the attached body with
feedparser, the whole process hangs.

It's losing during HTML sanitizing -- the last thing
debug output emits indicates it's looking at:

&lt;p&gt;&lt;a
href="http://www.flickr.com/photos/14155499@N00/110182459/"
title="Photo Sharing"&gt;&lt;img
src="http://static.flickr.com/54/110182459_eced6c8a60_o.png"
width="366" height="102" alt=""Include the list of
links referenced in the article"" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;#8212;&lt;/p&gt;

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:11

Merged into: #146

Attachments:

hanging.xml

[ 1495173 ] Misparsing of core elements in extensions [1440553]

Fix for: [ 1440553 ] misparses core elements within
extension elements

The attached patch to feedparser.py fixes the problem
reported as bug #1440553 -
http://sourceforge.net/tracker/index.php?func=detail&aid=1440553&group_id=1
12328&atid=661937
- where core element values are taken from inside
extension elements. Also attached is test case to
demonstrate the viability.

The solution isn't a catch all, notable it won't work
for core elements with doubly nested extension
elements, but that would require the already requested
refactoring of state preservation.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:34

Merged into: #192

Attachments:

error when parsing this feed

What steps will reproduce the problem?
1. try to parse this feed from command line with feedparser. check attached 
file

I expect to see parsed feed, but script produces error.

I took feedparser.py from SVN latests verion. I run it with python 2.4.3 on 
WindowsXP.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 1:56

Attachments:

content.xml

[ 1559875 ] Link parsing is buggy, produces garbage for RSS 2.0 example

Go here:

http://www.feedparser.org/docs/annotated-rss20.html

copy/paste the xml somewhere. run python:

>>> import feedparser
>>> feedparser.parse(path_to_rss20_example).feed.links
[{'href': u'http://example.org/', 'type': 'text/html',
'rel': 'alternate'}, {'type': 'text/html', 'rel':
'alternate'}, {'type': 'text/html', 'rel': 'alternate'}]

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:18

[ 1443138 ] does not put author into author_detail.name [w/patch]

Imagine a tag <author>joe</author>. Clearly joe is the name of the
author (at least choosing from the limited possibilities we have).
Currently
feedparser gives up trying to make an author_detail element. But it should

at least set name.

That's what the attached patch does. (I don't think I did all the unit
tests I
should have, though.)

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:32

Attachments:

author_name_always.patch

[ 1573544 ] patch for gb2312 feeds

feedparser fails when processing this feed,
http://yaoke12345.bokee.com/rss2.xml, because of the
encoding.

gb2312 is a famous and old (published in 1980's)
charator encoding standard in mainland China. It is
has now been obseleted by the new standard, "gb18030".
gb18030 is downward compatible with gb2312, but
contains much more charactors than gb2312, such as
traditional Chinese charactors, Japanese symbols, etc.

However, many Chinese website still announce their
encoding as gb2312, although their web pages are
actually gb18030 encoded. This is a so common that
both MSIE and Firefox use gb18030 to decode gb2312
pages, to render correctly.

In the above example, bokee.com, the largest blog
service provider in mainland Chinese, allowing its user
to post gb18030 blog articles, announces its feed
encoding as gb2312, too.

I think feedparser should be able to handle this
situation, so I made a simple patch (see the
attachment). When testing the gb2312 encoding, it use
gb18030 instead.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:12

Attachments:

feedparser-gb2312.patch

[ 1449142 ] [Documentation] Wrong URL in auth example

I apologize if the SF project bug database isn't the best place to report
this documentation "bug."

In the following example:


http://feedparser.org/docs/http-authentication.html#example.auth.required

line two currently reads:

>>> d =
feedparser.parse('http://feedparser.org/docs/examples/digest_auth.xml')

but, I believe, should read:

>>> d =
feedparser.parse('http://feedparser.org/docs/examples/basic_auth.xml')

On line six, d.headers['www-authenticate'] is inspected as:

'Basic realm="Use test/basic"'

and the digest authentication example is then given.

--
Alexander McCormmach
Email: my first name, all lower case, at tunicate.org

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:31

[ 1540828 ] Charset detection infinite recursion on attached sample

Universal character set detection crashes on the
attached text sample with an infinite recursion error.
Taking only a short sample of the text (say, 400 bytes)
works without error, but not the whole file at once.
Incremental detection as described in the advanced
usage section also fails with a recursion error.

The attachment contains a sample of text that produces
the problem along with a reproduction script. The text
FYI is the text of the president of Iran's new blog
that has been in the news recently.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:21

Attachments:

charset_det_recursion.tgz

[ 1559904 ] RSS 2.0 example on website has no "entries" attribute

When parsed

http://feedparser.org/docs/annotated-rss20.html

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:17

[ 1568001 ] media_thumbnail gets no value

In the feed http://feeds.feedburner.com/drivetime, each
entry has a "media:thumbnail" which has a url
attribute. However feedparser assigns it the value of
a blank string.

The feed source looks horrible in my browser, but
feedparser parses it fairly well otherwise.

example:

<media:thumbnail
url="http://ravijain.org/pressroom/drivetime_itunes_02.jpg"
/>

but:

f =
feedparser.parse("http://feeds.feedburner.com/drivetime")

len(f['entries'][0]['media_thumbnail'])
0

:(

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:15

Attachments:

drivetime

[ 1662900 ] .etag is always present

.etag is always present when feed is fetched from the network even if HTTP
response does not contain ETag header. This contradicts the documentation.

Example:

>>> import feedparser
>>> d =
feedparser.parse('http://www.nsu.ru/dynamic/news/rss.php?news_type=3')
>>> d.headers
{'x-powered-by': 'PHP/4.1.2', 'transfer-encoding': 'chunked', 'vary':
'accept-charset, user-agent', 'server': 'Apache', 'connection': 'close',
'date': 'Sun, 18 Feb 2007 16:25:51 GMT', 'content-type': 'text/xml;
charset=koi8-r'}
>>> d.etag
>>>

This is due to missing check of ETag presence in response parsing. Attached
is the patch which fixes it.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:03

Attachments:

add-etag-only-if-etag-header-present.patch

[ 1491772 ] doesn't resolve content-location

A feed like:

http://www.aaronsw.com/2002/feeds/pg

has a header (sent by Apache) like:

Content-Location: pg.cgi

feedparser does:

baseuri = http_headers.get('content-location', result.get('href'))

which ends up setting the baseuri to "pg.cgi". It should instead probably
join the content-location with the feed document, getting the (more)
correct http://www.aaronsw.com/2002/feeds/pg.cgi

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:24

[ 1519145 ] Two feeds parsed ISO-8859-2 instead of UTF-8

http://community.livejournal.com/dozory/data/rss and
http://community.livejournal.com/dozory/data/atom are
wrongly parsed as ISO-8859-2 instead of UTF-8.

http://community.livejournal.com/technion/data/rss is
parsed correctly as UTF-8.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:22

[ 1654401 ] cannot parse feed: unicode error

unicode error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31:
ordinal not in range(128)

reproduce:
import feedparser
d =
feedparser.parse('http://flickr.com/services/feeds/photos_public.gne?tags=e
speranto&format=rss_200')

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:05

[ 1458377 ] 4.1 docs: feed.icon listed twice, second should be feed.logo

On the main index, there are two entries for feed.icon.
The second one should actually be feed.logo, since it
links to reference-feed-logo.html.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:30

Merged into: #11

[ 1440553 ] misparses core elements within extension elements

test case: last entry of
http://www.snellspace.com/public/ordertest.xml

entry updated/id/title pick up values from inside
x:foo, which is wrong.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:33

Merged into: #76

[ 1528241 ] revisions 1.123-HEAD don't work with Tidy

The problem is introduced by the new 'type' parameter
to _sanitizeHTML() method. It overrides/masks the
builtin type() that is used in the same method (only)
when Tidy is enabled.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:22

.etag is always present in feed, even if response does not contain ETag header


This bug was submitted to Debian BTS [1] by Mikhail Gusarov:

"""
python-feedparser documentation says "etag will only be present if the
feed was retrieved from a web server, and only if the web server
provided an ETag HTTP header for the feed."

However, .etag is present always if feed was fetched from network, and
contains None value, due to missing check in code.
"""

Please check if Mikhail is right and the proposed patch is useful. If you
don't plan to release a new version soon, please tell me if you find
suitable to include the changes as a patch in de debian package until a new
upstream release is ready.
Thanks.

[1]http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411388

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 3:52

Attachments:

feedparser.patch

[ 1627080 ] media_content is not populated

(Apparently I should put my email address in here. It's [email protected])

To my understanding of the namespace support, the media_content dictionary
key should be populated for entries which have such a XML tag set. This is
not occuring. For example, for this RSS feed:

http://video.google.com/videofeed?type=search&q=engedu&so=0&num=20&output=r
ss

We have:

<media:content url="
http://video.google.com/videofile/DebuggingBackwardsin.flv?docid=3897010229
726822034&amp;itag=5" type="video/x-flv" medium="video" expression="full"
duration="307
9" width="320" height="240" /><media:content
url="http://video.google.com/videofile/DebuggingBackwardsin.avi?docid=38970
10229726822034&amp;itag=9" type="video/x-m
svideo" medium="video" expression="full" duration="3079" width="480"
height="360" /><media:content
url="http://video.google.com/videofile/DebuggingBackwardsin.mp4
?docid=3897010229726822034&amp;itag=7" type="video/mp4" medium="video"
expression="full" duration="3079" width="320" height="240" />

For the first entry, but this code:

print 'Namespaces: %s' % parser.namespaces
print 'media:description lengh: %d' %
len(parser.entries[0].media_description)
print 'Has media_content: %s' %
repr(parser.entries[0].has_key('media_content'))

Gives:

Namespaces: {'media': 'http://search.yahoo.com/mrss', 'opensearch':
'http://a9.com/-/spec/opensearchrss/1.0/'}
media:description lengh: 1238
Has media_content: False

This is using FeedParser 4.1-2ubuntu1 (a Ubuntu package).

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:06

[ 1616754 ] TOC has 2 links with the same name. 1 doesn't match the href

STR:
Go to http://feedparser.org/docs/
Search for 'feed.icon'
Examine pages that are linked to with that text

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:07

[ 1564289 ] Dates get mixed up sometimes

the methods to handle "issued" are aliases for the
methods that handle "published". i think that is a bug
in itself, since it's fairly commonplace to include
both <issued/> and <published/> elements in atom, with
different values (e.g. livejournal's atom output).
anyway, imagine this scenario:

<issued>foo</issued>
<published>bar</published>

for the first node, the code will set published->foo
and published_parsed->f,o,o (lets pretend) - remember
issued is an alias for published. for the second,
it'll set published->bar and leave published_parsed as
f,o,o because of the setdefault in
_FeedParserMixin._save()!

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:16

[ 1463291 ] sanitize does not balance tags

>>> feedparser.parse
("<rss><channel><item><description>&lt;blockquote></
description></item></channel></rss>").entries[0].summary
u'<blockquote>'

A feed like this can cause the entire rest of a page to be indented if it's

used in the obvious manner that HTML sanitization appears to be
intended for. Instead, feedparser should close the tag and the end of the
item:

u'<blockquote></blockquote>'

And for some software, close tags without open tags can also be harmful.

Cal Henderson (Flickr)'s thoughts on HTML sanitization seem relatively
sensible and contain a long list of test cases:

http://www.iamcal.com/publish/articles/php/processing_html_part_2/




Date: 2006-04-20 17:17
Sender: aaronsw
Logged In: YES 
user_id=122141

I've made a couple more fixes. The latest version will always be at:

http://www.aaronsw.com/2002/sanitize/


Date: 2006-04-18 15:29
Sender: aaronsw
Logged In: YES 
user_id=122141

The patch is missing a comma at the end of a line after 'colgroup'. That
Python 
doesn't consider `'foo' 'bar'` a syntax error like `1 2` is becoming an 
increasingly large annoyance for me.


Date: 2006-04-06 16:28
Sender: aaronsw
Logged In: YES 
user_id=122141

Since I don't feel like I have a good grasp on the structure of the test
cases and 
since I assume you have some sort of automated tool for making them, I
didn't 
try to make the test cases for this myself. But attached is a Python
script that 
runs a bunch of tests on the sanitize function. It should not be hard to
convert it 
into the feedparser test case format.


Date: 2006-04-06 16:23
Sender: aaronsw
Logged In: YES 
user_id=122141

Attached is a patch to add this functionality. Two test cases had
unbalanced tags 
so I closed them in the test cases to minimize the code they tested.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:27

Attachments:

[ 1458648 ] Bug in parsing BBC feeds

The RSS feeds by the BBC have links in the form:
<link>http://blah</link>

feedparser currently gives empty links for this:
{'type': 'text/html', 'rel': 'alternate'}

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:29

Attachments:

883.xml

[ 1451139 ] Double encoded with chardet

When installing http://chardet.feedparser.org/ it seems
that data is double encoded.

Sorry, this is what I was told: "
What is strange though is that the parser is returning a
unicode string of a utf-8 encoded string.  Somewhere along
the line, there is a double decoding happening."

This is not the case if we uninstall Universal Encoding
Detector.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:31

[ 1615527 ] styles tags are not stripped or sanitized correctly

Typepad feeds with blog entries exported from Flickr usually include a
<style> tag. Feedparser strips out the opening and closing tags correctly,
but not the CSS in-between, and the resulting CSS spills into the text.

The quick and simple option is to just strip out everything in-between
style tags.

A second option would be to use the style attribute sanitizing code on the
tag content, but this would allow a feed to influence other feeds'
presentation on pages that have more than one feed aggregated in a single
HTML page.

The fix is quite simple: just add style to
_HTMLSanitizer.unacceptable_elements_with_end_tag

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:08

feedparser._parse_date gets dates with fractional seconds wrong in _parse_date_w3dtf [PATCH]

I think this is a separate bug to the other fractional seconds bug that's
open from the other day.

>>> feedparser._parse_date('2007-04-23T23:25:47+10:00')
(2007, 4, 23, 13, 25, 47, 0, 113, 0)
>>> feedparser._parse_date('2007-04-23T23:25:47.538+10:00')
(2007, 4, 24, 0, 25, 47, 1, 114, 1)

>>> feedparser._parse_date_w3dtf('2007-04-23T23:25:47.538+10:00')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/var/lib/python-support/python2.4/rawdoglib/feedparser.py", line
2245, in _parse_date_w3dtf
    gmt = __extract_date(m) + __extract_time(m) + (0, 0, 0)
  File "/var/lib/python-support/python2.4/rawdoglib/feedparser.py", line
2207, in __extract_time
    seconds = int(seconds)
ValueError: invalid literal for int(): 47.538

It's because m.group('seconds') returns a string that is passed to int() in
_parse_date_w3dft.

I fixed it with the following but there are other uses of int() on output
from m.group around there.

--- feedparser.py.orig  2007-04-23 15:48:10.000000000 +0100
+++ feedparser.py       2007-04-23 15:47:40.000000000 +0100
@@ -2204,7 +2204,7 @@
         minutes = int(m.group('minutes'))
         seconds = m.group('seconds')
         if seconds:
-            seconds = int(seconds)
+            seconds = int(float(seconds))
         else:
             seconds = 0
         return hours, minutes, seconds


With this patch I get:

>>> feedparser._parse_date('2007-04-23T23:25:47+10:00')
(2007, 4, 23, 13, 25, 47, 0, 113, 0)
>>> feedparser._parse_date('2007-04-23T23:25:47.538+10:00')
(2007, 4, 23, 13, 25, 47, 0, 113, 0)

Discovered using rawdog and the feed at http://etbe.blogspot.com/atom.xml

Original issue reported on code.google.com by [email protected] on 23 Apr 2007 at 2:50

[ 1598443 ] Failure to support ISO8601/W3CDTF fractional seconds + patch

Hi, I'm Zach Beane, <[email protected]>.

The W3CDTF says that the date format may include fractional seconds, but
the regular expression does not take that into account. This causes a
failure to detect a time zone that follows the fractional second.

Here's a patch that detects (but ignores) the fractional second value and
allows the correct processing of the timezone that follows.

diff -c /home/xach/tmp/feedparser.py\~ /home/xach/tmp/feedparser.py
*** /home/xach/tmp/feedparser.py~ 2006-01-10 23:32:22.000000000 -0500
--- /home/xach/tmp/feedparser.py 2006-11-17 11:11:54.355518778 -0500
***************
*** 1860,1865 ****
--- 1860,1866 ----
'CC', r'(?P<century>\d\d$)')
+ r'(T?(?P<hour>\d{2}):(?P<minute>\d{2})'
+ r'(:(?P<second>\d{2}))?'
+ + r'(\.(?P<fracsecond>\d+))?'
+ r'(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?'
for tmpl in _iso8601_tmpl]
del tmpl




Ooops! This is fixed in CVS, but the pattern for fractional seconds should
be \.\d+ instead of \.\d*.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:11

[ 1544440 ] Exception for malformed numeric entities in content

As in attached file, feedparser cannot handle malformed
numeric entities such &#a; and raises an exception. A
workaround for this is
505,508c505,511
< if ref[0] == 'x':
< c = int(ref[1:], 16)
< else:
< c = int(ref)

---
> try:
> if ref[0] == 'x':
> c = int(ref[1:], 16)
> else:
> c = int(ref)
> except:
> c = 0

Don't know if this is what the parser is supposed to do
in case of malformed entities...

Bye,
Giuseppe Ottaviano
[email protected]

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:19

Attachments:

atom_malformed.xml

[ 1541699 ] Feed parser can not be imported

I am using Feedparser to parser RSS feeds. When I run
the CGI scripts, it displays feedparser is undefined
and can not find several modules. I have downloaded
and installed in my cgi-bin directory. Then I
use "import feedparser" in my CGI scripts. But it
does not work.



Date: 2007-01-09 20:01
Sender: nobody
Logged In: NO 

same problem for me..

python 2.4.3
__version__ = "4.1"# + "$Revision: 1.92 $"[11:15] + "-cvs"

BUT, it only fails (sometimes) when importing from the interpreter. Ha,
what a bug!  After the 7th time trying >>>import feedparser, it imported
just fine.  

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/feedparser.py", line 1958, in ?
    _korean_nate_date_re = \
  File "/usr/lib/python2.4/sre.py", line 180, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.4/sre.py", line 225, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.4/sre_compile.py", line 500, in compile
    code = _code(p, flags)
  File "/usr/lib/python2.4/sre_compile.py", line 484, in _code
    _compile(code, p.data, flags)
  File "/usr/lib/python2.4/sre_compile.py", line 96, in _compile
    _compile(code, av[1], flags)
  File "/usr/lib/python2.4/sre_compile.py", line 52, in _compile
    _compile_charset(av, flags, code, fixup)
  File "/usr/lib/python2.4/sre_compile.py", line 178, in _compile_charset
    for op, av in _optimize_charset(charset, fixup):
  File "/usr/lib/python2.4/sre_compile.py", line 221, in
_optimize_charset
    return _optimize_unicode(charset, fixup)
  File "/usr/lib/python2.4/sre_compile.py", line 341, in
_optimize_unicode
    mapping = array.array('b', mapping).tostring()
AttributeError: 'module' object has no attribute 'array'

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:20

[ 1573251 ] RSS 1.0 dc:subject doesn’t get parsed into entry.dc_subject

In an RSS 1.0 feed, feedparser 4.1 doesn’t fails to parse a dc:subject
element and expose it via either entry.dc_subject or entry['dc_subject']

Attached there’s a test case (an RSS 1.0 feed showing that behavior).

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:13

Attachments:

feed-latin1.rdf

[ 1475524 ] feedparser gives no indication of parsing success

I've built two serious pieces of software on top of feedparser and both
have run into the same problem: feedparser gives no reliable indication of

whether it's successfully parsed a feed. I've tried to look to see if the
entries list exists, but sometimes a successfully-parsed feed has no
entries. I've tried to look at bozo_exception but that gets set even when
the feed was actually parsed.

I suggest a new attribute (bozo_status?) set by feedparser.parse to
indicate the status. It'd indicate whether an exception was thrown in
_open_resource or decompressing, causing feedparser to give up. Ideally
it'd also indicate if the document didn't even look anything like RSS, but

that doesn't seem to be easily doable within the current structure.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:25

[ 1696520 ] missing 'msn' and 'live' namespaces

the parser doesn't recognize microsoft's 'msn' and 'live' namespaces, and
do not construct variables like 'msn_type' or 'live_type', which msl-live's
feeds use to distinguish entry between photos and articles

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:02

[ 1501902 ] Atom 1.0 link missing .txt

On page
http://www.feedparser.org/docs/version-detection.html
the link for atom 10 should be
http://www.ietf.org/rfc/rfc4287.txt
but is missing .txt

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:23

[ 1458381 ] 4.1: No support for multiple authors

The latest Atom-1.0 spec
(http://www.ietf.org/rfc/rfc4287) makes it legal for
multiple <author> elements to be present in <feed>,
<entry>, and <source>, the same as <contributor>.
Feedparser currently only supports a single <author>. A
little experimenting shows that with multiple <author>
elements, the subelements of the second and later
<author> overwrite those of the first <author>.

Suggested fix: Treat <author> like <contributor> and
create feed.authors[i].{name,href,email}. Leave
feed.author and feed.author_detail as-is, or have them
contain information for only the first <author>, for
backwards-compatibility.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:30

[ 1649999 ] Blogger's feeds cannot be parsed

The line:
<?xml-stylesheet href="http://www.blogger.com/styles/atom.css"
type="text/css"?>
which appears in the beginning of Blogger's feeds causes the parser to fail
and return empty.
Can you please fix this problem?

My email is [email protected].
10x!

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:06

Migrate outstanding SourceForge issues to Google Code

Don't see an automated way of doing it, will probably have to copy and paste.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 4:06

[ 1697297 ] iTunes image overides feed.image in iTunes Namespace

In iTunes feeds containing both the <image> tag and <itunes:image> tag the
feed image is overwritten by the iTunes image. All the image_detail is also
overwritten. The feed validates containing both tags, both are legal.
Feedparser does follow the iTunes application behavior and falls back to
the <image> tag when no <itunes:image> tag exists, so it's not broken per
se.
see http://www.freetalklive.com/netcast.xml for described behavior.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:01

[ 1532607 ] Entry title munged

With the feed http://www.democracynow.org/podcast.xml
feedparser is occasionally munging entry titles. With
the present feed contents, entries[0].title is

'\r\xe9\xa8r\xb6\x9c\xc8\xda02\x89\xddk"n\x97-\xf5\xdbM:'

entries[1].title is

u'Democracy Now! - Thursday, July 27, 2006'

FWIW, the feed does validate.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:21

[ 1562102 ] Crash on upper ASCII

Version 4.1 is crashing on input with some upper ascii
characters and I guess no encoding. Traceback:

doc = feedparser.parse(self.url)
File
"/var/lib/python-support/python2.4/feedparser.py", line
2623, in parse
feedparser.feed(data)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python2.4/sgmllib.py", line 100, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 139, in
goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.4/sgmllib.py", line 297, in
parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.4/sgmllib.py", line 337, in
finish_endtag
self.unknown_endtag(tag)
File
"/var/lib/python-support/python2.4/feedparser.py", line
476, in unknown_endtag
method()
File
"/var/lib/python-support/python2.4/feedparser.py", line
1217, in _end_description
value = self.popContent('description')
File
"/var/lib/python-support/python2.4/feedparser.py", line
700, in popContent
value = self.pop(tag)
File
"/var/lib/python-support/python2.4/feedparser.py", line
641, in pop
output = _resolveRelativeURIs(output, self.baseuri,
self.encoding)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1594, in _resolveRelativeURIs
p.feed(htmlSource)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python2.4/sgmllib.py", line 100, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 134, in
goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.4/sgmllib.py", line 284, in
parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.4/sgmllib.py", line 315, in
finish_starttag
self.unknown_starttag(tag, attrs)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1589, in unknown_starttag
_BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
File
"/var/lib/python-support/python2.4/feedparser.py", line
1460, in unknown_starttag
strattrs = u''.join([u' %s="%s"' % (key, value) for
key, value in uattrs]).encode(self.encoding)
LookupError: unknown encoding:

XML that breaks it is attached.

Original issue reported on code.google.com by [email protected] on 19 Apr 2007 at 5:16

Attachments:

bb.xml

libo26 / feedparser Goto Github PK

feedparser's People

Contributors

Watchers

feedparser's Issues

Recommend Projects

Recommend Topics

Recommend Org