extractus / feed-extractor Goto Github PK

View Code? Open in Web Editor NEW

148.0 148.0 30.0 1.14 MB

Simplest way to read & normalize RSS/ATOM/JSON feed data

Home Page: https://extractor-demos.pages.dev/feed-extractor

License: MIT License

JavaScript 100.00%

atom-feed deno feed-reader jsonfeed nodejs rss

feed-extractor's Introduction

Installation

pnpm i @extractus/extractus

Usage

Extract html with default extractors, transformer, selector

import { extract } from '@extractus/extractus'

extract(htmlString, options)

Reference

Extractor

Extract all strings from the html Example: packages/defaults/extractors.ts

type Extractor =
  | ((input: string, context?: ExtractContext) => string | undefined)
  | ((input: string) => string | undefined)

Transformer

Transform the extracted strings. Such as normalize urls, filter blank strings Example: packages/defaults/transformer.ts

type Transformer =
  | ((input: Iterable<string | undefined>, context?: ExtractContext) => Iterable<string | undefined>)
  | ((input: Iterable<string | undefined>) => Iterable<string | undefined>)

Selector

Select one value from transformed values. Such as the first title, string to date object Example: packages/defaults/selector.ts

type Selector =
  | ((input: Iterable<string>, context?: ExtractContext) => T)
  | ((input: Iterable<string>) => T)

Development

Using pnpm for manage workspace

Clone repo
Open project in terminal or IDE
Run pnpm i at the root of project

Roadmap

https://github.com/orgs/extractus/projects/2/views/1

feed-extractor's People

Contributors

Stargazers

Watchers

feed-extractor's Issues

Now available as a Github Action

Hi! Just wanted to alert you to the Github Action wrapper I created for this tool, making it possible to use in continuous integration for Github Pages builds. Would love a link in the README!

Incomplete type declaration in interface FeedData for property `description`

In the interface FeedData the property decription is declared as:

description?: string;

However sometimes the description element in the parsed RSS feed is returned as object:

"description": {
    "#text": "Updates, ideas, and inspiration from GitHub to help developers build and design software.",
    "@_type": "text"
  },

To accurately reflect this situation the FeedData interface could be changed like so:

/**
 * Type for property bag objects (key -> value) with unknown content.
 */
export type TPropertyBag = { [key: string]: any };

export interface FeedData {
  link?: string;
  title?: string;
  description?: string | TPropertyBag ;
  generator?: string;
  language?: string;
  published?: string;
  entries?: Array<FeedEntry>;
}

Alternatively, the extractor could desctructure the description element and just return the string.

feed.zip

CERT_HAS_EXPIRED

Hey bro.
I am using you npm package for some project.
I am getting the follow error

Error
at Function.createFromInputFallback (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:320:98)
at configFromString (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2385:15)
at configFromInput (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2611:13)
at prepareConfig (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2594:13)
at createFromConfig (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2561:44)
at createLocalOrUTC (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2648:16)
at createLocal (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2652:16)
at hooks (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:12:29)
at normalize (/Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:62:16)
at modify (/Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:139:14)
at Array.map ()
at toRSS (/Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:142:20)
at /Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:224:18
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:85:5)
FetchError: request to https://www.muywindows.com/feed failed, reason: certificate has expired
at ClientRequest. (/Users/wellington/Developer/zuntaz-bots/node_modules/node-fetch/index.js:133:11)
at ClientRequest.emit (events.js:209:13)
at TLSSocket.socketErrorListener (_http_client.js:406:9)
at TLSSocket.emit (events.js:209:13)
at emitErrorNT (internal/streams/destroy.js:91:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
at processTicksAndRejections (internal/process/task_queues.js:77:11) {
name: 'FetchError',
message: 'request to https://www.muywindows.com/feed failed, reason: certificate has expired',
type: 'system',
errno: 'CERT_HAS_EXPIRED',
code: 'CERT_HAS_EXPIRED'
}

Support other date format in published

Like this:
Wed, 31 May 2023 13:55:24 -0000

Support optional item elements in parsed RSS 2.0 items

I love the simplicity of this library, but it drops the optional elements on items, such as enclosure and author, leaving no way to access them. Could support for those be added?

Add `id` property to entries

Nice library!

It would be helpful to have an 'id' property added to each entry. This allows an entry to be uniquely tracked, and ensures that if the URL of a feed item updates, it's still considered the same entry.

JSON Feed has a required id property.
RSS has guid, but it is optional. If it's not set the recommendation is generally to just use the URL instead as the unique identifier.
Atom has the id field for each entry, which is also required.

So it should be pretty easy to normalize this into an id field and make it non-optional in the type definition.

RSS Results Structure Changes Depending on Normalization

Hello!

I am glad to have found your module, it looks like it will make handling feeds easy.

The structure of the results from fetching an RSS feed depends on if the normalization option is set.

If it is false then the object containing the feed items is called item and if it is true then it is called entries.

I do not know if this is intended behaviour, however the documentation doesn't mention it either way.

CORS

Hi,
thx for this awesome library.

Unfortunately, I can't fetch 90% of the RSS sources because of CORS issues.
Do you have any suggestions on how to solve it?

When feed level links are relative the baseUrl option is not applied

The attached feed uses relative links like so (see also the attached feed XML). The feed level link looks like this:

<link>/blog/</link>

The feed extractor returns this link as is, and does not use the baseUrl property provided with parsing options. While using relative links in RSS feed is highly questionable, the extractor should use the baseUrl on the feed level too.

This issue can be worked around by providing custom processing like so:

if (link) {
    this.site = link.startsWith("/") && options.baseUrl ? options.baseUrl + link : link;
}

feed.zip

Support Get favicon

hi, can feed-extractor, oEmbed Extractor support crawling to favicon URL like article-extractor?

IDE Type Error when adding optional FeedEntries like category

In the index.d.ts file

feed-extractor/index.d.ts

Lines 3 to 12 in b81646a

 export interface FeedEntry { 

 /** 

  * id, guid, or generated identifier for the entry 

  */ 

 id: string; 

 link?: string; 

 title?: string; 

 description?: string; 

 published?: Date; 

 }

Since the feed entry allows for custom extra keys like categories and enclosure, Adding an optional parameter would stop the error that pops up in VSCode.

In my case, since I am using a string for tags (fetching medium RSS), adding the line like:

interface FeedEntry{
   ...
   category?: Array<string>;
}

But I believe this would fail again if we have a custom object for categories like text and domain as mentioned in the examples on npm.

fast-xml-parser regex vulnerability patch could be improved from a safety perspective

Summary

This is a comment on GHSA-6w63-h3fj-q4vw and the patches fixing it.

ref GHSA-gpv5-7x3g-ghjv

Details

The code which validates a name calls the validator:
https://github.com/NaturalIntelligence/fast-xml-parser/blob/ecf6016f9b48aec1a921e673158be0773d07283e/src/xmlparser/DocTypeReader.js#L145-L153
This checks for the presence of an invalid character. Such an approach is always risky, as it is so easy to forget to include an invalid character in the list. A safer approach is to validate entity names against the XML specification: https://www.w3.org/TR/xml11/#sec-common-syn - an ENTITY name is a Name:

[4]   NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
                        [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
                        [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]  NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   Name ::= NameStartChar (NameChar)*

so the safest way to validate an entity name is to build a regex to represent this expression and check whether the name given matches the regex. (Something along the lines of /^[name start char class][name char class]*$/.) There's probably a nice way to simplify the explicit list rather than typing it out verbatim using Unicode character properties, but I don't know enough to do so.

Add content:encoded to FeedEntry

Thanks for a great tool. So far I've been using feed-extractor get feed items, and then passing each item's link to article-extractor to get the full article. However, I notice that in most of my feeds, the full text of the article is included in the RSS feed under the content:encoded tag. Is there already a way to get this data using feed-extractor so I wouldn't need to make a second call to article-extractor? It seems to me like it would be cool thing "encoded" were added as a property on FeedEntry, so that when it exists, we have access to it after parsing the feed. Is there a better way to do this?

trim() axios response body

res.data should be trim()

fixed in #40

`fetchOptions` was not passed to fetch

other options were not passed to fetch

feed-extractor/src/utils/retrieve.js

Lines 17 to 25 in 93c7dcb

 const { 

 headers = { 

 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0', 

 }, 

 proxy = null, 

 agent = null, 

 } = options 

 const res = proxy ? await profetch(url, proxy) : await fetch(url, { headers, agent })

Empty description when content is wrapped in CDATA

Hi!

When I pass a feed that contains content wrapped in CDATA tags, normalized feed entry contains empty description.

Sample feeds:

For now I use a dirty workaround using getExtraEntryFields and some custom code to process HTML:

getExtraEntryFields: (feedEntry) => {
	const cdataDescription = feedEntry.description.includes("<![CDATA[")
	  ? stripAndTruncateHTML(
	      feedEntry.description
	        .replaceAll("<![CDATA[", "")
	        .replaceAll("]]>'", ""),
	      siteConfig.maxPostLength
	    )
	  : "";

	return { cdataDescription };
}

Also - do you have a donation link or something? I'd love to buy you a coffee because this project ROCKS. ❤️

CI test does not pass Node 19

CDATA in description not parsed as desired

Hi Team,

Thanks for building this open-source tool. I'm new to dealing with RSS feeds and wanted an easy way to parse the data into typed objects. I'm having an issue with one feed where they have embedded A LOT Of CDATA in the description, with a lot of HTML with styles and links to images, etc..

Here is an example:
(NOTE: some of this is being hidden by the browser; open this issue in Edit view to see all the data might work. If there is a way to prevent it from rendering as HTML in this Issue, I don't know how.)

<description><![CDATA[<a href="https://someorg.org/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/" title="Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs" rel="nofollow"><img width="300" height="157" src="https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-300x157.jpg" class="webfeedsFeaturedVisual wp-post-image" alt="German Flag over building" decoding="async" style="float: left; margin-right: 5px;" link_thumbnail="1" loading="lazy" srcset="https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-300x157.jpg 300w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-1024x536.jpg 1024w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-768x402.jpg 768w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a><p>The German Supply Chain <span class="glossaryLink"  aria-describedby="tt"  data-cmtooltip="&#38;lt;!-- wp:paragraph --&#38;gt;Often the second stage in the third-party risk management life cycle. Due diligence involves conducting a review of a potential third party prior to signing a contract. This review should involve developing a deeper understanding of the third party&#8217;s ownership, operations, resources, financial status, relevant employees, risk and control framework, business continuity program, third-party risk management program, and other factors important to the third-party relationship. Due diligence helps ensure the organization selects an appropriate third party to partner with, and that the organization understands both the inherent and residual risks posed by the relationship. These residual risks should be within the organization&#8217;s risk appetite.&#38;lt;br/&#38;gt;&#38;lt;!-- /wp:paragraph --&#38;gt;"  data-gt-translate-attributes='[{"attribute":"data-cmtooltip", "format":"html"}]'>Due Diligence</span> Act goes into effect January 2023 and is already making waves within supply chain, risk management, and compliance communities. [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://someorg.org/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/">Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs</a> appeared first on <a rel="nofollow" href="https://someorg.org">Aravo</a>.</p>
]]></description>

Options: { descriptionMaxLen: 20000, xmlParserOptions: { // I've tried a bunch. . . nothing "worked"} }

Output:

description:  "The German Supply Chain Due Diligence Act goes into effect January 2023 and is already making waves within supply chain, risk management, and compliance communities. [&#8230;] The post Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs appeared first on Aravo."
link:  "https://aravo.com/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/"
published:  "2022-12-01T14:33:07.000Z"
title:  "Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs"

Desired output: All contents of the description CDATA

Questions:

Is this something that can be supported?
How unusual (to you) is this use of the description field (all CDATA of HTML)?

Content type octet-stream support

Why is the content type checked?

I'm trying to consume a feed that returns octet-stream, can we support it?

feed-extractor/src/utils/retrieve.js

Line 33 in 8d65cd1

if (/(\+|\/)(xml|html)/.test(contentType)) {

ref: racket/racket-lang-org#235

add support windows-1251 encoding

feed-extractor does not understand windows-1251, which is still required for some sites.

[Feature Request] need more fields, want the result can be customized via options

I use this tool to parse rss feed, but there are some fields I need not in the result, such as image, owner.

I want two options extraFeedFields and extraEntryFields used as function, their return value will be merged into the feed and entry fields. So everyone can custom the result.

const feedData = await read('https://some-rss-feed-xml/', {
    extraFeedFields: (channel) {
       return {
          image: channel['itunes:image'],
          owner: channel['iutnes:owner']
      }
   },
})

result:

{
  "title": "xxx",
  "link": "xxx",
  "description": "xxx",
  "language": "",
  "generator": "",
  "published": "",
  "entries": [...],
  "image": {...},
  "owner": {...},
}

Disable item description trimming?

Is it possible to skip truncating the description and returning full contents in any way?

Right now I provide descriptionMaxLenwith some impossible value (999999) but it's a bit hacky workaround.

It would be great if I could pass -1 or false to skip description truncation altogether.

fetch error: cert invalid

Type:     FetchError
Message:  request to https://www.logseqtimes.com/rss/ failed, reason: Hostname/IP does not match certificate's altnames: Host: www.logseqtimes.com. is not in the cert's altnames: DNS:fallback.tls.fastly.net

ref: avelino/bots.clj.social#103

there should be an option to bypass the certificate check

Minor regression in v7.0.3

Hi!

After fixing #105 I noticed a small regression among RSS feeds that serve both content:encoded and description in their items.

content:encoded is used first (even though human-friendly description is available) and it results in a pile of HTML/CSS code being served.

Feed that shows this problem: https://turystyka-niecodzienna.pl/rss

I suspect it may be possible to fix by switching the order from content || description into description || content (and perhaps htmlContent || description into description || htmlContent?) in 3e1d612#diff-79bdb3bf907b1dc8f0ca3b16390b8e93716d86d536837a4cbda4d9b0b2b19ee7

Entry author

I don't know generally how often other blogs use the entry author but I could enjoy having access to it for my Lichess Blog Discord bot.

atom is working, rss2 and rss gets error message

Site: https://abikw.nvii-dev.de
When trying to fetch a feed with atom it's working, but when trying to fetch a feed with rss2 or rss I'm getting the following error:

TypeError: item.map is not a function
parseRSS webpack-internal:///./node_modules/feed-reader/src/utils/parser.js:89
read webpack-internal:///./node_modules/feed-reader/src/main.js:39
getFeedFile webpack-internal:///./node_modules/cache-loader/dist/cjs.js?!./node_modules/babel-loader/lib/index.js!./node_modules/cache-loader/dist/cjs.js?!./node_modules/vue-loader-v16/dist/index.js?!./src/pages/News.vue?vue&type=script&lang=js:28
created webpack-internal:///./node_modules/cache-loader/dist/cjs.js?!./node_modules/babel-loader/lib/index.js!./node_modules/cache-loader/dist/cjs.js?!./node_modules/vue-loader-v16/dist/index.js?!./src/pages/News.vue?vue&type=script&lang=js:38
callWithErrorHandling webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6824
callWithAsyncErrorHandling webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6833
callHook webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:2419
applyOptions webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:2321
finishComponentSetup webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6561
setupStatefulComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6473
setupComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6403
mountComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4258
processComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4233
patch webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:3837
patchKeyedChildren webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4722
patchChildren webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4541
patchElement webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4057
processElement webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:3917
patch webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:3834
componentUpdateFn webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4443
run webpack-internal:///./node_modules/@vue/reactivity/dist/reactivity.esm-bundler.js:195
callWithErrorHandling webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6824
flushJobs webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:7060
cjs.js:31:17

Code I'm using:
getFeedFile() {
const url = 'https://abikw.nvii-dev.de/feed/rss';

    this.read(url)
      .then((feed) => {
        console.log('News - getFeedFile - feed', feed);
      })
      .catch((err) => {
        console.log('News - getFeedFile - error: ', err);
      });
  },

How to debug feeds that throw an error?

Trying to pull something like https://www.nature.com/nature.rss - getting an error both locally and in demo. Ran the address through the w3c validator and came up valid.

Somewhat related, I'm also trying to use a proxy but to no avail as http://[email protected]:8887 is throwing Invalid URL

The link cannot be resolved when the hostname is not included

example:

<channel>
  <link>/</link>
  <language>en</language>
  <atom:link href="/index.xml" rel="self" type="application/rss+xml" />
  <item>
    <link>/posts/2023/06/piem/</link>
	<guid>/posts/2023/06/piem/</guid>
  </item>
</channel>

When the link is in the above format, it will be resolved as null:

{
  "link": null,
  "language": "en",
  "atom:link": {
    "@_href": "/index.xml",
    "@_rel": "self",
    "@_type": "application/rss+xml"
  },
  "item": [
    {
      "link": null,
      "guid": "/posts/2023/06/piem/"
    }
  ]
}

Wrong TS type without normalization.

When normalization option is false the returned type FeedData is false.

I suggest we return a generic type like Record<string, any>.

Cannot define extra entry fields to fetch

Hi,
The RSS feed I fetch includes, for each item, an illustration image of the published article. As this field is not provided by default, I have tried to define it in the parser options as explained here in the documentation, but I get the following error when executing my script:

TypeError: Cannot read properties of undefined (reading '@_url')

Here is the definition of my options (I use typescript):

const options = {
    getExtraEntryFields: (entryData: any /* What is the expected type ?? */) => {
        const { enclosure } = entryData
        return {
            enclosure: {
                url: enclosure['@_url'], // enclosure is undefined ...
                type: enclosure['@_type'], // enclosure is undefined ...
            }
        }
    }
}

const rss = await extract(url, options)

I'm new to using RSS feeds, can you help me understand the error and get to the point?

Bug on heise.de

Creates faulty object on https://www.heise.de/rss/heise-atom.xml
also on https://www.tagesschau.de/index~atom.xml

(description shows [object]...)

Commit causing Buffer error when executing in browser, not Node environment

This commit (line 36 in retrieve.js) breaks the extract call is giving the following error when executing in a browser such as Chrome:

Uncaught (in promise) ReferenceError: Buffer is not defined at __WEBPACK_DEFAULT_EXPORT__ (retrieve.js:58:1) at async extract

The reason is that Buffer object is only available in a Node environment, not in a browser library.

You either want to rewrite it to not depend on that Node Buffer class or rewrite the line something like:

const text = (buffer && Buffer) ? Buffer.from(buffer).toString().trim() : ''

Entries with title.length < 10 are ignored

https://github.com/ndaidong/feed-reader/blob/master/src/main.js#L55

Any idea why?

feed.xml link field parsing error

https://stephango.com/feed.xml

link field parsing error

Hardcoded attributeNamePrefix value in xmlParserOptions

Hi!

First and foremost, thanks for your work!

I've been using the library in my GitHub action and I tried to change attributeNamePrefix property in xmlParserOptions but it didn't work. I had a look at the code and noticed it's hardcoded and thus impossible to change:

feed-extractor/src/utils/xmlparser.js

Line 23 in 4bdcdb1

attributeNamePrefix: '@_',

Is there any reasoning behind this decision I'm not aware of? Is it possible to make it modifiable via xmlParserOptions like the rest of the properties?

I can provide a pull request for this if you don't mind.

Thanks a lot!

Query and local targets of links missing

The extractor removes the query part of links like (see attached feed and example below):

<link>
   https://jdhitsolutions.com/blog/books/9389/powershell-scripting-and-toolmaking/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=powershell-scripting-and-toolmaking
</link>

is returned as

https://jdhitsolutions.com/blog/books/9389/powershell-scripting-and-toolmaking/

and also local link targets:

<link>
    https://gist.github.com/aeveltstra/94806a1230b8165f43e9b4e4dec9bacc#file-powershell-gui-aws-lambda-start-functions-ps1
</link>

is returned as

https://gist.github.com/aeveltstra/94806a1230b8165f43e9b4e4dec9bacc

While in this case the query is not essential, for other feeds it is. The missing local target may be a usability issue for large articles. Hence the link should always be returned in its complete form.

feed.zip

Add support for normalized entry "full text content" ?

Typically rss feed provide "full text content" directly in their feed file, content:encoded or description in rss format, content or summary in atom format, etc.

So can feed-extractor try to add a normalized content property to entry item?

I know you have another package called article-extractor, but I don't want to do a manually html-parse if it already provides it's full "content". also some websites are not server-rendered thus can not be parsed correctly

Some items being ignore due to hardcoded limits

Can I ask what the purpose behind the length checks at the beginning of normalize() are? Specifically this:

  if (!link || !title ||
    !isString(link) || !isString(title) ||
→   link.length < 10 || title.length < 10) {
    return false;
  }

It's taken me ages to track down why some items in a feed are returning as undefined, and it's because the title was short.

better axios error handler

Ni !
hi @ndaidong

Thank you for your work.

I've forked your project, i'd like to improve error handling. Currently you return null on every axios errors.

What do you think about that ?

Making a PR right now

Thank you.

not working but the sample site is

This is how i use your code:

import { extract } from '@extractus/feed-extractor';

const sample1 = 'https://podcastfeeds.nbcnews.com/RPWEjhKq';

const parse = async () => {
    return extract(sample1);
}

I've tried 5 different urls and they all throw an error: XML is invalid.

the weird thing is that when I put that same url on https://extractor-demos.pages.dev/feed-extractor it seems to be able to parse just fine. is the demo page running a different version than v7.1.3 ?

Add support for fetch options

I needed to be able to pass options to the underlying fetch to adjust timeout, etc. Here's a patch to enable that, in case it's of use to anyone else:

@@ -15,9 +15,9 @@
 var isArray = bella.isArray;
 var isObject = bella.isObject;

-var toJSON = (source) => {
+var toJSON = (source, opts) => {
   return new Promise((resolve, reject) => {
-    fetch(source).then((res) => {
+    fetch(source, opts).then((res) => {
       if (res.ok && res.status === 200) {
         return res.text();
       }
@@ -174,9 +174,9 @@
 };


-var parse = (url) => {
+var parse = (url, opts = {}) => {
   return new Promise((resolve, reject) => {
-    toJSON(url).then((o) => {
+    toJSON(url, opts).then((o) => {
       let result;
       if (o.rss && o.rss.channel) {
         let t = o.rss.channel;

missing optionals entry fields ?

Hi again :)

Below a raw rss entry

<entry>
    <author>
        <name>/u/0xdea</name>
        <uri>https://www.reddit.com/user/0xdea</uri>
    </author>
    <category term="netsec" label="r/netsec"/>
    <content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/0xdea&quot;&gt; /u/0xdea &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://security.humanativaspa.it/automating-binary-vulnerability-discovery-with-ghidra-and-semgrep/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content>
    <id>t3_vtcsdv</id>
    <link href="https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/" />
    <updated>2022-07-07T07:27:52+00:00</updated>
    <published>2022-07-07T07:27:52+00:00</published>
    <title>Automating binary vulnerability discovery with Ghidra and Semgrep</title>
</entry>

Below attributes returned by feed-reader, some fields are missing

{
  title: 'Automating binary vulnerability discovery with Ghidra and Semgrep',
  link: 'https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/',
  description: 'submitted by /u/0xdea [link] [comments]',
  published: '2022-07-07T07:27:52.000Z',
}

We should expect something like that

{
  id:'t3_vtcsdv',
  author: {
    name:'/u/0xdea',
    uri:'https://www.reddit.com/user/0xdea'
  },
  category: {
      term:'netsec',
      label:'r/netsec'
  },
  content:{
      type:"html',
      rawValue:'&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/0xdea&quot;&gt; /u/0xdea &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://security.humanativaspa.it/automating-binary-vulnerability-discovery-with-ghidra-and-semgrep/&quot;&gt[link]&lt;/a&gt;&lt;/span&gt;&amp;#32;&lt;span&gt;&lt;ahref=&quot;https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;'
  },
  title: 'Automating binary vulnerability discovery with Ghidra and Semgrep',
  link: 'https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/',
  description: 'submitted by /u/0xdea [link] [comments]',
  published: '2022-07-07T07:27:52.000Z',
  updated: '2022-07-07T07:27:52.000Z',
}

see #36
see #13

So, before start coding on my side, i'd like to know why you didn't implement all fields ? missing opportunity ? don't have time ? or you don't want for good reasons ?

Your module could be a good one, because many of others are using "request" module, which is deprecated since a long time now. Good opportunity. But if we can not access all other fields, your module will stay invisible

What do you think ? thank you !

Issue with package types

Hi,

Theres seems to be a problem with the package types, with something along the lines of:

There are types at '.../node_modules/@extractus/feed-extractor/index.d.ts', but this result could not be resolved when respecting package.json "exports". The '@extractus/feed-extractor' library may need to update its package.json or typings.

I was able to fix this locally by adding to the package.json "types": "./index.d.ts", on the exports section.
I can make a PR on this.

Medium feeds - no description tag, only content:encoded -> empty description

Hi again!

I found another edge case and this time it's Medium feeds being naughty. They contain no description tag, only content:encoded with HTML content.

Example: https://medium.com/feed/@ameliakusiak

Traditionally, it can be worked around with getExtraEntryFields.

Add support for OPML feeds

An OPML feed is a way for how to integrate many feeds into one, sort of like merging many feeds into one.

Possible implementation

A OPML feed can be imported the same way as a normal feed...

const feed = await extract("https://example.com/feeds.opml");

...and can almost be read the same too...

interface OPMLData {
    title: String;
    feeds: OPMLFeed[];
    entries: OPMLEntry[];
}

interface OPMLFeed {
    title: String;
    link: String;
}

interface OPMLEntry {
    id: String;
    feed: String;
    title: String;
    link: String;
    description: String;
    published: ISO Datetime String;
}

A mock OPML feed can be made by passing an array of feed URLs to the extract function...

const feed = await extract([
    "https://example.com/feed.rss",
    "https://foobar.com/feed.atom"
]);

Things that would be done post-implementation

The extractFromOpml API would need to be introduced.
The existing APIs (extractFromJson and extractFromXml) would need to be updated to support the mock OPML generation support.

Update relevant links/docs for feed-extractor

Check list:

`published` typed incorrectly

FeedData.published and FeedEntry.published are typed as Date, but aren't they string?

Would you accept a PR supporting CJS compatibility ?

I totally understand if you'd prefer not to, just let me know either way.

Thanks for the tool :)

remove xml response content-type check ?

Some response don't have content-type, see this one

https://www.hackerone.com/blog.rss

I've commented the content-type check in your code, and it's working, but ... i don't know ... perhaps you would prefer to pass an option to enable/disable content-type check ?

What do you think ?

Add options to get specifics fields

Hi,

Your library is really simple and great but I have a problem. Actually, I want to get the guid field of rss items but your library don't return it.

It's possible to add the possibility to set the fields to return on option of the parser ?

Thanks

	export interface FeedEntry {
	/**
	* id, guid, or generated identifier for the entry
	*/
	id: string;
	link?: string;
	title?: string;
	description?: string;
	published?: Date;
	}

	const {
	headers = {
	'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0',
	},
	proxy = null,
	agent = null,
	} = options

	const res = proxy ? await profetch(url, proxy) : await fetch(url, { headers, agent })

extractus / feed-extractor Goto Github PK

feed-extractor's Introduction

Installation

Usage

Extract html with default extractors, transformer, selector

Reference

Extractor

Transformer

Selector

Development

Roadmap

feed-extractor's People

Contributors

Stargazers

Watchers

Forkers

feed-extractor's Issues

Summary

Details

Possible implementation

Things that would be done post-implementation

Recommend Projects

Recommend Topics

Recommend Org