microlinkhq / metascraper Goto Github PK

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

License: MIT License

JavaScript 1.12% HTML 98.85% CSS 0.03% TypeScript 0.01%

metascraper's Introduction

A library to easily get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

What is it

The metascraper library allows you to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

It follows a few principles:

Have a high accuracy for online articles by default.
Make it simple to add new rules or override existing ones.
Don't restrict rules to CSS selectors or text accessors.

Getting started

Let's extract accurate information from the following website:

First, metrascraper expects you provide the HTML markup behind the target URL.

There are multiple ways to get the HTML markup. In our case, we are going to run a programmatic headless browser to simulate real user navigation, so the data obtained will be close to a real-world example.

const getHTML = require('html-get')

/**
 * `browserless` will be passed to `html-get`
 * as driver for getting the rendered HTML.
 */
const browserless = require('browserless')()

const getContent = async url => {
  // create a browser context inside the main Chromium process
  const browserContext = browserless.createContext()
  const promise = getHTML(url, { getBrowserless: () => browserContext })
  // close browser resources before return the result
  promise.then(() => browserContext).then(browser => browser.destroyContext())
  return promise
}

/**
 * `metascraper` is a collection of tiny packages,
 * so you can just use what you actually need.
 */
const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

/**
 * The main logic
 */
getContent('https://microlink.io')
  .then(metascraper)
  .then(metadata => console.log(metadata))
  .then(browserless.close)
  .then(process.exit)

The output will be something like:

{
  "author": "Microlink HQ",
  "date": "2022-07-10T22:53:04.856Z",
  "description": "Enter a URL, receive information. Normalize metadata. Get HTML markup. Take a screenshot. Identify tech stack. Generate a PDF. Automate web scraping. Run Lighthouse",
  "image": "https://cdn.microlink.io/logo/banner.jpeg",
  "logo": "https://cdn.microlink.io/logo/trim.png",
  "publisher": "Microlink",
  "title": "Turns websites into data — Microlink",
  "url": "https://microlink.io/"
}

What data it detects

Note: Custom metadata detection can be defined using a rule bundle.

Here is an example of the metadata that metascraper can detect:

audio — e.g. https://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3
A audio URL that best represents the article.
author — e.g. Noah Kulwin
A human-readable representation of the author's name.
date — e.g. 2016-05-27T00:00:00.000Z
An ISO 8601 representation of the date the article was published.
description — e.g. Venture capitalists are raising money at the fastest rate...
The publisher's chosen description of the article.
video — e.g. https://assets.entrepreneur.com/content/preview.mp4
A video URL that best represents the article.
image — e.g. https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg
An image URL that best represents the article.
lang — e.g. en
An ISO 639-1 representation of the url content language.
logo — e.g. https://entrepreneur.com/favicon180x180.png
An image URL that best represents the publisher brand.
publisher — e.g. Fast Company
A human-readable representation of the publisher's name.
title — e.g. Meet Wall Street's New A.I. Sheriffs
The publisher's chosen title of the article.
url — e.g. http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion
The URL of the article.

How it works

metascraper is built out of rules bundles.

It was designed to be easy to adapt. You can compose your own transformation pipeline using existing rules or write your own.

Rules bundles are a collection of HTML selectors around a determinate property. When you load the library, implicitly it is loading core rules.

Each set of rules load a set of selectors in order to get a determinate value.

These rules are sorted with priority: The first rule that resolve the value successfully, stop the rest of rules for get the property. Rules are sorted intentionally from specific to more generic.

Rules work as fallback between them:

If the first rule fails, then it fallback in the second rule.
If the second rule fails, time to third rule.
etc

metascraper do that until finish all the rule or find the first rule that resolves the value.

Importing rules

metascraper exports a constructor that need to be initialized providing a collection of rules to load:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Again, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Use the first parameter to pass custom options specific per each rules bundle:

const metascraper = require('metascraper')([
  require('metascraper-clearbit')({
    size: 256,
    format: 'jpg'
  })
])

Rules bundles

?> Can't find the rules bundle that you want? Let's open an issue to create it.

Official

Rules bundles maintained by metascraper maintainers.

Core essential

metascraper-audio – Get audio property from HTML markup.
metascraper-author – Get author property from HTML markup.
metascraper-date – Get date property from HTML markup.
metascraper-description – Get description property from HTML markup.
metascraper-feed – Get RSS/Atom feed URL from HTML markup.
metascraper-image – Get image property from HTML markup.
metascraper-iframe – Get iframe for embedding content for the supported providers.
metascraper-lang – Get lang property from HTML markup.
metascraper-logo – Get logo property from HTML markup.
metascraper-logo-favicon – Metascraper logo favicon fallback.
metascraper-media-provider – Get specific video provider url (Facebook/Twitter/Vimeo/etc).
metascraper-publisher – Get publisher property from HTML markup.
metascraper-readability – A Mozilla readability connector for metascraper.
metascraper-title – Get title property from HTML markup.
metascraper-url – Get url property from HTML markup.
metascraper-video – Get video property from HTML markup.

Vendor specific

metascraper-amazon – Metascraper integration with Amazon.
metascraper-clearbit – Metascraper integration with Clearbit Logo API.
metascraper-instagram – Metascraper integration for Instagram.
metascraper-manifest – Metascraper integration for detecting PWA Web app manifests.
metascraper-soundcloud – Metascraper integration with SoundCloud.
metascraper-telegram – Metascraper integration with Telegram.
metascraper-uol – Metascraper integration for uol.com URLs.
metascraper-spotify – Metascraper integration with Spotify.
metascraper-x – Metascraper integration with x.com.
metascraper-youtube – Metascraper integration with YouTube.

Community

Rules bundles maintained by individual users.

metascraper-address – Get schema.org formatted address.
metascraper-shopping – Get product information from HTML markup on merchant websites.

See CONTRIBUTING for adding your own module!

API

constructor(rules)

Create a new metascraper instance declaring the rules bundle to be used explicitly.

rules

Type: Array

The collection of rules bundle to be loaded.

metascraper(options)

Call the instance for extracting content based on rules bundle provided at the constructor.

options

url

Required
Type: String

The URL associated with the HTML markup.

It is used for resolve relative links that can be present in the HTML markup.

it can be used as fallback field for different rules as well.

html

Type: String

The HTML markup for extracting the content.

htmlDom

Type: object

The DOM representation of the HTML markup. When it's not provided, it's get from the html parameter.

rules

Type: Array

You can pass additional rules to add on execution time.

These rules will be merged with your loaded rules at the beginning.

validateUrl

Type: boolean
Default: true

Ensure the URL provided is validated as a WHATWG URL API compliant.

Environment Variables

METASCRAPER_RE2

Type: boolean
Default: true

It attemptt to load re2 to use instead of RegExp.

Benchmark

To give you an idea of how accurate metascraper is, here is a comparison of similar libraries:

Library	metascraper	html-metadata	node-metainspector	open-graph-scraper	unfluff
Correct	95.54%	74.56%	61.16%	66.52%	70.90%
Incorrect	1.79%	1.79%	0.89%	6.70%	10.27%
Missed	2.68%	23.67%	37.95%	26.34%	8.95%

A big part of the reason for metascraper's higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.

metascraper's default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.

If you're interested in the breakdown by individual pieces of metadata, check out the full comparison summary, or dive into the raw result data for each library.

License

metascraper © Microlink, released under the MIT License.
Authored and maintained by Microlink with help from contributors.

microlink.io · GitHub microlinkhq · X @microlinkhq

metascraper's People

Contributors

Stargazers

Watchers

Forkers

martindale emberlink chantysothy emanuelecasadio trungpv93 inbeom skxu dorilla enterstudio okanjo apprennet treyhuffine lkngin marsch jasonout anlexn hollypony afzalive bentobrowser mahnuh nilportugues tbell511 parthdesai93 yyzl shopwith laurentenhoor envisioning botdotme twhiting6 gauravsharma531 speigg ryanmcclure4 xyanz muharihar slonoed makepanic myfavsin rista404 mygithubforks habibtalib franciskim markchipman petitchevalroux sikor144 solertis delacruz-dev peterviergutz plaa aberm daviswhitehead akriot osdiab asdbaihu davidlechuga manoj-manoharan-backup-old bmbmikeg slavaganzin ericpkerr coralproject salmanfarooqui 4ver phonbopit dhavalw jonasrottmann medartus berksafran mddanishyusuf ericwiener ahsane thoress 418sec zanachka sbrichardson shem8 jeloi kiitehq ikhemissi karliky narbulus palpark-me chrismessina 0xdeus zzzcdf aminerman goleary madwings rafaelrpinto joeybaker dsias parkerproject pouyiouk jo9182 arqdevteam serebano dalmasonto admariner iroin priest671 stungkit samirrayani

metascraper's Issues

Passing options to popsicle (e.g. headers)

We get unscrapable responses trying to access YouTube video .../watchv= URLs (via a cors-proxy).

Or rather, ten times out of eleven. The content shown in browser dev tools (network) is almost all <script> and no meta tags.

In PostMan, if I clear the user-agent header, I get consistently usable data. I want to try this for metascraper, but I can't see any way to tell it to use a popsicle instance with options (i.e. custom headers) specified.

Is there any way to override popsicle defaults for metascraper? If not, it may make sense to be able to pass-through either a prebuilt popsicle object, or popsicle options. Is there quick workaround to over-ride the user-agent header when using metascraper from a browser?

EDIT:

Probably pointless, unless an expert can advise otherwise. Trying to set the user-agent header on popsicle just gives Refused to set unsafe header "user-agent" which is presumably enforced at the browser level. I've been trying to hold off on implementing a custom back-end cors/scrape-proxy, but maybe that's the only option.

Remove async/await

It'd be nicer on users if it were written in ES5 promises. Related to #7. Current implementation relies on a regenerator runtime from Babel to support async/await.

add metascraper-youtube package

http://tubularinsights.com/download-youtube-thumbnail-images-creators-tip-120/

CORS issues

when I try to get tags from github, and many other sites, I get a similar error to the one below:

XMLHttpRequest cannot load https://github.com/ianstormtaylor/metascraper/issues/new. 
Redirect from 'https://github.com/ianstormtaylor/metascraper/issues/new' to 
  'https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fianstormtaylor%2Fmetascraper%2Fissues%2Fnew' 
has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. 
Origin 'http://localhost:4201' is therefore not allowed access.

Is there anyway to ignore this? or a method of retrieving the page like a browser would?

add `tags` property

There are a few different metadata syntaxes out there for tags, would return an array of them. Maybe even join across syntaxes to get the full list. And then unique with case insensitivity

Add metascraper-lang

I would be interested in language property. These are the places I know where the language can be stored:

html lang
og:locale
hreflang (https://moz.com/learn/seo/hreflang-tag) - but this part is needs most work and from what I see there could be errors here (many people only use href lang for other languanges than currently selected)

Thanks in advance.

Fallback rules cause an exception "rule is not a function"

This particular scenario throws an exception:
const MY_RULES = {
title: [
myPreferredTitleRule,
myFallbackTitleRule,
mySuperLastResortTitleRule,
]
}
scrapeMetadatum does not seem to anticipate receiving fallback rules and throws an exception on line 95: var next = rule($, url); when "rule" is an array of functions instead of a function itself.

ES6 syntax fails in node v4 and 5

I am really looking forward to trying this project out. My node version is v5.5.0. I am getting the following syntax error:

I believe this is because the ES6 default parameters are not supported in my version of node.

Target Node 8.x

Node 8 is great and everybody should update ASAP, but unfortunately, it's not always that easy.

Node 6.x is still Active LTS status and some great services don't support 8.x yet.

Also, adoption isn't widespread yet:

In this, you can see the three release lines represented by solid green (Node.js 4 "Argon"), solid blue (Node.js 6 “Boron”), and solid purple (Node.js 8 “Carbon”).

Throughout the year, you can see Node.js 6 reigning supreme—peaking at 388,417 downloads on October 25th. Unlike last year with Node.js 6 and Node.js 4, the downloads of Node.js 8 didn’t surpass the downloads of the Node.js 6 LTS. This is an interesting trend that we should watch into the new year. It seems that the majority of downloads of Node.js are still Node.js 6, even though Node.js 8 has been trending upward for the past few months.

Given the data, it may be safe to assume that the delayed LTS-adoption trend will continue, and as Node.js usage continues to grow, adoption of new LTS versions could very well take longer and longer with each new LTS cycle. We’ll be able to confirm or reassess this assumption with next year’s data, though!

Source: NodeSource

I found that the library aimed to support Node 4.x and browser back in the old days by transpiling via Babel but I couldn't find a notice about dropping support to these environments on Changelog/Release Notes (Although there is a Remove browser mention commit).

Cloud Function providers are probably going to support 8.x soon as this is The Right Thing™ but they're moving slowly.

So, would maintainers consider accepting a PR adding back Babel to enable this great library at least on 6.x runtime in the meantime or while marked as LTS?

Thanks for working on OSS projects!

URL and relative image URLs not working

Hi,

I set up a small test page (with nearly no metadata) to play around with metascraper and noticed two unexpected things:

image will be returned null if the only available on the page has a relative URL (works fine with absolute URLs)
url is always null (both for the direct URL to the page and short URLs)

The test page can be found here: http://x.manuel-heidrich.com/tests/browse/
Short URLs for the test page: http://bit.ly/2zVanQ9 and https://goo.gl/gTYg35

HTML of the test page: http://sharedby.manuel-heidrich.com/Cs6cno/3QJaQeuE
Response from metascraper: http://sharedby.manuel-heidrich.com/WAH4q/1HE7KFYF

In case of the images I guess it would be great if in case of relative URLs imagescraper would return absolute URLs.

Concerning the URL: Is it a bug or intended behaviour? I think url should never be null.

Happy to help if I can!

add `url` property

Would favor canonical URLs, but fall all the way back to window.location.href.

add a cookie jar

@blakeembrey pointed out that New York Times ends up perma redirecting without one! We should switch to popsicle anyways, since i think it has built-in support.

CORS Issue...

i saw the #27
i did you say that, but cors error...
error: XMLHttpRequest cannot load http://www.naver.com/. Redirect from 'http://www.naver.com/' to 'https://www.naver.com/' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://www.lululala.co.kr:8000' is therefore not allowed access.

i'm a front-end developer.. so i must solve the cors in front...
please tell me... the solve
how can i do?
please.... help me.

Remove dependency on Babel?

Currently the build actually relies on the end-user having Babel installed and all of Babel core. It'd be nicer on us users if we could use it without Babel.

TypeError: Cannot read property 'parent' of null

trying to run the basic example however it is logging TypeError: Cannot read property 'parent' of null

    import Metascraper from 'metascraper';

    Metascraper
      .scrapeUrl('http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance')
      .then((metadata) => {
        console.log('meta: ', metadata);
      })
      .catch((error) => {
        console.log('error: ', error);
      });

Keywords from meta tags or content

Embedly for example can be given keywords. Could you add such functionality?

allow for functions instead of just arrays of functions

I think it would be simple to do, and kind of nice to allow rules to be defined with just a single function, instead of always needing to be an array. Such that as simple case could look like...

title: async (window) => window.document.title.textContent

Or the complex ones could be:

title: [
  // tons of rules...
]

@blakeembrey that sound good to you?

Error: unknown compression method

any idea why some url like this one http://www.concordmonitor.com/on-new-hampshire-s-antique-alley-it-s-out-with-the-old-in-with-the-new-9234789 gives this error while calling:

Metascraper.scrapeUrl(results[j].url)

Add `body` property

Returning the cleaned body of the article.

How to integrate custom ruleset?

Are there any examples of a custom rule set in a project? The documentation, to me, isn't clear enough. Where do I save the custom ruleset? How do I require it in the rc file?

I'm running docker + node.

Async rules?

I noticed this uses async rules, any particular reason? Seems all the current rules are completely synchronous.

FWIW, if you just want to compose async functions I wrote https://github.com/blakeembrey/throwback. It takes a chain of functions that return promises and any can either return value or return next() to continue execution. Might be a nice middleware pattern here if you want to keep async middleware.

Add a timeout?

It would be nice if you could specify a timeout and reject the returning promise after said timeout.

split up browser/server to reduce build sizes

Right now I think jsdom is being included regardless. But it would be good to make it easier for certain pieces to be used without it.

Add metascraper-video

Add a new property for supporting video (mp4, avi, webm) and gif embed.

Rules from `og` meta tags

og:video:secure_url
og:video:url
og:video

(a lot of good examples here!)

Rules from `twitter` meta tags

twitter:player:stream

Not sure if twitter support property tags to target video. Need to check:

Rules from `sailthru` meta tags

???

Rules from HTML

Add metascraper-thumbail

Generate a thumbail of the important part of a video based on video property added by metascraper-video

https://github.com/MrYakobo/video-thumbnail

Error trying to scrape http://facebook.com

This is the code in my express router:

console.log('Scraping metadata...', req.query.url);
  Metascraper.scrapeUrl(req.query.url)
  .then((metadata) => {
    console.log(metadata);
    res.send(metadata);
  })
  .catch((err) => {
    console.log(err);
    res.send(err);
  });

It returns this error when trying to scrape http://facebook.com

{ [Error: incorrect header check] errno: -3, code: 'Z_DATA_ERROR' }

Any thoughts why this is happening?

Doesn't recognize a metascraperrc/package.json OR looks for them in a different path when running from node_modules?

I have a package published that uses metascraper, yet it doesn't seem to draw from either a metascraperrc or the metascraper section of package.json, when it's in node_modules. It works when testing in its own directory, but not as a package. Am I doing something wrong, or is this a scoping issue? It could also be that I'm using a local rule, such as ./rule.js, which it might be looking for in the root directory of the project that has it as its dependency.

Strict author check would fail on non-ASCII names

In https://github.com/ianstormtaylor/metascraper/blob/master/lib/rules/author.js. Maybe a better check is /^\S+\s+\S+/?

add comparison examples with a few other libraries, to show power of fallback stack

Would be cool to have, and helpful in evaluating them

Update cheerio to 1.0.0-rc.2

Getting this error:

ERROR in ./node_modules/cheerio/index.js
Module not found: Error: Can't resolve './package' in '....../node_modules/cheerio'

Product Information?

Hi @ianstormtaylor, I'm not sure if this is completely out of scope for this library - if yes, apologies.

But in case, it isn't, it would be amazing to treat product pages as distinct from articles by getting product specific information from the sites (atleast the main ones have it standardized). Here is a library (though a bit outdated) I found which does some of that - https://github.com/hulihanapplications/fletcher/blob/master/lib/fletcher/models/

Thank you - your library looks great! :)

Replace JSDom with Cheerio?

JSDOM is kind of heavy here. Would you consider a PR that replaces JSDOM with Cheerio instead? Since it's really only using getAttribute and querySelector like APIs, maybe use Cheerio for all text parsing use-cases and have a wrapper for the browser window use-case that defers to the document.* methods?

Module not found: Can't resolve package in cheerio.

Here is the full error:

web_1         | ERROR in ./~/cheerio/index.js
web_1         | Module not found: Error: Can't resolve './package' in '/src/app/node_modules/cheerio'
web_1         |  @ ./~/cheerio/index.js 11:18-38
web_1         |  @ ./~/metascraper/dist/index.js

trying to build with webpack.

SyntaxError while running under Node LTS (v4.x)

When I try to require('metascraper') I got this error:

/home/user/workspace/node_modules/metascraper/lib/index.js:2
let RULES = require('./rules')
^^^

SyntaxError: Block-scoped declarations (let, const, function, class) not yet supported outside strict mode

My suggestion is to add the 'use strict'; on top of each file containing a block-scoped declaration.

[metascraper-amazon] Image selector matches incorrect image

I'm running into issues with the image value not being the main image for metascraper-amazon. There are actually multiple .a-dyanmic-image classes on the screen as seen in the attached photo. Can we create some rules with priority over this like wrapUrl($ => $('#landingImage').attr('src')) or wrapUrl($ => $('.a-dynamic-image').first().attr('src'))?

image url return wihout https

sometimes the meta image returns something like "//images.solecollector.com/complex/images/w_1200/viukziox1lihfkmhn2i5/off-white-air-jordan-1"

instead of "https://images.solecollector.com/complex/images/w_1200/viukziox1lihfkmhn2i5/off-white-air-jordan-1"

How to Implement your own Rule

I am having some difficulties creating my own rule. For test purposes, I copied the metascraper-publisher folder and renamed it to "metascraper-test". I changed the appropriate names in the package.json file. I then added "metascraper-test" to the Default_Rules section in the "load-rules.js" file. However, I keep getting a "metascraper-test" not found error, even though it is present.

Any ideas on what could be causing this?

Thanks!

use TextDecoder for meta Content-Type

Can you support it, global or from plugins?
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251" />

Unusable in Node 4

Hi,

It seem that the published npm packge isn't transpiled by Babel.

application/ld+json

Are there any reasons why application/ld+json is not being utilized? It seems to be contained in the majority of large websites and includes information such as headline, description, date, authors, publisher and more!

An example site that utilizes this format is https://www.theverge.com/2017/11/16/16667366/tesla-semi-truck-announced-price-release-date-electric-self-driving.

I can work on this if others think it would be helpful. Just wanted to reach out first to see if anybody else has already done this, or if there is a reason not too.

Thanks!

Angular CLI Error: `Module not found: Error: Can't resolve 'module'`

Prerequisites

I'm using the last version.
[N/A] My node version is the same as declared as package.json.

Subject of the issue

We're trying to use Metascraper in an Angular 5 application since the README.md indicates Metascraper supports browser usage. However, we get Module not found: Error: Can't resolve 'module' in require-from-string and resolve-from from Angular CLI when calling Metascraper from our code:

ERROR in ./node_modules/metascraper/node_modules/require-from-string/index.js
Module not found: Error: Can't resolve 'module' in '/path-to/node_modules/metascraper/node_modules/require-from-string'
 @ ./node_modules/metascraper/node_modules/require-from-string/index.js 3:13-30
 @ ./node_modules/metascraper/node_modules/cosmiconfig/dist/loadRc.js
 @ ./node_modules/metascraper/node_modules/cosmiconfig/dist/createExplorer.js
 @ ./node_modules/metascraper/node_modules/cosmiconfig/dist/index.js
 @ ./node_modules/metascraper/src/load-rules.js
 @ ./node_modules/metascraper/src/index.js
 @ ./clients/browser/app/library/content-linked-edit/content-linked-edit.component.ts
 @ ./clients/browser/app/library/library.module.ts
 @ ./clients/browser/$$_lazy_route_resource lazy
 @ ./node_modules/@angular/core/esm5/core.js
 @ ./clients/browser/main.ts
 @ multi webpack-dev-server/client?http://0.0.0.0:0 ./clients/browser/main.ts

ERROR in ./node_modules/metascraper/node_modules/resolve-from/index.js
Module not found: Error: Can't resolve 'module' in '/path-to/node_modules/metascraper/node_modules/resolve-from'
 @ ./node_modules/metascraper/node_modules/resolve-from/index.js 3:15-32
 @ ./node_modules/metascraper/src/load-rules.js
 @ ./node_modules/metascraper/src/index.js
 @ ./clients/browser/app/library/content-linked-edit/content-linked-edit.component.ts
 @ ./clients/browser/app/library/library.module.ts
 @ ./clients/browser/$$_lazy_route_resource lazy
 @ ./node_modules/@angular/core/esm5/core.js
 @ ./clients/browser/main.ts
 @ multi webpack-dev-server/client?http://0.0.0.0:0 ./clients/browser/main.ts

The following warning precedes those errors:

WARNING in ./node_modules/metascraper/src/load-rules.js
52:15-34 Critical dependency: the request of a dependency is an expression
 @ ./node_modules/metascraper/src/load-rules.js
 @ ./node_modules/metascraper/src/index.js
 @ ./clients/browser/app/library/content-linked-edit/content-linked-edit.component.ts
 @ ./clients/browser/app/library/library.module.ts
 @ ./clients/browser/$$_lazy_route_resource lazy
 @ ./node_modules/@angular/core/esm5/core.js
 @ ./clients/browser/main.ts
 @ multi webpack-dev-server/client?http://0.0.0.0:0 ./clients/browser/main.ts

Does Metascraper work in a browser environment?

XMLHttpRequest cannot load https://github.com/ianstormtaylor/metascraper/issues/27. No 'Access-Control-Allow-Origin' header is present on the requested resource.

XMLHttpRequest cannot load #27. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://www.lululala.co.kr:8000' is therefore not allowed access.

:8000/#/:1 Uncaught (in promise)
PopsicleError {cause: undefined, code: "EUNAVAILABLE", popsicle: Request, message: "Unable to connect to "#27"", name: "PopsicleError"…}

how should i do???

[metascraper-author] Rules priorities

It appears that there is an unintended 'a' in line 50 of the author package. In my test's this is causing poor results for the author.

strict(wrap($ => getValue($, $('[class*="author"] a')))),

Using this url, you will find that metascraper returns "Shakespeare Videos" for author
http://www.sparknotes.com/shakespeare/macbeth/quotes.html

Removing the random "a" on line 50, returns the correct author, "William Shakespeare"

strict(wrap($ => getValue($, $('[class*="author"]')))),

However, there are other URL's that only successfuly extract the Authors name when that 'a' is present. What exactly is that "a" doing?

Thanks,

Critical dependency: the request of a dependency is an expression

I believe this is an issue of how metascraper use of "require()" works with webpack design. I wish to ask if you can offer any direction to fix these errors I'm getting when using metascraper with vue.js, which uses webpack. Below is something I learned about webpack by referencing an example of how webpack expects require() to work:

var module_path= "./dir/"+somevariable+".js";
var foo= require(module_path);

• require("./dir/"+somevaribale+".js") is statically analyzable, so it works with webpack
• require(module_path) is not statically analyzable, so it doesn't work with webpack.
Webpack calls this feature "require context" and its behavior is by design.

Referencing one of the warnings -- warning in ./node_modules/metascraper/src/load-rules.js

52:15-34 Critical dependency: the request of a dependency is an expression
the offending code in load-rules.js seems to be with all the "require()" statements that are referencing variables.

QUESTION: Is there a way to resolve this issue so metascraper can be used with webpack?

Thanks... Jim

Prerequisites

[ Yes ] I'm using the last version of metascraper.
[ 5.6.0 ] My npm version is the same as declared as package.json.
[ v6.11.1 ] My node version is the same as declared as package.json.
"vue": "^2.5.2",
"webpack": "^3.6.0",
"webpack-bundle-analyzer": "^2.9.0",
"webpack-dev-server": "^2.9.1",
"webpack-merge": "^4.1.0"

Critical dependency: the request of a dependency is an expression

PLEASE SEE THE SEVEN WARNINGS BELOW.

# npm run dev

 WARNING  Compiled with 7 warnings                                 

 warning  in ./node_modules/keyv/src/index.js

18:14-40 Critical dependency: the request of a dependency is an expression

 warning  in ./node_modules/metascraper/src/load-rules.js

52:15-34 Critical dependency: the request of a dependency is an expression

 warning  in ./node_modules/lazy-debug-legacy/src/functions.js

62:15-40 Critical dependency: the request of a dependency is an expression

 warning  in ./node_modules/lazy-debug-legacy/src/functions.js

87:8-28 Critical dependency: the request of a dependency is an expression

 warning  in ./node_modules/css/node_modules/source-map/lib/source-map/source-map-generator.js

8:45-52 Critical dependency: require function is used in a way in which dependencies cannot be statically extracted

 warning  in ./node_modules/css/node_modules/source-map/lib/source-map/source-map-consumer.js

8:45-52 Critical dependency: require function is used in a way in which dependencies cannot be statically extracted

 warning  in ./node_modules/css/node_modules/source-map/lib/source-map/source-node.js

8:45-52 Critical dependency: require function is used in a way in which dependencies cannot be statically extracted

2.0.0

We are working on a v2 iteration.

TODO

Breaking Changes

The major breaking changes in this new version are that HTTP layers are moved out of the library.

From now, metascraper will be a main function that receives HTML and URL for extracting the data:

const metascraper = require('metascraper')
const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
  const {body: html, url} = await got(targetUrl)
  const metadata = await metascraper({html, url})
  console.log(metadata)
})()

Because this iteration refactor a lot of code, is not possible create a PR, so I'm going to use this issue as a tracking list of things to close before release the new version.

Improvements

Add new logo field.
Updated tests, now with at least top50 popular internet sites.
The codebase was simplified and ready for supporting plugins in 3.0.0.

don't return early for empty string

Cheerio results often return empty strings when no matches occur, so it seems best to not match on that instead and keep going down the fallbacks array

bundle rules as plugins?

A bit more to think about after we nail the main API, but I was thinking that it feels weird for the bundled default rules to have such prominence, when they are for a pretty specific use case. Wondering instead if it might be nicer to make rules more of a plugin-system.

It could even maintain the same (source, [rules]) API if that's still the simplest, although maybe it becomes a constructor with new Metascraper(rules) and then .scrapeHtml()... instead.

This would allow bundles of rules. The current set might be something like metascraper-plugin-articles but there could easily be a simple Open Graph one that doesn't handle fallbacks, as metascraper-plugin-open-graph, or whatever!

cc @blakeembrey for thoughts!

Favicon & Domain Name

Hi @ianstormtaylor! Does/can the library also return the favicon distinct from the image?

Need a logo!

Would be awesome have a logo for identifying the project.

Ideas are welcome 😄

Questions: Rules priority

This is more of a question than an issue.

In the author's plugin,
wrap($ => $('meta[property="article:author"]').attr('content')),
is listed second in the module export. Does this mean that it has a higher priority than the other methods of extracting an author?

Because I have noticed that if "article: author" is equal to a website url, then it is not used and a method further down the list is being used to extract the author.

How is this working? I do not see any code that is checking to see if its an URL or an actual authors name.

microlinkhq / metascraper Goto Github PK

metascraper's Introduction

What is it

Getting started

What data it detects

How it works

Importing rules

Rules bundles

Official

Community

API

constructor(rules)

rules

metascraper(options)

options

url

html

htmlDom

rules

validateUrl

Environment Variables

METASCRAPER_RE2

Benchmark

License

metascraper's People

Contributors

Stargazers

Watchers

Forkers

metascraper's Issues

Rules from og meta tags

Rules from twitter meta tags

Rules from sailthru meta tags

Rules from HTML

Prerequisites

Subject of the issue

Prerequisites

Critical dependency: the request of a dependency is an expression

TODO

Breaking Changes

Improvements

Recommend Projects

Recommend Topics

Recommend Org

Rules from `og` meta tags

Rules from `twitter` meta tags

Rules from `sailthru` meta tags