Giter Club home page Giter Club logo

mercury-parser-api's Introduction

Mercury Parser API

Docker Pulls Docker Stars FOSSA Status

This repo provides a dockerized drop-in replacement for the Mercury Parser API.

Deploy

Pull And Run

docker run -p 3000:3000 -d wangqiru/mercury-parser-api

Build Your Own

docker build -t mercury-parser-api .

then

docker run -p 3000:3000 -d mercury-parser-api

Usage

GET /parser?url=[required:url]&contentType=[optional:contentType]&headers=[optional:url-encoded-headers]

curl localhost:3000/parser?url=https://www.bbc.co.uk/news/science-environment-35876621

Response

{
    "title": "Ash tree set for extinction in Europe",
    "author": "Claire Marshall BBC Environment Correspondent",
    "date_published": null,
    "dek": null,
    "lead_image_url": "https://ichef.bbci.co.uk/news/1024/branded_news/9736/production/_88901783_88901782.jpg",
    "content": "<div><p class=\"byline\"> <span class=\"byline__name\">By Claire Marshall</span> <span class=\"byline__title\">BBC Environment Correspondent</span> </p><div class=\"story-body__inner\"> <figure class=\"media-landscape has-caption full-width lead\"> <span class=\"image-and-copyright-container\"> <img class=\"js-image-replace\" alt=\"Ash tree with suspected dieback\" src=\"https://ichef.bbci.co.uk/news/320/cpsprodpb/9736/production/_88901783_88901782.jpg\" width=\"976\"> <span class=\"off-screen\">Image copyright</span> <span class=\"story-image-copyright\">PA</span> </span> <figcaption class=\"media-caption\"> <span class=\"off-screen\">Image caption</span> <span class=\"media-caption__text\"> The chalara dieback has devastated ash trees across Europe </span> </figcaption> </figure><p class=\"story-body__introduction\">The ash tree is likely to be wiped out in Europe, according to a review of the evidence.</p><p>The trees are being killed off by the fungal disease ash-dieback along with an invasive beetle called the emerald ash borer.</p><p>According to the research, published in the Journal of Ecology, the British countryside will never look the same again.</p><p>The paper says that the ash will most likely be &quot;eliminated&quot; in Europe.</p><p>This could mirror the way Dutch elm disease largely wiped out the elm in the 1980s.</p><p><a href=\"http://www.bbc.co.uk/news/science-environment-33744042\" class=\"story-body__link\">Warning over ash dieback disease</a></p><p><a href=\"/news/uk-northern-ireland-33480275\" class=\"story-body__link\">100,000 trees destroyed over disease</a></p><p><a href=\"http://www.bbc.co.uk/news/science-environment-20171524\" class=\"story-body__link\">How to spot ash dieback</a></p><p>Ash trees are a key part of the treescape of Britain. You don&apos;t have to go to the countryside to see them. In and around towns and cities there are 2.2 million. In woodland, only the oak is more common.</p><p>However, according to a review led by Dr Peter Thomas of Keele University and published in the Journal of Ecology, &quot;between the fungal disease ash dieback and a bright green beetle called the emerald ash borer, it is likely that almost all ash trees in Europe will be wiped out - just as the elm was largely eliminated by Dutch elm disease&quot;.</p><p>Ash dieback, also known as Chalara, is a disease that was first seen in Eastern Europe in 1992. It now affects more than 2 million sq km, from Scandinavia to Italy.</p><figure class=\"media-landscape no-caption full-width\"> </figure><figure class=\"media-landscape has-caption full-width\"> <div class=\"image-and-copyright-container\"> <span class=\"off-screen\">Image copyright</span> <span class=\"story-image-copyright\">Getty Images</span> </div> <figcaption class=\"media-caption\"> <span class=\"off-screen\">Image caption</span> <span class=\"media-caption__text\"> The loss of ash trees won&apos;t just change the landscape, it will have a severe impact on biodiversity </span> </figcaption> </figure><p>It was identified in England in 2012 in a consignment of imported infected trees. It has since spread from Norfolk and Suffolk to South Wales. Caused by the fungus <i>Hymenoscyphus fraxineus</i>, it kills the leaves, then the branches, trunk and eventually the whole tree. It has the potential to destroy 95% of ash trees in the UK.</p><p>The emerald ash borer is a bright green beetle that, like ash dieback, is native to Asia. It&apos;s not yet in the UK but is spreading west from Moscow at a rate of 25 miles (41 km) a year and is thought to have reached Sweden.</p><p>The adult beetles feed on ash trees and cause little damage. However the larvae bore under the bark and in to the wood, killing the tree.</p><p>According to Dr Thomas: &quot;Our European ash is very susceptible to the beetle. It is only a matter of time before it spreads across the rest of Europe - including Britain - and the beetle is set to become the biggest threat faced by ash in Europe, potentially far more serious than ash dieback.&quot;</p><figure class=\"media-landscape has-caption full-width\"> <div class=\"image-and-copyright-container\"> <span class=\"off-screen\">Image copyright</span> <span class=\"story-image-copyright\">Science Photo Library</span> </div> <figcaption class=\"media-caption\"> <span class=\"off-screen\">Image caption</span> <span class=\"media-caption__text\"> The emerald ash borer also threatens ash trees </span> </figcaption> </figure><p>This won&apos;t just change our landscape - it will have a severe impact on biodiversity. 1,000 species are associated with ash or ash woodland, including 12 types of bird, 55 mammals and 239 invertebrates.</p><p>Mr Thomas said, &quot;Of these, over 100 species of lichens, fungi and insects are dependent upon the ash tree and are likely to decline or become extinct if the ash was gone.</p><p>&quot;Some other trees such as alder, small-leaved lime and rowan can provide homes for some of these species... but if the ash went, the British countryside would never look the same again.&quot;</p><p>One small hope is that some cloned ash trees have shown resistance against the fungus. But that won&apos;t protect them against the beetle.</p><p>Follow Claire <a href=\"http://twitter.com/bbcmarshall\" class=\"story-body__link-external\">on Twitter.</a></p> </div></div>",
    "next_page_url": null,
    "url": "https://www.bbc.co.uk/news/science-environment-35876621",
    "domain": "www.bbc.co.uk",
    "excerpt": "The ash tree is likely to be wiped out in Europe, according to the largest-ever survey of the species.",
    "word_count": 585,
    "direction": "ltr",
    "total_pages": 1,
    "rendered_pages": 1
}

Adding a custom extractor

You can add a custom extractor to the parser by binding your customizer module at /app/customizer.

docker run -p 3000:3000 -d \
    -v my-customizer-dir:/app/customizer \
    mercury-parser-api

In the above example, the my-customizer-dir directory will contain index.js, such as:

const NaverMobileBlogExtractor = {
  domain: 'm.blog.naver.com',
  title: {
    selectors: ['.se-title-text'],
  },
  author: {
    selectors: ['.blog_author'],
  },
  content: {
    selectors: ['.se-main-container'],
  }, 
  date_published: {
    selectors: ['.blog_date'],
    format: 'YYYY. MM. DD. HH:mm',
    timezone: 'Asia/Seoul',
  },
};

function customize(parser) {
  parser.addExtractor(NaverMobileBlogExtractor);
}

module.exports = { customize };

console.log('📜My custom extractor is loaded.');

License

Licensed under either of the below, at your preference:

FOSSA Status

mercury-parser-api's People

Contributors

adampash avatar dependabot-preview[bot] avatar dependabot[bot] avatar evgensk avatar fossabot avatar greenkeeper[bot] avatar henryqw avatar nickwynja avatar renovate-bot avatar renovate[bot] avatar snyk-bot avatar trustin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mercury-parser-api's Issues

Mercury API key

where can i get the Mercury API key? am using docker wangqiru/mercury-parser-api

Update to v2.2.1?

A great tool. Thanks a lot for the great work.
Is there any chance that this version will be updated to the current code base?
https://github.com/postlight/mercury-parser/releases/tag/v2.2.1

As far as I can see there, it is the first major update in a long time.
(the Chrome extension, which is probably based on the latest code, works much better for some sites than my self-hosted Docker version, hence my request).

docker pull fails

  • Platform: Debian 12, Docker v25.0.3

Expected Behavior

The image should be able to be pulled without a problem.

Current Behavior

When running docker pull wangqiru/mercury-parser-api the pull fails with the following error:
failed to register layer: failed to Lchown "/app/node_modules/content-type/HISTORY.md" for UID 1516583083, GID 0 (try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): lchown /app/node_modules/content-type/HISTORY.md: invalid argument

Add custom certificate?

Hi,
I'm using the "latest" tag of the Mercury Parser API inside a stack of Awesome TTRSS.
Some feeds are failing to be parsed with error, for example:

curl service.mercury:3000/parser?url=https://www.punto-informatico.it/telegram-non-e-sicura-pochi-ingegneri-server-dubai/
{"error":true,"messages":"unable to verify the first certificate"}

I had the same issue on TTRSS itself and I fixed by using a custom certificate. Can I supply a custom certificate also to Mercury Parser? Thanks!

  • Platform: Raspberry OS
  • Mercury Parser API Version: latest
  • Node Version:

Expected Behavior

Get the parsed article

Current Behavior

curl service.mercury:3000/parser?url=https://www.punto-informatico.it/telegram-non-e-sicura-pochi-ingegneri-server-dubai/
{"error":true,"messages":"unable to verify the first certificate"}

Possible Solution

Provide a way to use a custom CA certificate

Handle the full-text content in other language

The result of retrieving non English webpage is not encoded well. It returned the strings of hex digits (e.g. "中新网") instead of encoded text. Is there a way to fix it? I tried the CLI version of Mercury Parser and pass the parameter --format markdown, which resulting in correct text. But I have no idea how to add this kind of parameter in calling the mercury-parser-api. Please try the example URLs below to reproduce the problem:

  1. https://news.sina.com.cn/c/2021-01-23/doc-ikftssan9988691.shtml
  2. http://www.chinanews.com/sh/2021/01-24/9395190.shtml

Site Parsing Issue Q

Hi there Henry, I am running this with my NAS and a TTRSS build and generally all work well.

However I have just noticed it is having issues with at least one site. Now, I am a bit new to this and was hoping you had a tip or two how it potentially could be fixed as a fellow user. I did not at glance see a way to customize anything on a per feed basis and I believe your build is the latest one there is (it feels like the postlight version has not been maintained for a good while).

This site seems to pull the text, but also the text for multiple articles below it:
https://www.thumbsticks.com/nintendo-switch-releases-september-21-25-2020-09242020/

错误,返回信息:getaddrinfo EAI_AGAIN (地址)

  • Platform:Linux iZuf63oqyjp30wx07gi8jvZ 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Mercury Parser API Version:latest
  • Node Version:v10.19.0

Expected Behavior

返回结果

Current Behavior

{"error":true,"messages":"getaddrinfo EAI_AGAIN china.caixin.com"}

用别的链接也是这样,只是后面的地址改了,README.md里的例子也是一样

Steps to Reproduce

  1. 运行docker run -p 3000:3000 -d mercury-parser-api
  2. 在3000端口使用

或者直接用我的链接:http://139.196.180.51:3000

Detailed Description

直接到这个链接去

http://139.196.180.51:3000/parser?url=

在后面放文章url

Possible Solution

我上网查了一下

说getaddrinfo的返回值EAI_AGAIN代表DNS(name server)返回临时性错误. 可以稍后重试.

但是试多少遍都没有用

[Question] Feed processing time

Just for sanity, I just installed this plugin on a long working and stable installation (have been using readability for some years).

As soon as I enabled it, the installations active multi-process daemon seemed to take a break? Is it doing something when I add or enable it table or data wise that could be the reason? Or maybe Tiny is doing something once added? Just trying to understand why things stopped up.

After a while the daemon seemed to kick back in, but it feels like it's processing in 1/3 of the speed did did before (I have 4k feeds). It's simply enabled, no feed uses it nor do any filter trigger it yet. I still run and have readability enabled until I can find out what is going on.

I'm concerned about the massive drop in the speed of feeds being processed before and after adding this plugin... does this make any sense in anyone experience? Can it be related to the plugin or is that unlikely.... and speed wise feeds should process as fast as usual.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.