andreskrey / readability.php Goto Github PK

View Code? Open in Web Editor NEW

347.0 19.0 89.0 4.01 MB

PHP port of Mozilla's Readability.js

License: Apache License 2.0

PHP 0.62% Makefile 0.01% Dockerfile 0.01% HTML 99.37%

readability domdocument php libxml

readability.php's People

Contributors

Stargazers

Watchers

readability.php's Issues

Remove certain divs

Hi! Firstly, thanks so much for your amazing work!

I want to remove certain divs based on css classes or ids. For example, removes all div.container-to-remove.

I think that my question is more a feature request.

Do you think that it is possible? At least, can you tell me where I have to see to add that feature? (well, I know that I have to see everything, but maybe you know where, specifically)

Thanks in advance!

Feature: pass custom CSS selectors to Readability

Example HTML:

<p class="content">Main content</p>
<div>Share the word on social media!</div>

Passing p.content CSS selector to Readability's constructor should return Main content

Problem extracting images

There are problem extracting images when

<meta property="og:image:width" content="373"/>

<meta property="og:image:height" content="280"/>

are present in the document code.
when I get the main image I have the value 280 instead the filepath.

Trying to get property 'nodeName' of non-object

Occasionally I get the following Error:

Trying to get property 'nodeName' of non-object in:
readability.php/src/Readability.php on line 1079

while ($parentOfTopCandidate->nodeName !== 'body') { ...

getContent() returns HTML instead of text?

I am migrating from 0.3 to 1.2

My old wrapper:

<?php
/**
 * Created by PhpStorm.
 * User: ninoskopac
 * Date: 22/02/2018
 * Time: 22:44
 */
declare(strict_types=1);
namespace Read2Me\Content;

use andreskrey\Readability\HTMLParser;
use Goutte\Client;

class Readability
{
    /** @var  HTMLParser $readability */
    private $readability;
    /**
     * @var Client
     */
    private $goutte;
    private $response;

    public function __construct(Client $goutte)
    {
        $this->goutte = $goutte;

        $this->readability = new HTMLParser();
        $this->response = $this->readability->parse($this->goutte->getCrawler()->html());
    }

    public function getContent(): string {
        /** @var \DOMDocument $domDocument */
        $domDocument = $this->response['article'];

        return $domDocument->textContent ?? '';
    }
}

My new wrapper:

<?php
/**
 * Created by PhpStorm.
 * User: ninoskopac
 * Date: 22/02/2018
 * Time: 22:44
 */
declare(strict_types=1);
namespace Read2Me\Content;

use andreskrey\Readability\ParseException;
use andreskrey\Readability\Readability as ReadabilityReadability;
use andreskrey\Readability\Configuration as ReadabilityConfiguration;
use Goutte\Client;

class Readability
{
    /** @var  ReadabilityReadability $readability */
    private $readability;
    /**
     * @var Client
     */
    private $goutte;

    public function __construct(Client $goutte)
    {
        $this->goutte = $goutte;
        $this->readability = new ReadabilityReadability(new ReadabilityConfiguration());

        try {
            $this->readability->parse($this->goutte->getCrawler()->html());
        } catch (ParseException $e) {
        }
    }

    public function getContent(): string {
        $title = $this->readability->getTitle();
        $content = $this->readability->getContent();

        if (empty($content))
            return '';

        return $title . "\n\n" . $content;
    }
}

So, getContent() returns HTML now, and not text?

getContent issue!

Hi !
I have an issue with this site : https://www.the-star.co.ke/news/2018/03/13/warning-risks-of-food-security_c1728262

The function "parse()" returns only the title!

Fatal error

Hi !

I'm trying your script but when I use the test script I got this error :

Fatal error: Class 'PHPUnit\Framework\TestCase' not found in /homepages/15/d202009493/htdocs/SERP2db/reader/test/ReadabilityTest.php on line 9

what am I doing wrong ? Is there any dependencies ?

Thanks !

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given

Hi,

This URL returns a Fatal Error: https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27

$readability = new Readability(new Configuration());

$html = file_get_contents('https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27');

try {
    $readability->parse($html);
    echo $readability;
} catch (ParseException $e) {
    echo sprintf('Error processing text: %s', $e->getMessage());
}

Result is:

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324

How to remove tag class

When i use ->parse() I see div class='someClass'
How I can remove class from result?

'installation' error

It gives following error after giving this command, $composer require andreskrey/readability.php

Using version ^2.0 for andreskrey/readability.php
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

Problem 1
- The requested package andreskrey/readability.php No version set (parsed as 1.0.0) is satisfiable by andreskrey/readability.php[No version set (parsed as 1.0.0)] but these conflict with your requirements or minimum-stability.

Installation failed, reverting ./composer.json to its original content.

Aside tags

Hi Andres!

I have an issue with this site:
https://www.engadget.com/2017/11/03/xbox-one-x-review/
I think we could remove all content from :
/html/body/div[1]/div/div[2]/main/aside

By definition, it should be related to the content ( https://www.w3schools.com/tags/tag_aside.asp ),
but by experience, it's more "look at our other articles!"

What do you think?

Strip all images from text

Is there any way that the script could be configured to strip all images from the parsed content?

I neither need nor want them (screen reader) and they just take up space which leads to breaking results.

Division By Zero Error Php

I am getting a error message that has resulted in my stats to encounter the error.

A PHP Error was encountered

Severity: Warning

Message: Division by zero

Filename: controllers/Api.php

Line Number: 93

Here is the Line 93 :
[90] $weekrate = $bht - $ohc;
. if ( empty( $ohc ) ) {
. $weekrate = 'N/A';
[93] } else $weekrate = floor( $weekrate / $ohc * 100 );

PHP 8 compatibility

I'm utilising readability.php in the feediron plugin for TT-RSS. TT-RSS recently have switched to PHP 8 and there looks to be some deprecation's to deal with in PHP 8. Issue originally reported here

plugins/af_readability/vendor/andreskrey/Readability/Readability.php:199
usort():** Returning bool from comparison function is deprecated, return an integer less than, equal to, or greater than zero

Configuration option for _cleanClasses

Hi, Andres!
Maybe we can introduce a new configuration option for clean source classes or not?
It's very important for me to be able to work with source classes and ids (ids removes only in mozilla's readability this time)
Thank you!

Getting Top Image

newspaper.py handles this like:

    def get_meta_img_url(self, article_url, doc):
        """Returns the 'top img' as specified by the website
        """
        top_meta_image, try_one, try_two, try_three, try_four = [None] * 5
        try_one = self.get_meta_content(doc, 'meta[property="og:image"]')
        if try_one is None:
            link_icon_kwargs = {'tag': 'link', 'attr': 'rel', 'value': 'icon'}
            elems = self.parser.getElementsByTag(doc, **link_icon_kwargs)
            try_two = elems[0].get('href') if elems else None

        if try_two is None:
            link_img_src_kwargs = \
                {'tag': 'link', 'attr': 'rel', 'value': 'img_src'}
            elems = self.parser.getElementsByTag(doc, **link_img_src_kwargs)
            try_three = elems[0].get('href') if elems else None

        if try_three is None:
            try_four = self.get_meta_content(doc, 'meta[name="og:image"]')

        top_meta_image = try_one or try_two or try_three or try_four

        if top_meta_image:
            return urllib.parse.urljoin(article_url, top_meta_image)
        return ''

This would be pretty straight forward in your PHP as well...

If I get the time, I will update this issue and do it myself, but if you have time this would be a very useful feature.

PHP Warning: Division by zero

Hello!

I have the warning:
PHP Warning: Division by zero ... andreskrey/readability.php/src/HTMLParser.php on line 1093
With the site:
https://www.womenshealthmag.com/life/macys-black-friday-deals-2017

Refer to lines in method prepArticle:

      $h2 = $article->getElementsByTagName('h2');
        if ($h2->length === 1) {
           $lengthSimilarRate = (mb_strlen($h2->item(0)->textContent) - mb_strlen($this->metadata['title'])) / mb_strlen($this->metadata['title']);

It's caused by an empty metadata['title'], but I don't know why is empty.
I fix that with a !empty(metadata['title']) in the if statement, but I'm sure isn't the right way.

Readability plugin as used in tt-rss has quote escape problem on nytimes.com

This mostly works well, but in about 50% of the articles I get from nytimes.com I get this weird artifact:

rss-error

Usually the article follows it regardless, and it works for the rest of the articles. I did notice that the JavaScript snippit that is included was the same every time, so I looked where I could find it, and I found it in the original url:

https://www.nytimes.com/2020/12/03/us/election-officials-threats-trump.html

Since you may need a subscription to access this site, I’ve make a capture of a possible cause around the point I saw snippets like this in the html (when I search for the start of that JS):

rss-snippit

Regards,
Marius.

Parsing never ends + lot of PHP Notice

I'm trying to parse content of this article: http://www.jornalcorreiodonorte.com.br/2.1149/com-a-chegada-do-ver%C3%A3o-aten%C3%A7%C3%A3o-deve-ser-redobrada-com-os-c%C3%A3es-1.2191389
I'm using v2.1.0.

This produces a lot of PHP Notice:

PHP Notice:  Trying to get property 'nodeName' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1079
PHP Notice:  Trying to get property 'contentScore' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1080
PHP Notice:  Trying to get property 'contentScore' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1091
PHP Notice:  Trying to get property 'parentNode' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1092

Those 4 lines are repeated million times (I guess we are in an infinite loop).

My code is very simple:

$readability = new andreskrey\Readability\Readability(new andreskrey\Readability\Configuration());
$readability->parse(file_get_contents('http://www.jornalcorreiodonorte.com.br/2.1149/com-a-chegada-do-ver%C3%A3o-aten%C3%A7%C3%A3o-deve-ser-redobrada-com-os-c%C3%A3es-1.2191389'));

iterator_to_array() error

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /tt-rss/vendor//andreskrey/Readability/Nodes/NodeTrait.php:324.
Arch Linux, PHP 7.3.2

Stack trace:
#0 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php(324): iterator_to_array(NULL)
#1 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php(421): andreskrey\Readability\Nodes\DOM\DOMText->getChildren(true)
#2 /tt-rss/vendor/andreskrey/Readability/Readability.php(1272): andreskrey\Readability\Nodes\DOM\DOMText->hasSingleTagInsideElement('td')
#3 /tt-rss/vendor/andreskrey/Readability/Readability.php(1166): andreskrey\Readability\Readability->prepArticle(Object(andreskrey\Readability\Nodes\DOM\DOMDocument))
#4 /tt-rss/vendor/andreskrey/Readability/Readability.php(155): andreskrey\Readability\Readability->rateNodes(Array)
#5 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php on line 324

can get the published date of page

add method getPublishedDate() , auto find the published date

Add custom tag to be included in the text

TorrentFreak.com articles are sometimes polluted with <g></g> tags by the spell checker they use. Readability consistently filters out all words encapsulated by these tags, so I'd like to prevent this. How should I do this properly?

I've already tried adding |g to the okMaybeItsACandidate regexp in NodeUtility.php, and tried to add 'g', to $phrasing_elems in NodeTrait.php. Neither of these operations did the trick.

(I am using Readability as part of Tiny Tiny RSS)

Readability return boolean instead html string or Readability instance

I'm using Laravel 5.7.
This is my code:

use andreskrey\Readability\Readability as PhpReadability;
use andreskrey\Readability\Configuration;

class Readability extends Controller
{
	protected $readability;

	public function __construct()
	{
    	    $this->readability = new PhpReadability(new Configuration());
	}

       public function parse($html)
      {
    	  return $this->readability->parse($html); // return boolean instead of html string or Readability instance;
      }
}

Telegraph articles failing

Repro URL: http://www.telegraph.co.uk/news/2017/11/16/zimbabwes-robert-mugabe-wife-grace-insisting-finishes-term-priest/

Problem: missing paragraphs from the article

Equal parts of article

Example article: https://www.fleetfeet.com/blog/shoe-review-nike-air-zoom-pegasus-36
The problem is, that it's detecting as main content only one part of the article started with "Nike Pegasus 36 Ride and Performance". Other parts of the article is ommited. StripUnlikelyCandidates has no effect at all. Also image/main image is not detected, always null, not only for this case, but i tried 100+ articles and image is always null.
Is it possible to detect the full article, not just part of it ?

try {
            $readability = new Readability(new Configuration([
                'StripUnlikelyCandidates' => false
            ]));

            $readability->parse($this->html);

            return $readability;
        } catch (\Exception $exception) {
            throw new \Exception($exception->getMessage());
        }

JavaScript code included in returned content

Hello,

Great library! Congrats.

I found a small issue. For some articles, I get a JavaScript code returned in content. Example: https://www.reuters.com/article/us-usa-trump-ivanka/ivanka-trump-closes-fashion-line-to-focus-on-helping-her-father-idUSKBN1KE2JN

Returned content:

WORLD IVANKA TRUMP CLOSES FASHION LINE TO FOCUS ON HELPING HER FATHER BREAKING NEWS July 24, 2018 0 (Reuters) – U.S. President Donald Trump’s daughter Ivanka Trump on Tuesday said she was shutting her fashion line to focus on her role as an informal White House adviser, where she is working on advancing working women. Since Trump’s surprise November 2016 election, his family has faced criticism that its portfolio of real estate and consumer goods businesses, which lean heavily on the Trump name as a marker of luxury, conflict with their roles as Washington officials. Critics of the president called the shutting of the brand a victory for a boycott of Trump-tied businesses that began late in the 2016 campaign. “After 17 months in Washington, I do not know when or if I will ever return to the business, but I do know that my focus for the foreseeable future will be the work I am doing here in Washington,” Ivanka Trump, 36, said in a statement on Tuesday. Amid criticism of potential conflicts of interest, Ivanka Trump in 2017 gave up day-to-day management of her clothing company and put its assets in a trust managed by family members. Her company said licensing contracts would not be renewed and those in place will be allowed to run their course. Mid-priced women’s clothing, shoes and accessories were sold under the label. The 18 people who work for the 11-year-old company will be laid off as it shuts down. Since Trump’s election, retailers including Nordstrom Inc , Hudson’s Bay Co and Sears Holdings Corp have dropped or sharply scaled back their assortment of Trump-branded products, though they typically attributed those decisions to poor sales rather than political messages. Read: US vehicle tariffs considered as heart of attention as Trump meets EU's Juncker Ivanka Trump’s brand said in a statement that retailers including Bloomingdale’s, owned by Macy’s Inc , Dillard’s Inc and Amazon.com Inc , continued to carry her wares. According to the Wall Street Journal, online sales of Ivanka Trump’s brand fell nearly 55 percent in the 12 months to June, compared with the year-earlier period, citing Rakuten Intelligence, which gathers email receipts from 5.5 million U.S. consumers. Trump’s combative style on the campaign trail and as president have drawn the family’s brands into political fights, with some supporters hosting events at the luxury Trump International Hotel blocks from the White House even as opponents stage boycotts. The president has been a loud advocate for domestic manufacturing, prompting criticism that much of the Ivanka Trump line was made overseas, as is the vast majority of clothing and footwear sold in the United States. According to media reports, much of her product line has been sourced from China, which is the target of tariffs imposed by the president in a trade conflict. The most organised boycott of Trump-related businesses, “Grab Your Wallet,” called the news about Ivanka’s business a victory. “This is the biggest possible win for Grab Your Wallet,” the group’s co-founder, Shannon Coulter, a San Francisco marketing executive, said in a phone interview. (Reporting by Scott Malone in Boston, Barbara Goldberg and Diana Kruzman in New York and Nivedita Balu in Bengaluru; editing by Richard Chang and Cynthia Osterman) This story has not been edited by Firstpost staff and is generated by auto-feed. Updated Date: Jul 25, 2018 04:05 AM Read: Secret tape may not add to legal jeopardy for Trump or Cohen Also Watch Social Media Star: Abhishek Bachchan, Varun Grover reveal how they handle selfies, trolls and broccoli Monday, July 16, 2018 It’s a Wrap: Soorma star Diljit Dosanjh and Hockey legend Sandeep Singh in conversation with Parul Sharma Monday, July 16, 2018 Watch: Dalit man in Uttar Pradesh defies decades of prejudice by taking out baraat in Thakur-dominated Nizampur village Monday, July 16, 2018 India’s water crisis: After govt apathy, Odisha farmer carves out 3-km canal from hills to tackle scarcity in village Sunday, July 15, 2018 Maurizio Sarri, named as new Chelsea manager, is owner Roman Abramovich’s latest gamble in quest for ‘perfect football’ ‘; console.log(titleHeading); $(‘.fp-first-video’).html(response+titleHeading); $(“#”+videoId).attr(“id”,liId); $(“#”+videoId+”-image”).attr(“src”,mainImage); $(“#”+videoId+”-title”).text(otherHead); $(“#”+videoId+”-date”).text(mainDate); $(“#”+videoId+”-image”).attr(“id”,liId+”-image”); $(“#”+videoId+”-title”).attr(“id”,liId+”-title”); $(“#”+videoId+”-date”).attr(“id”,liId+”-date”); $(“.fp-first-video”).attr(“id”,”alsoMain-“+videoId); $(“.fpvideo-wrap”).attr(“id”,”firstAlsoWatch_”+videoId) $(“#mainImage”).val(otherImage); $(“#mainDate”).val(otherDate); } }, error: function(xhr, ajaxOptions, thrownError) { console.log(‘Something went wrong..’); } }); });

As you can see, the end contains a javascript code.

Thank you for looking into this!

Regards.

Class 'Readability' not found

If I try to install this with:

composer require andreskrey/readability.php

And run

use andreskrey\Readability\HTMLParser;
use andreskrey\Readability\Configuration;

$readability = new Readability(new Configuration());

I get:

PHP Fatal error: Class 'Readability' not found

My composer.json looks like this:

{
"require": {
"andreskrey/readability.php": "^0.3.1"
}
}

Can I set a user-agent string?

It would be useful to configure a user-agent when fetching pages. Is this possible (or a feature you'd be willing to consider)? I can't find anything about it in the docs.

Extract content issue

Hi! Is there any solution to extract content from official source (Example : http://empres-i.fao.org/empres-i/2/obd?idOutbreak=229632&rss=t)

why this url lost image

http://www.mafengwo.cn/i/9359689.html

I get the content ,but haven't image ,please help me

php7.4 - usort() needs int, not bool

https://github.com/andreskrey/readability.php/blob/master/src/Readability.php#L198

                    // No luck after removing flags, just return the longest text we found during the different loops
                    usort($this->attempts, function ($a, $b) {
                        return $a['textLength'] < $b['textLength'];   // L198
                    });

php 7.4 is complaining:

usort(): Returning bool from comparison function is deprecated, return an integer less than, equal to, or greater than zero

Suggested fix:

- return $a['textLength'] < $b['textLength'];
+ return $b['textLength'] - $a['textLength'];

(NB: replacing < with -, and swapped operands)

Testing:

<?php

function f1(&$x){ // original
	usort($x, function ($a, $b) {
		return $a['textLength'] < $b['textLength'];
	});
}

function f2(&$x){ // proposed change
	usort($x, function ($a, $b) {
		return $b['textLength'] - $a['textLength'];
	});
}

$a = [
	['textLength'=> 5, 'b'=>'1'],
	['textLength'=> 7, 'b'=>'2'],
	['textLength'=> 3, 'b'=>'3'],
	['textLength'=> 1, 'b'=>'4'],
];

$b = $a;
f1($b);
print_r($b);

$b = $a;
f2($b);
print_r($b);

Output:

$ php -e x.php
Array
(
    [0] => Array
        (
            [textLength] => 7
            [b] => 2
        )

    [1] => Array
        (
            [textLength] => 5
            [b] => 1
        )

    [2] => Array
        (
            [textLength] => 3
            [b] => 3
        )

    [3] => Array
        (
            [textLength] => 1
            [b] => 4
        )

)
Array
(
    [0] => Array
        (
            [textLength] => 7
            [b] => 2
        )

    [1] => Array
        (
            [textLength] => 5
            [b] => 1
        )

    [2] => Array
        (
            [textLength] => 3
            [b] => 3
        )

    [3] => Array
        (
            [textLength] => 1
            [b] => 4
        )

)

HTML infiltrating in imported content

Hello,

First of all, congrats for the code!

On some content, HTML code is infiltrating in extracted textual content. Example for this article:

https://www.yahoo.com/entertainment/chris-evans-ignites-celeb-civil-085545038.html

Snippet of the extracted content:

Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”” data-reactid=”16″ type=”text”>There’s an issue that’s divided Twitter almost as much as anything in politics this week, with Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”

As you can see, this is infiltrating in content:

”” data-reactid=”16″ type=”text”>

This is mu current code:

$readConf = new Configuration(); $readConf->setSummonCthulhu(true); $readability = new Readability($readConf); $readability->parse($html_string); $return_me = $readability->getContent();

Any help is appreciated.

Regards,
Szabi.

Need img data-src attribute

Hi! Great work! But not enough attribute data-src in this attributes set

readability.php/src/Readability.php

Line 1521 in 5bd54a8

$img->getAttribute('data-url')

First paragraph of many articles omitted

Some articles have their first paragraph omitted, regardless of the configuration values of CleanConditionally and StripUnlikelyCandidates. Examples below:

https://www.engadget.com/2019/05/10/lyft-just-started-experimenting-with-car-rentals-in-san-francisc/?utm_campaign=homepage&utm_medium=internal&utm_source=dl

The paragraph beginning with "Between offering on-demand rides..." at the beginning of the article is completely absent from cleaned content.

https://www.cnn.com/2019/04/12/us/andrew-chael-katie-bouman-black-hole-image-trnd/index.html

The short paragraph beginning with "When internet trolls tried..." is omitted from the article content. This section gets assigned to the article excerpt and is recoverable that way, but the first example's lost paragraph is not available in the excerpt.

I've looked at several pages where this behavior happens and I'm unable to determine why it is happening. HTML parsing is not my strong suit so I'm hoping someone can take a look. I really appreciate all the work on this library.

The code I'm using to scrape these pages:

$pageContent = file_get_contents($this->url);
    
    $readabilityConfig = new Configuration([
        
    ]);
    $readability = new Readability($readabilityConfig);

    try {
        $readability->parse($pageContent);
    } catch (ParseException $e) {
        echo('parse failed');
    }

Is there a way to rating to title meta tag more than og:title meta tag?

Hello and thanks for your package
Is there a way to get title not og:title?

"Malware automatically quarantined" [/nextcloud/apps/news/vendor/....(readability)...]

Related issue: nextcloud/news#1321

Email notification on a cPanel based host where NC is installed via Softaculous, and the News app is enabled from the main apps page inside the NC instance, from the official repository provided there.

The email alert contains this information:

"We have detected malicious PHP script(s) within your web hosting account. To prevent system abuse, our system has automatically quarantined these file(s). This concerns the following:

Generic:HTML/Seospam.B (Generic)
/home//nextcloud/apps/news/vendor/andreskrey/readability.php/test/test-pages/yahoo-4/source.html

Existence of these scripts generally points to third parties having gained access to your web hosting account either by having exploiting a vulnerability in one of the software packages you are using or by a compromised password. We strongly recommend you check your hosting account for other files that appear out of place, which our automated detection system might have missed."

System Information

NC server News app version: 15.3.2
(ref. https://github.com/nextcloud/news/blob/master/CHANGELOG.md )
Android News app version: 0.9.9.50 (irrelevant?)
Nextcloud version: 20.0.7 (irrelevant, AFAICS)

PS. I have not added/activated any Yahoo news sources, hence this malware report is seemingly happening in the software "by itself"(?), not because of any user actions, so I am therefore reporting this both in the news app project and in the "readability" project, as I am unsure where this should be handled.

Using a different parser for HTML, and a few other small issues

Hi Andres, firstly, really great work on Readability.php! We're hoping to use it to replace on older PHP port we use at FiveFilters.org.

I'm not sure if you're still maintaining this and accepting pull requests, but if you are, we've written a blog post covering some issues that we encountered. The blog post is here: https://www.fivefilters.org/2021/readability/. The changes we've made so far are here: https://github.com/fivefilters/readability.php.

Happy to submit any of these changes as pull requests here if you'd like to integrate them into this repository.

PHP Warning: Division by zero (because $topCandidate->contentScore is zero)

Hello!

Here is my sample code:

$url = 'https://france.googleblog.com/2018/05/google-celebre-loeuvre-du-realisateur.html';
$html = file_get_contents($url);
$config = new Configuration;
$config->setSummonCthulhu(true);
$readability = new Readability($config);
$readability->parse($html);

As result, several PHP warnings:

PHP Warning: Division by zero in vendor/andreskrey/readability.php/src/Readability.php on line 986

Same issue with those URLs:

In some cases, disabling SummonCthulhu solve the problem.

Thanks for your help!

Infinite loop on search of parentTopCandidate

Hi!

I have another bug with parentTopCandidate's code when I try to parse:
http://www.motomag.com/Honda-Goldwing-2018-Tambour-battant-ca-donne-quoi.html

Trying to get property of non-object in /vendor/andreskrey/readability.php/src/Readability.php on line 955
Trying to get property of non-object in /vendor/andreskrey/readability.php/src/Readability.php on line 964

I try to fix that by check if parentNode is null each time we call ->parentNode, but this seems more complicated than that, because parentOfTopCandidate it's needed for "DOMElement $sibling" which failed with Call to a member function getChildren() on null.

I hope this help.

When the IMG tag meets data-src, the SRC is lost, causing the image to not display

The test link is https://mp.weixin.qq.com/s/Y1QrPBn4W0RQDjf2YcZbPg, image not display ,beacuse only data-src ,not src.

<img data-ratio="1.6866667" data-src="https://mmbiz.qpic.cn/mmbiz_gif/ia1Z7HH4plnAspjkKfE7nb4bkEObaj69CzIEKJUkmevJFgZEgpVOSoUnjic59I5ZNdibeia6pnneT0tViblDtZAYLog/640?wx_fmt=gif" data-type="gif" data-w="300"></img>

code:

require 'vendor/autoload.php';

use andreskrey\Readability\Readability;
use andreskrey\Readability\Configuration;

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)');$readability = new Readability(new Configuration());

$html = file_get_contents('https://mp.weixin.qq.com/s/Y1QrPBn4W0RQDjf2YcZbPg');

try {
$readability->parse($html);
//$data = $readability->getImages();

//var_dump($data);
echo $readability;
} catch (ParseException $e) {
    echo sprintf('Error processing text: %s', $e->getMessage());
}

Some domains not parsing correctly

Images problems

Doesn't take pictures from some sites

$readability = new \andreskrey\Readability\Readability((new \andreskrey\Readability\Configuration())
        ->setFixRelativeURLs(true)
        ->setOriginalURL('https://stayglam.com/life/sexy-tattoos/'));

    $html = file_get_contents('https://stayglam.com/life/sexy-tattoos/');

    try {
        $readability->parse($html);
        //$data = $readability->getImages();
        //var_dump($data);
        echo $readability;
    }
    catch (\andreskrey\Readability\ParseException $e) {
        echo sprintf('Error processing text: %s', $e->getMessage());
    }

Improve regexp unlikelyCandidates

Hello Andres!

In some cases, we need to change the unlikelyCandidates regexp in order to remove the bad content.
Example:
http://mashable.com/2017/11/03/xbox-one-x-review/
Class "shopping-disclaimer" need to be removed.

In this case, I think to make a PR to add the word "disclaimer" to the regexp.

But, what do you think about using a configuration parameter to change the unlikelyCandidate regexp for other cases more tricky?

I also need to add some logs to know why some nodes are not removed from content.
So I think it will be nice to have a logger like Monolog a simple debug mode.
I would like to have your opinion about that too.

Thanks!

PS: You reached an awesome milestone with this v1!
Congrats! 👍

Question: can the library tell me if the website is a good candidate for extraction?

Howdy,

I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.

https://workplace.stackexchange.com/questions/102524/one-of-my-subordinates-child-passed-away-how-can-i-inform-my-team => skips the original post
https://workplace.stackexchange.com/questions/102692/how-to-deal-with-flaws-in-tests-of-potential-employers => skips everything

Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?

That way, I could fall back to my internal text extraction algorithm.

setFixRelativeURLs doesn't seem to apply to image URLs

Readability isn't setting relative URLs for the image source/src.

Type error: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in vendor/andreskrey/readability.php/src/Readability.php (line 1274)

Here is one reference html for this error: https://gist.github.com/Slityak/070b8e8e44bf87981892d29ff4db1a04

Not sure but I think the error occurs when <td> have only one child (firstChild) without children but simple text. Like this part:

<table width="100%" border="0" cellspacing="0" cellpadding="5"> 
  <tbody>
    <tr> 
      <td> 
        <div align="center">
          Summer Hockey Clinic Registration Form 
        </div>
      </td> 
    </tr> 
  </tbody>
</table>

andreskrey / readability.php Goto Github PK

readability.php's People

Contributors

Stargazers

Watchers

Forkers

readability.php's Issues

A PHP Error was encountered

System Information

Recommend Projects

Recommend Topics

Recommend Org