Giter Club home page Giter Club logo

readability.php's People

Contributors

andreskrey avatar davidfricker avatar ninoskopac avatar topotru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

readability.php's Issues

Remove certain divs

Hi! Firstly, thanks so much for your amazing work!

I want to remove certain divs based on css classes or ids. For example, removes all div.container-to-remove.

I think that my question is more a feature request.

Do you think that it is possible? At least, can you tell me where I have to see to add that feature? (well, I know that I have to see everything, but maybe you know where, specifically)

Thanks in advance!

Problem extracting images

There are problem extracting images when

<meta property="og:image:width" content="373"/> 

or

<meta property="og:image:height" content="280"/>

are present in the document code.
when I get the main image I have the value 280 instead the filepath.

Trying to get property 'nodeName' of non-object

Occasionally I get the following Error:

Trying to get property 'nodeName' of non-object in:
readability.php/src/Readability.php on line 1079

while ($parentOfTopCandidate->nodeName !== 'body') { ...

getContent() returns HTML instead of text?

I am migrating from 0.3 to 1.2

My old wrapper:

<?php
/**
 * Created by PhpStorm.
 * User: ninoskopac
 * Date: 22/02/2018
 * Time: 22:44
 */
declare(strict_types=1);
namespace Read2Me\Content;

use andreskrey\Readability\HTMLParser;
use Goutte\Client;

class Readability
{
    /** @var  HTMLParser $readability */
    private $readability;
    /**
     * @var Client
     */
    private $goutte;
    private $response;

    public function __construct(Client $goutte)
    {
        $this->goutte = $goutte;

        $this->readability = new HTMLParser();
        $this->response = $this->readability->parse($this->goutte->getCrawler()->html());
    }

    public function getContent(): string {
        /** @var \DOMDocument $domDocument */
        $domDocument = $this->response['article'];

        return $domDocument->textContent ?? '';
    }
}

My new wrapper:

<?php
/**
 * Created by PhpStorm.
 * User: ninoskopac
 * Date: 22/02/2018
 * Time: 22:44
 */
declare(strict_types=1);
namespace Read2Me\Content;

use andreskrey\Readability\ParseException;
use andreskrey\Readability\Readability as ReadabilityReadability;
use andreskrey\Readability\Configuration as ReadabilityConfiguration;
use Goutte\Client;

class Readability
{
    /** @var  ReadabilityReadability $readability */
    private $readability;
    /**
     * @var Client
     */
    private $goutte;

    public function __construct(Client $goutte)
    {
        $this->goutte = $goutte;
        $this->readability = new ReadabilityReadability(new ReadabilityConfiguration());

        try {
            $this->readability->parse($this->goutte->getCrawler()->html());
        } catch (ParseException $e) {
        }
    }

    public function getContent(): string {
        $title = $this->readability->getTitle();
        $content = $this->readability->getContent();

        if (empty($content))
            return '';

        return $title . "\n\n" . $content;
    }
}

So, getContent() returns HTML now, and not text?

Fatal error

Hi !

I'm trying your script but when I use the test script I got this error :

Fatal error: Class 'PHPUnit\Framework\TestCase' not found in /homepages/15/d202009493/htdocs/SERP2db/reader/test/ReadabilityTest.php on line 9

what am I doing wrong ? Is there any dependencies ?

Thanks !

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given

Hi,

This URL returns a Fatal Error: https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27

$readability = new Readability(new Configuration());

$html = file_get_contents('https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27');

try {
    $readability->parse($html);
    echo $readability;
} catch (ParseException $e) {
    echo sprintf('Error processing text: %s', $e->getMessage());
}

Result is:

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324

'installation' error

It gives following error after giving this command, $composer require andreskrey/readability.php

Using version ^2.0 for andreskrey/readability.php
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

Problem 1
- The requested package andreskrey/readability.php No version set (parsed as 1.0.0) is satisfiable by andreskrey/readability.php[No version set (parsed as 1.0.0)] but these conflict with your requirements or minimum-stability.

Installation failed, reverting ./composer.json to its original content.

Strip all images from text

Is there any way that the script could be configured to strip all images from the parsed content?

I neither need nor want them (screen reader) and they just take up space which leads to breaking results.

Division By Zero Error Php

I am getting a error message that has resulted in my stats to encounter the error.

A PHP Error was encountered

Severity: Warning

Message: Division by zero

Filename: controllers/Api.php

Line Number: 93

Here is the Line 93 :
[90] $weekrate = $bht - $ohc;
. if ( empty( $ohc ) ) {
. $weekrate = 'N/A';
[93] } else $weekrate = floor( $weekrate / $ohc * 100 );

PHP 8 compatibility

I'm utilising readability.php in the feediron plugin for TT-RSS. TT-RSS recently have switched to PHP 8 and there looks to be some deprecation's to deal with in PHP 8. Issue originally reported here

plugins/af_readability/vendor/andreskrey/Readability/Readability.php:199
usort():** Returning bool from comparison function is deprecated, return an integer less than, equal to, or greater than zero

Configuration option for _cleanClasses

Hi, Andres!
Maybe we can introduce a new configuration option for clean source classes or not?
It's very important for me to be able to work with source classes and ids (ids removes only in mozilla's readability this time)
Thank you!

Getting Top Image

newspaper.py handles this like:

    def get_meta_img_url(self, article_url, doc):
        """Returns the 'top img' as specified by the website
        """
        top_meta_image, try_one, try_two, try_three, try_four = [None] * 5
        try_one = self.get_meta_content(doc, 'meta[property="og:image"]')
        if try_one is None:
            link_icon_kwargs = {'tag': 'link', 'attr': 'rel', 'value': 'icon'}
            elems = self.parser.getElementsByTag(doc, **link_icon_kwargs)
            try_two = elems[0].get('href') if elems else None

        if try_two is None:
            link_img_src_kwargs = \
                {'tag': 'link', 'attr': 'rel', 'value': 'img_src'}
            elems = self.parser.getElementsByTag(doc, **link_img_src_kwargs)
            try_three = elems[0].get('href') if elems else None

        if try_three is None:
            try_four = self.get_meta_content(doc, 'meta[name="og:image"]')

        top_meta_image = try_one or try_two or try_three or try_four

        if top_meta_image:
            return urllib.parse.urljoin(article_url, top_meta_image)
        return ''

This would be pretty straight forward in your PHP as well...

If I get the time, I will update this issue and do it myself, but if you have time this would be a very useful feature.

PHP Warning: Division by zero

Hello!

I have the warning:
PHP Warning: Division by zero ... andreskrey/readability.php/src/HTMLParser.php on line 1093
With the site:
https://www.womenshealthmag.com/life/macys-black-friday-deals-2017

Refer to lines in method prepArticle:

      $h2 = $article->getElementsByTagName('h2');
        if ($h2->length === 1) {
           $lengthSimilarRate = (mb_strlen($h2->item(0)->textContent) - mb_strlen($this->metadata['title'])) / mb_strlen($this->metadata['title']);

It's caused by an empty metadata['title'], but I don't know why is empty.
I fix that with a !empty(metadata['title']) in the if statement, but I'm sure isn't the right way.

Readability plugin as used in tt-rss has quote escape problem on nytimes.com

This mostly works well, but in about 50% of the articles I get from nytimes.com I get this weird artifact:

rss-error

Usually the article follows it regardless, and it works for the rest of the articles. I did notice that the JavaScript snippit that is included was the same every time, so I looked where I could find it, and I found it in the original url:

https://www.nytimes.com/2020/12/03/us/election-officials-threats-trump.html

Since you may need a subscription to access this site, I’ve make a capture of a possible cause around the point I saw snippets like this in the html (when I search for the start of that JS):

rss-snippit

Regards,
Marius.

Parsing never ends + lot of PHP Notice

I'm trying to parse content of this article: http://www.jornalcorreiodonorte.com.br/2.1149/com-a-chegada-do-ver%C3%A3o-aten%C3%A7%C3%A3o-deve-ser-redobrada-com-os-c%C3%A3es-1.2191389
I'm using v2.1.0.

This produces a lot of PHP Notice:

PHP Notice:  Trying to get property 'nodeName' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1079
PHP Notice:  Trying to get property 'contentScore' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1080
PHP Notice:  Trying to get property 'contentScore' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1091
PHP Notice:  Trying to get property 'parentNode' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1092

Those 4 lines are repeated million times (I guess we are in an infinite loop).

My code is very simple:

$readability = new andreskrey\Readability\Readability(new andreskrey\Readability\Configuration());
$readability->parse(file_get_contents('http://www.jornalcorreiodonorte.com.br/2.1149/com-a-chegada-do-ver%C3%A3o-aten%C3%A7%C3%A3o-deve-ser-redobrada-com-os-c%C3%A3es-1.2191389'));

iterator_to_array() error

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /tt-rss/vendor//andreskrey/Readability/Nodes/NodeTrait.php:324.
Arch Linux, PHP 7.3.2

Stack trace:
#0 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php(324): iterator_to_array(NULL)
#1 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php(421): andreskrey\Readability\Nodes\DOM\DOMText->getChildren(true)
#2 /tt-rss/vendor/andreskrey/Readability/Readability.php(1272): andreskrey\Readability\Nodes\DOM\DOMText->hasSingleTagInsideElement('td')
#3 /tt-rss/vendor/andreskrey/Readability/Readability.php(1166): andreskrey\Readability\Readability->prepArticle(Object(andreskrey\Readability\Nodes\DOM\DOMDocument))
#4 /tt-rss/vendor/andreskrey/Readability/Readability.php(155): andreskrey\Readability\Readability->rateNodes(Array)
#5 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php on line 324

Add custom tag to be included in the text

TorrentFreak.com articles are sometimes polluted with <g></g> tags by the spell checker they use. Readability consistently filters out all words encapsulated by these tags, so I'd like to prevent this. How should I do this properly?

I've already tried adding |g to the okMaybeItsACandidate regexp in NodeUtility.php, and tried to add 'g', to $phrasing_elems in NodeTrait.php. Neither of these operations did the trick.

(I am using Readability as part of Tiny Tiny RSS)

Readability return boolean instead html string or Readability instance

I'm using Laravel 5.7.
This is my code:

use andreskrey\Readability\Readability as PhpReadability;
use andreskrey\Readability\Configuration;

class Readability extends Controller
{
	protected $readability;

	public function __construct()
	{
    	    $this->readability = new PhpReadability(new Configuration());
	}

       public function parse($html)
      {
    	  return $this->readability->parse($html); // return boolean instead of html string or Readability instance;
      }
}

Equal parts of article

Example article: https://www.fleetfeet.com/blog/shoe-review-nike-air-zoom-pegasus-36
The problem is, that it's detecting as main content only one part of the article started with "Nike Pegasus 36 Ride and Performance". Other parts of the article is ommited. StripUnlikelyCandidates has no effect at all. Also image/main image is not detected, always null, not only for this case, but i tried 100+ articles and image is always null.
Is it possible to detect the full article, not just part of it ?

try {
            $readability = new Readability(new Configuration([
                'StripUnlikelyCandidates' => false
            ]));

            $readability->parse($this->html);

            return $readability;
        } catch (\Exception $exception) {
            throw new \Exception($exception->getMessage());
        }

JavaScript code included in returned content

Hello,

Great library! Congrats.

I found a small issue. For some articles, I get a JavaScript code returned in content. Example: https://www.reuters.com/article/us-usa-trump-ivanka/ivanka-trump-closes-fashion-line-to-focus-on-helping-her-father-idUSKBN1KE2JN

Returned content:

WORLD IVANKA TRUMP CLOSES FASHION LINE TO FOCUS ON HELPING HER FATHER BREAKING NEWS July 24, 2018 0 (Reuters) – U.S. President Donald Trump’s daughter Ivanka Trump on Tuesday said she was shutting her fashion line to focus on her role as an informal White House adviser, where she is working on advancing working women. Since Trump’s surprise November 2016 election, his family has faced criticism that its portfolio of real estate and consumer goods businesses, which lean heavily on the Trump name as a marker of luxury, conflict with their roles as Washington officials. Critics of the president called the shutting of the brand a victory for a boycott of Trump-tied businesses that began late in the 2016 campaign. “After 17 months in Washington, I do not know when or if I will ever return to the business, but I do know that my focus for the foreseeable future will be the work I am doing here in Washington,” Ivanka Trump, 36, said in a statement on Tuesday. Amid criticism of potential conflicts of interest, Ivanka Trump in 2017 gave up day-to-day management of her clothing company and put its assets in a trust managed by family members. Her company said licensing contracts would not be renewed and those in place will be allowed to run their course. Mid-priced women’s clothing, shoes and accessories were sold under the label. The 18 people who work for the 11-year-old company will be laid off as it shuts down. Since Trump’s election, retailers including Nordstrom Inc , Hudson’s Bay Co and Sears Holdings Corp have dropped or sharply scaled back their assortment of Trump-branded products, though they typically attributed those decisions to poor sales rather than political messages. Read: US vehicle tariffs considered as heart of attention as Trump meets EU's Juncker Ivanka Trump’s brand said in a statement that retailers including Bloomingdale’s, owned by Macy’s Inc , Dillard’s Inc and Amazon.com Inc , continued to carry her wares. According to the Wall Street Journal, online sales of Ivanka Trump’s brand fell nearly 55 percent in the 12 months to June, compared with the year-earlier period, citing Rakuten Intelligence, which gathers email receipts from 5.5 million U.S. consumers. Trump’s combative style on the campaign trail and as president have drawn the family’s brands into political fights, with some supporters hosting events at the luxury Trump International Hotel blocks from the White House even as opponents stage boycotts. The president has been a loud advocate for domestic manufacturing, prompting criticism that much of the Ivanka Trump line was made overseas, as is the vast majority of clothing and footwear sold in the United States. According to media reports, much of her product line has been sourced from China, which is the target of tariffs imposed by the president in a trade conflict. The most organised boycott of Trump-related businesses, “Grab Your Wallet,” called the news about Ivanka’s business a victory. “This is the biggest possible win for Grab Your Wallet,” the group’s co-founder, Shannon Coulter, a San Francisco marketing executive, said in a phone interview. (Reporting by Scott Malone in Boston, Barbara Goldberg and Diana Kruzman in New York and Nivedita Balu in Bengaluru; editing by Richard Chang and Cynthia Osterman) This story has not been edited by Firstpost staff and is generated by auto-feed. Updated Date: Jul 25, 2018 04:05 AM Read: Secret tape may not add to legal jeopardy for Trump or Cohen Also Watch Social Media Star: Abhishek Bachchan, Varun Grover reveal how they handle selfies, trolls and broccoli Monday, July 16, 2018 It’s a Wrap: Soorma star Diljit Dosanjh and Hockey legend Sandeep Singh in conversation with Parul Sharma Monday, July 16, 2018 Watch: Dalit man in Uttar Pradesh defies decades of prejudice by taking out baraat in Thakur-dominated Nizampur village Monday, July 16, 2018 India’s water crisis: After govt apathy, Odisha farmer carves out 3-km canal from hills to tackle scarcity in village Sunday, July 15, 2018 Maurizio Sarri, named as new Chelsea manager, is owner Roman Abramovich’s latest gamble in quest for ‘perfect football’ ‘; console.log(titleHeading); $(‘.fp-first-video’).html(response+titleHeading); $(“#”+videoId).attr(“id”,liId); $(“#”+videoId+”-image”).attr(“src”,mainImage); $(“#”+videoId+”-title”).text(otherHead); $(“#”+videoId+”-date”).text(mainDate); $(“#”+videoId+”-image”).attr(“id”,liId+”-image”); $(“#”+videoId+”-title”).attr(“id”,liId+”-title”); $(“#”+videoId+”-date”).attr(“id”,liId+”-date”); $(“.fp-first-video”).attr(“id”,”alsoMain-“+videoId); $(“.fpvideo-wrap”).attr(“id”,”firstAlsoWatch_”+videoId) $(“#mainImage”).val(otherImage); $(“#mainDate”).val(otherDate); } }, error: function(xhr, ajaxOptions, thrownError) { console.log(‘Something went wrong..’); } }); });

As you can see, the end contains a javascript code.

Thank you for looking into this!

Regards.

Class 'Readability' not found

If I try to install this with:

composer require andreskrey/readability.php

And run

use andreskrey\Readability\HTMLParser;
use andreskrey\Readability\Configuration;

$readability = new Readability(new Configuration());

I get:

PHP Fatal error: Class 'Readability' not found

My composer.json looks like this:

{
"require": {
"andreskrey/readability.php": "^0.3.1"
}
}

Can I set a user-agent string?

It would be useful to configure a user-agent when fetching pages. Is this possible (or a feature you'd be willing to consider)? I can't find anything about it in the docs.

php7.4 - usort() needs int, not bool

https://github.com/andreskrey/readability.php/blob/master/src/Readability.php#L198

                    // No luck after removing flags, just return the longest text we found during the different loops
                    usort($this->attempts, function ($a, $b) {
                        return $a['textLength'] < $b['textLength'];   // L198
                    });

php 7.4 is complaining:

usort(): Returning bool from comparison function is deprecated, return an integer less than, equal to, or greater than zero

Suggested fix:

- return $a['textLength'] < $b['textLength'];
+ return $b['textLength'] - $a['textLength'];

(NB: replacing < with -, and swapped operands)

Testing:

<?php

function f1(&$x){ // original
	usort($x, function ($a, $b) {
		return $a['textLength'] < $b['textLength'];
	});
}

function f2(&$x){ // proposed change
	usort($x, function ($a, $b) {
		return $b['textLength'] - $a['textLength'];
	});
}

$a = [
	['textLength'=> 5, 'b'=>'1'],
	['textLength'=> 7, 'b'=>'2'],
	['textLength'=> 3, 'b'=>'3'],
	['textLength'=> 1, 'b'=>'4'],
];

$b = $a;
f1($b);
print_r($b);

$b = $a;
f2($b);
print_r($b);

Output:

$ php -e x.php
Array
(
    [0] => Array
        (
            [textLength] => 7
            [b] => 2
        )

    [1] => Array
        (
            [textLength] => 5
            [b] => 1
        )

    [2] => Array
        (
            [textLength] => 3
            [b] => 3
        )

    [3] => Array
        (
            [textLength] => 1
            [b] => 4
        )

)
Array
(
    [0] => Array
        (
            [textLength] => 7
            [b] => 2
        )

    [1] => Array
        (
            [textLength] => 5
            [b] => 1
        )

    [2] => Array
        (
            [textLength] => 3
            [b] => 3
        )

    [3] => Array
        (
            [textLength] => 1
            [b] => 4
        )

)

HTML infiltrating in imported content

Hello,

First of all, congrats for the code!

On some content, HTML code is infiltrating in extracted textual content. Example for this article:

https://www.yahoo.com/entertainment/chris-evans-ignites-celeb-civil-085545038.html

Snippet of the extracted content:

Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”” data-reactid=”16″ type=”text”>There’s an issue that’s divided Twitter almost as much as anything in politics this week, with Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”

As you can see, this is infiltrating in content:

”” data-reactid=”16″ type=”text”>

This is mu current code:

$readConf = new Configuration(); $readConf->setSummonCthulhu(true); $readability = new Readability($readConf); $readability->parse($html_string); $return_me = $readability->getContent();

Any help is appreciated.

Regards,
Szabi.

First paragraph of many articles omitted

Some articles have their first paragraph omitted, regardless of the configuration values of CleanConditionally and StripUnlikelyCandidates. Examples below:

https://www.engadget.com/2019/05/10/lyft-just-started-experimenting-with-car-rentals-in-san-francisc/?utm_campaign=homepage&utm_medium=internal&utm_source=dl

The paragraph beginning with "Between offering on-demand rides..." at the beginning of the article is completely absent from cleaned content.

https://www.cnn.com/2019/04/12/us/andrew-chael-katie-bouman-black-hole-image-trnd/index.html

The short paragraph beginning with "When internet trolls tried..." is omitted from the article content. This section gets assigned to the article excerpt and is recoverable that way, but the first example's lost paragraph is not available in the excerpt.

I've looked at several pages where this behavior happens and I'm unable to determine why it is happening. HTML parsing is not my strong suit so I'm hoping someone can take a look. I really appreciate all the work on this library.

The code I'm using to scrape these pages:

$pageContent = file_get_contents($this->url);
    
    $readabilityConfig = new Configuration([
        
    ]);
    $readability = new Readability($readabilityConfig);

    try {
        $readability->parse($pageContent);
    } catch (ParseException $e) {
        echo('parse failed');
    }

"Malware automatically quarantined" [/nextcloud/apps/news/vendor/....(readability)...]

Related issue: nextcloud/news#1321

Email notification on a cPanel based host where NC is installed via Softaculous, and the News app is enabled from the main apps page inside the NC instance, from the official repository provided there.

The email alert contains this information:

"We have detected malicious PHP script(s) within your web hosting account. To prevent system abuse, our system has automatically quarantined these file(s). This concerns the following:

Generic:HTML/Seospam.B (Generic)
/home//nextcloud/apps/news/vendor/andreskrey/readability.php/test/test-pages/yahoo-4/source.html

Existence of these scripts generally points to third parties having gained access to your web hosting account either by having exploiting a vulnerability in one of the software packages you are using or by a compromised password. We strongly recommend you check your hosting account for other files that appear out of place, which our automated detection system might have missed."

System Information

PS. I have not added/activated any Yahoo news sources, hence this malware report is seemingly happening in the software "by itself"(?), not because of any user actions, so I am therefore reporting this both in the news app project and in the "readability" project, as I am unsure where this should be handled.

Using a different parser for HTML, and a few other small issues

Hi Andres, firstly, really great work on Readability.php! We're hoping to use it to replace on older PHP port we use at FiveFilters.org.

I'm not sure if you're still maintaining this and accepting pull requests, but if you are, we've written a blog post covering some issues that we encountered. The blog post is here: https://www.fivefilters.org/2021/readability/. The changes we've made so far are here: https://github.com/fivefilters/readability.php.

Happy to submit any of these changes as pull requests here if you'd like to integrate them into this repository.

PHP Warning: Division by zero (because $topCandidate->contentScore is zero)

Hello!

Here is my sample code:

$url = 'https://france.googleblog.com/2018/05/google-celebre-loeuvre-du-realisateur.html';
$html = file_get_contents($url);
$config = new Configuration;
$config->setSummonCthulhu(true);
$readability = new Readability($config);
$readability->parse($html);

As result, several PHP warnings:

PHP Warning: Division by zero in vendor/andreskrey/readability.php/src/Readability.php on line 986

Same issue with those URLs:

In some cases, disabling SummonCthulhu solve the problem.

Thanks for your help!

Infinite loop on search of parentTopCandidate

Hi!

I have another bug with parentTopCandidate's code when I try to parse:
http://www.motomag.com/Honda-Goldwing-2018-Tambour-battant-ca-donne-quoi.html

Trying to get property of non-object in /vendor/andreskrey/readability.php/src/Readability.php on line 955
Trying to get property of non-object in /vendor/andreskrey/readability.php/src/Readability.php on line 964

I try to fix that by check if parentNode is null each time we call ->parentNode, but this seems more complicated than that, because parentOfTopCandidate it's needed for "DOMElement $sibling" which failed with Call to a member function getChildren() on null.

I hope this help.

When the IMG tag meets data-src, the SRC is lost, causing the image to not display

The test link is https://mp.weixin.qq.com/s/Y1QrPBn4W0RQDjf2YcZbPg, image not display ,beacuse only data-src ,not src.

<img data-ratio="1.6866667" data-src="https://mmbiz.qpic.cn/mmbiz_gif/ia1Z7HH4plnAspjkKfE7nb4bkEObaj69CzIEKJUkmevJFgZEgpVOSoUnjic59I5ZNdibeia6pnneT0tViblDtZAYLog/640?wx_fmt=gif" data-type="gif" data-w="300"></img>

code:

require 'vendor/autoload.php';

use andreskrey\Readability\Readability;
use andreskrey\Readability\Configuration;

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)');$readability = new Readability(new Configuration());

$html = file_get_contents('https://mp.weixin.qq.com/s/Y1QrPBn4W0RQDjf2YcZbPg');

try {
$readability->parse($html);
//$data = $readability->getImages();

//var_dump($data);
echo $readability;
} catch (ParseException $e) {
    echo sprintf('Error processing text: %s', $e->getMessage());
}

Images problems

Doesn't take pictures from some sites

$readability = new \andreskrey\Readability\Readability((new \andreskrey\Readability\Configuration())
        ->setFixRelativeURLs(true)
        ->setOriginalURL('https://stayglam.com/life/sexy-tattoos/'));

    $html = file_get_contents('https://stayglam.com/life/sexy-tattoos/');

    try {
        $readability->parse($html);
        //$data = $readability->getImages();
        //var_dump($data);
        echo $readability;
    }
    catch (\andreskrey\Readability\ParseException $e) {
        echo sprintf('Error processing text: %s', $e->getMessage());
    }

Improve regexp unlikelyCandidates

Hello Andres!

In some cases, we need to change the unlikelyCandidates regexp in order to remove the bad content.
Example:
http://mashable.com/2017/11/03/xbox-one-x-review/
Class "shopping-disclaimer" need to be removed.

In this case, I think to make a PR to add the word "disclaimer" to the regexp.

But, what do you think about using a configuration parameter to change the unlikelyCandidate regexp for other cases more tricky?

I also need to add some logs to know why some nodes are not removed from content.
So I think it will be nice to have a logger like Monolog a simple debug mode.
I would like to have your opinion about that too.

Thanks!

PS: You reached an awesome milestone with this v1!
Congrats! 👍

Question: can the library tell me if the website is a good candidate for extraction?

Howdy,

I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.

https://workplace.stackexchange.com/questions/102524/one-of-my-subordinates-child-passed-away-how-can-i-inform-my-team => skips the original post
https://workplace.stackexchange.com/questions/102692/how-to-deal-with-flaws-in-tests-of-potential-employers => skips everything

Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?

That way, I could fall back to my internal text extraction algorithm.

Type error: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in vendor/andreskrey/readability.php/src/Readability.php (line 1274)

Here is one reference html for this error: https://gist.github.com/Slityak/070b8e8e44bf87981892d29ff4db1a04

Not sure but I think the error occurs when <td> have only one child (firstChild) without children but simple text. Like this part:

<table width="100%" border="0" cellspacing="0" cellpadding="5"> 
  <tbody>
    <tr> 
      <td> 
        <div align="center">
          Summer Hockey Clinic Registration Form 
        </div>
      </td> 
    </tr> 
  </tbody>
</table>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.