andreskrey / readability.php Goto Github PK
View Code? Open in Web Editor NEWPHP port of Mozilla's Readability.js
License: Apache License 2.0
PHP port of Mozilla's Readability.js
License: Apache License 2.0
Hi! Firstly, thanks so much for your amazing work!
I want to remove certain divs based on css classes or ids. For example, removes all div.container-to-remove
.
I think that my question is more a feature request.
Do you think that it is possible? At least, can you tell me where I have to see to add that feature? (well, I know that I have to see everything, but maybe you know where, specifically)
Thanks in advance!
Example HTML:
<p class="content">Main content</p>
<div>Share the word on social media!</div>
Passing p.content
CSS selector to Readability's constructor should return Main content
There are problem extracting images when
<meta property="og:image:width" content="373"/>
or
<meta property="og:image:height" content="280"/>
are present in the document code.
when I get the main image I have the value 280 instead the filepath.
Occasionally I get the following Error:
Trying to get property 'nodeName' of non-object in:
readability.php/src/Readability.php on line 1079
while ($parentOfTopCandidate->nodeName !== 'body') { ...
I am migrating from 0.3 to 1.2
My old wrapper:
<?php
/**
* Created by PhpStorm.
* User: ninoskopac
* Date: 22/02/2018
* Time: 22:44
*/
declare(strict_types=1);
namespace Read2Me\Content;
use andreskrey\Readability\HTMLParser;
use Goutte\Client;
class Readability
{
/** @var HTMLParser $readability */
private $readability;
/**
* @var Client
*/
private $goutte;
private $response;
public function __construct(Client $goutte)
{
$this->goutte = $goutte;
$this->readability = new HTMLParser();
$this->response = $this->readability->parse($this->goutte->getCrawler()->html());
}
public function getContent(): string {
/** @var \DOMDocument $domDocument */
$domDocument = $this->response['article'];
return $domDocument->textContent ?? '';
}
}
My new wrapper:
<?php
/**
* Created by PhpStorm.
* User: ninoskopac
* Date: 22/02/2018
* Time: 22:44
*/
declare(strict_types=1);
namespace Read2Me\Content;
use andreskrey\Readability\ParseException;
use andreskrey\Readability\Readability as ReadabilityReadability;
use andreskrey\Readability\Configuration as ReadabilityConfiguration;
use Goutte\Client;
class Readability
{
/** @var ReadabilityReadability $readability */
private $readability;
/**
* @var Client
*/
private $goutte;
public function __construct(Client $goutte)
{
$this->goutte = $goutte;
$this->readability = new ReadabilityReadability(new ReadabilityConfiguration());
try {
$this->readability->parse($this->goutte->getCrawler()->html());
} catch (ParseException $e) {
}
}
public function getContent(): string {
$title = $this->readability->getTitle();
$content = $this->readability->getContent();
if (empty($content))
return '';
return $title . "\n\n" . $content;
}
}
So, getContent() returns HTML now, and not text?
Hi !
I have an issue with this site : https://www.the-star.co.ke/news/2018/03/13/warning-risks-of-food-security_c1728262
The function "parse()" returns only the title!
Hi !
I'm trying your script but when I use the test script I got this error :
Fatal error: Class 'PHPUnit\Framework\TestCase' not found in /homepages/15/d202009493/htdocs/SERP2db/reader/test/ReadabilityTest.php on line 9
what am I doing wrong ? Is there any dependencies ?
Thanks !
Hi,
This URL returns a Fatal Error: https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27
$readability = new Readability(new Configuration());
$html = file_get_contents('https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27');
try {
$readability->parse($html);
echo $readability;
} catch (ParseException $e) {
echo sprintf('Error processing text: %s', $e->getMessage());
}
Result is:
PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324
When i use ->parse() I see div class='someClass'
How I can remove class from result?
It gives following error after giving this command, $composer require andreskrey/readability.php
Using version ^2.0 for andreskrey/readability.php
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.
Problem 1
- The requested package andreskrey/readability.php No version set (parsed as 1.0.0) is satisfiable by andreskrey/readability.php[No version set (parsed as 1.0.0)] but these conflict with your requirements or minimum-stability.
Installation failed, reverting ./composer.json to its original content.
Hi Andres!
I have an issue with this site:
https://www.engadget.com/2017/11/03/xbox-one-x-review/
I think we could remove all content from :
/html/body/div[1]/div/div[2]/main/aside
By definition, it should be related to the content ( https://www.w3schools.com/tags/tag_aside.asp ),
but by experience, it's more "look at our other articles!"
What do you think?
Is there any way that the script could be configured to strip all images from the parsed content?
I neither need nor want them (screen reader) and they just take up space which leads to breaking results.
I am getting a error message that has resulted in my stats to encounter the error.
Severity: Warning
Message: Division by zero
Filename: controllers/Api.php
Line Number: 93
Here is the Line 93 :
[90] $weekrate = $bht - $ohc;
. if ( empty( $ohc ) ) {
. $weekrate = 'N/A';
[93] } else $weekrate = floor( $weekrate / $ohc * 100 );
I'm utilising readability.php in the feediron plugin for TT-RSS. TT-RSS recently have switched to PHP 8 and there looks to be some deprecation's to deal with in PHP 8. Issue originally reported here
plugins/af_readability/vendor/andreskrey/Readability/Readability.php:199
usort():** Returning bool from comparison function is deprecated, return an integer less than, equal to, or greater than zero
Hi, Andres!
Maybe we can introduce a new configuration option for clean source classes or not?
It's very important for me to be able to work with source classes and ids (ids removes only in mozilla's readability this time)
Thank you!
newspaper.py handles this like:
def get_meta_img_url(self, article_url, doc):
"""Returns the 'top img' as specified by the website
"""
top_meta_image, try_one, try_two, try_three, try_four = [None] * 5
try_one = self.get_meta_content(doc, 'meta[property="og:image"]')
if try_one is None:
link_icon_kwargs = {'tag': 'link', 'attr': 'rel', 'value': 'icon'}
elems = self.parser.getElementsByTag(doc, **link_icon_kwargs)
try_two = elems[0].get('href') if elems else None
if try_two is None:
link_img_src_kwargs = \
{'tag': 'link', 'attr': 'rel', 'value': 'img_src'}
elems = self.parser.getElementsByTag(doc, **link_img_src_kwargs)
try_three = elems[0].get('href') if elems else None
if try_three is None:
try_four = self.get_meta_content(doc, 'meta[name="og:image"]')
top_meta_image = try_one or try_two or try_three or try_four
if top_meta_image:
return urllib.parse.urljoin(article_url, top_meta_image)
return ''
This would be pretty straight forward in your PHP as well...
If I get the time, I will update this issue and do it myself, but if you have time this would be a very useful feature.
Hello!
I have the warning:
PHP Warning: Division by zero ... andreskrey/readability.php/src/HTMLParser.php on line 1093
With the site:
https://www.womenshealthmag.com/life/macys-black-friday-deals-2017
Refer to lines in method prepArticle:
$h2 = $article->getElementsByTagName('h2');
if ($h2->length === 1) {
$lengthSimilarRate = (mb_strlen($h2->item(0)->textContent) - mb_strlen($this->metadata['title'])) / mb_strlen($this->metadata['title']);
It's caused by an empty metadata['title'], but I don't know why is empty.
I fix that with a !empty(metadata['title'])
in the if statement, but I'm sure isn't the right way.
This mostly works well, but in about 50% of the articles I get from nytimes.com I get this weird artifact:
Usually the article follows it regardless, and it works for the rest of the articles. I did notice that the JavaScript snippit that is included was the same every time, so I looked where I could find it, and I found it in the original url:
https://www.nytimes.com/2020/12/03/us/election-officials-threats-trump.html
Since you may need a subscription to access this site, I’ve make a capture of a possible cause around the point I saw snippets like this in the html (when I search for the start of that JS):
Regards,
Marius.
I'm trying to parse content of this article: http://www.jornalcorreiodonorte.com.br/2.1149/com-a-chegada-do-ver%C3%A3o-aten%C3%A7%C3%A3o-deve-ser-redobrada-com-os-c%C3%A3es-1.2191389
I'm using v2.1.0.
This produces a lot of PHP Notice:
PHP Notice: Trying to get property 'nodeName' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1079
PHP Notice: Trying to get property 'contentScore' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1080
PHP Notice: Trying to get property 'contentScore' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1091
PHP Notice: Trying to get property 'parentNode' of non-object in vendor/andreskrey/readability.php/src/Readability.php on line 1092
Those 4 lines are repeated million times (I guess we are in an infinite loop).
My code is very simple:
$readability = new andreskrey\Readability\Readability(new andreskrey\Readability\Configuration());
$readability->parse(file_get_contents('http://www.jornalcorreiodonorte.com.br/2.1149/com-a-chegada-do-ver%C3%A3o-aten%C3%A7%C3%A3o-deve-ser-redobrada-com-os-c%C3%A3es-1.2191389'));
PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /tt-rss/vendor//andreskrey/Readability/Nodes/NodeTrait.php:324.
Arch Linux, PHP 7.3.2
Stack trace:
#0 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php(324): iterator_to_array(NULL)
#1 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php(421): andreskrey\Readability\Nodes\DOM\DOMText->getChildren(true)
#2 /tt-rss/vendor/andreskrey/Readability/Readability.php(1272): andreskrey\Readability\Nodes\DOM\DOMText->hasSingleTagInsideElement('td')
#3 /tt-rss/vendor/andreskrey/Readability/Readability.php(1166): andreskrey\Readability\Readability->prepArticle(Object(andreskrey\Readability\Nodes\DOM\DOMDocument))
#4 /tt-rss/vendor/andreskrey/Readability/Readability.php(155): andreskrey\Readability\Readability->rateNodes(Array)
#5 /tt-rss/vendor/andreskrey/Readability/Nodes/NodeTrait.php on line 324
add method getPublishedDate() , auto find the published date
TorrentFreak.com articles are sometimes polluted with <g></g>
tags by the spell checker they use. Readability consistently filters out all words encapsulated by these tags, so I'd like to prevent this. How should I do this properly?
I've already tried adding |g
to the okMaybeItsACandidate
regexp in NodeUtility.php
, and tried to add 'g',
to $phrasing_elems
in NodeTrait.php
. Neither of these operations did the trick.
(I am using Readability as part of Tiny Tiny RSS)
I'm using Laravel 5.7.
This is my code:
use andreskrey\Readability\Readability as PhpReadability;
use andreskrey\Readability\Configuration;
class Readability extends Controller
{
protected $readability;
public function __construct()
{
$this->readability = new PhpReadability(new Configuration());
}
public function parse($html)
{
return $this->readability->parse($html); // return boolean instead of html string or Readability instance;
}
}
Problem: missing paragraphs from the article
Example article: https://www.fleetfeet.com/blog/shoe-review-nike-air-zoom-pegasus-36
The problem is, that it's detecting as main content only one part of the article started with "Nike Pegasus 36 Ride and Performance". Other parts of the article is ommited. StripUnlikelyCandidates has no effect at all. Also image/main image is not detected, always null, not only for this case, but i tried 100+ articles and image is always null.
Is it possible to detect the full article, not just part of it ?
try {
$readability = new Readability(new Configuration([
'StripUnlikelyCandidates' => false
]));
$readability->parse($this->html);
return $readability;
} catch (\Exception $exception) {
throw new \Exception($exception->getMessage());
}
Hello,
Great library! Congrats.
I found a small issue. For some articles, I get a JavaScript code returned in content. Example: https://www.reuters.com/article/us-usa-trump-ivanka/ivanka-trump-closes-fashion-line-to-focus-on-helping-her-father-idUSKBN1KE2JN
Returned content:
WORLD IVANKA TRUMP CLOSES FASHION LINE TO FOCUS ON HELPING HER FATHER BREAKING NEWS July 24, 2018 0 (Reuters) – U.S. President Donald Trump’s daughter Ivanka Trump on Tuesday said she was shutting her fashion line to focus on her role as an informal White House adviser, where she is working on advancing working women. Since Trump’s surprise November 2016 election, his family has faced criticism that its portfolio of real estate and consumer goods businesses, which lean heavily on the Trump name as a marker of luxury, conflict with their roles as Washington officials. Critics of the president called the shutting of the brand a victory for a boycott of Trump-tied businesses that began late in the 2016 campaign. “After 17 months in Washington, I do not know when or if I will ever return to the business, but I do know that my focus for the foreseeable future will be the work I am doing here in Washington,” Ivanka Trump, 36, said in a statement on Tuesday. Amid criticism of potential conflicts of interest, Ivanka Trump in 2017 gave up day-to-day management of her clothing company and put its assets in a trust managed by family members. Her company said licensing contracts would not be renewed and those in place will be allowed to run their course. Mid-priced women’s clothing, shoes and accessories were sold under the label. The 18 people who work for the 11-year-old company will be laid off as it shuts down. Since Trump’s election, retailers including Nordstrom Inc , Hudson’s Bay Co and Sears Holdings Corp have dropped or sharply scaled back their assortment of Trump-branded products, though they typically attributed those decisions to poor sales rather than political messages. Read: US vehicle tariffs considered as heart of attention as Trump meets EU's Juncker Ivanka Trump’s brand said in a statement that retailers including Bloomingdale’s, owned by Macy’s Inc , Dillard’s Inc and Amazon.com Inc , continued to carry her wares. According to the Wall Street Journal, online sales of Ivanka Trump’s brand fell nearly 55 percent in the 12 months to June, compared with the year-earlier period, citing Rakuten Intelligence, which gathers email receipts from 5.5 million U.S. consumers. Trump’s combative style on the campaign trail and as president have drawn the family’s brands into political fights, with some supporters hosting events at the luxury Trump International Hotel blocks from the White House even as opponents stage boycotts. The president has been a loud advocate for domestic manufacturing, prompting criticism that much of the Ivanka Trump line was made overseas, as is the vast majority of clothing and footwear sold in the United States. According to media reports, much of her product line has been sourced from China, which is the target of tariffs imposed by the president in a trade conflict. The most organised boycott of Trump-related businesses, “Grab Your Wallet,” called the news about Ivanka’s business a victory. “This is the biggest possible win for Grab Your Wallet,” the group’s co-founder, Shannon Coulter, a San Francisco marketing executive, said in a phone interview. (Reporting by Scott Malone in Boston, Barbara Goldberg and Diana Kruzman in New York and Nivedita Balu in Bengaluru; editing by Richard Chang and Cynthia Osterman) This story has not been edited by Firstpost staff and is generated by auto-feed. Updated Date: Jul 25, 2018 04:05 AM Read: Secret tape may not add to legal jeopardy for Trump or Cohen Also Watch Social Media Star: Abhishek Bachchan, Varun Grover reveal how they handle selfies, trolls and broccoli Monday, July 16, 2018 It’s a Wrap: Soorma star Diljit Dosanjh and Hockey legend Sandeep Singh in conversation with Parul Sharma Monday, July 16, 2018 Watch: Dalit man in Uttar Pradesh defies decades of prejudice by taking out baraat in Thakur-dominated Nizampur village Monday, July 16, 2018 India’s water crisis: After govt apathy, Odisha farmer carves out 3-km canal from hills to tackle scarcity in village Sunday, July 15, 2018 Maurizio Sarri, named as new Chelsea manager, is owner Roman Abramovich’s latest gamble in quest for ‘perfect football’ ‘; console.log(titleHeading);
As you can see, the end contains a javascript code.
Thank you for looking into this!
Regards.
If I try to install this with:
composer require andreskrey/readability.php
And run
use andreskrey\Readability\HTMLParser;
use andreskrey\Readability\Configuration;
$readability = new Readability(new Configuration());
I get:
PHP Fatal error: Class 'Readability' not found
My composer.json looks like this:
{
"require": {
"andreskrey/readability.php": "^0.3.1"
}
}
It would be useful to configure a user-agent when fetching pages. Is this possible (or a feature you'd be willing to consider)? I can't find anything about it in the docs.
Hi! Is there any solution to extract content from official source (Example : http://empres-i.fao.org/empres-i/2/obd?idOutbreak=229632&rss=t)
http://www.mafengwo.cn/i/9359689.html
I get the content ,but haven't image ,please help me
https://github.com/andreskrey/readability.php/blob/master/src/Readability.php#L198
// No luck after removing flags, just return the longest text we found during the different loops
usort($this->attempts, function ($a, $b) {
return $a['textLength'] < $b['textLength']; // L198
});
php 7.4 is complaining:
usort(): Returning bool from comparison function is deprecated, return an integer less than, equal to, or greater than zero
Suggested fix:
- return $a['textLength'] < $b['textLength'];
+ return $b['textLength'] - $a['textLength'];
(NB: replacing <
with -
, and swapped operands)
Testing:
<?php
function f1(&$x){ // original
usort($x, function ($a, $b) {
return $a['textLength'] < $b['textLength'];
});
}
function f2(&$x){ // proposed change
usort($x, function ($a, $b) {
return $b['textLength'] - $a['textLength'];
});
}
$a = [
['textLength'=> 5, 'b'=>'1'],
['textLength'=> 7, 'b'=>'2'],
['textLength'=> 3, 'b'=>'3'],
['textLength'=> 1, 'b'=>'4'],
];
$b = $a;
f1($b);
print_r($b);
$b = $a;
f2($b);
print_r($b);
Output:
$ php -e x.php
Array
(
[0] => Array
(
[textLength] => 7
[b] => 2
)
[1] => Array
(
[textLength] => 5
[b] => 1
)
[2] => Array
(
[textLength] => 3
[b] => 3
)
[3] => Array
(
[textLength] => 1
[b] => 4
)
)
Array
(
[0] => Array
(
[textLength] => 7
[b] => 2
)
[1] => Array
(
[textLength] => 5
[b] => 1
)
[2] => Array
(
[textLength] => 3
[b] => 3
)
[3] => Array
(
[textLength] => 1
[b] => 4
)
)
Hello,
First of all, congrats for the code!
On some content, HTML code is infiltrating in extracted textual content. Example for this article:
https://www.yahoo.com/entertainment/chris-evans-ignites-celeb-civil-085545038.html
Snippet of the extracted content:
Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”” data-reactid=”16″ type=”text”>There’s an issue that’s divided Twitter almost as much as anything in politics this week, with Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”
As you can see, this is infiltrating in content:
”” data-reactid=”16″ type=”text”>
This is mu current code:
$readConf = new Configuration(); $readConf->setSummonCthulhu(true); $readability = new Readability($readConf); $readability->parse($html_string); $return_me = $readability->getContent();
Any help is appreciated.
Regards,
Szabi.
Hi! Great work! But not enough attribute data-src in this attributes set
readability.php/src/Readability.php
Line 1521 in 5bd54a8
Some articles have their first paragraph omitted, regardless of the configuration values of CleanConditionally and StripUnlikelyCandidates. Examples below:
The paragraph beginning with "Between offering on-demand rides..." at the beginning of the article is completely absent from cleaned content.
https://www.cnn.com/2019/04/12/us/andrew-chael-katie-bouman-black-hole-image-trnd/index.html
The short paragraph beginning with "When internet trolls tried..." is omitted from the article content. This section gets assigned to the article excerpt and is recoverable that way, but the first example's lost paragraph is not available in the excerpt.
I've looked at several pages where this behavior happens and I'm unable to determine why it is happening. HTML parsing is not my strong suit so I'm hoping someone can take a look. I really appreciate all the work on this library.
The code I'm using to scrape these pages:
$pageContent = file_get_contents($this->url);
$readabilityConfig = new Configuration([
]);
$readability = new Readability($readabilityConfig);
try {
$readability->parse($pageContent);
} catch (ParseException $e) {
echo('parse failed');
}
Hello and thanks for your package
Is there a way to get title not og:title?
Related issue: nextcloud/news#1321
Email notification on a cPanel based host where NC is installed via Softaculous, and the News app is enabled from the main apps page inside the NC instance, from the official repository provided there.
The email alert contains this information:
"We have detected malicious PHP script(s) within your web hosting account. To prevent system abuse, our system has automatically quarantined these file(s). This concerns the following:
Generic:HTML/Seospam.B (Generic)
/home//nextcloud/apps/news/vendor/andreskrey/readability.php/test/test-pages/yahoo-4/source.html
Existence of these scripts generally points to third parties having gained access to your web hosting account either by having exploiting a vulnerability in one of the software packages you are using or by a compromised password. We strongly recommend you check your hosting account for other files that appear out of place, which our automated detection system might have missed."
PS. I have not added/activated any Yahoo news sources, hence this malware report is seemingly happening in the software "by itself"(?), not because of any user actions, so I am therefore reporting this both in the news app project and in the "readability" project, as I am unsure where this should be handled.
Hi Andres, firstly, really great work on Readability.php! We're hoping to use it to replace on older PHP port we use at FiveFilters.org.
I'm not sure if you're still maintaining this and accepting pull requests, but if you are, we've written a blog post covering some issues that we encountered. The blog post is here: https://www.fivefilters.org/2021/readability/. The changes we've made so far are here: https://github.com/fivefilters/readability.php.
Happy to submit any of these changes as pull requests here if you'd like to integrate them into this repository.
Hello!
Here is my sample code:
$url = 'https://france.googleblog.com/2018/05/google-celebre-loeuvre-du-realisateur.html';
$html = file_get_contents($url);
$config = new Configuration;
$config->setSummonCthulhu(true);
$readability = new Readability($config);
$readability->parse($html);
As result, several PHP warnings:
PHP Warning: Division by zero in vendor/andreskrey/readability.php/src/Readability.php on line 986
Same issue with those URLs:
In some cases, disabling SummonCthulhu
solve the problem.
Thanks for your help!
Hi!
I have another bug with parentTopCandidate's code when I try to parse:
http://www.motomag.com/Honda-Goldwing-2018-Tambour-battant-ca-donne-quoi.html
Trying to get property of non-object in /vendor/andreskrey/readability.php/src/Readability.php on line 955
Trying to get property of non-object in /vendor/andreskrey/readability.php/src/Readability.php on line 964
I try to fix that by check if parentNode is null each time we call ->parentNode
, but this seems more complicated than that, because parentOfTopCandidate it's needed for "DOMElement $sibling" which failed with Call to a member function getChildren() on null
.
I hope this help.
The test link is https://mp.weixin.qq.com/s/Y1QrPBn4W0RQDjf2YcZbPg, image not display ,beacuse only data-src ,not src.
<img data-ratio="1.6866667" data-src="https://mmbiz.qpic.cn/mmbiz_gif/ia1Z7HH4plnAspjkKfE7nb4bkEObaj69CzIEKJUkmevJFgZEgpVOSoUnjic59I5ZNdibeia6pnneT0tViblDtZAYLog/640?wx_fmt=gif" data-type="gif" data-w="300"></img>
code:
require 'vendor/autoload.php';
use andreskrey\Readability\Readability;
use andreskrey\Readability\Configuration;
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)');$readability = new Readability(new Configuration());
$html = file_get_contents('https://mp.weixin.qq.com/s/Y1QrPBn4W0RQDjf2YcZbPg');
try {
$readability->parse($html);
//$data = $readability->getImages();
//var_dump($data);
echo $readability;
} catch (ParseException $e) {
echo sprintf('Error processing text: %s', $e->getMessage());
}
Doesn't take pictures from some sites
$readability = new \andreskrey\Readability\Readability((new \andreskrey\Readability\Configuration())
->setFixRelativeURLs(true)
->setOriginalURL('https://stayglam.com/life/sexy-tattoos/'));
$html = file_get_contents('https://stayglam.com/life/sexy-tattoos/');
try {
$readability->parse($html);
//$data = $readability->getImages();
//var_dump($data);
echo $readability;
}
catch (\andreskrey\Readability\ParseException $e) {
echo sprintf('Error processing text: %s', $e->getMessage());
}
Hello Andres!
In some cases, we need to change the unlikelyCandidates regexp in order to remove the bad content.
Example:
http://mashable.com/2017/11/03/xbox-one-x-review/
Class "shopping-disclaimer" need to be removed.
In this case, I think to make a PR to add the word "disclaimer" to the regexp.
But, what do you think about using a configuration parameter to change the unlikelyCandidate regexp for other cases more tricky?
I also need to add some logs to know why some nodes are not removed from content.
So I think it will be nice to have a logger like Monolog a simple debug mode.
I would like to have your opinion about that too.
Thanks!
PS: You reached an awesome milestone with this v1!
Congrats! 👍
Howdy,
I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.
https://workplace.stackexchange.com/questions/102524/one-of-my-subordinates-child-passed-away-how-can-i-inform-my-team => skips the original post
https://workplace.stackexchange.com/questions/102692/how-to-deal-with-flaws-in-tests-of-potential-employers => skips everything
Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?
That way, I could fall back to my internal text extraction algorithm.
Readability isn't setting relative URLs for the image source/src.
Here is one reference html for this error: https://gist.github.com/Slityak/070b8e8e44bf87981892d29ff4db1a04
Not sure but I think the error occurs when <td>
have only one child (firstChild) without children but simple text. Like this part:
<table width="100%" border="0" cellspacing="0" cellpadding="5">
<tbody>
<tr>
<td>
<div align="center">
Summer Hockey Clinic Registration Form
</div>
</td>
</tr>
</tbody>
</table>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.