berkmancenter / amber_wordpress Goto Github PK
View Code? Open in Web Editor NEWAmber plugin for Wordpress
Home Page: http://amberlink.org
License: GNU General Public License v3.0
Amber plugin for Wordpress
Home Page: http://amberlink.org
License: GNU General Public License v3.0
Support nginx as a standalone webserver, in addition to Apache.
The current installation procedure depends on creating rewrite rules that reportedly work differently in Apache and Nginx. We requires testing the plugin with nginx, and identifying any code/installation/documentation changes.
Amber injects its link attributes in RSS feeds. This isn't ideal since RSS feeds can't use this information. It would be better to disable this attribute injection for RSS feeds.
Only make the call if geo-specific features are enabled. Also setup the call so that execution is deferred until after the page loads, so it doesn’t impact load time.
Upon trying to activate Amber on Wordpress, I get Parse error: syntax error, unexpected T_STRING in /home2/bcomeara/public_html/phylometh.org/wp-content/plugins/amberlink/libraries/backends/aws/AmazonS3Storage.php on line 15
I am using Version 1.4.1 of Amber on Wordpress 4.4.2. All plugins and themes are most recent version: the active plugins are Category Specific RSS Menu, Google Analytics Dashboard for WP, and iframe. I just did the Add Plugin option within Wordpress. When activation failed, I deleted the plugin and reinstalled it, and I'm getting the same error. I don't have (or plan to use) S3 storage.
Okay this is extremely similar to #44 so please read that first.
In this case the answer is even simpler though Just stop trying to fetch Facebook URLs at all
They don't work and they will never work as long as we obey Facebook's robots.txt
which has the following:
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
[..]
User-agent: *
Disallow: /
This seems pretty unambiguous to me, and because our site does a lot of linking to Facebook posts, there are a thousand failed attempts to snapshot their URLs clogging up our queue.
Can we just reject facebook URLs out of hand and have them skip the queue?
As with Twitter, it would be good to have a WP filter available for sites to quickly add domains that should be ignored completely by Amber.
I feel Facebook should be included in such a list for all plugin users, but the ability to filter them out for our site in particular is the most vital need.
Thanks for your attention and help.
I noticed that a CSS file and a Javascript file were added to the head of my blog. I understand they are needed if you want to show an overlay on :hover
or click, but when you deselect those options these extra requests don't seem necessary.
I'm still tracing the issue, but wanted to get the issue here in case it pops up for someone else.
After installing the 1.4.1 update, it appears as if my WP permalinks are broken for any post in which I ran the Amber plugin. For example:
http://jasongriffey.net/wp/2016/01/12/libraries-in-the-exponential-age/
currently results in a 404 for the page (not for the CSS/JS, which I see another issue for), while
http://jasongriffey.net/wp/2016/01/12/state-of-the-union-2016-tag-cloud/
does not. The difference is that I told the former to amber-fy the links, and in the latter one, I didn't. I've disabled Amber on the blog, but the 404's remain.
Hi everyone, I wonder who else is still using Amber!?
We've still got it on globalvoices.org, where our installs cumulatively have millions of URLs being tracked by Amber.
We use it with Internet Archive as our backend, which has worked well for years now.
Unfortunately as of July 10, 2020, Amber has gone haywire, and is now marking all URLs as down, and showing the popup box via JS on them regardless of their actual up/down status. Obviously this is disastrous, as all our brand new posts are having their links falsely marked as down.
I think there are two key problems here, one with the actual IA API, and one with how Amber handles API failures:
After much spelunking in the amber_wordpress
code, I found the place where Amber's attempts to fetch a cache from IA are failing.
In Amber->fetch_item()
it uses $fetcher->fetch($item)
, which in turn calls InternetArchiveFetcher->fetch()
.
At that point it goes to the endpoint for the fetched URL:
$api_endpoint = join("",array(
$this->archiveUrl,
"/save/",
$url));
$ia_result = AmberNetworkUtils::open_single_url($api_endpoint, array(), FALSE);
This gives a URL like:
http://web.archive.org/save/http://google.com
It then takes the result, $ia_result
, and analyzes its "info" and "headers".
['headers']['Content-Location']
is missing every timeFor me the issue comes up when it looks for $ia_result['headers']['Content-Location']
:
if (!isset($ia_result['headers']['Content-Location'])) {
throw new RuntimeException("Internet Archive response did not include archive location");
}
For me, every single URL being requested is getting this error.
"Internet Archive response did not include archive location" is being saved in the message
field of amber_check
.
So what do we do about that problem? Seems to me that there's something wrong with the endpoint:
http://web.archive.org/save/http://google.com
When I visit it I don't get a ['Content-Location']
header, or anything that really looks like it. Maybe this changed?
Okay, so that leads me to the other part of this problem, which I think is a genuine bug in Amber that just happens to be triggered by this particular "outage" of the IA API.
amber_check
record for the URL, making it seem down when it's notSo it makes sense that when the Amber->fetch_item()
fails to generate a cache, it will make a note of it in the database, but what currently happens is the most extreme possible version of that, and it is so extreme that I suspect it was unintentional.
In Amber->fetch_item()
, after running $fetcher->fetch()
(in this case, ultimately calling InternetArchiveFetcher->fetch()
) it checks for errors, and if there are some, it re-saves the amber_check
db record with $status->save_check
:
} catch (RuntimeException $re) {
$update['message'] = $re->getMessage();
$update['url'] = $item;
$status->save_check($update);
return false;
}
The problem with this is that earlier in the execution we already generated and saved a save_check()
record based on our direct checking of the URL.
The earlier save happens in Amber->cache_link()
just after running $checker->check()
:
if (($update = $checker->check(empty($last_check) ? array('url' => $item) : $last_check, $force)) !== false) {
$status->save_check($update);
When a URL is working, that code will save accurate information like this to the db:
id: c7b920f57e553df2bb68272f61570210
url: http://google.com
status: 1
last_checked: 1595885887
next_check: 1596058687
message:
This is what we want, of course. If a URL is working then we want its status
to be 1
so that the frontend of our sites won't throw up the "this is probably down" popup when people click.
We also want to keep the last_checked
and next_check
values.
But an API failure deletes all that info!
The issue is that by re-saving the amber_check
value later, during Amber->fetch_item()
, we end up obliterating all that info, and replacing it with a nearly empty record (just message and url, because that's what we saved in the code above):
id: c7b920f57e553df2bb68272f61570210
url: http://google.com
status: 0
last_checked: 0
next_check: 0
message: Internet Archive response did not include archive location
It's possible this is intentional, but if so, it seems like a really bad idea.
The only new info we really have is the message
, so updating only that is what makes the most sense to me.
The following update to Amber->fetch_item()
fixes the problem for me, by first loading the full array of amber_check
info for the URL (which was just generated a second ago!), then updating only the message before resaving it:
} catch (RuntimeException $re) {
// $update['message'] = $re->getMessage();
// $update['url'] = $item;
// $status->save_check($update);
$new_check_record = $status->get_check($item);
$new_check_record['message'] = $re->getMessage();
$status->save_check($new_check_record);
If we fix that part of the code then even when the API fails to satisfy our caching code, we at least still have an accurate picture of whether the URL is up or down, so we're not ending up with JS popups when people click links that should have just worked.
There's other bugs and weird stuff I've discovered while trying to figure this out, but these two are the ones that are totally ruining my ability to keep Amber active on our site.
If anyone is still working on this plugin, please consider updating the save_check()
logic for the sake of everyone.
If anyone at all knows of a change I can make to InternetArchiveFetcher->fetch()
that will make the actual caching start working again, I would love to know!
Thanks to anyone who took the time to read through this.
If a link is intended to be saved in multiple storage locations (e.g. Local Storage and Internet Archive), and one of the location returns an error during storage (e.g. "Storage size too large", "Internet Archive did not return success code"), the error message may not be associated with the correct row on the dashboard.
admin-ajax.php
calls that help render this box, and if we can fix that bug, the box should have more messages in it if nothing else. It seems to apply to Nginx in particular and we are going to propose a separate patch.When a link is down, javascript makes a popup show when the link is clicked. It says the link isn't working and offers "View the Snapshot".
This is great, it's the essence of Amber.
This white box is exactly what we'd expect, since most modern sites will not allow arbitrary loading of their content in an iframe.
Overall, this seems like both bad usability, since it's usually an unlabeled, useless white box, and a security issue, because we are loading an iframe with what could be a compromised site.
What is the purpose of this box and is it necessary? Could it be made to load something more useful, or to only display if there is content to show?
All links to snapshots result in a 404. I tried deactivating certain plugins to see if they conflicted, but the issue remains. I tried removing things from my .htaccess file, but no effect. Any idea what might cause it?
My blog can be found here: https://vasilis.nl/nerd/
And here’s a link to a snapshot: https://vasilis.nl/nerd/amber/cache/02e9e44319ed3fda280172c8b1f86813/
Issue: Earlier this week I updated the plugin for Wordpress to 1.4.3. After activating it, my WP site didn't load properly anymore. After the hearer nothing loads. Asked the ISP for help and this is the bug they reported:
FastCGI: server "/var/run/php-fpm/php56/php-cgi" stderr: PHP message: PHP Warning: include_once(): Failed opening '/sites/warekennis.nl/www/wp-content/plugins/amberlink/libraries/AmberDB.php' for inclusion (include_path='.:/usr/local/php56/share/pear') in /sites/warekennis.nl/www/wp-content/plugins/amberlink/amber.php on line 106
Copied from berkmancenter/amber_common#35
I just installed the Amber plugin for WordPress (4.4.1) and preserved links for a sample post. When I then viewed the post in my browser, I got 404 errors for Amber's JS file and CSS stylesheet. In both cases, the issue seems to be that the path requested include /plugins/amber/
rather than /plugins/amberlink/
. (GET http://favorites.aribadernatal.com/wp-content/plugins/amber/js/amber.js
) As a result, clicking the link does not trigger any Amber behavior.
This will allow us to improve XSS protection when displaying snapshots.
The project I'm currently most interested in Amber for keeps most of its actually-important content in a CPT, which Amber doesn't scan(only posts and pages at the moment). Ideally, there'd be an option to select which post types should be examined, since in a site using many of them they might not all be relevant.
I had a bit of a hack and slash at this and got it working(for me) with a couple of questions to look at. Will try and send a PR in the next couple days.
Title
getallheaders() function is not defined
Effected Users
The WordPress users having their site running on nginx + php-fpm
instead of Apache
.
Description
if (!function_exists('getallheaders')) {
function getallheaders() {
$headers = [];
foreach ($_SERVER as $name => $value) {
if (substr($name, 0, 5) == 'HTTP_') {
$headers[str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5)))))] = $value;
}
}
return $headers;
}
}
Place used
function validate_cache_referrer() in amber.php
Related links
Reference commit
22e0a39#diff-48a2bdcac7ee505e0aabaf11f1d698d8R794
Update:
Faced on
PHP Version 5.5.9-1ubuntu4.16
nginx version: nginx/1.8.1
Hello and thank you for this wonderful system.
Our site is enormous, with lots of content each day, and Amber can't keep up with our links.
We are trying to figure out how to clean up the Amber Dashboard results as best we can to keep things moving (UPDATE: See #47 for details on how the queue woes can be fixed).
In the process I noticed that almost all our Twitter links (of which there are many, we are a site that reports on citizen media) are in a "Down" state, but some are indeed up.
What I've discovered is that many/most of the Twitter URLs in our site have an annoying tracker variable embedded in them, like this one (full data from the dashboard):
Additionally, looking directly at the database, I see this extra and excruciatingly useful info in the message
column of the amber_check
table:
Sidenote: This info should be surfaced in the dashboard, which just says "could not capture snapshot", which is FAR less useful than the vital information that the URL is flat out banned by robots.txt. I created a ticket for this separate issue: #46
To test this, I resaved the post with the link, but having removed ?ref_src=twsrc%5Etfw
. The result was a new entry in the amber_check
and amber_cache
tables for the clean version of the URL, this time with UP/1 as the state.
So clearly:
?ref_src=twsrc%5Etfw
style query arguments?ref_src=twsrc%5Etfw
In terms of the robots.txt, this section seems to be what applies:
# Every bot that might possibly read and respect this file.
User-agent: *
[..]
Disallow: /*?
If I'm reading this correctly, it means that any Twitter URL with a ?
will get blocked, and is Twitter's way of denying search engines access to equivalent/duplicate URLs, since Twitter never seems to use URL query variables to determine the content that will show.
IMHO, the most logical behavior for Amber in this context is to strip URL variables out of Twitter URLs before handling them in the first placel. The URL variables are ignored by twitter and guarantee that the Amber process will fail, while removing them before fetching/sending will usually fix the problem and result in success.
Alternately, Amber can also choose to discard these URLs completely and not try to fetch them. This is "sad" because it means the URLs won't get preserved, but better than spinning our wheels (my server and archive.org) for no possible utility.
I realize that these considerations may snowball into a long list of exceptions within Amber, and this is undesirable, but the popularity of Twitter and it's embed system makes Twitter a very important type of URL to not have clogging up our queues. IMHO it's worth solving this directly in the plugin for the sake of all users who will likely be affected.
If this will not be fixed in the core plugin, could you advise on how to safely create this filter ourselves?
Amber should have some WordPress filters/actions that allow us to accomplish these kind of transformations ourselves, both for common problems like this Twitter one, and for more unique problems (such as a similar issue that might affect our internal intersite linking).
If this filter already exists, it would be nice to have a reference to it and example of it's use in the documentation. If that existed I could write a patch for my needs very quickly.
Thanks for your attention and work on this plugin! Hoping for a fruitful relationship going forward :)
It's possible that the website will not have permissions to write to the "wp-content/uploads" directory, or to create that directory if it does not exist.
This should be rare, because in this scenario the user would also not be able to upload images or any other media.
Regardless, we should provide better error messages when this is the case.
WordPress 4.5 isn't out yet, but it will be soon! WP 4.5 is scheduled to be released on April 12: https://make.wordpress.org/core/2016/03/30/wordpress-4-5-field-guide/
Is Amber compatible with WP 4.5? We need to test the current version of Amber to be sure.
(Copied from berkmancenter/amber_common#33)
I use advanced custom fields plugin* to create custom fields on my custom post type.
Almost all outside links i have are placed in these custom fields and retrieved in the loop. (This makes it easier for my clients to add links...)
The problem is that amber doesn't look in the custom fields.
FYI: The links are raw urls in one field as a string and a description in another field also as a string.
When setting the backend for storing snapshots to "Local" and the update strategy to "Update snapshots periodically" Amber will overwrite the old snapshot with the new one. It would be nice to have multiple snapshots for the same link. This is useful when you are referencing the same resource (whose content has changed over time) in different posts at different times.
Something that's probably pretty easy to implement and would make Amber immediately very useful to me is if it shows where broken links live, so I can go and fix them.
So it seems the devs foresaw this problem, but put it off for the future, or until it actually happened to someone. Well, it happened to us and rewarded us for preserving so many links with an Amber Dashboard that crashes whenever we try to load it.
The feature at fault is the table of all preserved links on the dashboard. It works fine at first, but once we'd preserved about 80k URLs the WSOD started happening reliably.
Here's the code causing the problem, from amber-dashboard.php
:
/** Load the data from the database **/
$data = $this->get_report();
/** Currently handling pagination within the PHP code **/
$current_page = $this->get_pagenum();
$total_items = count($data);
$data = array_slice($data,(($current_page-1)*$per_page),$per_page);
As you can imagine from the comments, get_report()
just fetches every row in the database, no matter how many there are, then the PHP does the pagination by picking out the current page of results.
As soon as there are enough results to fill your PHP memory limit, you get a memory error which completely crashes the page:
Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 20480 bytes) in /[...]/wp-includes/wp-db.php on line 1889
This was always a recipe for disaster. Our site has over 400k URLs in it that need to be preserved, and we only got to 80k before this happened, so it wasn't even close.
On top of that, our server has an extremely generous memory limit of 256mb, so 80k is probably a LOT more than most users would be able to tolerate before the server starts struggling when using this page.
The answer seems pretty obvious, work out some pagination in the SQL commands and use the limited results. That way you could support an effectively infinite number of URLs with no RAM constraints.
For now our solution has just been to disable the table entirely, since it's not very useful with 80k+ rows anyway. Maybe that is something you could implement in the core plugin if you don't have time for pagination: After some set amount of rows in the DB, you show a message saying the table has been hidden to avoid memory errors.
Here's the specific code we commented out that got the Amber Dashboard working again for us (again in amber-dashboard.php
):
function render_dashboard_page(){
//$this->list_table = new Amber_List_Table();
//$this->list_table->prepare_items();
Then, below:
<form id="amber_dashboard-2">
<input type="hidden" name="page" value="<?php echo $_REQUEST['page'] ?>" />
<?php //$this->list_table->display() ?>
Good day,
I received the following message when I attempted to upload the .zip file of the most recent plugin update (On the main Plugins page, I clicked on the Add New button then clicked on Upload Plugin button on the next page within my blog's Dashboard using Google Chrome):
"Unpacking the package…
Installing the plugin…
The package could not be installed. No valid plugins were found.
Plugin install failed."
Any assistance would be greatly appreciated.
Amber prevents against XSS scripting attacks from malicious javascript served by snapshots by serving up all snapshots within a sandboxed iframe.
It should not be possible for users to access cached content other than through this iframe, since they could then be at risk of an XSS attack.
This article is very useful: https://github.com/berkmancenter/amber_wordpress/wiki/Default
But without reading it, it's a lot harder for a user to figure out Amber. We found the settings page on our own, but didn't realize the dashboard was there 'till reading all the docs. Would be very useful to have a link from settings to dashbard and back again for users to quickly find the various tools.
Right now Amber has a very sane and safe approach to slowly working down the queue of posts -- one at a time, once every five minutes -- which is a good default, as many sites might struggle with a heavier workload, and in the long run, most sites will be "stable" with the default slow rate of checking URLs.
The problem is that for larger sites -- like ours, which has ~100k posts with a dozen or more links each -- this may never be a stable rate of URL checking. If the volume of posts per day is high, and the volume of links per post is high, then in the long run, there is a perpetually growing queue with no warning to site administrators. This ever-growing queue gets in the way of the long-term rechecking of all URLs to determine if they have since gone offline.
Admittedly, this isn't insurmountable with the current code, as the Amber Dashboard allows us to see the queue size and, if it's growing, click "Snapshot all new links" to hopefully-quickly clear out the queue in one sitting.
That said, it would be much better if the plugin allowed site administrators to control the rate of dequeing of URLs, since in many cases the rate can be increased significantly without any performance problems, and this can permanently remove the need for administrators to worry about queue length.
The Amber WordPress plugin hooks into the WordPress cron system with an "every five minutes" schedule and executes Amber::dequeue_link()
once per 5 minutes with the Amber::cron_event_hook()
method during the amber_cron_event_hook
action.
Both of these factors, 5 minute cron schedule and one dequeue_link()
per run are currently hardcoded in the plugin, making it impossible to directly alter them, even with the expertize and time for plugin development.
This rigidity is unnecessary, and IMHO both of these values can easily be made filterable in ways that would add only a few lines of PHP to your plugin, and subsequently would allow users to alter the plugin behavior with only a few lines of PHP on their end.
So my proposal is you add a single location in the Amber plugin where the cron schedule is determined, and in that location you use WP's add_filter()
function to allow the value to be modified by plugins.
Similarly, the Amber::cron_event_hook()
method should be modified to have a "number or URLs to dequeue" variable, which is filtered with add_filter()
, and which is used to run Amber::dequeue_link()
that many times on each run.
Finally, the documentation should be updated to point out that these filters exist for advanced users, and simple code examples should be given of their usage.
This would allow major sites like ours to solve the problem for ourselves, without requiring any additional UI that might confuse users, bloat the interface or create additional dev burden.
For the sake of completeness, I'll briefly outline the reasons someone would use one or the other of these two means of increasing the amount of URLs processed.
Increase cron frequency: By filtering the cron schedule from 5 minutes (300s) to 60s
Increase URLs processed per cron run: By filtering the number of times Amber::dequeue_link()
gets run. E.g. Setting it to run 5 times per run.
So for most sites, altering the cron schedule to run Amber::cron_event_hook()
more often will be the best solution. It will work well for large sites with a large corpus of links, which will almost always also get regular traffic -- if only from search engines updating their caches of said corpus — to match their large URL queue in amber.
If you were going to add a UI setting inside Amber to control the rate of URL dequeing, I would definitely make it control the cron schedule, and leave the number of URLs dequeued per run to a filter as described above.
A pulldown menu with [Every 10 minutes|Every 5 minutes|Every 1 minute]
in Amber Settings > Storage Settings would be extremely useful for us and probably many other sites.
Like I said though, a filter and a little documentation would work just as well for power users, who are most likely to need this option.
Thanks for your attention and for considering this request.
Found this while writing up #44 but it's its own issue IMHO
When a post fails the Notes
column on the dashboard always seems to say "Could not capture snapshot", even if the message
field in the amber_check
table of the database has a much more useful message.
What's up with that? Why wouldn't I want to see the extra info?
In my case, both Twitter (#44) and Facebook (#45) URLs are regularly broken because Amber is obeying the robots.txt
files of those domains. Separate from that being annoying, is that I had to look in the database directly to find that Amber knew exactly why they were broken (message
: "Blocked by robots.txt") but the Amber Dashboard didn't share that info with me.
This is excruciatingly useful info for someone debugging a Down
state in the Dashboard, and it's actually useful to the user, unlike "Could not capture snapshot", which it turns out is the exact same information as the Down/Up
info in the Status
column.
So to me it seems obvious: Dashboard should list the message
value from the database if there is one, and only resort to the redundant "Could not capture snapshot" if for some reason the URL is Down
but message
has no info.
Thanks for your attention and help
The Amber Dashboard link and url are accessible from with the Wordpress admin panel, but visiting the Amber Dashboard only shows a blank page.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.