gaffling / php-grab-favicon Goto Github PK
View Code? Open in Web Editor NEW๐ผ Saves the favicon of the given URL and returns the image path.
Home Page: http://suchmaschine.biz
License: MIT License
๐ผ Saves the favicon of the given URL and returns the image path.
Home Page: http://suchmaschine.biz
License: MIT License
Status:
June 23rd 2023
Haven't been able to do much work this week due to some unexpected household emergencies, should be back at it next week.
202306161401
--showhttpwarnings
--showhttperrors
. (With the options enabled, they will output as TYPE_WARNING
and TYPE_ERROR
.)TYPE_OBJECTS
, TYPE_TIMERS
(full debug logging is now 1023
)202306121848
:
curl, exif, get, put, mbstring, fileinfo, mimetype, gd, imagemagick, gmagick, hrtime
. If an extension is listed as true in this section but is not loaded or available, it will change to false. (Please note, GD, ImageMagick and gmagick are not currently used at all.)--sites
as an alternate to --list
Some notes on this:
Having our own image identification is important should the PHP installation be limited (for whatever reason) and going by file extension is still the last resort.
The method used for this is looking for the "signature" of the image file. Most image formats have a header with signature data to be used by software trying to open it (this is also called a "magic number".) The new code knows PNG, GIF, JPEG, WEBP, BMP and ICO formats.
Some image formats are easier to identify than others, for example PNG format's "magic" which is \x89PNG\r\n\x1A\n
which is pretty good. BMP and ICO have very very simple identifiers and so having false positives is much more likely which is why I've been adding a "certainty" rating. Eventually you'll be able to set a minimum acceptable "certainty" and reject possibly invalid files. (You can currently set it but nothing looks at it.)
Here's some sample trace logging showing this in action:
2023-06-12 18:47:21 [TRACE] [grap_favicon(20):listIcons:getMIMETypeFromFile] pathname='icons/whatsapp.png', content_type=image/png, confidence=certain, method=signature
Ideally, if everything is available to get-fav.php the following methods are used, in order:
FileInfo
mime_content_type
(local files only)exif_imagetype
(and image_type_to_mime_type
if available)getMIMETypeFromBinary
(the new fallback function using "magic")202306071311
:
--checklocal
/ --nochecklocal
, --storeifnew
(requires --checklocal and --store) ( Not implemented yet. )--showconfig
/ --noshowconfig
to show running configuration options--showconfigonly
(implies --showconfig
), shows running configuration and exits.--silent
(console mode only) (turns off the console completely)ENABLE_SAME_FOLDER_INI
and ENABLE_SAME_FOLDER_API_INI
. They default to false
. If they are set to true
, if get-fav.ini
and get-fav-api.ini
, respectively, are in the same folder as get-fav.php
they will be read and used automatically. --configfile
and --apiconfigfile
, if specified, will be applied after.It will likely be a few days before I do another git push
as the next one is a big one:
storeifnew
is enabled it will be replaced)202306062230
:
file_get_contents
is not available, check if PHP.INI: allow_url_fopen
is disabled, if so show an error message.202306042323
:
--apiconfigfile=PATHNAME
to load API Definitions202306021445
:
202306011529
:
--allowoctetstream
/ --disallowoctetstream
, the default is false
because if the more accurate content-type detection is not available most will return application/octet-stream
. I may make the default true if and mime_content_type
and/or finfo_open
are available. (.ini file is [global] allow_octet_stream=boolean
)202305312016
:
202305281757
:
202305251719
:
202305242106
:
get-fav-api.ini
)202305241803
:
202305241420
:
202305221634
:
parse_ini_file
with INI_SCANNER_RAW
, does array_replace_recursive
with the existing configuration structure and finally validates boolean/numeric (with range checks).)202305231619
:
Stuff being worked on:
(I'm keeping my github fork up to date as I work on stuff, assuming it's not throwing horrible errors.)
--checklocal
option will check the icon in the local path first and check online only if missing or otherwise invalid (size, type, blocklist). (in progress)configuration.md
for detailed help on options.defines
for easier maintenance.--disableapis=google,faviconkit
)microsoft.com.ico
becomes microsoft.ico
)--version
(aka v
and ver
)define
$debug
to a bool--user-agent
is passed in.writeOutput
)Issues:
--help
output takes more than one standard console screen (| more
or | clip
need to be used)Before pull request:
Other Tasks:
Notes:
define
block at the top,I'm running into some situations where cURL seems to be ignoring the timeout.
This is a cURL issue and occurs outside of PHP and PHP-Grab-Favicon.
For example, when attempting to get the favicon for https://www.fredmeyer.com/
.
When manually using cURL, not much light is shed on this:
http://fredmeyer.com/favicon.ico
returns HTTP/1.0 301 Moved Permanently
and returns http://www.fredmeyer.com/favicon.ico
In a web browser, going to http://fredmeyer.com/favicon.ico
returns the icon with the url https://www.fredmeyer.com/favicon.ico
Manually using cURL with https://www.fredmeyer.com/favicon.ico
results in an infinite wait.
Update:
This appears to be a SSL/TLS version issue. If I force TLS 1.3 it gets farther.
Update 2:
get-fav is dropping the protocol (so if https is requested it will downgrade to http), fixed in 202305241340
, now the timeout and API fallback is working for fredmeyer but the https://www.fredmeyer.com/favicon.ico
should work.
Update 3:
I'm not quite sure why having the protocol (http/https) adjustment fixed the timeout, the timeout should just be the timeout. I'm just going to close this for now with a wary eye.
There is a problem if one of the "need to review the security of your connection" checks comes up when get-fav is attempting to find icons.
Unfortunately, I don't think this is fixable (other than trying again or hoping the API catches things) using cURL.
Some of this may be user agent related, will try to see if some sites are happy enough with the default cURL user agent.
Will look into some possible solutions.
The original design of get-fav.php
seems to have been to be used on a web server. The command line switches aren't very useful in this mode so I'm designing a system for processing options in "web mode".
There will be a master switch to enable these new options, it will default to disabled for security reasons.
define('ENABLE_WEB_INPUT', false);
My current thought is to have options come in over the query string and/or form fields. I have not go into any specifics yet but they will likely follow the general syntax of the command line switches.
I don't really use the script this way so this is a chance for comments/suggestions/notes.
I don't plan on adding security checks to the script itself for incoming requests as that should be handled by the server configuration (htaccess or whatever). I do plan on having some checks with paths, especially for anything that can write.
Split the download dir into several sub-dirs (MD5 segment of filename e.g. /af/cd/example.com.png) if there are a lot of favicons.
@gaffling How did you want this to work?
My general thought is if the number of requested icons > the threshold set by default, ini or switch then instead of just using local_path it will create sub-folders based on fragments of the md5 hashes of the icons for saving.
Although it might be better if it was alpha based on the domain name, like "microsoft" would go in "m/microsoft.com.ico". Perhaps there could be multiple threshold levels so perhaps "mi/microsoft.com.ico" if there are a crazy number.
Domain:
https://www.ionos.de
Put the Domain in TestArray ...
$testURLs = array(
'https://aws.amazon.com',
'https://www.ionos.de',
'https://www.commerzbank.de',
'https://www.apple.com',
);
Start with:
php get-fav.php --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 "
Console-Output:
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined variable $filePath in /var/www/vhosts/---/httpdocs/get-fav.php on line 484
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/d---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
Icon: ./aws.amazon.com.ico
Icon: ./commerzbank.de.ico
Icon: ./apple.com.ico
Runtime: 2.71 Sec.
Is there a reason why i receive 504 gateway timeout error whenever i tried to fetch 50 favicon icons?
Originally posted by @raitman005 in #2 (comment)
There is a bug (my fault) where the curl timeout isn't being set properly, this will cause it to hang instead of timing out.
I have fixed it locally but I've changed other things as well and not ready for a PR.
Quick fix is to insert the following two lines in the load
function
$timeOut = getGlobal('curl_timeout');
if (!isset($timeOut )) { $timeOut = 60; }
Current:
function load($url, $DEBUG, $consoleMode = false, $timeOut = 60) {
if (function_exists('curl_version')) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, getGlobal('curl_useragent'));
curl_setopt($ch, CURLOPT_VERBOSE, getGlobal('curl_verbose'));
curl_setopt($ch, CURLOPT_TIMEOUT, $timeOut);
** Hot Fix:**
function load($url, $DEBUG, $consoleMode = false, $timeOut = 60) {
if (function_exists('curl_version')) {
$timeOut = getGlobal('curl_timeout');
if (!isset($timeOut )) { $timeOut = 60; }
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, getGlobal('curl_useragent'));
curl_setopt($ch, CURLOPT_VERBOSE, getGlobal('curl_verbose'));
curl_setopt($ch, CURLOPT_TIMEOUT, $timeOut);
(It's handled much better in the newer code)
We are using exif_imagetype
to determine image type, it takes a filename (or url) as a parameter.
Unfortunately, when it's a URL there is no way to set a timeout.
The solution may be to download the image to disk temporarily for testing but that has it's own issues.
Figured out a way to do a timeout, implemented in 202305241420
. It's still not ideal, the HTTP functions imagetype uses aren't very good. (Some sites won't talk to it, even with user agent being set.)
I also notice in some cases if we find an icon, it's technically downloaded twice. With icons this isn't a huge deal since they are tiny but maybe there's a more efficient way of doing this. I have something in mind and we'll see how it goes.
This won't work for the file exist check but this will apply the correct extension (as long as exif_imagetype knows it) when saving icons locally.
This has "ico" as the default extension although in reality the one it's most likely to be if it's none of the others is SVG.
// Write Favicon local
$iconType = "ico";
if (exif_imagetype($favicon) == IMAGETYPE_GIF) { $iconType = "gif"; }
if (exif_imagetype($favicon) == IMAGETYPE_JPEG) { $iconType = "jpg"; }
if (exif_imagetype($favicon) == IMAGETYPE_PNG) { $iconType = "png"; }
if (exif_imagetype($favicon) == IMAGETYPE_ICO) { $iconType = "ico"; }
if (exif_imagetype($favicon) == IMAGETYPE_WEBP) { $iconType = "webp"; }
if (exif_imagetype($favicon) == IMAGETYPE_BMP) { $iconType = "bmp"; }
$filePath = preg_replace('#\/\/#', '/', $directory.'/'.$domain.'.' . $iconType);
The file exists check is going to have to check all the common extensions.
Hello, What if i want to use the file location dir of the favicon instead of the domain name on saving the file, where do i change it in the file?
domain/favicon href value.png
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.