Giter Club home page Giter Club logo

php-grab-favicon's Introduction

Igor Gaffling

40 years of experience in digital technology

gaffling

php-grab-favicon's People

Contributors

gaffling avatar leethompson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

php-grab-favicon's Issues

Split Download Folder Into Several Sub-Dirs If There Are Lot of Icons

Split the download dir into several sub-dirs (MD5 segment of filename e.g. /af/cd/example.com.png) if there are a lot of favicons.

@gaffling How did you want this to work?

My general thought is if the number of requested icons > the threshold set by default, ini or switch then instead of just using local_path it will create sub-folders based on fragments of the md5 hashes of the icons for saving.

Although it might be better if it was alpha based on the domain name, like "microsoft" would go in "m/microsoft.com.ico". Perhaps there could be multiple threshold levels so perhaps "mi/microsoft.com.ico" if there are a crazy number.

Feature Ideas

  1. I will likely add this for myself locally but others may find it useful too but to give it a list of domains via stdin or a text file would be great.
  2. Should probably accept some other command line options too, such as save/don't save, save path, debug messages, etc.

exif_imagetype Issues

We are using exif_imagetype to determine image type, it takes a filename (or url) as a parameter.
Unfortunately, when it's a URL there is no way to set a timeout.

The solution may be to download the image to disk temporarily for testing but that has it's own issues.

Figured out a way to do a timeout, implemented in 202305241420. It's still not ideal, the HTTP functions imagetype uses aren't very good. (Some sites won't talk to it, even with user agent being set.)

I also notice in some cases if we find an icon, it's technically downloaded twice. With icons this isn't a huge deal since they are tiny but maybe there's a more efficient way of doing this. I have something in mind and we'll see how it goes.

cURL Not Obeying Timeout

I'm running into some situations where cURL seems to be ignoring the timeout.
This is a cURL issue and occurs outside of PHP and PHP-Grab-Favicon.

For example, when attempting to get the favicon for https://www.fredmeyer.com/.

When manually using cURL, not much light is shed on this:
http://fredmeyer.com/favicon.ico returns HTTP/1.0 301 Moved Permanently and returns http://www.fredmeyer.com/favicon.ico

In a web browser, going to http://fredmeyer.com/favicon.ico returns the icon with the url https://www.fredmeyer.com/favicon.ico
Manually using cURL with https://www.fredmeyer.com/favicon.ico results in an infinite wait.

Update:
This appears to be a SSL/TLS version issue. If I force TLS 1.3 it gets farther.

Update 2:
get-fav is dropping the protocol (so if https is requested it will downgrade to http), fixed in 202305241340, now the timeout and API fallback is working for fredmeyer but the https://www.fredmeyer.com/favicon.ico should work.

Update 3:
I'm not quite sure why having the protocol (http/https) adjustment fixed the timeout, the timeout should just be the timeout. I'm just going to close this for now with a wary eye.

Checking Extensions

This won't work for the file exist check but this will apply the correct extension (as long as exif_imagetype knows it) when saving icons locally.

This has "ico" as the default extension although in reality the one it's most likely to be if it's none of the others is SVG.

      // Write Favicon local
      $iconType = "ico";
      if (exif_imagetype($favicon) == IMAGETYPE_GIF) { $iconType = "gif"; }
      if (exif_imagetype($favicon) == IMAGETYPE_JPEG) { $iconType = "jpg"; }
      if (exif_imagetype($favicon) == IMAGETYPE_PNG) { $iconType = "png"; }
      if (exif_imagetype($favicon) == IMAGETYPE_ICO) { $iconType = "ico"; }
      if (exif_imagetype($favicon) == IMAGETYPE_WEBP) { $iconType = "webp"; }
      if (exif_imagetype($favicon) == IMAGETYPE_BMP) { $iconType = "bmp"; }
      
      $filePath = preg_replace('#\/\/#', '/', $directory.'/'.$domain.'.' . $iconType);

The file exists check is going to have to check all the common extensions.

Web Mode (Not-Console/CLI)

The original design of get-fav.php seems to have been to be used on a web server. The command line switches aren't very useful in this mode so I'm designing a system for processing options in "web mode".

There will be a master switch to enable these new options, it will default to disabled for security reasons.

define('ENABLE_WEB_INPUT', false);

My current thought is to have options come in over the query string and/or form fields. I have not go into any specifics yet but they will likely follow the general syntax of the command line switches.

I don't really use the script this way so this is a chance for comments/suggestions/notes.

I don't plan on adding security checks to the script itself for incoming requests as that should be handled by the server configuration (htaccess or whatever). I do plan on having some checks with paths, especially for anything that can write.

Roadmap: More Enhancements in Development

Status:

June 23rd 2023
Haven't been able to do much work this week due to some unexpected household emergencies, should be back at it next week.

202306161401

  • Added a MIME database, too many functions were all doing different types of lookups and so I consolidated it into a database "object". Works pretty well so far.
  • Added a content buffer for internal processing. The goal is to prevent unnecessary reloading of data.
  • Added yet more switches, mostly end users want to see the HTTP warnings (4xx) and errors (5xx) they can be enabled in the ini or --showhttpwarnings --showhttperrors. (With the options enabled, they will output as TYPE_WARNING and TYPE_ERROR.)
  • Image size doesn't always work even if a valid image so until that gets resolved, if you specify a minimum size it's more of a 'goal'.
  • Added SVG detect to our own data check
  • Added new logging levels TYPE_OBJECTS, TYPE_TIMERS (full debug logging is now 1023)
  • convertRelativeToAbsolute now has one return path making for easier debugging.
  • Tightened up domain parsing and regex code, it should be a bit better dealing with subdomains.
  • Integrating MIME Database, checkIconAcceptance, and other new things to existing code. Then I'm going to do a battery of tests, after which I'm going to simplify/optimize and continue with adding the remaining new features (check local icon, etc).

202306121848:

  • Added new 'extensions' section to the ini, it's mostly for testing but could be used if something isn't working right. They are all simple boolean values (true or false), the list is: curl, exif, get, put, mbstring, fileinfo, mimetype, gd, imagemagick, gmagick, hrtime. If an extension is listed as true in this section but is not loaded or available, it will change to false. (Please note, GD, ImageMagick and gmagick are not currently used at all.)
  • Image identification fallbacks added to local file loading
  • Been testing/fixing up extension/function fallback code
  • Fixed an issue where the log could be initialized too soon and not honor some settings
  • Added --sites as an alternate to --list
  • Added raw datacheck for most common icon formats
  • Added a "confidence" level, not used other than logging yet
  • This isn't the "big update" yet, as I wanted to test some of the fallbacks before I started going down the bigger rabbit hole and that's probably going to continue this week.

Some notes on this:

Having our own image identification is important should the PHP installation be limited (for whatever reason) and going by file extension is still the last resort.

The method used for this is looking for the "signature" of the image file. Most image formats have a header with signature data to be used by software trying to open it (this is also called a "magic number".) The new code knows PNG, GIF, JPEG, WEBP, BMP and ICO formats.

Some image formats are easier to identify than others, for example PNG format's "magic" which is \x89PNG\r\n\x1A\n which is pretty good. BMP and ICO have very very simple identifiers and so having false positives is much more likely which is why I've been adding a "certainty" rating. Eventually you'll be able to set a minimum acceptable "certainty" and reject possibly invalid files. (You can currently set it but nothing looks at it.)

Here's some sample trace logging showing this in action:

2023-06-12 18:47:21 [TRACE] [grap_favicon(20):listIcons:getMIMETypeFromFile] pathname='icons/whatsapp.png', content_type=image/png, confidence=certain, method=signature

Ideally, if everything is available to get-fav.php the following methods are used, in order:

  1. The content-type returned by the server (remote only)
  2. FileInfo
  3. mime_content_type (local files only)
  4. exif_imagetype (and image_type_to_mime_type if available)
  5. getMIMETypeFromBinary (the new fallback function using "magic")
  6. file extension

202306071311:

  • Initial work for processing parameters in HTML mode created (completely untested)
  • Added --checklocal / --nochecklocal, --storeifnew (requires --checklocal and --store) ( Not implemented yet. )
  • Added --showconfig / --noshowconfig to show running configuration options
  • Added --showconfigonly (implies --showconfig), shows running configuration and exits.
  • Added --silent (console mode only) (turns off the console completely)
  • Near the top of the script there are two defines ENABLE_SAME_FOLDER_INI and ENABLE_SAME_FOLDER_API_INI. They default to false. If they are set to true, if get-fav.ini and get-fav-api.ini, respectively, are in the same folder as get-fav.php they will be read and used automatically. --configfile and --apiconfigfile, if specified, will be applied after.

It will likely be a few days before I do another git push as the next one is a big one:

  • Path write checking
  • Check local icons against criteria (if required, replacements will be downloaded; if the current icon is ok but there is a different icon online, if storeifnew is enabled it will be replaced)
  • Icons will also be tested for size criteria.
  • Blocklists will be applied.
  • Code will be put in place for storing local icons in sub-folders.
  • Some test HTTP mode variables will be parsed.
  • Documentation will be updated.

202306062230:

  • Refined HTTP Response Parsing (now includes general 'class' of response as part of the data)
  • PHP .ini values are now in defines so if something changes down the road it's easier to update
  • Added more parameter checking
  • Added major/minor to version
  • If cURL is disabled and file_get_contents is not available, check if PHP.INI: allow_url_fopen is disabled, if so show an error message.

202306042323:

  • Bug fixing.
  • Added --apiconfigfile=PATHNAME to load API Definitions
  • Loading of 'same folder' API and config file can be controlled in the special runtime defines section. Default is OFF. (They can always be overridden with command line switch)
  • API: Updated favicongrabber's built-in definition
  • API: Added iconhorse to built-in definition
  • Added more to the capabilities structure
  • If exif is used content-type will be looked up using the image_type_to_mime_type function
  • Capability checking is more thorough and accurate. ("exif" requires "mbstring" etc)

202306021445:

  • Mostly "under the hood" work today. Mostly internal structures prepping for some of the features still being implemented.
  • Added a HTTP response parser for cleaner coding and better log messaging
  • Made some small changes for PHP 5.6.40 compatibility

202306011529:

  • Today one of the APIs was returning 502 errors which gave me the opportunity to add some error handling.
  • Rewrote the JSON parsing for APIs, this required a change to the .ini file for APIs but it should be more flexible (once all the bugs are fixed)
  • It will now go through more than one icon record (for API's that support it) and return the first that matches criteria (size, format, etc) (I need to do the same for the regex search.)
  • Added another switch pair --allowoctetstream / --disallowoctetstream, the default is false because if the more accurate content-type detection is not available most will return application/octet-stream. I may make the default true if and mime_content_type and/or finfo_open are available. (.ini file is [global] allow_octet_stream=boolean )
  • If in debugMode (--debug and/or debug/trace/special logging) active settings will be shown.
  • minor changes for PHP 8.2 compatibility
  • This version has only been tested with PHP 8.2.6.

202305312016:

  • Mostly bug fixing and optimization.
  • Debug logging is at about 80% complete.
  • changed more internal structures, probably not done with that (mostly to accommodate new features)
  • added tenacious mode will try all APIs until it gets a successful result (default is off)
  • added precision timers for internal use
  • it will now warn if, due to the PHP configuration, some functions that identify formats are not available that results may not be that great
  • you can now specify what icon types are acceptable (careful) (note: it is not wired in everywhere yet)

202305281757:

  • Added a 4th API (INI file only right now)
  • Rewrote API randomizer
  • Setting up proper debug logging which is about 20% complete
  • Unified output into the new logging function (automatically renders HTML if not in console mode). (It is possible now to have the script not output anything if you disable both file and console outputs.)
  • Added switch for icon size
  • Added switches for console output (timestamps, level, etc)
  • Debug/HTML mode icons should set the correct MIME type for display (not tested)
  • I know it's looking like a lots been done, and it has but very little has been tested. If you choose to try my branch out, please keep that in mind.

202305251719:

  • Debug logging added (not implemented much yet).
  • Greatly improved image detection although it uses fileinfo which may not be installed everywhere. It will fallback to exif etc.
  • Introduced HTTP Load buffering. If the load function gets a URL that it already loaded it will just return what it got last time. (can be disabled)
  • A lot of new "under the hood" functions, if you choose to play with it from my fork be very careful and please report bugs.

202305242106:

  • APIs can be read in from an INI file (get-fav-api.ini)

202305241803:

  • Added remove TLD support (needs a lot of testing)
  • Made load function allow recursion for redirects (needs a lot of testing)

202305241420:

  • Bugfix. Now setting timeout for PHP level HTTP and socket operations. (#13)
  • Bugfix. Now keeps specified protocol active (http, https) (#12)
  • Preliminary support to keep port and user/password information (not hooked up yet)
  • Added a new direct try, it takes the url and adds favicon.ico to it and sees if it gets anything then falls back to previous behavior.

202305221634:

  • Reads config files, command line switches will always override any ini setting. (It is using parse_ini_file with INI_SCANNER_RAW, does array_replace_recursive with the existing configuration structure and finally validates boolean/numeric (with range checks).)

202305231619:

  • Path and other settings are validated
  • Settings are checked against capabilities
  • Updated --help
  • Help menu now shows actual defaults from the defines
  • Help menu now shows available APIs (* by ones that are disabled)
  • Updated copyright notice (year changed to 2019-2023)
  • Individual APIs can be enabled/disabled

Stuff being worked on:

(I'm keeping my github fork up to date as I work on stuff, assuming it's not throwing horrible errors.)

  • New --checkicon --checklocal option will check the icon in the local path first and check online only if missing or otherwise invalid (size, type, blocklist). (in progress)
  • The main design of the script seems to be as a server side script so I plan to add options for it (passed in via query string or form, default will be disabled for security reasons) (#11) (in progress)
  • Icon validation where it can be checked with generic fallback icons (via md5 hash comparisons in a 'blocklist') (in progress)
  • Updating README.MD to reflect command line switches etc. (in progress)
  • Document functions and config file format (ini file). (in progress)
  • MD5 fragment sub-folder option (#14)
  • Configuration file support (command line switches will still override the config)
  • Added configuration.md for detailed help on options.
  • Redoing configuration throughout the code (to better handle config file overrides) (it's more of an array structure)
  • Add a configuration validation check for paths
  • Moved defaults & constants to defines for easier maintenance.
  • Improved error handling
  • Add code to enable/disable individual apis by name (.e.g. --disableapis=google,faviconkit)
  • Option to strip the TLD domain from the filename (.e.g microsoft.com.ico becomes microsoft.ico)
  • Investigate defining APIs in the ini file.
  • Adding more comments to code
  • Log file support with timestamp and append options (mostly for debugging purposes)
  • Final configuration validation check should include capabilities, so if you force enable curl but you php doesn't have it, it should use the fallback.
  • Added --version (aka v and ver)
  • Added a version as a define
  • Some bug fixing
  • More command line switches for troubleshooting and for specific situations allowing control over connection, http and dns timeouts.
  • Changed $debug to a bool
  • cURL path now handles http->https redirects.
  • PHP's user agent is now set as well as cURLs (not permanently) (#7) if --user-agent is passed in.
  • Allow manual disabling of curl.
  • New structure for APIs (will allow adding APIs in the future). (NOTE: it does not currently fallback if the randomly selected one fails)
  • Ability to enable/disable individual API methods
  • Unifying message output/debug messages (function writeOutput)
  • Update command line help.
  • API definitions should allow for apikey (untested)

Issues:

  • New API system allows for more APIs but currently doesn't allow fallbacks
  • --help output takes more than one standard console screen (| more or | clip need to be used)
  • exif_imagetype fails on some sites for some reason, probably because fopen isn't doing something it likes. May add a 'temporary' download of the potential icon file for analysis instead of a direct open. (#13) (Partial fix, should be used less.)

Before pull request:

  • Lots of testing
  • HTML mode testing
  • Regression testing with PHP 5, PHP 7 and PHP 8
  • Bug fixes

Other Tasks:

  • "How to use" will need to be updated.

Notes:

  • Most of the internal structure has changed. There are now functions to set (and validate) and get configuration data.
  • The main function now just needs a url, it gets the configuration data when it starts.
  • This will make reading an ini config file and applying it much easier which will be the next step.
  • Almost all constants are now in a define block at the top,
  • The "how to use" notes will need to be updated.
  • I am now testing with PHP 5.6.4, 7.4.33, 8.1.19 and 8.2.6.

cURL Timeout Not Working Properly

There is a bug (my fault) where the curl timeout isn't being set properly, this will cause it to hang instead of timing out.

I have fixed it locally but I've changed other things as well and not ready for a PR.

Quick fix is to insert the following two lines in the load function

    $timeOut = getGlobal('curl_timeout');
    if (!isset($timeOut )) { $timeOut = 60; }

Current:

function load($url, $DEBUG, $consoleMode = false, $timeOut = 60) {
  if (function_exists('curl_version')) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, getGlobal('curl_useragent'));
    curl_setopt($ch, CURLOPT_VERBOSE, getGlobal('curl_verbose'));
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeOut);

** Hot Fix:**

function load($url, $DEBUG, $consoleMode = false, $timeOut = 60) {
  if (function_exists('curl_version')) {
    $timeOut = getGlobal('curl_timeout');
    if (!isset($timeOut )) { $timeOut = 60; }
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, getGlobal('curl_useragent'));
    curl_setopt($ch, CURLOPT_VERBOSE, getGlobal('curl_verbose'));
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeOut);

(It's handled much better in the newer code)

Issue With Website Security Checks

There is a problem if one of the "need to review the security of your connection" checks comes up when get-fav is attempting to find icons.

Unfortunately, I don't think this is fixable (other than trying again or hoping the API catches things) using cURL.

Some of this may be user agent related, will try to see if some sites are happy enough with the default cURL user agent.

Will look into some possible solutions.

Cannot Grab

Domain:
https://www.ionos.de

Put the Domain in TestArray ...

$testURLs = array(
   'https://aws.amazon.com',
   'https://www.ionos.de',
   'https://www.commerzbank.de',
   'https://www.apple.com',
 );

Start with:

php get-fav.php --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 "
Console-Output:

PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined variable $filePath in /var/www/vhosts/---/httpdocs/get-fav.php on line 484
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/d---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined global variable $curl_timeout in /var/www/vhosts/---/httpdocs/get-fav.php on line 572
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
PHP Warning: Undefined array key "SERVER_NAME" in /var/www/vhosts/---/httpdocs/get-fav.php on line 512
Icon: ./aws.amazon.com.ico
Icon: ./commerzbank.de.ico
Icon: ./apple.com.ico

Runtime: 2.71 Sec.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.