phergie / phergie-irc-parser Goto Github PK
View Code? Open in Web Editor NEWPHP parser for messages conforming to the IRC protocol
License: BSD 3-Clause "New" or "Revised" License
PHP parser for messages conforming to the IRC protocol
License: BSD 3-Clause "New" or "Revised" License
Currently, the parser uses this regex for hostnames:
$host = "$name(?:\\.(?:$name)*)+";
According to RFC 1123, hostnames do not need a dot in them, but the parser currently does. This means that, say, nick!user@localhost
is rejected by the parser. (It used to be fudged, since the @
character was accepted as a valid username character and so the username in this case would have been captured as user@localhost
, but that behaviour goes away in #20.)
To be more correct (but retaining the non-standard trailing-dot caveat for cases like Freenode's services.
) the regex should be
$host = "$name(?:\\.(?:$name)*)*";
but this runs into problems with the prefix parser, since hostnames without a dot might look identical to "naked" nicknames. (Try running the test suite with that change, which reveals the problem.)
In my opinion, the best option would be to add the regex change and re-order the $prefix
regex so that nick matching is prioritised before servername matching, so that prefixes that are bare alphanumeric strings are treated as nicks rather than servernames. That would act on the assumption that it's less likely to see an IRC server name itself without a dot, than it is to see a naked nickname appear in a prefix. The protocol is ambiguous, so it's up for comment.
See this article. Create a .gitattributes
file in the root of the repo with entries for files like those under tests
, .travis.yml
, and others not needed by end users.
freenode.net sends PART
command with comments as is described in http://tools.ietf.org/html/rfc2812#section-3.2.2
:user!~user@host PART #channel :"Ex-Chat"
The syntax for numeric commands is such that the bot's nickname always appears after the initial numeric:
XXX BotNick Param1 Param2 :ParamTrail
However, the parser isn't configured to extract the bot's nickname within the numeric, and parses it as the first parameter.
(Changing this behaviour will impact any downstream modules that process numeric parameters.)
If a string containing a non-RFC-compliant line is passed to parseAll()
, it gets rejected by parse()
and this breaks the parseAll()
loop, on the assumption that the buffer does not yet contain a complete line. This causes the client to hang.
For example:
$buf = implode('', [
":server.name 375 BotNick :server.name message of the day\r\n", // valid line
":server.name 372 BotNick :Who left a carriage return \r in here?\r\n", // invalid line
":server.name 376 BotNick :End of message of the day.\r\n", // valid line
":server.name 251 BotNick :There are", // incomplete line
]);
$parsed = $myParser->consumeAll($buf);
echo str_replace("\r", '', var_export(['$buf' => $buf, '$parsed' => $parsed], true));
outputs:
array (
'$buf' => ':server.name 372 BotNick :Who left a carriage return in here?
:server.name 376 BotNick :End of message of the day.
:server.name 251 BotNick :There are',
'$parsed' =>
array (
0 =>
array (
'prefix' => ':server.name',
'servername' => 'server.name',
'command' => '375',
'params' =>
array (
1 => 'server.name message of the day',
'all' => 'server.name message of the day',
),
'message' => ':server.name 375 BotNick :server.name message of the day
',
'code' => 'RPL_MOTDSTART',
'target' => 'BotNick',
'tail' => ':server.name 372 BotNick :Who left a carriage return in here?
:server.name 376 BotNick :End of message of the day.
:server.name 251 BotNick :There are',
),
),
)
Calling consumeAll()
again would return an empty array, and leave the buffer unchanged; it would never progress beyond the invalid line.
Perhaps parseAll()
should have a way of failing gracefully if an invalid CRLF-terminated line is received.
minimum-stability
Hi there,
I believe I've discovered a bug in this library. When throwing a message to the parser sent by a nick ending in an underscore, the parser returns NULL.
I wrote a simple tester for this bug:
<?php
require 'src/Phergie/Irc/ParserInterface.php';
require 'src/Phergie/Irc/Parser.php';
$parser = new \Phergie\Irc\Parser();
var_dump($parser->parse(":[email protected] PRIVMSG #test :test\r\n"));
var_dump($parser->parse(":[email protected] PRIVMSG #test :test\r\n"));
Running this will output the following (truncated for brevity):
NULL
array(7) {
["prefix"]=>
string(31) ":henri!~[email protected]"
Thanks.
Currently, parse()
returns NULL
on both an incomplete line (no CRLF) and an invalid line (as per regex parsing).
This means that if a line is received where any of the line elements fail their respective regexes, parse()
returns NULL
. This causes parseAll()
and consumeAll()
to break, as they never receive any further output from parse()
, and the client stops receiving any further input and freezes.
aa2ea5f is one way of solving this issue. It separates incomplete line detection, which is performed with simple string matching at the top of the function, from invalid line detection through preg_match()
. Incomplete lines will still return NULL
as before, whereas invalid lines will return an array containing the invalid line and the tail.
Such a change would require changes to the client to detect the presence of the 'invalid'
key in the returned message array, and log it without emitting it as a message event.
Submitting here as a request for comment, more than anything. I'm on holiday currently, so just stopping by...!
Hostnames on the Rizon network appear to allow IPv6 address-like strings (i.e. hexadecimal numbers with segments delimited by colons) postfixed by :IP
, which the parser presently can't handle.
2015-06-10 13:23:59 NOTICE [email protected] Parser unable to parse line: :hashworks!~hashworks@DCE7E23D:1D6D03E4:2248D1C4:IP PRIVMSG #moonbase :well, this is bad
The associated BNF notation from RFC 2812
hostname = shortname *( "." shortname )
shortname = ( letter / digit ) *( letter / digit / "-" )
*( letter / digit ) ; as specified in RFC 1123 [HNAME]
From RFC 1123:
The syntax of a legal Internet host name was specified in RFC-952.
And from RFC-952:
A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.)
We should probably modify the $host
pattern to include a case for such hostnames.
I tried fixing it myself, but my Regexp skills are really rusty :(
ZNC distributes messages like "Buffer playback started" to the channel using nickname ***. The parser doesn't like * in the nickname and therefore marks the message as invalid:
:***[email protected] PRIVMSG #channel:Buffer Playback...
Adding * to the list of special characters works, but I'm not sure if this is a valid solution.
Plugin developers often need to validate things like user nicks or channel names. At the moment, there's no way to get at the patterns representing those things in the Parser
class.
Today I did little mistake in outcoming message, which crashed my bot. Instead of:
$writeStream->ircPrivmsg(
$chan,
"\x01ACTION looks\x01"
);
I wrote (no space after ACTION
)
$writeStream->ircPrivmsg(
$chan,
"\x01ACTIONlooks\x01"
);
This caused following error:
PHP Warning: preg_match(): Empty regular expression in /home/bot/bot2/vendor/phergie/phergie-irc-parser/src/Phergie/Irc/Parser.php on line 369
PHP Catchable fatal error: Argument 1 passed to Phergie\Irc\Parser::removeIntegerKeys() must be of the type array, null given, called in /home/bot/bot2/vendor/phergie/phergie-irc-parser/src/Phergie/Irc/Parser.php on line 371 and defined in /home/bot/bot2/vendor/phergie/phergie-irc-parser/src/Phergie/Irc/Parser.php on line 300
If a MODE message is received with a channel target, the parser will then attempt to sub-parse the message. This algorithm basically states "if the mode string contains a 'b', the trailing string is a banmask; else if it contains a 'k', it's a channel key; else..."
However, this assumes that only one mode change is being made. If a message such as MODE #channel +b-o *!*@some.host ChanOpUser
is parsed, the string *!*@some.host ChanOpUser
is assigned to $params['banmask']
, and $params['user']
(where one would ordinarily find the ChanOpUser
param) does not exist.
MODE messages may also have a different number of trailing parameters than modes (eg. MODE #channel +vvm TrustedUser1 TrustedUser2
), so there is no uniform way to reliably associate modes with the respective parameters. (Clients may achieve this by parsing the MODES
element of the 005 RPL_NAMREPLY
numeric, which identifies which modes take parameters and which don't, but a contextless parser cannot.)
As such, I wonder if it's truly valid for an RFC-compliant parser to be processing the MODE string in this way. At the least, some kind of common named parameter should be made available so that the trailing parameters can be accessed in a consistent way.
The specification of the USER
command was modified in RFC2812.
RFC1459: USER <username> <hostname> <servername> <realname>
RFC2812: USER <username> <mode> <unused> <realname>
where <mode>
is an integer bitmask that can be comprised of the following bit fields:
USER_MODE_WALLOPS
= 4USER_MODE_INVISIBLE
= 8(NB. most major ircds will just ignore the mode field and set their own default usermodes on connect anyway - most clients just send 8 or 12 as an unconfigurable default value)
While using your parser in our project we noticed that the message string requires \r\n to be at the end, for example:
Won't work: string: :user PRIVMSG #channel :message
Will work: string: :user PRIVMSG #channel :message\r\n
In our opinion, this should either be explicitly noted or be made optional.
Do you remember I reported on #phergie, that my bot unexpectedly crashes?
Today I found a solution. The problem is nickname pattern, which is incorrect.
You can find it here:
$letter = 'a-zA-Z';
$number = '0-9';
$nick = "(?:[$letter][$letter$number\\-\\[\\]\\\\`^{}\\_]*)";
Why is it wrong? Rizon (IRC network) allows nicknames such as:
____
, |blah
, hello|there
and so on.
There is no rule, that first character must be letter or number. Several networks allow other characters as well.
Connection freezes silently after such unmatched message (when using phergie-irc-client-react).
I think the more appropriate pattern would be:
$nick = "(?:[$letter\_\[\]\\^{}|`][$letter$number\_\-\[\]\\^{}|`]*)";
$params['all']
should contain a verbatim copy of the parameters string; however, if the first parameter is a trailing parameter with a leading colon, the colon is chomped by Parser::strip()
. This makes it indiscernible from $params['all']
whether the parameters string is a set of individual middle parameters, or one trailing parameter.
The following patch would "correct" this behaviour. It affects 42 of the test parser cases. I can't see that anything that currently relies on $params['all']
would be affected by this change - any use of $params['all']
would already have to be colon-aware, since colons elsewhere in the params string were still retained, it was just leading colons that were being chomped.
diff --git a/src/Parser.php b/src/Parser.php
index 972f190..b299f06 100644
--- a/src/Parser.php
+++ b/src/Parser.php
@@ -407,7 +407,7 @@ class Parser implements ParserInterface
}
// Clean up and store the processed parameters
- $params = array_merge(array('all' => $params[0]), array_filter($params));
+ $params = array_merge(array('all' => ltrim($parsed['params'])), array_filter($params));
$params = $this->removeIntegerKeys($params);
$parsed['params'] = $params;
} elseif (ctype_digit($command)) {
@@ -421,7 +421,7 @@ class Parser implements ParserInterface
$temp = explode(' ', ltrim($parsed['params']), 2);
$parsed['target'] = array_shift($temp);
if ($parsed['params'] = (!empty($temp)) ? (' ' . array_shift($temp)) : '') {
- $all = $this->strip($parsed['params']);
+ $all = ltrim($parsed['params']);
if (strpos($parsed['params'], ' :') !== false) {
list($head, $tail) = explode(' :', $parsed['params'], 2);
} else {
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.