Giter Club home page Giter Club logo

phergie-irc-parser's Introduction

This project is abandoned

This repo is being kept for posterity and will be archived in a readonly state. If you're interested it can be forked under a new Composer namespace/GitHub organization.

phergie/phergie-irc-parser

A PHP-based parser for messages conforming to the IRC protocol as described in RFC 1459 and RFC 2812.

Build Status

Install

The recommended method of installation is through composer.

{
    "require": {
        "phergie/phergie-irc-parser": "~1"
    }
}

Design goals

  • Minimal dependencies: PHP 5.4.2+ with the core PCRE extension
  • Can extract messages from a real-time data stream
  • Simple easy-to-understand API

Usage

<?php
$stream = ":Angel PRIVMSG Wiz :Hello are you receiving this message ?\r\n"
        . "PRIVMSG Angel :yes I'm receiving it !receiving it !'u>(768u+1n) .br\r\n";
$parser = new Phergie\Irc\Parser();

// Get one message without modifying $stream
// or null if no complete message is found
$message = $parser->parse($stream);

// Get one message and remove it from $stream
// or null if no complete message is found
$message = $parser->consume($stream);

// Get all messages without modifying $stream
// or an empty array if no complete messages are found
$messages = $parser->parseAll($stream);

// Get all messages and remove them from $stream
// or an empty array if no complete messages are found
$messages = $parser->consumeAll($stream);

/*
One parsed message looks like this:
array(
    'prefix' => ':Angel',
    'nick' => 'Angel',
    'command' => 'PRIVMSG',
    'params' => array(
        'receivers' => 'Wiz',
        'text' => 'Hello are you receiving this message ?',
        'all' => 'Wiz :Hello are you receiving this message ?',
    ),
    'targets' => array('Wiz'),
    'message' => ":Angel PRIVMSG Wiz :Hello are you receiving this message ?\r\n",
)
*/

Tests

To run the unit test suite:

curl -s https://getcomposer.org/installer | php
php composer.phar install
./vendor/bin/phpunit Phergie/Irc/ParserTest.php

License

Released under the BSD License. See LICENSE.

Community

Check out #phergie on irc.freenode.net or e-mail us at [email protected].

Related Projects

phergie-irc-parser's People

Contributors

elazar avatar julien-c avatar matthewtrask avatar renegade334 avatar rpasing avatar svpernova09 avatar unlobito avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

phergie-irc-parser's Issues

$params['all'] strips off the leading colon if the first parameter is a trailing parameter

$params['all'] should contain a verbatim copy of the parameters string; however, if the first parameter is a trailing parameter with a leading colon, the colon is chomped by Parser::strip(). This makes it indiscernible from $params['all'] whether the parameters string is a set of individual middle parameters, or one trailing parameter.

The following patch would "correct" this behaviour. It affects 42 of the test parser cases. I can't see that anything that currently relies on $params['all'] would be affected by this change - any use of $params['all'] would already have to be colon-aware, since colons elsewhere in the params string were still retained, it was just leading colons that were being chomped.

diff --git a/src/Parser.php b/src/Parser.php
index 972f190..b299f06 100644
--- a/src/Parser.php
+++ b/src/Parser.php
@@ -407,7 +407,7 @@ class Parser implements ParserInterface
             }

             // Clean up and store the processed parameters
-            $params = array_merge(array('all' => $params[0]), array_filter($params));
+            $params = array_merge(array('all' => ltrim($parsed['params'])), array_filter($params));
             $params = $this->removeIntegerKeys($params);
             $parsed['params'] = $params;
         } elseif (ctype_digit($command)) {
@@ -421,7 +421,7 @@ class Parser implements ParserInterface
                 $temp = explode(' ', ltrim($parsed['params']), 2);
                 $parsed['target'] = array_shift($temp);
                 if ($parsed['params'] = (!empty($temp)) ? (' ' . array_shift($temp)) : '') {
-                    $all = $this->strip($parsed['params']);
+                    $all = ltrim($parsed['params']);
                     if (strpos($parsed['params'], ' :') !== false) {
                         list($head, $tail) = explode(' :', $parsed['params'], 2);
                     } else {

Crashing on CTCP messages

Today I did little mistake in outcoming message, which crashed my bot. Instead of:

$writeStream->ircPrivmsg(
    $chan,
    "\x01ACTION looks\x01"
);

I wrote (no space after ACTION)

$writeStream->ircPrivmsg(
    $chan,
    "\x01ACTIONlooks\x01"
);

This caused following error:

PHP Warning:  preg_match(): Empty regular expression in /home/bot/bot2/vendor/phergie/phergie-irc-parser/src/Phergie/Irc/Parser.php on line 369
PHP Catchable fatal error:  Argument 1 passed to Phergie\Irc\Parser::removeIntegerKeys() must be of the type array, null given, called in /home/bot/bot2/vendor/phergie/phergie-irc-parser/src/Phergie/Irc/Parser.php on line 371 and defined in /home/bot/bot2/vendor/phergie/phergie-irc-parser/src/Phergie/Irc/Parser.php on line 300

Valid hostnames

Currently, the parser uses this regex for hostnames:

$host = "$name(?:\\.(?:$name)*)+";

According to RFC 1123, hostnames do not need a dot in them, but the parser currently does. This means that, say, nick!user@localhost is rejected by the parser. (It used to be fudged, since the @ character was accepted as a valid username character and so the username in this case would have been captured as user@localhost, but that behaviour goes away in #20.)

To be more correct (but retaining the non-standard trailing-dot caveat for cases like Freenode's services.) the regex should be

$host = "$name(?:\\.(?:$name)*)*";

but this runs into problems with the prefix parser, since hostnames without a dot might look identical to "naked" nicknames. (Try running the test suite with that change, which reveals the problem.)

In my opinion, the best option would be to add the regex change and re-order the $prefix regex so that nick matching is prioritised before servername matching, so that prefixes that are bare alphanumeric strings are treated as nicks rather than servernames. That would act on the assumption that it's less likely to see an IRC server name itself without a dot, than it is to see a naked nickname appear in a prefix. The protocol is ambiguous, so it's up for comment.

Some user hostnames can't be parsed

Hostnames on the Rizon network appear to allow IPv6 address-like strings (i.e. hexadecimal numbers with segments delimited by colons) postfixed by :IP, which the parser presently can't handle.

2015-06-10 13:23:59 NOTICE [email protected] Parser unable to parse line: :hashworks!~hashworks@DCE7E23D:1D6D03E4:2248D1C4:IP PRIVMSG #moonbase :well, this is bad

The associated BNF notation from RFC 2812

hostname   =  shortname *( "." shortname )
shortname  =  ( letter / digit ) *( letter / digit / "-" ) 
    *( letter / digit ) ; as specified in RFC 1123 [HNAME]

From RFC 1123:

The syntax of a legal Internet host name was specified in RFC-952.

And from RFC-952:

A "name" (Net, Host, Gateway, or Domain name) is a text string up
   to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
   sign (-), and period (.)

We should probably modify the $host pattern to include a case for such hostnames.

PRIVMSGs sent from nicks ending in an underscore aren't parsed.

Hi there,

I believe I've discovered a bug in this library. When throwing a message to the parser sent by a nick ending in an underscore, the parser returns NULL.

I wrote a simple tester for this bug:

<?php
require 'src/Phergie/Irc/ParserInterface.php';
require 'src/Phergie/Irc/Parser.php';

$parser = new \Phergie\Irc\Parser();
var_dump($parser->parse(":[email protected] PRIVMSG #test :test\r\n"));

var_dump($parser->parse(":[email protected] PRIVMSG #test :test\r\n"));

Running this will output the following (truncated for brevity):

NULL
array(7) {
["prefix"]=>
string(31) ":henri!~[email protected]"

Thanks.

parseAll() hangs if an invalid line is received

If a string containing a non-RFC-compliant line is passed to parseAll(), it gets rejected by parse() and this breaks the parseAll() loop, on the assumption that the buffer does not yet contain a complete line. This causes the client to hang.

For example:

$buf = implode('', [
    ":server.name 375 BotNick :server.name message of the day\r\n", // valid line
    ":server.name 372 BotNick :Who left a carriage return \r in here?\r\n", // invalid line
    ":server.name 376 BotNick :End of message of the day.\r\n", // valid line
    ":server.name 251 BotNick :There are", // incomplete line
]);

$parsed = $myParser->consumeAll($buf);
echo str_replace("\r", '', var_export(['$buf' => $buf, '$parsed' => $parsed], true));

outputs:

array (
  '$buf' => ':server.name 372 BotNick :Who left a carriage return  in here?
:server.name 376 BotNick :End of message of the day.
:server.name 251 BotNick :There are',
  '$parsed' => 
  array (
    0 => 
    array (
      'prefix' => ':server.name',
      'servername' => 'server.name',
      'command' => '375',
      'params' => 
      array (
        1 => 'server.name message of the day',
        'all' => 'server.name message of the day',
      ),
      'message' => ':server.name 375 BotNick :server.name message of the day
',
      'code' => 'RPL_MOTDSTART',
      'target' => 'BotNick',
      'tail' => ':server.name 372 BotNick :Who left a carriage return  in here?
:server.name 376 BotNick :End of message of the day.
:server.name 251 BotNick :There are',
    ),
  ),
)

Calling consumeAll() again would return an empty array, and leave the buffer unchanged; it would never progress beyond the invalid line.

Perhaps parseAll() should have a way of failing gracefully if an invalid CRLF-terminated line is received.

Nickname regular expression is incorrect.

Do you remember I reported on #phergie, that my bot unexpectedly crashes?
Today I found a solution. The problem is nickname pattern, which is incorrect.
You can find it here:

$letter = 'a-zA-Z';
$number = '0-9';

$nick = "(?:[$letter][$letter$number\\-\\[\\]\\\\`^{}\\_]*)";

Why is it wrong? Rizon (IRC network) allows nicknames such as:
____, |blah, hello|there and so on.

There is no rule, that first character must be letter or number. Several networks allow other characters as well.
Connection freezes silently after such unmatched message (when using phergie-irc-client-react).

I think the more appropriate pattern would be:

$nick =  "(?:[$letter\_\[\]\\^{}|`][$letter$number\_\-\[\]\\^{}|`]*)";

RFC2812 USER command specification

The specification of the USER command was modified in RFC2812.

RFC1459: USER <username> <hostname> <servername> <realname>
RFC2812: USER <username> <mode> <unused> <realname>
where <mode> is an integer bitmask that can be comprised of the following bit fields:

  • USER_MODE_WALLOPS = 4
  • USER_MODE_INVISIBLE = 8

(NB. most major ircds will just ignore the mode field and set their own default usermodes on connect anyway - most clients just send 8 or 12 as an unconfigurable default value)

ref: https://tools.ietf.org/html/rfc2812#section-3.1.3

Resilience against invalid lines

Currently, parse() returns NULL on both an incomplete line (no CRLF) and an invalid line (as per regex parsing).

This means that if a line is received where any of the line elements fail their respective regexes, parse() returns NULL. This causes parseAll() and consumeAll() to break, as they never receive any further output from parse(), and the client stops receiving any further input and freezes.

aa2ea5f is one way of solving this issue. It separates incomplete line detection, which is performed with simple string matching at the top of the function, from invalid line detection through preg_match(). Incomplete lines will still return NULL as before, whereas invalid lines will return an array containing the invalid line and the tail.

Such a change would require changes to the client to detect the presence of the 'invalid' key in the returned message array, and log it without emitting it as a message event.

Submitting here as a request for comment, more than anything. I'm on holiday currently, so just stopping by...!

The message string requires \r\n

While using your parser in our project we noticed that the message string requires \r\n to be at the end, for example:

Won't work: string: :user PRIVMSG #channel :message
Will work: string: :user PRIVMSG #channel :message\r\n

In our opinion, this should either be explicitly noted or be made optional.

Numeric parsing

The syntax for numeric commands is such that the bot's nickname always appears after the initial numeric:

XXX BotNick Param1 Param2 :ParamTrail

However, the parser isn't configured to extract the bot's nickname within the numeric, and parses it as the first parameter.

(Changing this behaviour will impact any downstream modules that process numeric parameters.)

ZNC buffer playback messages are not picked up on

ZNC distributes messages like "Buffer playback started" to the channel using nickname ***. The parser doesn't like * in the nickname and therefore marks the message as invalid:

:***[email protected] PRIVMSG #channel:Buffer Playback...

Adding * to the list of special characters works, but I'm not sure if this is a valid solution.

Channel MODE parsing and multiple mode changes

If a MODE message is received with a channel target, the parser will then attempt to sub-parse the message. This algorithm basically states "if the mode string contains a 'b', the trailing string is a banmask; else if it contains a 'k', it's a channel key; else..."

However, this assumes that only one mode change is being made. If a message such as MODE #channel +b-o *!*@some.host ChanOpUser is parsed, the string *!*@some.host ChanOpUser is assigned to $params['banmask'], and $params['user'] (where one would ordinarily find the ChanOpUser param) does not exist.

MODE messages may also have a different number of trailing parameters than modes (eg. MODE #channel +vvm TrustedUser1 TrustedUser2), so there is no uniform way to reliably associate modes with the respective parameters. (Clients may achieve this by parsing the MODES element of the 005 RPL_NAMREPLY numeric, which identifies which modes take parameters and which don't, but a contextless parser cannot.)

As such, I wonder if it's truly valid for an RFC-compliant parser to be processing the MODE string in this way. At the least, some kind of common named parameter should be made available so that the trailing parameters can be accessed in a consistent way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.