Giter Club home page Giter Club logo

Comments (9)

SirUli avatar SirUli commented on September 25, 2024

Alright, regarding the writing of the file i found that files are opened with :utf8 but are not written with it. See https://github.com/aichaos/rivescript-perl/blob/master/lib/RiveScript.pm#L1758:

open ($fh, ">", $file) or die "Can't write to $file: $!";

Replace by

open ($fh, ">:utf8", $file) or die "Can't write to $file: $!";

After chaning that and running the above code, the result would be:

// Written by RiveScript::deparse()
! version = 2.0

+ küchentemperatur
- In der Küche ist es warm.

+ wie warm ist es in der küche
- Die temperatur in der Küche beträgt 27 Grad

At least now the initial sentence is okay again in the file and the file can be reused. Just the streamed data is now corrupted but that could be fixed by directly writing as UTF8 into the streaming data:

sub
utf8ToLatin1($)
{
  my ($s)= @_;
  $s =~ s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
  return $s;
}

# Stream in some RiveScript code.
$rs->stream (utf8ToLatin1(q~
  + küchentemperatur
  - In der Küche ist es warm.
~));

Which then finally gives me test-write.rive as:

`// Written by RiveScript::deparse()
! version = 2.0

+ küchentemperatur
- In der Küche ist es warm.

+ wie warm ist es in der küche
- Die temperatur in der Küche beträgt 27 Grad

But now rivescript doesn't recognize the "küchentemperatur" anymore:

You> küchentemperatur
Bot> ERR: No Reply Matched

The other command of course is still recognized. When looking at the deparsed code (print Dumper($rs->deparse());) this is also obvious:

$VAR1 = {
          'begin' => {
                       'array' => {},
                       'triggers' => {},
                       'that' => {},
                       'person' => {},
                       'sub' => {},
                       'global' => {},
                       'var' => {}
                     },
          'inherit' => {},
          'topic' => {
                       'random' => {
                                     "wie warm ist es in der k\x{fc}che" => {
                                                                        'reply' => [
                                                                                     "Die temperatur in der K\x{fc}che betr\x{e4}gt 27 Grad"
                                                                                   ]
                                                                      },
                                     'k▒chentemperatur' => {
                                                             'reply' => [
                                                                          'In der K▒che ist es warm.'
                                                                        ]
                                                           }
                                   }
                     },
          'include' => {},
          'that' => {}
        };

as these do not look the same "wrong". So that wasn't the solution. So i decided currently not to use the streaming function but characters in the latin 1 supplement are currently only displayed if i used something like this:

sub
latin1ToUtf8($)
{
  my ($s)= @_;
  $s =~ s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
  return $s;
}

sub
utf8ToLatin1($)
{
  my ($s)= @_;
  $s =~ s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
  return $s;
}
my $reply = latin1ToUtf8($rs->reply ('localuser',utf8ToLatin1($msg)));

Basically i send my message as latin1 and get latin 1 back although i hand over utf8=1. Am i thinking wrong?

from rivescript-perl.

kirsle avatar kirsle commented on September 25, 2024

In the Data::Dumper output, iirc the \x{fc} syntax is what you should be seeing, and the other is because Perl doesn't know the data is UTF-8. Perl internally has a unicode bit on its data structures that lets it know whether a string is UTF-8 or not. If the bit isn't set, Perl tries to print it as-is, and you see it correctly on your terminal because your terminal emulator understands the character.

A way to test is to try doing length() on the string. If you get the number of characters back, Perl knows it's UTF-8. If you get a length longer than you expect, Perl doesn't know it's UTF-8 and it's counting the bytes in the string rather than the characters.

The write() function not opening files in UTF-8 mode does look like a bug, though. Reading files from disk should work with UTF-8. When you're inputting a message from the user for matching, make sure that the input message is a correct UTF-8 string. If you're reading from STDIN, you may need to call binmode(STDIN, ":utf8") to set it to UTF-8 mode. If you're reading from a web browser POST, make sure your server supports UTF-8 and sends the correct headers.

from rivescript-perl.

SirUli avatar SirUli commented on September 25, 2024

Thanks Noah, helps already a lot. I'm trying to figure out where the data comes wrong.

Perl & UTF8 is really hard.. :( Thanks for your input on this!

from rivescript-perl.

kirsle avatar kirsle commented on September 25, 2024

You might have already found this on Google but it seems to be a pretty good resource for handling UTF-8 in Perl: http://ahinea.com/en/tech/perl-unicode-struggle.html

from rivescript-perl.

SirUli avatar SirUli commented on September 25, 2024

Hi Noah,
i'm getting closer to my target - just takes for years to get all these thing correctly. utf8 is really hard.

Just a question: When i enable utf8 then it seems as if characters for punctuation are not dropped anymore? I suddenly see question and exclamation marks

[11:20:9] RiveScript: Sorting triggers...
[11:20:9] RiveScript: Sorting reverse triggers for %previous groups...
[11:20:9] RiveScript: Get reply to [TC_TALKTOUSER_ULI] Wie warm ist es im Bad?
[11:20:9] RiveScript: Checking topic random for any %previous's.
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against wie (heisst du|ist dein name) (wie (heisst du|ist dein name))
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against welches geschlecht hast du (welches geschlecht hast du)
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against wie alt bist du (wie alt bist du)
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against wo wohnst du (wo wohnst du)
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against hallo bot (hallo bot)
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against <reply> (undefined)
[11:20:9] RiveScript: Trying to match "wie warm ist es im bad?" against wie warm ist es [@ortsbeschreibung] * (wie warm ist es(?:\s*(?:im|in der|auf dem)\s*|\s*)(.+?))
[11:20:9] RiveScript: Found a match!
[11:20:9] RiveScript: Checking conditionals
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: wohnzimmer
[11:20:9] RiveScript:           Check if "bad?" == "wohnzimmer"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: bad
[11:20:9] RiveScript:           Check if "bad?" == "bad"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: büro
[11:20:9] RiveScript:           Check if "bad?" == "büro"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: küche
[11:20:9] RiveScript:           Check if "bad?" == "küche"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: kühlschrank
[11:20:9] RiveScript:           Check if "bad?" == "kühlschrank"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: tiefgefrierfach
[11:20:9] RiveScript:           Check if "bad?" == "tiefgefrierfach"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: balkon
[11:20:9] RiveScript:           Check if "bad?" == "balkon"
[11:20:9] RiveScript:   Left: <star1>; EQ: ==; Right: schlafzimmer
[11:20:9] RiveScript:           Check if "bad?" == "schlafzimmer"
[11:20:9] RiveScript: Processing responses to this trigger.
[11:20:9] RiveScript: Reply: Keine Ahnung wie warm es in "<star1>" ist.

from rivescript-perl.

kirsle avatar kirsle commented on September 25, 2024

I have a fix for this that was implemented on the JS version of RiveScript but I haven't gotten to doing it for the Perl version yet.

It adds a unicodePunctuation attribute on the RiveScript object which holds a regexp of common punctuation characters to be stripped out when UTF-8 mode is on.

aichaos/rivescript-js@789c5c2

I'll try getting it in the Perl version shortly.

from rivescript-perl.

SirUli avatar SirUli commented on September 25, 2024

Thanks - i now made i finally. The homeautomation software that i was working with pushed it with a strange format to my module.

What helped (extract from a function taking $msg as parameter and returning $reply):

  use RiveScript;

  my %rivescriptconfig = ('utf8'       => 1);

  # Create a new RiveScript interpreter.
  my $rs = new RiveScript(%rivescriptconfig);

  # Load another file.
  $rs->loadFile ("./test.rive");

  # Stream in some RiveScript code.
  $rs->stream (q~
    + küchentemperatur
    - In der Küche ist es warm.
  ~);

  # Sort all the loaded replies.
  $rs->sortReplies;


  use Data::Dumper;
  print Dumper($rs->deparse());

  $rs->write ('./test-write.rive');

  $msg = decode_utf8( $msg);
  my $reply = $rs->reply ('localuser',$msg);
  $reply = encode_utf8($reply );
  return $reply;

So i'd close this issue - Rivescript works fine so far :) With the punctuation: i included that in my own function until you push that to the perl module. So take your time!

THANKS for your support!

from rivescript-perl.

kirsle avatar kirsle commented on September 25, 2024

I uploaded RiveScript v1.42 to CPAN which adds the unicode_punctuation configurable parameter.

Usage is like:

my $bot = new RiveScript(
    utf8 => 1,
    unicode_punctuation => qr/[.,!?;:]/,
);

The default regexp includes the characters listed there, you should only have to set your own regexp if you want other characters to be counted as punctuation and removed.

from rivescript-perl.

SirUli avatar SirUli commented on September 25, 2024

Thank you!

from rivescript-perl.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.