leont / file-slurp-sane Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 10.0 36 KB

A simple, sane and efficient file slurper

Perl 100.00%

file-slurp-sane's People

Contributors

Stargazers

Watchers

Forkers

revhippie dolmen mephinet grinnz karenetheridge rehsack willt jkeenan clayne sysfce2

file-slurp-sane's Issues

WIthout utf8_strict, write_text fails with certain long texts

So, straight to the problem. Let's say I have a utf-8 Perl string (for example, a literal string constant in a file with use utf8;). Let's say its utf-8 byte representation is longer than 4096 bytes, and the 4096 byte threshold breaks the byte representation of a character into two parts.

Under these conditions, if I use write_text to write this string, and I don't have PerlIO::utf8_strict installed, I get the following errors:

"\x{00e2}" does not map to utf8 at /run/current-system/sw/lib/perl5/site_perl/5.24.3/File/Slurper.pm line 73.
Close with partial character at /run/current-system/sw/lib/perl5/site_perl/5.24.3/File/Slurper.pm line 73.

I'm attaching the Perl file that manifests this problem. I'm running perl 5, version 24, subversion 3 (v5.24.3) built for x86_64-linux-thread-multi, File::Slurper version 0.010, under Nixos Linux.

The underlying problem seems to lie inside Perl's IO layers that are employed by write_text when utf_strict is not available and encoding is not specified explicitly. In this case, :raw:encoding(utf-8) is used as an IO layer when writing the file, and it fails to handle this kind of output properly. However, if I specify utf8 (without a dash) as an encoding in write_text call, or just directly do output with :raw:encoding(utf8), then this problem seems to disappear. (By the way, :raw doesn't influence the behaviour here, so it can be taken out of equation.)

While this seems to be an issue with core Perl, it's probably appropriate to address it in File::Slurper, either by defaulting to utf8 instead of utf-8, or by declaring utf8_strict as a required dependency.

write_binary always uses layers :raw:encoding('latin-1'), should be just :raw.

The write_binary code looks like:
sub write_binary { return write_text(@_[0,1], 'latin-1'); }
so passes an encoding of 'latin-1', with a hyphen.

write_text calls _text_layers which uses encoding and crlf to determine the encoding layer.

The relevant part of _text layers is:
if ($encoding =~ /^(latin|iso-8859-)1$/i) { return $crlf ? ':unix:crlf' : ':raw'; }

That matches "latin1" but not "latin-1". write_binary does not write using ':raw', but instead ":raw:encoding('latin-1')".

This produces a "wide character" warning for each non-ascii line written.

To fix, either change your regex, maybe /^(latin-?|iso-8859-)1$/i, or change write_binary to: return write_text(@_[0,1], 'latin1'); , no hyphen.

I could branch and create a pull request but that seems silly for a 1-character fix.

read_text returns empty list for an empty file

Calling read_text in list context normally returns a scalar as expected, unless the file in question is empty.

$ perl -MFile::Slurper=read_text -E 'my @foo = (read_text("/dev/null")); say 0+@foo'
0

So a statement like localise(read_text($file, "UTF-8"), $locale) passes $locale as the first argument if and only if $file is empty, assuming no prototype on the sub.

The documentation strongly implies that the return value is always a scalar

Reads file $filename into a scalar

read_lines problem in a long running process

I recently had to remove read_lines from a little server I am developing (Phoebe, commit). It used read_lines to read a list of "pages" from an "index" file. When I started or restarted the server, everything worked as intended, the index showed one item per page. After a few hours, the behaviour changed: read_lines returned a single item containing all the pages concatenated with newlines.

Normal operations with an index containing three files:

After a few hours, with the same index:

A\nB\nC

And when I restart the process, it is back to three items.

Sadly, I have no idea how this is possible. I never pass $encoding, $crlf, or $skip_chomp, and read_text also calls _text_layers just like read_lines. I'm just confused about the situation right now and don't know how I should debug or log the issue.

Provide read_*_if_exists() functions

A lot of workflows do not treat nonexistence of a file as an exceptional circumstance. While it’s commonplace for such code to -e $path before reading, this sets up a race condition.

It would be useful for File::Slurper to expose read functions that return undef if the given path is nonexistent.

Add a write_lines() function

I just wrote some code and assumed there was a write_lines() function, but it turns out there isn't :-)

leont / file-slurp-sane Goto Github PK

file-slurp-sane's People

Contributors

Stargazers

Watchers

Forkers

file-slurp-sane's Issues

WIthout utf8_strict, write_text fails with certain long texts

write_binary always uses layers :raw:encoding('latin-1'), should be just :raw.

read_text returns empty list for an empty file

read_lines problem in a long running process

Provide read_*_if_exists() functions

Add a write_lines() function

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent