Giter Club home page Giter Club logo

Comments (9)

johnkerl avatar johnkerl commented on May 18, 2024

http://johnkerl.org/miller/doc/internationalization.html

Multi-byte needs to be handled better at some point.

from miller.

johnkerl avatar johnkerl commented on May 18, 2024

In particular, this task has two components:

  • Known issue is naive use of C library strlen which would fix the reported issue
  • Moreover, as elaborated on in http://johnkerl.org/miller/doc/internationalization.html the Miller multi-byte support currently works by accident. So there additionally need to be unit-test cases for other functionality, and fixing any other issues uncovered in addition to the pretty-print/xtab formatting as reported here.

Marking as wishlist for now, just relative to other things more pressing -- namely, RFC-CSV and packaging. In the medium term, though, this will be a should-do rather than a wishlist item.

from miller.

HerbCSO avatar HerbCSO commented on May 18, 2024

Thanks @johnkerl

from miller.

johnkerl avatar johnkerl commented on May 18, 2024

Turns out this is super-easy to deal with. Fixed in latest commit.

That handles case 1 above. Case 2, well, is a matter of going and looking for trouble. That's what issues are for ... so I'll consider this one closed.

Your pastebin file also lacks a final newline which I'm not handling well with (default) mmap I/O; tracked on #29.

Thanks for the report! :)

from miller.

HerbCSO avatar HerbCSO commented on May 18, 2024

Thanks @johnkerl for the quick response!

I just pulled down the latest version from master but it's failing one of the utf8 tests:

mlr --icsv --oxtab cat test/input/utf8-2.csv
1871c1871
< langue   nom      jour
---
> langue    nom       jour
1876c1876
< langue   nom      jour
---
> langue    nom       jour
1886c1886
< vendredi jour
---
> vendredi  jour
make[1]: *** [reg-test] Error 1
make: *** [c] Error 2

I think this is only because the test is just running mlr without the path, so it was using the old executable which was in my path. Not a biggie, I just copied the new binary into my path and then it worked, but probably worth fixing since it is a little confusing at first. ;]

I can also confirm that it works well for my original test case.

However for the larger file where I first noticed it, it's still an issue for some reason. That has data in it that I can't share so I'm trying to construct a test case for it now. I'll share that once I have one.

from miller.

HerbCSO avatar HerbCSO commented on May 18, 2024

New test file is at http://termbin.com/4fwz

You can download it and test it with:

curl http://termbin.com/4fwz | mlr --nidx --fs tab --opprint cat

My output from running that is the following:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   326  100   326    0     0    580      0 --:--:-- --:--:-- --:--:--   580
1           2          3            4                                                    5
customer_id premise_id secondary_id mail_address_line_1                                  mail_address_line_2
12345       12345      12345        CALLE LA VEGA 14 3º DCHA.                           ASTURIAS
12345       12345      12345        CALLE LUIS ARMIÑAN 19 1º A                         ASTURIAS
12345       12345      12345        CASERIO CARRUEBANO 2 BAJO                            ASTURIAS
12345       12345      12345        THIS IS A REALLY LONG LINE TO ILLUSTRATE THE PROBLEM ASTURIAS

1

Note that the "ASTURIAS" on the first two lines is still outdented.

from miller.

johnkerl avatar johnkerl commented on May 18, 2024

Thanks; it's a bit of a puzzlement why my test cases were OK and yours were not. Nonetheless I'll add yours to the suite. I suspect it's an issue with fprintf(fp, "%.*s", len, string) having the same miscount issue as strlen. This is very easy to fix; just compute utf8_strlen and left/right-pad the excess number of spaces.

from miller.

johnkerl avatar johnkerl commented on May 18, 2024

Fixed in 2de70a9

Thanks for the bug report! :)

from miller.

HerbCSO avatar HerbCSO commented on May 18, 2024

Awesome, thanks so much @johnkerl, I can confirm that it works now (including on my larger files)!

from miller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.