Comments (9)
http://johnkerl.org/miller/doc/internationalization.html
Multi-byte needs to be handled better at some point.
from miller.
In particular, this task has two components:
- Known issue is naive use of C library
strlen
which would fix the reported issue - Moreover, as elaborated on in http://johnkerl.org/miller/doc/internationalization.html the Miller multi-byte support currently works by accident. So there additionally need to be unit-test cases for other functionality, and fixing any other issues uncovered in addition to the pretty-print/xtab formatting as reported here.
Marking as wishlist for now, just relative to other things more pressing -- namely, RFC-CSV and packaging. In the medium term, though, this will be a should-do rather than a wishlist item.
from miller.
Thanks @johnkerl
from miller.
Turns out this is super-easy to deal with. Fixed in latest commit.
That handles case 1 above. Case 2, well, is a matter of going and looking for trouble. That's what issues are for ... so I'll consider this one closed.
Your pastebin file also lacks a final newline which I'm not handling well with (default) mmap I/O; tracked on #29.
Thanks for the report! :)
from miller.
Thanks @johnkerl for the quick response!
I just pulled down the latest version from master but it's failing one of the utf8 tests:
mlr --icsv --oxtab cat test/input/utf8-2.csv
1871c1871
< langue nom jour
---
> langue nom jour
1876c1876
< langue nom jour
---
> langue nom jour
1886c1886
< vendredi jour
---
> vendredi jour
make[1]: *** [reg-test] Error 1
make: *** [c] Error 2
I think this is only because the test is just running mlr
without the path, so it was using the old executable which was in my path. Not a biggie, I just copied the new binary into my path and then it worked, but probably worth fixing since it is a little confusing at first. ;]
I can also confirm that it works well for my original test case.
However for the larger file where I first noticed it, it's still an issue for some reason. That has data in it that I can't share so I'm trying to construct a test case for it now. I'll share that once I have one.
from miller.
New test file is at http://termbin.com/4fwz
You can download it and test it with:
curl http://termbin.com/4fwz | mlr --nidx --fs tab --opprint cat
My output from running that is the following:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 326 100 326 0 0 580 0 --:--:-- --:--:-- --:--:-- 580
1 2 3 4 5
customer_id premise_id secondary_id mail_address_line_1 mail_address_line_2
12345 12345 12345 CALLE LA VEGA 14 3º DCHA. ASTURIAS
12345 12345 12345 CALLE LUIS ARMIÑAN 19 1º A ASTURIAS
12345 12345 12345 CASERIO CARRUEBANO 2 BAJO ASTURIAS
12345 12345 12345 THIS IS A REALLY LONG LINE TO ILLUSTRATE THE PROBLEM ASTURIAS
1
Note that the "ASTURIAS" on the first two lines is still outdented.
from miller.
Thanks; it's a bit of a puzzlement why my test cases were OK and yours were not. Nonetheless I'll add yours to the suite. I suspect it's an issue with fprintf(fp, "%.*s", len, string)
having the same miscount issue as strlen
. This is very easy to fix; just compute utf8_strlen
and left/right-pad the excess number of spaces.
from miller.
Fixed in 2de70a9
Thanks for the bug report! :)
from miller.
Awesome, thanks so much @johnkerl, I can confirm that it works now (including on my larger files)!
from miller.
Related Issues (20)
- Read performance can be improved for high-column-count data
- Investigate shutdown latency on `mlr head` HOT 2
- Cryptic fatal error message for nonexistent files since 6.9.0 HOT 2
- Investigate spurious `[]` on JSON output in some cases HOT 4
- `flatten` not working on csv input data
- Bash process substitution not working with `put -f`
- Miller's `strptime` accepts fewer format options than `strptime`
- Inconsistent result when using `$*`
- Double-width characters spoil column alignment HOT 4
- `mlr --icsv --ojson cat < mlr.bug.csv` drops some columns HOT 5
- Add description for "put" verb HOT 1
- 'mlr cut' is very slow HOT 8
- mlr --otsv does not handle broken quotes correctly compared to --ocsv HOT 6
- JSON to CSV Error HOT 8
- exit code = 1 for --csv skip-trivial-records and csv file's last record is blank
- Automated way of clearing down column data HOT 7
- JSON flag documentation question HOT 1
- Equivalent to Excel function "data load from folder" then "combine and load" multiple CSV's finally "apply transformations" HOT 7
- Find and remove "string" retaining all other row data HOT 5
- Find and replace special character & with and using ssub HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from miller.