Comments (7)
Thanks for replying, but, i can guarantee my data is valid UTF-8 chars. I can correctly output my data into a file with out_file, kafka with out_kafka(_buffered), but not hdfs with out_webhdfs. I compared the source code between out_file.rb and out_webhdfs.rb, and I found the difference:
out_file.rb:
Plugin.new_formatter(@format)
out_hdfs.rb
include Fluent::Mixin::PlainTextFormatter
The two formatter is different and the error is appeared in exactly in Fluent::Mixin::PlainTextFormatter.
I wondered whether Fluent::Mixin::PlainTextFormatter causes the error?
sincerely
from fluent-plugin-webhdfs.
Could you paste what you did, with fluentd's configuration? And do you have stack trace for that error?
from fluent-plugin-webhdfs.
i have the same problem. i will show the how the problem occurs.
let's say, i "tail" a file into hdfs using webhdfs.
td-agent.conf
<source>
type tail
pos_file /home/lvjin/workspace/work/td-agent/pos_files/test_log.pos
format json
path /home/lvjin/workspace/work/log_producer/test_log/test_log.json
tag test_log
</source>
<match test_log>
type webhdfs
host 192.168.1.245
port 50070
path /ehualu/logs/watch_log/watch_log.%Y%m%d.json
output_include_time false
output_include_tag false
flush_interval 10s
</match>
each line of log in the 'tail' file(test_log.json in the demo) is a json (utf-8 encoded) like:
{"close_time":"019:00","device_tags":[{"tag":"学习用品"},{"tag":"学校"},{"tag":"大屏"},{"tag":"学校"},{"tag":"汽车"},{"tag":"加油站"},{"tag":"教科书"},{"tag":"汽车"}],"start_time":"08:30","daily_h_traffic":7379,"device_size":"600*2000","device_intr":"位于CBD核心地带","screen_size":56,"device_ratio":"3:4","device_height":40,"visiable_angle":50,"device_resolution ":"1280*720","daily_car_traffic":5861,"visable_distance":10,"is_corner":false,"weekly_price":11452,"geo":[{"province":"安徽","city":"常州","coordinates":[{"lat":"12.96391","lon":"121.48462"}],"district":"锡山区"}],"id":1,"is_disturbed":true,"device_id":"09:2B:DD:6E:AD:F8"}
here is the stack trace:
2015-12-10 09:15:28 +0800 [warn]: emit transaction failed: error_class=Encoding::UndefinedConversionError error="\"\\xE5\" from ASCII-8BIT to UTF-8" tag="test_log"
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `encode'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `to_json'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `stringify_record'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:115:in `format'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:551:in `block in emit'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:128:in `call'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:128:in `block in each'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:127:in `each'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:127:in `each'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:542:in `emit'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:88:in `emit_stream'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:230:in `receive_lines'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:322:in `call'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:322:in `wrap_receive_lines'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:514:in `call'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:514:in `on_notify'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:347:in `on_notify'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:448:in `call'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:448:in `on_change'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:215:in `run'
additional: ruby 2.1.5, td-agent 2.2.x
while i change the output from webhdfs to kafka, the problem doesn't occur, and i can read the correct data from kafka. I think perhaps it is the webhdfs output plugin that leads to it. i read the source code a moment ago, and i have nearly no knowledge about ruby, but personally, i was wondering if msgpack needed here? if i was wrong, forgive my ignorance. Looking forward to you apply!
sincerely!
from fluent-plugin-webhdfs.
Your data contains invalid character for UTF-8 (JSON requires valid utf-8 chars).
I can add an option to ignore&skip such records... but is it what you need?
Another options is to add scrub strings such that (convert invalid chars to '?').
from fluent-plugin-webhdfs.
I know about Fluent::Mixin::PlainTextFormatter because it's also my product...
PlainTextFormatter uses JSON module of ruby, and Fluentd's default formatter (used in out_file) is using Yajl (yajl-ruby). Yajl ignores invalid utf-8 chars always.
from fluent-plugin-webhdfs.
I replaced "JSON module of ruby" in the PlainTextFormatter with Yajl, and it works well
record.to_json
replaced by
Yajl.dump(record)
I am wondering why you've chosen "JSON module of ruby" instead of "Yajl"?
Another question, have you tested your out_webhdfs plugin with data of "Non Latin", such as Japanese, Chinese?
Anyway, it works well now, thanks for your help!
sincerely
A noob of ruby
from fluent-plugin-webhdfs.
I'm also having this issue. It took me a while to figure out, but I have some raw logs that are getting escaped with '\xAE'
The character conversion in both out_forward and out_file happens correctly.
This plugin is inconsistent with the others.
Replacing it with "?" isn't really an option for me, because I'm expecting any "garbage" to be propagated through my system.
It would seem that \xAE doesn't get padded or interpreted as \u00AE for some reason. Any one byte character code above 7F seems to be failing to write at the "to_json" conversion. Possibly because it expects a padded/wide character, and it's given a short.
I get a "warn" in the logs relating to JSON::GeneratorError and it doesn't emit the record.
This is a problem, because now I'm missing records in Hadoop, where they would have been written to file otherwise.
Because x00-xFF are identically mapped to U+0000-U+00FF, why not allow them as valid unicode characters as many others do? I guess this is a bug in the Ruby JSON package then.
I'm looking into making the above edit, but at this point it may be faster for me to deploy kafka to my cluster and use that instead.
I hope this can be resolved in future td-agent package releases. Where I'm using the rpm, I seem to be locked into this bug.
Is there a reason you don't use the same Yajl record writer as other plugins?
from fluent-plugin-webhdfs.
Related Issues (20)
- Plugin lost file buffer when setting buffer type to file and split file by hourly HOT 5
- Output by tag like out_file (Feature Request) HOT 3
- 根据日志中的字段指定hdfs上的路径 HOT 5
- httpFS - Do not create file if it does not exist HOT 5
- part of data missing
- output_format , just simple HOT 12
- out_webhdfs.rb's class changed from TimeSlicedOutput to Output HOT 2
- the webhdfs doesn't work HOT 6
- Kerberos Keytab example HOT 5
- Can hdfs path use ${tag} or ${record} ? just like: /data/hdfs/${tag}.#{Socket.gethostname}.log HOT 1
- Error installing fluent-plugin-webhdfs HOT 3
- HA configuration performs incorrectly HOT 4
- When try to append to file more often then default timekey, exception happens
- kerberos_keytab not authorize HOT 2
- use chunk_id in path HOT 3
- httpFS - Can not create file when it does not exist
- Compression Snappy is not work. HOT 3
- systemctl reload td-agent.service is not working properly with @type webhdfs
- Operation category READ is not supported in state standby
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fluent-plugin-webhdfs.