Giter Club home page Giter Club logo

Comments (7)

selerite avatar selerite commented on June 4, 2024 1

Thanks for replying, but, i can guarantee my data is valid UTF-8 chars. I can correctly output my data into a file with out_file, kafka with out_kafka(_buffered), but not hdfs with out_webhdfs. I compared the source code between out_file.rb and out_webhdfs.rb, and I found the difference:
out_file.rb:

Plugin.new_formatter(@format)

out_hdfs.rb

include Fluent::Mixin::PlainTextFormatter

The two formatter is different and the error is appeared in exactly in Fluent::Mixin::PlainTextFormatter.
I wondered whether Fluent::Mixin::PlainTextFormatter causes the error?

sincerely

from fluent-plugin-webhdfs.

tagomoris avatar tagomoris commented on June 4, 2024

Could you paste what you did, with fluentd's configuration? And do you have stack trace for that error?

from fluent-plugin-webhdfs.

selerite avatar selerite commented on June 4, 2024

i have the same problem. i will show the how the problem occurs.
let's say, i "tail" a file into hdfs using webhdfs.
td-agent.conf

<source>
  type tail
  pos_file /home/lvjin/workspace/work/td-agent/pos_files/test_log.pos
  format json
  path /home/lvjin/workspace/work/log_producer/test_log/test_log.json
  tag test_log
</source>
<match test_log>
  type webhdfs
  host 192.168.1.245
  port 50070
  path /ehualu/logs/watch_log/watch_log.%Y%m%d.json
  output_include_time false
  output_include_tag false
  flush_interval 10s
</match>

each line of log in the 'tail' file(test_log.json in the demo) is a json (utf-8 encoded) like:

{"close_time":"019:00","device_tags":[{"tag":"学习用品"},{"tag":"学校"},{"tag":"大屏"},{"tag":"学校"},{"tag":"汽车"},{"tag":"加油站"},{"tag":"教科书"},{"tag":"汽车"}],"start_time":"08:30","daily_h_traffic":7379,"device_size":"600*2000","device_intr":"位于CBD核心地带","screen_size":56,"device_ratio":"3:4","device_height":40,"visiable_angle":50,"device_resolution ":"1280*720","daily_car_traffic":5861,"visable_distance":10,"is_corner":false,"weekly_price":11452,"geo":[{"province":"安徽","city":"常州","coordinates":[{"lat":"12.96391","lon":"121.48462"}],"district":"锡山区"}],"id":1,"is_disturbed":true,"device_id":"09:2B:DD:6E:AD:F8"}

here is the stack trace:

2015-12-10 09:15:28 +0800 [warn]: emit transaction failed: error_class=Encoding::UndefinedConversionError error="\"\\xE5\" from ASCII-8BIT to UTF-8" tag="test_log"
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `encode'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `to_json'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `stringify_record'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:115:in `format'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:551:in `block in emit'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:128:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:128:in `block in each'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:127:in `each'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:127:in `each'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:542:in `emit'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:88:in `emit_stream'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:230:in `receive_lines'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:322:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:322:in `wrap_receive_lines'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:514:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:514:in `on_notify'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:347:in `on_notify'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:448:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:448:in `on_change'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:215:in `run'

additional: ruby 2.1.5, td-agent 2.2.x

while i change the output from webhdfs to kafka, the problem doesn't occur, and i can read the correct data from kafka. I think perhaps it is the webhdfs output plugin that leads to it. i read the source code a moment ago, and i have nearly no knowledge about ruby, but personally, i was wondering if msgpack needed here? if i was wrong, forgive my ignorance. Looking forward to you apply!
sincerely!

from fluent-plugin-webhdfs.

tagomoris avatar tagomoris commented on June 4, 2024

Your data contains invalid character for UTF-8 (JSON requires valid utf-8 chars).
I can add an option to ignore&skip such records... but is it what you need?
Another options is to add scrub strings such that (convert invalid chars to '?').

from fluent-plugin-webhdfs.

tagomoris avatar tagomoris commented on June 4, 2024

I know about Fluent::Mixin::PlainTextFormatter because it's also my product...
PlainTextFormatter uses JSON module of ruby, and Fluentd's default formatter (used in out_file) is using Yajl (yajl-ruby). Yajl ignores invalid utf-8 chars always.

from fluent-plugin-webhdfs.

selerite avatar selerite commented on June 4, 2024

I replaced "JSON module of ruby" in the PlainTextFormatter with Yajl, and it works well

record.to_json

replaced by

Yajl.dump(record)

I am wondering why you've chosen "JSON module of ruby" instead of "Yajl"?
Another question, have you tested your out_webhdfs plugin with data of "Non Latin", such as Japanese, Chinese?

Anyway, it works well now, thanks for your help!

sincerely
A noob of ruby

from fluent-plugin-webhdfs.

btwood avatar btwood commented on June 4, 2024

I'm also having this issue. It took me a while to figure out, but I have some raw logs that are getting escaped with '\xAE'
The character conversion in both out_forward and out_file happens correctly.
This plugin is inconsistent with the others.
Replacing it with "?" isn't really an option for me, because I'm expecting any "garbage" to be propagated through my system.

It would seem that \xAE doesn't get padded or interpreted as \u00AE for some reason. Any one byte character code above 7F seems to be failing to write at the "to_json" conversion. Possibly because it expects a padded/wide character, and it's given a short.

I get a "warn" in the logs relating to JSON::GeneratorError and it doesn't emit the record.
This is a problem, because now I'm missing records in Hadoop, where they would have been written to file otherwise.

Because x00-xFF are identically mapped to U+0000-U+00FF, why not allow them as valid unicode characters as many others do? I guess this is a bug in the Ruby JSON package then.

I'm looking into making the above edit, but at this point it may be faster for me to deploy kafka to my cluster and use that instead.

I hope this can be resolved in future td-agent package releases. Where I'm using the rpm, I seem to be locked into this bug.

Is there a reason you don't use the same Yajl record writer as other plugins?

from fluent-plugin-webhdfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.