Giter Club home page Giter Club logo

utf8-cleaner's Introduction

UTF8Cleaner

Build Status

Removes invalid UTF-8 characters from the environment so that your app doesn't choke on them. This prevents errors like "invalid byte sequence in UTF-8".

Installation

Add this line to your application's Gemfile:

gem 'utf8-cleaner'

And then execute:

$ bundle

Or install it yourself as:

$ gem install utf8-cleaner

If you're not running Rails, you'll have to add the middleware to your config.ru:

require 'utf8-cleaner'
use UTF8Cleaner::Middleware

Usage

There's nothing to "use". It just works!

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

Credits

Original middleware author: @phoet - https://gist.github.com/phoet/1336754

utf8-cleaner's People

Contributors

apeckham avatar benjamin-ltr avatar benlovell avatar nextmat avatar nicolasleger avatar reiz avatar rx avatar salrepe avatar sbleon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

utf8-cleaner's Issues

ArgumentError: string contains null byte

Hi! We've been using utf8-cleaner for a bit and it's made a big difference in preventing our bug tracking services from being flooded, so thank you for sharing.

Unfortunately as soon as our older utf8 errors stopped rolling in we started getting a lot of these "string contains null byte" errors and utf8-cleaner isn't treating these as invalid strings. Our app is running Rails 5.2, Ruby 2.5.1, and utf8-cleaner 0.2.5.

I created a branch to add a check for this null character %00 to utf8-cleaner and would love to submit a Pull Request if you all would be interested (PR available here). It is rather basic and just adds another regex check for NULL_CHARS = /(%00)/ right after valid_uri_encoded_utf8 checks for INVALID_PERCENT_ENCODING_REGEX.

Before changes:

curl -I https://localhost:5000/customers/somecustomer%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%afWindows%c0%afsystem%c0%aeini%00
HTTP/1.1 500 Internal Server Error
Content-Type: text/html; charset=UTF-8
~> 500 ArgumentError (string contains null byte):

After changes:

curl -I https://localhost:5000/customers/somecustomer%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%af%c0%ae%c0%ae%c0%afWindows%c0%afsystem%c0%aeini%00
HTTP/1.1 301 Moved Permanently
X-Frame-Options: SAMEORIGIN
~> 301 redirect

Reading the previous, still-open issue, I'd considered using a rescue_from as Leon suggested, but to his other point, I believe a fix for any null characters would be right in line with the main purpose of the gem; we're using utf8-cleaner to clean our incoming requests so we can at least handle/route them properly, even if they aren't properly formed or correct. That being said, I'm of course open to any feedback, suggestions, or constructive criticism.

ActionView::Template::Error: incompatible character encodings: UTF-8 and ASCII-8BIT

Hi there!

I've been using utf8-cleaner for quite a while. To be honest I don't quite know if it has had any effect in the application - I put it 'just in case' given that it seems a well maintained gem, and I was experiencing requests with problematic encodings.

Theoretically if I use utf8-cleaner, no request URL encoding should ever cause a 500, right?

Well, I am able to consistently reproduce this in my app:

curl -I `ruby -e "puts %|https://www.myapp.com/foo/bar\?abcdt\=\x80\xC2\\@7ok_id\=130|"`
HTTP/1.1 500 Internal Server Error

(anonimized domain/route/params)

Internaly the error is:

ActionView::Template::Error: incompatible character encodings: UTF-8 and ASCII-8BIT

Unfortunately I cannot reproduce this on my machine; I am able to consistently reproduce it in production though.

Setup 1 (localhost, not reproducible)

Plain Rails server:

curl -I `ruby -e "puts %|http://localhost:3000/foo/bar\?abcdt\=\x80\xC2\\@7ok_id\=130|"`
HTTP/1.1 200 OK

Setup 2 (localhost, not reproducible)

Rails server behind local instance of nginx.

curl -I `ruby -e "puts %|http://localhost:8080/foo/bar\?abcdt\=\x80\xC2\\@7ok_id\=130|"`
HTTP/1.1 200 OK

Setup 3 (production, reproducible)

Cloudflare -> AWS ELB -> nginx -> Rails server

curl -I `ruby -e "puts %|https://www.myapp.com/foo/bar\?abcdt\=\x80\xC2\\@7ok_id\=130|"`
HTTP/1.1 500 Internal Server Error

My point is that maybe Cloudflare/ELB are doing something funny.

Let me know if I can do anything to help debugging the issue.

Cheers - Victor

body sanitization

Hi guys,
I'm checking last version (0.0.7) and I'm receiving the following error:

==> /var/log/passenger.log <==
App 27370 stderr: [ 2014-04-09 06:50:27.2568 27714/0x00000006110868(Worker 1) utils.rb:68 ]: *** Exception NoMethodError in Rack application object (undefined method `reopen' for #<PhusionPassenger::Utils::TeeInput:0x000000075b32c0>) (process 27714, thread 0x00000006110868(Worker 1)):
App 27370 stderr:   from /home/deploy/luxfix/shared/bundle/ruby/1.9.1/gems/utf8-cleaner-0.0.7/lib/utf8-cleaner/middleware.rb:40:in `sanitize_env_rack_input'
App 27370 stderr:   from /home/deploy/luxfix/shared/bundle/ruby/1.9.1/gems/utf8-cleaner-0.0.7/lib/utf8-cleaner/middleware.rb:25:in `sanitize_env'
App 27370 stderr:   from /home/deploy/luxfix/shared/bundle/ruby/1.9.1/gems/utf8-cleaner-0.0.7/lib/utf8-cleaner/middleware.rb:18:in `call'
App 27370 stderr:   from /home/deploy/luxfix/shared/bundle/ruby/1.9.1/gems/railties-3.2.17/lib/rails/engine.rb:484:in `call'
App 27370 stderr:   from /home/deploy/luxfix/shared/bundle/ruby/1.9.1/gems/railties-3.2.17/lib/rails/application.rb:231:in `call'
App 27370 stderr:   from /home/deploy/luxfix/shared/bundle/ruby/1.9.1/gems/railties-3.2.17/lib/rails/railtie/configurable.rb:30:in `method_missing'
App 27370 stderr:   from /home/deploy/.rvm/gems/ruby-1.9.3-p484/gems/passenger-4.0.41/lib/phusion_passenger/rack/thread_handler_extension.rb:74:in `process_request'
App 27370 stderr:   from /home/deploy/.rvm/gems/ruby-1.9.3-p484/gems/passenger-4.0.41/lib/phusion_passenger/request_handler/thread_handler.rb:141:in `accept_and_process_next_request'
App 27370 stderr:   from /home/deploy/.rvm/gems/ruby-1.9.3-p484/gems/passenger-4.0.41/lib/phusion_passenger/request_handler/thread_handler.rb:109:in `main_loop'
App 27370 stderr:   from /home/deploy/.rvm/gems/ruby-1.9.3-p484/gems/passenger-4.0.41/lib/phusion_passenger/request_handler.rb:448:in `block (3 levels) in start_threads'
App 27370 stderr: [ 2014-04-09 06:50:27.2570 27714/0x00000006110868(Worker 1) request_handler/thread_handler.rb:153 ]: Request done.

Is UTF8-cleaner the right solution for this issue?

We've got a lot of user input that occasionally happens to be in the wrong encoding format (I'm not sure how it happens, and I'm not 100% sure why it doesn't get forced into UTF-8 by MongoDB). Recently I've taken to using a forced encode like below on strings that need to be displayed, but as you can imagine this is an untenable solution for anything except the most limited scenarios.

str.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Is this the kind of situation that UTF8-cleaner was built to fix? Or does it only work for incoming strings?

still getting 'invalid byte sequence in UTF-8'

somewhere in china is sending these values to our app causing the above error

messages/new?content=Please%2Bcall%2Bme%2Bback%2Bto%2Bdiscuss%E2%80%A6+Result:+chosen+nickname+%22iuggoutleti%22;+registered;+logged+in;+success+-+posted+to+first+encountered+partition+%22/messages/new?content=Please+call+me+back+to+discuss%E2%80%A6%22;+Result:+%D1%A1%D4%F1%D7%A2%B2%E1%D3%C3%BB%A7%C3%FB+%22scuncimmuch%22;+%D7%A2%B2%E1%CD%EA%B3%C9;+%B3%C9%B9%A6;

Should your gem resolve this?

I tried it but to no avail

TypeError: expected Hash (got String) for param `_REQUEST'

Installed UTF-8 cleaner, seems to have resolved UTF-8 conversion errors in request, but now I'm seeing the following occasionally. Wondering if anyone else has seen the same?

TypeError: expected Hash (got String) for param `_REQUEST'
gems/rack-1.4.5/lib/rack/utils.rb:127:in normalize_params
    from gems/rack-1.4.5/lib/rack/utils.rb:96:in block in parse_nested_query
    from gems/rack-1.4.5/lib/rack/utils.rb:93:in each
    from gems/rack-1.4.5/lib/rack/utils.rb:93:in parse_nested_query
    from gems/rack-1.4.5/lib/rack/request.rb:332:in parse_query
    from gems/rack-1.4.5/lib/rack/request.rb:186:in GET
    from gems/rack-1.4.5/lib/rack/request.rb:221:in params
    from bundler/gems/remotipart-4b75722c2565/lib/remotipart/middleware.rb:12:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/params_parser.rb:21:in call
    from /mnt/tablesolution-production/current/gems/vesper_ext/lib/vesper_ext/middleware/catch_json_errors.rb:13:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/flash.rb:242:in call
    from gems/rack-1.4.5/lib/rack/session/abstract/id.rb:210:in context
    from gems/rack-1.4.5/lib/rack/session/abstract/id.rb:205:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/cookies.rb:341:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/callbacks.rb:28:in block in call
    from gems/activesupport-3.2.18/lib/active_support/callbacks.rb:405:in _run__1640190139283734739__call__3111014343171907194__callbacks
    from gems/activesupport-3.2.18/lib/active_support/callbacks.rb:405:in __run_callback
    from gems/activesupport-3.2.18/lib/active_support/callbacks.rb:385:in _run_call_callbacks
    from gems/activesupport-3.2.18/lib/active_support/callbacks.rb:81:in run_callbacks
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/callbacks.rb:27:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/remote_ip.rb:31:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/debug_exceptions.rb:16:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/show_exceptions.rb:56:in call
    from gems/lograge-0.2.2/lib/lograge/rails_ext/rack/logger.rb:15:in call_app
    from gems/railties-3.2.18/lib/rails/rack/logger.rb:16:in block in call
    from gems/activesupport-3.2.18/lib/active_support/tagged_logging.rb:22:in tagged
    from gems/railties-3.2.18/lib/rails/rack/logger.rb:16:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/request_id.rb:22:in call
    from gems/rack-1.4.5/lib/rack/methodoverride.rb:21:in call
    from gems/rack-1.4.5/lib/rack/runtime.rb:17:in call
    from gems/activesupport-3.2.18/lib/active_support/cache/strategy/local_cache.rb:72:in call
    from gems/actionpack-3.2.18/lib/action_dispatch/middleware/static.rb:63:in call
    from gems/rack-cache-1.2/lib/rack/cache/context.rb:136:in forward
    from gems/rack-cache-1.2/lib/rack/cache/context.rb:245:in fetch
    from gems/rack-cache-1.2/lib/rack/cache/context.rb:185:in lookup
    from gems/rack-cache-1.2/lib/rack/cache/context.rb:66:in call!
    from gems/rack-cache-1.2/lib/rack/cache/context.rb:51:in call
    from gems/utf8-cleaner-0.0.9/lib/utf8-cleaner/middleware.rb:18:in call
    from gems/railties-3.2.18/lib/rails/engine.rb:484:in call
    from gems/railties-3.2.18/lib/rails/application.rb:231:in call
    from gems/railties-3.2.18/lib/rails/railtie/configurable.rb:30:in method_missing
    from gems/unicorn-4.6.3/lib/unicorn/http_server.rb:552:in process_client
    from gems/unicorn-4.6.3/lib/unicorn/http_server.rb:632:in worker_loop
    from gems/unicorn-4.6.3/lib/unicorn/http_server.rb:500:in spawn_missing_workers
    from gems/unicorn-4.6.3/lib/unicorn/http_server.rb:511:in maintain_worker_count
    from gems/unicorn-4.6.3/lib/unicorn/http_server.rb:277:in join
    from gems/unicorn-4.6.3/bin/unicorn_rails:209:in <top (required)>
    from bin/unicorn_rails:23:in load
    from bin/unicorn_rails:23:in <main>

Params merged if there is invalid encoding

Hi, I am start using this gem because it is the most easy solution to remove Invalid % encoding error. But I ran into a problem.

So when I hit with these:

# 1
url?page=1&perPage=5&query=alcohol 70%

# 2
url?page=1&perPage=5&query=alcohol 70

Both the params in controller would be:

<ActionController::Parameters {"page=1", "perPage"=>"5", "query"=>"alcohol 70", "controller"=>"api/v1/search_context", "action"=>"index", "search_context"=>{}} permitted: false>

But when I move the query params to right after url, these are what I got:

# 1 (This is work just fine)
url?query=alcohol 70&page=1&perPage=5
<ActionController::Parameters {"query"=>"alcohol 70", "page=1", "perPage"=>"5", "controller"=>"api/v1/search_context", "action"=>"index", "search_context"=>{}} permitted: false>

# 2 (The query value get merged with key page)
url?query=alcohol 70%&page=1&perPage=5
<ActionController::Parameters {"query"=>"alcohol 70page=1", "perPage"=>"5", "controller"=>"api/v1/search_context", "action"=>"index", "search_context"=>{}} permitted: false>

Is there a way to make it works unless 'just change the params order'?

Thank you in advance.

Allow developers to detect modified input

It would be good to allow developers to choose how they handle invalid input. We could set a header called X-utf8-cleaner-modified-input. This would allow developers to be strict if they wanted and present an error to the client.

incompatible character encodings: UTF-8 and ASCII-8BIT

hi,

since I updated to 0.0.6 , i have a lot or error like
incompatible character encodings: UTF-8 and ASCII-8BIT

or invalid byte sequence in UTF-8
with GET params like \xE2\xE8\xEB\xFC\xED\xF3\xF1

I can't figure why

Versions < 0.0.4 were mangling cookies

Versions prior to 0.0.4 were URI-encoding cookie data, breaking Rails' CSRF verification and probably other things. Please upgrade if you're using one of those versions!

Invalid byte sequence in User-Agent request header

Hi! I'm currently working on a website with a lot of traffic in Latin America.

We've found some agents that publish the "User-Agent" header with invalid characters, making our app crash. We rolled up a fix on our app to re-process this header with the corrected string, before noticing this library.

I'm reviewing, however, that in your middleware class, your'e not sanitizing the User-Agent header / HTTP_USER_AGENT env key.

It's that on purpose? I'm preparing a PR in case it's desired.

data loss on CONTENT_TYPE = multipart

Try to upload an image.
A monkeypatch for this (as initializer) is :

module UTF8Cleaner
  class Middleware
    def call(env)
      if env['CONTENT_TYPE'] && env['CONTENT_TYPE'] =~ /multipart\/form-data/
        @app.call(env)
      else
        @app.call(sanitize_env(env))
      end
    end
  end
end

Please fix this ASAP!

Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

We have a penetration testing service which berates our app with traffic intended to discover some vulnerabilities. I've found that utf8-cleaner can prevent most of the exception we used to see but there is still an exception that is thrown very often.

…ctivesupport-4.0.13/lib/active_support/core_ext/uri.rb:   15:in `gsub'
…ctivesupport-4.0.13/lib/active_support/core_ext/uri.rb:   15:in `unescape'
                /usr/local/lib/ruby/2.2.0/uri/common.rb:  125:in `unescape'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/uri_string.rb:   88:in `valid_uri_encoded_utf8'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/uri_string.rb:   23:in `valid?'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/uri_string.rb:   15:in `cleaned'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/middleware.rb:   56:in `cleaned_string'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/middleware.rb:   44:in `sanitize_env_rack_input'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/middleware.rb:   30:in `sanitize_env'
…gems/utf8-cleaner-0.2.1/lib/utf8-cleaner/middleware.rb:   21:in `call'

I've found this article which talks about many things including this error:
http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

I'm wondering if there is anything that utf8-cleaner could do to prevent this particular exception.

You guys have any further insight on the subject? Thanks!

doesn't clean post body

thanks for utf8-cleaner.

utf8-cleaner doesn't clean the body of the request, which can contain invalid utf8 too. do you think it should? invalid characters in the body can cause errors in our app, just like get params.

Remove null bytes from JSON requests

JSON strings should not contain the null bytes (\x00, \u0000).

This will cause exceptions in Rails - for example when you try to save that string in the database.

It would be nice to remove that invalid character when you are using this gem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.