Giter Club home page Giter Club logo

Comments (18)

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024 1

Okay, glad that PYTHONUTF8=1 solves the immediate problem.

I think I need to modify flawfinder to note this as an option - so don't close this yet.

Take care!

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

This looks like the text isn't actually UTF-8 in the file being analyzed. Have you verified that the file being examined actually complies with UTF-8?

If it doesn't comply with UTF-8 (seems likely), see the documentation on various options. Sadly, Python3 doesn't provide good tools for handling non-UTF-8 text files.

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

Notepad++ thinks the file is UTF-8.
image

VS Code thinks the file is UTF-8.
image

Notepad thinks the file is UTF-8.
image

I think the file does comply with UTF-8.

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Please run "iconv" or some other tool that does byte-by-byte checking.

I think the editors just look at a few lines, and they may accept badly formatted data anyway. Python3 is extremely picky and immediately fails any time the text isn't perfect. Workarounds are documented.

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Also: if the character is just a literal 0x81 byte, that is not valid UTF-8.

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

My bad, it's character U+0441, I have updated the title.
image

Tried "iconv", no difference between original file and the converted one.
image

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Huh. That doesn't make sense to me at all. The sequence 0xd1 0x81 seems like perfectly fine UTF-8 to me, it shouldn't give you that error message.

So we agree it shouldn't happen. But clearly it's happening anyway :-).

Can you send me a URL for a mishandled file so I can just use curl/wget to get it? Ideally make the test file as small as possible while still causing the problem. I want to reproduce the problem with the smallest possible failing test. If I can reproduce it, I should be able to fix it, or at least explain it & suggest a workaround.

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Oh, weird thought: This is on Windows. Is it possible it's actually being stored as UTF-16? I doubt that's what is going on, but I'm grasping at straws and maybe this is the straw I needed :-).

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

Here is the link to that file: https://github.com/Kitware/CMake/blob/master/Source/CTest/cmCTestBuildHandler.cxx

I am not sure how to check if it's stored as UTF-16, I mean it's just a plain text file, I don't see any header when viewing it in hex.
image

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

This character is what you need to reproduce the issue.

image

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

You posted an image showing the file. However, I need the file contents itself. Can you post it somewhere (ideally shortened) & share the URL to it? A small snippet would be best for my purposes.

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Oh, whups, you did provide a link. Thank you.

I ran it on MacOS and it worked just fine. Below is the output.

Ugh, it seems to be a Windows 10 specific thing. I don't have any of those platforms.
I want to fix it, but I have to be able to reproduce it. Any ideas?

python3 ./flawfinder.py cmCTestBuildHandler.cxx 
Flawfinder version 2.0.19, (C) 2001-2019 David A. Wheeler.
Number of rules (primarily dangerous function names) in C/C++ ruleset: 222
Examining cmCTestBuildHandler.cxx

FINAL RESULTS:

cmCTestBuildHandler.cxx:6204:  [2] (misc) open:
  Check when opening files - can an attacker redirect it (via symlinks),
  force the opening of special file type (e.g., device files), move things
  around to create a race condition, control its ancestors, or change its
  contents? (CWE-362).

ANALYSIS SUMMARY:

Hits = 1
Lines analyzed = 6240 in approximately 0.28 seconds (22118 lines/second)
Physical Source Lines of Code (SLOC) = 364
Hits@level = [0]   0 [1]   0 [2]   1 [3]   0 [4]   0 [5]   0
Hits@level+ = [0+]   1 [1+]   1 [2+]   1 [3+]   0 [4+]   0 [5+]   0
Hits/KSLOC@level+ = [0+] 2.74725 [1+] 2.74725 [2+] 2.74725 [3+]   0 [4+]   0 [5+]   0
Minimum risk level = 1

Not every hit is necessarily a security vulnerability.
You can inhibit a report by adding a comment in this form:
// flawfinder: ignore
Make *sure* it's a false positive!
You can use the option --neverignore to show these.

There may be other security vulnerabilities; review your code!
See 'Secure Programming HOWTO'
(https://dwheeler.com/secure-programs) for more information.

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

Yeah, I couldn't repro it using WSL Ubuntu. Looks like this issue is not easy to tackle, I wonder if detecting encoding using chardet before opening the file an acceptable solution?

https://stackoverflow.com/questions/36303919/python-3-0-open-default-encoding

https://peps.python.org/pep-0597/

https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Hmm, it appears that on Windows the default encoding isn't what files use. That seems like a bug in the Windows implementation. I'd prefer flawfinder to NOT always assume UTF-8, because some systems don't use UTF-8. See: https://peps.python.org/pep-0597/ - it seems the "solution" is that people writing code are supposed to magically know what the file encoding is from users. That's rediculous. I have no magic available. I need users to tell me what the encoding is, and use the default if they don't specify something.

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

Hmm, it appears you're trying to process UTF-8 files, but the Windows default is NOT UTF-8, and that's the mismatch.

Try this:

python3 -X utf8 flawfinder.py ....

or set PYTHONUTF8 to 1. In a shell do this:

export PYTHONUTF8=1  # linux / macOS
set PYTHONUTF8=1  # windows

.. .then run flawfinder.

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

That CMake repo has Tests/RunCMake/CommandLine/cmake_depends/test_UTF-16LE.h in UTF-16 encoding, if I force it by set PYTHONUTF8=1, I get encoding error on that file.
image

Is there a way to exclude certain folders that contain non-product code, i.e., in this case Tests folder.

from flawfinder.

david-a-wheeler avatar david-a-wheeler commented on July 3, 2024

There isn't an --exclude option though that's not a bad idea. However, you can expressly list just the files and/or directories to scan, so just be more explicit about it.

However: can you tell me if PYTHONUTF8=1 resolves the problem with cmCTestBuildHandler.cxx ? If it does, then we're at least making progress.

Flawfinder doesn't have a way of scanning different files with different encodings. Most software developers wouldn't want to do that. If you have to do that, I suggest making a copy, changing all the source files to some consistent encoding, and then analyzing them.

from flawfinder.

kuchungmsft avatar kuchungmsft commented on July 3, 2024

Yes, PYTHONUTF8=1 resolves problem with cmCTestBuildHandler.cxx, thanks. The suggestion to make a copy and have a consistent encoding would not work for me because test_UTF-16LE.h is meant to validate that CMake can handle UTF-16, just like compilers can handle inconsistent encoding of source files.

I guess I can workaround it by analyzing only the Source folder instead of entire repo. Thanks a lot for your help.

from flawfinder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.