Giter Club home page Giter Club logo

Comments (29)

bztsrc avatar bztsrc commented on August 24, 2024 1

demonstrated the issue via UBSan.

Which means absolutely nothing. Can you demonstrate the UB on any of the mainstream CPUs? On a real hardware, something more empirical than a known to be buggy sanitizer said so?

not going to go anywhere given that there's no willingness to learn.

On that, we agree. Keep believing that your buggy sanitizer knows better instead of learning how real world CPUs actually work.

Cheers,
bzt

from smu.

karlb avatar karlb commented on August 24, 2024

Thanks for the contribution! I haven't noticed it so far because I didn't realize that I won't get notifications for my repo by default if it is a fork. I'll have a closer look at it the next time I do changes on smu.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Thanks for the contribution! I haven't noticed it

No worries, and you're welcome!

BTW, I've used your smu fork as a base for my documentation generator tool, so actually I'm using a modified version of this, and just backported the table generator from gendoc tags to html tags and posted it here to say thanks! :-)

I'll have a closer look at it

Let me know if you encounter any trouble, I'll fix it for you. The thing is, I'm really grateful for your smu fork, it was extremely helpful and invaluable for my project!

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

Assuming this patch is the same as the one posted on the main smu repo, the "alignment markers" are treated as table entries. lowdown, md4c and gfm omits them for example.

And looking at babelmark it seems like every implementation which supports tables omits them as well.

Not sure weather this was an intentional decision or not, so just leaving it out here.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Assuming this patch is the same as the one posted on the main smu repo

Yes, it is.

the "alignment markers" are treated as table entries

Not sure what you mean. Alignment markers are not table entries, and they are not treated as such. Those are collected in a bitfield only (called calign), and no td tag is generated for them into the output. This was intentional, because you must not add the second line to the HTML table (first row in MD goes to the first HTML row as th, second just collected for alignment info, and third row in MD goes to the second HTML row).

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

I see, it seems like this happens when there's extra spaces around the alignment markers. Here's a test case:

$ cat test.md
| 1st field | 2nd field |
| :--       |       --: |
| 1st entry | 2nd entry |
$ ./smu-table-patch test.md
<table>
<tr><th>1st field </th><th>2nd field </th></tr>
<tr><td>:--       </td><td>--: </td></tr>
<tr><td>1st entry </td><td>2nd entry </td></tr>
</table>

Here's how it's rendered on github:

1st field 2nd field
1st entry 2nd entry

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

Btw, I haven't verified it but doing bitwise operation on signed type is a minefield of Undefined Behaviour.

I would suggest using unsigned long for calign instead, unless there's some special reason to use signed type.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

I see, it seems like this happens when there's extra spaces around the alignment markers.

Yeah, I never use that syntax, could be. All it needs a check for space, that's all.

- if(begin < end && (begin[1] == '-' || begin[1] == ':')) {
+ if(begin < end && (begin[1] == '-' || begin[1] == ':' || (begin[1] == ' ' && (begin[2] == '-' || begin[2] == ':')))) {

Btw, I haven't verified it but doing bitwise operation on signed type is a minefield of Undefined Behaviour.

Now why on earth could it be an UD???? But feel free to use an unsigned type if you're afraid. It really doesn't matter, because claign is never used directly, just its bits, and it's never compared to any other value. And sign extension (which isn't an UD, rather a very well defined behaviour) doesn't matter here either because the shift value is explicitly checked before the shift.

Cheers,
bzt

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Here's a version that allows any number of spaces (or tabs) in the second line, not just one. (It also allows some malformed tables, like | --- --- ---: |)

/* smu table parser, Copyright(C) bzt 2022 MIT */

static char intable = 0, inrow, incell;                         /* table state */
static long int calign;

int
dotable(const char *begin, const char *end, int newblock) {
    const char *p;
    int i, l = (int)sizeof(calign) * 4;

    if(*begin != '|')
        return 0;
    if(inrow && (begin + 1 >= end || begin[1] == '\n')) {       /* close cell and row and if ends, table too */
        fprintf(stdout, "</t%c></tr>", inrow == -1 ? 'h' : 'd');
        inrow = 0;
        if(begin + 2 >= end || begin[2] == '\n') {
            intable = 0;
            fputs("\n</table>", stdout);
            return 2;
        }
        return 1;
    }
    for(p = begin; p < end && (*p == '|' || *p == ' ' ||        /* only load cell aligns from 2nd line */
		*p == '\t' || *p == ':' || *p == '-'); p++);
    if(*p == '\r' || *p == '\n') {
        for(i = -1, p = begin; p < end && *p != '\n'; p++)
            if(*p == '|') {
                i++;
                do { p++; } while(p < end && (*p == ' ' || *p == '\t'));
                if(i < l && *p == ':')
                    calign |= 1 << (i * 2);
            } else
            if(i < l && *p == ':')
                calign |= 1 << (i * 2 + 1);
        return p - begin + 1;
    }
    if(!intable) {                                              /* open table */
        intable = 1; inrow = -1; incell = 0; calign = 0;
        fputs("<table>\n<tr>", stdout);
    }
    if(!inrow) {                                                /* open row */
        inrow = 1; incell = 0;
        fputs("<tr>", stdout);
    }
    if(incell)                                                  /* close cell */
        fprintf(stdout, "</t%c>", inrow == -1 ? 'h' : 'd');
    l = incell < l ? (calign >> (incell * 2)) & 3 : 0;          /* open cell */
    fprintf(stdout, "<t%c%s>", inrow == -1 ? 'h' : 'd',
        l == 2 ? " class=\"right\"" : (l == 3 ? " class=\"center\"" : ""));
    incell++;
    for(p = begin + 1; p < end && *p == ' '; p++);
    return p - begin;
}

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

Yeah, I never use that syntax, could be. All it needs a check for space, that's all.

That does the trick, yes.

Now why on earth could it be an UD?

Doing bitwise operation on singed type itself isn't undefined, but it's very easy to trigger it when doing so. For example,

  • Shifting a 1 onto the sign bit is undefined. 1
  • Shifting a negative number left << is undefined. 1
  • Shifting a negative number right >> is implementation defined. 2

That's just some of the many problems with doing bitwise operation on signed types which I can recall off the top of my head. There's plenty more.

Footnotes

  1. https://port70.net/~nsz/c/c11/n1570.html#6.5.7p4 2

  2. https://port70.net/~nsz/c/c11/n1570.html#6.5.7p5

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

Here's a version that allows any number of spaces (or tabs) in the second line, not just one. (It also allows some malformed tables, like | --- --- ---: |)

The result of malformed tables seems to vary among most other implementations (which supports tables). Some accept it, some don't. 1 So I think accepting it is probably fine.

Footnotes

  1. https://babelmark.github.io/?text=%7C+1st+field++++%7C+2nd+field+%7C%0A%7C+---+---+---%3A+%7C+%3A--+---+++%7C%0A%7C+1st+entry++++%7C+2nd+entry+%7C

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Doing bitwise operation on singed type itself isn't undefined, but it's very easy to trigger it when doing so.

All of these only matter if you're about to reference the shifted value as a signed value, which we do not do here. We just use this variable to store some bits, never do a less than or bigger than relation on the entire value. But add unsigned if you want to, it won't break anything and it won't hurt.

That does the trick, yes.

Cool! I'm glad it works for you! Let's hope this patch gets merged soon.

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

All of these only matter if you're about to reference the shifted value as a signed value

Just the act of shifting a 1 into the sign bit is undefined. Weather the value is referenced or not is not relevant. This can be verified quite easily by using UBsan:

$ cat test.c
int main(void)
{
        long n = 0;
        n |= 1L << 63;
}
$ clang -fsanitize=undefined test.c
$ ./a.out
test.c:4:10: runtime error: left shift of 1 by 63 places cannot be represented in type 'long'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior test.c:4:10 in 

Also the 1s in the shifts need to be suffixed with UL because by default they're ints. E.g:

-      calign |= 1 << (i * 2);
+      calign |= 1UL << (i * 2);

from smu.

karlb avatar karlb commented on August 24, 2024

This looks good and I'll merge it. See #11 for current state.

@bztsrc gendoc looks cool, I'm glad that my contributions were helpful!

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

The following patch (can be applied on top of the latest iteration) should fix any UB due to signed int:

-static char intable = 0, inrow, incell;                         /* table state */
-static long int calign;
-
 int
 dotable(const char *begin, const char *end, int newblock) {
+	static char intable, inrow, incell; /* table state */
+	static unsigned long int calign;
+
 	const char *p;
 	int i, l = (int)sizeof(calign) * 4;
 
@@ -703,10 +703,10 @@ dotable(const char *begin, const char *end, int newblock) {
 				i++;
 				do { p++; } while(p < end && (*p == ' ' || *p == '\t'));
 				if(i < l && *p == ':')
-					calign |= 1 << (i * 2);
+					calign |= 1ul << (i * 2);
 			} else
 				if(i < l && *p == ':')
-					calign |= 1 << (i * 2 + 1);
+					calign |= 1ul << (i * 2 + 1);
 		return p - begin + 1;
 	}
 	if(!intable) {                                              /* open table */

Note that I've also moved the static states inside the function instead of leaving them at file-scope since they're only accessed in there.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

The following patch (can be applied on top of the latest iteration) should fix any UB due to signed int:

There's no UB here (for the reasons I mentioned), but your patch looks good to me, it won't make any trouble for sure, and it will work even on some broken C compilers too. You should apply it if you're worried about using bits in a signed variable.

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

There's no UB here (for the reasons I mentioned)

There is and I've shown where exactly in the C standard as well as demonstrated the issue via UBSan.

Just saying "there isn't" doesn't make it true. But this conversation is probably not going to go anywhere given that there's no willingness to learn.

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

Hmm, I wonder why this is returning 76.

long
int_shift(void)
{
    long n = 76;
    n |= 1 << 32;
    return n;
}

https://godbolt.org/z/osrKsWfqf <- Are all these compilers buggy? Would you like to open a bug report? Do link to it if you do, because the thread will be entertaining.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Are all these compilers buggy?

Nope, it just demonstrates that you don't know what the sign bit is. :-)

FYI, 1 << 32 does not set the sign bit, neither for 32 bit architectures, nor for 64 bit LP64 or LLP64 memory models.

Look, you still have to learn a lot. The C standard was written in a time when processors that could store 0 as well as -0 (negative zero) still existed. Their market share was marginal, but existed. It wasn't that those couldn't shift to the sign bit, but if your code did, then the resulting value was different if you later accessed it as a whole value, therefore your code wasn't portable. Since those processors got extinct several decades ago, and now all CPUs implement negative numbers the same way, this simply isn't the case any more. This means not only that you can safely shift to the sign bit (which you could do anyway), but the resulting whole value is consistent on all architectures too (because all modern CPUs use 2's complement to store negative numbers these days).

But all of this doesn't matter, because this code never references the whole value, so it doesn't matter if the stored bitchunk is actually encodes negative or not. All we care about is the bits, and no CPU has (or ever had) a bit-shifting instruction that would care about sign encoding or that would refuse shifting to the sign bit. There's simply no such instruction, so you can be sure that the C compiler generated code won't misbehave. (Do not believe this because I said so, download and read CPU instruction set specs, like the DDI ARM specs, Intel specs, AMD64 specs, (even Z80 specs, ATMEL specs ...etc.) and see for yourself.)

And I have already said multiple times, add unsigned to your code if you are afraid. It won't make any difference, but feel free to do so.

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

Are all these compilers buggy?

Nope, it just demonstrates that you don't know what the sign bit is. :-)

FYI, 1 << 32 does not set the sign bit, neither for 32 bit architectures, nor for 64 bit LP64 or LLP64 memory models.

I pointed out two issues, the example above demonstrated the later one. Where 1 which is an int is being shifted more than it's width.

Look, you still have to learn a lot. The C standard was written in a time when processors that could store 0 as well as -0 (negative zero) still existed. Their market share was marginal, but existed. It wasn't that those couldn't shift to the sign bit, but if your code did, then the resulting value was different if you later accessed it as a whole value, therefore your code wasn't portable. Since those processors got extinct several decades ago, and now all CPUs implement negative numbers the same way, this simply isn't the case any more. This means not only that you can safely shift to the sign bit (which you could do anyway), but the resulting whole value is consistent on all architectures too (because all modern CPUs use 2's complement to store negative numbers these days).

But all of this doesn't matter, because this code never references the whole value, so it doesn't matter if the stored bitchunk is actually encodes negative or not. All we care about is the bits, and no CPU has (or ever had) a bit-shifting instruction that would care about sign encoding or that would refuse shifting to the sign bit. There's simply no such instruction, so you can be sure that the C compiler generated code won't misbehave. (Do not believe this because I said so, download and read CPU instruction set specs, like the DDI ARM specs, Intel specs, AMD64 specs, (even Z80 specs, ATMEL specs ...etc.) and see for yourself.)

And I have already said multiple times, add unsigned to your code if you are afraid. It won't make any difference, but feel free to do so.

All these details are irrelevant. There are many things which have well defined semantics in assembly but are undefined in C, such as signed overflow and shifting a one into the sign bit.

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

The C standard was written in a time when processors that could store 0 as well as -0 (negative zero) still existed.

Oh and btw, C2x which will be the next C standard still leaves signed overflow and shifting one into the sign bit undefined FYI. 1

C2x is finalized, but you're free to submit a proposal for the next standard for defining these things (which would be a proposal I can get behind). But the fact remains that they're still Undefined in C.

Footnotes

  1. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2596.pdf

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Where 1 which is an int is being shifted more than it's width.

Obviously, nobody ever questioned that. And? What does this have to do with the sign bit or the negative encoding? Or, for that matter, with my code, which strictly checks if the shifting offset is smaller than the width? You make absolutely no sense here. Just admit you were wrong, and move on.

My last message to you, mind your own words:

not going to go anywhere given that there's no willingness to learn.

Byez,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

which strictly checks if the shifting offset is smaller than the width? You make absolutely no sense here. Just admit you were wrong, and move on.

I'll gladly admit I'm wrong if I were.

    int i, l = (int)sizeof(calign) * 4; // l is set to 32
    /* ... */
    if(i < l && *p == ':') // i is below 32
         calign |= 1 << (i * 2); // oops, `i` multiplied by 2, the shift width is no longer below 32.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

I'll gladly admit I'm wrong if I were.

Oh, but you most certainly are wrong in this too. On 32 bits, 4 * 4 != 32 in the first place and if compiled on 64 bits, where sizeof(calign) = 8 and 8 * 4 = 32 but then the width is 64 and not 32...

And you're wrong about the sign bit too, look, here is the final ultimate demonstration unquestionably proving that adding unsigned simply does not matter if you just use bits:

signed.c:

#include <stdio.h>
int main(int argc, char **argv)
{
    int n = 3 << 30;
    printf("%x\n", (n >> 30) & 3);
}

unsigned.c:

#include <stdio.h>
int main(int argc, char **argv)
{
    unsigned int n = 3 << 30;
    printf("%x\n", (n >> 30) & 3);
}

Generate Assembly and compare:

$ gcc -S signed.c signed.s
$ gcc -S unsigned.c unsigned.s
$ diff signed.s unsigned.s
1c1
< 	.file	"signed.c"
---
> 	.file	"unsigned.c"
$

See? Absolutely no difference in the generated Assembly no matter unsigned was used or not.

The difference only appears if you reference n as a whole value (where decoding of the bitchunk as a signed or unsigned does matter), but not if you just use bits of it, which was my point, Q.E.D. If you still don't understand why you're wrong, or you can't see that there's simply no difference in the generated machine code, then I can't help you...

Seriously, I have better things to do than teaching someone who is unwilling to learn. Peace! No hard feelings!

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

and if compiled on 64 bits, where sizeof(calign) = 8 and 8 * 4 = 32 but then the width is 64 and not 32...

Width of calign (which is a long) would be 64 bits, not 1 (which is an int) which is what's being shifted: 1 << (i * 2). Not sure what's hard to understand here.

See? Absolutely no difference in the generated Assembly no matter unsigned was used or not.

Undefined behavior doesn't mean it's guaranteed to misbehave, it means there's no guarantee how it will behave. It very well might behave exactly as one expects, but that doesn't exempt it from being UB.

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

Not sure what's hard to understand here.

Same here.

You're totally off. Use a debugger and watch the variables, my code works perfectly, no matter what you say, you're wrong, no out of bounds shifting happens, period.

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

only handles 32 coloumns' alignments (more supported, just always will be left-aligned)

Column 17 and 18 are center aligned where they should've been right aligned.

[nrk smu table]% cat table.md
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|--:|--:|
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
[nrk smu table]% ./smu < table.md
<table>
<tr><th>01 </th><th>02 </th><th>03 </th><th>04 </th><th>05 </th><th>06 </th><th>07 </th><th>08 </th><th>09 </th><th>10 </th><th>11 </th><th>12 </th><th>13 </th><th>14 </th><th>15 </th><th>16 </th><th>17 </th><th>18 </th></tr>
<tr><td class="center">01 </td><td class="center">02 </td><td class="center">03 </td><td class="center">04 </td><td class="center">05 </td><td class="center">06 </td><td class="center">07 </td><td class="center">08 </td><td class="center">09 </td><td class="center">10 </td><td class="center">11 </td><td class="center">12 </td><td class="center">13 </td><td class="center">14 </td><td class="center">15 </td><td class="center">16 </td><td class="center">17 </td><td class="center">18 </td></tr>
</table>

And with the following patch applied, they work as intended.

                 if(i < l && p[1] == ':') {
-                    calign |= 1 << (i * 2); p++;
+                    calign |= 1ul << (i * 2); p++;
                 }
             } else
             if(i < l && *p == ':')
-                calign |= 1 << (i * 2 + 1);
+                calign |= 1ul << (i * 2 + 1);
         return p - begin + 1;
     }
     if(!intable) {                                              /* open table */
[nrk smu table]% ./smu < table.md
<table>
<tr><th>01 </th><th>02 </th><th>03 </th><th>04 </th><th>05 </th><th>06 </th><th>07 </th><th>08 </th><th>09 </th><th>10 </th><th>11 </th><th>12 </th><th>13 </th><th>14 </th><th>15 </th><th>16 </th><th>17 </th><th>18 </th></tr>
<tr><td class="center">01 </td><td class="center">02 </td><td class="center">03 </td><td class="center">04 </td><td class="center">05 </td><td class="center">06 </td><td class="center">07 </td><td class="center">08 </td><td class="center">09 </td><td class="center">10 </td><td class="center">11 </td><td class="center">12 </td><td class="center">13 </td><td class="center">14 </td><td class="center">15 </td><td class="center">16 </td><td class="right">17 </td><td class="right">18 </td></tr>
</table>

from smu.

bztsrc avatar bztsrc commented on August 24, 2024

And with the following patch applied, they work as intended.

Yeah, on 64 bit Linux, where calign |= 1l << (i * 2); works as well. This has nothing to do with the signedness, it's about the literal's data model and implicit casting.

You claimed there's an UB because of the signed variable. That's not true, and never was. You wrote, let me quote you:

I haven't verified it but doing bitwise operation on signed type is a minefield of Undefined Behaviour.

should fix any UB due to signed int:

And my answer was:

There's no UB here (for the reasons I mentioned), but your patch looks good to me, [...] should apply it if you're worried about using bits in a signed variable.

Now the l suffix has absolutely nothing to do with signedness (u does, and as I've demonstrated it works perfectly without it, doesn't matter). And whether you have to you use l or not, is totally compiler configuration specific. For example, it does absolutely nothing with MinGW and MSVC (which are configured for LLP64 therefore sizeof(int)==sizeof(long)==4).

For a truly universal and portable solution you would have to use int64_t (or uint64_t) and INT64_C() (or UINT64_C()) macros in the first place, there's no other way. I would have used those should smu.c have included stdint.h, but that's not the case.

Cheers,
bzt

from smu.

N-R-K avatar N-R-K commented on August 24, 2024

You claimed there's an UB because of the signed variable.

And I also said that 1 << .. was wrong because it's shifting an int larger than it's width. Which is what the above patch demonstrates. And I already explained this in here

I pointed out two issues, the example above demonstrated the later one. Where 1 which is an int is being shifted more than it's width.

As for shifting into the signed bit, C isn't a "try it and see" language. Just because something appears to be "working", doesn't mean it's actually correct.

And there's absolutely no ambiguity in the C standard when it comes to shifting a signed integer. It's very explicitly called out as UB as a form of arithmetic overflow even in C23. 1

If E1 has a signed type and nonnegative value, and E1 × 2E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

And it is also very explicitly called out in Annex J 2

— An expression having signed promoted type is left-shifted and either the value of the expres- sion is negative or the result of shifting would not be representable in the promoted type (6.5.7).


But regardless, if you still want to stubbornly keep relying on undefined behaviour, feel free to do so. Doesn't matter here anymore since the issue has been fixed (in this repo's master).

Footnotes

  1. https://www.iso-9899.info/n3047.html#6.5.7

  2. https://www.iso-9899.info/n3047.html#J.

from smu.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.