Giter Club home page Giter Club logo

md4c's People

Contributors

1hyena avatar andponlin avatar andreasbaumann avatar aoloe avatar craigbarnes avatar cxw42 avatar dangelog avatar data-man avatar davidkorczynski avatar dominickpastore avatar dtldarek avatar dyedgreen avatar ec1oud avatar eklitzke avatar karstenbriksoft avatar kharacternyk avatar kkoehne avatar l1mey112 avatar mity avatar niblo avatar niclasr avatar nmgwddj avatar perezmeyer avatar rsms avatar snej avatar srinivas32 avatar step- avatar tcknet avatar timgates42 avatar tin-pot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

md4c's Issues

heap_overflow in function md_build_attribute md4c.c:1462

commit cb7ecd7
./md2html --github crash2

in function md_build_attribute ,https://github.com/mity/md4c/blob/master/md4c/md4c.c#L1462
The raw_size is too large to cause a heap overflow

Program received signal SIGSEGV, Segmentation fault.
0x0000000000404f4b in md_build_attribute (ctx=ctx@entry=0x7fffffffdf30, 
    raw_text=0x62756b "http://meta.math.stackexchan\214\345\246\202\351\234\200\346\222\260\345\206\231\346\226\260\347\250\277\344\273\266\357\274\214\347\202\271\345\207\273\351\241\266\351\203\250\345\267\245\345\205\267\346\240\217\345\217\263\344\276\377\a\232\204 <i class=\"icon-file\"></i> **\346\226\260\346\226\207\347\250\277** \346\210\226\350\200\205\344\275\277\347\224\250\345\277\253\346\215\267\351\224\256 `Ctrl+Alt+N`\343\200\202\r\n\r\n------\r\n\r\n## \344\273\200\344\271\210\346\230\257 Markdown\r\n\r\n"..., 
    raw_size=raw_size@entry=4294966501, flags=<optimized out>, attr=attr@entry=0x7fffffffdd40, 
    build=build@entry=0x7fffffffdce0) at /opt/lxf/md4c/md4c/md4c.c:1462
1462	            if(raw_text[raw_off] == _T('\0')) {
(gdb) bt
#0  0x0000000000404f4b in md_build_attribute (ctx=ctx@entry=0x7fffffffdf30, 
    raw_text=0x62756b "http://meta.math.stackexchan\214\345\246\202\351\234\200\346\222\260\345\206\231\346\226\260\347\250\277\344\273\266\357\274\214\347\202\271\345\207\273\351\241\266\351\203\250\345\267\245\345\205\267\346\240\217\345\217\263\344\276\377\a\232\204 <i class=\"icon-file\"></i> **\346\226\260\346\226\207\347\250\277** \346\210\226\350\200\205\344\275\277\347\224\250\345\277\253\346\215\267\351\224\256 `Ctrl+Alt+N`\343\200\202\r\n\r\n------\r\n\r\n## \344\273\200\344\271\210\346\230\257 Markdown\r\n\r\n"..., 
    raw_size=raw_size@entry=4294966501, flags=<optimized out>, attr=attr@entry=0x7fffffffdd40, 
    build=build@entry=0x7fffffffdce0) at /opt/lxf/md4c/md4c/md4c.c:1462
#1  0x0000000000405446 in md_enter_leave_span_a (ctx=ctx@entry=0x7fffffffdf30, enter=4, 
    type=type@entry=MD_SPAN_A, dest=<optimized out>, dest_size=dest_size@entry=4294966501, 
    prohibit_escapes_in_dest=prohibit_escapes_in_dest@entry=1, title=0x0, title_size=0)
    at /opt/lxf/md4c/md4c/md4c.c:3846
#2  0x000000000040580c in md_process_inlines (ctx=ctx@entry=0x7fffffffdf30, lines=lines@entry=0x6333a8, 
    n_lines=n_lines@entry=1) at /opt/lxf/md4c/md4c/md4c.c:4003
#3  0x0000000000411ce3 in md_process_normal_block_contents (n_lines=1, lines=0x6333a8, ctx=0x7fffffffdf30)
    at /opt/lxf/md4c/md4c/md4c.c:4302
#4  md_process_leaf_block (block=0x6333a0, ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:4472
#5  md_process_all_blocks (ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:4547
#6  md_process_doc (ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:5874
#7  md_parse (
    text=text@entry=0x627250 "# \346\254\242\350\277\216\344\275", '&' <repeats 17 times>, " \347\274\226\350\276\221\351\230\305\350\257\273\345\231\250\r\n\r\n------\r\n\r\n\346\310\221\344\273\254\347\220\206\350\247\r\n", size=size@entry=10251, renderer=renderer@entry=0x7fffffffe1c0, 
    userdata=userdata@entry=0x7fffffffe1a0) at /opt/lxf/md4c/md4c/md4c.c:5935
#8  0x0000000000403aa2 in md_render_html (
    input=input@entry=0x627250 "# \346\254\242\350\277\216\344\275", '&' <repeats 17 times>, " \347\274\226\350\276\221\351\230\305\350\257\273\345\231\250\r\n\r\n------\r\n\r\n\346\310\221\344\273\254\347\220\206\350\247\r\n", input_size=input_size@entry=10251, process_output=process_output@entry=0x402280 <process_output>, 
    userdata=userdata@entry=0x7fffffffe210, parser_flags=<optimized out>, renderer_flags=<optimized out>)
    at /opt/lxf/md4c/md2html/render_html.c:488
#9  0x0000000000401263 in process_file (out=0x7ffff7dd4400 <_IO_2_1_stdout_>, in=0x627010)
    at /opt/lxf/md4c/md2html/md2html.c:139
#10 main (argc=<optimized out>, argv=<optimized out>) at /opt/lxf/md4c/md2html/md2html.c:343
(gdb) p raw_off 
$1 = 187029
(gdb) p raw_text[raw_off]
Cannot access memory at address 0x655000
(gdb) p raw_text[raw_off-1]
$2 = 0 '\000'
(gdb) p raw_size 
$3 = 4294966501

poc file: poc

Are size of these two arrays correct

I'm using VC++2015, and the compiler complains about something. Most of them seem Ok to be ignored, but I'm worried about size of two arrays:

1210 static const CHAR open_str[9] = _T("<![CDATA[");

4287 static const CHAR indent_str[16] = _T(" ");

It says the arrays are not big enough to contain endding '\0's.

Is it on purpose?

Thanks!

Security audit, fuzzing, and more testing

Markdown implementations are often used to process untrusted input. md4c is written in C, which makes it very easy to introduce a security vulnerability. Hence, it is imperative that md4c is hardened against all possible exploits.

This includes:

  • Adding always-on assertions about things like array bounds
  • Intensive, repeated fuzzing with tools like afl-fuzz.
  • Security auditing

Lots of warnings

Hello,

Thank you for this awesome library, I just get a lots of warnings when compiling with -Wall -Wextra:

$ gcc -std=c99 -Wall -Wextra md4c.c -c
md4c.c: In function ‘md_decode_unicode’:
md4c.c:844:52: warning: unused parameter ‘str_size’ [-Wunused-parameter]
     md_decode_unicode(const CHAR* str, OFF off, SZ str_size, SZ* p_size)
                                                    ^~~~~~~~
md4c.c: In function ‘md_merge_lines’:
md4c.c:864:73: warning: unused parameter ‘n_lines’ [-Wunused-parameter]
 md_merge_lines(MD_CTX* ctx, OFF beg, OFF end, const MD_LINE* lines, int n_lines,
                                                                         ^~~~~~~
md4c.c: In function ‘md_is_hex_entity_contents’:
md4c.c:1275:35: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_is_hex_entity_contents(MD_CTX* ctx, const CHAR* text, OFF beg, OFF max_end, OFF* p_end)
                                   ^~~
md4c.c: In function ‘md_is_dec_entity_contents’:
md4c.c:1291:35: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_is_dec_entity_contents(MD_CTX* ctx, const CHAR* text, OFF beg, OFF max_end, OFF* p_end)
                                   ^~~
md4c.c: In function ‘md_is_named_entity_contents’:
md4c.c:1307:37: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_is_named_entity_contents(MD_CTX* ctx, const CHAR* text, OFF beg, OFF max_end, OFF* p_end)
                                     ^~~
md4c.c: In function ‘md_free_attribute’:
md4c.c:1412:27: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_free_attribute(MD_CTX* ctx, MD_ATTRIBUTE_BUILD* build)
                           ^~~
md4c.c: In function ‘md_rollback’:
md4c.c:2618:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(i = 1; i < SIZEOF_ARRAY(ctx->mark_chains); i++) {
                  ^
md4c.c: In function ‘md_build_mark_char_map’:
md4c.c:2719:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         for(i = 0; i < sizeof(ctx->mark_char_map); i++) {
                      ^
md4c.c: In function ‘md_collect_marks’:
md4c.c:2936:52: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                 for(scheme_index = 0; scheme_index < SIZEOF_ARRAY(scheme_map); scheme_index++) {
                                                    ^
md4c.c: In function ‘md_process_verbatim_block_contents’:
md4c.c:4312:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         while(indent > SIZEOF_ARRAY(indent_chunk_str)) {
                      ^
md4c.c: In function ‘md_is_atxheader_line’:
md4c.c:4786:61: warning: unused parameter ‘p_end’ [-Wunused-parameter]
 md_is_atxheader_line(MD_CTX* ctx, OFF beg, OFF* p_beg, OFF* p_end, unsigned* p_level)
                                                             ^~~~~
md4c.c: At top level:
md4c.c:5336:1: warning: missing initializer for field ‘beg’ of ‘MD_LINE_ANALYSIS {aka const struct MD_LINE_ANALYSIS_tag}’ [-Wmissing-field-initializers]
 static const MD_LINE_ANALYSIS md_dummy_blank_line = { MD_LINE_BLANK, 0 };
 ^~~~~~
md4c.c:189:9: note: ‘beg’ declared here
     OFF beg;
         ^~~
md4c.c: In function ‘md_analyze_line’:
md4c.c:5481:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                ctx->n_block_bytes > sizeof(MD_BLOCK))
                                   ^
md4c.c:5499:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                ctx->n_block_bytes > sizeof(MD_BLOCK))
                                   ^
md4c.c: In function ‘md_parse’:
md4c.c:5908:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(i = 0; i < SIZEOF_ARRAY(ctx.mark_chains); i++) {
                  ^
md4c.c: In function ‘md_rollback’:
md4c.c:2669:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
                 if((mark_flags & MD_MARK_CLOSER)  &&  mark->prev > opener_index) {
                   ^
md4c.c:2676:13: note: here
             default:
             ^~~~~~~
md4c.c: In function ‘md_process_inlines’:
md4c.c:3956:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
                     if(!(mark->flags & MD_MARK_AUTOLINK)) {
                       ^
md4c.c:3966:17: note: here
                 case '@':       /* Permissive e-mail autolink. */
                 ^~~~
md4c.c: In function ‘md_enter_child_containers’:
md4c.c:5197:33: warning: this statement may fall through [-Wimplicit-fallthrough=]
                 is_ordered_list = TRUE;
                                 ^
md4c.c:5200:13: note: here
             case _T('-'):
             ^~~~
md4c.c: In function ‘md_leave_child_containers’:
md4c.c:5240:33: warning: this statement may fall through [-Wimplicit-fallthrough=]
                 is_ordered_list = TRUE;
                                 ^
md4c.c:5243:13: note: here
             case _T('-'):
             ^~~~

Invalid HTML with --fpermissive-url-autolinks

With permissive-url-autolinks enabled, md2html appears to generate an extra closing tag after an ordinary inline link.

$ echo "This is a [link](http://github.com/)." | ./md2html
<p>This is a <a href="http://github.com/">link</a>.</p>

$ echo "This is a [link](http://github.com/)." | ./md2html --fpermissive-url-autolinks
<p>This is a <a href="http://github.com/">link</a></a>.</p>

Tag `<gi att1=tok1 att2=tok2>` not recognized

In md_is_html_tag(), state 41 "in middle of unquoted attribute value" is not exited when white space is encounterd. As a consequence, in the tag mentioned above the second = throws scanning off, we end up in the "unexpected" case, do return FALSE - and the tag is not recognized as such.

Multiple vulnerabilities in md4c

Multiple vulnerabilities in md4c

There are multiple vulnerabilities in md4c (git repository: https://github.com/mity/md4c, Latest commit 81e2a5c on Apr 12, 2018).

git log

commit 81e2a5cac2c8c2b1f8fe63b7bce3fe7e516e2891
Author: Martin Mitas <[email protected]>
Date:   Thu Apr 12 17:03:37 2018 +0200

Heap buffer overflow in md_split_simple_pairing_mark()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-81e2a5c/Heap_buffer_overflow_in_md_split_simple_pairing_mark

It seems like that an overflow happened in memcpy() in md4c.c:3499:

memcpy(dummy, mark, sizeof(MD_MARK));

AddressSanitizer provided information as below:

==27938==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61a000000684 at pc 0x0000004dd7f5 bp 0x7ffedcfedc30 sp 0x7ffedcfed3e0
WRITE of size 20 at 0x61a000000684 thread T0
    #0 0x4dd7f4 in __asan_memcpy /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cc:23
    #1 0x546dd3 in md_split_simple_pairing_mark /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3499:5
    #2 0x546dd3 in md_analyze_simple_pairing_mark /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3553
    #3 0x540584 in md_analyze_marks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c
    #4 0x53c9c8 in md_analyze_link_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3813:5
    #5 0x53c9c8 in md_analyze_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3802
    #6 0x550b95 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4283:5
    #7 0x52e7f7 in md_process_leaf_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4454:13
    #8 0x52e7f7 in md_process_all_blocks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4529
    #9 0x52e7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5856
    #10 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #11 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #12 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #13 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #14 0x7f17c6fc582f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #15 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

Address 0x61a000000684 is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cc:23 in __asan_memcpy
Shadow bytes around the buggy address:
0x0c347fff8080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c347fff8090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c347fff80a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c347fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c347fff80d0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff80e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff8110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff8120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==27938==ABORTING

Heap buffer overflow in md_process_inlines()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-81e2a5c/Heap_buffer_overflow_in_md_process_inlines

It seems like that mark variable access a restricted area of memory in md4c.c:4004:

while(!(mark->flags & MD_MARK_RESOLVED) || mark->beg < off)

AddressSanitizer provided information as below:

==29037==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61e000000a91 at pc 0x000000553328 bp 0x7fffe3bdfa70 sp 0x7fffe3bdfa68
READ of size 1 at 0x61e000000a91 thread T0
    #0 0x553327 in md_process_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4004:27
    #1 0x553327 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4284
    #2 0x52e7f7 in md_process_leaf_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4454:13
    #3 0x52e7f7 in md_process_all_blocks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4529
    #4 0x52e7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5856
    #5 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #6 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #7 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #8 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #9 0x7f49ca83682f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #10 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x61e000000a91 is located 17 bytes to the right of 2560-byte region [0x61e000000080,0x61e000000a80)
allocated by thread T0 here:
    #0 0x4ded00 in realloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:107
    #1 0x5369af in md_push_mark /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2496:21
    #2 0x5369af in md_collect_marks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2897
    #3 0x5369af in md_analyze_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3774
    #4 0x550b95 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4283:5

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4004:27 in md_process_inlines
Shadow bytes around the buggy address:
0x0c3c7fff8100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c3c7fff8150: fa fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8160: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8170: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8190: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff81a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==29037==ABORTING

Angle brackets in link destinations

The spec recognizes two types of link destinations:

  1. a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or

  2. a nonempty sequence of characters that does not include ASCII space or control characters, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)

Although not explicitly stated, it's clear from various discussions (e.g. commonmark/cmark#193, commonmark/cmark#219) that if parsing with the type 1 fails, the parser should retry with type 2.

However, MD4C currently does that only on the link destination level, not whole link. (See function md_is_link_destination()).

Hence we parse correctly

[a](<te<st>)

But we fail with

[a](<x>X)

because <x> is seen as type 1, but the following unexpected char X then makes it to not be seen as a link.

Parsing <><><><><><>… takes quadratic time

$ python -c 'print("<>" * 10000)' | time md2html > /dev/null
1.41user 0.00system 0:01.45elapsed 96%CPU (0avgtext+0avgdata 1968maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
$ python -c 'print("<>" * 20000)' | time md2html > /dev/null
4.80user 0.00system 0:04.84elapsed 99%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+365minor)pagefaults 0swaps
$ python -c 'print("<>" * 40000)' | time md2html > /dev/null
19.52user 0.00system 0:19.65elapsed 99%CPU (0avgtext+0avgdata 3528maxresident)k
0inputs+0outputs (0major+640minor)pagefaults 0swaps

Feature request: safe mode

This is a feature request for a “safe mode”: potentially-malicious content (such as certain URL schemes) is disallowed to prevent XSS attacks.

The output from this should be safely insertable into a webpage, without any further escaping or sanitization.

Use MD4C for syntax highlighter: Passing document parts / single blocks

Hi,

MD4C's SAX-like callbacks sound a lot more useful than the reference CMark AST implementation to implement syntax highlighting in an editor. Skimming the code, it appears only reference-style links and images would cause problems if the client does not pass the whole document but only parts of it, like the visible portion on screen or the last edited block of text. Did I get that right?

-- Christian

Heap buffer overflow in md_merge_lines()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-387bd02/crash_md_merge_lines

AddressSanitizer provided information as below:

=================================================================
==21464==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6040000000b1 at pc 0x00000054ff84 bp 0x7fff500be8d0 sp 0x7fff500be8c8
WRITE of size 1 at 0x6040000000b1 thread T0
    #0 0x54ff83 in md_merge_lines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:878:18
    #1 0x54ff83 in md_merge_lines_alloc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:910
    #2 0x54ff83 in md_is_link_reference_definition_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2154
    #3 0x532108 in md_is_link_reference_definition /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2215:15
    #4 0x532108 in md_consume_link_reference_definitions /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4648
    #5 0x532108 in md_end_current_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4694
    #6 0x52c7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5850:5
    #7 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #8 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #9 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #10 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #11 0x7fda7443a82f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #12 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x6040000000b1 is located 0 bytes to the right of 33-byte region [0x604000000090,0x6040000000b1)
allocated by thread T0 here:
    #0 0x4de898 in __interceptor_malloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x54e91f in md_merge_lines_alloc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:904:22
    #2 0x54e91f in md_is_link_reference_definition_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2154
    #3 0x532108 in md_is_link_reference_definition /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2215:15
    #4 0x532108 in md_consume_link_reference_definitions /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4648
    #5 0x532108 in md_end_current_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4694

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:878:18 in md_merge_lines
Shadow bytes around the buggy address:
0x0c087fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff8000: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 04
=>0x0c087fff8010: fa fa 00 00 00 00[01]fa fa fa fa fa fa fa fa fa
0x0c087fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==21464==ABORTING

Parsing '<><><><>...' takes quadratic time

$ python -c 'print("<>" * 10000)' | time md2html > /dev/null
1.41user 0.00system 0:01.45elapsed 96%CPU (0avgtext+0avgdata 1968maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
$ python -c 'print("<>" * 20000)' | time md2html > /dev/null
4.80user 0.00system 0:04.84elapsed 99%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+365minor)pagefaults 0swaps
$ python -c 'print("<>" * 40000)' | time md2html > /dev/null
19.52user 0.00system 0:19.65elapsed 99%CPU (0avgtext+0avgdata 3528maxresident)k
0inputs+0outputs (0major+640minor)pagefaults 0swaps

(the 1st case from #57)

Feature request: parse GitHub checkbox syntax

Another reason I want to add Markdown support to Qt is to enable combined notes and TODO lists like Evernote has, to be able to edit GitHub README.md files, etc.

https://github.com/blog/1825-task-lists-in-all-markdown-documents

As they show there, the syntax is

### Solar System Exploration, 1950s – 1960s

- [ ] Mercury
- [x] Venus
- [x] Earth (Orbit/Moon)
- [x] Mars
- [ ] Jupiter
- [ ] Saturn
- [ ] Uranus
- [ ] Neptune
- [ ] Comet Haley

Solar System Exploration, 1950s – 1960s

  • Mercury
  • Venus
  • Earth (Orbit/Moon)
  • Mars
  • Jupiter
  • Saturn
  • Uranus
  • Neptune
  • Comet Haley

It won't be quite so easy to add support for checking and unchecking them to any Qt text-editing component, but I'd like to try.

Currently for my personal todo lists and shopping lists etc., I use todo.txt format (https://f-droid.org/packages/nl.mpcjanssen.simpletask/ and https://f-droid.org/packages/com.nutomic.syncthingandroid/ are a good combination for this); but it has the disadvantage of not being able to mix free-form notes with the checklists. So I think markdown will be better.

md2html does not have man page

As MD4C is currently being added into some Linux distros (see #48 or #55), md2html tool should have a better documentation, in particular man page.

trouble with zero-based nested lists

Try to parse something like this: it seems to me that it combines some list items. As long as the indices start from 1, it doesn't happen.

0. Introduction
1. Chapter One
    0) One thing
    1) Another thing
        0. Subpoint
        1. Counterpoint
    2) Yet another thing

Assertion triggered

The input

***b* c*

triggers an assertion:

MD4C: ../md4c/md4c.c:2368: Assertion 'dummy->ch == 'D'' failed.

Heap buffer overflow in md_is_link_reference_definition_helper()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-387bd02/crash_md_is_link_reference_definition_helper

AddressSanitizer provided information as below:

=================================================================
==7016==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x615000000280 at pc 0x00000054e1e4 bp 0x7ffdf438ab70 sp 0x7ffdf438ab68
READ of size 4 at 0x615000000280 thread T0
    #0 0x54e1e3 in md_is_link_reference_definition_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1931:33
    #1 0x5320d5 in md_is_link_reference_definition /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2213:11
    #2 0x5320d5 in md_consume_link_reference_definitions /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4648
    #3 0x5320d5 in md_end_current_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4694
    #4 0x52c7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5850:5
    #5 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #6 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #7 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #8 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #9 0x7f20771c082f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #10 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x615000000280 is located 0 bytes to the right of 512-byte region [0x615000000080,0x615000000280)
allocated by thread T0 here:
    #0 0x4ded00 in realloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:107
    #1 0x527b65 in md_push_block_bytes /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4560:27
    #2 0x527b65 in md_start_new_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4587
    #3 0x527b65 in md_process_line /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5820
    #4 0x527b65 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5847
    #5 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #6 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #7 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #8 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #9 0x7f20771c082f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1931:33 in md_is_link_reference_definition_helper
Shadow bytes around the buggy address:
0x0c2a7fff8000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c2a7fff8050:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==7016==ABORTING

Entities in non-trivial contexts

Entities which are not inside a normal text flow are not translated.

This includes these situations:

  • Entity inside a link or image destination (URL).
  • Entity inside a link or image title text.
  • Entity inside a code fence info string.

It is responsible for following CommonMark 0.27 test failures.

Consequtive lists without a blank line

100. foo
    * bar

is transformed into

<ol start="100"><li>foo
* bar</li>
</ol>

But it should be

<ol start="100"><li>foo</li>
</ol>
<ul>
<li>bar</li>
</ul>

(Interestingly, lower or larger indentation of the 2nd list works correctly.)

Entity in direct link title entails mis-translation

This gets processed correctly:

Some [direct link](http://example.com "Direct link -- ie, inline URL") here.

But when an entity reference occurs in the title text, like this:

Some [direct link](http://example.com "Direct link &ndash; ie, inline URL") here.

the last portion of the title text (starting at the reference) gets repeated after the generated <a> element, so the output from md2html looks like this:

<p>Some <a href="http://example.com" title="Direct link – ie, inline URL">dire
ct link</a>– ie, inline URL&quot;) here.</p>

As far as I have seen, this stems from md4c.c, not md2html.c: the &ndash; reference is indeed transmitted twice, once embedded in the attribute text, and a second time as a MD_TEXT_ENTITY item itself.

On can actually observe this behaviour in Babelmark 2, where MD4C 0.1.1 is included as of lately.

Broken permissive autolink justbefore end-of-file

$ echo -n 'http://example.com' | md2html/md2html --github

generates

<p><a href="http://example.com�">http://example.com</p>

I.e. there is garbage in the link URL and the link is not correctly ended with </a>.

Entities inside image contents are not rendered.

![alt text with *entity* &copy;](img.png 'title')

renders into

<p><img src="img.png" alt="alt text with entity " title="title"></p>

but it should render into

<p><img src="img.png" alt="alt text with entity ©" title="title"></p>

Incorrect emphasis parsing

(Copied from commonmark/cmark#177, we are hit with exactly the same issue.)

@raphlinus writes:

In the example a***b* c*, cmark produces a**<em>b</em> c*, where I believe the spec would say a*<em><em>b</em> c</em>. My reading of the spec is that the length of the opening delimiter run is 3, and the length of both closing runs is 1, so in neither case the sum is a multiple of 3.

Crash

American Fuzzy Lop has found a crash with a pattern I was able to minimize into this.

[x]:
x
- <?

  x

md2html doesn't generate HTML tables

I tested with the example.md from https://codereview.qt-project.org/#/c/214844/ . It simply put the raw table syntax into a paragraph:

<p>| | Development Tools | Programming Techniques | Graphical User Interfaces | | ------------: | ----------------- | ---------------------- | ------------------------- | | 9:00 - 11:00 | Introduction to Qt ||| | 11:00 - 13:00 | Using qmake | Object-oriented Programming | Layouts in Qt | | 13:00 - 15:00 | Qt Designer Tutorial | Extreme Programming | Writing Custom Styles | | 15:00 - 17:00 | Qt Linguist and Internationalization |   |   |</p>

This doesn't matter for Qt, and I don't have any plan to use md2html for anything ATM; but since I'm trying to add md4c to Arch AUR, if the package is to include md2html, the bugs will become noticeable when someone tries to use it for this basic purpose of generating HTML. If it's not intended to be a serious tool (after all, there are enough MD->HTML tools to choose from), I could rather leave it out of the package though.

NULL pointer dereferenc in md4c/md4c.c:5824

i find a Segmentation fault ,when i used md2html.
commit cb7ecd7
./md2html --github crash1

it is a NULL pointer dereferenc in https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5824.
ctx->current_block is a null pointer.
but i find you did the assert in https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5822,i dont know why it does not work.
i just git clone it and use cmake . and make to build it.

(gdb) set args --github crash1 
(gdb) r
Starting program: /opt/lxf/md4c/md2html/md2html --github crash1 

Program received signal SIGSEGV, Segmentation fault.
md_process_line (line=0x7fffffffde80, p_pivot_line=<synthetic pointer>, ctx=0x7fffffffdf30)
    at /opt/lxf/md4c/md4c/md4c.c:5824
5824	        ctx->current_block->type = MD_BLOCK_TABLE;
(gdb) bt
#0  md_process_line (line=0x7fffffffde80, p_pivot_line=<synthetic pointer>, ctx=0x7fffffffdf30)
    at /opt/lxf/md4c/md4c/md4c.c:5824
#1  md_process_doc (ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:5865
#2  md_parse (text=text@entry=0x627250 "", size=size@entry=8632, renderer=renderer@entry=0x7fffffffe1c0, 
    userdata=userdata@entry=0x7fffffffe1a0) at /opt/lxf/md4c/md4c/md4c.c:5935
#3  0x0000000000403aa2 in md_render_html (input=input@entry=0x627250 "", input_size=input_size@entry=8632, 
    process_output=process_output@entry=0x402280 <process_output>, userdata=userdata@entry=0x7fffffffe210, 
    parser_flags=<optimized out>, renderer_flags=<optimized out>) at /opt/lxf/md4c/md2html/render_html.c:488
#4  0x0000000000401263 in process_file (out=0x7ffff7dd4400 <_IO_2_1_stdout_>, in=0x627010)
    at /opt/lxf/md4c/md2html/md2html.c:139
#5  main (argc=<optimized out>, argv=<optimized out>) at /opt/lxf/md4c/md2html/md2html.c:343
(gdb) p ctx->current_block 
$1 = (MD_BLOCK *) 0x0

this is the crash file :
poc file

Feature request: support table cell merging

That is, it should be possible for a cell to span multiple columns and maybe even multiple rows.

For example MultiMarkdown does it like this:

https://fletcher.github.io/MultiMarkdown-5/tables.html

I'd suggest to add columnSpan and rowSpan to MD_BLOCK_TD_DETAIL.

I'm trying to use md4c to add Markdown support to Qt (the table patch is https://codereview.qt-project.org/#/c/214901/ ) and this is the main low-hanging fruit I've seen so far, since QTextTable supports it already.

Attributes of `CommonMark.dtd` omitted by parser

When porting some code (which up to now used cmark) to MD4C I missed some information from the parser which is represented as attributes in the CommonMark DTD:

  1. In the details for MD_BLOCK_UL, the information whether the list is "tight" or "loose" (attribute tight (true|false).

  2. In the details for MD_BLOCK_OL, the information whether the list is "tight" or "loose" (attribute tight (true|false), and which style of item marker was used (attribute delimiter (period|paren)).

Any chance to put corresponding information into MD_BLOCK_UL_DETAIL rsp MD_BLOCK_OL_DETAIL?

Heap buffer overflow in md_is_named_entity_contents()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-387bd02/crash_md_is_named_entity_contents

AddressSanitizer provided information as below:

==16545==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200000001f at pc 0x0000005464c6 bp 0x7ffe90e1b080 sp 0x7ffe90e1b078
READ of size 1 at 0x60200000001f thread T0
    #0 0x5464c5 in md_is_named_entity_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1311:28
    #1 0x5464c5 in md_is_entity_str /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1341
    #2 0x553b62 in md_build_attribute /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1473:20
    #3 0x5562f9 in md_enter_leave_span_a /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3838:5
    #4 0x5510d2 in md_process_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3947:21
    #5 0x5510d2 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4284
    #6 0x52e7f7 in md_process_leaf_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4454:13
    #7 0x52e7f7 in md_process_all_blocks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4529
    #8 0x52e7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5856
    #9 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #10 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #11 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #12 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #13 0x7fd6be7ec82f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #14 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x60200000001f is located 0 bytes to the right of 15-byte region [0x602000000010,0x60200000001f)
allocated by thread T0 here:
    #0 0x4de898 in __interceptor_malloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x54bedb in md_merge_lines_alloc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:904:22
    #2 0x54bedb in md_is_inline_link_spec_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2352
    #3 0x53b9bf in md_is_inline_link_spec /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2370:12
    #4 0x53b9bf in md_resolve_links /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3367
    #5 0x53b9bf in md_analyze_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3786
    #6 0x550b75 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4283:5

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1311:28 in md_is_named_entity_contents
Shadow bytes around the buggy address:
0x0c047fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c047fff8000: fa fa 00[07]fa fa 00 07 fa fa fa fa fa fa fa fa
0x0c047fff8010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==16545==ABORTING

Let the `entity_lookup()` function return UTF-32, not UTF-8

The fact that entity_lookup() currently returns the replacement text as a UTF-8 octet sequence is convenient for UTF-8 output (of course), but clumsy otherwise: when generating UTF-16 output, md2html would have to convert UTF-8 into a code point, and write the code point as one or two UTF-16 code units.

While the latter step is trivial, the former is just an unnecessary burden (and in fact, md2html.c won't replace entity references when generating UTF-16 right now).

A better approach would return UTF-32 from entity_lookup(): from this, a renderer could easily

  • output the replacement text in UTF-8,
  • output the replacement text in UTF-16,
  • output the replacement text in ASCII (using numerical character references for non-ASCII code points);
  • output the replacement text in Latin 1 (dito, for code points beyond U+00FF).

Heap-buffer-overflow in md4c.c

./md2html md4c_heap-buffer-overflow_md4c

==26370==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60300000f000 at pc 0x7f8e75e343ca bp 0x7fff7ec8b0f0 sp 0x7fff7ec8b0e0
WRITE of size 4 at 0x60300000f000 thread T0
    #0 0x7f8e75e343c9 in md_build_attribute /home/github/md4cg/md4c/md4c.c:1491
    #1 0x7f8e75e4e584 in md_setup_fenced_code_detail /home/github/md4cg/md4c/md4c.c:4377
    #2 0x7f8e75e4ea94 in md_process_leaf_block /home/github/md4cg/md4c/md4c.c:4419
    #3 0x7f8e75e4fb28 in md_process_all_blocks /home/github/md4cg/md4c/md4c.c:4528
    #4 0x7f8e75e5b574 in md_process_doc /home/github/md4cg/md4c/md4c.c:5854
    #5 0x7f8e75e5b99c in md_parse /home/github/md4cg/md4c/md4c.c:5915
    #6 0x4045ac in md_render_html /home/github/md4cg/md2html/render_html.c:488
    #7 0x401b4a in process_file /home/github/md4cg/md2html/md2html.c:139
    #8 0x402394 in main /home/github/md4cg/md2html/md2html.c:343
    #9 0x7f8e75a7b82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #10 0x4012c8 in _start (/home//github/md4cg/md2html/md2html+0x4012c8)

0x60300000f000 is located 0 bytes to the right of 32-byte region [0x60300000efe0,0x60300000f000)
allocated by thread T0 here:
    #0 0x7f8e760fb961 in realloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98961)
    #1 0x7f8e75e3327f in md_build_attr_append_substr /home/github/md4cg/md4c/md4c.c:1392
    #2 0x7f8e75e33e6b in md_build_attribute /home/github/md4cg/md4c/md4c.c:1482
    #3 0x7f8e75e4e584 in md_setup_fenced_code_detail /home/github/md4cg/md4c/md4c.c:4377
    #4 0x7f8e75e4ea94 in md_process_leaf_block /home/github/md4cg/md4c/md4c.c:4419
    #5 0x7f8e75e4fb28 in md_process_all_blocks /home/github/md4cg/md4c/md4c.c:4528
    #6 0x7f8e75e5b574 in md_process_doc /home/github/md4cg/md4c/md4c.c:5854
    #7 0x7f8e75e5b99c in md_parse /home/github/md4cg/md4c/md4c.c:5915
    #8 0x4045ac in md_render_html /home/github/md4cg/md2html/render_html.c:488
    #9 0x401b4a in process_file /home/github/md4cg/md2html/md2html.c:139
    #10 0x402394 in main /home//github/md4cg/md2html/md2html.c:343
    #11 0x7f8e75a7b82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/github/md4cg/md4c/md4c.c:1491 md_build_attribute
Shadow bytes around the buggy address:
  0x0c067fff9db0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9dc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9dd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9de0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9df0: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
=>0x0c067fff9e00:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
==26370==ABORTING

poc:https://github.com/xcainiao/poc/blob/master/md4c_heap-buffer-overflow_md4c

Invalid output.

The test case input in #38 generates an unexpected output, even after fixing the invalid read.

Minimized version of the test case for the invalid output is as follows:

[x](((x
x]((C(&))x

It currently generates

<p><a href="" title="(x
x]((C(&amp;">x</a>
x]((C(&amp;))x</p>

That is clearly wrong. The input likely is not valid link syntax (needs yet some more analysis and spec studying to confirm). But even if it would be link then the link alt text should not be repeated after the link. Something rots here.

Flag MD_FLAG_NOHTMLSPANS disables also autolinks

MD_FLAG_NOHTMLSPANS is supposed to disable inline raw HTML but nothing else. It disables also (standard CommonMark) autolinks:

$ echo '<http://google.com>' | md2html 
<p><a href="http://google.com">http://google.com</a></p>

$ echo '<http://google.com>' | md2html --fno-html-spans
<p>&lt;http://google.com&gt;</p>

Ref. definitions with same labels

[foo]: /foo
[qnptgbh]: /qnptgbh
[abgbrwcv]: /abgbrwcv
[abgbrwcv]: /abgbrwcv2
[abgbrwcv]: /abgbrwcv3
[abgbrwcv]: /abgbrwcv4
[alqadfgn]: /alqadfgn

translates to

<p><a href="/foo">foo</a>
<a href="/qnptgbh">qnptgbh</a>
<a href="/abgbrwcv2">abgbrwcv</a>
<a href="/alqadfgn">alqadfgn</a>
[axgydtdu]</p>

I.e. MD4C now does not guarantee that first reference definition of the same label is used.

Erroneous UTF-16 Surrogate Decoding

The pertinent macros to detect and decode UTF-16 surrogate code units in md4c.c are (7d20152):

#define IS_UTF16_SURROGATE_HI(word)     (((WORD)(word) & 0xfc) == 0xd800)
#define IS_UTF16_SURROGATE_LO(word)     (((WORD)(word) & 0xfc) == 0xdc00)
#define UTF16_DECODE_SURROGATE(hi, lo)  ((((unsigned)(hi) & 0x3ff) << 10) | (((unsigned)(lo) & 0x3ff) << 0))

The constant 0xfc in the first two lines should read 0xfc00.

The cast to WORD seems pointless, the actual argument to these macros is always a wchar_t expression, which (in MSC) is promoted to 32-bit int without sign extension. (Furthermore WORD is defined in <windows.h>, and currently the only name used from there ...!). Thus the defining expression could be:

((word) & 0xfc00) == 0xd800

The expression to compose the Unicode code point from the two 10-bit fragments omits the bias value 0x10000; it should read:

0x10000 + ((((hi) & 0x3ff) << 10) | ((lo) & 0x3ff))

or - using only required parentheses -:

0x10000 + ( ((hi) & 0x3ff) << 10  |  (lo) & 0x3ff )

Move HTML renderer from md2html.c to a separate source file

I'm currently working on porting an app from Discount to md4c and would like to reuse the HTML rendering code from md2html.c. However, copying the sources into my project repo currently requires manually stripping out code for option parsing, clock() calls, the main() function, etc. and makes it a harder to track future upstream changes.

Would a pull request to move just the rendering code to a separate source file be accepted? If so, is it likely to conflict with any changes you are working on for issue #5?

P.S. Thanks for sharing this great library. I've been wanting to upgrade a few projects to a CommonMark compatible parser for a long time, but couldn't use cmark for lack of tables support. The feature set of md4c really hits the sweet spot I was looking for.

Get source line number for headings

Hello,
is it possible to retrieve the source line number for headings?
I would like to build a parser which can create a TOC with navigation feature for a markdown editor.

Emphasis parsed wrongly.

a*b**c* should translate to <p>a<em>b**c</em></p>, but currently it is translated to <p>a*b**c*</p>.

Parsing '\``\``\``\``...' takes quadratic time

$ python -c 'print("\\``" * 20000)' | time md2html > /dev/null
0.81user 0.00system 0:00.85elapsed 95%CPU (0avgtext+0avgdata 2480maxresident)k
0inputs+0outputs (0major+333minor)pagefaults 0swaps
$ python -c 'print("\\``" * 40000)' | time md2html > /dev/null
3.70user 0.00system 0:03.76elapsed 98%CPU (0avgtext+0avgdata 3240maxresident)k
0inputs+0outputs (0major+551minor)pagefaults 0swaps
$ python -c 'print("\\``" * 80000)' | time md2html > /dev/null
15.23user 0.00system 0:15.35elapsed 99%CPU (0avgtext+0avgdata 5008maxresident)k
0inputs+0outputs (0major+991minor)pagefaults 0swaps

(the 2nd case from #57)

A list item can begin with at most one blank line.

MD4C does not currently follow the rule of CommonMark spec 0.27 that

a list item can begin with at most one blank line.

Therefore we fail the Example 241.

However when we literally implement the rule as per this patch:

diff --git a/md4c/md4c.c b/md4c/md4c.c
index aae2a63..a82d458 100644
--- a/md4c/md4c.c
+++ b/md4c/md4c.c
@@ -4833,6 +4833,18 @@ redo:
             ctx->last_line_has_list_loosening_effect = (n_parents > 0  &&
                     n_brothers + n_children == 0  &&
                     ctx->containers[n_parents-1].ch != _T('>'));
+
+            /* If current list item contains nothing but a single blank line
+             * and we would be second blank line in the same list item, then
+             * we and the list. */
+            if(n_parents > 0  &&  ctx->containers[n_parents-1].ch != _T('>')  &&
+               n_brothers + n_children == 0  &&  ctx->current_block == NULL  &&
+               ctx->n_block_bytes > sizeof(MD_BLOCK))
+            {
+                MD_BLOCK* top_block = (MD_BLOCK*) ((char*)ctx->block_bytes + ctx->n_block_bytes - sizeof(MD_BLOCK));
+                if(top_block->type == MD_BLOCK_LI)
+                    n_parents--;
+            }
         }
         goto done;
     } else {

then we stop to pass the Example 274.

After inspecting, it looks as a contradiction in the spec to me. Fired the issue commonmark/commonmark-spec#443 for it.

Parsing '[ (]([ (]([ (]([ (](...' takes quadratic time

$ python -c 'print("[ (](" * 20000)' | time md2html > /dev/null
0.75user 0.00system 0:00.76elapsed 97%CPU (0avgtext+0avgdata 3440maxresident)k
0inputs+0outputs (0major+550minor)pagefaults 0swaps
$ python -c 'print("[ (](" * 40000)' | time md2html > /dev/null
2.96user 0.00system 0:03.00elapsed 98%CPU (0avgtext+0avgdata 4956maxresident)k
0inputs+0outputs (0major+989minor)pagefaults 0swaps
$ python -c 'print("[ (](" * 80000)' | time md2html > /dev/null
12.22user 0.00system 0:12.28elapsed 99%CPU (0avgtext+0avgdata 8568maxresident)k
0inputs+0outputs (0major+1867minor)pagefaults 0swaps

(the 3rd case from #57)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.