ricklupton / rmscene Goto Github PK

Read v6 .rm files from the reMarkable tablet

License: MIT License

Python 100.00%

rmscene's Introduction

rmscene

Python library to read v6 files from reMarkable tables (software version 3).

In particular, this version introduces the ability to include text as well as drawn lines. Extracting this text is the original motivation to develop this library, but it also can read much of the other types of data in the reMarkable files.

To convert rm files to other formats, you can use rmc, which combines this library with code for converting lines to SVG, PDF, and simple Markdown.

Changelog

Unreleased

New features:

Add support for new blocks: 0x0D (SceneInfo) and 0x08 (SceneTombstoneItemBlock) (#24)
Add support for move_id field on some SceneLineItems (#24)

v0.5.0

Breaking changes:

The start property of GlyphRange items is now optional (#15).
The representation of formatted text spans has changed. Rather than using nested structures like BoldSpan and ItalicSpan, the CrdtStr objects now have optional text properties like font-weight and font-style. This simplifies the parsing code and the resulting data structure.

New features:

Improved error recovery. An error during parsing, or an unknown block type, results in an UnreadableBlock containing the data that could not be read, so that parsing of other blocks can continue.
Compatible with new reMarkable software version 3.6 format for highlighted text (#15).
New methods read_bool_optional and similar of TaggedBlockReader which return a default value if no matching tagged value is present in the block.

Other changes and fixes:

The value attribute of scene item blocks, which was not being used, has been removed.
Check more carefully for sub-blocks (#17).
Type hints fixed for expand_text_items.

v0.4.0

Breaking changes:

Rename scene_items.TextFormat to ParagraphStyle to better describe its meaning, now that we have inline bold/italic text styles.
Remove methods from scene_items.Text object; use text.TextDocument instead.
Writer: experimental change to emulate different reMarkable software versions by passing {"version": "3.2.2"} options to write_blocks. This allows us to continue to test round-trip reading and writing of old test files as new data values are added. Replaces "line_version" option.

New features:

Parse text formatting information (bold and italic) introduced in reMarkable software version 3.3.

Other changes:

Allow empty text items and unknown text formats without throwing exceptions.
When extra data is present in the file, log the unrecognised bytes at DEBUG logging level along with the call stack, to make it easier to figure out where the code needs to be modified to read new data.
Parse new data values (with unknown meaning) in PageInfoBlock and MigrationInfoBlock.

v0.3.0

Introduce CrdtSequence type to handle the different places that CRDT sequences are used, not just for text.
Introduce scene_items module with data structures representing the data, independently from the Blocks used to serialize them to .rm files.
Introduce a SceneTree structure which holds the SceneItems in groups/layers.
Move Text data from RootTextBlock to scene_items.Text class, which includes methods for extracting lines of text and formatting.
Text lines now include the trailing newline character.
Read GlyphRange scene items, representing highlighted text in PDFs.

v0.2.0

Try to be more robust to unexpected data introduced by newer reMarkable software versions.
Only warn once if unknown data is present, rather than for every block.
Small API change to return type of read_block and read_subblock methods.

v0.1.0

Initial release

Acknowledgements

https://github.com/ddvk/reader helped a lot in figuring out the structure and meaning of the files. @adq discovered a means to get debug output (see issue 25) which is very helpful for understanding the format.

Contributors:

@Azeirah -- code and reporting issues
@adq -- code and reporting issues
@dotlambda -- packaging

rmscene's People

Contributors

Stargazers

Watchers

Forkers

chemag dotlambda gee-one azeirah gravityblast flolbr mtrifonov-design adq iacore hboon

rmscene's Issues

Unknown data `move_id` on line items

Tag index 7, after timestamp

Seen in #6

Better error handling within blocks

Currently any parsing error aborts parsing the whole file, but we should be able to skip to the next block and try again. Reporting the location in the file where the error happened would be useful for diagnosing the problem too (e.g. in #17)

In _read_blocks, the yield block_type.from_stream(stream) can be wrapped in a try/except.

Remove unnecessary attribute in SceneLineItemBlock "value"

I am not sure whether I am the problem or I misunderstood something, but while trying to include rmscene into rmrl (see https://github.com/benneti/rmrl ), I stumbled upon the problem that it is not possible to access block.value but only block.item.value.
But looking at the code (as someone unfamiliar with @dataclass notation) it looks like there should be a value optional available directly.
Some more insight would be appreciated!

A styling span (bold/italic) can cover more than one paragraph, causing an unbalanced stack

You can see the problem on...

page 1 on the attached RM document d133d282-93aa-4dfd-a7fb-00eba9a0f8d3.rm
as well as on page 5 1c08e4c9-f45e-4e73-8c5f-bfeb3bac42ad.rm' (this one is clearer, I made this page to intentionally showcase the problem)

The code in text.py assumes that the bold/italic stack should be balanced per paragraph, but this is not necessarily the case. The stack is balanced over the entire text document. This is because italic or bold can span multiple paragraphs. As you can see on the attached images, I have three paragraphs of text in a row which all have italic styling applied.

The following two images show the italicized text and the selection I made to create the italic span respectively.

The per-character log

This log is for page 5.

I added print statements to TextDocument.from_scene_item, printing out each character per iteration.

(Note I replaced the deleted character with "キ" to make them stand out during my debugging, they have no special meaning other than being a deleted character)

As you can see in the log, the italic span opens in the first paragraph at CrdtId(1, 151) and ends at the third paragraph at CrdtId(1, 152)

CrdtId(1, 15) "キ"
CrdtId(1, 16) "キ"
CrdtId(1, 17) "キ"
CrdtId(1, 18) "キ"
CrdtId(1, 19) "キ"
CrdtId(1, 20) "キ"
CrdtId(1, 21) "キ"
CrdtId(1, 22) "キ"
CrdtId(1, 23) "キ"
CrdtId(1, 24) "キ"
CrdtId(1, 25) "キ"
CrdtId(1, 26) "キ"
CrdtId(1, 27) "キ"
CrdtId(1, 28) "キ"
CrdtId(1, 29) "キ"
CrdtId(1, 30) "T"
CrdtId(1, 31) "h"
CrdtId(1, 32) "r"
CrdtId(1, 33) "e"
CrdtId(1, 34) "e"
CrdtId(1, 35) " "
CrdtId(1, 36) "キ"
CrdtId(1, 37) "キ"
CrdtId(1, 38) "キ"
CrdtId(1, 39) "キ"
CrdtId(1, 40) "キ"
CrdtId(1, 41) "キ"
CrdtId(1, 42) "キ"
CrdtId(1, 151) 3
CrdtId(1, 43) "p"
CrdtId(1, 44) "a"
CrdtId(1, 45) "r"
CrdtId(1, 46) "a"
CrdtId(1, 47) "g"
CrdtId(1, 48) "r"
CrdtId(1, 49) "a"
CrdtId(1, 50) "p"
CrdtId(1, 51) "h"
CrdtId(1, 52) "s"
CrdtId(1, 53) " "
CrdtId(1, 54) "w"
CrdtId(1, 55) "i"
CrdtId(1, 56) "t"
CrdtId(1, 57) "h"
CrdtId(1, 58) " "
CrdtId(1, 59) "i"
CrdtId(1, 60) "t"
CrdtId(1, 61) "a"
CrdtId(1, 62) "l"
CrdtId(1, 63) "i"
CrdtId(1, 64) "キ"
CrdtId(1, 65) "キ"
CrdtId(1, 66) "キ"
CrdtId(1, 67) "キ"
CrdtId(1, 68) "キ"
CrdtId(1, 69) "キ"
CrdtId(1, 70) "キ"
CrdtId(1, 71) "c"
CrdtId(1, 72) " "
CrdtId(1, 73) "i"
CrdtId(1, 74) "n"
CrdtId(1, 75) " "
CrdtId(1, 76) "o"
CrdtId(1, 77) "n"
CrdtId(1, 153) "e"
CrdtId(1, 78) " "
CrdtId(1, 79) "s"
CrdtId(1, 80) "e"
CrdtId(1, 81) "l"
CrdtId(1, 82) "e"
CrdtId(1, 83) "キ"
CrdtId(1, 84) "キ"
CrdtId(1, 85) "c"
CrdtId(1, 86) "t"
CrdtId(1, 87) "i"
CrdtId(1, 88) "o"
CrdtId(1, 89) "n"
CrdtId(1, 90) "
"
End of paragraph
Breaking
Unbalanced stack! [(None, [CrdtStr(s='キキキキキキキキキキキキキキキThree キキキキキキキ', i=[CrdtId(1, 15), CrdtId(1, 16), CrdtId(1, 17), CrdtId(1, 18), CrdtId(1, 19), CrdtId(1, 20), CrdtId(1, 21), CrdtId(1, 22), CrdtId(1, 23), CrdtId(1, 24), CrdtId(1, 25), CrdtId(1, 26), CrdtId(1, 27), CrdtId(1, 28Id(1, 30), CrdtId(1, 31), CrdtId(1, 32), CrdtId(1, 33), CrdtId(1, 34), CrdtId(1, 35), CrdtId(1, 36), CrdtId(1, 37), CrdtId(1, 38), CrdtId(1, 39), CrdtId(1, 40), CrdtId(1, 41), CrdtId(1, 42)])]), (<class 'rmscene.text.ItalicSpan'>, [CrdtStr(s='paragraphs with italiキキキキキキキc in one seleキキction'Id(1, 43), CrdtId(1, 44), CrdtId(1, 45), CrdtId(1, 46), CrdtId(1, 47), CrdtId(1, 48), CrdtId(1, 49), CrdtId(1, 50), CrdtId(1, 51), CrdtId(1, 52), CrdtId(1, 53), CrdtId(1, 54), CrdtId(1, 55), CrdtId(1, 56), CrdtId(1, 57), CrdtId(1, 58), CrdtId(1, 59), CrdtId(1, 60), CrdtId(1, 61), CrdtId(1, 62), CrdtId(1, 63), CrdtId(1, 64), CrdtId(1, 65), CrdtId(1, 66), CrdtId(1, 67), CrdtId(1, 68), CrdtId(1, 69), CrdtId(1, 70), CrdtId(1, 71), CrdtId(1, 72), CrdtId(1, 73), CrdtId(1, 74), CrdtId(1, 75), CrdtId(1, 76), CrdtId(1, 77), CrdtId(1, 153), CrdtId(1, 78), CrdtId(1, 79), CrdtId(1, 80), CrdtId(1, 81), CrdtId(1, 82), CrdtId(1, 83), CrdtId(1, 84), CrdtId(1, 85), CrdtId(1, 86), CrdtId(1, 87), CrdtId(1, 88), CrdtId(1, 89)])])]
-------------
New paragraph
-------------
CrdtId(1, 91) "T"
CrdtId(1, 92) "h"
CrdtId(1, 93) "i"
CrdtId(1, 94) "s"
CrdtId(1, 95) " "
CrdtId(1, 96) "i"
CrdtId(1, 97) "s"
CrdtId(1, 98) " "
CrdtId(1, 99) "t"
CrdtId(1, 100) "h"
CrdtId(1, 101) "e"
CrdtId(1, 102) " "
CrdtId(1, 103) "m"
CrdtId(1, 104) "i"
CrdtId(1, 105) "d"
CrdtId(1, 106) "d"
CrdtId(1, 107) "l"
CrdtId(1, 108) "e"
CrdtId(1, 109) " "
CrdtId(1, 110) "p"
CrdtId(1, 111) "a"
CrdtId(1, 112) "r"
CrdtId(1, 113) "a"
CrdtId(1, 114) "g"
CrdtId(1, 115) "r"
CrdtId(1, 116) "a"
CrdtId(1, 117) "p"
CrdtId(1, 118) "h"
CrdtId(1, 119) "
"
End of paragraph
Breaking
-------------
New paragraph
-------------
CrdtId(1, 120) "T"
CrdtId(1, 121) "H"
CrdtId(1, 122) "i"
CrdtId(1, 123) "s"
CrdtId(1, 124) " "
CrdtId(1, 125) "i"
CrdtId(1, 126) "s"
CrdtId(1, 127) " "
CrdtId(1, 128) "t"
CrdtId(1, 129) "h"
CrdtId(1, 130) "e"
CrdtId(1, 131) " "
CrdtId(1, 132) "e"
CrdtId(1, 133) "n"
CrdtId(1, 134) "d"
CrdtId(1, 135) " "
CrdtId(1, 136) "キ"
CrdtId(1, 137) "キ"
CrdtId(1, 138) "キ"
CrdtId(1, 139) "p"
CrdtId(1, 140) "a"
CrdtId(1, 141) "r"
CrdtId(1, 142) "a"
CrdtId(1, 143) "キ"
CrdtId(1, 144) "キ"
CrdtId(1, 145) "キ"
CrdtId(1, 146) "g"
CrdtId(1, 147) "r"
CrdtId(1, 148) "a"
CrdtId(1, 152) 4
Unexpected end of span at CrdtId(1, 152): got <class 'rmscene.text.ItalicSpan'>, expected None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/text.py", line 179, in from_scene_item
    contents = parse_paragraph_contents()
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/text.py", line 154, in parse_paragraph_contents
    stack[-1][1].append(span_type(nested))
IndexError: list index out of range

Text formatting text notebook

Release version 0.5.0 to pypy

I noticed that version 0.4.0 is the latest on PyPi.

Maybe it might be nice to have a Github action to do this automatically when a new version is created?

Node does not exist for SceneGroupItemBlock: None

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/remarks/__main__.py", line 151, in <module>
    main()
  File "/app/remarks/__main__.py", line 147, in main
    run_remarks(input_dir, output_dir, **args_dict)
  File "/app/remarks/remarks.py", line 96, in run_remarks
    process_document(metadata_path, out_path, doc_type, **kwargs)
  File "/app/remarks/remarks.py", line 271, in process_document
    (ann_data, has_ann_hl), version = parse_rm_file(ann_rm_file)
  File "/app/remarks/conversion/parsing.py", line 259, in parse_rm_file
    return parse_v6(file_path), "V6"
  File "/app/remarks/conversion/parsing.py", line 128, in parse_v6
    dims = determine_document_dimensions(file_path)
  File "/app/remarks/conversion/parsing.py", line 183, in determine_document_dimensions
    build_tree(tree, blocks)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 773, in build_tree
    raise ValueError(
    ValueError: Node does not exist for SceneGroupItemBlock: None`

def build_tree(tree: SceneTree, blocks: Iterable[Block]):
    for b in block:
        if ...:
            ...
        elif isinstance(b, SceneGroupItemBlock):
            # Add this entry to children of parent_id
            node_id = b.item.value
            # if node_id == None:
                # continue
            if node_id not in tree:
                raise ValueError(
                    "Node does not exist for SceneGroupItemBlock: %s" % node_id
                )
            item = replace(b.item, value=tree[node_id])
            tree.add_item(item, b.parent_id)

Given that the value in the SceneGroupItemBlock is a tp.Optional[CrdtId] it's defined that it's ok to be None?

I added the commented continue if None code and there are no further errors in the document.

Am I correct in assuming it's ok to just skip it if it's None given that the value is optionally typed?

ReMarkable version 3.3 support

ReMarkable released beta v3.3, introducing bold and italic text for the keyboard.

I made a test notebook with the following contents:

Bolded text
Italicized text
Text in a list

Processing this file results in the following output:

File: "bolded text.notebook" (325e6c07-7f7c-4ddb-b75a-2de392a05bf6)

Some data has not been read. The data may have been written using a newer format than this reader supports.
In MainBlockInfo(offset=84, size=7, extra_data=b'', block_type=0, min_version=1, current_version=1) only read 5 bytes
In MainBlockInfo(offset=99, size=25, extra_data=b'', block_type=10, min_version=0, current_version=1) only read 20 bytes
In SubBlockInfo(offset=1014, size=7, extra_data=b'') only read 2 bytes
In SubBlockInfo(offset=1078, size=7, extra_data=b'') only read 2 bytes
In SubBlockInfo(offset=1266, size=7, extra_data=b'') only read 2 bytes
In SubBlockInfo(offset=1382, size=7, extra_data=b'') only read 2 bytes
Unknown block type 8. Skipping 20 bytes.
In SubBlockInfo(offset=10332, size=233, extra_data=b'') only read 229 bytes
In SubBlockInfo(offset=10598, size=135, extra_data=b'') only read 131 bytes
In SubBlockInfo(offset=10766, size=4713, extra_data=b'') only read 4709 bytes
In SubBlockInfo(offset=15512, size=233, extra_data=b'') only read 229 bytes
In SubBlockInfo(offset=15778, size=135, extra_data=b'') only read 131 bytes
In SubBlockInfo(offset=15946, size=4713, extra_data=b'') only read 4709 bytes
In SubBlockInfo(offset=20692, size=233, extra_data=b'') only read 229 bytes
In SubBlockInfo(offset=20958, size=135, extra_data=b'') only read 131 bytes
In SubBlockInfo(offset=21126, size=4713, extra_data=b'') only read 4709 bytes

bolded text.zip

data structure

Hi, it's so good to see someone tackles the new file format :)
So it seems reMarkable has switched to a distributed data base approach to store and sync notebooks ? An LWW-CRDT
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#LWW-Element-Set_(Last-Write-Wins-Element-Set) ?
I don't know much about these kind of data types but I wonder if it's so generic that there must be a de/serialization library available already?

Extracting longer works with the v3.2.3.1595 release

I'm not sure it's helpful to get bug reports from the bleeding edge release, but just in case:

poetry run python -m rmscene print-blocks \
    ~/backup/remarkable/latest/files/7de2ff35-356b-425e-bba1-5b1e0b5a7f94/809e181d-8974-461c-a824-da0ec4fb0713.rm

AuthorIdsBlock(author_uuids={1: UUID('624b4da6-5190-505d-ac2a-947f7e7505cf')})

MigrationInfoBlock(migration_id=CrdtId(1, 1), is_device=True)
Block starting at 84, length 7, only read 5
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/__main__.py", line 62, in <module>
    args.func(args)
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/__main__.py", line 35, in pprint_file
    for el in result:
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/scene_stream.py", line 738, in read_blocks
    yield from _read_blocks(stream)
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/scene_stream.py", line 723, in _read_blocks
    yield block_type.from_stream(stream)
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/scene_stream.py", line 99, in from_stream
    migration_id = stream.read_id(1)
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/tagged_block_reader.py", line 56, in read_id
    self.data.read_tag(index, TagType.ID)
  File "/home/philipsd6/devel/remarkable/rmscene/src/rmscene/tagged_block_common.py", line 93, in read_tag
    raise UnexpectedBlockError(
rmscene.tagged_block_common.UnexpectedBlockError: Expected index 1, got 0, at position 97

Add remarkable-tablet topic

https://remarkable.guide/devel/github.html#remarkable-tablet-topic

ReMarkable beta 3.6: UnexpectedBlockError

I'm getting this error in documents accessed with the ReMarkable beta v3.6. I haven't got a shareable file yet as I am not sure about how to reproduce it just yet, I will try to create a minimal example to reproduce it however.

This is the error I'm getting:

Unknown block type 13. Skipping 31 bytes.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/remarks/__main__.py", line 151, in <module>
    main()
  File "/app/remarks/__main__.py", line 147, in main
    run_remarks(input_dir, output_dir, **args_dict)
  File "/app/remarks/remarks.py", line 96, in run_remarks
    process_document(metadata_path, out_path, doc_type, **kwargs)
  File "/app/remarks/remarks.py", line 188, in process_document
    pages_map.append(determine_document_dimensions(path))
  File "/app/remarks/conversion/parsing.py", line 183, in determine_document_dimensions
    build_tree(tree, blocks)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 750, in build_tree
    for b in blocks:
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 725, in read_blocks
    yield from _read_blocks(stream)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 707, in _read_blocks
    yield block_type.from_stream(stream)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 376, in from_stream
    value = subclass.value_from_stream(stream)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 468, in value_from_stream
    value = glyph_range_from_stream(reader)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 420, in glyph_range_from_stream
    start = stream.read_int(2)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/tagged_block_reader.py", line 88, in read_int
    self.data.read_tag(index, TagType.Byte4)
  File "/app/.venv/lib/python3.10/site-packages/rmscene/tagged_block_common.py", line 103, in read_tag
    raise UnexpectedBlockError(
    rmscene.tagged_block_common.UnexpectedBlockError: Expected index 2, got 4, at position 5992

From what I can tell, it looks like the glyph_range data has changed:

def glyph_range_from_stream(stream: TaggedBlockReader) -> si.GlyphRange:
    start = stream.read_int(2)
    length = stream.read_int(3)
    color_id = stream.read_int(4)  # ddvk has this as a byte?
    color = si.PenColor(color_id)
    text = stream.read_string(5)

I can fix each error as it comes along, making the final code something like this, bruteforcing my way to the new struct, but I get stuck at PenColor where the Error tells me it's a Byte2, but I get a value in the 20000s as a pencolor when reading the TaggedBlock as a short.

def glyph_range_from_stream(stream: TaggedBlockReader) -> si.GlyphRange:
    start = stream.read_int(4)
    length = stream.read_length(5) # has a tag 0xC == 12 bytes? Not sure what this is supposed to represent
    # i just bruteforced a function called read_length to skip the error but the returned data of course makes no sense.
    color_id = stream.read_short(0) 
    color = si.PenColor(color_id)
    text = stream.read_string(5)

Instead, I tried finding the struct definition using Ghidra. I'm stupendously inexperienced using tools like that so unfortunately I couldn't find much more than just references to the string GlyphRange :(

Do you have time to look at this? And if not, can you explain to me how to find the right values myself? It's been a while since I last did system programming in C/C++ but I'm not entirely unfamiliar with it either.

enabling debug dumping-as-json mode for v6 files

If you set SCENE_FILE_V6_DEBUG=1 prior to running xochitl will start dumping the rm files as json files for debug when it saves them - it'll output in journalctl what the filenames are.

If you wanna enable this persistently:

[Unit]
Description=reMarkable main application
StartLimitIntervalSec=600
StartLimitBurst=4
OnFailure=remarkable-fail.service
After=home.mount
Wants=rm-sync.service

[Service]
ExecStart=/usr/bin/xochitl --system
Restart=on-failure
WatchdogSec=60
Environment="SCENE_FILE_V6_DEBUG=1"

[Install]
WantedBy=multi-user.target

Block type 13

I was curious about block type 13, I noticed it tends to show up on pages with written text? I'm not sure about what it could be.

I did manage to parse one by sight-reading, but the data isn't self-explanatory either

1f00 0000                # block length = 31
0000 010d                # block header, min_version=0, max_version=1, 0x0d is our new block type
1c06 000000              # subblock of length=6
1f00 00                  #     first id 0
2f00 00                  #     second id 0
2c05 000000              # subblock of length=5
1f00 00                  #     id 0
2101                     #     byte val=1
3c05 0000 00             # subblock of length=5
1f00 00                  #     id 0
2101                     #     byte val=1

And this is my implementation

class SceneWIPItemBlock(Block):
    BLOCK_TYPE: tp.ClassVar = 0x0D

    @classmethod
    def from_stream(cls, reader: TaggedBlockReader) -> tp.Any:
            if reader.has_subblock(1):
                with reader.read_subblock(1):
                    b1id1 = reader.read_id(1)
                    b1id2 = reader.read_id(2)

            if reader.has_subblock(2):
                with reader.read_subblock(2):
                    b2id1 = reader.read_id(1)
                    b2unknown = reader.read_byte(2)

            if reader.has_subblock(3):
                with reader.read_subblock(3):
                    b3id1 = reader.read_id(1)
                    b3unknown = reader.read_byte(2)

Do you have any idea what this might be? I think if I can find out the name, it would be a good starting point. Maybe ddvk has the name in his repo, otherwise I might be able to find it using Ghidra.

There's no pressure behind finding out what it is, I was just curious.

AssertionError item_type == subclass.ITEM_TYPE

Back again with another error!

This one is a bit more difficult to figure out, but I feel like I got a decent idea about what's approximately going on.

Error

Traceback (most recent call last):
  File "/home/lb/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lb/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/__main__.py", line 151, in <module>
    main()
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/__main__.py", line 147, in main
    run_remarks(input_dir, output_dir, **args_dict)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/remarks.py", line 90, in run_remarks
    process_document(metadata_path, out_path, **kwargs)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/remarks.py", line 157, in process_document
    dims = determine_document_dimensions(rm_annotation_file)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/conversion/parsing.py", line 200, in determine_document_dimensions
    build_tree(tree, blocks)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 756, in build_tree
    for b in blocks:
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 731, in read_blocks
    yield from _read_blocks(stream)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 713, in _read_blocks
    yield block_type.from_stream(stream)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 375, in from_stream
    assert item_type == subclass.ITEM_TYPE
AssertionError

Alright, so we have a SceneItemBlock where the read block_info item_type doesn't match up with the subclass.ITEM_TYPE

IE, this:

def from_stream(cls, stream: TaggedBlockReader) -> SceneItemBlock:
    "Group item block?"
    _logger.debug("Reading %s", cls.__name__
    
    assert stream.current_block
    block_type = stream.current_block.block_type
    if block_type == SceneGlyphItemBlock.BLOCK_TYPE:
        subclass = SceneGlyphItemBlock
    elif block_type == SceneGroupItemBlock.BLOCK_TYPE:
        subclass = SceneGroupItemBlock
    elif block_type == SceneLineItemBlock.BLOCK_TYPE:
        subclass = SceneLineItemBlock
    elif block_type == SceneTextItemBlock.BLOCK_TYPE:
        subclass = SceneTextItemBlock
    else:
        raise ValueError(
            "unknown scene type %d in %s" % (block_type, stream.current_block)
        )

    parent_id = stream.read_id(1)
    item_id = stream.read_id(2)
    left_id = stream.read_id(3)
    right_id = stream.read_id(4)
    deleted_length = stream.read_int(5)

    if stream.has_subblock(6):
        with stream.read_subblock(6) as block_info:
            item_type = stream.data.read_uint8()
            # ---------------- HERE ---------------- #
            assert item_type == subclass.ITEM_TYPE
            value = subclass.value_from_stream(stream)
        # Keep known extra data
        extra_data = block_info.extra_data
    else:
        value = None
        extra_data = b""

The read item_type in this case was 1, whereas the subclass.ITEM_TYPE is 3. So we have a subclass of SceneLineItemBlock, but the item_type of the subblock corresponds to a SceneGlyphItemBlock.

What I found in the .rm file is that there's a singular byte immediately following the read item_type which contains the expected item type (at least, in this scenario? Maybe it doesn't always match up?), so I modified the code to read that byte when the assertion fails.

if stream.has_subblock(6):
    with stream.read_subblock(6) as block_info:
        item_type = stream.data.read_uint8()
        if not item_type == subclass.ITEM_TYPE:
            override_item_type = stream.read_byte(0)
            assert override_item_type == subclass.ITEM_TYPE

Next, I found that the override_item_type was followed up by four IDs and an int, so clearly that has to be a block.

My guess is that somehow a block ends up becoming another block. What exactly's going on, I'm not super sure yet.

So I recur to read the block again, starting after the override_item_type.

if stream.has_subblock(6):
    with stream.read_subblock(6) as block_info:
        item_type = stream.data.read_uint8()
        if not item_type == subclass.ITEM_TYPE:
            override_item_type = stream.read_byte(0)
            assert override_item_type == subclass.ITEM_TYPE
            # recur here
            return SceneItemBlock.from_stream(stream)

This works fine up until the point of trying to read the subclass' value in value_from_stream. This makes sense, since our subblocks' item_types weren't matching up. So my guess was that we ended up here with a SceneLineItemBlock with a GlyphRange subblock. Again, I'm not sure how that makes sense but it does look like it's what's happening here.

So I modify the code a bit more again, to override what reader class is getting used. Instead of determining the block subclass by block_type in SceneItemBlock, I use the item_type as read in the second (recur) call to SceneItemBlock subblock to determine the subclass. The code is a bit ugly but as far as I can see this does work as you would expect, no more parsing errors after this, making the code look like this:

@classmethod
    def from_stream(
        cls, stream: TaggedBlockReader, override_item_type: int = -1
    ) -> SceneItemBlock:
        "Group item block?"
        _logger.debug("Reading %s", cls.__name__)

        assert stream.current_block
        block_type = stream.current_block.block_type
        if block_type == SceneGlyphItemBlock.BLOCK_TYPE:
            subclass = SceneGlyphItemBlock
        elif block_type == SceneGroupItemBlock.BLOCK_TYPE:
            subclass = SceneGroupItemBlock
        elif block_type == SceneLineItemBlock.BLOCK_TYPE:
            subclass = SceneLineItemBlock
        elif block_type == SceneTextItemBlock.BLOCK_TYPE:
            subclass = SceneTextItemBlock
        else:
            raise ValueError(
                "unknown scene type %d in %s" % (block_type, stream.current_block)
            )

        parent_id = stream.read_id(1)
        item_id = stream.read_id(2)
        left_id = stream.read_id(3)
        right_id = stream.read_id(4)
        deleted_length = stream.read_int(5)

        if stream.has_subblock(6):
            with stream.read_subblock(6) as block_info:
                item_type = stream.data.read_uint8()
                if not override_item_type == -1:
                    if override_item_type == SceneGlyphItemBlock.ITEM_TYPE:
                        subclass = SceneGlyphItemBlock
                    elif override_item_type == SceneGroupItemBlock.ITEM_TYPE:
                        subclass = SceneGroupItemBlock
                    elif override_item_type == SceneLineItemBlock.ITEM_TYPE:
                        subclass = SceneLineItemBlock
                    elif override_item_type == SceneTextItemBlock.ITEM_TYPE:
                        subclass = SceneTextItemBlock
                    elif override_item_type == SceneGlyphItemBlock.ITEM_TYPE:
                        subclass = SceneGlyphItemBlock
                    else:
                        raise ValueError("unknown scene type %d in %s" % (override_item_type, stream.current_block))
                elif item_type != subclass.ITEM_TYPE:
                    override_item_type = stream.read_byte(0)
                    assert override_item_type == subclass.ITEM_TYPE
                    return subclass.from_stream(stream, item_type)
                value = subclass.value_from_stream(stream)
            # Keep known extra data
            extra_data = block_info.extra_data
        else:
            value = None
            extra_data = b""

        return subclass(
            parent_id,
            CrdtSequenceItem(item_id, left_id, right_id, deleted_length, value),
            extra_data=extra_data,
        )

I do have one exception left:

Traceback (most recent call last):
  File "/home/lb/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lb/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/__main__.py", line 151, in <module>
    main()
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/__main__.py", line 147, in main
    run_remarks(input_dir, output_dir, **args_dict)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/remarks.py", line 90, in run_remarks
    process_document(metadata_path, out_path, **kwargs)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/remarks.py", line 157, in process_document
    dims = determine_document_dimensions(rm_annotation_file)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/conversion/parsing.py", line 200, in determine_document_dimensions
    build_tree(tree, blocks)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 781, in build_tree
    for b in blocks:
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 756, in read_blocks
    yield from _read_blocks(stream)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 738, in _read_blocks
    yield block_type.from_stream(stream)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 375, in from_stream
    with stream.read_subblock(6) as block_info:
  File "/home/lb/miniconda3/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/tagged_block_reader.py", line 174, in read_subblock
    self._check_position(subblock)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/tagged_block_reader.py", line 185, in _check_position
    raise BlockOverflowError(
rmscene.tagged_block_reader.BlockOverflowError: <class 'rmscene.tagged_block_reader.SubBlockInfo'> starting at 698, length 0, read up to 809 (overflow by 111)

Which makes sense since I read two blocks worth of data but I report only one block's length.

Tag releases on GitHub

Unknown bit assertion failed

Reproduction steps:

Open a previously working RM file (used Lines_v2.rm in this repository)
Using the desktop app, open the file
Double click to create a node for writing
Do not write text, and exit the editor

If this cannot be reproduced, here is a .rm file with the unknown bit different from 0.
(zipped the file due to github not supporting .rm files)
2d23993e-0422-4360-bce5-6c3e97a47d21.zip

Output:

// Snipped output
SceneGroupItemBlock(parent_id=CrdtId(0, 11),
                    item_id=CrdtId(2, 26),
                    left_id=CrdtId(1, 24),
                    right_id=CrdtId(0, 0),
                    deleted_length=0,
                    value=CrdtId(2, 25))
Sub-block starting at 559, length 64, only read 61

SceneLineItemBlock(parent_id=CrdtId(2, 25),
                   item_id=CrdtId(2, 30),
                   left_id=CrdtId(0, 0),
                   right_id=CrdtId(0, 0),
                   deleted_length=0,
                   value=Line(color=<PenColor.BLACK: 0>,
                              tool=<Pen.BALLPOINT_2: 15>,
                              points=[Point(x=-65.49798583984375,
                                            y=-176.16778564453125,
                                            speed=3,
                                            direction=0,
                                            width=8,
                                            pressure=0),
                                      Point(x=-65.70648193359375,
                                            y=-176.17727661132812,
                                            speed=0,
                                            direction=0,
                                            width=8,
                                            pressure=0)],
                              thickness_scale=1.0,
                              starting_length=0.0))
Block starting at 537, length 86, only read 83
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/redblueflame/test/rmscene/src/rmscene/__main__.py", line 62, in <module>
    args.func(args)
  File "/home/redblueflame/test/rmscene/src/rmscene/__main__.py", line 35, in pprint_file
    for el in result:
  File "/home/redblueflame/test/rmscene/src/rmscene/scene_stream.py", line 738, in read_blocks
    yield from _read_blocks(stream)
  File "/home/redblueflame/test/rmscene/src/rmscene/scene_stream.py", line 716, in _read_blocks
    with stream.read_block() as header:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/redblueflame/test/rmscene/src/rmscene/tagged_block_reader.py", line 125, in read_block
    assert unknown == 0
AssertionError

Bump version

I'd like to depend on the newest code. I'm currently depending on a git commit which is ok, but I prefer using to depend on the newest code changes with a version; ie 0.3.1

ValueError: 0 is not a valid TextFormat

Hi again, here's another Exception I encountered in the v3.3 beta.

I was creating a notebook to test the eraser tool for lucasrla/remarks#65, but I accidentally tapped the text tool. I typed nothing, closed the keyboard and pressed undo. The text tool expanded the page, but the expansion did not go away with the undos. There were some other weird bugs on the ReMarkable side of things (like one of my lines literally disappeared after this), but those are on ReMarkable, not for you.

I think the issue here is that ReMarkable did initialize a text block -- because I tapped the Text tool -- but didn't assign a text format to the text block because I didn't type any text.

If that's the case, text format = 0 can be safely ignored.

I think it would be reasonable to add an UNINITIALIZED = 0 enum value to the TextFormat enum.

(note that this output contains some extra data from remarks calling rmscene)

Found 1 documents in "tests/in/erasers", will process them now

File: "Erasers.notebook" (4404761d-89ea-4416-8579-f9d78f44cba1)
Some data has not been read. The data may have been written using a newer format than this reader supports.
In MainBlockInfo(offset=84, size=7, extra_data=b'', block_type=0, min_version=1, current_version=1) only read 5 bytes
In MainBlockInfo(offset=99, size=25, extra_data=b'', block_type=10, min_version=0, current_version=1) only read 20 bytes
Some data has not been read. The data may have been written using a newer format than this reader supports.
In MainBlockInfo(offset=84, size=7, extra_data=b'', block_type=0, min_version=1, current_version=1) only read 5 bytes
In MainBlockInfo(offset=99, size=25, extra_data=b'', block_type=10, min_version=0, current_version=1) only read 20 bytes
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/__main__.py", line 151, in <module>
    main()
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/__main__.py", line 147, in main
    run_remarks(input_dir, output_dir, **args_dict)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/remarks.py", line 94, in run_remarks
    process_document(metadata_path, out_path, doc_type, **kwargs)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/remarks.py", line 287, in process_document
    parsed_data, has_ann_hl = parse_rm_file(ann_rm_file)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/conversion/parsing.py", line 256, in parse_rm_file
    return parse_v6(file_path)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/conversion/parsing.py", line 126, in parse_v6
    dims = determine_document_dimensions(file_path)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/remarks/conversion/parsing.py", line 178, in determine_document_dimensions
    build_tree(tree, blocks)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 713, in build_tree
    for b in blocks:
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 690, in read_blocks
    yield from _read_blocks(stream)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 672, in _read_blocks
    yield block_type.from_stream(stream)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 603, in from_stream
    text_formats = dict(
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 604, in <genexpr>
    text_format_from_stream(stream) for _ in range(num_subblocks)
  File "/home/lb/PhpstormProjects/rm-notesync-laravel/binaries/remarks/.venv/lib/python3.10/site-packages/rmscene/scene_stream.py", line 555, in text_format_from_stream
    format_type = si.TextFormat(stream.data.read_uint8())
  File "/usr/lib/python3.10/enum.py", line 385, in __call__
    return cls.__new__(cls, value)
  File "/usr/lib/python3.10/enum.py", line 710, in __new__
    raise ve_exc
ValueError: 0 is not a valid TextFormat

Process finished with exit code 1

This is a zip of the notebook: Erasers.zip

Question regarding coordinates of SceneGlyphItemBlock

I am trying to export pdfs with highlighted text from my remarkable in a way that allows me to have the highlights show up as annotations in a reference manager i.e. Jabref. I already managed to extract the highlights with rmscene (Thank you very much for your work on this by the way!) But I am struggling to understand how to interpret the coordinates I get from:

with open(
    "result/fe28f8d6-82d7-4ee2-a1df-0c6cbf1a1785/73efd3f4-f6b9-43c8-9507-db728a545848.rm",
    "rb",
) as f:
    result = [
        block for block in read_blocks(f) if isinstance(block, SceneGlyphItemBlock)
    ]

x = result[0].item.value.rectangles[0].x
y = result[0].item.value.rectangles[0].y
w = result[0].item.value.rectangles[0].w
h = result[0].item.value.rectangles[0].h

The resulting values I got where the following:
x=-777.0586776928369,
y=154.8036158772884,
w=518.3922791561563,
h=113.49999618530273

And I don't quite get how I can translate that into the coordinates I need for generating the annotations via pdf-annotate.
I would assume that the x and y value represent the coordinate of the top left(?) corner of the highlighted area, while w is representing the width and h the height. What I don't understand is why the x value is negative or how the coordinate system works in the remarkable files. Where is it centered, how do I find out the total size, etc.

Maybe you could give me some pointers?

Exported SVG does not contain all the strokes inside viewbox

Sometimes, all the graphical elements are offset for some reason. Like this:

or this, if the canvas is landscape:

I can probably fix this myself. How do I start?

ricklupton / rmscene Goto Github PK

rmscene's Introduction

rmscene

Changelog

Unreleased

v0.5.0

v0.4.0

v0.3.0

v0.2.0

v0.1.0

Acknowledgements

rmscene's People

Contributors

Stargazers

Watchers

Forkers

rmscene's Issues

The per-character log

Error

Recommend Projects

Recommend Topics

Recommend Org