dogtopus / minipb Goto Github PK

View Code? Open in Web Editor NEW

51.0 7.0 6.0 110 KB

Lightweight Protocol Buffer serialize/deserialize library

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

protobuf micropython

minipb's Introduction

MiniPB

Mini Protobuf library in pure Python.

Features

Pure Python.
Feature-rich yet lightweight. Even runs on MicroPython.
Supports struct-like format string, ctypes-like structure representation (i.e. Structure._field_) and dataclass-like message class as schema.
Support schema-less inspection of a given serialized message via Wire.{encode,decode}_raw API.
- Proudly doing this earlier than protoscope.

Getting started

MiniPB supports 3 different flavors of schema declaration methods: Message classes (object serialization), key-value schema (dict serialization) and format string (tuple serialization).

Message class

import minipb

### Encode/Decode a Message with schema defined via Fields
@minipb.process_message_fields
class HelloWorldMessage(minipb.Message):
    msg = minipb.Field(1, minipb.TYPE_STRING)

# Creating a Message instance
#   Method 1: init with kwargs work!
msg_obj = HelloWorldMessage(msg='Hello world!')

#   Method 2: from_dict, iterates over all Field's declared in order on the class
msg_obj = HelloWorldMessage.from_dict({'msg': 'Hello world!'})

# Encode a message
encoded_msg = msg_obj.encode()
# encoded_message == b'\n\x0cHello world!'

# Decode a message
decoded_msg_obj = HelloWorldMessage.decode(encoded_msg)
# decoded_msg == HelloWorldMessage(msg='Hello world!')

decoded_dict = decoded_msg_obj.to_dict()
# decoded_dict == {'msg': 'Hello world!'}

Key-value schema

import minipb

### Encode/Decode a message with the Wire object and Key-Value Schema
# Create the Wire object with schema
hello_world_msg = minipb.Wire([
    ('msg', 'U') # 'U' means UTF-8 string.
])

# Encode a message
encoded_msg = hello_world_msg.encode({
    'msg': 'Hello world!'
})
# encoded_message == b'\n\x0cHello world!'

# Decode a message
decoded_msg = hello_world_msg.decode(encoded_msg)
# decoded_msg == {'msg': 'Hello world!'}

Format string

import minipb

### Encode/Decode a message with the Wire object and Format String
hello_world_msg = minipb.Wire('U')

# Encode a message
encoded_msg = hello_world_msg.encode('Hello world!')
# encoded_message == b'\n\x0cHello world!'

# Decode a message
decoded_msg = hello_world_msg.decode(encoded_msg)
# decoded_msg == ('Hello world!',)

Refer to the Schema Representation for detailed explanation on schema formats accepted by MiniPB.

Installation

CPython, PyPy, etc.

Install via pip

pip install git+https://github.com/dogtopus/minipb

MicroPython

NOTE (Old data but still somewhat relevant): Despite being lightweight compared to official Protobuf, the minipb module itself still uses around 15KB of RAM after loaded via import. Therefore it is recommended to use MiniPB on MicroPython instances with minimum of 24KB of memory available to the scripts. Instances with at least 48KB of free memory is recommended for more complex program logic.

On targets with plenty of RAM, such as Pyboards and the Unix build, installation consists of copying minipb.py to the filesystem and installing the logging and bisect module from micropython-lib. For targets with restricted RAM there are two options: cross compilation and frozen bytecode. The latter offers the greatest saving. See the official docs for further explanation.

Cross compilation may be achieved as follows. First you need mpy-cross that is compatible with the mpy version you are using.

Compile MiniPB by using

mpy-cross -s minipb.py minipb/minipb.py -o /your/PYBFLASH/minipb.mpy

You also need logging and bisect module from micropython-lib. Compile it by using

mpy-cross -s logging.py micropython-lib/logging/logging.py -o /your/PYBFLASH/logging.mpy
mpy-cross -s bisect.py micropython-lib/bisect/bisect.py -o /your/PYBFLASH/bisect.mpy

Unmount PYBFLASH and reset the board when both files are installed to your MicroPython instance.

On production deployment, it is possible to run mpy-cross with -O set to higher than 0 to save more flash and RAM usage by sacrificing some debuggability. For example -O3 saves about 1KB of flash and library RAM usage while disables assertion and removes source line numbers during traceback.

mpy-cross -s minipb.py -O3 minipb/minipb.py -o /your/PYBFLASH/minipb.mpy
mpy-cross -s logging.py -O3 micropython-lib/logging/logging.py -o /your/PYBFLASH/logging.mpy
mpy-cross -s bisect.py -O3 micropython-lib/bisect/bisect.py -o /your/PYBFLASH/bisect.mpy

Usage

Detailed documentation can be found under the project Wiki. The module's pydoc contains some useful information about the API too.

minipb's People

Contributors

Stargazers

Watchers

Forkers

mohamedjemaii githubchenchi mbakerparagon pepijndevos zilberil mtai

minipb's Issues

"SAP" streaming parser API?

I might implement this if I feel like it but just wanted to drop the idea here. So basically this library is ideal for micropython on low memory devices, but doesn't have a way to parse bigger-than-ram inputs.

It seems like the core API up until _break_down is already an iterator that reads from a file-like object. So it seems like you'd just have to write a IterWire subclass that returns an iterator like (path, value) instead of assembling a dict or tuple. Would be a fun puzzle.

Properly handle end-of-stream for decoder

Instead of relying on BytesIO.tell() and len(BytesIO.getvalue()) only on the beginning of _break_down, add end-of-stream check for all types of decoders so we can have a clear indication on both expected (e.g. hitting EOS while reading headers in _break_down) and unexpected (e.g. hitting EOS while decoding the actual field) end-of-stream situations. It's preferred to use read() and check for the returned length instead of using the old way so that we don't copy the contents from BytesIO every time while also stay MicroPython-friendly (MPy doesn't have BytesIO.tell()!)

Marking this as a bug because some of the early EOS errors are completely silent currently and it needs to be fixed.

Rethink about stream decoding

Some old version of MiniPB uses stream decoding exclusively. However due to how Protobuf handles repeated non-repeated fields (i.e. multiple fields with the same ID while the field is non-repeated) this was changed to pre-indexing the whole message and decode in one go.

We might still be able to bring stream decoding back by making the decode states read-write and return a snapshot of it when the decoding is done (by reaching an end-of-message marker either set by the user or naturally occurring).

This could also make implementing #11 easier and in a much cleaner way (instead of committing the states to a list/object, we just yield the state changes the first time we see them. It would be user's responsibility to only use the last one that pops up).

This is awesome but the docs could be improved

A few usage examples of the Wire class in the main README would help greatly, and would help promote adoption of this library.

I'm puzzled that your example in doc/format_str uses the t type which it explicitly deprecates. If there is a reason to use this, it would be helpful if it were clarified.

I have run this on a Pyboard 1.1 and it works fine, using 13KiB of RAM. Encoding is efficient provided you understand the meaning of the format types - again some explanation may help those unfamiliar with serialisation especially the handling of negative integers. Efficiency is comparable to ustruct but it's much more usable.

LSP violation in RawWire

RawWire has an inheritance from Wire that improperly overrides some of the common functions (i.e. encode and decode). This violates the Liskov Substitution Principle. Either merge the 2 (e.g. have static method Wire.encode_raw and Wire.decode_raw) or have a common algorithm class so we can depend both classes on the common algorithm class.

Test on MicroPython

Since we are so small, porting to MicroPython might be a good idea. If this works, it might bring relatively decent Protobuf support to MicroPython as well.

Update type stub

Type t encoding is broken

~~It requires int.bit_length which MicroPython doesn't have~~ It's totally broken

Run test on actual MicroPython instead of CPython 3.4

There's https://hub.docker.com/r/micropython/unix but we need to either somehow run this on GitHub Actions or just come up with our own Actions image.

Recommendation for a separator character?

If a variety of data items are encoded with the same schema, the length of the resultant bytes object varies. To send this data down a stream such as a UART the simplest approach is to terminate each object with a separator byte (b'\n'?). But that byte needs to be one guaranteed not to be present in any possible encoding.

Is there such a byte?

Protoscope/WireIR interoperability

It would be nice to have a way to convert Protoscope to WireIR and back.

This might be implemented in minipb or the upcoming minipbj project.

Switch to semver and auto publish releases to pypi

Maybe also move to pyproject.toml.

Oneof

How would you implement [oneof](https://developers.google.com/protocol-buffers/docs/proto3#oneof) here? as a required tag, and then a bunch of optional fields for each possible type? Would that correspond to the oneof keyword used in .proto?

Field seek suffix

Why

For extremely sparse schemas, the current syntax will cause a lot of skip operators (i.e. x) to be used, causing the schema to be less readable. Keeping track on the field offset by hand is also very error-prone, especially when hundreds of fields are involved. Adding a field seek operator as a suffix would solve this.

Proposal

For format string, add a new suffix element @<field_id> behind field copy count i.e. suffix := [field_copy][@<seek_to_field_id>]. Might need some special handling for [ (ideally we want to put the element after ]) since [ is technically a prefix character.

For format list, append @<field_id> to the type string.

(How do we handle overlaps?)

Examples

String:

V2@2U@10U@20

List:

(
    ('arg1', 'V@2'),
    ('arg2', 'V'),
    ('arg3', 'U@10'),
    ('arg4', 'U@20'),
)

Proto:

message Example1 {
    uint64 arg1 = 2;
    uint64 arg2 = 3;
    string arg3 = 10;
    string arg4 = 20;
}

String:

[vU@10]@20+[U@2]@30

List:

(
    ('msg1', '[@20', (
        ('code', 'v'),
        ('desc', 'U@10'),
    )),
    ('msg2', '+[@30', (
        ('str', 'U@2'),
    )),
)

Proto:

message Example1 {
    message Sub1 {
        sint32 code = 1;
        string desc = 10;
    }
    message Sub2 {
        string str = 2;
    }
    Sub1 msg1 = 20;
    repeated Sub2 msg2 = 30;
}

protoc integration

So we finally have full compatibility with OG Protobuf.

More info about protoc plugin API: https://developers.google.com/protocol-buffers/docs/reference/other

Packed repeating fields: error under MicroPython

This script runs under CPython 3.6.9 but fails under MicroPython:

import minipb

data = {'txt' : [b'abc', b'def', b'ghi'],
       }
schema = (('txt', '#a'),
          )
w = minipb.Wire(schema)
tx = w.encode(data)
rx = w.decode(tx)
print(rx)
print('Length', len(tx))

This produces the following:

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "minipb.py", line 465, in decode
  File "minipb.py", line 712, in decode_wire
  File "minipb.py", line 713, in <genexpr>
  File "minipb.py", line 541, in _break_down
AttributeError: 'BytesIO' object has no attribute 'tell'
>>>

The same error occurs with #U. Replacing the # with + results in working code.

Possibly this can be documented rather than fixed. Using packed repeating fields results in minuscule savings: just a single byte under CPython in this test. I doubt there is a need for them in a MicroPython context.

Field ids for non-sequential IDs in .proto messages

Hello,
Thank you for this great compact tool.
I am not sure if this is a substantial problem but wanted to get feedback and bring it up. This may be just as much as protocol buffers question as a minipb library issue. I am trying to write a schema that can decode GTFS Realtime .proto. I am running into decode issues (unexpected end of message for field X) and wondering if it has to do with the field ids issue as described below.

From what I understand, when creating embedded messages or any key-value schema, this library gives them field IDs in sequence. For example, for the following .proto message:

message SearchRequest {
  string query = 1; 
  int32 page_number = 2;
  int32 result_per_page = 3;
  enum Corpus {
    UNIVERSAL = 0;
    WEB = 1;
    IMAGES = 2;
    LOCAL = 3;
    NEWS = 4;
    PRODUCTS = 5;
    VIDEO = 6;
  }
  Corpus corpus = 4;
}

Results in the following minipb schema:

search_req_schema = minipb.Wire((
    ('query', 'U'), # field id 1
    ('page_number', 't'), # field id 2
    ('result_per_page', 't'), # field id 3
    ('corpus', 't'), # field id 4
))

For the GTFS realtime .proto there is one Alert message that lists fields in non-sequential number order.

// An alert, indicating some sort of incident in the public transit network.
message Alert {
  // Time when the alert should be shown to the user. If missing, the
  // alert will be shown as long as it appears in the feed.
  // If multiple ranges are given, the alert will be shown during all of them.
  repeated TimeRange active_period = 1;

  // Entities whose users we should notify of this alert.
  repeated EntitySelector informed_entity = 5; # *********************SKIPS TO 5 FROM 1**

  // Cause of this alert.
  enum Cause {
    UNKNOWN_CAUSE = 1;
    OTHER_CAUSE = 2;  // Not machine-representable.
    TECHNICAL_PROBLEM = 3;
    STRIKE = 4;         // Public transit agency employees stopped working.
    DEMONSTRATION = 5;  // People are blocking the streets.
    ACCIDENT = 6;
    HOLIDAY = 7;
    WEATHER = 8;
    MAINTENANCE = 9;
    CONSTRUCTION = 10;
    POLICE_ACTIVITY = 11;
    MEDICAL_EMERGENCY = 12;
  }
  optional Cause cause = 6 [default = UNKNOWN_CAUSE];

  // What is the effect of this problem on the affected entity.
  enum Effect {
    NO_SERVICE = 1;
    REDUCED_SERVICE = 2;

    // We don't care about INsignificant delays: they are hard to detect, have
    // little impact on the user, and would clutter the results as they are too
    // frequent.
    SIGNIFICANT_DELAYS = 3;

    DETOUR = 4;
    ADDITIONAL_SERVICE = 5;
    MODIFIED_SERVICE = 6;
    OTHER_EFFECT = 7;
    UNKNOWN_EFFECT = 8;
    STOP_MOVED = 9;
  }
  optional Effect effect = 7 [default = UNKNOWN_EFFECT];

  // The URL which provides additional information about the alert.
  optional TranslatedString url = 8;  

  // Alert header. Contains a short summary of the alert text as plain-text.
  optional TranslatedString header_text = 10;  # ***************** SKIPS TO 10 FROM 8

  // Full description for the alert as plain-text. The information in the
  // description should add to the information of the header.
  optional TranslatedString description_text = 11; 

  // The extensions namespace allows 3rd-party developers to extend the
  // GTFS Realtime Specification in order to add and evaluate new features
  // and modifications to the spec.
  extensions 1000 to 1999;
}

I don't know why GTFS realtime proto skip field numbers like this. When converting this to a minipb schema, could that be an issue for properly receiving and decoding messages?