google / emboss Goto Github PK

Emboss is a tool for generating code that reads and writes binary data structures.

License: Apache License 2.0

Python 57.31% C++ 40.03% C 0.07% Starlark 2.27% Vim Script 0.31%

emboss's Introduction

Emboss

Emboss is a tool for generating code that reads and writes binary data structures. It is designed to help write code that communicates with hardware devices such as GPS receivers, LIDAR scanners, or actuators.

What does Emboss do?

Emboss takes specifications of binary data structures, and produces code that will efficiently and safely read and write those structures.

Currently, Emboss only generates C++ code, but the compiler is structured so that writing new back ends is relatively easy -- contact [email protected] if you think Emboss would be useful, but your project uses a different language.

When should I use Emboss?

If you're sitting down with a manual that looks something like this or this, Emboss is meant for you.

When should I not use Emboss?

Emboss is not designed to handle text-based protocols; if you can use minicom or telnet to connect to your device, and manually enter commands and see responses, Emboss probably won't help you.

Emboss is intended for cases where you do not control the data format. If you are defining your own format, you may be better off using Protocol Buffers or Cap'n Proto or BSON or some similar system.

Why not just use packed structs?

In C++, packed structs are most common method of dealing with these kinds of structures; however, they have a number of drawbacks compared to Emboss views:

Access to packed structs is not checked. Emboss (by default) ensures that you do not read or write out of bounds.
It is easy to accidentally trigger C++ undefined behavior using packed structs, for example by not respecting the struct's alignment restrictions or by running afoul of strict aliasing rules. Emboss is designed to work with misaligned data, and is careful to use strict-aliasing-safe constructs.
Packed structs do not handle variable-size arrays, nor arrays of sub-byte-size fields, such as boolean flags.
Packed structs do not handle endianness; your code must be very careful to correctly convert stored endianness to native.
Packed structs do not handle variable-sized fields, such as embedded substructs with variable length.
Although unions can sometimes help, packed structs do not handle overlapping fields well.
Although unions can sometimes help, packed structs do not handle optional fields well.
Certain aspects of bitfields in C++, such as their exact placement within the larger containing block, are implementation-defined. Emboss always reads and writes bitfields in a portable way.
Packed structs do not have support for conversion to human-readable text format.
It is difficult to read the definition of a packed struct in order to generate documentation, alternate representations, or support in languages other than C and C++.

What does Emboss not do?

Emboss does not help you transmit data over a wire -- you must use something else to actually transmit bytes back and forth. This is partly because there are too many possible ways of communicating with devices, but also because it allows you to manipulate structures independently of where they came from or where they are going.

Emboss does not help you interpret your data, or implement any kind of higher-level logic. It is strictly meant to help you turn bit patterns into something suitable for your programming language to handle.

What state is Emboss in?

Emboss is currently under development. While it should be entirely ready for many data formats, it may still be missing features. If you find something that Emboss can't handle, please contact [email protected] to see if and when support can be added.

Emboss is not an officially supported Google product: while the Emboss authors will try to answer feature requests, bug reports, and questions, there is no SLA (service level agreement).

Getting Started

Head over to the User Guide to get started.

emboss's People

Contributors

Stargazers

Watchers

Forkers

fleker forestliurui chloeyutianyi dejan-stankovic reventlov neotim isabella232 nightlark phroiland acsaeed benjaminlawson anthonydigirolamo aaronwebster jasongraffius rainberryinc ghas-results fsareshwala

emboss's Issues

Enums are always generated as [u]int64_t in C++

Emboss generates enums declared as uint64_t regardless of how many enum members are present or their ranges. When writing into structure fields, Emboss checks whether there are enough bytes to write without any loss of data.

For example, take the following enum definition:

enum OpCodeGroup:
  FOO = 0x08

Emboss generates the following code for the above definition:

enum class OpCodeGroup : ::std::uint64_t;

enum class OpCodeGroup : ::std::uint64_t {
  FOO = static_cast</**/::std::int32_t>(8LL),

};

There is only a single element in the enum and its range should easily fall within the range limits of a uint8_t. There is seemingly no way to control the generated size of an enum.

Potential Implementation
The .emb language syntax is updated to accept the maximum size in bytes for an enum in the generated code.

Array iterators should keep a copy of, not a pointer to, the underlying `ArrayView`

Keeping a pointer makes it very easy to make the pointer dangle, as in:

auto it = view.array().begin();
// it holds a pointer to the now-dead temporary `array()`

Further, since the equality checks in the iterator use the pointer, code like this walks off the end of the array:

assert(std::equal(view.array().begin(), view.array().end(), other.begin()));

Standardized file format for .emb files

Hello,
This project seems very promising. It would be great to use a standard format to allow more features built on .emb file declarations.
The Kaitai project, which could be seen as an heavier alternative, benefits of such approach. Multiple generic tool like the IDE are able to use the same definition files.

https://formats.kaitai.io/
https://ide.kaitai.io/

Thank you

Value bounds checking should respect `[requires]` attributes.

Currently, integer bounds checking ignores [requires], which means that $max_size_in_bytes and related values can be excessively large in some cases; e.g.:

struct Foo:
  0 [+4]     UInt      size
    [requires: this <= 16]
  4 [+4]     UInt      message_type
  8 [+size]  UInt:8[]  payload
  # $max_size_in_bytes == 0xffff_ffff
  # The real max size is 8 + 16 == 24.

The _compute_constraints_of_field_reference function in compiler/front_end/expression_bounds.py would need to be augmented to look for [requires] attributes and parse them for inferrable constraints (e.g., X && this <= 4, where X is any expression, puts an upper bound of 4 on the value). There is some similar expression analysis in write_inference.py, though the analysis there does not do what would be needed for bounds inference.

Autoformatter Hits Python Recursion Limit on `struct`s/`enum`s with many fields/entries

The compiler front end was previously fixed to avoid deep recursion, but the autoformatter was not.

Support Lists of Variable-Length Structures.

It is currently possible to (somewhat hackily) support lists by modeling them as recursive structures, like so:

struct Repeated(remaining_size : UInt:32):
  0 [+4]  UInt  length
  # ... other fields ...
  if remaining_size > length:
    length [+remaining_size - length]  Repeated  next

However, there should be a nicer syntax for specifying that there is a list of Repeated elements, with a C++ API that supports iteration (but not random access).

Built in function to count the number of bits set

Suppose you have a bits type like the following:

bits Features:
  0     [+1] Flag foo
  $next [+1] Flag bar
  $next [+1] Flag baz
  $next [+1] Flag abc
  $next [+1] Flag xyz

A packet contains a bitset like the one above with the number of flags enabled determining the size of a variable length array.

struct Data:
  0     [+1] Features features

  let num_features = (features.foo ? 1 : 0) + (features.bar ? 1 : 0) + ...
  $next [+num_features] UInt feature_values

The calculation of num_features can become quite a bit of toil. We have to physically write out the bits we wish to check. It would be easy to forget to add here if another needed to be checked in the future. It would be much easier to simply call out to a built in function which can do the counting for us.

Something like this would be so much better:

struct Data:
  0     [+1] Features features

  let num_features = bits_set(features)
  $next [+num_features] UInt feature_values

Support MakeBitsViewOfInt()

It would be nice if the C++ generated code supported MakeBitsView() types with uint backing storage. We often need to save bitmasks from packet views as integers/enums for passing around in our code. We then need to do bitmasking later without Emboss, which often requires creating new C++ enums. One thing to consider is that we may want the backing storage to have host endianness.

uint64_t feature_mask;
auto feature_view = emboss::MakeFeatureView(feature_mask);
bool foo_supported = feature_view.foo_supported().Read();
uint16_t bar = feature_view.bar().Read();

Support Async Reads/Writes

C++ Emboss views are backed by a backing storage, which is already a template parameter, so non-RAM backing storage can be supported. (This is not currently documented, but the feature is probably complete. The MakeFooView() methods create views using the emboss::ContiguousBuffer backing storage type, which ultimately uses RAM.)

However, view types currently assume that all reads/writes are low-latency (or that it is safe to block the current thread indefinitely), and do not provide any methods for asynchronous reads or stores. For high-latency backing storage (e.g., a hardware register file accessed through SPY or MODBUS or similar), it would be useful to return a promise or a future that will eventually fulfill the read or write, and provide a notification when it is done.

Implicit narrowing conversion in OffsetBitBlock

In Fuchsia, implicit narrowing conversions are errors, so OffsetBitBlock fails to compile:

../../third_party/github.com/google/emboss/src/runtime/cpp/emboss_memory_util.h:917:9: error: implicit conversion loses integer precision: 'int' to 'ValueType' (aka 'unsigned short') [-Werror,-Wimplicit-int-conversion]
        ~(MaskToNBits(static_cast<ValueType>(~ValueType{0}), size_) << offset_);

I'll work on a merge request.

PEP8-ify Emboss Code

In particular, 2-space indent should be switched over to 4-space indent, to match PEP8 and (current external) Google Python style.

The C++ compiler error when using `Write()` (and related methods) on a read-only view type is confusing

The compiler complains about passing the wrong type to memcpy() (passing const char * as the first argument) deep inside Emboss internal template code, which is not clear to an end user.

Using std::enable_if<... backing storage is writeable> on Write() would help. Ideally, we could coerce the C++ compiler to emit a message saying to use the ...Writer type alias instead of ...View.

Support syntax to specify field is placed directly after previous one

Emboss supports overlapping fields; fat fingering an incorrect offset is completely valid but an easy mistake to make. It would be nice to be able to have syntax to simply say that this field is placed directly after the previously defined one.

Potential Implementation
The .emb language syntax is updated to indicate a field should be placed directly after the previous one. The generated C++ code follows this guidance and computes the offset of a field automatically. Overlapping field syntax is retained for backwards compatibility.

Support cross-type integral equality

In the text below, we consider the following Emboss definition:

enum OpCodeGroup:
  # ...
  FOO = 0x08
  # ...

enum OpCodeCommand:
  # ...
  BAR = 0x0039
  # ...

bits OpCode(group: UInt:6, command: UInt:10):
  0 [+6] UInt ogf
  6 [+10] UInt ocf

We would like to pass constant values to the group and command parameters of the OpCode type. We could store these values in virtual fields within the struct we are composing using the let keyword. However, these values are better placed inside the OpCodeGroup and OpCodeCommand enumerations, respectively. Doing so allows the values to be defined once across the entire codebase and made accessible within both Emboss and C++ code. The following situations arise here:

If we use the UInt:6 and UInt:10 types for parameters, the arguments must be hardcoded, and we no longer have a single definition of the constants across the codebase (they have to be defined elsewhere again)
If we use the enumeration types for both parameters and fields, Emboss generates an EnumView which doesn’t allow writes, making ogf and ocf read only
If we use the enumeration types for the parameters and UInt types for the fields, adding a check like [requires: ogf == group && ocf == command] won’t compile because the types are different

For obvious reasons, we don’t want to use option 1 to hardcode constants. Enumerations shouldn’t be writable so option 2 makes sense as it is. In order to support this use case, Emboss should support cross-type integral equality as implied in option 2.

Potential Implementation
The .emb language syntax is updated to allow explicit int(Foo.Bar) conversions to allow for integral type coercion. We want to keep Emboss' strict type safety but want to be able to compare across only integral types.

Support generating an import depfile (.d) like GCC & Clang -MD option

When using the GN build system, we need to specify all imported emboss files (including recursively imported files) as inputs to the action target that runs embossc. It would be simpler and more maintainable to use a depfile that was generated by emboss. The format of the depfile is a Makefile like the ones generated by GCC and Clang when the -MD option is used.

I haven't tried, but I think this format might work:

a.emb: b.emb c.emb

Support Variable-Length/Offset fields within `bits`

"Bitstream"-style protocols tend to have variable-length fields -- particularly arrays -- and variable-offset fields within bits. Emboss should support variable-length and variable-offset fields within bits.

Example of `$size_in_bits` from documentation fails to compile

The example provided in https://github.com/google/emboss/blob/master/doc/language-reference.md#size_in_bits-size-in-bits (using bits FixedSize and struct Envelope gives the following error when built with emboss:

error: Fixed-size type 'FixedSize' cannot be placed in field of size 64 bits; requires 4 bits.
  0  [+8]  FixedSize  padded_payload
           ^^^^^^^^^

This seems to imply that FixedSize needs to be padded out to a whole byte increment. We should either:

relax the restrictions around padding (auto-pad by default?), or
fix the example to include the necessary padding

No way to add integral type safety on top of the currently available type system

Emboss provides no way to add integral type safety on top of the currently available type system. For example, suppose we wanted to store both an advertising handle and a connection handle within an Emboss definition. Both of these types are two bytes wide; we can use a UInt:16 to store them. However, this would mean that we can assign the values across the two ‘types’ as well. It would be better if we could tell Emboss to create a new pseudotype specific for an advertising handle and specific for a connection handle. Although they are still backed by a UInt:16, the Emboss compiler would be able to check types match across arguments both in .emb files as well as in generated code.

Native Support for Strings

Currently, strings are modeled as UInt:8[] -- that is, just arrays of bytes.

Direct support for an actual string type would help at the API layer and in the text format.

Clang Wunused-parameter warning

On a project where Wunused-parameter warnings are errors, I encountered the following build error:

hci.emb.h:43695:24: error: unused parameter 'emboss_reserved_local_value' [-Werror,-Wunused-parameter]
        ::std::int32_t emboss_reserved_local_value) {

I have a temporary fix of using the Wno-unused-parameter flag on targets that depend on Emboss, but this is not ideal.

I'm not 100% sure, but I think this can be fixed by just wrapping the parameter name in an inline comment.

Support Integer Division and Modulus in the Emboss Expression Sublanguage

Emboss should support integer division and modulus.

Note that there are several subtleties to these operations. reventlov to provide a proper design doc.

Support using `requires` with open enums.

I would like to be able to use requires to limit the range of values an open enum may have. For example:

enum PageScan:
  ...

struct Example:
  0 [+2] PageScan page_scan
    [requires: 12 < this < 1000]

Using overlapping fields in a `struct` to achieve C `union`-like functionality is not documented well enough

I'm attempting to define a simple union. At the moment, there aren't any examples of how to define a union in the Emboss repository. I've come up with my best guess as something like:

union Foo:
  0 [+1] UInt:8 a
  0 [+2] UInt:16 b

In any case, attempting to run this through embossc yields the following results:

17:27:54 INF hci_vendor.emb:260:1: error: Syntax error
17:27:54 INF hci_vendor.emb:260:1: note: Found 'union' (SnakeWord), expected "external", "enum", $, "bits", "struct".

It would appear that defining a union in Emboss is currently not possible. We should probably add the ability to do so.

Support Variable-Length `bits`

Emboss should support bits whose length is not know at compile-time.

`requires` attribute on struct/bits has no effect

This code compiles and runs with no issue:

*protocol.emb*

  struct Packet:
    [requires: number < another]
    0 [+4] UInt number
    $next [+4] UInt another

*main.cc*

  char buffer[64];
  auto view = test::MakePacketView(buffer, 64);
  view.number().Write(12);
  view.another().Write(6);

It's unclear whether the following behavior should be allowed to compile at all (as it was our attempt to enforce default values, an upcoming feature for Emboss), but it's worth noting that this also has no effect:

struct Packet(number_value: UInt:32):
  [requires: number == number_value]
  0 [+4] UInt number

The requires attribute on individual fields works as intended.

Support Overlay Files

The Emboss attribute system was originally designed to allow overlay files, which would allow attributes to be specified without editing the original .emb file -- this is particularly useful in cases where the .emb file is provided by a third party, which is the main reason it has not yet been implemented.

The basic idea is that, given an .emb like:

struct Foo:
  0 [+1]  UInt  foo_field

enum Bar:
  BAR_VALUE = 1

You would be able to write a .emboverlay file like:

[$default (cpp) enum_translation: "kCamelCase"]
[(cpp) namespace: "xyz"]
struct Foo:
  [(cpp) name: "Foo_"]
  foo_field:
    [(cpp) name: "foo_field_"]

enum Bar:
  BAR_VALUE
    [(cpp) name: "kBarValue_"]

The exact details need to be ironed out.

Support `switch`/`case`

switch/case would help for the common case of an enumerated message type, where a message type field is used to discriminate between various message bodies.

Support `bits` wider than 64 bits

NovAtel's RANGECMP (pp 842-844) record contains a 192-bit bitfield.

GPS subframes are defined as 300-bit sequences.

RTCM defines some wide bitfields which -- even more problematically -- contain 64-bit integers that are not aligned to byte boundaries.

Emboss should support bits of arbitrary size.

Note: the restriction that individual integers within those fields are no more than 64 bits is a separate feature that should be evaluated and implemented on its own.

Support High-Bit-Zero Numbering on `bits`

A number of big-endian protocols number bits with bit 0 as the highest-order/first-read bit. This seems to be particularly common in protocols that are expressed in terms of bitstreams.

Emboss should provide a way to allow bits to be addressed with bit 0 as the highest-order bit.

Support Setting an Array from C++ Container

It would be convenient to be able to copy to/from an Emboss array from/to a C++ container, such as a std::vector.

Migrate to c++17

I see a lot of TODOs referencing c++17. Are we good to resolve them as c++17 support is pretty much there?

Provide a way to read the offset and size of a field.

Given a structure like:

struct Foo:
  0 [+4]  UInt  a
  some_complex_offset [+some_complex_size]  Structure b

it would be useful to be able to get the values of some_complex_offset and some_complex_size, both in Emboss definitions and in C++.

The tricky part of this is that the offset and size of a field are properties of the structure that contains the field, not properties of the field itself, so a function like $offset_of(b) would need some new machinery to ensure that the argument was a physical field and that it was called in a context where it makes sense.

One thought I had was to make implicit virtual fields $offset_of_b and $size_of_b, similar to the existing implicit virtual field $size_in_bytes. (These might be spelled $offset_of b or $offset_of(b) in an .emb. The C++ names would need to be regular C++ identifiers, or there would need to be contortions in the C++ code generator to make something like view.OffsetOf().b() work.)

Support default values

Reading an unset field through the Emboss view API results in an assertion error on non-debug builds. This is because we are reading uninitialized memory. It would be great if Emboss could support default values for certain fields.

Use Case
Consider packets which have opcodes in their headers. These opcode values are unique to the packet/message being constructed and never change. They need to be defined in the packet but having the developer fill them out every time is redundant and error prone. Instead, it would be better to define the value in the .emb file and have Emboss take care of it for all uses.

Potential Implementation
Specify a [default: 0] attribute in the .emb files. Fields are initialized to their defaults via an initialize() method on views.

Negation (`not`) Operator

A not operator (not or !) would be clearer than the current workaround, == false.

Support `Int` and `UInt` wider than 64 bits.

Emboss should support types like UInt:128.

The main open issue is figuring out what type to use for wide integers in C++: there is no support for them in the C++11 (or 14, 17, 20, or 23) standards, so we would likely need to provide our own implementation and/or a way for users to insert their preferred wide integer library.

This feature requires #40 to be implemented, either first or in tandem.

Support using enum values in requires

It would be nice to be able to use enum values in requires expressions:

enum PageScan:
   MIN = 1
   MAX = 100

struct Example:
  0 [+2] PageScan page_scan
    [requires: PageScan.MIN < this < PageScan.MAX]

Support Sign-Magnitude Integers

Sign-magnitude integers show up in a few protocols, like RTCM and u-blox M10. Emboss should have a prelude type like SMInt to model these fields.

Better Error Messages for Text Format Parsing

UpdateFromText() just returns a bool indicating success/failure. There is currently no way to find out where parsing failed, which makes it difficult to debug manually-written text format structs.

CRC/Checksum Support

CRCs and checksums are common enough that there should be built-in support for verifying and setting CRCs/checksums to the correct value. One possible implementation would be $crc(range : UInt:8[]) and $checksum(range : UInt:8[]) functions which can be used in [requires].

`ContiguousBuffer` should have `memcpy()` equivalents.

It currently has methods to copy out of other ContiguousBuffers, but not from raw byte buffers.

Back End for Translation Between Emboss `struct`s and Google Protocol Buffers

This would help in a number of places in Google.

Support kCamelCase names for enum values

Currently, enum values are always SHOUTY_CASE. The Google C++ style guide prefers kCamelCase for enum names, so we would like to migrate our Emboss enum names to this style eventually. A possible implementation is adding a configuration option for enum name style at the top of .emb files.

Documentation blocks should be placed on top of the entity they describe

Currently, documentation blocks are placed underneath the entities they describe. This is quite confusing and the opposite of most commenting conventions in various programming languages. I think it makes sense to have a separate symbol for documentation blocks (e.g. # for comments and -- for documentation blocks). However, the position should remain on top of the entity they describe.

I think the one exception here would be to have a documentation block within the top of a struct, enum, bits, etc, kind of like a docstring in python. However, enabling that one exception may make the implementation more difficult so we can default to the top of the entity instead, imo.

Make `.Ok()` on Arrays Faster

The generated code for .Ok() on an array is basically a loop, like:

bool Ok() const {
  for (int i = 0; i < ElementCount(); ++i) {
    if (!(*this)[i].Ok()) return false;
  }
  return true;
}

When array elements are simple types, such as UInt with no constraints, .Ok() on each element devolves to a bounds check; unfortunately, no C++ compiler appears to implement an optimization that hoists the bounds checks out of the loop, so the array .Ok() method ends up repeating a huge amount of busywork.

To remedy this, the C++ back end could emit static functions on views like bool OkIsEquivalentToIsComplete() and bool IsCompleteIsAStaticSizeCheck(), then the array's .Ok() could be something like:

bool Ok() const {
  if (decltype((*this)[0])::OkIsEquivalentToIsComplete() &&
      decltype((*this)[0])::IsCompleteIsAStaticSizeCheck()) {
    return (*this)[0].Ok();
  } else {
    for (int i = 0; i < ElementCount(); ++i) {
      if (!(*this)[i].Ok()) return false;
    }
    return true;
  }
}

In C++17, the outer if can become if constexpr.