multiformats / multibase Goto Github PK

Self identifying base encodings

multibase's Introduction

Multibase

Self-identifying base encodings

Multibase is a protocol for disambiguating the "base encoding" used to express binary data in text formats (e.g., base32, base36, base64, base58, etc.) from the expression alone.

When text is encoded as bytes, we can usually use a one-size-fits-all encoding (UTF-8) because we're always encoding to the same set of 256 bytes (+/- the NUL byte). When that doesn't work, usually for historical or performance reasons, we can usually infer the encoding from the context.

However, when bytes are encoded as text (using a base encoding), the choice of base encoding (and alphabet, and other factors) is often restricted by the context. Worse, these restrictions can change based on where the data appears in the text. In some cases, we can only use [a-z0-9]; in others, we can use a larger set of characters but need a compact encoding. This has lead to a large set of "base encodings", almost one for every use-case. Unlike the case of encoding text to bytes, it is impractical to standardize widely around a single base encoding because there is no optimal encoding for all cases.

As data travels beyond its context, it becomes quite hard to ascertain which base encoding of the many possible ones were used; that's where multibase comes in. Where the data has been prefixed before leaving its context behind, it answers the question:

Given binary data d encoded into text s, what base b was used to encode it?

To answer this question, a single code point is prepended to s at time of encoding, which signals in that new context which b can be used to reconstruct d.

Format
- Multibase Table
Specifications
Status
- Reserved Terms
Multibase By Example
FAQ
Implementations:
Disclaimers
Contribute
License

Format

The Format is:

<base-encoding-code-point><base-encoded-data>

Where <base-encoding-code-point> is a code representing an entry in the multibase table.

Multibase Table

The current multibase table is here:

Unicode,    character,  encoding,           description,                                                    status
U+0000,     NUL,        none,               (No base encoding),                                             reserved
U+0030,     0,          base2,              Binary (01010101),                                              experimental
U+0031,     1,          none,               (No base encoding)                                              reserved
U+0037,     7,          base8,              Octal,                                                          draft
U+0039,     9,          base10,             Decimal,                                                        draft
U+0066,     f,          base16,             Hexadecimal (lowercase),                                        final
U+0046,     F,          base16upper,        Hexadecimal (uppercase),                                        final
U+0076,     v,          base32hex,          RFC4648 case-insensitive - no padding - highest char,           experimental
U+0056,     V,          base32hexupper,     RFC4648 case-insensitive - no padding - highest char,           experimental
U+0074,     t,          base32hexpad,       RFC4648 case-insensitive - with padding,                        experimental
U+0054,     T,          base32hexpadupper,  RFC4648 case-insensitive - with padding,                        experimental
U+0062,     b,          base32,             RFC4648 case-insensitive - no padding,                          final
U+0042,     B,          base32upper,        RFC4648 case-insensitive - no padding,                          final
U+0063,     c,          base32pad,          RFC4648 case-insensitive - with padding,                        draft
U+0043,     C,          base32padupper,     RFC4648 case-insensitive - with padding,                        draft
U+0068,     h,          base32z,            z-base-32 (used by Tahoe-LAFS),                                 draft
U+006b,     k,          base36,             Base36 [0-9a-z] case-insensitive - no padding,                  draft
U+004b,     K,          base36upper,        Base36 [0-9a-z] case-insensitive - no padding,                  draft
U+007a,     z,          base58btc,          Base58 Bitcoin,                                                 final
U+005a,     Z,          base58flickr,       Base58 Flicker,                                                 experimental
U+006d,     m,          base64,             RFC4648 no padding,                                             final
U+004d,     M,          base64pad,          RFC4648 with padding - MIME encoding,                           experimental
U+0075,     u,          base64url,          RFC4648 no padding,                                             final
U+0055,     U,          base64urlpad,       RFC4648 with padding,                                           final
U+0070,     p,          proquint,           Proquint (https://arxiv.org/html/0901.4016),                    experimental
U+0051,     Q,          none,               (no base encoding)                                              reserved
U+002F,     /,          none,               (no base encoding)                                              reserved
U+1F680,    🚀,         base256emoji,       base256 with custom alphabet using variable-sized-codepoints,   experimental

NOTE: Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). In UTF-32, for example, that same "z" would be [0x7a, 0x00, 0x00, 0x00] not [0x7a], so detecting and dropping an initial byte of 0x7a would not suffice to confirm the rest was base58btc-encoded bytes; [0x7a, 0x00, 0x00, 0x00] would instead be the UTF-32 bytes that correspond to the z codepoint for that entry, and the entire byte array would need to be detected and dropped. Also note the difference between 0x00 (codepoint 0 or 0x00) and 0 (codepoint 48 or 0x30).

Specifications

Below is a list of specs for the underlying base encodings:

base2 Base2 RFC
base8 Base8 RFC, similar to rfc4648
base10 Base10 RFC
base36 Base36 RFC
base16* RFC4648
base32* (Except for base32z) rfc4648
base32z Human-oriented base32 spec
base64* RFC4648
base58btc https://datatracker.ietf.org/doc/html/draft-msporny-base58-02
base58flickr https://datatracker.ietf.org/doc/html/draft-msporny-base58-02, but using a different alphabet
proquint Proquint RFC, which is the original spec with an added prefix for legibility

Status

Each multibase encoding has a status:

reserved - for functional reasons or to avoid collisions with other multi-* registries, this registry cannot accept registrations at this code-point and implementing one unregistered is discouraged for interoperability reasons
experimental - these encodings have been proposed but are not widely implemented and may be removed.
draft - these encodings are mature and widely implemented but may not be implemented by all implementations.
final - these encodings should be implemented by all implementations and are widely used.
deprecated - this entry will likely be removed and reassigned in the future and it will not likely become a final registration

Reserved Terms

The following codes are reserved and cannot be registered in the multibase table. Note that all three of the Unicode entries, expressed as the unsigned varint expression of that Unicode code-point in UTF-8, correspond to widely-used entries in the multiformats registry group that could create confusions for some legacy systems handling both binary and multibased structures from other multiformats. While technically the multibase registry is not part of the multiformats registry group, these reservations minimize risk of confusion when composing multiple multiformats in one data system.

NUL (n/a) - Legacy data may be found with null-byte-prefixed binary structures mixed in among multibase-encoded ones in arrays of data, although support for this is no longer mandated by conformant implementations.
/ (U+002F) - Separator used by multiaddr.
1 (U+0031) - Base58-encoded identity multihashes used by libp2p peer IDs.
Q (U+0051) - Base58-encoded sha2-256 multihashes used by libp2p/ipfs for peer IDs and CIDv0.

Multibase By Example

Consider the following encodings of the same binary string:

4D756C74696261736520697320617765736F6D6521205C6F2F # base16 (hex)
JV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32
3IY8QKL64VUGCX009XWUHKF6GBBTS3TVRXFRA5R            # base36
YAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt                 # base58
TXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==               # base64

And consider the same encodings with their multibase prefix

F4D756C74696261736520697320617765736F6D6521205C6F2F # base16 F
BJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32 B
K3IY8QKL64VUGCX009XWUHKF6GBBTS3TVRXFRA5R            # base36 K
zYAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt                 # base58 z
MTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==               # base64 M

The base prefixes used are: F, B, K, z, M.

FAQ

Is this a real problem?

Yes. If i give you "1214314321432165" is that decimal? or hex? or something else? See also:

Why the strange selection of codes / characters?

The code values are selected such that they are included in the alphabets of the base they represent. For example, f is the base code for base16 (hex), because f is in hex's 16 character alphabet. Note that most of the alphabets used can be encoded in UTF-8, and most but not all can be encoded in ASCII. We have yet not found a case needing something else.

Don't we have to agree on a table of base encodings?

Yes, but we already have to agree on base encodings, so this is not hard. The table even leaves some room for custom encodings and is intended to work both in contexts where the encodings are known or agreed on and open-world or brownfield contexts where these may vary.

Implementations:

go-multibase
js-multibase
cs-multibase
rust-multibase
java-multibase
py-multibase
haskell-multibase
net-ipfs-core
elixir-multibase
scala-multibase
cpp-multibase
ruby-multibase
dart-multibase
yoclib-multibase-php
multibase sub-module of Python module multiformats
Kotlin
- kotlin-multibase
- multibase part of Kotlin project multiformat
Add yours here!

Disclaimers

Warning: obviously multibase changes the first character depending on the encoding. Do not expect the value to be exactly the same. Remove the multibase prefix before using the value.

Contribute

Contributions welcome. Please check out the issues and reading the contributing document for the greater multiformats project before opening your first issue, as the workflow and the relation of multibase to the greater project both benefit from this context. more information on how we work, and about contributing in general.

If you'd like to switch a project over to multibase, whether by creating a new multibase implementation or building on one of those listed above, please file an issue in this repository using the "Interested in implementing" issue template. If would also like to reserve a prefix for compatibility, please file a separate issue in this repository using the "New Registration" issue template.

License

multibase's People

Contributors

Stargazers

Watchers

Forkers

kevina wemeetagain theobat tabrath celeduc mateon1 benjaminbollen dignifiedquire ianopolous dhruvbaldawa hari-mohan-choudhary richardschneider eth-r jiyuanwang changjiashuai alvarlaigna fluency03 nocursor jayd2446 sleeplessbyte jasnell jacekdaa damons asicminer2009 librecybernetics crazycupcakes40 mr-word cnxtech chernousov-m victorvw dfischer stevenans985900 sandtonphantom ronsherfey tsnyder-gs1us crypt0r3n3g4d3 murd4m1773n-m4f14 isabella232 johnchandlerburnham sg495 ling736 formal-business-system-solution molllyn1 ben221199 jorropo pinkdiamond1 jacquelinevv0693 heacare matrixsociety fieryswampshire nifr91 brianorwhatever bestxeosx-gm vorburger grondilu andre-beautrait beautrait lilsunny243 emily-michelle0620 andreasmhahn rz3pehr bumblefudge erwin-kok ismaelapariciojr leesei samkenxstream sergey-shandar pdxjohnny cybernetics sorokinvld ink-splatters

multibase's Issues

Consider encoding: airtameg

Airtameg is a base 26 alphabetic encoding, designed to fit data relatively efficiently into Aztec Codes (which use 5 bits per character in long single-case alphabetic strings, before error-correcting codes are added). I'd suggest a as the code for airtameg since it represents 0 in airtameg (though it would still need to be stripped before decoding) and is the first character of its name.

Airtameg is specified as an instance of the more general chunky base b encodings, which provide a quick way to specify encodings of bytes into any alphabet of between 2 and 256 characters, instead of having to write a new specification for each new encoding scheme, going through all the details regarding endianness, leading zeros, etc.

Alphabets code value are no longer all ASCII

In the FAQ section of README.md it is stated that:

The code values are selected such that they are included in the alphabets of the base they represent. For example, f is the base code for base16 (hex), because f is in hex's 16 character alphabet. Note that the alphabets can be encoded in ASCII or UTF8. We have not found a case needing something else.

With the introduction of base256emoji, this is no longer the case, so it might be worth re-wording the section to avoid confusion.

Adding Base45 (aka ISO/IEC 18004:2006 Alphanumeric Mode QR code)

There is a special compression mode in the ISO QR code ( ISO/IEC 18004:2006 ) standard called "Alphanumeric Mode". When you stay within this limited character set, it will do its own compression and error correction (~44% or more).

From QR code standard ISO/IEC 18004:2000 §8.3.3

Alphanumeric Mode
Alphanumeric Mode encodes data from a set of 45 characters, i.e. 10 numeric digits (0 - 9) (ASCII values 30HEX to 39HEX), 26 alphabetic characters (A - Z) (ASCII values 41HEX to 5AHEX) , and 9 symbols (SP, $, %, *, +, -, ., /, :) (ASCII values 20HEX, 24HEX, 25HEX, 2AHEX, 2BHEX, 2D to 2FHEX, 3AHEX respectively). Normally, two input characters are represented by 11 bits.

The advantage of encoding binary data in Base45 in QR code related scenarios is that the encoding does not require additional compression or checksums. But you do have to not use lower-case letters or other non-allowed characters, which can force you into the much less efficient QR binary mode.

I would like to see this mode added to multihash. I'm not sure if it is allowed for multiphase but ASCII 45 - is both allowed in a URL and is in the base45 character set. Second choice would be 4 or 5, Otherwise some other available unreserved multihash character prefix is fine.

As this base45 is already an international standard, used not only by QR codes but also some fairly obscure things like satellite radio, I think it would be a good addition.

My particular usecase is for encoding cryptographic keys and signatures to be used in air-gapped offline QR code based scenarios, such as #LetheKit

Let me know if this would be considered something I should create an initial PR for of if the maintainers of this repo want to add it.

-- Christopher Allen

Base 2, base 8, and base 10

The odd-balls in the current multibase spec are:

Base 2
Base 8
Base 10

That is, these are generally considered less useful than the other bases. The current situation is:

Base 2 is useful for bitfields.
One of base 8 or base 10 may be useful when only digits (0-9) are allowed.
- Base 10 has a spec.
- Base 10 is a more compact.
- Base 8 may be simpler to decode/encode.

The question is: which of these should we keep, if any? This is relevant to multiformats/go-multibase#26 as, if we keep base 8, we need to define and implement it.

Suggestion: split the `code` column into `code_char` and `code_ascii`

Reading the table, it can be a bit confusing if the code field is a char or an integer or, even worse, an escape code (like \t and \b).

I think that reformating the table in the following manner can avoid confussion.

encoding,          code_char, code_ascii, description,                                              status
identity,          <NUL>,     0x00,       8-bit binary (encoder and decoder keeps data unmodified), default
base2,             0,         0x30,       binary (01010101),                                        candidate
base8,             7,         0x37,       octal,                                                    draft
base10,            9,         0x39,       decimal,                                                  draft
base16,            f,         0x66,       hexadecimal,                                              default
base16upper,       F,         0x46,       hexadecimal,                                              default
base32hex,         v,         0x76,       rfc4648 case-insensitive - no padding - highest char,     candidate
base32hexupper,    V,         0x56,       rfc4648 case-insensitive - no padding - highest char,     candidate
base32hexpad,      t,         0x74,       rfc4648 case-insensitive - with padding,                  candidate
base32hexpadupper, T,         0x54,       rfc4648 case-insensitive - with padding,                  candidate
base32,            b,         0x62,       rfc4648 case-insensitive - no padding,                    default
base32upper,       B,         0x42,       rfc4648 case-insensitive - no padding,                    default
base32pad,         c,         0x63,       rfc4648 case-insensitive - with padding,                  candidate
base32padupper,    C,         0x43,       rfc4648 case-insensitive - with padding,                  candidate
base32z,           h,         0x68,       z-base-32 (used by Tahoe-LAFS),                           draft
base36,            k,         0x6b,       base36 [0-9a-z] case-insensitive - no padding,            draft
base36upper,       K,         0x4b,       base36 [0-9A-Z] case-insensitive - no padding,            draft
base58btc,         z,         0x7a,       base58 bitcoin,                                           default
base58flickr,      Z,         0x5a,       base58 flicker,                                           candidate
base64,            m,         0x6d,       rfc4648 no padding,                                       default
base64pad,         M,         0x4d,       rfc4648 with padding - MIME encoding,                     candidate
base64url,         u,         0x75,       rfc4648 no padding,                                       default
base64urlpad,      U,         0x55,       rfc4648 with padding,                                     default

Consider Encodings from this page

https://en.wikipedia.org/wiki/Binary-to-text_encoding

Incorrect test bytestring in `case_insensitivity.csv`

The bytestring for the case insensitivity test vectors is listed as b"hello world", but the encodings are actually those of b"yes mani !".

Add `multibase` in digest-algorithm iana table

I expect

to be able to use multibase for digest-algorithms

Note

It can be done updating the iana digest-algorithms table.

Multicodec names of multibase

My poroposal is /multibase/$BASENAME/
Example: `/multibase/base16/

Multicodec currently recommends /b16/ and so on but IMO this is occupying too much of shared namespace.

Haskell implementation for multibases.

Hello,
I would be interested in developping a multibase format in haskell, I've read the contributing doc but i still have a few questions : - is anyone working on this already or can i go for it? -i'm still pretty new to github, do, i directly pull the multiformats repo or do I have to submit the code for checking first?

I hope i can be useful to this project and even though i lack experience i'm here to learn so don't hesitate to reject my code if its not good enough.

Add test vectors to test edge cases

Empty string.
All numbers.
Emoji.
Chinese.
Strings of length 1-8, 30, 31, 32, 33, 500, 1000.
Strings of all nulls.
Strings of all 0xff (if possible?)

Check alphabets for base32

Make sure we track all the alphabets for base32 (that are very widely used)

Typo in Base2 rfc

In https://github.com/multiformats/multibase/blob/master/rfcs/Base2.md#encoding:

For example, [0x58, 0x59, 0x60] can be converted to multibase base2 as follows:
map each byte to the base2 representation:
 ["01011000", "01011001", "01011010"]

concatenate:
 "010110000101100101011010"

prefix with '0':
 "0010110000101100101011010"

The binary representation of 0x60 is 0b0110000. The third byte 0b01011010 in hexadecimal is 0x5A.

Consider Encoding: Decimal (base10)

alphabet = "0123456789"

Add CI runs ensuring table is well formed and reservations (1/Q) are upheld

As per #67 (comment)

Case insensitivity

Currently we have Base encodings which are currently spec'ed as case-insensitive (Base32 and Base36). I can see that this is convenient from an application perspective. But wouldn't it make sense to be strict at the lowest level (the level of this spec) and reject wrongly cased encodings?

The convenient normalization can always happen at an application level, it would also always have enough information to do so.

Hence I'm in favour in being more strict and making things case sensitive.

PS: Thanks @hugomrdias for bringing this up.

test vectors

we should have a set of test vectors for implementations

Consider Encodings: base85 (multiple alphabets)

Ascii85 https://en.wikipedia.org/wiki/Ascii85
Z85 http://rfc.zeromq.org/spec:32/Z85/

It is used in ZMQ, postscript, PDF, btoa/atob, and a number of other things.

Consider encoding: Bech32

https://github.com/bitcoin/bips/blob/master/bip-0173.mediawiki

Consider Encoding: LEB128

https://en.wikipedia.org/wiki/LEB128

Remove canonization of Base2

Base2 should be taken as-is. If it isn't 8k+1 in length it should be considered invalid since the missing zeros could be either from the start or the end.

varint

@jbenet could you confirm that you mean by <varint-base-encoding-code>? A varint merged with the code itself? (high significant bit means that it requires more than one byte) How does that work for cases where the code is a specific char? @whyrusleeping is of the opinion that varint does not make sense in this case (see https://github.com/multiformats/go-multibase/blob/master/multibase.go)

Standard prefixes for other bases

We have multibase prefixes for standardized encodings like base64 or base16 but what about other bases like base 120 or base 3 ? what about encodings that support custom charsets?

for example, if a user encodes a utf8 string in base 35 (for some reason) and wants to use his own custom charset (for the same kind of reasons that things like base64 / base64URL exist). How would we prefix that? its planned that haskell-multibases-2.0 will support encoding with bases 1 to 255 and custom charsets but I'm kind of stumped on the whole prefix buisness.

Base94: all of ascii except control and whitespace

All of ascii in order from ! to ~ (inclusive) following a classical radix strategy, no padding.

Key: ~

This trade maximum density while still being usable and exclusively ascii.
Sadly this alphabet is secable (following unicode's rules it has characters which do word breaks) but it's the kind of tradeoffs you have to make when you want compactness,
it's probably also slow to encode and decode, again tradeoffs have to be made.

No padded version because we lack a free character to do that with, also if you care about compactness you probably don't care about padding.

If no one has anything to object about it, I'll open a PR for a spec and a go implementation in the future.

Consider encoding: Main recommendation issue

Module used for base encoding is not compliant with RFC3548

Folks, saw this today:

https://www.npmjs.com/package/base-x

This is rather concerning. However, it is unclear in what bits it is not compliant. Has anyone checked this to ensure that multibase is using something that will lead to the same base encodings as everyone else?

Please clarify "highest char/letter"

According to rfc4648 z is included in both base32, base64, and base64url and yet we have this:

base32        U, u    rfc4648 - highest letter
base64        y       rfc4648 highest char
base64url     Y       rfc4648 highest char

I'm confused by what "highest letter/char" is suppose to mean.

Consistancy of numeral prefixes

When looking at the numeral prefixes, it appears to me that there isn't a full consistancy. This happens because the prefix for base2 is 0 and not 1.

Base	Current prefix	Consistant prefix	Consistant?
1	1	0	NO, but removed from the spec
2	0	1	NO
3	-	2	(not used)
4	-	3	(not used)
5	-	4	(not used)
6	-	5	(not used)
7	-	6	(not used)
8	7	7	YES
9	-	8	(not used)
10	9	9	YES
11	-	A/a	(not used)
12	-	B/b	NO (used by `base32`, but I think that is fine)
13	-	C/c	NO (used by `base32pad`, but I think that is fine)
14	-	D/d	(not used)
15	-	E/e	(not used)
16	F/f	F/f	YES

The table above shows what I mean. The bases having NO have some inconsistancy.

Consider Encoding: base36

https://en.wikipedia.org/wiki/Base36

widely used.

Consider encoding: base-emoji

Since the proposals were removed from the readme, I'm adding this issue so we don't lose track of it.

base-emoji would be helpful to get humans to more easily verify that things are correct on both sides of some communication. For example, see peer-base/peer-pad#134 (comment) of one use-case

Consider encoding: WordBase-2048

Consider use case and factors:

Legal documents such as for a non-profit, corporation, or legally recognized DAO: the author or script wishes to reference a CID, DID, smart contract, key, or other identifiers. I think a QR Code would be preferable. However, not all jurisdictions or filing processes support this. Documents might be printed, photocopied on an old copier, and then rescanned. OCR makes this easier, except when the document is difficult to read by machine, smugged, blurred, faded, or preferred to be checked by hand. Humans can read words more easily and reading words provides an organic type of error correction. Words can also be easily read allow to be voice recognized into another interface.

Proposal:

I suggest the word lists from BIP-39 be used to create a base 2048 in several languages. Primarily first in English. Perhaps a special indicator word/phrase could be used in the entire multiformat use, or the standard could rely entirely on the 2048 words. This would work similarly to a seedphrase. Perhaps in the future seed phrases could even have multiformat self-describing their parameters.

References:

https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md

Out-of-scope:

Machine error correction by additional word guessing and correction coding/checking might also be added to this but is outside the scope of this issue/feature. Seed phrase parameters would have to be another standard/proposal if enough people find this useful.

Switch Base36 encoding from `draft` to `default` mid-Q3 2020 or so

As per #65 (comment)

proposal: base4cloaky

This is a base4 (2bits per codepoint) made of:

U+2060 WORD JOINER ⁠ 0
U+2061 FUNCTION APPLICATION ⁡ 1
U+2062 INVISIBLE TIMES ⁢ 2
U+2063 INVISIBLE SEPARATOR ⁣ 3

With a special particularity, invalid codepoints are just skipped over.
U+2024 INVISIBLE PLUS ⁤ is then used to indicate the end of the data.
In case the data is incomplete when the terminator is reached, the remaining byte is zero padded.

The key would be U+2060.

This allows to hide thoses CIDs in other pieces of text (that includes other CIDs).
Here is a random example:

⁠Th⁠is is a⁡ to⁢t⁡al⁢⁠l⁢⁡y ⁡nor⁠⁢m⁡al p⁢⁠⁡ie⁡ce ⁢o⁡f t⁠⁡e⁡xt⁤.

Missing license file

Currently, the project is missing a license file 😅.

Consider Encoding: VLQ

https://en.wikipedia.org/wiki/Variable-length_quantity

Link to specifications for each encoding

Currently the Multibase encodings are described by a sentence. Ideally we would link every supported encoding to a specification, so that somone could implement Multibase based on those specs, rather then trying to hunt down what exactly is needed.

This issue was triggered by this excellent bug report on the CID spec.

Multihash encoding recommendations (base64url, base58)

I've been looking for a way to get the benefits of multihash but without base58 encoding. After reading some more I realized that while base58 appears to be the most common encoding for Multihash because of its use in IPFS, the multihash specification doesn't mention base58 (https://tools.ietf.org/html/draft-multiformats-multihash-00).

My understanding is that the best would be avoid forcing a specific base encoding and combine multibase with multihash? Is this the recommended option in the future, and will there be a few "recommended" encodings to be used with multihash out of the ones supported by multibase?

For instance, it would make sense to recommend using URL-safe base encoding such as base64url to avoid losing some of the interesting properties of base58 when switching to a different base encoding. In my case, I dislike base58 encoding because it doesn't align well to byte boundaries, which is why I would rather use multihash with base64url.

Code table mismatch

Hi all,

There is a mismatch between the code tables in the csv and the readme.

More specifically we have :

base	csv	readme
base32	Bb	Uu
base64	m	y
base64url	u	Y

Which one is authoritative ?

Consider encoding: NewBase60

http://tantek.pbworks.com/w/page/19402946/NewBase60

In heavy use by the IndieWeb community

Consider encoding: Safebase

https://github.com/kstenerud/safe-encoding has safe16, safe32, safe64, safe80 and safe85 for HTML/XML, JSON, URL and POSIX file names.

Base256emoji: why the leading null byte in the example?

in tests/basics.csv the string example "yes mani !" is encoded with "🚀🏃✋🌈😅🌷🤤😻🌟😅👏"

The rocket at the beginning corresponds to a null byte. Why is it here? I thought a leading null byte marked the identity encoding. Does this mean the base256emoji encoding also implies an identity encoding, like is it its default textual representation or something? In that case it would mean all base256emoji strings start with a rocket, right?

Canonical encoding rules for ambiguous situations

In some encodings without existing standards (like RFC4648), there is a potential for ambiguous encodings and this should be handled in some way. For example, the common test case "yes mani !" is usually base-2 encoded to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" in implementations, implying that leading zeros are dropped. However, this means that "\x00yes mani !" also encodes to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" which will absolutely cause problems for someone somewhere at some point.

There are a few different ways to resolve this ambiguity.

One would be to override the existing example and require fixed-length encoding so "yes mani !" would encode to "001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00yes mani !" to "00000000001111001011001010111001100100000011011010110000101101110011010010010000000100001". This would require decoding variable-width encodings as the shortest matching string, for backwards compatibility.

Another would be to permit dropping leading zeros as long as the encoding remains unambiguous. In this case the existing example would still be valid, "\x00yes mani !" would encode to "0001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00\x00yes mani !" to "000000000001111001011001010111001100100000011011010110000101101110011010010010000000100001".

A similar problem applies to base-8 and base-10, with the added complexity of the latter case not even mapping cleanly to bits to begin with. A canonical set of test vectors covering these and other edge cases would be extremely useful for implementers; #24

Check alphabets for base64

Make sure we track all the alphabets for base64 (that are very widely used)

Base2 Zero Width

We should add a codec for encoding binary in zero-width joiners and zero-width non-joiners. Why? Because this would be awesome.

(mild )

Proquint prefix in RFC

According to [https://github.com/multiformats/multibase/blob/master/rfcs/PRO-QUINT.md], the "full" prefix for proquints is pro-, but the multibase prefix is p. As the multibase spec dictates that encoded strings be <base-encoding-character><base-encoded-data>, the result of encoding IP 127.0.0.1—as the 4 bytes bytestring [0x7f, 0x00, 0x00, 0x01]—would be:

>>> import proquint
>>> ip = bytes([0x7f, 0x00, 0x00, 0x01])
>>> ip_int = int.from_bytes(ip, byteorder="big")
>>> encoded_data = proquint.uint2quint(ip_int)
>>> encoding_character = "p"
>>> encoding_character+encoded_data
'plusab-babad'

Of course, the multibase spec could be easily extended from a single code character to an arbitrary prefix code, as <base-encoding-prefix><base-encoded-data>. One could then legitimately use pro- as the multibase prefix for proquints, which would lead to the intended result:

>>> encoding_prefix = "pro-"
>>> encoding_prefix +encoded_data
'pro-lusab-babad'

@Stebalien I'm happy to make the necessary changes to the readme, table and proquint RFC, if this sounds sensible.

Dealing with padded encodings

Should we discriminate between padded and unpadded versions of encodings? For example, base32 and base64 can be used with or without padding, but most parsers will fail to parse one if it has padding when its not expected, and visa versa

Ruby implementation of multibase

Following the format of #29

I would be interested in developing a multibase implementation in ruby. I've read the contributing document. I have a WIP implementation at SleeplessByte/ruby-multibase.

What other steps do I have to take?

What about Base62?

Aka alphanumeric [0-9A-Za-z].

base32 and base64 examples are wrong

The README has the following example with the multibase prefix:

UJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP  # base32 U

But the prefix for base32 is not U . (That is for base64urlpad.) The README also has the example:

yTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==      # base64 y

But y is not in the multibase table. These examples need to be fixed.

Clarifying the varint nature of multibase codecs

This is a follow-up on multiformats/multicodec#89 (I didn't want to side-track that issue).

Issue #12 seems to be related.

From the multibase spec:

The Format is:

<varint-base-encoding-code><base-encoded-data>

So the first few bytes are a multiformats varint.

But according to multiformats/multicodec#89 (comment), the multibase encoding table is about characters (or Symbols as @Stebalien names them). This means that they depend on the string encoding. An f, which stands for base16, encoded as UTF-8/ASCII is 0x66 as hex (102 decimal) which qualifies as a varint.

But if you you encode f as UTF-32, it's byte sequence as decimals is [255, 254, 0, 0, 102, 0, 0, 0] which clearly isn't a varint.

I propose changing the spec to something like:

The format is:
<string-encoded-identifier><base-encoded-data>
Multibases are always strings with a certain character encoding, usually UTF-8. The <string-encoded-identifier> is a sequence of characters (currently only a single one) according to the multibase table. If the string is encoded as UTF-8 the sequence of the characters of the <string-encoded-identifier> can be interpreted as a multiformats varint. This way it can be determined where the <string-encoded-identifier> stops and the actual base encoded data starts.

/cc @fluency03 @Stebalien

multiformats / multibase Goto Github PK

multibase's Introduction

Multibase

Table of Contents

Format

Multibase Table

Specifications

Status

Reserved Terms

Multibase By Example

FAQ

Implementations:

Disclaimers

Contribute

License

multibase's People

Contributors

Stargazers

Watchers

Forkers

multibase's Issues

I expect

Note

Consider use case and factors:

Proposal:

References:

Out-of-scope:

Recommend Projects

Recommend Topics

Recommend Org