Giter Club home page Giter Club logo

multibase's Introduction

Multibase

Self-identifying base encodings

Multibase is a protocol for disambiguating the "base encoding" used to express binary data in text formats (e.g., base32, base36, base64, base58, etc.) from the expression alone.

When text is encoded as bytes, we can usually use a one-size-fits-all encoding (UTF-8) because we're always encoding to the same set of 256 bytes (+/- the NUL byte). When that doesn't work, usually for historical or performance reasons, we can usually infer the encoding from the context.

However, when bytes are encoded as text (using a base encoding), the choice of base encoding (and alphabet, and other factors) is often restricted by the context. Worse, these restrictions can change based on where the data appears in the text. In some cases, we can only use [a-z0-9]; in others, we can use a larger set of characters but need a compact encoding. This has lead to a large set of "base encodings", almost one for every use-case. Unlike the case of encoding text to bytes, it is impractical to standardize widely around a single base encoding because there is no optimal encoding for all cases.

As data travels beyond its context, it becomes quite hard to ascertain which base encoding of the many possible ones were used; that's where multibase comes in. Where the data has been prefixed before leaving its context behind, it answers the question:

Given binary data d encoded into text s, what base b was used to encode it?

To answer this question, a single code point is prepended to s at time of encoding, which signals in that new context which b can be used to reconstruct d.

Table of Contents

Format

The Format is:

<base-encoding-code-point><base-encoded-data>

Where <base-encoding-code-point> is a code representing an entry in the multibase table.

Multibase Table

The current multibase table is here:

Unicode,    character,  encoding,           description,                                                    status
U+0000,     NUL,        none,               (No base encoding),                                             reserved
U+0030,     0,          base2,              Binary (01010101),                                              experimental
U+0031,     1,          none,               (No base encoding)                                              reserved
U+0037,     7,          base8,              Octal,                                                          draft
U+0039,     9,          base10,             Decimal,                                                        draft
U+0066,     f,          base16,             Hexadecimal (lowercase),                                        final
U+0046,     F,          base16upper,        Hexadecimal (uppercase),                                        final
U+0076,     v,          base32hex,          RFC4648 case-insensitive - no padding - highest char,           experimental
U+0056,     V,          base32hexupper,     RFC4648 case-insensitive - no padding - highest char,           experimental
U+0074,     t,          base32hexpad,       RFC4648 case-insensitive - with padding,                        experimental
U+0054,     T,          base32hexpadupper,  RFC4648 case-insensitive - with padding,                        experimental
U+0062,     b,          base32,             RFC4648 case-insensitive - no padding,                          final
U+0042,     B,          base32upper,        RFC4648 case-insensitive - no padding,                          final
U+0063,     c,          base32pad,          RFC4648 case-insensitive - with padding,                        draft
U+0043,     C,          base32padupper,     RFC4648 case-insensitive - with padding,                        draft
U+0068,     h,          base32z,            z-base-32 (used by Tahoe-LAFS),                                 draft
U+006b,     k,          base36,             Base36 [0-9a-z] case-insensitive - no padding,                  draft
U+004b,     K,          base36upper,        Base36 [0-9a-z] case-insensitive - no padding,                  draft
U+007a,     z,          base58btc,          Base58 Bitcoin,                                                 final
U+005a,     Z,          base58flickr,       Base58 Flicker,                                                 experimental
U+006d,     m,          base64,             RFC4648 no padding,                                             final
U+004d,     M,          base64pad,          RFC4648 with padding - MIME encoding,                           experimental
U+0075,     u,          base64url,          RFC4648 no padding,                                             final
U+0055,     U,          base64urlpad,       RFC4648 with padding,                                           final
U+0070,     p,          proquint,           Proquint (https://arxiv.org/html/0901.4016),                    experimental
U+0051,     Q,          none,               (no base encoding)                                              reserved
U+002F,     /,          none,               (no base encoding)                                              reserved
U+1F680,    🚀,         base256emoji,       base256 with custom alphabet using variable-sized-codepoints,   experimental

NOTE: Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). In UTF-32, for example, that same "z" would be [0x7a, 0x00, 0x00, 0x00] not [0x7a], so detecting and dropping an initial byte of 0x7a would not suffice to confirm the rest was base58btc-encoded bytes; [0x7a, 0x00, 0x00, 0x00] would instead be the UTF-32 bytes that correspond to the z codepoint for that entry, and the entire byte array would need to be detected and dropped. Also note the difference between 0x00 (codepoint 0 or 0x00) and 0 (codepoint 48 or 0x30).

Specifications

Below is a list of specs for the underlying base encodings:

Status

Each multibase encoding has a status:

  • reserved - for functional reasons or to avoid collisions with other multi-* registries, this registry cannot accept registrations at this code-point and implementing one unregistered is discouraged for interoperability reasons
  • experimental - these encodings have been proposed but are not widely implemented and may be removed.
  • draft - these encodings are mature and widely implemented but may not be implemented by all implementations.
  • final - these encodings should be implemented by all implementations and are widely used.
  • deprecated - this entry will likely be removed and reassigned in the future and it will not likely become a final registration

Reserved Terms

The following codes are reserved and cannot be registered in the multibase table. Note that all three of the Unicode entries, expressed as the unsigned varint expression of that Unicode code-point in UTF-8, correspond to widely-used entries in the multiformats registry group that could create confusions for some legacy systems handling both binary and multibased structures from other multiformats. While technically the multibase registry is not part of the multiformats registry group, these reservations minimize risk of confusion when composing multiple multiformats in one data system.

  • NUL (n/a) - Legacy data may be found with null-byte-prefixed binary structures mixed in among multibase-encoded ones in arrays of data, although support for this is no longer mandated by conformant implementations.
  • / (U+002F) - Separator used by multiaddr.
  • 1 (U+0031) - Base58-encoded identity multihashes used by libp2p peer IDs.
  • Q (U+0051) - Base58-encoded sha2-256 multihashes used by libp2p/ipfs for peer IDs and CIDv0.

Multibase By Example

Consider the following encodings of the same binary string:

4D756C74696261736520697320617765736F6D6521205C6F2F # base16 (hex)
JV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32
3IY8QKL64VUGCX009XWUHKF6GBBTS3TVRXFRA5R            # base36
YAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt                 # base58
TXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==               # base64

And consider the same encodings with their multibase prefix

F4D756C74696261736520697320617765736F6D6521205C6F2F # base16 F
BJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32 B
K3IY8QKL64VUGCX009XWUHKF6GBBTS3TVRXFRA5R            # base36 K
zYAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt                 # base58 z
MTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==               # base64 M

The base prefixes used are: F, B, K, z, M.

FAQ

Is this a real problem?

Yes. If i give you "1214314321432165" is that decimal? or hex? or something else? See also:

Why the strange selection of codes / characters?

The code values are selected such that they are included in the alphabets of the base they represent. For example, f is the base code for base16 (hex), because f is in hex's 16 character alphabet. Note that most of the alphabets used can be encoded in UTF-8, and most but not all can be encoded in ASCII. We have yet not found a case needing something else.

Don't we have to agree on a table of base encodings?

Yes, but we already have to agree on base encodings, so this is not hard. The table even leaves some room for custom encodings and is intended to work both in contexts where the encodings are known or agreed on and open-world or brownfield contexts where these may vary.

Implementations:

Disclaimers

Warning: obviously multibase changes the first character depending on the encoding. Do not expect the value to be exactly the same. Remove the multibase prefix before using the value.

Contribute

Contributions welcome. Please check out the issues and reading the contributing document for the greater multiformats project before opening your first issue, as the workflow and the relation of multibase to the greater project both benefit from this context. more information on how we work, and about contributing in general.

If you'd like to switch a project over to multibase, whether by creating a new multibase implementation or building on one of those listed above, please file an issue in this repository using the "Interested in implementing" issue template. If would also like to reserve a prefix for compatibility, please file a separate issue in this repository using the "New Registration" issue template.

License

This repository is only for documents. All of these are licensed under the CC-BY-SA 3.0 license © 2016 Protocol Labs Inc. Any code is under a MIT © 2016 Protocol Labs Inc.

multibase's People

Contributors

ben221199 avatar bumblefudge avatar changjiashuai avatar daviddias avatar dignifiedquire avatar fabianhjr avatar gowthamgts avatar ianopolous avatar jbenet avatar jbrooker avatar jorropo avatar kevina avatar kubuxu avatar mateon1 avatar nocursor avatar pdxjohnny avatar pombredanne avatar ribasushi avatar richardlitt avatar richardschneider avatar rvagg avatar sergey-shandar avatar sg495 avatar stebalien avatar tabrath avatar theobat avatar vmx avatar vorburger avatar whyrusleeping avatar wigy-opensource-developer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multibase's Issues

Consider encoding: airtameg

Airtameg is a base 26 alphabetic encoding, designed to fit data relatively efficiently into Aztec Codes (which use 5 bits per character in long single-case alphabetic strings, before error-correcting codes are added). I'd suggest a as the code for airtameg since it represents 0 in airtameg (though it would still need to be stripped before decoding) and is the first character of its name.

Airtameg is specified as an instance of the more general chunky base b encodings, which provide a quick way to specify encodings of bytes into any alphabet of between 2 and 256 characters, instead of having to write a new specification for each new encoding scheme, going through all the details regarding endianness, leading zeros, etc.

Alphabets code value are no longer all ASCII

In the FAQ section of README.md it is stated that:

The code values are selected such that they are included in the alphabets of the base they represent. For example, f is the base code for base16 (hex), because f is in hex's 16 character alphabet. Note that the alphabets can be encoded in ASCII or UTF8. We have not found a case needing something else.

With the introduction of base256emoji, this is no longer the case, so it might be worth re-wording the section to avoid confusion.

Adding Base45 (aka ISO/IEC 18004:2006 Alphanumeric Mode QR code)

There is a special compression mode in the ISO QR code ( ISO/IEC 18004:2006 ) standard called "Alphanumeric Mode". When you stay within this limited character set, it will do its own compression and error correction (~44% or more).

From QR code standard ISO/IEC 18004:2000 §8.3.3

Alphanumeric Mode
Alphanumeric Mode encodes data from a set of 45 characters, i.e. 10 numeric digits (0 - 9) (ASCII values 30HEX to 39HEX), 26 alphabetic characters (A - Z) (ASCII values 41HEX to 5AHEX) , and 9 symbols (SP, $, %, *, +, -, ., /, :) (ASCII values 20HEX, 24HEX, 25HEX, 2AHEX, 2BHEX, 2D to 2FHEX, 3AHEX respectively). Normally, two input characters are represented by 11 bits.

The advantage of encoding binary data in Base45 in QR code related scenarios is that the encoding does not require additional compression or checksums. But you do have to not use lower-case letters or other non-allowed characters, which can force you into the much less efficient QR binary mode.

I would like to see this mode added to multihash. I'm not sure if it is allowed for multiphase but ASCII 45 - is both allowed in a URL and is in the base45 character set. Second choice would be 4 or 5, Otherwise some other available unreserved multihash character prefix is fine.

As this base45 is already an international standard, used not only by QR codes but also some fairly obscure things like satellite radio, I think it would be a good addition.

My particular usecase is for encoding cryptographic keys and signatures to be used in air-gapped offline QR code based scenarios, such as #LetheKit

Let me know if this would be considered something I should create an initial PR for of if the maintainers of this repo want to add it.

-- Christopher Allen

Base 2, base 8, and base 10

The odd-balls in the current multibase spec are:

  • Base 2
  • Base 8
  • Base 10

That is, these are generally considered less useful than the other bases. The current situation is:

  • Base 2 is useful for bitfields.
  • One of base 8 or base 10 may be useful when only digits (0-9) are allowed.
    • Base 10 has a spec.
    • Base 10 is a more compact.
    • Base 8 may be simpler to decode/encode.

The question is: which of these should we keep, if any? This is relevant to multiformats/go-multibase#26 as, if we keep base 8, we need to define and implement it.

Suggestion: split the `code` column into `code_char` and `code_ascii`

Reading the table, it can be a bit confusing if the code field is a char or an integer or, even worse, an escape code (like \t and \b).

I think that reformating the table in the following manner can avoid confussion.

encoding,          code_char, code_ascii, description,                                              status
identity,          <NUL>,     0x00,       8-bit binary (encoder and decoder keeps data unmodified), default
base2,             0,         0x30,       binary (01010101),                                        candidate
base8,             7,         0x37,       octal,                                                    draft
base10,            9,         0x39,       decimal,                                                  draft
base16,            f,         0x66,       hexadecimal,                                              default
base16upper,       F,         0x46,       hexadecimal,                                              default
base32hex,         v,         0x76,       rfc4648 case-insensitive - no padding - highest char,     candidate
base32hexupper,    V,         0x56,       rfc4648 case-insensitive - no padding - highest char,     candidate
base32hexpad,      t,         0x74,       rfc4648 case-insensitive - with padding,                  candidate
base32hexpadupper, T,         0x54,       rfc4648 case-insensitive - with padding,                  candidate
base32,            b,         0x62,       rfc4648 case-insensitive - no padding,                    default
base32upper,       B,         0x42,       rfc4648 case-insensitive - no padding,                    default
base32pad,         c,         0x63,       rfc4648 case-insensitive - with padding,                  candidate
base32padupper,    C,         0x43,       rfc4648 case-insensitive - with padding,                  candidate
base32z,           h,         0x68,       z-base-32 (used by Tahoe-LAFS),                           draft
base36,            k,         0x6b,       base36 [0-9a-z] case-insensitive - no padding,            draft
base36upper,       K,         0x4b,       base36 [0-9A-Z] case-insensitive - no padding,            draft
base58btc,         z,         0x7a,       base58 bitcoin,                                           default
base58flickr,      Z,         0x5a,       base58 flicker,                                           candidate
base64,            m,         0x6d,       rfc4648 no padding,                                       default
base64pad,         M,         0x4d,       rfc4648 with padding - MIME encoding,                     candidate
base64url,         u,         0x75,       rfc4648 no padding,                                       default
base64urlpad,      U,         0x55,       rfc4648 with padding,                                     default

Multicodec names of multibase

My poroposal is /multibase/$BASENAME/
Example: `/multibase/base16/

Multicodec currently recommends /b16/ and so on but IMO this is occupying too much of shared namespace.

Haskell implementation for multibases.

Hello,
I would be interested in developping a multibase format in haskell, I've read the contributing doc but i still have a few questions : - is anyone working on this already or can i go for it? -i'm still pretty new to github, do, i directly pull the multiformats repo or do I have to submit the code for checking first?

I hope i can be useful to this project and even though i lack experience i'm here to learn so don't hesitate to reject my code if its not good enough.

Add test vectors to test edge cases

  • Empty string.
  • All numbers.
  • Emoji.
  • Chinese.
  • Strings of length 1-8, 30, 31, 32, 33, 500, 1000.
  • Strings of all nulls.
  • Strings of all 0xff (if possible?)

Case insensitivity

Currently we have Base encodings which are currently spec'ed as case-insensitive (Base32 and Base36). I can see that this is convenient from an application perspective. But wouldn't it make sense to be strict at the lowest level (the level of this spec) and reject wrongly cased encodings?

The convenient normalization can always happen at an application level, it would also always have enough information to do so.

Hence I'm in favour in being more strict and making things case sensitive.

PS: Thanks @hugomrdias for bringing this up.

test vectors

we should have a set of test vectors for implementations

Remove canonization of Base2

Base2 should be taken as-is. If it isn't 8k+1 in length it should be considered invalid since the missing zeros could be either from the start or the end.

Standard prefixes for other bases

We have multibase prefixes for standardized encodings like base64 or base16 but what about other bases like base 120 or base 3 ? what about encodings that support custom charsets?

for example, if a user encodes a utf8 string in base 35 (for some reason) and wants to use his own custom charset (for the same kind of reasons that things like base64 / base64URL exist). How would we prefix that? its planned that haskell-multibases-2.0 will support encoding with bases 1 to 255 and custom charsets but I'm kind of stumped on the whole prefix buisness.

Base94: all of ascii except control and whitespace

All of ascii in order from ! to ~ (inclusive) following a classical radix strategy, no padding.

Key: ~

This trade maximum density while still being usable and exclusively ascii.
Sadly this alphabet is secable (following unicode's rules it has characters which do word breaks) but it's the kind of tradeoffs you have to make when you want compactness,
it's probably also slow to encode and decode, again tradeoffs have to be made.

No padded version because we lack a free character to do that with, also if you care about compactness you probably don't care about padding.

If no one has anything to object about it, I'll open a PR for a spec and a go implementation in the future.

Please clarify "highest char/letter"

According to rfc4648 z is included in both base32, base64, and base64url and yet we have this:

base32        U, u    rfc4648 - highest letter
base64        y       rfc4648 highest char
base64url     Y       rfc4648 highest char

I'm confused by what "highest letter/char" is suppose to mean.

Consistancy of numeral prefixes

When looking at the numeral prefixes, it appears to me that there isn't a full consistancy. This happens because the prefix for base2 is 0 and not 1.

Base Current prefix Consistant prefix Consistant?
1 1 0 NO, but removed from the spec
2 0 1 NO
3 - 2 (not used)
4 - 3 (not used)
5 - 4 (not used)
6 - 5 (not used)
7 - 6 (not used)
8 7 7 YES
9 - 8 (not used)
10 9 9 YES
11 - A/a (not used)
12 - B/b NO (used by base32, but I think that is fine)
13 - C/c NO (used by base32pad, but I think that is fine)
14 - D/d (not used)
15 - E/e (not used)
16 F/f F/f YES

The table above shows what I mean. The bases having NO have some inconsistancy.

Consider encoding: WordBase-2048

Consider use case and factors:

Legal documents such as for a non-profit, corporation, or legally recognized DAO: the author or script wishes to reference a CID, DID, smart contract, key, or other identifiers. I think a QR Code would be preferable. However, not all jurisdictions or filing processes support this. Documents might be printed, photocopied on an old copier, and then rescanned. OCR makes this easier, except when the document is difficult to read by machine, smugged, blurred, faded, or preferred to be checked by hand. Humans can read words more easily and reading words provides an organic type of error correction. Words can also be easily read allow to be voice recognized into another interface.

Proposal:

I suggest the word lists from BIP-39 be used to create a base 2048 in several languages. Primarily first in English. Perhaps a special indicator word/phrase could be used in the entire multiformat use, or the standard could rely entirely on the 2048 words. This would work similarly to a seedphrase. Perhaps in the future seed phrases could even have multiformat self-describing their parameters.

References:

https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md

Out-of-scope:

Machine error correction by additional word guessing and correction coding/checking might also be added to this but is outside the scope of this issue/feature. Seed phrase parameters would have to be another standard/proposal if enough people find this useful.

proposal: base4cloaky

This is a base4 (2bits per codepoint) made of:

  • U+2060 WORD JOINER 0
  • U+2061 FUNCTION APPLICATION 1
  • U+2062 INVISIBLE TIMES 2
  • U+2063 INVISIBLE SEPARATOR 3

With a special particularity, invalid codepoints are just skipped over.
U+2024 INVISIBLE PLUS is then used to indicate the end of the data.
In case the data is incomplete when the terminator is reached, the remaining byte is zero padded.

The key would be U+2060.

This allows to hide thoses CIDs in other pieces of text (that includes other CIDs).
Here is a random example:

⁠Th⁠is is a⁡ to⁢t⁡al⁢⁠l⁢⁡y ⁡nor⁠⁢m⁡al p⁢⁠⁡ie⁡ce ⁢o⁡f t⁠⁡e⁡xt⁤.

Multihash encoding recommendations (base64url, base58)

I've been looking for a way to get the benefits of multihash but without base58 encoding. After reading some more I realized that while base58 appears to be the most common encoding for Multihash because of its use in IPFS, the multihash specification doesn't mention base58 (https://tools.ietf.org/html/draft-multiformats-multihash-00).

My understanding is that the best would be avoid forcing a specific base encoding and combine multibase with multihash? Is this the recommended option in the future, and will there be a few "recommended" encodings to be used with multihash out of the ones supported by multibase?

For instance, it would make sense to recommend using URL-safe base encoding such as base64url to avoid losing some of the interesting properties of base58 when switching to a different base encoding. In my case, I dislike base58 encoding because it doesn't align well to byte boundaries, which is why I would rather use multihash with base64url.

Code table mismatch

Hi all,

There is a mismatch between the code tables in the csv and the readme.

More specifically we have :

base csv readme
base32 Bb Uu
base64 m y
base64url u Y

Which one is authoritative ?

Base256emoji: why the leading null byte in the example?

in tests/basics.csv the string example "yes mani !" is encoded with "🚀🏃✋🌈😅🌷🤤😻🌟😅👏"

The rocket at the beginning corresponds to a null byte. Why is it here? I thought a leading null byte marked the identity encoding. Does this mean the base256emoji encoding also implies an identity encoding, like is it its default textual representation or something? In that case it would mean all base256emoji strings start with a rocket, right?

Canonical encoding rules for ambiguous situations

In some encodings without existing standards (like RFC4648), there is a potential for ambiguous encodings and this should be handled in some way. For example, the common test case "yes mani !" is usually base-2 encoded to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" in implementations, implying that leading zeros are dropped. However, this means that "\x00yes mani !" also encodes to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" which will absolutely cause problems for someone somewhere at some point.

There are a few different ways to resolve this ambiguity.

One would be to override the existing example and require fixed-length encoding so "yes mani !" would encode to "001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00yes mani !" to "00000000001111001011001010111001100100000011011010110000101101110011010010010000000100001". This would require decoding variable-width encodings as the shortest matching string, for backwards compatibility.

Another would be to permit dropping leading zeros as long as the encoding remains unambiguous. In this case the existing example would still be valid, "\x00yes mani !" would encode to "0001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00\x00yes mani !" to "000000000001111001011001010111001100100000011011010110000101101110011010010010000000100001".

A similar problem applies to base-8 and base-10, with the added complexity of the latter case not even mapping cleanly to bits to begin with. A canonical set of test vectors covering these and other edge cases would be extremely useful for implementers; #24

Base2 Zero Width

We should add a codec for encoding binary in zero-width joiners and zero-width non-joiners. Why? Because this would be awesome.

(mild :trollface:)

Proquint prefix in RFC

According to [https://github.com/multiformats/multibase/blob/master/rfcs/PRO-QUINT.md], the "full" prefix for proquints is pro-, but the multibase prefix is p. As the multibase spec dictates that encoded strings be <base-encoding-character><base-encoded-data>, the result of encoding IP 127.0.0.1—as the 4 bytes bytestring [0x7f, 0x00, 0x00, 0x01]—would be:

>>> import proquint
>>> ip = bytes([0x7f, 0x00, 0x00, 0x01])
>>> ip_int = int.from_bytes(ip, byteorder="big")
>>> encoded_data = proquint.uint2quint(ip_int)
>>> encoding_character = "p"
>>> encoding_character+encoded_data
'plusab-babad'

Of course, the multibase spec could be easily extended from a single code character to an arbitrary prefix code, as <base-encoding-prefix><base-encoded-data>. One could then legitimately use pro- as the multibase prefix for proquints, which would lead to the intended result:

>>> encoding_prefix = "pro-"
>>> encoding_prefix +encoded_data
'pro-lusab-babad'

@Stebalien I'm happy to make the necessary changes to the readme, table and proquint RFC, if this sounds sensible.

Dealing with padded encodings

Should we discriminate between padded and unpadded versions of encodings? For example, base32 and base64 can be used with or without padding, but most parsers will fail to parse one if it has padding when its not expected, and visa versa

base32 and base64 examples are wrong

The README has the following example with the multibase prefix:

UJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP  # base32 U

But the prefix for base32 is not U . (That is for base64urlpad.) The README also has the example:

yTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==      # base64 y

But y is not in the multibase table. These examples need to be fixed.

Clarifying the varint nature of multibase codecs

This is a follow-up on multiformats/multicodec#89 (I didn't want to side-track that issue).

Issue #12 seems to be related.

From the multibase spec:

The Format is:

<varint-base-encoding-code><base-encoded-data>

So the first few bytes are a multiformats varint.

But according to multiformats/multicodec#89 (comment), the multibase encoding table is about characters (or Symbols as @Stebalien names them). This means that they depend on the string encoding. An f, which stands for base16, encoded as UTF-8/ASCII is 0x66 as hex (102 decimal) which qualifies as a varint.

But if you you encode f as UTF-32, it's byte sequence as decimals is [255, 254, 0, 0, 102, 0, 0, 0] which clearly isn't a varint.

I propose changing the spec to something like:

The format is:

<string-encoded-identifier><base-encoded-data>

Multibases are always strings with a certain character encoding, usually UTF-8. The <string-encoded-identifier> is a sequence of characters (currently only a single one) according to the multibase table. If the string is encoded as UTF-8 the sequence of the characters of the <string-encoded-identifier> can be interpreted as a multiformats varint. This way it can be determined where the <string-encoded-identifier> stops and the actual base encoded data starts.

/cc @fluency03 @Stebalien

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.