emoun / racp Goto Github PK
View Code? Open in Web Editor NEWRevised ASCII Codes for Programming - A new character set for the modern programming environment
License: MIT License
Revised ASCII Codes for Programming - A new character set for the modern programming environment
License: MIT License
Evaluate whether to change the tabbing rules to use Elastic Tabstops:
http://nickgravgaard.com/elastic-tabstops/
The key difference between Elastic Tabstops and the current tabbing rules:
The current rules dictate that given two succeeding lines, all tabs of both lines must align if both lines have the same number of tabs.
Elsatic Tabstop dictates that given two succeeding lines, all the tabs that they share must align. I.e. if the first has 1 tab and the second has 3, the first tab in both lines must align. If the fírst line has 4 tabs and the second has 3, then the first 3 tabs in both lines must align.
This issue will be used to discuss the pros and cons of either specification.
An implementation of Elastic Tabstops should be done in a dedicated branch.
Thanks to @henrikh for bringing Elastic Tabstops to my attention.
The presence of the NULL character (code 0) is a legacy from ASCII. The character itself is weird as it's presence in a string implies the absence of a character at that point in the string. Shouldn't there then just not be a character?
The only use case for such a character, as far as I know, is in a null-terminated string. Such strings, however, are clearly a bad idea, as a simple search will prove. As such, I propose to remove NULL from RACP and repurpose the code point. I porpose that the character for FALSE be moved to the 0 code point adn the TRUE to the 1 code point. As such, the NULL character will be replaced by the logical FALSE.
Removing the NULL character will mitigate all problems connected to it, while putting the FALSE character at the 0 code point will allow for easy recognition of it.
One of the better parts of ASCII was its placement of the numerical characters. The decimal characters 0 through 9 occupy the code points (decimal) 48 to 57. This placement allows for easy conversion from the characters to their integer values. This is done by masking the characters to only extracting the 4 least significant bits (filling the rest with 0).
The current placement of the hexadecimal characters (code points 11 through 25) does not allow for this same mechanism of conversion, instead allowing for the deduction of the value 10 from their code point to get the integer value. This, however, cannot be done for the zero character which must use the previous method. Additionally, the hexadecimals do not have their own zero character, which means that checking whether a string contains only hexadecimals is more expensive than checking for only decimals, as the hexadecimal check requires, like the decimal, checking whether the character is within 11 to 25 and additionally whether it is 48 (decimal 0).
This issue proposes a reorganising of the characters, moving the hexadecimals up to be immediately before the decimals, I.E. 33 though 47, and moving the characters that currently occupy those code points down to whether the hexadecimals currently reside.
This solution has the following advantages:
The disadvantages are:
The solution does not address the problem of the zero character having a higher code point than the hexadecimals. With decimals, without any conversion need, one can compare the value of integers and e.g. sort them. This is not possible with the hexadecimals as the zero is not in order with the rest.
Extend the RACP specification to include support for unicode a la the UTF standards.
We define the current RACP characters as being the Basic Characters
and introduce Unicode Characters
. When referencing a unicode codepoint we use the U+
method. Likewise, for the basic characters we will use B+
as a reference to a basic character codepoint.
The basic characters are encoded as previously defined using the 1 byte values 0-127.
The unicode characters are then encoded using 2-4 bytes, where all the bytes have the highest order bit set to 1. Therefore, we can differentiate between a basic and unicode characters by only looking at the highest order bit. The Unicode characters are encoded using the following specification:
We define a 'character sequence' as being 1-4 consecutive bytes that encoded a basic characters or a valid unicode character. A string of characters is therefore a sequence of character sequences:
Number of bytes | Bits for code point | First code point | Last code point | Capacity | Header | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|---|
1 | 7 | B+0 |
B+7F |
128 | 0xxx xxxx | |||
2 | 11 | U+0 |
U+7FF |
2048 | 110x xxxx | 10xx xxxx | ||
3 | 16 | U+800 |
U+107FF |
65535 | 1110 xxxx | 10xx xxxx | 10xx xxxx | |
3 | 14 | U+10800 |
U+147FF |
16383 | 1111 00xx | 10xx xxxx | 10xx xxxx | |
3 | 14 | U+14800 |
U+187FF |
16383 | 1111 01xx | 10xx xxxx | 10xx xxxx | |
3 | 14 | U+18800 |
U+1C7FF |
16383 | 1111 10xx | 10xx xxxx | 10xx xxxx | |
4 | 20 | U+1C800 |
U+110000 |
997376 | 1111 11xx | 10xx xxxx | 10xx xxxx | 10xx xxxx |
The first byte in any character sequence encodes which type the sequence has and is called the header byte. As specified before, if the header byte has its highest order bit set to 0, then the sequence has the length 1 (meaning the sequence is 1 byte long, containing only the header byte) and encodes a basic character.
If a byte starts with '11' then it is the header byte in a sequence. If it starts with '10' it is not the header. In the header, the bits 5 and 6 (where 8 is the highest order bit) define the type of the sequence the header is part of. We'll call those two bits the type bits.
If the highest type bit is 0
, the sequence has a length of two and the remaining 11 bits are used to encode the codepoints U+3FF
to U+7FF
as we will see below.
If the type bits are '10', the sequence has a length of 3 and the remaining 16 bits are used to encode the code points U+800
to U+107FF
.
If the type bits are '11' we define the bits 3 and 4 as the specialization bits. If these bits are '00', '01', or '10' the sequence also has a length of 3 and the remaining 12 bits are used to encode the code points U+10800
to U+1C7FF
.
If the specialization bits are '11' the sequence has a length of 4 and the remaining 20 bits are used to encode the code points U+1C800
to U+110000
. There is room for up to U+11C7FF
, but since Unicode is restricted to U+110000
so is this encoding.
Much like UTF-8 the integer value of a code point is spead out into the sequence bytes. First, the capacity of the type of sequence the codepoint goes into is subtracted from the codepoints integer value.
The lowest order bits in the resulting value are inserted, in order, into the remaining bits of the last byte. The next remaining lowest order bits are likewise inserted into the next to last byte, and so on until all the bits are inserted in the unassigned bits of each byte in the sequence. Examples:
Codepoint | Encoded value | u32 bitwise encoded value | encoding |
---|---|---|---|
B+E | 14 | ... 0000 1110 | 0 000 1110 |
U+E | 14 | ... 0000 1110 | 1100 0000 l 10 00 1110 |
U+44c | 76 | ... 0100 1100 | 1101 0001 l 10 00 1100 |
U+270F | 7951 | ... 0001 1111 l 0000 1111 | 1110 0001 l 10 11 1100 l 10 00 1111 |
U+15B38 | 4920 | ... 0001 0011 l 0011 1000 | 1111 01 01 l 10 00 1100 l 10 11 1000 |
U+1B207 | 10759 | ... 0010 1010 l 0000 0111 | 1111 10 10 l 10 10 1000 l 10 00 0111 |
U+10F447 | 994375 | ... 0000 1111 l 0010 1100 l 0100 0111 | 1111 11 11 l 10 11 0010 l 10 11 0001 l 10 00 0111 |
The current encoding of the basic RACP characters is also a valid encoding in the proposal.
Receiving an incomplete character sequence can be detected by looking at the first two bits in the first byte. If its not a valid header or a basic character then the sequence is incomplete and the decoder can report the error. Since the header can be looked at to see how many bytes a sequence should have, receiving too few can also be detected.
The header byte contains all the information about the sequence, meaning a sequence can be instantly decoded without having to look for the header of the next sequence or end of stream.
The start of a character sequence can be found by reversing until a header is found. This can be used in jumping searches to ensure that you can always find the start of a character sequence you land on.
A list of encoded unicode characters can be sorted in code point order by directly sorting the encoded value. Therefore, sorting can be done without having to decode the character. Seudo-proof:
Unicode point | Encoding | Encoded value |
---|---|---|
0 | 1100 0000 1000 0000 | 49280 |
1023 | 1100 1111 1011 1111 | 53183 |
1024 | 1101 0000 1000 0000 | 53376 |
2047 | 1101 1111 1011 1111 | 57279 |
2048 | 1110 0000 1000 0000 1000 0000 | 14712960 |
67583 | 1110 1111 1011 1111 1011 1111 | 15712191 |
67584 | 1111 0000 1000 0000 1000 0000 | 15761536 |
83967 | 1111 0011 1011 1111 1011 1111 | 15974335 |
83968 | 1111 0100 1000 0000 1000 0000 | 16023680 |
100351 | 1111 0111 1011 1111 1011 1111 | 16236479 |
100352 | 1111 1000 1000 0000 1000 0000 | 16285824 |
116735 | 1111 1011 1011 1111 1011 1111 | 16498623 |
116736 | 1111 1100 1000 0000 1000 0000 1000 0000 | 4236279936 |
1165311 | 1111 1111 1011 1111 1011 1111 1011 1111 | 4290756543 |
Since the basic characters take up all the space in one byte a unicode character is encoded as at least 2 byte. This is constrasted with UTF8 where the ASCII-equivalent characters are also 8 bytes. On the other hand, this does allow the basic characters to be only 1 byte, which can be seen as equivalent to UTF-8.
Compared to UTF-8, which has only 4 sequence types, this encoding is more complex, having 5 effective sequence types (Only 1 test is needed to identify the specialized 3 byte sequences). This will incur some performance hit regarding encoding/decoding. The increased number of sequence types was chosen to allow almost double the number of codepoint to be encoded using 3 or less bytes. This allows most characters in the supplementary multilingual plane to be 3 bytes, where they need to be 4 bytes in UTF-8. An additional subtraction/addition is needed in when encoding/decoding that is not needed by UTF-8.
On the other hand, since UTF-8 requires a valid encoding use the minimum number of bytes for a given code point, there might be some work there that this encoding does not need to do.
An experimental analyses should be done to measure the impact of encode/decode compared to UTF-8.
Many characters come in pairs: {
and }
, true
and false
, a
and A
.
The current ordering accounts for this, where some pair's placements are connected:
This is fine, but has one major deficiency: To get from one pair member to the other, you have to know which one you already have. E.g. to get from A
to a
, you have to add 32. But, to get from a
to A
you have to subtract 32. You have to figure out which one of a
or A
you have before you can get the other. Wouldn't it be better if you could get the other without having to know which one you already have?
This is possible if the members of a pair were 64 places apart. Say A
was at position 4
and a
at position 68
. To get from A
to a
, we could: (4+64)%128
which is 68
. To get from a
to A
we use: (68+64)%128
, which is 4
. Therefore, we have the formula: q = (p + 64)%128
, where p
is the character you have, and q
is its pair member you are looking for. Using this formula, you don't have to know which pair member you have to get the other.
This issue therefore proposes to reorder the characters of RACP so that all pairs conform to this formula, I.e. are 64 places apart.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.