A variable-length unsigned integer.
The bytes of a FlexUInt
s are written in
little-endian byte order. This means that the first bytes will contain
the FlexUInt
's least significant bits.
The least significant bits in the FlexUInt
indicate the number of bytes that were used to encode the integer.
If a FlexUInt
is N
bytes long, its N-1
least significant bits will be 0
; a terminal 1
bit will be
in the next most significant position.
All bits that are more significant than the terminal 1
represent the magnitude of the FlexUInt
.
FlexUInt
encoding of 14
┌──── Lowest bit is 1 (end), indicating │ this is the only byte. 0 0 0 1 1 1 0 1 └─────┬─────┘ unsigned int 14
FlexUInt
encoding of 729
┌──── There's 1 zero in the least significant bits, so this │ integer is two bytes wide. ┌┴┐ 0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1 └────┬────┘ └──────┬──────┘ lowest 6 bits highest 8 bits of the unsigned of the unsigned integer integer
FlexUInt
encoding of 21,043
┌───── There are 2 zeros in the least significant bits, so this │ integer is three bytes wide. ┌─┴─┐ 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 └───┬───┘ └──────┬──────┘ └──────┬──────┘ lowest 6 bits next 8 bits of highest 8 bits of the unsigned the unsigned of the unsigned integer integer integer
A variable-length signed integer.
From an encoding perspective, FlexInt
s are structurally similar to a FlexUInt
(described above). Both
encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate
how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a
FlexUInt
's bits are unsigned, a FlexInt
's bits are encoded using
two’s complement notation.
Tip
|
An implementation could choose to read a FlexInt by instead reading a FlexUInt and then reinterpreting its bits
as two’s complement.
|
FlexInt
encoding of 14
┌──── Lowest bit is 1 (end), indicating │ this is the only byte. 0 0 0 1 1 1 0 1 └─────┬─────┘ 2's comp. 14
FlexInt
encoding of -14
┌──── Lowest bit is 1 (end), indicating │ this is the only byte. 1 1 1 0 0 1 0 1 └─────┬─────┘ 2's comp. -14
FlexInt
encoding of 729
┌──── There's 1 zero in the least significant bits, so this │ integer is two bytes wide. ┌┴┐ 0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 1 └────┬────┘ └──────┬──────┘ lowest 6 bits highest 8 bits of the 2's of the 2's comp. integer comp. integer
FlexInt
encoding of -729
┌──── There's 1 zero in the least significant bits, so this │ integer is two bytes wide. ┌┴┐ 1 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 └────┬────┘ └──────┬──────┘ lowest 6 bits highest 8 bits of the 2's of the 2's comp. integer comp. integer
A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.
FixedUInt
encoding of 3,954,261
0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ lowest 8 bits next 8 bits of highest 8 bits of the unsigned the unsigned of the unsigned integer integer integer
A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two’s complement.
FixedInt
encoding of -3,954,261
1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1 └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ lowest 8 bits next 8 bits of highest 8 bits of the 2's the 2's comp. of the 2's comp. comp. integer integer integer
A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.
A FlexSym
begins with a FlexInt
; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:
-
greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.
-
less than zero, its absolute value represents a number of UTF-8 bytes that follow the
FlexInt
. These bytes represent the symbol’s text. -
exactly zero, another byte follows that is an opcode. The
FlexSym
parser is not responsible for evaluating this opcode, only returning it—the caller will decide whether the opcode is legal in the current context. Example usages of the opcode include:-
Representing SID
$0
as0xA0
. -
Representing the empty string (
""
) as0x90
. -
When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct (TODO: Link)
-
In a delimited struct, terminating the sequence of
(field name, value)
pairs with0xF0
.
-
FlexSym
encoding of symbol ID $10
┌─── The leading FlexInt ends in a `1`, │ no more FlexInt bytes follow. │ 0 0 0 1 0 1 0 1 └─────┬─────┘ 2's comp. positive 10
FlexSym
encoding of symbol text 'hello'
┌─── The leading FlexInt ends in a `1`, │ no more FlexInt bytes follow. │ h e l l o 1 1 1 1 0 1 1 1 01101000 01100101 01101100 01101100 01101111 └─────┬─────┘ └─────────────────────┬─────────────────────┘ 2's comp. 5-byte UTF-8 encoded "hello" negative 5
FlexSym
encoding of ''
(empty text) using an opcode┌─── The leading FlexInt ends in a `1`, │ no more FlexInt bytes follow. │ 0 0 0 0 0 0 0 1 10010000 └─────┬─────┘ └───┬──┘ 2's comp. opcode 0x90: zero empty symbol
Note
|
From this point on in the document, example encodings are given in hexadecimal notation. |
An opcode is a 1-byte FixedUInt
that tells the reader what the next expression represents
and how the bytes that follow should be interpreted.
The meanings of each opcode are organized loosely by their high and low nibbles.
High nibble | Low nibble | Meaning |
---|---|---|
|
|
E-expression with the address in the opcode |
|
|
E-expression with the address as a trailing 1-byte |
|
|
E-expression with the address as a trailing 2-byte |
|
|
Integers up to 8 bytes wide |
|
Reserved |
|
|
Floats |
|
|
Booleans |
|
|
|
Decimals |
|
|
Timestamps |
|
Reserved |
|
|
|
Strings |
|
|
Symbols with inline text |
|
|
Lists |
|
|
S-expressions |
|
|
Empty struct |
|
Reserved |
|
|
Structs with symbol address field names |
|
|
|
Ion version marker |
|
Symbols with symbol address |
|
|
Annotations with symbol address |
|
|
Annotations with |
|
|
|
|
|
Typed nulls |
|
|
NOP |
|
|
E-expression with a variable-width address |
|
|
System macro invocation |
|
|
|
Delimited container end |
|
Delimited list start |
|
|
Delimited S-expression start |
|
|
Delimited struct with |
|
|
Reserved |
|
|
Variable length prefixed macro invocation |
|
|
Variable length integer |
|
|
Variable length decimal |
|
|
Variable length, long-form timestamp |
|
|
Variable length string |
|
|
Variable length symbol encoded as |
|
|
Variable length list |
|
|
Variable length S-expression |
|
|
Variable length struct with symbol address field names |
|
|
Variable length blob |
|
|
Variable length clob |
The encoding of E-expressions is designed to balance density and generality. For example, they enable encodings with minimal tag bits, even none at all given a thoughtful signature. This increases density, but limits generality at the point of macro invocation.
The text and binary forms of E-expressions enforce the same syntactic constraints on the type and range of data allowed as arguments. Any syntactically well-formed E-expression can be transcoded between text and binary, without expansion and without changing semantics, and independent of whether it can be expanded successfully.
If the value of the opcode is less than 64
(0x40
), it represents an E-expression invoking the macro at the
corresponding address—an offset within the local macro table.
7
┌──── Opcode in 00-3F range indicates an e-expression │ where the opcode value is the macro address │ 07 └── FixedUInt 7
31
┌──── Opcode in 00-3F range indicates an e-expression │ where the opcode value is the macro address │ 1F └── FixedUInt 31
Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section Macro calling conventions. (TODO: Link)
While E-expressions invoking macro addresses in the range [0, 63]
can be encoded in a single byte using
E-expressions with the address in the opcode,
many applications will benefit from defining more than 64 macros.
The 0x4_
and 0x5_
opcodes can be used to represent over 1 million macro addresses.
If the high nibble of the opcode is 0x4_
, then a biased address follows as a 1-byte FixedUInt.
If the high nibble of the opcode is 0x5_
, then a biased address follows as a 2-byte FixedUInt.
In both cases, the address is biased by the total number of addresses with lower opcodes.
For 0x4_
, the bias is 256 * low_nibble + 64
(or (low_nibble shift-left 8) + 64
).
For 0x5_
, the bias is 65536 * low_nibble + 4160
.
841
┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address │┌─── Low nibble 3 indicates bias of 832 ││ 43 09 │ └─── FixedUInt 9 Biased Address : 9 Bias : 832 Address : 841
142,918
┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address │┌─── Low nibble 2 indicates bias of 135232 ││ 52 06 1E └─┬─┘ └─── FixedUInt 7686 Biased Address : 7686 Bias : 135232 Address : 142918
0x4_
and 0x5_
opcodes
Low Nibble | 0x4_ Bias |
0x5_ Bias |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Because the address is encoded using a FlexUInt
, there is no (theoretical) limit to the number of addresses that can
be invoked. However, larger addresses require more bytes to encode.
When using the 0xEE
opcode, the address is unbiased; the 0xEE
opcode can be used for any macro address.
0
┌──── Opcode EE indicates a macro address as trailing FlexUInt │ ┌─── FlexUInt 0 │ │ EE 01
2,097,151
┌──── Opcode EE indicates a macro address as trailing FlexUInt │ ┌─── FlexUInt 2097151 │ │ EE FC FF FF
0x6E
represents boolean true
, while 0x6F
represents boolean false
.
0xEB 0x00
represents null.bool
.
true
6E
false
6F
null.bool
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: boolean │ │ EB 00
Opcodes in the range 0x60
to 0x68
represent an integer. The opcode is followed by a FixedInt
that
represents the integer value. The low nibble of the opcode (0x_0
to 0x_8
) indicates the size of the FixedInt
.
Opcode 0x60
represents integer 0
; no more bytes follow.
Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6
,
followed by a
FlexUInt indicating how many bytes of representation data follow.
0xEB 0x01
represents null.int
.
0
┌──── Opcode in 60-68 range indicates integer │┌─── Low nibble 0 indicates ││ no more bytes follow. 60
17
┌──── Opcode in 60-68 range indicates integer │┌─── Low nibble 1 indicates ││ a single byte follows. 61 11 └── FixedInt 17
-944
┌──── Opcode in 60-68 range indicates integer │┌─── Low nibble 2 indicates ││ that two bytes follow. 62 50 FC └─┬─┘ FixedInt -944
-944
┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows │ ┌─── FlexUInt 2; a 2-byte FixedInt follows │ │ F6 05 50 FC └─┬─┘ FixedInt -944
null.int
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: integer │ │ EB 01
Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:
-
0 bits (0 bytes), representing the value 0e0 and indicated by opcode
0x6A
-
16 bits (2 bytes in little-endian order, half precision), indicated by opcode
0x6B
-
32 bits (4 bytes in little-endian order, single precision), indicated by opcode
0x6C
-
64 bits (8 bytes in little-endian order, double precision), indicated by opcode
0x6D
Note that in the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.
0xEB 0x02
represents null.float
.
0e0
┌──── Opcode in range 6A-6D indicates a float │┌─── Low nibble A indicates ││ a 0-length float; 0e0 6A
3.14e0
┌──── Opcode in range 6A-6D indicates a float │┌─── Low nibble B indicates a 2-byte float ││ 6B 47 42 └─┬─┘ half-precision 3.14
3.1415927e0
┌──── Opcode in range 6A-6D indicates a float │┌─── Low nibble C indicates a 4-byte, ││ single-precision value. 6C DB 0F 49 40 └────┬────┘ single-precision 3.1415927
3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float │┌─── Low nibble D indicates an 8-byte, ││ double-precision value. 6D 18 2D 44 54 FB 21 09 40 └──────────┬──────────┘ double-precision 3.141592653589793
null.float
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: float │ │ EB 02
If an opcode has a high nibble of 0x7_
, it represents a decimal. Low nibble values indicate
the number of trailing bytes used to encode the decimal.
The body of the decimal is encoded as a FlexInt
representing its exponent, followed by a FixedInt
representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length
of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0
. When
the coefficient is present but has a value of 0
, the coefficient is -0
.
Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7
.
0xEB 0x03
represents null.decimal
.
0d0
┌──── Opcode in range 70-7F indicates a decimal │┌─── Low nibble 0 indicates a zero-byte ││ decimal; 0d0 70
7d0
┌──── Opcode in range 70-7F indicates a decimal │┌─── Low nibble 2 indicates a 2-byte decimal ││ 72 01 07 | └─── Coefficient: 1-byte FixedInt 7 └─── Exponent: FlexInt 0
1.27
┌──── Opcode in range 70-7F indicates a decimal │┌─── Low nibble 2 indicates a 2-byte decimal ││ 72 FD 7F | └─── Coefficient: FixedInt 127 └─── Exponent: 1-byte FlexInt -2
1.27
┌──── Opcode F7 indicates a variable-length decimal │ F7 05 FD 7F | | └─── Coefficient: FixedInt 127 | └───── Exponent: 1-byte FlexInt -2 └─────── Decimal length: FlexUInt 2
0d3
, which has a coefficient of zero┌──── Opcode in range 70-7F indicates a decimal │┌─── Low nibble 1 indicates a 1-byte decimal ││ 71 07 └────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
-0d3
, which has a coefficient of negative zero┌──── Opcode in range 70-7F indicates a decimal │┌─── Low nibble 2 indicates a 2-byte decimal ││ 72 07 00 | └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0 └────── Exponent: FlexInt 3
null.decimal
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: decimal │ │ EB 03
Note
|
In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time. |
Timestamps have two encodings:
- Short-form timestamps
-
A compact representation optimized for the most commonly used precisions and date ranges.
- Long-form timestamps
-
A less compact representation capable of representing any timestamp in the Ion data model.
0xEB x04
represents null.timestamp
.
null.timestamp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: timestamp │ │ EB 04
If an opcode has a high nibble of 0x8_
, it represents a short-form timestamp. This encoding focuses on making the
most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via
the variable-length long form timestamp encoding.
Timestamps may be encoded using the short form if they meet all of the following conditions:
- The year is between 1970 and 2097.
-
The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form.
- The local offset is either UTC, unknown, or falls between
-14:00
to+14:00
and is divisible by 15 minutes. -
7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset
-14:00
). The value0b1111111
indicates an unknown offset. At the time of this writing (2023-05T), all real-world offsets fall between-12:00
and+14:00
and are multiples of 15 minutes. - The fractional seconds are a common precision.
-
The timestamp’s fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).
Each opcode with a high nibble of 0x8_
indicates a different precision and offset encoding pair.
Opcode | Precision | Serialized size in bytes* | Offset encoding |
---|---|---|---|
|
Year |
1 |
Implicitly Unknown offset |
|
Month |
2 |
|
|
Day |
2 |
|
|
Hour and minutes |
4 |
1 bit to indicate UTC or Unknown Offset |
|
Seconds |
5 |
|
|
Milliseconds |
6 |
|
|
Microseconds |
7 |
|
|
Nanoseconds |
8 |
|
|
Hour and minutes |
5 |
7 bits to represent a known offset. |
|
Seconds |
5 |
|
|
Milliseconds |
7 |
|
|
Microseconds |
8 |
|
|
Nanoseconds |
9 |
|
|
Reserved |
||
|
|||
|
* Serialized size in bytes does not include the opcode.
The body of a short-form timestamp is encoded as a FixedUInt
of the size specified by the opcode. This integer is
then partitioned into bit-fields representing the timestamp’s subfields. Note that endianness does not apply here because the
bit-fields are defined over the body interpreted as an integer.
The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.
The Month
and Day
subfields are one-based; 0
is not a valid month or day.
Letter code | Number of bits | Subfield |
---|---|---|
|
7 |
Year |
|
4 |
Month |
|
5 |
Day |
|
5 |
Hour |
|
6 |
Minute |
|
7 |
Offset |
|
1 |
Unknown ( |
|
6 |
Second |
|
10 (ms) |
Fractional second |
|
n/a |
Unused |
We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.
7 0 <--- bit position | | +=========+ byte 0 | 0xNN | <-- hex notation for constants like opcodes +=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`) 1 |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading +---------+ <-- octet boundary within an encoding primitive ... +---------+ N |nnnn:nnnn| +=========+
The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)
Note
|
While this encoding may complicate human reading, it guarantees that the timestamp’s subfields (year , month ,
etc.) occupy the same bit contiguous indexes regardless of how many bytes there are overall. (The last subfield,
fractional_seconds , always begins at the same bit index when present, but can vary in length according to the
precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the
appropriate bit ranges to access the subfields.
|
+=========+ byte 0 | 0x80 | +=========+ 1 |.YYY:YYYY| +=========+
+=========+ byte 0 | 0x81 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |....:.MMM| +=========+
+=========+ byte 0 | 0x82 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +=========+
+=========+ byte 0 | 0x83 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |....:Ummm| +=========+
+=========+ byte 0 | 0x84 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |ssss:Ummm| +---------+ 5 |....:..ss| +=========+
+=========+ byte 0 | 0x85 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |ssss:Ummm| +---------+ 5 |ffff:ffss| +---------+ 6 |....:ffff| +=========+
+=========+ byte 0 | 0x86 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |ssss:Ummm| +---------+ 5 |ffff:ffss| +---------+ 6 |ffff:ffff| +---------+ 7 |..ff:ffff| +=========+
+=========+ byte 0 | 0x87 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |ssss:Ummm| +---------+ 5 |ffff:ffss| +---------+ 6 |ffff:ffff| +---------+ 7 |ffff:ffff| +---------+ 8 |ffff:ffff| +=========+
+=========+ byte 0 | 0x88 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |oooo:ommm| +---------+ 5 |....:..oo| +=========+
+=========+ byte 0 | 0x89 | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |oooo:ommm| +---------+ 5 |ssss:ssoo| +=========+
+=========+ byte 0 | 0x8A | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |oooo:ommm| +---------+ 5 |ssss:ssoo| +---------+ 6 |ffff:ffff| +---------+ 7 |....:..ff| +=========+
+=========+ byte 0 | 0x8B | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |oooo:ommm| +---------+ 5 |ssss:ssoo| +---------+ 6 |ffff:ffff| +---------+ 7 |ffff:ffff| +---------+ 8 |....:ffff| +=========+
+=========+ byte 0 | 0x8C | +=========+ 1 |MYYY:YYYY| +---------+ 2 |DDDD:DMMM| +---------+ 3 |mmmH:HHHH| +---------+ 4 |oooo:ommm| +---------+ 5 |ssss:ssoo| +---------+ 6 |ffff:ffff| +---------+ 7 |ffff:ffff| +---------+ 8 |ffff:ffff| +---------+ 9 |..ff:ffff| +=========+
Text | Binary |
---|---|
2023T |
|
2023-10-15T |
|
2023-10-15T11:22:33Z |
|
2023-10-15T11:22:33-00:00 |
|
2023-10-15T11:22:33+01:15 |
|
2023-10-15T11:22:33.444555666+01:15 |
|
Warning
|
Opcodes 0x8D , 0x8E , and 0x8F are illegal; they are reserved for future use.
|
Unlike the Short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.
The long form begins with opcode 0xF8
. A FlexUInt
follows indicating the number
of bytes that were needed to represent the timestamp. The encoding consumes the minimum number
of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s
precision as follows:
Length | Corresponding precision |
---|---|
0 |
Illegal |
1 |
Illegal |
2 |
Year |
3 |
Month or Day (see below) |
4 |
Illegal; the hour cannot be specified without also specifying minutes |
5 |
Illegal |
6 |
Minutes |
7 |
Seconds |
8 or more |
Fractional seconds |
Unlike the short-form encoding, the long-form encoding reserves:
-
14 bits for the year (
Y
), which is not biased. -
12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440 (that is:
-24:00
). An offset value of0b111111111111
indicates an unknown offset.
Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the
timestamp are encoded as bit-fields on a FixedUInt
that corresponds to the length that followed the opcode.
If the timestamp’s overall length is greater than or equal to 8
, the FixedUInt
part of the timestamp is 7
bytes
and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a
(scale, coefficient)
pair, which is similar to a decimal. The primary difference is that the scale
represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to
1.0
or less than 0.0
. The scale is encoded as a FlexUInt
(instead of FlexInt
) to discourage the
encoding of decimal numbers greater than 1.0
. The coefficient is encoded as a FixedUInt
(instead of FixedInt
) to
prevent the encoding of fractional seconds less than 0.0
. Note that validation is still required; namely:
-
A scale value of
0
is illegal, as that would result in a fractional seconds greater than1.0
(a whole second). -
If
coefficient * 10^-scale > 1.0
, that(coefficient, scale)
pair is illegal.
If the timestamp’s length is 3
, the precision is determined by inspecting the day (DDDDD
) bits. Like the short-form,
the Month
and Day
subfields are one-based (0
is not a valid month or day). If the day subfield is zero, that
indicates month precision. If the day subfield is any non-zero number, that indicates day precision.
+=========+ byte 0 |YYYY:YYYY| +=========+ 1 |MMYY:YYYY| +---------+ 2 |HDDD:DDMM| +---------+ 3 |mmmm:HHHH| +---------+ 4 |oooo:oomm| +---------+ 5 |ssoo:oooo| +---------+ 6 |....:ssss| +=========+ 7 |FlexUInt | <-- scale of the fractional seconds +---------+ ... +=========+ N |FixedUInt| <-- coefficient of the fractional seconds +---------+ ...
Text | Binary |
---|---|
1947T |
|
1947-12T |
|
1947-12-23T |
|
1947-12-23T11:22:33-00:00 |
|
1947-12-23T11:22:33+01:15 |
|
1947-12-23T11:22:33.127+01:15 |
|
If the high nibble of the opcode is 0x9_
, it represents a string. The low nibble of the opcode
indicates how many UTF-8 bytes follow. Opcode 0x90
represents a string with empty text (""
).
Strings longer than 15 bytes can be encoded with the F9
opcode, which takes a FlexUInt
-encoded length
after the opcode.
0xEB x05
represents null.string
.
""
┌──── Opcode in range 90-9F indicates a string │┌─── Low nibble 0 indicates that no UTF-8 bytes follow 90
┌──── Opcode in range 90-9F indicates a string │┌─── Low nibble E indicates that 14 UTF-8 bytes follow ││ f o u r t e e n b y t e s 9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73 └──────────────────┬────────────────────┘ UTF-8 bytes
┌──── Opcode F9 indicates a variable-length string │ ┌─── Length: FlexUInt 24 │ │ v a r i a b l e l e n g t h e n c o d i n g F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67 └────────────────────────────────┬────────────────────────────────────┘ UTF-8 bytes
null.string
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: string │ │ EB 05
If the high nibble of the opcode is 0xA_
, it represents a symbol whose text follows the opcode. The low nibble of the
opcode indicates how many UTF-8 bytes follow. Opcode 0xA0
represents a symbol with empty text (''
).
0xEB x06
represents null.symbol
.
''
)┌──── Opcode in range A0-AF indicates a symbol with inline text │┌─── Low nibble 0 indicates that no UTF-8 bytes follow A0
┌──── Opcode in range A0-AF indicates a symbol with inline text │┌─── Low nibble E indicates that 14 UTF-8 bytes follow ││ f o u r t e e n b y t e s AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73 └──────────────────┬────────────────────┘ UTF-8 bytes
┌──── Opcode FA indicates a variable-length symbol with inline text │ ┌─── Length: FlexUInt 24 │ │ v a r i a b l e l e n g t h e n c o d i n g FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67 └────────────────────────────────┬────────────────────────────────────┘ UTF-8 bytes
null.symbol
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: symbol │ │ EB 06
Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1
through 0xE3
:
-
0xE1
represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byteFixedUInt
that follows the opcode. -
0xE2
represents a symbol whose address in the symbol table is a 2-byteFixedUInt
that follows the opcode. -
0xE3
represents a symbol whose address in the symbol table is aFlexUInt
that follows the opcode.
Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.
Opcode | Symbol address range | Bias |
---|---|---|
|
0 to 255 |
0 |
|
256 to 65,791 |
256 |
|
65,792 to infinity |
65,792 |
Opcode FE
indicates a blob of binary data. A FlexUInt
follows that represents the blob’s byte-length.
0xEB x07
represents null.blob
.
┌──── Opcode FE indicates a blob, FlexUInt length follows │ ┌─── Length: FlexUInt 24 │ │ FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79 └────────────────────────────────┬────────────────────────────────────┘ 24 bytes of binary data
null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: blob │ │ EB 07
Opcode FF
indicates a clob—binary character data of an unspecified encoding. A FlexUInt
follows that represents
the clob’s byte-length.
0xEB x08
represents null.clob
.
┌──── Opcode FF indicates a clob, FlexUInt length follows │ ┌─── Length: FlexUInt 24 │ │ FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79 └────────────────────────────────┬────────────────────────────────────┘ 24 bytes of binary data
null.clob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: clob │ │ EB 08
Each of the container types (list, s-expression, and struct) has both a length-prefixed encoding and a delimited encoding.
The length-prefixed encoding places more burden on the writer, but simplifies reading and enables skipping over uninteresting values in the data stream. In contrast, the delimited encoding is simpler and faster for writers, but requires the reader to visit each child value in turn to skip over the container.
An opcode with a high nibble of 0xB_
indicates a length-prefixed list. The lower nibble of the
opcode indicates how many bytes were used to encode the child values that the list contains.
If the list’s encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB
opcode
to write a variable-length list. The 0xFB
opcode is followed by a
FlexUInt
that indicates the list’s byte length.
0xEB 0x09
represents null.list
.
[]
)┌──── An Opcode in the range 0xB0-0xBF indicates a list. │┌─── A low nibble of 0 indicates that the child values of this list took zero bytes to encode. B0
[1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list. │┌─── A low nibble of 0 indicates that the child values of this list took zero bytes to encode. B6 61 01 61 02 61 03 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
["variable length list"]
┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows. │ ┌───── Length: FlexUInt 22 │ │ ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows. │ │ │ ┌─────── Length: FlexUInt 20 │ │ │ │ v a r i a b l e l e n g t h l i s t FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74 └─────────────────────────────┬─────────────────────────────────┘ Nested string element
null.list
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: list │ │ EB 09
Opcode 0xF1
begins a delimited list, while opcode 0xF0
closes the most recently opened delimited container
that has not yet been closed.
[]
)┌──── Opcode 0xF1 indicates a delimited list │ ┌─── Opcode 0xF0 indicates the end of the most recently opened container F1 F0
[1, 2, 3]
┌──── Opcode 0xF1 indicates a delimited list │ ┌─── Opcode 0xF0 indicates the end of │ │ the most recently opened container F1 61 01 61 02 61 03 F0 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
[1, [2], 3]
┌──── Opcode 0xF1 indicates a delimited list │ ┌─── Opcode 0xF1 begins a nested delimited list │ │ ┌─── Opcode 0xF0 closes the most recently │ │ │ opened delimited container: the nested list. │ │ │ ┌─── Opcode 0xF0 closes the most recently opened (and still open) │ │ │ │ delimited container: the outer list. │ │ │ │ F1 61 01 F1 61 02 F0 61 03 F0 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
S-expressions use the same encodings as lists, but with different opcodes.
Opcode | Encoding |
---|---|
|
Length-prefixed S-expression; low nibble of the opcode represents the byte-length. |
|
Variable-length prefixed S-expression; a |
|
Starts a delimited S-expression; |
0xEB 0x0A
represents null.sexp
.
()
)┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression. │┌─── A low nibble of 0 indicates that the child values of this S-expression took zero bytes to encode. C0
(1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression. │┌─── A low nibble of 6 indicates that the child values of this S-expression took six bytes to encode. C6 61 01 61 02 61 03 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
("variable length sexp")
┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows. │ ┌───── Length: FlexUInt 22 │ │ ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows. │ │ │ ┌─────── Length: FlexUInt 20 │ │ │ │ v a r i a b l e l e n g t h s e x p FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70 └─────────────────────────────┬─────────────────────────────────┘ Nested string element
()
)┌──── Opcode 0xF2 indicates a delimited S-expression │ ┌─── Opcode 0xF0 indicates the end of the most recently opened container F2 F0
(1 2 3)
┌──── Opcode 0xF2 indicates a delimited S-expression │ ┌─── Opcode 0xF0 indicates the end of │ │ the most recently opened container F2 61 01 61 02 61 03 F0 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
(1 (2) 3)
┌──── Opcode 0xF2 indicates a delimited S-expression │ ┌─── Opcode 0xF2 begins a nested delimited S-expression │ │ ┌─── Opcode 0xF0 closes the most recently │ │ │ opened delimited container: the nested S-expression. │ │ │ ┌─── Opcode 0xF0 closes the most recently opened (and still open) │ │ │ │ delimited container: the outer S-expression. │ │ │ │ F2 61 01 F2 61 02 F0 61 03 F0 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
null.sexp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: sexp │ │ EB 0A
Structs have 3 available encodings:
0xEB 0x0B
represents null.struct
.
null.struct
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type │ ┌─── Null type: struct │ │ EB 0B
An opcode with a high nibble of 0xD_
indicates a struct with symbol address field names (which is similar to the
only available encoding of structs in Ion 1.0.
The lower nibble of the opcode indicates how many bytes were used to encode all of its nested (field name, value)
pairs.
If the struct’s encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD
opcode
to write a variable-length struct with symbol address field names. The 0xFD
opcode is followed by a
FlexUInt
that indicates the byte length.
Each field in the struct is encoded as a FlexUInt
representing the address of the field name’s
text in the symbol table, followed by an opcode-prefixed value.
The symbol address $0
cannot be encoded in this format because the FlexUInt
0
in the field name position is reserved for switching the struct to FlexSym
field names.
{}
)┌──── An opcode in the range 0xD0-0xDF indicates a struct with symbol address field names │┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode D0
{$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a struct with symbol address field names │ ┌─── Field name: FlexUInt 10 ($10) │ │ ┌─── Field name: FlexUInt 11 ($11) │ │ │ D6 15 61 01 17 61 02 └─┬─┘ └─┬─┘ 1 2
{$10: "variable length struct"}
┌───────────── Opcode `FD` indicates a variable length struct with symbol address field names │ ┌────────── Length: FlexUInt 25 │ │ ┌─────── Field name: FlexUInt 10 ($10) │ │ │ ┌──── Opcode `F9` indicates a variable length string │ │ │ │ ┌─ FlexUInt: 22 the string is 22 bytes long │ │ │ │ │ v a r i a b l e l e n g t h s t r u c t FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74 └─────────────────────────────┬─────────────────────────────────┘ UTF-8 bytes
Note
|
This encoding is very similar to structs with symbol address
field names, but allows writers to choose between representing each field name as a symbol address
(for example: $10 ) or as inline UTF-8 bytes (for example: "foo" ). This encoding is potentially less
dense, but offers writers significant flexibility over whether and when field names are added to the
symbol table.
|
All length-prefixed structs begin as structs with symbol address field names.
However, they can be switched to use FlexSym
field names at any time by emitting the FlexUInt 0 (the byte 0x01
) in the field name position. Once a struct has been switched to the FlexSym
field name encoding, it cannot be switch back.
Each field in the struct is encoded as a FlexSym
field name, followed by an opcode-prefixed value.
{"foo": 1, $11: 2}
┌─── Opcode with high nibble `D` indicates a struct │┌── Length: 10 ││ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode ││ │ ┌─ FlexSym -3 ┌─ FlexSym: 11 ($11) ││ │ │ f o o │ DA 01 FD 66 6F 6F 61 01 17 61 02 └──┬───┘ └─┬─┘ └─┬─┘ 3 UTF-8 1 2 bytes
{$11: 1, "foo": 2}
┌─── Opcode with high nibble `D` indicates a struct │┌── Length: 10 ││ ┌─ FlexSym: 11 ($11) ││ │ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode ││ │ │ ┌─ FlexSym -3 ││ │ │ │ f o o DA 17 61 01 01 FD 66 6F 6F 61 02 └─┬─┘ └──┬───┘ └─┬─┘ 1 3 UTF-8 2 bytes
{$0: 1}
┌─── Opcode with high nibble `D` indicates a struct │┌── Length: 5 ││ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode ││ │ ┌── FlexSym "escape" ││ │ │ ││ │ │ D5 01 01 A0 61 01 └─┬─┘ └─┬─┘ $0 1
0x01
.
Opcode 0xF3
indicates the beginning of a delimited struct with FlexSym
field names.
Unlike lists and S-expressions, structs cannot use opcode 0xF0
by itself to indicate the end of the delimited
container. This is because 0xF0
is a valid FlexSym
(a symbol with 16 bytes of inline text). To close the delimited
struct, the writer emits a 0x01
byte (a FlexSym
escape) followed by the opcode 0xF0
.
Note
|
While length-prefixed structs can choose between structs with
symbol address field names and structs with FlexSym field names,
delimited structs always use FlexSym -encoded field names.
|
{}
)┌─── Opcode 0xF3 indicates the beginning of a delimited struct with `FlexSym` field names. │ ┌─── FlexSym escape code 0 (0x01): an opcode follows │ │ ┌─── Opcode 0xF0 indicates the end of the most │ │ │ recently opened delimited container F3 01 F0
{"foo": 1, $11: 2}
┌─── Opcode 0xF3 indicates the beginning of a delimited struct with `FlexSym` field names. │ │ ┌─ FlexSym -3 ┌─ FlexSym: 11 ($11) │ │ │ ┌─── FlexSym escape code 0 (0x01): an opcode follows │ │ │ │ ┌─── Opcode 0xF0 indicates the end of the most │ │ f o o │ │ │ recently opened delimited container F3 FD 66 6F 6F 61 01 17 61 02 01 F0 └──┬───┘ └─┬─┘ └─┬─┘ 3 UTF-8 1 2 bytes
The opcode 0xEA
indicates an untyped null (that is: null
, or its alias null.null
).
The opcode 0xEB
indicates a typed null; a byte follows whose value represents an offset into the following table:
Byte | Type |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
All other byte values are reserved for future use.
Note
|
Future versions of Ion may decide to generalize this into a "constants" table. |
null
┌──── The opcode `0xEA` represents a null (null.null) EA
null.string
┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows │ ┌──── Byte 0x05 indicates the type `string` EB 05
Annotations can be encoded either as symbol addresses
or as FlexSym
s. In both encodings, the annotations sequence appears
just before the value that it decorates.
It is illegal for an annotations sequence to appear before any of the following:
-
Another annotations sequence
-
The end of the stream
-
A
NOP
-
An E-expression (that is: a macro invocation). To add annotations to the expansion of an E-expression, see the
annotate
macro. (TODO: Link)
Opcodes 0xE4
through 0xE6
indicate one or more annotations encoded as symbol addresses. If the opcode is:
$10::false
┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows │ ┌──── Annotation with symbol address: FlexUInt 10 E4 15 6F └── The annotated value: `false`
$10::$11::false
┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow │ ┌──── Annotation with symbol address: FlexUInt 10 ($10) │ │ ┌──── Annotation with symbol address: FlexUInt 11 ($11) E5 15 17 6F └── The annotated value: `false`
$10::$11::$12::false
┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations; │ a FlexUInt follows representing the length of the sequence. │ ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10) │ │ ┌──── Annotation with symbol address: FlexUInt 10 ($10) │ │ │ ┌──── Annotation with symbol address: FlexUInt 11 ($11) │ │ │ │ ┌──── Annotation with symbol address: FlexUInt 12 ($12) E5 07 15 17 19 6F └── The annotated value: `false`
Opcodes 0xE7
through 0xE9
indicate one or more annotations encoded as FlexSym
s.
If the opcode is:
-
0xE7
, a singleFlexSym
-encoded symbol follows. -
0xE8
, twoFlexSym
-encoded symbols follow. -
0xE9
, aFlexUInt
follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded asFlexSym
s.
While this encoding is more flexible than annotations with symbol addresses, it can be slightly less compact when all the annotations are encoded as symbol addresses.
$10::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows │ ┌──── Annotation with symbol address: FlexSym 10 ($10) E7 15 6F └── The annotated value: `false`
foo::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow │ │ f o o E7 FD 66 6F 6F 6F └──┬───┘ └── The annotated value: `false` 3 UTF-8 bytes
Note that FlexSym
annotation sequences can switch between symbol address and inline text
on a per-annotation basis.
$10::foo::false
┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow │ ┌──── Annotation: FlexSym 10 ($10) │ │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow │ │ │ f o o E8 15 FD 66 6F 6F 6F └──┬───┘ └── The annotated value: `false` 3 UTF-8 bytes
$10::foo::$11::false
┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations │ ┌──── Length: FlexUInt 6 │ │ ┌──── Annotation: FlexSym 10 ($10) │ │ │ ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow │ │ │ │ ┌──── Annotation: FlexSym 11 ($11) │ │ │ │ f o o │ E9 0D 15 FD 66 6F 6F 17 6F └──┬───┘ └── The annotated value: `false` 3 UTF-8 bytes
A NOP
(short for "no-operation") is the binary equivalent of whitespace. NOP
bytes have no meaning,
but can be used as padding to achieve a desired alignment.
An opcode of 0xEC
indicates a single-byte NOP
pad. An opcode of 0xED
indicates that a
FlexUInt
follows that represents the number of additional bytes to skip.
It is legal for a NOP
to appear anywhere that a value can be encoded. It is not legal for a NOP
to appear in
annotation sequences or struct field names. If a NOP
appears in place of a struct field value, then the associated
field name is ignored; the NOP
is immediately followed by the next field name, if any.
NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad │ EC
NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows │ ┌──── Length: FlexUInt 2; two more bytes of NOP follow │ │ ED 05 93 C6 └─┬─┘ NOP bytes, values ignored
The binary encoding of E-expressions (aka macro invocations) starts with the address of the macro to expand. The address
can be encoded as part of the opcode, as
a FixedUInt
that follows the opcode, or as
a FlexUInt
that follows the opcode.
The encoding of the E-expression’s arguments depends on their respective types. Argument types can be classified as belonging to one of two categories: tagged encodings and tagless encodings.
Tagged types are argument types whose encoding begins with an opcode, sometimes informally called a 'tag'. These include the core types and the abstract types.
The core types are the 13 types in the Ion data model:
null
| bool
| int
| float
| decimal
| timestamp
| string
| symbol
| blob
| clob
| list
| sexp
| struct
The abstract types are unions of two or more of the core types.
Abstract type | Included Ion types |
---|---|
|
All core Ion types |
|
|
|
|
|
|
|
|
|
|
When a macro parameter has a tagged type, the encoding of that parameter’s corresponding argument in an E-expression is identical to how it would be encoded anywhere else in an Ion stream: it has a leading opcode that dictates how many bytes follow and how they should be interpreted. This is very flexible, but makes it possible for writers to encode values that conflict with the parameter’s declared type. Because of this, the macro expander will read the argument and then check its type against the parameter’s declared type. If it does not match, the macro expander must raise an error.
Macro foo
(defined below) is used in this section’s subsequent examples to demonstrate the encoding of tagged-type
arguments.
foo
at address 0(macro foo // Macro name (number::x!) // Parameters /*...*/ // Template (elided) )
(:foo 3.14e0)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0x6B indicates a 2-byte float; an IEEE-754 half-precision float follows │ │ 00 6B 47 42 └─┬─┘ 3.14e0 // The macro expander confirms that `3.14e0` (a `float`) matches the expected type: `number`.
(:foo 9)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0x61 indicates a 1-byte integer. A 1-byte FixedInt follows. │ │ ┌──── A 1-byte FixedInt: 9 00 61 09 // The macro expander confirms that `9` (an `int`) matches the expected type: `number`.
(:foo $10::9)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0xE4 indicates a single annotation with symbol address. A FlexUInt follows. │ │ ┌──── Symbol address: FlexUInt 10 ($10); an opcode for the annotated value follows. │ │ │ ┌──── Opcode 0x61 indicates a 1-byte integer │ │ │ │ ┌──── 1-byte FixedInt 9 00 E4 15 61 09 // The macro expander confirms that `$10::9` (an annotated `int`) matches the expected type: `number`.
(:foo null.int)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0xEB indicates a typed null. A 1-byte FixedUInt follows indicating the type. │ │ ┌──── Null type: FixedUInt: 1; integer 00 EB 01 // The macro expander confirms that `null.int` matches the expected type: `number`.
(:foo null)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0xEA represents an untyped null (aka `null.null`) 00 EA // The macro expander confirms that `null` matches the expected type: `number`
(:foo (:bar))
// A second macro definition at address 1 (macro bar // Macro name () // Parameters 5 // Template; invocations of `bar` always expand to `5`. ) ┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a tagged int as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0x01 is less than 0x40, so it is an E-expression invoking the macro │ │ at address 1: `bar`. `bar` takes no parameters, so no bytes follow. 00 01 // The macro expander confirms that the expansion of `(:bar)` (that is: `5`) matches // the expected type: `number`.
(:foo "hello")
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0, `foo`. `foo` takes a tagged int as a parameter (`x`), so an opcode follows. │ ┌──── Opcode 0x95 indicates a 5-byte string. 5 UTF-8 bytes follow. │ │ h e l l o 00 95 68 65 6C 6C 6F └──────┬─────┘ UTF-8 bytes // ERROR: Expected a `number` for `foo` parameter `x`, but found `string`
In contrast to tagged encodings, tagless encodings do not begin with an opcode. This means
that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings
do not have an opcode, they cannot represent E-expressions, annotation sequences, or null
values of any kind.
Tagless types include the primitive types and macro shapes.
Primitive types are self-delineating, either by having a statically known size in bytes or by including length information in their encoding.
Primitive types include:
Ion type | Primitive type | Size in bytes | Encoding |
---|---|---|---|
|
|
1 |
|
|
2 |
||
|
4 |
||
|
8 |
||
|
variable |
||
|
1 |
||
|
2 |
||
|
4 |
||
|
8 |
||
|
variable |
||
|
|
2 |
|
|
4 |
||
|
8 |
||
|
|
variable |
TODO:
-
Finalize names for primitive types. (
compact_
?plain_
?) -
Do we need a
compact_string
encoding? It saves a byte for string lengths >16 and <128. -
Do we need other int sizes?
int24
?int40
?
The term macro shape describes a macro that is being used as the encoding of an E-expression argument. They are considered "shapes" rather than types because while their encoding is always statically known, the types of data produced by their expansion is not. A single macro can produce streams of varying length and containing values of different Ion types depending on the arguments provided in the invocation.
See the Macro Shapes section of Macros by Example for more information.
E-expression arguments corresponding to each parameter are encoded one after the other moving from left to right.
foo
at address 0
(macro foo // Macro name ( // Parameters string::a compact_symbol::b uint16::c ) /* ... */ // Body (elided) )
(:0 "hello" baz 512)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0, `foo`. `foo`'s first parameter is a string, so an opcode follows. │ │ ┌──── Opcode 0x95 indicates a 5-byte string. 5 UTF-8 bytes follow. │ │ │ │ ┌──── `foo`'s second parameter is a compact_symbol, so a `FlexSym` follows. │ │ │ FlexSym -3: 3 bytes of UTF-8 text follow. │ │ │ │ │ │ ┌──── `foo`'s third parameter is a uint16, so a 2-byte │ │ │ │ 2-byte `FixedUInt` follows. │ │ │ │ FixedUInt: 512 │ │ h e l l o │ b a z │ 00 95 68 65 6C 6C 6F FD 62 61 7A 00 20 └──────┬─────┘ └───┬──┘ UTF-8 bytes UTF-8 bytes
The examples in previous sections have only shown how to encode invocations of macros which have either no parameters at all (aka constants) or whose parameters all have a cardinality of exactly-one.
If a macro has any parameters with a cardinality of zero-or-one (?
), zero-or-more (*
), or one-or-more (+
),
then E-expressions invoking that macro will begin with an argument encoding bitmap (AEB).
An AEB is a series of bits that correspond to a macro parameter and communicate additional
information about how the arguments corresponding to that parameter have been encoded in the current E-expression.
In particular, the AEB indicates whether a parameter that accepts (:void)
has any arguments at all, and how a
grouped parameter’s arguments have been delimited.
The number of bits allotted to each parameter is determined by its cardinality, as shown in the table below; each parameter can have 0, 1, or 2 bits.
Grouping Mode | Cardinality | Example parameter signature | Number of bits | Bit(s) value | Encoding |
---|---|---|---|---|---|
Ungrouped |
Exactly-one |
|
0 |
n/a |
One expression |
Zero-or-one |
|
1 |
|
No expression; equivalent to |
|
|
One expression |
||||
Zero-or-more |
|
|
No expression; equivalent to |
||
|
One expression |
||||
One-or-more |
|
0 |
n/a |
One expression |
|
Grouped |
Zero-or-more |
|
2 |
|
No expression; equivalent to |
|
One expression |
||||
|
Length-prefixed expression group |
||||
|
Delimited expression group |
||||
One-or-more |
|
|
Illegal. One-or-more forbids |
||
|
One expression |
||||
|
|||||
|
The total number of bits in the AEB can be calculated by analyzing the signature of the macro being invoked. If the macro has no parameters or all of its parameters have a cardinality of either exactly-one or one-or-more, no bits are required; the AEB will be omitted altogether. If the macro has many parameters with a cardinality other than exactly-one, it is possible for the AEB to require more than one byte to encode; in such cases, the bytes are written in little-endian order. AEB bytes can contain unused bits.
Bits are assigned to the parameters in a macro’s signature from left to right. Bits are assigned from least significant to most significant (commonly: right-to-left).
Example parameter sequence | Bit assignments | Total bits |
---|---|---|
|
No AEB |
0 |
|
No AEB |
0 |
|
|
1 |
|
|
1 |
|
|
2 |
|
|
3 |
|
|
5 |
|
|
6 |
|
|
9 |
Grouped parameters can be encoded using either a length-prefixed or delimited expression group encoding.
The example encodings in the following sections refer to this macro definition:
foo
at address 0
(macro foo // Macro name (int::x*) // Parameters; `x` is a grouped parameter /*...*/ // Body (elided) )
If a grouped parameter’s AEB bits are 0b10
, then the argument expressions belonging
to that parameter will be prefixed by a FlexUInt
indicating the number of bytes used to encode them.
(:foo [1, 2, 3])
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`), │ so an argument encoding bitmap (AEB) follows. │ ┌──── AEB: 0b0000_0010; the arguments for grouped parameter `x` have been encoded │ │ as a length-prefixed expression group. A FlexUInt length prefix follows. │ │ ┌──── FlexUInt: 6; the next 6 bytes are an `int` expression group. │ │ │ 00 02 0D 61 01 61 02 61 03 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
If a grouped parameter’s AEB bits are 0b11
, then the argument expressions belonging
to that parameter will be encoded in a delimited sequence.
Delimited sequences are encoded differently for tagged types and
tagless types.
Tagged type encodings begin with an opcode; a delimited sequence of tagged arguments is terminated by
the closing delimiter opcode, 0xF0
.
(:foo [1, 2, 3])
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`), │ so an argument encoding bitmap (AEB) follows. │ ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded │ │ as a delimited expression group. A series of tagged `int` expressions follow. │ │ ┌──── Opcode 0xF0 ends the expression group. │ │ │ 00 03 61 01 61 02 61 03 F0 └─┬─┘ └─┬─┘ └─┬─┘ 1 2 3
Tagless type encodings do not have an opcode, and so cannot use the closing delimiter opcode--0xF0
is a valid first
byte for many tagless encodings.
Instead, tagless expressions are grouped into 'pages', each of which is prefixed by a FlexUInt
representing a count (not a byte-length) of the expressions that follow. If a prefix has a count of zero, that marks
the end of the sequence of pages.
compact_foo
at address 1
(macro compact_foo // Macro name (compact_int::x*) // Parameters; `x` is a grouped parameter /*...*/ // Body (elided) )
(:compact_foo [1, 2, 3])
using a single page┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`), │ so an argument encoding bitmap (AEB) follows. │ ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded │ │ as a delimited expression group. Count-prefixed pages of `compact_int` │ │ expressions follow. │ │ ┌──── Count prefix: FlexUInt 3; 3 `compact_int`s follow. │ │ │ ┌──── Count prefix: FlexUInt 0; no more pages follow. │ │ │ │ 00 03 07 03 05 07 01 └──┬───┘ First page: 1, 2, 3
(:compact_foo [1, 2, 3])
using two pages┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at │ address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`), │ so an argument encoding bitmap (AEB) follows. │ ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded │ │ as a delimited expression group. Count-prefixed pages of `compact_int` │ │ expressions follow. │ │ ┌──── Count prefix: FlexUInt 2; 2 `compact_int`s follow. │ │ │ ┌──── Count prefix: FlexUInt 1; a single `compact_int` follows. │ │ │ │ ┌──── Count prefix: FlexUInt 0; no more pages follow. │ │ │ │ │ 00 03 05 03 05 03 07 01 └─┬─┘ └─ Second page: 3 │ First page: 1, 2