Ion 1.1 Binary Encoding

Encoding Primitives

`FlexUInt`

A variable-length unsigned integer.

The bytes of a FlexUInts are written in little-endian byte order. This means that the first bytes will contain the FlexUInt's least significant bits.

The least significant bits in the FlexUInt indicate the number of bytes that were used to encode the integer. If a FlexUInt is N bytes long, its N-1 least significant bits will be 0; a terminal 1 bit will be in the next most significant position. All bits that are more significant than the terminal 1 represent the magnitude of the FlexUInt.

Figure 1: FlexUInt encoding of 14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14

Figure 2: FlexUInt encoding of 729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the unsigned  of the unsigned
integer          integer

Figure 3: FlexUInt encoding of 21,043

            ┌───── There are 2 zeros in the least significant bits, so this
            │      integer is three bytes wide.
          ┌─┴─┐
1 0 0 1 1 1 0 0  1 0 0 1 0 0 0 1  0 0 0 0 0 0 1 0
└───┬───┘        └──────┬──────┘  └──────┬──────┘
lowest 6 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

`FlexInt`

A variable-length signed integer.

From an encoding perspective, FlexInts are structurally similar to a FlexUInt (described above). Both encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a FlexUInt's bits are unsigned, a FlexInt's bits are encoded using two’s complement notation.

Tip	An implementation could choose to read a `FlexInt` by instead reading a `FlexUInt` and then reinterpreting its bits as two’s complement.

Figure 4: FlexInt encoding of 14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
 2's comp. 14

Figure 5: FlexInt encoding of -14

              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
 2's comp. -14

Figure 6: FlexInt encoding of 729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

Figure 7: FlexInt encoding of -729

             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
1 0 0 1 1 1 1 0  1 1 1 1 0 1 0 0
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

`FixedUInt`

A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.

Figure 8: FixedUInt encoding of 3,954,261

0 1 0 1 0 1 0 1  0 1 0 1 0 1 1 0  0 0 1 1 1 1 0 0
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

`FixedInt`

A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two’s complement.

Figure 9: FixedInt encoding of -3,954,261

1 0 1 0 1 0 1 1  1 0 1 0 1 0 0 1  1 1 0 0 0 0 1 1
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the 2's       the 2's comp.    of the 2's comp.
comp. integer    integer          integer

`FlexSym`

A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.

A FlexSym begins with a FlexInt; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:

greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.
less than zero, its absolute value represents a number of UTF-8 bytes that follow the FlexInt. These bytes represent the symbol’s text.
exactly zero, another byte follows that is an opcode. The FlexSym parser is not responsible for evaluating this opcode, only returning it—the caller will decide whether the opcode is legal in the current context. Example usages of the opcode include:
- Representing SID $0 as 0xA0.
- Representing the empty string ("") as 0x90.
- When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct (TODO: Link)
- In a delimited struct, terminating the sequence of (field name, value) pairs with 0xF0.

Figure 10: FlexSym encoding of symbol ID $10

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │
0 0 0 1 0 1 0 1
└─────┬─────┘
  2's comp.
  positive 10

Figure 11: FlexSym encoding of symbol text 'hello'

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │      h         e        l        l        o
1 1 1 1 0 1 1 1  01101000  01100101 01101100 01101100 01101111
└─────┬─────┘    └─────────────────────┬─────────────────────┘
  2's comp.               5-byte UTF-8 encoded "hello"
  negative 5

Figure 12: FlexSym encoding of '' (empty text) using an opcode

              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │

0 0 0 0 0 0 0 1   10010000
└─────┬─────┘     └───┬──┘
  2's comp.      opcode 0x90:
  zero           empty symbol

Note	From this point on in the document, example encodings are given in hexadecimal notation.

Opcodes

An opcode is a 1-byte FixedUInt that tells the reader what the next expression represents and how the bytes that follow should be interpreted.

The meanings of each opcode are organized loosely by their high and low nibbles.

High nibble	Low nibble	Meaning
`0x0_` to `0x3_`	`0`-`F`	E-expression with the address in the opcode
`0x4_`	`0`-`F`	E-expression with the address as a trailing 1-byte `FixedUInt`.
`0x5_`	`0`-`F`	E-expression with the address as a trailing 2-byte `FixedUInt`.
`0x6_`	`0`-`8`	Integers up to 8 bytes wide
	`9`	Reserved
	`A`-`D`	Floats
	`E`-`F`	Booleans
`0x7_`	`0`-`F`	Decimals
`0x8_`	`0`-`C`	Timestamps
`0x8_`	`D`-`F`	Reserved
`0x9_`	`0`-`F`	Strings
`0xA_`	`0`-`F`	Symbols with inline text
`0xB_`	`0`-`F`	Lists
`0xC_`	`0`-`F`	S-expressions
`0xD_`	`0`	Empty struct
	`1`	Reserved
	`2`-`F`	Structs with symbol address field names
`0xE_`	`0`	Ion version marker
	`1`-`3`	Symbols with symbol address
	`4`-`6`	Annotations with symbol address
	`7`-`9`	Annotations with `FlexSym` text
	`A`	`null.null`
	`B`	Typed nulls
	`C`-`D`	NOP
	`E`	E-expression with a variable-width address
	`F`	System macro invocation
`0xF_`	`0`	Delimited container end
	`1`	Delimited list start
	`2`	Delimited S-expression start
	`3`	Delimited struct with `FlexSym` field names start
	`4`	Reserved
	`5`	Variable length prefixed macro invocation
	`6`	Variable length integer
	`7`	Variable length decimal
	`8`	Variable length, long-form timestamp
	`9`	Variable length string
	`A`	Variable length symbol encoded as `FlexSym`
	`B`	Variable length list
	`C`	Variable length S-expression
	`D`	Variable length struct with symbol address field names
	`E`	Variable length blob
	`F`	Variable length clob

Encoding Expressions

The encoding of E-expressions is designed to balance density and generality. For example, they enable encodings with minimal tag bits, even none at all given a thoughtful signature. This increases density, but limits generality at the point of macro invocation.

The text and binary forms of E-expressions enforce the same syntactic constraints on the type and range of data allowed as arguments. Any syntactically well-formed E-expression can be transcoded between text and binary, without expansion and without changing semantics, and independent of whether it can be expanded successfully.

E-expression With the Address in the Opcode

If the value of the opcode is less than 64 (0x40), it represents an E-expression invoking the macro at the corresponding address—an offset within the local macro table.

Figure 13: Invocation of macro address 7

┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
07
└── FixedUInt 7

Figure 14: Invocation of macro address 31

┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
1F
└── FixedUInt 31

Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section Macro calling conventions. (TODO: Link)

E-expression With the Address as a Trailing `FixedUInt`

While E-expressions invoking macro addresses in the range [0, 63] can be encoded in a single byte using E-expressions with the address in the opcode, many applications will benefit from defining more than 64 macros.

The 0x4_ and 0x5_ opcodes can be used to represent over 1 million macro addresses. If the high nibble of the opcode is 0x4_, then a biased address follows as a 1-byte FixedUInt. If the high nibble of the opcode is 0x5_, then a biased address follows as a 2-byte FixedUInt. In both cases, the address is biased by the total number of addresses with lower opcodes. For 0x4_, the bias is 256 * low_nibble + 64 (or (low_nibble shift-left 8) + 64). For 0x5_, the bias is 65536 * low_nibble + 4160.

Figure 15: Invocation of macro address 841

┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address
│┌─── Low nibble 3 indicates bias of 832
││
43 09
   │
   └─── FixedUInt 9

Biased Address : 9
Bias : 832
Address : 841

Figure 16: Invocation of macro address 142,918

┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address
│┌─── Low nibble 2 indicates bias of 135232
││
52 06 1E
   └─┬─┘
     └─── FixedUInt 7686

Biased Address : 7686
Bias : 135232
Address : 142918

Table 1. Macro address bias for 0x4_ and 0x5_ opcodes

Low Nibble	`0x4_` Bias	`0x5_` Bias
`0`	`64`	`4160`
`1`	`320`	`69696`
`2`	`576`	`135232`
`3`	`832`	`200768`
`4`	`1088`	`266304`
`5`	`1344`	`331840`
`6`	`1600`	`397376`
`7`	`1856`	`462912`
`8`	`2112`	`528448`
`9`	`2368`	`593984`
`A`	`2624`	`659520`
`B`	`2880`	`725056`
`C`	`3136`	`790592`
`D`	`3392`	`856128`
`E`	`3648`	`921664`
`F`	`3904`	`987200`

E-expression With the Address as a Trailing `FlexUInt`

Because the address is encoded using a FlexUInt, there is no (theoretical) limit to the number of addresses that can be invoked. However, larger addresses require more bytes to encode.

When using the 0xEE opcode, the address is unbiased; the 0xEE opcode can be used for any macro address.

Figure 17: Invocation of macro address 0

┌──── Opcode EE indicates a macro address as trailing FlexUInt
│  ┌─── FlexUInt 0
│  │
EE 01

Figure 18: Invocation of macro address 2,097,151

┌──── Opcode EE indicates a macro address as trailing FlexUInt
│  ┌─── FlexUInt 2097151
│  │
EE FC FF FF

Booleans

0x6E represents boolean true, while 0x6F represents boolean false.

0xEB 0x00 represents null.bool.

Figure 19: Encoding of boolean true

6E

Figure 20: Encoding of boolean false

6F

Figure 21: Encoding of null.bool

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: boolean
│  │
EB 00

Numbers

Integers

Opcodes in the range 0x60 to 0x68 represent an integer. The opcode is followed by a FixedInt that represents the integer value. The low nibble of the opcode (0x_0 to 0x_8) indicates the size of the FixedInt. Opcode 0x60 represents integer 0; no more bytes follow.

Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6, followed by a FlexUInt indicating how many bytes of representation data follow.

0xEB 0x01 represents null.int.

Figure 22: Encoding of integer 0

┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││    no more bytes follow.
60

Figure 23: Encoding of integer 17

┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││    a single byte follows.
61 11
    └── FixedInt 17

Figure 24: Encoding of integer -944

┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││    that two bytes follow.
62 50 FC
   └─┬─┘
FixedInt -944

Figure 25: Encoding of integer -944

┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows
│   ┌─── FlexUInt 2; a 2-byte FixedInt follows
│   │
F6 05 50 FC
      └─┬─┘
   FixedInt -944

Figure 26: Encoding of null.int

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: integer
│  │
EB 01

Floats

Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:

0 bits (0 bytes), representing the value 0e0 and indicated by opcode 0x6A
16 bits (2 bytes in little-endian order, half precision), indicated by opcode 0x6B
32 bits (4 bytes in little-endian order, single precision), indicated by opcode 0x6C
64 bits (8 bytes in little-endian order, double precision), indicated by opcode 0x6D

Note that in the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.

0xEB 0x02 represents null.float.

Figure 27: Encoding of float 0e0

┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││    a 0-length float; 0e0
6A

Figure 28: Encoding of float 3.14e0

┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
   └─┬─┘
half-precision 3.14

Figure 29: Encoding of float 3.1415927e0

┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││    single-precision value.
6C DB 0F 49 40
   └────┬────┘
single-precision 3.1415927

Figure 30: Encoding of float 3.141592653589793e0

┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││    double-precision value.
6D 18 2D 44 54 FB 21 09 40
   └──────────┬──────────┘
double-precision 3.141592653589793

Figure 31: Encoding of null.float

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: float
│  │
EB 02

Decimals

If an opcode has a high nibble of 0x7_, it represents a decimal. Low nibble values indicate the number of trailing bytes used to encode the decimal.

The body of the decimal is encoded as a FlexInt representing its exponent, followed by a FixedInt representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0. When the coefficient is present but has a value of 0, the coefficient is -0.

Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7.

0xEB 0x03 represents null.decimal.

Figure 32: Encoding of decimal 0d0

┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││    decimal; 0d0
70

Figure 33: Encoding of decimal 7d0

┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
   |  └─── Coefficient: 1-byte FixedInt 7
   └─── Exponent: FlexInt 0

Figure 34: Encoding of decimal 1.27

┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
   |  └─── Coefficient: FixedInt 127
   └─── Exponent: 1-byte FlexInt -2

Figure 35: Variable-length encoding of decimal 1.27

┌──── Opcode F7 indicates a variable-length decimal
│
F7 05 FD 7F
   |  |  └─── Coefficient: FixedInt 127
   |  └───── Exponent: 1-byte FlexInt -2
   └─────── Decimal length: FlexUInt 2

Figure 36: Encoding of 0d3, which has a coefficient of zero

┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
   └────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0

Figure 37: Encoding of -0d3, which has a coefficient of negative zero

┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
   |  └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
   └────── Exponent: FlexInt 3

Figure 38: Encoding of null.decimal

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: decimal
│  │
EB 03

Timestamps

Note

In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.

Timestamps have two encodings:

Short-form timestamps: A compact representation optimized for the most commonly used precisions and date ranges.
Long-form timestamps: A less compact representation capable of representing any timestamp in the Ion data model.

0xEB x04 represents null.timestamp.

Figure 39: Encoding of null.timestamp

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: timestamp
│  │
EB 04

Short-form Timestamp

If an opcode has a high nibble of 0x8_, it represents a short-form timestamp. This encoding focuses on making the most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via the variable-length long form timestamp encoding.

Timestamps may be encoded using the short form if they meet all of the following conditions:

The year is between 1970 and 2097.: The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form.
The local offset is either UTC, unknown, or falls between -14:00 to +14:00 and is divisible by 15 minutes.: 7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset -14:00). The value 0b1111111 indicates an unknown offset. At the time of this writing (2023-05T), all real-world offsets fall between -12:00 and +14:00 and are multiples of 15 minutes.
The fractional seconds are a common precision.: The timestamp’s fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).

Opcodes by precision and offset

Each opcode with a high nibble of 0x8_ indicates a different precision and offset encoding pair.

Opcode	Precision	Serialized size in bytes*	Offset encoding
`0x80`	Year	1	Implicitly Unknown offset
`0x81`	Month	2
`0x82`	Day	2
`0x83`	Hour and minutes	4	1 bit to indicate UTC or Unknown Offset
`0x84`	Seconds	5
`0x85`	Milliseconds	6
`0x86`	Microseconds	7
`0x87`	Nanoseconds	8
`0x88`	Hour and minutes	5	7 bits to represent a known offset. This encoding can also represent UTC and Unknown Offset, though it is less compact than opcodes `0x83`-`0x87` above.
`0x89`	Seconds	5
`0x8A`	Milliseconds	7
`0x8B`	Microseconds	8
`0x8C`	Nanoseconds	9
`0x8D`	Reserved
`0x8E`
`0x8F`

* Serialized size in bytes does not include the opcode.

The body of a short-form timestamp is encoded as a FixedUInt of the size specified by the opcode. This integer is then partitioned into bit-fields representing the timestamp’s subfields. Note that endianness does not apply here because the bit-fields are defined over the body interpreted as an integer.

The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.

The Month and Day subfields are one-based; 0 is not a valid month or day.

Letter code	Number of bits	Subfield
`Y`	7	Year
`M`	4	Month
`D`	5	Day
`H`	5	Hour
`m`	6	Minute
`o`	7	Offset
`U`	1	Unknown (`0`) or UTC (`1`) offset
`s`	6	Second
`f`	10 (ms) 20 (μs) 30 (ns)	Fractional second
`.`	n/a	Unused

We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.

          7       0 <--- bit position
          |       |
         +=========+
byte 0   |  0xNN   | <-- hex notation for constants like opcodes
         +=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
     1   |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
         +---------+ <-- octet boundary within an encoding primitive
         ...
         +---------+
     N   |nnnn:nnnn|
         +=========+

The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)

Note

While this encoding may complicate human reading, it guarantees that the timestamp’s subfields (year, month, etc.) occupy the same bit contiguous indexes regardless of how many bytes there are overall. (The last subfield, fractional_seconds, always begins at the same bit index when present, but can vary in length according to the precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the appropriate bit ranges to access the subfields.

Figure 40: Encoding of a timestamp with year precision

         +=========+
byte 0   |  0x80   |
         +=========+
     1   |.YYY:YYYY|
         +=========+

Figure 41: Encoding of a timestamp with month precision

         +=========+
byte 0   |  0x81   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |....:.MMM|
         +=========+

Figure 42: Encoding of a timestamp with day precision

         +=========+
byte 0   |  0x82   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +=========+

Figure 43: Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset

         +=========+
byte 0   |  0x83   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |....:Ummm|
         +=========+

Figure 44: Encoding of a timestamp with seconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x84   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |....:..ss|
         +=========+

Figure 45: Encoding of a timestamp with milliseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x85   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |....:ffff|
         +=========+

Figure 46: Encoding of a timestamp with microseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x86   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |..ff:ffff|
         +=========+

Figure 47: Encoding of a timestamp with nanoseconds precision at UTC or unknown offset

         +=========+
byte 0   |  0x87   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +=========+

Figure 48: Encoding of a timestamp with hour-and-minutes precision at known offset

         +=========+
byte 0   |  0x88   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |....:..oo|
         +=========+

Figure 49: Encoding of a timestamp with seconds precision at known offset

         +=========+
byte 0   |  0x89   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +=========+

Figure 50: Encoding of a timestamp with milliseconds precision at known offset

         +=========+
byte 0   |  0x8A   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |....:..ff|
         +=========+

Figure 51: Encoding of a timestamp with microseconds precision at known offset

         +=========+
byte 0   |  0x8B   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |....:ffff|
         +=========+

Figure 52: Encoding of a timestamp with nanoseconds precision at known offset

         +=========+
byte 0   |  0x8C   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +---------+
     9   |..ff:ffff|
         +=========+

Table 2. Examples of short-form timestamps

Text	Binary
2023T	`80 35`
2023-10-15T	`82 35 7D`
2023-10-15T11:22:33Z	`84 35 7D CB 1A 02`
2023-10-15T11:22:33-00:00	`84 35 7D CB 12 02`
2023-10-15T11:22:33+01:15	`89 35 7D CB 2A 84`
2023-10-15T11:22:33.444555666+01:15	`8C 35 7D CB 2A 84 92 61 7F 1A`

Warning

Opcodes 0x8D, 0x8E, and 0x8F are illegal; they are reserved for future use.

Long-form Timestamp

Unlike the Short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.

The long form begins with opcode 0xF8. A FlexUInt follows indicating the number of bytes that were needed to represent the timestamp. The encoding consumes the minimum number of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s precision as follows:

Length	Corresponding precision
0	Illegal
1	Illegal
2	Year
3	Month or Day (see below)
4	Illegal; the hour cannot be specified without also specifying minutes
5	Illegal
6	Minutes
7	Seconds
8 or more	Fractional seconds

Unlike the short-form encoding, the long-form encoding reserves:

14 bits for the year (Y), which is not biased.
12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440 (that is: -24:00). An offset value of 0b111111111111 indicates an unknown offset.

Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the timestamp are encoded as bit-fields on a FixedUInt that corresponds to the length that followed the opcode.

If the timestamp’s overall length is greater than or equal to 8, the FixedUInt part of the timestamp is 7 bytes and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a (scale, coefficient) pair, which is similar to a decimal. The primary difference is that the scale represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to 1.0 or less than 0.0. The scale is encoded as a FlexUInt (instead of FlexInt) to discourage the encoding of decimal numbers greater than 1.0. The coefficient is encoded as a FixedUInt (instead of FixedInt) to prevent the encoding of fractional seconds less than 0.0. Note that validation is still required; namely:

A scale value of 0 is illegal, as that would result in a fractional seconds greater than 1.0 (a whole second).
If coefficient * 10^-scale > 1.0, that (coefficient, scale) pair is illegal.

If the timestamp’s length is 3, the precision is determined by inspecting the day (DDDDD) bits. Like the short-form, the Month and Day subfields are one-based (0 is not a valid month or day). If the day subfield is zero, that indicates month precision. If the day subfield is any non-zero number, that indicates day precision.

Figure 53: Encoding of the body of a long-form timestamp

         +=========+
byte 0   |YYYY:YYYY|
         +=========+
     1   |MMYY:YYYY|
         +---------+
     2   |HDDD:DDMM|
         +---------+
     3   |mmmm:HHHH|
         +---------+
     4   |oooo:oomm|
         +---------+
     5   |ssoo:oooo|
         +---------+
     6   |....:ssss|
         +=========+
     7   |FlexUInt | <-- scale of the fractional seconds
         +---------+
         ...
         +=========+
     N   |FixedUInt| <-- coefficient of the fractional seconds
         +---------+
         ...

Table 3. Examples of long-form timestamps

Text	Binary
1947T	`F8 05 9B 07`
1947-12T	`F8 07 9B 07 03`
1947-12-23T	`F8 07 9B 07 5F`
1947-12-23T11:22:33-00:00	`F8 0F 9B 07 DF 65 FD 7F 08`
1947-12-23T11:22:33+01:15	`F8 0F 9B 07 DF 65 AD 57 08`
1947-12-23T11:22:33.127+01:15	`F8 13 9B 07 DF 65 AD 57 08 07 7F`

Text

Strings

If the high nibble of the opcode is 0x9_, it represents a string. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0x90 represents a string with empty text ("").

Strings longer than 15 bytes can be encoded with the F9 opcode, which takes a FlexUInt-encoded length after the opcode.

0xEB x05 represents null.string.

Figure 54: Encoding of the empty string, ""

┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90

Figure 55: Encoding of a 14-byte string

┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes

Figure 56: Encoding of a 24-byte string

┌──── Opcode F9 indicates a variable-length string
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes

Figure 57: Encoding of null.string

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: string
│  │
EB 05

Symbols With Inline Text

If the high nibble of the opcode is 0xA_, it represents a symbol whose text follows the opcode. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0xA0 represents a symbol with empty text ('').

0xEB x06 represents null.symbol.

Figure 58: Encoding of a symbol with empty text ('')

┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0

Figure 59: Encoding of a symbol with 14 bytes of inline text

┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes

Figure 60: Encoding of a symbol with 24 bytes of inline text

┌──── Opcode FA indicates a variable-length symbol with inline text
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes

Figure 61: Encoding of null.symbol

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: symbol
│  │
EB 06

Symbols With a Symbol Address

Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1 through 0xE3:

0xE1 represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byte FixedUInt that follows the opcode.
0xE2 represents a symbol whose address in the symbol table is a 2-byte FixedUInt that follows the opcode.
0xE3 represents a symbol whose address in the symbol table is a FlexUInt that follows the opcode.

Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.

Opcode	Symbol address range	Bias
`0xE1`	0 to 255	0
`0xE2`	256 to 65,791	256
`0xE3`	65,792 to infinity	65,792

Binary Data

Blobs

Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob’s byte-length.

0xEB x07 represents null.blob.

Figure 62: Encoding of a blob with 24 bytes of data

┌──── Opcode FE indicates a blob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data

Figure 63: Encoding of null.blob

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: blob
│  │
EB 07

Clobs

Opcode FF indicates a clob—binary character data of an unspecified encoding. A FlexUInt follows that represents the clob’s byte-length.

0xEB x08 represents null.clob.

Figure 64: Encoding of a clob with 24 bytes of data

┌──── Opcode FF indicates a clob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data

Figure 65: Encoding of null.clob

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: clob
│  │
EB 08

Containers

Each of the container types (list, s-expression, and struct) has both a length-prefixed encoding and a delimited encoding.

The length-prefixed encoding places more burden on the writer, but simplifies reading and enables skipping over uninteresting values in the data stream. In contrast, the delimited encoding is simpler and faster for writers, but requires the reader to visit each child value in turn to skip over the container.

Lists

Length-prefixed encoding

An opcode with a high nibble of 0xB_ indicates a length-prefixed list. The lower nibble of the opcode indicates how many bytes were used to encode the child values that the list contains.

If the list’s encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB opcode to write a variable-length list. The 0xFB opcode is followed by a FlexUInt that indicates the list’s byte length.

0xEB 0x09 represents null.list.

Figure 66: Length-prefixed encoding of an empty list ([])

┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this list took zero bytes to encode.
B0

Figure 67: Length-prefixed encoding of [1, 2, 3]

┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this list took zero bytes to encode.
B6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3

Figure 68: Length-prefixed encoding of ["variable length list"]

┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     l  i  s  t
FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element

Figure 69: Encoding of null.list

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: list
│  │
EB 09

Delimited Encoding

Opcode 0xF1 begins a delimited list, while opcode 0xF0 closes the most recently opened delimited container that has not yet been closed.

Figure 70: Delimited encoding of an empty list ([])

┌──── Opcode 0xF1 indicates a delimited list
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F1 F0

Figure 71: Delimited encoding of [1, 2, 3]

┌──── Opcode 0xF1 indicates a delimited list
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F1 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3

Figure 72: Delimited encoding of [1, [2], 3]

┌──── Opcode 0xF1 indicates a delimited list
│        ┌─── Opcode 0xF1 begins a nested delimited list
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested list.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and still open)
│        │        │        │    delimited container: the outer list.
│        │        │        │
F1 61 01 F1 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

S-Expressions

S-expressions use the same encodings as lists, but with different opcodes.

Opcode	Encoding
`0xC0`-`0xCF`	Length-prefixed S-expression; low nibble of the opcode represents the byte-length.
`0xFC`	Variable-length prefixed S-expression; a `FlexUInt` following the opcode represents the byte-length.
`0xF2`	Starts a delimited S-expression; `0xF0` closes the most recently opened delimited container.

0xEB 0x0A represents null.sexp.

Figure 73: Length-prefixed encoding of an empty S-expression (())

┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression took zero bytes to encode.
C0

Figure 74: Length-prefixed encoding of (1 2 3)

┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression took six bytes to encode.
C6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3

Figure 75: Length-prefixed encoding of ("variable length sexp")

┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  e  x  p
FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element

Figure 76: Delimited encoding of an empty S-expression (())

┌──── Opcode 0xF2 indicates a delimited S-expression
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F2 F0

Figure 77: Delimited encoding of (1 2 3)

┌──── Opcode 0xF2 indicates a delimited S-expression
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F2 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3

Figure 78: Delimited encoding of (1 (2) 3)

┌──── Opcode 0xF2 indicates a delimited S-expression
│        ┌─── Opcode 0xF2 begins a nested delimited S-expression
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested S-expression.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and still open)
│        │        │        │    delimited container: the outer S-expression.
│        │        │        │
F2 61 01 F2 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

Figure 79: Encoding of null.sexp

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: sexp
│  │
EB 0A

Structs

Structs have 3 available encodings:

Structs with symbol address field names
Structs with FlexSym field names
Delimited structs with FlexSym field names

0xEB 0x0B represents null.struct.

Figure 80: Encoding of null.struct

┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: struct
│  │
EB 0B

Structs With Symbol Address Field Names

An opcode with a high nibble of 0xD_ indicates a struct with symbol address field names (which is similar to the only available encoding of structs in Ion 1.0. The lower nibble of the opcode indicates how many bytes were used to encode all of its nested (field name, value) pairs.

If the struct’s encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD opcode to write a variable-length struct with symbol address field names. The 0xFD opcode is followed by a FlexUInt that indicates the byte length.

Each field in the struct is encoded as a FlexUInt representing the address of the field name’s text in the symbol table, followed by an opcode-prefixed value.

The symbol address $0 cannot be encoded in this format because the FlexUInt 0 in the field name position is reserved for switching the struct to FlexSym field names.

Figure 81: Length-prefixed encoding of an empty struct ({})

┌──── An opcode in the range 0xD0-0xDF indicates a struct with symbol address field names
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0

Figure 82: Length-prefixed encoding of {$10: 1, $11: 2}

┌──── An opcode in the range 0xD0-0xDF indicates a struct with symbol address field names
│  ┌─── Field name: FlexUInt 10 ($10)
│  │        ┌─── Field name: FlexUInt 11 ($11)
│  │        │
D6 15 61 01 17 61 02
      └─┬─┘    └─┬─┘
        1        2

Figure 83: Length-prefixed encoding of {$10: "variable length struct"}

 ┌───────────── Opcode `FD` indicates a variable length struct with symbol address field names
 │  ┌────────── Length: FlexUInt 25
 │  │  ┌─────── Field name: FlexUInt 10 ($10)
 │  │  │  ┌──── Opcode `F9` indicates a variable length string
 │  │  │  │  ┌─ FlexUInt: 22 the string is 22 bytes long
 │  │  │  │  │  v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  t  r  u  c  t
FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
               └─────────────────────────────┬─────────────────────────────────┘
                                        UTF-8 bytes

Structs With `FlexSym` Field Names

Note

This encoding is very similar to structs with symbol address field names, but allows writers to choose between representing each field name as a symbol address (for example: $10) or as inline UTF-8 bytes (for example: "foo"). This encoding is potentially less dense, but offers writers significant flexibility over whether and when field names are added to the symbol table.

All length-prefixed structs begin as structs with symbol address field names. However, they can be switched to use FlexSym field names at any time by emitting the FlexUInt 0 (the byte 0x01) in the field name position. Once a struct has been switched to the FlexSym field name encoding, it cannot be switch back.

Each field in the struct is encoded as a FlexSym field name, followed by an opcode-prefixed value.

Figure 84: Length-prefixed encoding of {"foo": 1, $11: 2}

┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 10
││ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode
││ │  ┌─ FlexSym -3     ┌─ FlexSym: 11 ($11)
││ │  │   f  o  o       │
DA 01 FD 66 6F 6F 61 01 17 61 02
         └──┬───┘ └─┬─┘    └─┬─┘
         3 UTF-8    1        2
          bytes

Figure 85: Length-prefixed encoding of {$11: 1, "foo": 2}

┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 10
││ ┌─ FlexSym: 11 ($11)
││ │        ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode
││ │        │  ┌─ FlexSym -3
││ │        │  │   f  o  o
DA 17 61 01 01 FD 66 6F 6F 61 02
      └─┬─┘       └──┬───┘ └─┬─┘
        1         3 UTF-8    2
                   bytes

Figure 86: Length-prefixed encoding of {$0: 1}

┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 5
││ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode
││ │  ┌── FlexSym "escape"
││ │  │
││ │  │
D5 01 01 A0 61 01
      └─┬─┘ └─┬─┘
       $0     1

TODO: Demonstrate splicing macro values into the struct via FlexSym escape code 0x01.

Delimited Structs

Opcode 0xF3 indicates the beginning of a delimited struct with FlexSym field names.

Unlike lists and S-expressions, structs cannot use opcode 0xF0 by itself to indicate the end of the delimited container. This is because 0xF0 is a valid FlexSym (a symbol with 16 bytes of inline text). To close the delimited struct, the writer emits a 0x01 byte (a FlexSym escape) followed by the opcode 0xF0.

Note	While length-prefixed structs can choose between structs with symbol address field names and structs with `FlexSym` field names, delimited structs always use `FlexSym`-encoded field names.

Figure 87: Delimited encoding of the empty struct ({})

┌─── Opcode 0xF3 indicates the beginning of a delimited struct with `FlexSym` field names.
│  ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │  ┌─── Opcode 0xF0 indicates the end of the most
│  │  │    recently opened delimited container
F3 01 F0

Figure 88: Delimited encoding of {"foo": 1, $11: 2}

┌─── Opcode 0xF3 indicates the beginning of a delimited struct with `FlexSym` field names.
│
│  ┌─ FlexSym -3     ┌─ FlexSym: 11 ($11)
│  │                 │        ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │                 │        │  ┌─── Opcode 0xF0 indicates the end of the most
│  │   f  o  o       │        │  │    recently opened delimited container
F3 FD 66 6F 6F 61 01 17 61 02 01 F0
      └──┬───┘ └─┬─┘    └─┬─┘
      3 UTF-8    1        2
       bytes

Nulls

The opcode 0xEA indicates an untyped null (that is: null, or its alias null.null).

The opcode 0xEB indicates a typed null; a byte follows whose value represents an offset into the following table:

Byte	Type
`0x00`	`null.bool`
`0x01`	`null.int`
`0x02`	`null.float`
`0x03`	`null.decimal`
`0x04`	`null.timestamp`
`0x05`	`null.string`
`0x06`	`null.symbol`
`0x07`	`null.blob`
`0x08`	`null.clob`
`0x09`	`null.list`
`0x0A`	`null.sexp`
`0x0B`	`null.struct`

All other byte values are reserved for future use.

Note	Future versions of Ion may decide to generalize this into a "constants" table.

Figure 89: Encoding of null

┌──── The opcode `0xEA` represents a null (null.null)
EA

Figure 90: Encoding of null.string

┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows
│  ┌──── Byte 0x05 indicates the type `string`
EB 05

Annotations

Annotations can be encoded either as symbol addresses or as FlexSyms. In both encodings, the annotations sequence appears just before the value that it decorates.

It is illegal for an annotations sequence to appear before any of the following:

Another annotations sequence
The end of the stream
A NOP
An E-expression (that is: a macro invocation). To add annotations to the expansion of an E-expression, see the annotate macro. (TODO: Link)

Annotations With Symbol Addresses

Opcodes 0xE4 through 0xE6 indicate one or more annotations encoded as symbol addresses. If the opcode is:

0xE4, a single FlexUInt-encoded symbol address follows.
0xE5, two FlexUInt-encoded symbol addresses follow.
0xE6, a FlexUInt follows that represents the number of bytes needed to encode the annotations sequence, which can be made up of any number of FlexUInt symbol addresses.

Figure 91: Encoding of $10::false

┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows
│  ┌──── Annotation with symbol address: FlexUInt 10
E4 15 6F
      └── The annotated value: `false`

Figure 92: Encoding of $10::$11::false

┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow
│  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
E5 15 17 6F
         └── The annotated value: `false`

Figure 93: Encoding of $10::$11::$12::false

┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations;
│     a FlexUInt follows representing the length of the sequence.
│   ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10)
│   │  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│   │  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
│   │  │  │  ┌──── Annotation with symbol address: FlexUInt 12 ($12)
E5 07 15 17 19 6F
               └── The annotated value: `false`

Annotations With `FlexSym` Text

Opcodes 0xE7 through 0xE9 indicate one or more annotations encoded as FlexSyms.

If the opcode is:

0xE7, a single FlexSym-encoded symbol follows.
0xE8, two FlexSym-encoded symbols follow.
0xE9, a FlexUInt follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded as FlexSyms.

While this encoding is more flexible than annotations with symbol addresses, it can be slightly less compact when all the annotations are encoded as symbol addresses.

Figure 94: Encoding of $10::false

┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation with symbol address: FlexSym 10 ($10)
E7 15 6F
      └── The annotated value: `false`

Figure 95: Encoding of foo::false

┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │   f  o  o
E7 FD 66 6F 6F 6F
      └──┬───┘ └── The annotated value: `false`
      3 UTF-8
       bytes

Note that FlexSym annotation sequences can switch between symbol address and inline text on a per-annotation basis.

Figure 96: Encoding of $10::foo::false

┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow
│  ┌──── Annotation: FlexSym 10 ($10)
│  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │   f  o  o
E8 15 FD 66 6F 6F 6F
         └──┬───┘ └── The annotated value: `false`
         3 UTF-8
          bytes

Figure 97: Encoding of $10::foo::$11::false

┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations
│  ┌──── Length: FlexUInt 6
│  │  ┌──── Annotation: FlexSym 10 ($10)
│  │  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │  │           ┌──── Annotation: FlexSym 11 ($11)
│  │  │  │   f  o  o │
E9 0D 15 FD 66 6F 6F 17 6F
            └──┬───┘    └── The annotated value: `false`
            3 UTF-8
             bytes

`NOP`s

A NOP (short for "no-operation") is the binary equivalent of whitespace. NOP bytes have no meaning, but can be used as padding to achieve a desired alignment.

An opcode of 0xEC indicates a single-byte NOP pad. An opcode of 0xED indicates that a FlexUInt follows that represents the number of additional bytes to skip.

It is legal for a NOP to appear anywhere that a value can be encoded. It is not legal for a NOP to appear in annotation sequences or struct field names. If a NOP appears in place of a struct field value, then the associated field name is ignored; the NOP is immediately followed by the next field name, if any.

Figure 98: Encoding of a 1-byte NOP

┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC

Figure 99: Encoding of a 4-byte NOP

┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│  ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│  │
ED 05 93 C6
      └─┬─┘
NOP bytes, values ignored

E-expression Arguments

The binary encoding of E-expressions (aka macro invocations) starts with the address of the macro to expand. The address can be encoded as part of the opcode, as a FixedUInt that follows the opcode, or as a FlexUInt that follows the opcode.

The encoding of the E-expression’s arguments depends on their respective types. Argument types can be classified as belonging to one of two categories: tagged encodings and tagless encodings.

Tagged Encodings

Tagged types are argument types whose encoding begins with an opcode, sometimes informally called a 'tag'. These include the core types and the abstract types.

Core types

The core types are the 13 types in the Ion data model:

Abstract types

The abstract types are unions of two or more of the core types.

Abstract type	Included Ion types
`any`	All core Ion types
`number`	`int`, `float`, `decimal`
`exact`	`int`, `decimal`
`text`	`string`, `symbol`
`lob`	`blob`, `clob`
`sequence`	`list`, `sexp`

Tagged E-expression Argument Encoding

When a macro parameter has a tagged type, the encoding of that parameter’s corresponding argument in an E-expression is identical to how it would be encoded anywhere else in an Ion stream: it has a leading opcode that dictates how many bytes follow and how they should be interpreted. This is very flexible, but makes it possible for writers to encode values that conflict with the parameter’s declared type. Because of this, the macro expander will read the argument and then check its type against the parameter’s declared type. If it does not match, the macro expander must raise an error.

Macro foo (defined below) is used in this section’s subsequent examples to demonstrate the encoding of tagged-type arguments.

Figure 100: Definition of example macro foo at address 0

(macro
    foo           // Macro name
    (number::x!)  // Parameters
    /*...*/       // Template (elided)
)

Figure 101: Encoding of E-expression (:foo 3.14e0)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x6B indicates a 2-byte float; an IEEE-754 half-precision float follows
│  │
00 6B 47 42
      └─┬─┘
      3.14e0

// The macro expander confirms that `3.14e0` (a `float`) matches the expected type: `number`.

Figure 102: Encoding of E-expression (:foo 9)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x61 indicates a 1-byte integer. A 1-byte FixedInt follows.
│  │  ┌──── A 1-byte FixedInt: 9
00 61 09

// The macro expander confirms that `9` (an `int`) matches the expected type: `number`.

Figure 103: Encoding of E-expression (:foo $10::9)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0xE4 indicates a single annotation with symbol address. A FlexUInt follows.
│  │  ┌──── Symbol address: FlexUInt 10 ($10); an opcode for the annotated value follows.
│  │  │  ┌──── Opcode 0x61 indicates a 1-byte integer
│  │  │  │   ┌──── 1-byte FixedInt 9
00 E4 15 61 09

// The macro expander confirms that `$10::9` (an annotated `int`) matches the expected type: `number`.

Figure 104: Encoding of E-expression (:foo null.int)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0xEB indicates a typed null. A 1-byte FixedUInt follows indicating the type.
│  │  ┌──── Null type: FixedUInt: 1; integer
00 EB 01

// The macro expander confirms that `null.int` matches the expected type: `number`.

Figure 105: Encoding of E-expression (:foo null)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0xEA represents an untyped null (aka `null.null`)
00 EA

// The macro expander confirms that `null` matches the expected type: `number`

Figure 106: Encoding of E-expression (:foo (:bar))

// A second macro definition at address 1
(macro
    bar // Macro name
    ()  // Parameters
    5   // Template; invocations of `bar` always expand to `5`.
)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged int as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x01 is less than 0x40, so it is an E-expression invoking the macro
│  │     at address 1: `bar`. `bar` takes no parameters, so no bytes follow.
00 01

// The macro expander confirms that the expansion of `(:bar)` (that is: `5`) matches
// the expected type: `number`.

Figure 107: Encoding of illegal E-expression (:foo "hello")

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0, `foo`. `foo` takes a tagged int as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x95 indicates a 5-byte string. 5 UTF-8 bytes follow.
│  │  h  e  l  l  o
00 95 68 65 6C 6C 6F
      └──────┬─────┘
        UTF-8 bytes

// ERROR: Expected a `number` for `foo` parameter `x`, but found `string`

Tagless Encodings

In contrast to tagged encodings, tagless encodings do not begin with an opcode. This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings do not have an opcode, they cannot represent E-expressions, annotation sequences, or null values of any kind.

Tagless types include the primitive types and macro shapes.

Primitive Types

Primitive types are self-delineating, either by having a statically known size in bytes or by including length information in their encoding.

Primitive types include:

Ion type	Primitive type	Size in bytes	Encoding
`int`	`uint8`	1	`FixedUInt`
	`uint16`	2
	`uint32`	4
	`uint64`	8
	`compact_uint`	variable	`FlexUInt`
	`int8`	1	`FixedInt`
	`int16`	2
	`int32`	4
	`int64`	8
	`compact_int`	variable	`FlexInt`
`float`	`float16`	2	IEEE-754 half-precision floating point format (little-endian)
	`float32`	4	IEEE-754 single-precision floating point format (little-endian)
	`float64`	8	IEEE-754 double-precision floating point format (little-endian)
`symbol`	`compact_symbol`	variable	`FlexSym`

TODO:

Finalize names for primitive types. (compact_? plain_?)
Do we need a compact_string encoding? It saves a byte for string lengths >16 and <128.
Do we need other int sizes? int24? int40?

Macro Shapes

The term macro shape describes a macro that is being used as the encoding of an E-expression argument. They are considered "shapes" rather than types because while their encoding is always statically known, the types of data produced by their expansion is not. A single macro can produce streams of varying length and containing values of different Ion types depending on the arguments provided in the invocation.

See the Macro Shapes section of Macros by Example for more information.

Encoding E-expressions With Multiple Arguments

E-expression arguments corresponding to each parameter are encoded one after the other moving from left to right.

Figure 108: Definition of macro foo at address 0

(macro foo             // Macro name
  (                    // Parameters
    string::a
    compact_symbol::b
    uint16::c
  )
  /* ... */            // Body (elided)
)

Figure 109: Encoding of E-expression for macro with multiple parameters: (:0 "hello" baz 512)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0, `foo`. `foo`'s first parameter is a string, so an opcode follows.
│
│  ┌──── Opcode 0x95 indicates a 5-byte string. 5 UTF-8 bytes follow.
│  │
│  │                 ┌──── `foo`'s second parameter is a compact_symbol, so a `FlexSym` follows.
│  │                 │     FlexSym -3: 3 bytes of UTF-8 text follow.
│  │                 │
│  │                 │           ┌──── `foo`'s third parameter is a uint16, so a 2-byte
│  │                 │           │     2-byte `FixedUInt` follows.
│  │                 │           │     FixedUInt: 512
│  │  h  e  l  l  o  │   b  a  z │
00 95 68 65 6C 6C 6F FD 62 61 7A 00 20
      └──────┬─────┘    └───┬──┘
        UTF-8 bytes    UTF-8 bytes

Argument Encoding Bitmap (AEB)

The examples in previous sections have only shown how to encode invocations of macros which have either no parameters at all (aka constants) or whose parameters all have a cardinality of exactly-one.

If a macro has any parameters with a cardinality of zero-or-one (?), zero-or-more (*), or one-or-more (+), then E-expressions invoking that macro will begin with an argument encoding bitmap (AEB). An AEB is a series of bits that correspond to a macro parameter and communicate additional information about how the arguments corresponding to that parameter have been encoded in the current E-expression. In particular, the AEB indicates whether a parameter that accepts (:void) has any arguments at all, and how a grouped parameter’s arguments have been delimited.

The number of bits allotted to each parameter is determined by its cardinality, as shown in the table below; each parameter can have 0, 1, or 2 bits.

Grouping Mode	Cardinality	Example parameter signature	Number of bits	Bit(s) value	Encoding
Ungrouped	Exactly-one	`(x int!)`	0	n/a	One expression
	Zero-or-one	`(x int?)`	1	`0`	No expression; equivalent to `(:void)`
	Zero-or-one	`(x int?)`		`1`	One expression
	Zero-or-more	`(x int*)`		`0`	No expression; equivalent to `(:void)`
	Zero-or-more	`(x int*)`		`1`	One expression
	One-or-more	`(x int+)`	0	n/a	One expression
Grouped	Zero-or-more	`(x [int])` `(x int...)`	2	`00`	No expression; equivalent to `(:void)`
				`01`	One expression
				`10`	Length-prefixed expression group
				`11`	Delimited expression group
	One-or-more	(x )` + `(x int\...)		`00`	Illegal. One-or-more forbids `(:void)`.
				`01`	One expression
				`10`	Length-prefixed expression group
				`11`	Delimited expression group

The total number of bits in the AEB can be calculated by analyzing the signature of the macro being invoked. If the macro has no parameters or all of its parameters have a cardinality of either exactly-one or one-or-more, no bits are required; the AEB will be omitted altogether. If the macro has many parameters with a cardinality other than exactly-one, it is possible for the AEB to require more than one byte to encode; in such cases, the bytes are written in little-endian order. AEB bytes can contain unused bits.

Bits are assigned to the parameters in a macro’s signature from left to right. Bits are assigned from least significant to most significant (commonly: right-to-left).

Example parameter sequence	Bit assignments	Total bits
`()`	No AEB	0
`((a int!) (b string!) (c float!))`	No AEB	0
`((a int!) (b string!) (c float?))`	`-------c`	1
`((a int!) (b string?) (c float!))`	`-------b`	1
`((a int!) (b string*) (c float?))`	`------cb`	2
`((a int*) (b string!) (c [float]))`	`-----cca`	3
`((a int*) (b [string]) (c [float]))`	`---ccbba`	5
`((a [int]) (b [string]) (c [float]+))`	`--ccbbaa`	6
`((a int*) (b [string]) (c [float]) (d [bool]) (e blob…))`	`eddccbba` `-------e`	9

Expression Groups

Grouped parameters can be encoded using either a length-prefixed or delimited expression group encoding.

The example encodings in the following sections refer to this macro definition:

Figure 110: Definition of macro foo at address 0

(macro
    foo          // Macro name
    (int::x*)    // Parameters; `x` is a grouped parameter
    /*...*/      // Body (elided)
)

Length-prefixed Expression Groups

If a grouped parameter’s AEB bits are 0b10, then the argument expressions belonging to that parameter will be prefixed by a FlexUInt indicating the number of bytes used to encode them.

Figure 111: Length-prefixed encoding of (:foo [1, 2, 3])

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0010; the arguments for grouped parameter `x` have been encoded
│  │     as a length-prefixed expression group. A FlexUInt length prefix follows.
│  │  ┌──── FlexUInt: 6; the next 6 bytes are an `int` expression group.
│  │  │
00 02 0D 61 01 61 02 61 03
         └─┬─┘ └─┬─┘ └─┬─┘
           1     2     3

Delimited Expression Groups

If a grouped parameter’s AEB bits are 0b11, then the argument expressions belonging to that parameter will be encoded in a delimited sequence. Delimited sequences are encoded differently for tagged types and tagless types.

Delimited Tagged Expression Groups

Tagged type encodings begin with an opcode; a delimited sequence of tagged arguments is terminated by the closing delimiter opcode, 0xF0.

Figure 112: Delimited encoding of (:foo [1, 2, 3])

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded
│  │     as a delimited expression group. A series of tagged `int` expressions follow.
│  │                    ┌──── Opcode 0xF0 ends the expression group.
│  │                    │
00 03 61 01 61 02 61 03 F0
      └─┬─┘ └─┬─┘ └─┬─┘
        1     2     3

Delimited Tagless Expression Groups

Tagless type encodings do not have an opcode, and so cannot use the closing delimiter opcode--0xF0 is a valid first byte for many tagless encodings.

Instead, tagless expressions are grouped into 'pages', each of which is prefixed by a FlexUInt representing a count (not a byte-length) of the expressions that follow. If a prefix has a count of zero, that marks the end of the sequence of pages.

Figure 113: Definition of macro compact_foo at address 1

(macro
    compact_foo          // Macro name
    (compact_int::x*)    // Parameters; `x` is a grouped parameter
    /*...*/              // Body (elided)
)

Figure 114: Delimited encoding of (:compact_foo [1, 2, 3]) using a single page

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded
│  │     as a delimited expression group. Count-prefixed pages of `compact_int`
│  │     expressions follow.
│  │   ┌──── Count prefix: FlexUInt 3; 3 `compact_int`s follow.
│  │   │          ┌──── Count prefix: FlexUInt 0; no more pages follow.
│  │   │          │
00 03 07 03 05 07 01
         └──┬───┘
         First page: 1, 2, 3

Figure 115: Delimited encoding of (:compact_foo [1, 2, 3]) using two pages

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded
│  │     as a delimited expression group. Count-prefixed pages of `compact_int`
│  │     expressions follow.
│  │   ┌──── Count prefix: FlexUInt 2; 2 `compact_int`s follow.
│  │   │        ┌──── Count prefix: FlexUInt 1; a single `compact_int` follows.
│  │   │        │    ┌──── Count prefix: FlexUInt 0; no more pages follow.
│  │   │        │    │
00 03 05 03 05 03 07 01
         └─┬─┘    └─ Second page: 3
           │
         First page: 1, 2

Files

binary-encoding.adoc

Latest commit

History

binary-encoding.adoc

File metadata and controls

Ion 1.1 Binary Encoding

Encoding Primitives

FlexUInt

FlexInt

FixedUInt

FixedInt

FlexSym

Opcodes

Encoding Expressions

E-expression With the Address in the Opcode

E-expression With the Address as a Trailing FixedUInt

E-expression With the Address as a Trailing FlexUInt

Booleans

Numbers

Integers

Floats

Decimals

Timestamps

Short-form Timestamp

Opcodes by precision and offset

Long-form Timestamp

Text

Strings

Symbols With Inline Text

Symbols With a Symbol Address

Binary Data

Blobs

Clobs

Containers

Lists

Length-prefixed encoding

Delimited Encoding

S-Expressions

Structs

Structs With Symbol Address Field Names

Structs With FlexSym Field Names

Delimited Structs

Nulls

Annotations

Annotations With Symbol Addresses

Annotations With FlexSym Text

NOPs

E-expression Arguments

Tagged Encodings

Core types

Abstract types

Tagged E-expression Argument Encoding

Tagless Encodings

Primitive Types

Macro Shapes

Encoding E-expressions With Multiple Arguments

Argument Encoding Bitmap (AEB)

Expression Groups

Length-prefixed Expression Groups

Delimited Expression Groups

Delimited Tagged Expression Groups

Delimited Tagless Expression Groups

`FlexUInt`

`FlexInt`

`FixedUInt`

`FixedInt`

`FlexSym`

E-expression With the Address as a Trailing `FixedUInt`

E-expression With the Address as a Trailing `FlexUInt`

Structs With `FlexSym` Field Names

Annotations With `FlexSym` Text

`NOP`s