Skip to content

Latest commit

 

History

History
2703 lines (2205 loc) · 83.8 KB

binary-encoding.adoc

File metadata and controls

2703 lines (2205 loc) · 83.8 KB

Ion 1.1 Binary Encoding

Encoding Primitives

FlexUInt

A variable-length unsigned integer.

The bytes of a FlexUInts are written in little-endian byte order. This means that the first bytes will contain the FlexUInt's least significant bits.

The least significant bits in the FlexUInt indicate the number of bytes that were used to encode the integer. If a FlexUInt is N bytes long, its N-1 least significant bits will be 0; a terminal 1 bit will be in the next most significant position. All bits that are more significant than the terminal 1 represent the magnitude of the FlexUInt.

Figure 1: FlexUInt encoding of 14
              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
unsigned int 14
Figure 2: FlexUInt encoding of 729
             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the unsigned  of the unsigned
integer          integer
Figure 3: FlexUInt encoding of 21,043
            ┌───── There are 2 zeros in the least significant bits, so this
            │      integer is three bytes wide.
          ┌─┴─┐
1 0 0 1 1 1 0 0  1 0 0 1 0 0 0 1  0 0 0 0 0 0 1 0
└───┬───┘        └──────┬──────┘  └──────┬──────┘
lowest 6 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

FlexInt

A variable-length signed integer.

From an encoding perspective, FlexInts are structurally similar to a FlexUInt (described above). Both encode their bytes using little-endian byte order, and both use the count of least-significant zero bits to indicate how many bytes were used to encode the integer. They differ in the interpretation of their bits; while a FlexUInt's bits are unsigned, a FlexInt's bits are encoded using two’s complement notation.

Tip
An implementation could choose to read a FlexInt by instead reading a FlexUInt and then reinterpreting its bits as two’s complement.
Figure 4: FlexInt encoding of 14
              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
0 0 0 1 1 1 0 1
└─────┬─────┘
 2's comp. 14
Figure 5: FlexInt encoding of -14
              ┌──── Lowest bit is 1 (end), indicating
              │     this is the only byte.
1 1 1 0 0 1 0 1
└─────┬─────┘
 2's comp. -14
Figure 6: FlexInt encoding of 729
             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
0 1 1 0 0 1 1 0  0 0 0 0 1 0 1 1
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer
Figure 7: FlexInt encoding of -729
             ┌──── There's 1 zero in the least significant bits, so this
             │     integer is two bytes wide.
            ┌┴┐
1 0 0 1 1 1 1 0  1 1 1 1 0 1 0 0
└────┬────┘      └──────┬──────┘
lowest 6 bits    highest 8 bits
of the 2's       of the 2's
comp. integer    comp. integer

FixedUInt

A fixed-width, little-endian, unsigned integer whose length is inferred from the context in which it appears.

Figure 8: FixedUInt encoding of 3,954,261
0 1 0 1 0 1 0 1  0 1 0 1 0 1 1 0  0 0 1 1 1 1 0 0
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the unsigned  the unsigned     of the unsigned
integer          integer          integer

FixedInt

A fixed-width, little-endian, signed integer whose length is known from the context in which it appears. Its bytes are interpreted as two’s complement.

Figure 9: FixedInt encoding of -3,954,261
1 0 1 0 1 0 1 1  1 0 1 0 1 0 0 1  1 1 0 0 0 0 1 1
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
lowest 8 bits    next 8 bits of   highest 8 bits
of the 2's       the 2's comp.    of the 2's comp.
comp. integer    integer          integer

FlexSym

A variable-length symbol token whose UTF-8 bytes can be inline, found in the symbol table, or derived from a macro expansion.

A FlexSym begins with a FlexInt; once this integer has been read, we can evaluate it to determine how to proceed. If the FlexInt is:

  • greater than zero, it represents a symbol ID. The symbol’s associated text can be found in the local symbol table. No more bytes follow.

  • less than zero, its absolute value represents a number of UTF-8 bytes that follow the FlexInt. These bytes represent the symbol’s text.

  • exactly zero, another byte follows that is an opcode. The FlexSym parser is not responsible for evaluating this opcode, only returning it—the caller will decide whether the opcode is legal in the current context. Example usages of the opcode include:

    • Representing SID $0 as 0xA0.

    • Representing the empty string ("") as 0x90.

    • When used to encode a struct field name, the opcode can invoke a macro that will evaluate to a struct whose key/value pairs are spliced into the parent struct (TODO: Link)

    • In a delimited struct, terminating the sequence of (field name, value) pairs with 0xF0.

Figure 10: FlexSym encoding of symbol ID $10
              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │
0 0 0 1 0 1 0 1
└─────┬─────┘
  2's comp.
  positive 10
Figure 11: FlexSym encoding of symbol text 'hello'
              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │      h         e        l        l        o
1 1 1 1 0 1 1 1  01101000  01100101 01101100 01101100 01101111
└─────┬─────┘    └─────────────────────┬─────────────────────┘
  2's comp.               5-byte UTF-8 encoded "hello"
  negative 5
Figure 12: FlexSym encoding of '' (empty text) using an opcode
              ┌─── The leading FlexInt ends in a `1`,
              │    no more FlexInt bytes follow.
              │

0 0 0 0 0 0 0 1   10010000
└─────┬─────┘     └───┬──┘
  2's comp.      opcode 0x90:
  zero           empty symbol
Note
From this point on in the document, example encodings are given in hexadecimal notation.

Opcodes

An opcode is a 1-byte FixedUInt that tells the reader what the next expression represents and how the bytes that follow should be interpreted.

The meanings of each opcode are organized loosely by their high and low nibbles.

High nibble Low nibble Meaning

0x0_ to 0x3_

0-F

E-expression with the address in the opcode

0x4_

0-F

E-expression with the address as a trailing 1-byte FixedUInt.

0x5_

0-F

E-expression with the address as a trailing 2-byte FixedUInt.

0x6_

0-8

Integers up to 8 bytes wide

9

Reserved

A-D

Floats

E-F

Booleans

0x7_

0-F

Decimals

0x8_

0-C

Timestamps

D-F

Reserved

0x9_

0-F

Strings

0xA_

0-F

Symbols with inline text

0xB_

0-F

Lists

0xC_

0-F

S-expressions

0xD_

0

Empty struct

1

Reserved

2-F

Structs with symbol address field names

0xE_

0

Ion version marker

1-3

Symbols with symbol address

4-6

Annotations with symbol address

7-9

Annotations with FlexSym text

A

null.null

B

Typed nulls

C-D

NOP

E

E-expression with a variable-width address

F

System macro invocation

0xF_

0

Delimited container end

1

Delimited list start

2

Delimited S-expression start

3

Delimited struct with FlexSym field names start

4

Reserved

5

Variable length prefixed macro invocation

6

Variable length integer

7

Variable length decimal

8

Variable length, long-form timestamp

9

Variable length string

A

Variable length symbol encoded as FlexSym

B

Variable length list

C

Variable length S-expression

D

Variable length struct with symbol address field names

E

Variable length blob

F

Variable length clob

Encoding Expressions

The encoding of E-expressions is designed to balance density and generality. For example, they enable encodings with minimal tag bits, even none at all given a thoughtful signature. This increases density, but limits generality at the point of macro invocation.

The text and binary forms of E-expressions enforce the same syntactic constraints on the type and range of data allowed as arguments. Any syntactically well-formed E-expression can be transcoded between text and binary, without expansion and without changing semantics, and independent of whether it can be expanded successfully.

E-expression With the Address in the Opcode

If the value of the opcode is less than 64 (0x40), it represents an E-expression invoking the macro at the corresponding address—an offset within the local macro table.

Figure 13: Invocation of macro address 7
┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
07
└── FixedUInt 7
Figure 14: Invocation of macro address 31
┌──── Opcode in 00-3F range indicates an e-expression
│     where the opcode value is the macro address
│
1F
└── FixedUInt 31

Note that the opcode alone tells us which macro is being invoked, but it does not supply enough information for the reader to parse any arguments that may follow. The parsing of arguments is described in detail in the section Macro calling conventions. (TODO: Link)

E-expression With the Address as a Trailing FixedUInt

While E-expressions invoking macro addresses in the range [0, 63] can be encoded in a single byte using E-expressions with the address in the opcode, many applications will benefit from defining more than 64 macros.

The 0x4_ and 0x5_ opcodes can be used to represent over 1 million macro addresses. If the high nibble of the opcode is 0x4_, then a biased address follows as a 1-byte FixedUInt. If the high nibble of the opcode is 0x5_, then a biased address follows as a 2-byte FixedUInt. In both cases, the address is biased by the total number of addresses with lower opcodes. For 0x4_, the bias is 256 * low_nibble + 64 (or (low_nibble shift-left 8) + 64). For 0x5_, the bias is 65536 * low_nibble + 4160.

Figure 15: Invocation of macro address 841
┌──── Opcode in range 40-4F indicates a macro address with 1-byte FixedUInt address
│┌─── Low nibble 3 indicates bias of 832
││
43 09
   │
   └─── FixedUInt 9

Biased Address : 9
Bias : 832
Address : 841
Figure 16: Invocation of macro address 142,918
┌──── Opcode in range 50-5F indicates a macro address with 2-byte FixedUInt address
│┌─── Low nibble 2 indicates bias of 135232
││
52 06 1E
   └─┬─┘
     └─── FixedUInt 7686

Biased Address : 7686
Bias : 135232
Address : 142918
Table 1. Macro address bias for 0x4_ and 0x5_ opcodes
Low Nibble 0x4_ Bias 0x5_ Bias

0

64

4160

1

320

69696

2

576

135232

3

832

200768

4

1088

266304

5

1344

331840

6

1600

397376

7

1856

462912

8

2112

528448

9

2368

593984

A

2624

659520

B

2880

725056

C

3136

790592

D

3392

856128

E

3648

921664

F

3904

987200

E-expression With the Address as a Trailing FlexUInt

Because the address is encoded using a FlexUInt, there is no (theoretical) limit to the number of addresses that can be invoked. However, larger addresses require more bytes to encode.

When using the 0xEE opcode, the address is unbiased; the 0xEE opcode can be used for any macro address.

Figure 17: Invocation of macro address 0
┌──── Opcode EE indicates a macro address as trailing FlexUInt
│  ┌─── FlexUInt 0
│  │
EE 01
Figure 18: Invocation of macro address 2,097,151
┌──── Opcode EE indicates a macro address as trailing FlexUInt
│  ┌─── FlexUInt 2097151
│  │
EE FC FF FF

Booleans

0x6E represents boolean true, while 0x6F represents boolean false.

0xEB 0x00 represents null.bool.

Figure 19: Encoding of boolean true
6E
Figure 20: Encoding of boolean false
6F
Figure 21: Encoding of null.bool
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: boolean
│  │
EB 00

Numbers

Integers

Opcodes in the range 0x60 to 0x68 represent an integer. The opcode is followed by a FixedInt that represents the integer value. The low nibble of the opcode (0x_0 to 0x_8) indicates the size of the FixedInt. Opcode 0x60 represents integer 0; no more bytes follow.

Integers that require more than 8 bytes are encoded using the variable-length integer opcode 0xF6, followed by a FlexUInt indicating how many bytes of representation data follow.

0xEB 0x01 represents null.int.

Figure 22: Encoding of integer 0
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 0 indicates
││    no more bytes follow.
60
Figure 23: Encoding of integer 17
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 1 indicates
││    a single byte follows.
61 11
    └── FixedInt 17
Figure 24: Encoding of integer -944
┌──── Opcode in 60-68 range indicates integer
│┌─── Low nibble 2 indicates
││    that two bytes follow.
62 50 FC
   └─┬─┘
FixedInt -944
Figure 25: Encoding of integer -944
┌──── Opcode F6 indicates a variable-length integer, FlexUInt length follows
│   ┌─── FlexUInt 2; a 2-byte FixedInt follows
│   │
F6 05 50 FC
      └─┬─┘
   FixedInt -944
Figure 26: Encoding of null.int
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: integer
│  │
EB 01

Floats

Float values are encoded using the IEEE-754 specification in little-endian byte order. Floats can be serialized in four sizes:

  • 0 bits (0 bytes), representing the value 0e0 and indicated by opcode 0x6A

  • 16 bits (2 bytes in little-endian order, half precision), indicated by opcode 0x6B

  • 32 bits (4 bytes in little-endian order, single precision), indicated by opcode 0x6C

  • 64 bits (8 bytes in little-endian order, double precision), indicated by opcode 0x6D

Note that in the Ion data model, float values are always 64 bits. However, if a value can be losslessly serialized in fewer than 64 bits, Ion implementations may choose to do so.

0xEB 0x02 represents null.float.

Figure 27: Encoding of float 0e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble A indicates
││    a 0-length float; 0e0
6A
Figure 28: Encoding of float 3.14e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble B indicates a 2-byte float
││
6B 47 42
   └─┬─┘
half-precision 3.14
Figure 29: Encoding of float 3.1415927e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble C indicates a 4-byte,
││    single-precision value.
6C DB 0F 49 40
   └────┬────┘
single-precision 3.1415927
Figure 30: Encoding of float 3.141592653589793e0
┌──── Opcode in range 6A-6D indicates a float
│┌─── Low nibble D indicates an 8-byte,
││    double-precision value.
6D 18 2D 44 54 FB 21 09 40
   └──────────┬──────────┘
double-precision 3.141592653589793
Figure 31: Encoding of null.float
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: float
│  │
EB 02

Decimals

If an opcode has a high nibble of 0x7_, it represents a decimal. Low nibble values indicate the number of trailing bytes used to encode the decimal.

The body of the decimal is encoded as a FlexInt representing its exponent, followed by a FixedInt representing its coefficient. The width of the coefficient is the total length of the decimal encoding minus the length of the exponent. It is possible for the coefficient to have a width of zero, indicating a coefficient of 0. When the coefficient is present but has a value of 0, the coefficient is -0.

Decimal values that require more than 15 bytes can be encoded using the variable-length decimal opcode: 0xF7.

0xEB 0x03 represents null.decimal.

Figure 32: Encoding of decimal 0d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 0 indicates a zero-byte
││    decimal; 0d0
70
Figure 33: Encoding of decimal 7d0
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 01 07
   |  └─── Coefficient: 1-byte FixedInt 7
   └─── Exponent: FlexInt 0
Figure 34: Encoding of decimal 1.27
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 FD 7F
   |  └─── Coefficient: FixedInt 127
   └─── Exponent: 1-byte FlexInt -2
Figure 35: Variable-length encoding of decimal 1.27
┌──── Opcode F7 indicates a variable-length decimal
│
F7 05 FD 7F
   |  |  └─── Coefficient: FixedInt 127
   |  └───── Exponent: 1-byte FlexInt -2
   └─────── Decimal length: FlexUInt 2
Figure 36: Encoding of 0d3, which has a coefficient of zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 1 indicates a 1-byte decimal
││
71 07
   └────── Exponent: FlexInt 3; no more bytes follow, so the coefficient is implicitly 0
Figure 37: Encoding of -0d3, which has a coefficient of negative zero
┌──── Opcode in range 70-7F indicates a decimal
│┌─── Low nibble 2 indicates a 2-byte decimal
││
72 07 00
   |  └─── Coefficient: 1-byte FixedInt 0, indicating a coefficient of -0
   └────── Exponent: FlexInt 3
Figure 38: Encoding of null.decimal
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: decimal
│  │
EB 03

Timestamps

Note
In Ion 1.0, text timestamp fields were encoded using the local time while binary timestamp fields were encoded using UTC time. This required applications to perform conversion logic when transcribing from one format to the other. In Ion 1.1, all binary timestamp fields are encoded in local time.

Timestamps have two encodings:

Short-form timestamps

A compact representation optimized for the most commonly used precisions and date ranges.

Long-form timestamps

A less compact representation capable of representing any timestamp in the Ion data model.

0xEB x04 represents null.timestamp.

Figure 39: Encoding of null.timestamp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: timestamp
│  │
EB 04

Short-form Timestamp

If an opcode has a high nibble of 0x8_, it represents a short-form timestamp. This encoding focuses on making the most common timestamp precisions and ranges the most compact; less common precisions can still be expressed via the variable-length long form timestamp encoding.

Timestamps may be encoded using the short form if they meet all of the following conditions:

The year is between 1970 and 2097.

The year subfield is encoded as the number of years since 1970. 7 bits are dedicated to representing the biased year, allowing timestamps through the year 2097 to be encoded in this form.

The local offset is either UTC, unknown, or falls between -14:00 to +14:00 and is divisible by 15 minutes.

7 bits are dedicated to representing the local offset as the number of quarter hours from -56 (that is: offset -14:00). The value 0b1111111 indicates an unknown offset. At the time of this writing (2023-05T), all real-world offsets fall between -12:00 and +14:00 and are multiples of 15 minutes.

The fractional seconds are a common precision.

The timestamp’s fractional second precision (if present) is either 3 digits (milliseconds), 6 digits (microseconds), or 9 digits (nanoseconds).

Opcodes by precision and offset

Each opcode with a high nibble of 0x8_ indicates a different precision and offset encoding pair.

Opcode Precision Serialized size in bytes* Offset encoding

0x80

Year

1

Implicitly Unknown offset

0x81

Month

2

0x82

Day

2

0x83

Hour and minutes

4

1 bit to indicate UTC or Unknown Offset

0x84

Seconds

5

0x85

Milliseconds

6

0x86

Microseconds

7

0x87

Nanoseconds

8

0x88

Hour and minutes

5

7 bits to represent a known offset.

This encoding can also represent UTC and Unknown Offset, though it is less compact than opcodes 0x83-0x87 above.

0x89

Seconds

5

0x8A

Milliseconds

7

0x8B

Microseconds

8

0x8C

Nanoseconds

9

0x8D

Reserved

0x8E

0x8F

* Serialized size in bytes does not include the opcode.

The body of a short-form timestamp is encoded as a FixedUInt of the size specified by the opcode. This integer is then partitioned into bit-fields representing the timestamp’s subfields. Note that endianness does not apply here because the bit-fields are defined over the body interpreted as an integer.

The following letters to are used to denote bits in each subfield in diagrams that follow. Subfields occur in the same order in all encoding variants, and consume the same number of bits, with the exception of the fractional bits, which consume only enough bits to represent the fractional precision supported by the opcode being used.

The Month and Day subfields are one-based; 0 is not a valid month or day.

Letter code Number of bits Subfield

Y

7

Year

M

4

Month

D

5

Day

H

5

Hour

m

6

Minute

o

7

Offset

U

1

Unknown (0) or UTC (1) offset

s

6

Second

f

10 (ms)
20 (μs)
30 (ns)

Fractional second

.

n/a

Unused

We will denote the timestamp encoding as follows with each byte ordered vertically from top to bottom. The respective bits are denoted using the letter codes defined in the table above.

          7       0 <--- bit position
          |       |
         +=========+
byte 0   |  0xNN   | <-- hex notation for constants like opcodes
         +=========+ <-- boundary between encoding primitives (e.g., opcode/`FlexUInt`)
     1   |nnnn:nnnn| <-- bits denoted with a `:` as a delimeter to aid in reading
         +---------+ <-- octet boundary within an encoding primitive
         ...
         +---------+
     N   |nnnn:nnnn|
         +=========+

The bytes are read from top to bottom (least significant to most significant), while the bits within each byte should be read from right to left (also least significant to most significant.)

Note
While this encoding may complicate human reading, it guarantees that the timestamp’s subfields (year, month, etc.) occupy the same bit contiguous indexes regardless of how many bytes there are overall. (The last subfield, fractional_seconds, always begins at the same bit index when present, but can vary in length according to the precision.) This arrangement allows processors to read the Little-Endian bytes into an integer and then mask the appropriate bit ranges to access the subfields.
Figure 40: Encoding of a timestamp with year precision
         +=========+
byte 0   |  0x80   |
         +=========+
     1   |.YYY:YYYY|
         +=========+
Figure 41: Encoding of a timestamp with month precision
         +=========+
byte 0   |  0x81   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |....:.MMM|
         +=========+
Figure 42: Encoding of a timestamp with day precision
         +=========+
byte 0   |  0x82   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +=========+
Figure 43: Encoding of a timestamp with hour-and-minutes precision at UTC or unknown offset
         +=========+
byte 0   |  0x83   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |....:Ummm|
         +=========+
Figure 44: Encoding of a timestamp with seconds precision at UTC or unknown offset
         +=========+
byte 0   |  0x84   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |....:..ss|
         +=========+
Figure 45: Encoding of a timestamp with milliseconds precision at UTC or unknown offset
         +=========+
byte 0   |  0x85   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |....:ffff|
         +=========+
Figure 46: Encoding of a timestamp with microseconds precision at UTC or unknown offset
         +=========+
byte 0   |  0x86   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |..ff:ffff|
         +=========+
Figure 47: Encoding of a timestamp with nanoseconds precision at UTC or unknown offset
         +=========+
byte 0   |  0x87   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |ssss:Ummm|
         +---------+
     5   |ffff:ffss|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +=========+
Figure 48: Encoding of a timestamp with hour-and-minutes precision at known offset
         +=========+
byte 0   |  0x88   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |....:..oo|
         +=========+
Figure 49: Encoding of a timestamp with seconds precision at known offset
         +=========+
byte 0   |  0x89   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +=========+
Figure 50: Encoding of a timestamp with milliseconds precision at known offset
         +=========+
byte 0   |  0x8A   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |....:..ff|
         +=========+
Figure 51: Encoding of a timestamp with microseconds precision at known offset
         +=========+
byte 0   |  0x8B   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |....:ffff|
         +=========+
Figure 52: Encoding of a timestamp with nanoseconds precision at known offset
         +=========+
byte 0   |  0x8C   |
         +=========+
     1   |MYYY:YYYY|
         +---------+
     2   |DDDD:DMMM|
         +---------+
     3   |mmmH:HHHH|
         +---------+
     4   |oooo:ommm|
         +---------+
     5   |ssss:ssoo|
         +---------+
     6   |ffff:ffff|
         +---------+
     7   |ffff:ffff|
         +---------+
     8   |ffff:ffff|
         +---------+
     9   |..ff:ffff|
         +=========+
Table 2. Examples of short-form timestamps
Text Binary

2023T

80 35

2023-10-15T

82 35 7D

2023-10-15T11:22:33Z

84 35 7D CB 1A 02

2023-10-15T11:22:33-00:00

84 35 7D CB 12 02

2023-10-15T11:22:33+01:15

89 35 7D CB 2A 84

2023-10-15T11:22:33.444555666+01:15

8C 35 7D CB 2A 84 92 61 7F 1A

Warning
Opcodes 0x8D, 0x8E, and 0x8F are illegal; they are reserved for future use.

Long-form Timestamp

Unlike the Short-form timestamp encoding, which is limited to encoding timestamps in the most commonly referenced timestamp ranges and precisions for which it optimizes, the long-form timestamp encoding is capable of representing any valid timestamp.

The long form begins with opcode 0xF8. A FlexUInt follows indicating the number of bytes that were needed to represent the timestamp. The encoding consumes the minimum number of bytes required to represent the timestamp. The declared length can be mapped to the timestamp’s precision as follows:

Length Corresponding precision

0

Illegal

1

Illegal

2

Year

3

Month or Day (see below)

4

Illegal; the hour cannot be specified without also specifying minutes

5

Illegal

6

Minutes

7

Seconds

8 or more

Fractional seconds

Unlike the short-form encoding, the long-form encoding reserves:

  • 14 bits for the year (Y), which is not biased.

  • 12 bits for the offset, which counts the number of minutes (not quarter-hours) from -1440 (that is: -24:00). An offset value of 0b111111111111 indicates an unknown offset.

Similar to short-form timestamps, with the exception of representing the fractional seconds, the components of the timestamp are encoded as bit-fields on a FixedUInt that corresponds to the length that followed the opcode.

If the timestamp’s overall length is greater than or equal to 8, the FixedUInt part of the timestamp is 7 bytes and the remaining bytes are used to encode fractional seconds. The fractional seconds are encoded as a (scale, coefficient) pair, which is similar to a decimal. The primary difference is that the scale represents a negative exponent because it is illegal for the fractional seconds value to be greater than or equal to 1.0 or less than 0.0. The scale is encoded as a FlexUInt (instead of FlexInt) to discourage the encoding of decimal numbers greater than 1.0. The coefficient is encoded as a FixedUInt (instead of FixedInt) to prevent the encoding of fractional seconds less than 0.0. Note that validation is still required; namely:

  • A scale value of 0 is illegal, as that would result in a fractional seconds greater than 1.0 (a whole second).

  • If coefficient * 10^-scale > 1.0, that (coefficient, scale) pair is illegal.

If the timestamp’s length is 3, the precision is determined by inspecting the day (DDDDD) bits. Like the short-form, the Month and Day subfields are one-based (0 is not a valid month or day). If the day subfield is zero, that indicates month precision. If the day subfield is any non-zero number, that indicates day precision.

Figure 53: Encoding of the body of a long-form timestamp
         +=========+
byte 0   |YYYY:YYYY|
         +=========+
     1   |MMYY:YYYY|
         +---------+
     2   |HDDD:DDMM|
         +---------+
     3   |mmmm:HHHH|
         +---------+
     4   |oooo:oomm|
         +---------+
     5   |ssoo:oooo|
         +---------+
     6   |....:ssss|
         +=========+
     7   |FlexUInt | <-- scale of the fractional seconds
         +---------+
         ...
         +=========+
     N   |FixedUInt| <-- coefficient of the fractional seconds
         +---------+
         ...
Table 3. Examples of long-form timestamps
Text Binary

1947T

F8 05 9B 07

1947-12T

F8 07 9B 07 03

1947-12-23T

F8 07 9B 07 5F

1947-12-23T11:22:33-00:00

F8 0F 9B 07 DF 65 FD 7F 08

1947-12-23T11:22:33+01:15

F8 0F 9B 07 DF 65 AD 57 08

1947-12-23T11:22:33.127+01:15

F8 13 9B 07 DF 65 AD 57 08 07 7F

Text

Strings

If the high nibble of the opcode is 0x9_, it represents a string. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0x90 represents a string with empty text ("").

Strings longer than 15 bytes can be encoded with the F9 opcode, which takes a FlexUInt-encoded length after the opcode.

0xEB x05 represents null.string.

Figure 54: Encoding of the empty string, ""
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
90
Figure 55: Encoding of a 14-byte string
┌──── Opcode in range 90-9F indicates a string
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
9E 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes
Figure 56: Encoding of a 24-byte string
┌──── Opcode F9 indicates a variable-length string
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
F9 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes
Figure 57: Encoding of null.string
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: string
│  │
EB 05

Symbols With Inline Text

If the high nibble of the opcode is 0xA_, it represents a symbol whose text follows the opcode. The low nibble of the opcode indicates how many UTF-8 bytes follow. Opcode 0xA0 represents a symbol with empty text ('').

0xEB x06 represents null.symbol.

Figure 58: Encoding of a symbol with empty text ('')
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble 0 indicates that no UTF-8 bytes follow
A0
Figure 59: Encoding of a symbol with 14 bytes of inline text
┌──── Opcode in range A0-AF indicates a symbol with inline text
│┌─── Low nibble E indicates that 14 UTF-8 bytes follow
││  f  o  u  r  t  e  e  n     b  y  t  e  s
AE 66 6F 75 72 74 65 65 6E 20 62 79 74 65 73
   └──────────────────┬────────────────────┘
                 UTF-8 bytes
Figure 60: Encoding of a symbol with 24 bytes of inline text
┌──── Opcode FA indicates a variable-length symbol with inline text
│  ┌─── Length: FlexUInt 24
│  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     e  n  c  o  d  i  n  g
FA 31 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 65 6E 63 6f 64 69 6E 67
      └────────────────────────────────┬────────────────────────────────────┘
                                  UTF-8 bytes
Figure 61: Encoding of null.symbol
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: symbol
│  │
EB 06

Symbols With a Symbol Address

Symbol values whose text can be found in the local symbol table are encoded using opcodes 0xE1 through 0xE3:

  • 0xE1 represents a symbol whose address in the symbol table (aka its symbol ID) is a 1-byte FixedUInt that follows the opcode.

  • 0xE2 represents a symbol whose address in the symbol table is a 2-byte FixedUInt that follows the opcode.

  • 0xE3 represents a symbol whose address in the symbol table is a FlexUInt that follows the opcode.

Writers MUST encode a symbol address in the smallest number of bytes possible. For each opcode above, the symbol address that is decoded is biased by the number of addresses that can be encoded in fewer bytes.

Opcode Symbol address range Bias

0xE1

0 to 255

0

0xE2

256 to 65,791

256

0xE3

65,792 to infinity

65,792

Binary Data

Blobs

Opcode FE indicates a blob of binary data. A FlexUInt follows that represents the blob’s byte-length.

0xEB x07 represents null.blob.

Figure 62: Encoding of a blob with 24 bytes of data
┌──── Opcode FE indicates a blob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FE 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data
Figure 63: Encoding of null.blob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: blob
│  │
EB 07

Clobs

Opcode FF indicates a clob—​binary character data of an unspecified encoding. A FlexUInt follows that represents the clob’s byte-length.

0xEB x08 represents null.clob.

Figure 64: Encoding of a clob with 24 bytes of data
┌──── Opcode FF indicates a clob, FlexUInt length follows
│   ┌─── Length: FlexUInt 24
│   │
FF 31 49 20 61 70 70 6c 61 75 64 20 79 6f 75 72 20 63 75 72 69 6f 73 69 74 79
      └────────────────────────────────┬────────────────────────────────────┘
                            24 bytes of binary data
Figure 65: Encoding of null.clob
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: clob
│  │
EB 08

Containers

Each of the container types (list, s-expression, and struct) has both a length-prefixed encoding and a delimited encoding.

The length-prefixed encoding places more burden on the writer, but simplifies reading and enables skipping over uninteresting values in the data stream. In contrast, the delimited encoding is simpler and faster for writers, but requires the reader to visit each child value in turn to skip over the container.

Lists

Length-prefixed encoding

An opcode with a high nibble of 0xB_ indicates a length-prefixed list. The lower nibble of the opcode indicates how many bytes were used to encode the child values that the list contains.

If the list’s encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFB opcode to write a variable-length list. The 0xFB opcode is followed by a FlexUInt that indicates the list’s byte length.

0xEB 0x09 represents null.list.

Figure 66: Length-prefixed encoding of an empty list ([])
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this list took zero bytes to encode.
B0
Figure 67: Length-prefixed encoding of [1, 2, 3]
┌──── An Opcode in the range 0xB0-0xBF indicates a list.
│┌─── A low nibble of 0 indicates that the child values of this list took zero bytes to encode.
B6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Figure 68: Length-prefixed encoding of ["variable length list"]
┌──── Opcode 0xFB indicates a variable-length list. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     l  i  s  t
FB 2d F9 29 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 6c 69 73 74
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element
Figure 69: Encoding of null.list
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: list
│  │
EB 09
Delimited Encoding

Opcode 0xF1 begins a delimited list, while opcode 0xF0 closes the most recently opened delimited container that has not yet been closed.

Figure 70: Delimited encoding of an empty list ([])
┌──── Opcode 0xF1 indicates a delimited list
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F1 F0
Figure 71: Delimited encoding of [1, 2, 3]
┌──── Opcode 0xF1 indicates a delimited list
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F1 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Figure 72: Delimited encoding of [1, [2], 3]
┌──── Opcode 0xF1 indicates a delimited list
│        ┌─── Opcode 0xF1 begins a nested delimited list
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested list.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and still open)
│        │        │        │    delimited container: the outer list.
│        │        │        │
F1 61 01 F1 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3

S-Expressions

S-expressions use the same encodings as lists, but with different opcodes.

Opcode Encoding

0xC0-0xCF

Length-prefixed S-expression; low nibble of the opcode represents the byte-length.

0xFC

Variable-length prefixed S-expression; a FlexUInt following the opcode represents the byte-length.

0xF2

Starts a delimited S-expression; 0xF0 closes the most recently opened delimited container.

0xEB 0x0A represents null.sexp.

Figure 73: Length-prefixed encoding of an empty S-expression (())
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 0 indicates that the child values of this S-expression took zero bytes to encode.
C0
Figure 74: Length-prefixed encoding of (1 2 3)
┌──── An Opcode in the range 0xC0-0xCF indicates an S-expression.
│┌─── A low nibble of 6 indicates that the child values of this S-expression took six bytes to encode.
C6 61 01 61 02 61 03
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Figure 75: Length-prefixed encoding of ("variable length sexp")
┌──── Opcode 0xFC indicates a variable-length sexp. A FlexUInt length follows.
│  ┌───── Length: FlexUInt 22
│  │  ┌────── Opcode 0xF9 indicates a variable-length string. A FlexUInt length follows.
│  │  │  ┌─────── Length: FlexUInt 20
│  │  │  │   v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  e  x  p
FC 2D F9 29 76 61 72 69 61 62 6C 65 20 6C 65 6E 67 74 68 20 73 65 78 70
      └─────────────────────────────┬─────────────────────────────────┘
                          Nested string element
Figure 76: Delimited encoding of an empty S-expression (())
┌──── Opcode 0xF2 indicates a delimited S-expression
│  ┌─── Opcode 0xF0 indicates the end of the most recently opened container
F2 F0
Figure 77: Delimited encoding of (1 2 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│                    ┌─── Opcode 0xF0 indicates the end of
│                    │    the most recently opened container
F2 61 01 61 02 61 03 F0
   └─┬─┘ └─┬─┘ └─┬─┘
     1     2     3
Figure 78: Delimited encoding of (1 (2) 3)
┌──── Opcode 0xF2 indicates a delimited S-expression
│        ┌─── Opcode 0xF2 begins a nested delimited S-expression
│        │        ┌─── Opcode 0xF0 closes the most recently
│        │        │    opened delimited container: the nested S-expression.
│        │        │        ┌─── Opcode 0xF0 closes the most recently opened (and still open)
│        │        │        │    delimited container: the outer S-expression.
│        │        │        │
F2 61 01 F2 61 02 F0 61 03 F0
   └─┬─┘    └─┬─┘    └─┬─┘
     1        2        3
Figure 79: Encoding of null.sexp
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: sexp
│  │
EB 0A

Structs

Structs have 3 available encodings:

0xEB 0x0B represents null.struct.

Figure 80: Encoding of null.struct
┌──── Opcode 0xEB indicates a typed null; a byte follows specifying the type
│  ┌─── Null type: struct
│  │
EB 0B
Structs With Symbol Address Field Names

An opcode with a high nibble of 0xD_ indicates a struct with symbol address field names (which is similar to the only available encoding of structs in Ion 1.0. The lower nibble of the opcode indicates how many bytes were used to encode all of its nested (field name, value) pairs.

If the struct’s encoded byte-length is too large to be encoded in a nibble, writers may use the 0xFD opcode to write a variable-length struct with symbol address field names. The 0xFD opcode is followed by a FlexUInt that indicates the byte length.

Each field in the struct is encoded as a FlexUInt representing the address of the field name’s text in the symbol table, followed by an opcode-prefixed value.

The symbol address $0 cannot be encoded in this format because the FlexUInt 0 in the field name position is reserved for switching the struct to FlexSym field names.

Figure 81: Length-prefixed encoding of an empty struct ({})
┌──── An opcode in the range 0xD0-0xDF indicates a struct with symbol address field names
│┌─── A lower nibble of 0 indicates that the struct's fields took zero bytes to encode
D0
Figure 82: Length-prefixed encoding of {$10: 1, $11: 2}
┌──── An opcode in the range 0xD0-0xDF indicates a struct with symbol address field names
│  ┌─── Field name: FlexUInt 10 ($10)
│  │        ┌─── Field name: FlexUInt 11 ($11)
│  │        │
D6 15 61 01 17 61 02
      └─┬─┘    └─┬─┘
        1        2
Figure 83: Length-prefixed encoding of {$10: "variable length struct"}
 ┌───────────── Opcode `FD` indicates a variable length struct with symbol address field names
 │  ┌────────── Length: FlexUInt 25
 │  │  ┌─────── Field name: FlexUInt 10 ($10)
 │  │  │  ┌──── Opcode `F9` indicates a variable length string
 │  │  │  │  ┌─ FlexUInt: 22 the string is 22 bytes long
 │  │  │  │  │  v  a  r  i  a  b  l  e     l  e  n  g  t  h     s  t  r  u  c  t
FD 33 15 F9 2D 76 61 72 69 61 62 6c 65 20 6c 65 6e 67 74 68 20 73 74 72 75 63 74
               └─────────────────────────────┬─────────────────────────────────┘
                                        UTF-8 bytes
Structs With FlexSym Field Names
Note
This encoding is very similar to structs with symbol address field names, but allows writers to choose between representing each field name as a symbol address (for example: $10) or as inline UTF-8 bytes (for example: "foo"). This encoding is potentially less dense, but offers writers significant flexibility over whether and when field names are added to the symbol table.

All length-prefixed structs begin as structs with symbol address field names. However, they can be switched to use FlexSym field names at any time by emitting the FlexUInt 0 (the byte 0x01) in the field name position. Once a struct has been switched to the FlexSym field name encoding, it cannot be switch back.

Each field in the struct is encoded as a FlexSym field name, followed by an opcode-prefixed value.

Figure 84: Length-prefixed encoding of {"foo": 1, $11: 2}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 10
││ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode
││ │  ┌─ FlexSym -3     ┌─ FlexSym: 11 ($11)
││ │  │   f  o  o       │
DA 01 FD 66 6F 6F 61 01 17 61 02
         └──┬───┘ └─┬─┘    └─┬─┘
         3 UTF-8    1        2
          bytes
Figure 85: Length-prefixed encoding of {$11: 1, "foo": 2}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 10
││ ┌─ FlexSym: 11 ($11)
││ │        ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode
││ │        │  ┌─ FlexSym -3
││ │        │  │   f  o  o
DA 17 61 01 01 FD 66 6F 6F 61 02
      └─┬─┘       └──┬───┘ └─┬─┘
        1         3 UTF-8    2
                   bytes
Figure 86: Length-prefixed encoding of {$0: 1}
┌─── Opcode with high nibble `D` indicates a struct
│┌── Length: 5
││ ┌── FlexUInt 0 in the field name position indicates that the struct is switching to FlexSym mode
││ │  ┌── FlexSym "escape"
││ │  │
││ │  │
D5 01 01 A0 61 01
      └─┬─┘ └─┬─┘
       $0     1
TODO: Demonstrate splicing macro values into the struct via FlexSym escape code 0x01.
Delimited Structs

Opcode 0xF3 indicates the beginning of a delimited struct with FlexSym field names.

Unlike lists and S-expressions, structs cannot use opcode 0xF0 by itself to indicate the end of the delimited container. This is because 0xF0 is a valid FlexSym (a symbol with 16 bytes of inline text). To close the delimited struct, the writer emits a 0x01 byte (a FlexSym escape) followed by the opcode 0xF0.

Note
While length-prefixed structs can choose between structs with symbol address field names and structs with FlexSym field names, delimited structs always use FlexSym-encoded field names.
Figure 87: Delimited encoding of the empty struct ({})
┌─── Opcode 0xF3 indicates the beginning of a delimited struct with `FlexSym` field names.
│  ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │  ┌─── Opcode 0xF0 indicates the end of the most
│  │  │    recently opened delimited container
F3 01 F0
Figure 88: Delimited encoding of {"foo": 1, $11: 2}
┌─── Opcode 0xF3 indicates the beginning of a delimited struct with `FlexSym` field names.
│
│  ┌─ FlexSym -3     ┌─ FlexSym: 11 ($11)
│  │                 │        ┌─── FlexSym escape code 0 (0x01): an opcode follows
│  │                 │        │  ┌─── Opcode 0xF0 indicates the end of the most
│  │   f  o  o       │        │  │    recently opened delimited container
F3 FD 66 6F 6F 61 01 17 61 02 01 F0
      └──┬───┘ └─┬─┘    └─┬─┘
      3 UTF-8    1        2
       bytes

Nulls

The opcode 0xEA indicates an untyped null (that is: null, or its alias null.null).

The opcode 0xEB indicates a typed null; a byte follows whose value represents an offset into the following table:

Byte Type

0x00

null.bool

0x01

null.int

0x02

null.float

0x03

null.decimal

0x04

null.timestamp

0x05

null.string

0x06

null.symbol

0x07

null.blob

0x08

null.clob

0x09

null.list

0x0A

null.sexp

0x0B

null.struct

All other byte values are reserved for future use.

Note
Future versions of Ion may decide to generalize this into a "constants" table.
Figure 89: Encoding of null
┌──── The opcode `0xEA` represents a null (null.null)
EA
Figure 90: Encoding of null.string
┌──── The opcode `0xEB` indicates a typed null; a byte indicating the type follows
│  ┌──── Byte 0x05 indicates the type `string`
EB 05

Annotations

Annotations can be encoded either as symbol addresses or as FlexSyms. In both encodings, the annotations sequence appears just before the value that it decorates.

It is illegal for an annotations sequence to appear before any of the following:

  • Another annotations sequence

  • The end of the stream

  • A NOP

  • An E-expression (that is: a macro invocation). To add annotations to the expansion of an E-expression, see the annotate macro. (TODO: Link)

Annotations With Symbol Addresses

Opcodes 0xE4 through 0xE6 indicate one or more annotations encoded as symbol addresses. If the opcode is:

  • 0xE4, a single FlexUInt-encoded symbol address follows.

  • 0xE5, two FlexUInt-encoded symbol addresses follow.

  • 0xE6, a FlexUInt follows that represents the number of bytes needed to encode the annotations sequence, which can be made up of any number of FlexUInt symbol addresses.

Figure 91: Encoding of $10::false
┌──── The opcode `0xE4` indicates a single annotation encoded as a symbol address follows
│  ┌──── Annotation with symbol address: FlexUInt 10
E4 15 6F
      └── The annotated value: `false`
Figure 92: Encoding of $10::$11::false
┌──── The opcode `0xE5` indicates that two annotations encoded as symbol addresses follow
│  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
E5 15 17 6F
         └── The annotated value: `false`
Figure 93: Encoding of $10::$11::$12::false
┌──── The opcode `0xE6` indicates a variable-length sequence of symbol address annotations;
│     a FlexUInt follows representing the length of the sequence.
│   ┌──── Annotations sequence length: FlexUInt 3 with symbol address: FlexUInt 10 ($10)
│   │  ┌──── Annotation with symbol address: FlexUInt 10 ($10)
│   │  │  ┌──── Annotation with symbol address: FlexUInt 11 ($11)
│   │  │  │  ┌──── Annotation with symbol address: FlexUInt 12 ($12)
E5 07 15 17 19 6F
               └── The annotated value: `false`

Annotations With FlexSym Text

Opcodes 0xE7 through 0xE9 indicate one or more annotations encoded as FlexSyms.

If the opcode is:

  • 0xE7, a single FlexSym-encoded symbol follows.

  • 0xE8, two FlexSym-encoded symbols follow.

  • 0xE9, a FlexUInt follows that represents the byte length of the annotations sequence, which is made up of any number of annotations encoded as FlexSyms.

While this encoding is more flexible than annotations with symbol addresses, it can be slightly less compact when all the annotations are encoded as symbol addresses.

Figure 94: Encoding of $10::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation with symbol address: FlexSym 10 ($10)
E7 15 6F
      └── The annotated value: `false`
Figure 95: Encoding of foo::false
┌──── The opcode `0xE7` indicates a single annotation encoded as a FlexSym follows
│  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │   f  o  o
E7 FD 66 6F 6F 6F
      └──┬───┘ └── The annotated value: `false`
      3 UTF-8
       bytes

Note that FlexSym annotation sequences can switch between symbol address and inline text on a per-annotation basis.

Figure 96: Encoding of $10::foo::false
┌──── The opcode `0xE8` indicates two annotations encoded as FlexSyms follow
│  ┌──── Annotation: FlexSym 10 ($10)
│  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │   f  o  o
E8 15 FD 66 6F 6F 6F
         └──┬───┘ └── The annotated value: `false`
         3 UTF-8
          bytes
Figure 97: Encoding of $10::foo::$11::false
┌──── The opcode `0xE9` indicates a variable-length sequence of FlexSym-encoded annotations
│  ┌──── Length: FlexUInt 6
│  │  ┌──── Annotation: FlexSym 10 ($10)
│  │  │  ┌──── Annotation: FlexSym -3; 3 bytes of UTF-8 text follow
│  │  │  │           ┌──── Annotation: FlexSym 11 ($11)
│  │  │  │   f  o  o │
E9 0D 15 FD 66 6F 6F 17 6F
            └──┬───┘    └── The annotated value: `false`
            3 UTF-8
             bytes

NOPs

A NOP (short for "no-operation") is the binary equivalent of whitespace. NOP bytes have no meaning, but can be used as padding to achieve a desired alignment.

An opcode of 0xEC indicates a single-byte NOP pad. An opcode of 0xED indicates that a FlexUInt follows that represents the number of additional bytes to skip.

It is legal for a NOP to appear anywhere that a value can be encoded. It is not legal for a NOP to appear in annotation sequences or struct field names. If a NOP appears in place of a struct field value, then the associated field name is ignored; the NOP is immediately followed by the next field name, if any.

Figure 98: Encoding of a 1-byte NOP
┌──── The opcode `0xEC` represents a 1-byte NOP pad
│
EC
Figure 99: Encoding of a 4-byte NOP
┌──── The opcode `0xED` represents a variable-length NOP pad; a FlexUInt length follows
│  ┌──── Length: FlexUInt 2; two more bytes of NOP follow
│  │
ED 05 93 C6
      └─┬─┘
NOP bytes, values ignored

E-expression Arguments

The binary encoding of E-expressions (aka macro invocations) starts with the address of the macro to expand. The address can be encoded as part of the opcode, as a FixedUInt that follows the opcode, or as a FlexUInt that follows the opcode.

The encoding of the E-expression’s arguments depends on their respective types. Argument types can be classified as belonging to one of two categories: tagged encodings and tagless encodings.

Tagged Encodings

Tagged types are argument types whose encoding begins with an opcode, sometimes informally called a 'tag'. These include the core types and the abstract types.

Core types

The core types are the 13 types in the Ion data model:

null | bool | int | float | decimal | timestamp | string | symbol | blob | clob | list | sexp | struct

Abstract types

The abstract types are unions of two or more of the core types.

Abstract type Included Ion types

any

All core Ion types

number

int, float, decimal

exact

int, decimal

text

string, symbol

lob

blob, clob

sequence

list, sexp

Tagged E-expression Argument Encoding

When a macro parameter has a tagged type, the encoding of that parameter’s corresponding argument in an E-expression is identical to how it would be encoded anywhere else in an Ion stream: it has a leading opcode that dictates how many bytes follow and how they should be interpreted. This is very flexible, but makes it possible for writers to encode values that conflict with the parameter’s declared type. Because of this, the macro expander will read the argument and then check its type against the parameter’s declared type. If it does not match, the macro expander must raise an error.

Macro foo (defined below) is used in this section’s subsequent examples to demonstrate the encoding of tagged-type arguments.

Figure 100: Definition of example macro foo at address 0
(macro
    foo           // Macro name
    (number::x!)  // Parameters
    /*...*/       // Template (elided)
)
Figure 101: Encoding of E-expression (:foo 3.14e0)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x6B indicates a 2-byte float; an IEEE-754 half-precision float follows
│  │
00 6B 47 42
      └─┬─┘
      3.14e0

// The macro expander confirms that `3.14e0` (a `float`) matches the expected type: `number`.
Figure 102: Encoding of E-expression (:foo 9)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x61 indicates a 1-byte integer. A 1-byte FixedInt follows.
│  │  ┌──── A 1-byte FixedInt: 9
00 61 09

// The macro expander confirms that `9` (an `int`) matches the expected type: `number`.
Figure 103: Encoding of E-expression (:foo $10::9)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0xE4 indicates a single annotation with symbol address. A FlexUInt follows.
│  │  ┌──── Symbol address: FlexUInt 10 ($10); an opcode for the annotated value follows.
│  │  │  ┌──── Opcode 0x61 indicates a 1-byte integer
│  │  │  │   ┌──── 1-byte FixedInt 9
00 E4 15 61 09

// The macro expander confirms that `$10::9` (an annotated `int`) matches the expected type: `number`.
Figure 104: Encoding of E-expression (:foo null.int)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0xEB indicates a typed null. A 1-byte FixedUInt follows indicating the type.
│  │  ┌──── Null type: FixedUInt: 1; integer
00 EB 01

// The macro expander confirms that `null.int` matches the expected type: `number`.
Figure 105: Encoding of E-expression (:foo null)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged number as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0xEA represents an untyped null (aka `null.null`)
00 EA

// The macro expander confirms that `null` matches the expected type: `number`
Figure 106: Encoding of E-expression (:foo (:bar))
// A second macro definition at address 1
(macro
    bar // Macro name
    ()  // Parameters
    5   // Template; invocations of `bar` always expand to `5`.
)

┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a tagged int as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x01 is less than 0x40, so it is an E-expression invoking the macro
│  │     at address 1: `bar`. `bar` takes no parameters, so no bytes follow.
00 01

// The macro expander confirms that the expansion of `(:bar)` (that is: `5`) matches
// the expected type: `number`.
Figure 107: Encoding of illegal E-expression (:foo "hello")
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0, `foo`. `foo` takes a tagged int as a parameter (`x`), so an opcode follows.
│  ┌──── Opcode 0x95 indicates a 5-byte string. 5 UTF-8 bytes follow.
│  │  h  e  l  l  o
00 95 68 65 6C 6C 6F
      └──────┬─────┘
        UTF-8 bytes

// ERROR: Expected a `number` for `foo` parameter `x`, but found `string`

Tagless Encodings

In contrast to tagged encodings, tagless encodings do not begin with an opcode. This means that they are potentially more compact than a tagged type, but are also less flexible. Because tagless encodings do not have an opcode, they cannot represent E-expressions, annotation sequences, or null values of any kind.

Tagless types include the primitive types and macro shapes.

Primitive Types

Primitive types are self-delineating, either by having a statically known size in bytes or by including length information in their encoding.

Primitive types include:

Ion type Primitive type Size in bytes Encoding

int

uint8

1

uint16

2

uint32

4

uint64

8

compact_uint

variable

int8

1

int16

2

int32

4

int64

8

compact_int

variable

float

float16

2

float32

4

float64

8

symbol

compact_symbol

variable

TODO:

  • Finalize names for primitive types. (compact_? plain_?)

  • Do we need a compact_string encoding? It saves a byte for string lengths >16 and <128.

  • Do we need other int sizes? int24? int40?

Macro Shapes

The term macro shape describes a macro that is being used as the encoding of an E-expression argument. They are considered "shapes" rather than types because while their encoding is always statically known, the types of data produced by their expansion is not. A single macro can produce streams of varying length and containing values of different Ion types depending on the arguments provided in the invocation.

See the Macro Shapes section of Macros by Example for more information.

Encoding E-expressions With Multiple Arguments

E-expression arguments corresponding to each parameter are encoded one after the other moving from left to right.

Figure 108: Definition of macro foo at address 0
(macro foo             // Macro name
  (                    // Parameters
    string::a
    compact_symbol::b
    uint16::c
  )
  /* ... */            // Body (elided)
)
Figure 109: Encoding of E-expression for macro with multiple parameters: (:0 "hello" baz 512)
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0, `foo`. `foo`'s first parameter is a string, so an opcode follows.
│
│  ┌──── Opcode 0x95 indicates a 5-byte string. 5 UTF-8 bytes follow.
│  │
│  │                 ┌──── `foo`'s second parameter is a compact_symbol, so a `FlexSym` follows.
│  │                 │     FlexSym -3: 3 bytes of UTF-8 text follow.
│  │                 │
│  │                 │           ┌──── `foo`'s third parameter is a uint16, so a 2-byte
│  │                 │           │     2-byte `FixedUInt` follows.
│  │                 │           │     FixedUInt: 512
│  │  h  e  l  l  o  │   b  a  z │
00 95 68 65 6C 6C 6F FD 62 61 7A 00 20
      └──────┬─────┘    └───┬──┘
        UTF-8 bytes    UTF-8 bytes

Argument Encoding Bitmap (AEB)

The examples in previous sections have only shown how to encode invocations of macros which have either no parameters at all (aka constants) or whose parameters all have a cardinality of exactly-one.

If a macro has any parameters with a cardinality of zero-or-one (?), zero-or-more (*), or one-or-more (+), then E-expressions invoking that macro will begin with an argument encoding bitmap (AEB). An AEB is a series of bits that correspond to a macro parameter and communicate additional information about how the arguments corresponding to that parameter have been encoded in the current E-expression. In particular, the AEB indicates whether a parameter that accepts (:void) has any arguments at all, and how a grouped parameter’s arguments have been delimited.

The number of bits allotted to each parameter is determined by its cardinality, as shown in the table below; each parameter can have 0, 1, or 2 bits.

Grouping Mode Cardinality Example parameter signature Number of bits Bit(s) value Encoding

Ungrouped

Exactly-one

(x int!)

0

n/a

One expression

Zero-or-one

(x int?)

1

0

No expression; equivalent to (:void)

1

One expression

Zero-or-more

(x int*)

0

No expression; equivalent to (:void)

1

One expression

One-or-more

(x int+)

0

n/a

One expression

Grouped

Zero-or-more

(x [int])
(x int...)

2

00

No expression; equivalent to (:void)

01

One expression

10

Length-prefixed expression group

11

Delimited expression group

One-or-more

(x )` + `(x int\...)

00

Illegal. One-or-more forbids (:void).

01

One expression

10

11

The total number of bits in the AEB can be calculated by analyzing the signature of the macro being invoked. If the macro has no parameters or all of its parameters have a cardinality of either exactly-one or one-or-more, no bits are required; the AEB will be omitted altogether. If the macro has many parameters with a cardinality other than exactly-one, it is possible for the AEB to require more than one byte to encode; in such cases, the bytes are written in little-endian order. AEB bytes can contain unused bits.

Bits are assigned to the parameters in a macro’s signature from left to right. Bits are assigned from least significant to most significant (commonly: right-to-left).

Example parameter sequence Bit assignments Total bits

()

No AEB

0

((a int!) (b string!) (c float!))

No AEB

0

((a int!) (b string!) (c float?))

-------c

1

((a int!) (b string?) (c float!))

-------b

1

((a int!) (b string*) (c float?))

------cb

2

((a int*) (b string!) (c [float]))

-----cca

3

((a int*) (b [string]) (c [float]))

---ccbba

5

((a [int]) (b [string]) (c [float]+))

--ccbbaa

6

((a int*) (b [string]) (c [float]) (d [bool]) (e blob…​))

eddccbba
-------e

9

Expression Groups

Grouped parameters can be encoded using either a length-prefixed or delimited expression group encoding.

The example encodings in the following sections refer to this macro definition:

Figure 110: Definition of macro foo at address 0
(macro
    foo          // Macro name
    (int::x*)    // Parameters; `x` is a grouped parameter
    /*...*/      // Body (elided)
)

Length-prefixed Expression Groups

If a grouped parameter’s AEB bits are 0b10, then the argument expressions belonging to that parameter will be prefixed by a FlexUInt indicating the number of bytes used to encode them.

Figure 111: Length-prefixed encoding of (:foo [1, 2, 3])
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0010; the arguments for grouped parameter `x` have been encoded
│  │     as a length-prefixed expression group. A FlexUInt length prefix follows.
│  │  ┌──── FlexUInt: 6; the next 6 bytes are an `int` expression group.
│  │  │
00 02 0D 61 01 61 02 61 03
         └─┬─┘ └─┬─┘ └─┬─┘
           1     2     3

Delimited Expression Groups

If a grouped parameter’s AEB bits are 0b11, then the argument expressions belonging to that parameter will be encoded in a delimited sequence. Delimited sequences are encoded differently for tagged types and tagless types.

Delimited Tagged Expression Groups

Tagged type encodings begin with an opcode; a delimited sequence of tagged arguments is terminated by the closing delimiter opcode, 0xF0.

Figure 112: Delimited encoding of (:foo [1, 2, 3])
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded
│  │     as a delimited expression group. A series of tagged `int` expressions follow.
│  │                    ┌──── Opcode 0xF0 ends the expression group.
│  │                    │
00 03 61 01 61 02 61 03 F0
      └─┬─┘ └─┬─┘ └─┬─┘
        1     2     3
Delimited Tagless Expression Groups

Tagless type encodings do not have an opcode, and so cannot use the closing delimiter opcode--0xF0 is a valid first byte for many tagless encodings.

Instead, tagless expressions are grouped into 'pages', each of which is prefixed by a FlexUInt representing a count (not a byte-length) of the expressions that follow. If a prefix has a count of zero, that marks the end of the sequence of pages.

Figure 113: Definition of macro compact_foo at address 1
(macro
    compact_foo          // Macro name
    (compact_int::x*)    // Parameters; `x` is a grouped parameter
    /*...*/              // Body (elided)
)
Figure 114: Delimited encoding of (:compact_foo [1, 2, 3]) using a single page
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded
│  │     as a delimited expression group. Count-prefixed pages of `compact_int`
│  │     expressions follow.
│  │   ┌──── Count prefix: FlexUInt 3; 3 `compact_int`s follow.
│  │   │          ┌──── Count prefix: FlexUInt 0; no more pages follow.
│  │   │          │
00 03 07 03 05 07 01
         └──┬───┘
         First page: 1, 2, 3
Figure 115: Delimited encoding of (:compact_foo [1, 2, 3]) using two pages
┌──── The opcode is less than 0x40, so it is an E-expression invoking the macro at
│     address 0: `foo`. `foo` takes a group of int expressions as a parameter (`x`),
│     so an argument encoding bitmap (AEB) follows.
│  ┌──── AEB: 0b0000_0011; the arguments for grouped parameter `x` have been encoded
│  │     as a delimited expression group. Count-prefixed pages of `compact_int`
│  │     expressions follow.
│  │   ┌──── Count prefix: FlexUInt 2; 2 `compact_int`s follow.
│  │   │        ┌──── Count prefix: FlexUInt 1; a single `compact_int` follows.
│  │   │        │    ┌──── Count prefix: FlexUInt 0; no more pages follow.
│  │   │        │    │
00 03 05 03 05 03 07 01
         └─┬─┘    └─ Second page: 3
           │
         First page: 1, 2