From 07299e170532a24fbaefa6a87aa3c427b79f765e Mon Sep 17 00:00:00 2001
From: Jake Lishman <jake.lishman@ibm.com>
Date: Wed, 1 Sep 2021 19:21:17 +0100
Subject: [PATCH] Optimise CommutationAnalysis transpiler pass
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The `CommutationAnalysis` transpiler pass for large circuits with many
gates is typically one of the longer passes, along with the mapping
passes.  It spends most of its time deciding whether any two given
operators commute, which it does by matrix multiplication and
maintaining a cache.

This commit maintains the same general method (as opposed to, say,
maintaining a known-good lookup table), but performs the following
optimisations to improve it, in approximate order from most to least
impactful:

- we store the _result_ of "do these two matrices commute?" instead of
  the previous method of two matrices individually in the cache.  With
  the current way the caching is written, the keys depend on both
  matrices, so it is never the case that one key can be a cache hit and
  the other a cache miss. This means that storing only the result does
  not cause any more cache misses than before, and saves the subsequent
  matrix operations (multiplication and comparison).  In real-word
  usage, this is the most major change.

- the matrix multiplication is slightly reorganised to halve the number
  of calls to `Operator.compose`. Instead of first constructing a
  identity operator over _all_ qubits, and composing both operators onto
  it, wey reorganise the indices of the qubit arguments so that all the
  "first" operator's qubits come first, and it is tensored with a small
  identity to bring it up to size.  Then the other operator is composed
  with it.  This is generally faster, since it replaces a call to
  `compose`, which needs to do a matmul-einsum step into a simple
  `kron`, which need not concern itself with order.  It also results in
  fewer operations, since the input matrices are smaller.

- the cache-key algorithm is changed to avoid string-ification as much
  as possible.  This generally has a very small impact for most
  real-world applications, but has _massive_ impact on circuits with
  large numbers of unsynthesised `UnitaryGate` elements (like quantum
  volume circuits being transpiled without `basis_gates`).  This is
  because the gate parameters were previously being converted to string,
  which for `UnitaryGate` meant string-ifying the whole matrix.  This
  was much slower than just doing the dot product, so was defeating the
  purpose of the cache.

On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase
on the `time_quantum_volume_transpile` benchmark at transpiler
optimisation levels 2 and 3 (the only levels the `CommutationAnalysis`
pass is done by default), over the whole transpiler pass.  The
improvement in runtime of the pass itself depends strongly on the type
of circuit itself, but in the worst (highly non-realistic) cases, can be
nearly an order of magnitude improvement (for example just calling
`transpile(optimization_level=3)` on a `QuantumVolume(128)` circuit
drops from 27s to 5s).

`asv` `time_quantum_volume_transpile` benchmarks:

- This commit:

    =============================== ============
     transpiler optimization level
    ------------------------------- ------------
                   0                  3.57±0s
                   1                 7.16±0.03s
                   2                 10.8±0.02s
                   3                 33.3±0.08s
    =============================== ============

- Previous commit:

    =============================== ============
     transpiler optimization level
    ------------------------------- ------------
                   0                 3.56±0.02s
                   1                 7.24±0.05s
                   2                 16.8±0.04s
                   3                 38.9±0.1s
    =============================== ============
---
 .../optimization/commutation_analysis.py      | 99 ++++++++++++++-----
 1 file changed, 72 insertions(+), 27 deletions(-)

diff --git a/qiskit/transpiler/passes/optimization/commutation_analysis.py b/qiskit/transpiler/passes/optimization/commutation_analysis.py
index 82b1b0a699a1..51f1562b3801 100644
--- a/qiskit/transpiler/passes/optimization/commutation_analysis.py
+++ b/qiskit/transpiler/passes/optimization/commutation_analysis.py
@@ -87,42 +87,87 @@ def run(self, dag):
                 self.property_set["commutation_set"][(current_gate, wire)] = temp_len - 1
 
 
-def _commute(node1, node2, cache):
+_COMMUTE_ID_OP = {}
+
+
+def _hashable_parameters(params):
+    """Convert the parameters of a gate into a hashable format for lookup in a dictionary.
+
+    This aims to be fast in common cases, and is not intended to work outside of the lifetime of a
+    single commutation pass; it does not handle mutable state correctly if the state is actually
+    changed."""
+    try:
+        hash(params)
+        return params
+    except TypeError:
+        pass
+    if isinstance(params, (list, tuple)):
+        return tuple(_hashable_parameters(x) for x in params)
+    if isinstance(params, np.ndarray):
+        # We trust that the arrays will not be mutated during the commutation pass, since nothing
+        # would work if they were anyway. Using the id can potentially cause some additional cache
+        # misses if two UnitaryGate instances are being compared that have been separately
+        # constructed to have the same underlying matrix, but in practice the cost of string-ifying
+        # the matrix to get a cache key is far more expensive than just doing a small matmul.
+        return (np.ndarray, id(params))
+    # Catch anything else with a slow conversion.
+    return ("fallback", str(params))
+
 
+def _commute(node1, node2, cache):
     if not isinstance(node1, DAGOpNode) or not isinstance(node2, DAGOpNode):
         return False
-
     for nd in [node1, node2]:
         if nd.op._directive or nd.name in {"measure", "reset", "delay"}:
             return False
-
     if node1.op.condition or node2.op.condition:
         return False
-
     if node1.op.is_parameterized() or node2.op.is_parameterized():
         return False
 
-    qarg = list(set(node1.qargs + node2.qargs))
-    qbit_num = len(qarg)
-
-    qarg1 = [qarg.index(q) for q in node1.qargs]
-    qarg2 = [qarg.index(q) for q in node2.qargs]
-
-    id_op = Operator(np.eye(2 ** qbit_num))
-
-    node1_key = (node1.op.name, str(node1.op.params), str(qarg1))
-    node2_key = (node2.op.name, str(node2.op.params), str(qarg2))
-    if (node1_key, node2_key) in cache:
-        op12 = cache[(node1_key, node2_key)]
+    # Assign indices to each of the qubits such that all `node1`'s qubits come first, followed by
+    # any _additional_ qubits `node2` addresses.  This helps later when we need to compose one
+    # operator with the other, since we can easily expand `node1` with a suitable identity.
+    qarg = {q: i for i, q in enumerate(node1.qargs)}
+    num_qubits = len(qarg)
+    for q in node2.qargs:
+        if q not in qarg:
+            qarg[q] = num_qubits
+            num_qubits += 1
+    qarg1 = tuple(qarg[q] for q in node1.qargs)
+    qarg2 = tuple(qarg[q] for q in node2.qargs)
+
+    node1_key = (node1.op.name, _hashable_parameters(node1.op.params), qarg1)
+    node2_key = (node2.op.name, _hashable_parameters(node2.op.params), qarg2)
+    try:
+        # We only need to try one orientation of the keys, since if we've seen the compound key
+        # before, we've set it in both orientations.
+        return cache[node1_key, node2_key]
+    except KeyError:
+        pass
+
+    operator_1 = Operator(node1.op, input_dims=(2,) * len(qarg1), output_dims=(2,) * len(qarg1))
+    operator_2 = Operator(node2.op, input_dims=(2,) * len(qarg2), output_dims=(2,) * len(qarg2))
+
+    if qarg1 == qarg2:
+        # Use full composition if possible to get the fastest matmul paths.
+        op12 = operator_1.compose(operator_2)
+        op21 = operator_2.compose(operator_1)
     else:
-        op12 = id_op.compose(node1.op, qargs=qarg1).compose(node2.op, qargs=qarg2)
-        cache[(node1_key, node2_key)] = op12
-    if (node2_key, node1_key) in cache:
-        op21 = cache[(node2_key, node1_key)]
-    else:
-        op21 = id_op.compose(node2.op, qargs=qarg2).compose(node1.op, qargs=qarg1)
-        cache[(node2_key, node1_key)] = op21
-
-    if_commute = op12 == op21
-
-    return if_commute
+        # Expand operator_1 to be large enough to contain operator_2 as well; this relies on qargs1
+        # being the lowest possible indices so the identity can be tensored before it.
+        extra_qarg2 = num_qubits - len(qarg1)
+        if extra_qarg2:
+            try:
+                id_op = _COMMUTE_ID_OP[extra_qarg2]
+            except KeyError:
+                id_op = _COMMUTE_ID_OP[extra_qarg2] = Operator(
+                    np.eye(2 ** extra_qarg2),
+                    input_dims=(2,) * extra_qarg2,
+                    output_dims=(2,) * extra_qarg2,
+                )
+            operator_1 = id_op.tensor(operator_1)
+        op12 = operator_1.compose(operator_2, qargs=qarg2, front=False)
+        op21 = operator_1.compose(operator_2, qargs=qarg2, front=True)
+    cache[node1_key, node2_key] = cache[node2_key, node1_key] = ret = op12 == op21
+    return ret