The folly of fidelity and how I learned to love randomness

If a quantum device would ideally prepare a state \psi , but instead we prepare some other approximation state \rho . Then how should the error be quantified?

For many people, the gold standard is to use the fidelity. This is defined as

F =   \langle \psi \vert \rho \vert \psi \rangle

One of the main goals of this blog post is to explain why the fidelity is a terrible, terrible error metric. You should be using the “trace norm distance”. I will not appeal to mathematical beauty — though trace norm wins on these grounds also — but simply to what is experimentally and observationally meaningful. What do I mean by this? We never observe quantum states. We observe measurement results. Ultimately, we care that the observed statistics of experiments is a good approximation of the exact, target statistics. In particular, every measurement outcome is governed by a projector \Pi and if we have two quantum states \psi and \rho then the difference in outcome probabilities is

\epsilon_\Pi = | \mathrm{Tr} [ \Pi \rho ] - \langle \psi \vert \Pi \vert \psi \rangle |

Error in outcome probabilities is the only physically meaningful kind of error. Anything else is an abstraction. The above error depends on the measurement we perform, so to get a single number we need to consider either the average or maximum over all possible projectors.

The second goal of this post is to convince you that randomness is really, super useful! This will link into the fidelity vs trace norm story. Several of my papers use randomness in a subtle way to outperform deterministic protocols (see here, here and most recently here). I’ve also written another blog posts on the subject (here). Nevertheless, I still find many people are wary of randomness. Consider a protocol for approximating \psi that is probabilistic and prepares states \phi_k with probability p_k . What we will see is that the error (when correctly quantified) of probabilistic protocols can be considerably less than any of the pure states \phi_k in this ensemble. Strange but true. Remember, I only care about measurement probabilities. And you should only care about measurement probabilities! Given a random protocol, the probability of a certain outcome will be
\sum_k  p_k \langle \phi_k  \vert \Pi \vert \phi_k  \rangle = \mathrm{Tr}[ \Pi \rho ]
where
\rho = \sum_k  p_k \vert \phi_k  \rangle \langle \phi_k  \vert
so our measure of error must depend on the averaged density matrix. Even though we might know after the event which pure state we prepared, the measurement statistics are entirely governed by the density matrix that averages over the ensemble of pure states. When I toss a coin, I know after the event with probability 100% whether it is heads or tails, but this does not prevent it from being an excellent 50/50 (pseudo) random number generator. Say I want to build a random number generator with a 25/75 split of different outcomes. I could toss a coin once: if I get heads then I output heads; otherwise I toss a second time and output the second toss result. This clearly does the job. We do not hear people object, “ah ha but after the first toss, it is now more likely to give tails so you algorithm is broken”. Similarly, quantum computers are random number generators and are allowed a “first coin toss” that determines what they do next.

Let’s thinking about the simplest possible example; a single-qubit. Imagine the ideal, or target state, is \vert \psi \rangle = \vert 0 \rangle . We consider states of the form \vert \theta \rangle = \cos( \theta / 2 )  \vert 0 \rangle + \sin( \theta / 2 )  \vert 1 \rangle . The states \vert \theta \rangle and \vert -\theta \rangle both have the same fidelity with respect to the target state. So too does the mixed state
\rho(\theta) = \frac{1}{2} ( \vert \theta \rangle \langle \theta \vert + \vert -\theta \rangle \langle -\theta \vert  ) .
Performing a measurement in the X-Z plane, we get the following measurement statistics.

We have three different lines for three states that all have the same fidelity, but the measurements errors behave very differently. Two things jump out.

Firstly, the fidelity is not a reliable measure of measurement error. For the pure states, in the worst case, the measurement error is 20 times higher than the error as quantified by fidelity. Actually, for almost all measurement angles the fidelity is considerably less than the measurement error for the pure states. Just about the only thing fidelity tells you is that there is one measurement (e.g. with \varphi=0  ) such that the fidelity tells you the measurement error. But it is not generally the case that the quantum computer will perform precisely this measurement. This closes the case on fidelity.

Secondly, the plot highlights how the mixed state performs considerably better than the pure states with the same fidelity! Clearly, we should choose the preparation of the mixed state over either of the pure states. For each pure state, there are very small windows where the measurement error is less than for the mixed state. But when you perform a quantum computation you will not have prior access to a plot like this to help you decide what states to prepare. Rather, you need to design your computation to provide a promise that if your error is \delta  (correctly quantified) then all probabilities should be correct up-to and not exceeding \delta  . The mixed state gives the best promise of this kind.

Now we can see that the “right” metric for measuring errors should be something like:
\delta =   \mathrm{max}_{\Pi} | \mathrm{Tr} [ \Pi \rho ] - \langle \psi \vert \Pi \vert \psi \rangle |
Nothing could be more grounded in experiments. This is precisely the trace norm (also called the 1-norm) measure of error. Though it needs a bit of manipulation to gets into it’s most recognisable form. One might initially worry that the optimisation over projections is tricky to compute, but it is not! We rearrange as follows

\delta =   \mathrm{max}_{\Pi} | \mathrm{Tr} [ \Pi (\rho- \vert \psi \rangle\langle \psi \vert) ]  |

let

M =   \rho- \vert \psi \rangle\langle \psi \vert = \sum_j \lambda_j  \vert \Phi_j \rangle\langle \Phi_j  \vert

where we have given the eigenvalue/vector decomposition of this operator. The eigenvalues will be real numbers because M  is Hermitian.

It takes a little bit of thinking to realise that the maximum is achieved by a projection onto the subspace with positive eigenvalues
\delta =   \mathrm{max}_{\Pi} | \mathrm{Tr} [ \Pi M ] = \sum_{j : \lambda_j > 0} \lambda_j
Because, M  is traceless, the eigenvalues sum to zero and we can simplify this as follows
\delta =  \frac{1}{2}\sum_{j }  | \lambda_j |
If we calculated this \delta   quantity and added it to the plots above, we would indeed see that it gives a strict upper bound on the measurement error for all measurements.

We have just been talking about measurement errors, but have actually derived the trace norm error. Let us connect this to the usual mathematical way of introducing the trace norm. For an operator M   , we denote || M ||_1   or || M ||_{\mathrm{tr}}   for the trace norm, which simply means take the absoluate sum of the eigenvalues of M   . Given two states, the trace norm distance is defined as
\frac{1}{2} || \rho -\sigma ||_{\mathrm{tr}}
Hopefully, the reader can see that this is exactly what we have found above. In other words, we have the equivalence
\frac{1}{2} || \rho -\sigma ||_{\mathrm{tr}} = \mathrm{max}_{\Pi} | \mathrm{Tr} [ \Pi ( \rho -\sigma) ]

Note that we need the density matrix to evaluate this error metric. If we have a simulation that works with pure states, and we perform statistic sampling over some probability distribution to find the average trace norm error
\sum_{k} p_k \frac{1}{2} \big|\big| \vert \phi_k \rangle  \langle \phi_k \vert  - \vert \psi \rangle  \langle \psi \vert  \big|\big|_{\mathrm{tr}}
then this will massively over estimated the true error of the density matrix
\rho = \sum_{k} p_k \vert \phi_k \rangle  \langle \phi_k \vert

I hope this convinces you all to use the trace norm in future error quantification.

I want to finish by tying this back into my recent preprint:
arXiv1811.08017
That preprint looks not at state preparation but the synthesis of a unitary. There I make use of the diamond norm distance for quantum channels. The reason for using this is essentially the same as above. If we have a channel that has error \delta   in diamond norm distance, it ensures that for any state preparation routine the final state will have no more than error \delta   in trace norm distance. And therefore, it correctly upper bounds the error in measurement statistics.

PhD applicant F.A.Q.

Many emails I get regarding PhD projects under my supervision often ask similar questions. To reduce the number of emails sent/received, I will address below the most common questions.

  1. When should I apply? Applications are best submitted in either late December or early January. Funding decisions regarding university studenships are typically made around February time so you want to apply before then. The university accepts applications all year round and sometimes funding will become available at different times of year, but your chances are significantly reduced if you apply later than mid-January.
  2. How do I apply? All applications must be made by completeing the online application form, here. For this you will need: a CV, a transcript of your exam results and two referees who will provide a letter of recommendation. You are very welcome to also informally email me, to introduce yourself or ask further questions about the project. But please be aware that an email is not considered a formal application.
  3. When is the start date? A successful January applicant would typically start in September the same year.
  4. Do you have funding for a PhD studentship? The department has a limited number of EPSRC funded studentships. If you have read an advert for a PhD position in my group, then (unless otherwise stated) it will be for an EPSRC funded studentships. These studentships are allocated competitively. This means that allocation of studentships is partly decided depending on the track record of the student applying. Therefore, I have to first shortlist and interview applicants, and then decide whether to put them forward for a studentship award. UK citizens are eligible for EPSRC studentships. EU (but non-UK) citizens are also eligible and encouraged but currently (as of 2018) only 10 percentage of EPSRC studentships can be allocated to EU (and non-UK) citizens.
  5. What is the Sheffield University Prize scholarship? This is a more prestigious award that is very competitve and you can read more by following this link. It has the advantage that all nationalities are eligible. However, since it is extremely competitve, it is only worth applying if you are an exceptional candidate. For instance, if you scored top of the year in your exams and already have a published piece of research.
  6. The job advert asks for a degree in physics, computer science or mathematics but I only have one of these degrees, will I struggle? I am only expecting applicants to have one degree. Quantum computing and information is interdisplinary topic and so draws on tools and techniques from different degree courses. This makes it an exciting and rewarding area of research that makes you more aware of the connections between different disciplines. Indeed, I now consider the division of universities into departments as a historical accidient rather than a reflection of a fundemental division. Nevertheless, undergraduates often feel nervous about choosing an interdisciplinary research topic. As a supportive tale, I often point out that I started my PhD in quantum computing with an undergraduate in Physics and Philosophy, and so I only had half a relevant degree. However, what you lack in experience must be replaced with enthusiasm. I expect PhD students to be fascinated by different aspects of physics, computer science and mathematics.
  7. Can I do my PhD remotely? An important part of a PhD is becoming part of a research group and interacting with your peers on a daily basis. As such, I expect students to be based in Sheffield and only permit remote supervision in exceptional circumstances. I might be more flexible on this point if you have your own external funding.
  8. Can you recommend any other PhD programmes? In the UK, I highly recommend the quantum themed CDTs (centers for doctoral training) run by University College London, Imperial College and Bristol. They are all have 1 extra year of training at the beginning, after which you choose a 3 year project (so 4 years total). An important point is that after the first year these CDTs (usually) proivde the option of doing your research project at another university, giving you a lot of options. There are also a lot of great places internationally, too many to list.

QIP 2018 talks

You can find all the QIP 2018 talks here:
https://collegerama.tudelft.nl/Mediasite/Showcase/qip2018

I gave a talk on random compiling that you can watch here:
https://collegerama.tudelft.nl/Mediasite/Play/ea10646f4d494cdaac3b17207d68cf601d?playfrom=0&player=0381c08c03db488d8bdbd4a5dfd3478b0a
The relevant paper is published here in Phys. Rev A. After giving the talk, I wrote a pair of blog posts (post 1 and post 2) to address the most common questions that arose during QIP coffee breaks.

QIP random circuits mystery resolved! Part 2

Here I will follow on from my previous post. Circuit compilers usually come with a promise that the cost is upper bounded by some function

f(\epsilon) = A \log ( 1 / \epsilon)^\gamma

with constants depending on the details. In my talk, I asserted that: (1) for qubit single gate sets, the Solovay-Kitaev algorithm achieved \gamma \sim 3.97  ; and (2) modern tools can efficiently solve the single-qubit problem using Clifford+T gates and achieve \gamma = 1  .

I further speculated that for multiqubit compiling problems, the optimal scaling (and hence \gamma value) could be much worse! But a few results have been pointed out to me. Firstly, the famous Dawson-Nielsen paper actually shows that \gamma \sim 3.97  is achieved by Solovay Kitaev for all compiling problems, and so is independent of the number of qubits. Secondly, this neat paper by Harrow-Recht-Chuang showed that an optimal compiler will always achieve \gamma \sim 1  scaling independent of the number of qubits.

However, the Harrow-Recht-Chuang result is non-constructive, so it doesn’t give us a compiler. It just says an optimal one exists. Also, it doesn’t say anything about how the classical runtime scales with the number of qubits. Therefore, if we restrict to compilers with polynomial runtime (polynomial in \log(1 / \epsilon)  ), all we can say is that the optimal scaling sits somewhere in the interval 1 \leq \gamma \leq 3.97  . Finding where the optimal rests in this interval (and writing the software for a few qubit compiler) is clearly one of the most important open problems in the field.

The above discussion seems to entail that multiqubit compiling isn’t very different from single qubit compiling in terms of overhead scaling. However, we have a constant prefactor A  , which is constant with respect to \epsilon  but could increase with the number of qubits. Indeed, we know that there are classical circuits that need an exponential number of gates, which tells us that the prefactor for quantum compiler should also scale exponentially with qubit number.

QIP random circuits mystery resolved! Part 1

Thank you QIP audience! On Tuesday, I gave a presentation on this paper
Phys. Rev. A 95, 042306 (2017)
https://arxiv.org/abs/1612.02689
I had some great questions, but in retrospect don’t think my answers were the best. Many questions focused on how to interpret results showing that random circuits improve on purely unitary circuits. I often get this question and so tried to pre-empt it in the middle of my talk, but clearly failed to convey my point. I am still getting this question every coffee break, so let me try again. Another interesting point is how the efficiency of an optimal compiler scales with the number of qubits (see Part 2). In what follows I have to credit Andrew Doherty, Robin Blume-Kohout, Scott Aaronson and Adam Bouland, for their insights and questions. Thanks!

First, let’s recap. The setting is that we have some gate set \mathcal{G}  where each gate in the set has a cost. If the gate set is universal then for any target unitary V  and any \epsilon > 0 we can find some circuit U = G_1 G_2 \ldots G_n  built from gates in \mathcal{G}  such that the distance between the unitaries is less than \epsilon . For the distance measure we take the diamond norm distance because it has nice composition properties. Typically, compilers come with a promise that the cost of circuit is upperbounded by some function f(\epsilon) = A \log ( 1 / \epsilon)^\gamma  for some constants A  and \gamma  depending on the details (see Part 2 for details).

The main result I presented was that we can find a probability distribution of circuits \{ U_k , p_k \} such that the channel

\mathcal{E}(\rho) = \sum_k p_k U_k \rho U_k^\dagger

is O(\epsilon^2) close to the target unitary V  even though the individual circuits have cost upper bounded by f(\epsilon) . So using random circuits gets you free quadratic error suppression!

But what the heck is going on here!? Surely, each individual run of the compiler gives a particular circuit U_k  and the experimentalist know that this unitary has been performed. But this particular instance has an error no more than \epsilon , rather than O( \epsilon^2) . Is it that each circuit is upper bounded by \epsilon noise, but that somehow the typical or average circuit has cost O(\epsilon^2) . No! Because the theorem holds even when every unitary has exactly \epsilon error. However, typicality does resolve the mystery but only when we think about the quantum computation as a whole.

Each time we use a random compiler we get some circuit U_k = V e^{i \delta_k} where e^{i \delta_k} is a coherent noise term with small || \delta_k || \leq O(\epsilon) . However, these are just subcircuits of a larger computation. Therefore, we really want to implement some large computation

V^{(n)} \ldots V^{(2)} V^{(1)} .

For each subcircuit compiling is reasonable (e.g. it acts nontrivially on only a few qubits) but the whole computation acts on too many qubits to optimally compile or even compute the matrix representation. Then using random compiling we implement some sequence

U^{(n)}_{a_n} \ldots U^{(2)}_{a_2} U^{(1)}_{a_1}

with some probability

p_{a_n}^{(n)} \ldots p^{(2)}_{a_2} p^{(1)}_{a_1} .

OK, now let’s see what happens with the coherent noise terms. For the k^{th} subcircuit we have
U^{(k)}_{a_k} = V^{(k)} e^{i \delta_{a_k}^{(k)}}

so the whole computation we implement is

U^{(n)}_{a_n} \ldots U^{(2)}_{a_2} U^{(1)}_{a_1} = V^{(n)} e^{i \delta_{a_1}^{(n)}}\ldots V^{(2)} e^{i \delta_{a_2}^{(2)}} V^{(1)} e^{i \delta_{a_n}^{(n)}}

We can conjugate the noise terms through the circuits. For instance,

e^{i \delta_{a_2}^{(2)} } V^{(1)} =  V^{(1)} e^{i \Delta_{a_2}^{(2)}}

where

\Delta_{a_2}^{(2)}= V^{(1)} \delta_{a_2}^{(2)} (V^{(1)})^\dagger .

Since norms are unitarily invariant we still have

||\Delta_{a_2}^{(2)}|| = || \delta_{a_2}^{(2)} || \leq O(\epsilon)

Repeating this conjugation process we can collect all the coherent noise terms together

U^{(n)}_{a_n} \ldots U^{(2)}_{a_2} U^{(1)}_{a_1} = (V^{(n)} \ldots V^{(2)} V^{(1)} ) ( e^{\Delta_{a_n}^{(n)}} \ldots e^{\Delta_{a_2}^{(2)}} e^{\Delta_{a_1}^{(1)}})

Using that the noise terms are small, we can use

e^{\Delta_{a_n}^{(n)}} \ldots e^{\Delta_{a_2}^{(2)}} e^{\Delta_{a_1}^{(1)}} \sim e^{\Delta}

where

\Delta = \sum_k \Delta_{a_k}^{(k)}

Using the triangle inequality one has

|| \Delta ||  \leq \sum_k  || \Delta_{a_k}^{(k)}|| \leq n O(\epsilon)  .

But this noise term could be much much smaller than this bound implies. Indeed, one would only get close to equality when the noise terms coherently add up. In some sense, our circuits must conspire to align their coherent noise terms to all point in the same direction. Conversely, one might find that the coherent noise terms cancel out, and one could possibly even have that \Delta = 0 . This would be the ideal situation. But we are talking about large unitary. Too large to compute otherwise we would have simulated the whole quantum computation. For a fixed \Delta , we can’t say much more. But if we remember \Delta comes from a random ensemble, we can make probabilistic arguments about its size. A key point in the paper is that we choose the probabilities such that the expectation of each random term is zero:

\mathbb{E} (  \Delta_{a_k}^{(k)} )=   \sum_{a_k} p^{(k)}_{a_k} \Delta_{a_k}^{(k)} = 0  .

Furthermore, we are summing a series of such terms (sampled independently). A sum of independent random variables are going to convergence (via a central limit theorem) to some Gaussian distribution that is centred around the mean (which is zero). Of course, there will be some variance about the mean, but this will be \sqrt{n} \epsilon  rather than the n \epsilon  bound above that limits the tails of the distribution. But this gives us a rough intuition that \Delta will (with high probability) have quadratically smaller size. Indeed, this is how Hastings frames the problem in his related paper arxiv1612.01011. Based on this intuition one could imagine trying to upper bound \mathbb{E} (  || \Delta|| )  and make the above discussion rigorous. Indeed, this might be an interesting exercise to work through. However, both Hastings and I instead tackled the problem by showing bounds on the diamond distance of the channel, which implicitly entails that the coherent errors are composing (with high probability) in an incoherent way!

More in part 2

QCDA consortium

Last year the EU recommended our QCDA (Quantum Code Design & Architectures) network for funding via its QuantERA programme! The consortium has some really amazing scientists involved and we will be recruiting 5 more people to join (4 postdocs and 1 PhD student). If you want to learn more about the network, I’ve set up a website dedicated to QCDA

http://www.qcda.eu/

We are still waiting to hear from the national funding agencies when the project can start but it could be as soon as February 2018.