A Note on Quantum Approximate Counting and Random Musings

Ok, today we’ll tackle something that I’ve never talked about in my blog before:Quantum computing.There might be a few more quantum computing posts in the future if I am in the mood for it.

The problem at hand is simple: There is an unordered list of N items of which K are marked;Approximate the value of K. Throughout, we’ll represent the N items by [N] and the K items by [K].

This is actually a rather old problem which has been tackled before, and is deeply connected to amplitude estimation(see [1]) which uses uses the Quantum Fourier transform. Recently though, a simpler proof which uses only Grover-like techniques was published by Scott Aaronson, a professor at UT Austin(Hook em’ Horns!). Note that the algorithm is probabilistic with no bounds.

The heart of the technique is Grover’s algorithm so perhaps I should spend a few words on that. Consider an unordered list of size N where you want to find a particular element which can be checked to be a solution in constant time, then this can be found in O(\sqrt{N}).

In the classical case, one is forced to do this(both the above problem and QAC) in O(n) time. The full strength of Grover’s algorithm is that one can actually find the pre-image f^{-1}(b) for f:[N] \to \{0,1\} where f maps x to 1% if and only if latex x$ is a solution to the given problem. This is all encoded as a quantum black-box unitary operator U defined by U\lvert x \rangle =\lvert x \rangle if f(x)=0 and U\lvert x \rangle=-\lvert x \rangle if f(x)=1.

The idea of the algorithm is that we spit out the the superposition of all states:

\lvert s \rangle =\frac{1}{\sqrt{N}} \sum\limits_{i=1}^{N} \lvert x \rangle

followed by applying the operator U above which selects the marked item by flipping it.

and then apply the diffusion operator :

D=2 \lvert s \rangle \langle s \rvert-I. This, in effect, flips the marked qubit around the mean, selectively increasing its relative amplitude.

Explicitly, the mapping of the diffusion gate is:

\sum\limits_{x \in \{0,1\}^{n}}\alpha_{x}\lvert x \rangle  \mapsto \sum\limits_{x \in \{0,1\}^{n}}(2\mu-\alpha_{x}) where \mu is the mean of the \lvert x \rangle.

One repeats this a certain number of times(specifically O(\sqrt{N}) and then measures, obtaining the corresponding eigenvalue with high probability.

A Random Aside?

In the case of a non-local hidden-variable quantum computer, one can actually do this in atmost O(\sqrt[3]{N}) time. I’m not really going to talk about this. However, this spawned a simple question:What if you could perform it in O(log(N)) time? What would that actually imply?

Consider the SAT problem: There are variables x_{1},\cdots,x_{n} and expressions in these variables $l_{1},\cdots,l_{k}$. Find an assignment of either true/false to each x_{i} such that all the expressions l_{p} evaluate to true.,i.e l_{1} \land l_{2} \land \cdots \land l_{k} is true.

Anyways, we can store all O(2^{n}) possibilities and search through them in O(log(2^{n}))=O(n) time in our hypothetical case, which is linear. This would mean that confirmed NP decision problems like SAT, Hamiltonian path problem can be solved in polynomial time which implies P=NP. So, it is unlikely that this is the case.

I’ll get back to the these things in the end of this post.

Quantum Approximate Counting

Consider an unordered collection of size N where K of them is marked. Estimate the value of K. This is the problem at hand.

Firstly, let’s clarify the exact problem statement. We wish to find an approxiamte solution, so that means we select multiplicative error \epsilon, i.e find $\tilde{K}$ such that

1-\epsilon \leq \frac{\tilde{K}}{K} \leq 1+\epsilon

Moreover, the algorithm is supposed to be probabilistic so we also select a small probability of failure $\delta$.

Let’s first review some obvious solutions before getting to the actual algorithm. One must note that in the case of a quantum solution, we only care about how many query calls we make,i.e the query call complexity. The real difference in these cases is that in a quantum computer, we can measure observables and get a probability distribution of measurements, so we want to make clever choices for the observables that we measure to extract useful information, the same philosophy behind Grover’s algorithm. As such, it seems reasonable that in a good solution, the query call complexity depends somehow on the value K itself.

Classical solution

To exactly find K, the only choice is a linear search which is O(n). Actually, the approximate case(deterministic) could be done in O(\frac{1}{\epsilon^{2}}\frac{N}{K}).

‘Easy’ quantum solution

Use Grover’s algorithm where f maps the input to 1 if and only if it is marked, this would be O(\sqrt{N}).


Query call complexity of the ideal solution

Perhaps, it is best if I reveal the complexity before moving on. The ideal ideal query call complexity to obtain \tilde{K} which \epsilon-approximates K with high probability of success 1-\delta is O(\frac{1}{\epsilon} \sqrt{\frac{N}{K}} log(\frac{1}{\delta})).

Initial steps

Instead of approximating K, we equivalently approxiamte \theta = arcsin(\frac{N}{K}).

There are really two steps:

First, we find a good constant-factor approximation to K and then we cleverly shave off the bounds to get an \epsilon -approximation.

So, first we obtain some interval [\theta_{min}^{t+1},\theta_{max}^{t-1}] containing \theta.

This is explained in part 1 of Aaronson’s overview of the algorithm:

Here ,\phi is the uniform superposition of all states and G is the diffusion operator

Note that :

G^{\frac{r-1}{2}}\lvert \psi \rangle=\frac{sin(r\theta)}{\sqrt{K}} \sum\limits_{x \in [K] } \lvert x \rangle +\frac{cos(r\theta)}{\sqrt{K}} \sum\limits_{x \not\in [K]} \lvert x \rangle. In the computational basis, the probability of measuring a marked item is sin^{2}(r\theta). Step 1 essentially returns a constant factor approxiamtion to \theta withhigh probability. This can be proven using Chernoff bounds which I’ll omit from the discussion.

It is also proven in page 5 that with high probability, one will not find enough marked items to finish step 1 before t<t_{0}. Also, with high probability, we’ll see enough marked items in t=t_{0},t_{0}+1.The interesting part is how to get an \epsilon-approximation.

Final solution

Now, that we have \theta_{min},\theta_{max}, we use the following clever result on page 9:

Let 0<\theta_{min}\leq \theta\leq \theta_{max} \leq \frac{\pi}{1000} and latex \theta_{max}=\theta_{min}(1+\gamma)$ where \gamma \leq \frac{1}{5}. There exists an integer r such that the following is true:

Consider tossing a biased coin which is heads with probability sin^{2}(r\theta) at least 1000.ln(\frac{1}{\epsilon}) times.

If more heads are observed, set \theta_{min}:=\frac{\theta_{max}}{1+0.9 \gamma} else set

\theta_{max}:=\theta_{min}(1+0.9\gamma).

This process fails to maintain \theta_{min} \leq \theta \leq \theta_{max} with low probability \gamma. The value r also has some non-trivial bounds.

The probability of getting heads above is exactly the probability of getting a marked item in G^{\frac{r-1}{2}}\lvert \psi \rangle above.

The upshot is that we can distinguish \theta_{min},\theta_{max} by measurement by considering instead r\theta_{min},r\theta_{max} which are nearly orthogonal. Based on our results of measurement, we shave-off either \theta_{min} or \theta_{max} by a factor of 0.9 getting closer and closer(with high probability) to our required \epsilon-estimate of \theta.

Philosophical musings on Energy and Algorithms

I beg the reader to ignore everything here and close this tab immediately as I have no idea what I’m saying(not that I ever do).

Recently, I came across a few interesting thins=gs:

One was an experimental method to tackle classic difficult computer science problems. One was using chemical reactions with DNA molecules to solve the Hamiltonian path problem which is NP-hard. The problem was that it requires an amount factorial in the number of vertices to work. Another solution uses an optical contraption of cables and beam splitters to get a solution. Again, the drawback is that amount of required energy is exponential in the number of vertices. There is another example of some kind of a bizarre mold which rearranges itself in a manner to give an approximate solution to the Travelling Salesman Problem.

In fact, there is an entire field of DNA computing. The aim is to use parallel computing to optimize our solutions. The drawback is that the processing speed is slow.

Again, even for Grover’s algorithm, note that we must first ‘store’/obtain the entire quantum state itself unlike a classic linear-search where the auxilliry space needed is 1. Within a certain degree(spanning all these systems of computing), it seems that we can, to SOME degree, substitute time for space/energy.An obvious thing to note is that you can’t decide anything about some new data if you haven’t even stored it. More often than not, we are willing to make this sacrifice because we care more for efficiency.

Parallel computing speedups require increasing the number of components. Of course, there is Amdahl’s law which puts some kind of a restriction to this.

Aaronson talked about this in one of his informal talks(I think an interview or TedTalk).

The study of complexity theory really may reveal things about physics, nature and energy it seems. In classical computing, there is a limit to how much we can trade space for time so we consider other unconventional types of computing where the threshold for this tradeoff is higher. This is truly an interesting direction to think in.

Fast generalized DFT for finite groups

Almost 6 months ago, I came across an interesting update through the UT CS theory seminar group on the classic problem on the computation of the Discrete Fourier Transform(DFT) of finite groups. Chris Umans, from Caltech, made some pretty nice improvements to the complexity of the general DFT through some slick combinatorics , which I’ll shortly explain. The paper is here.

Introduction

Well, before I even start, perhaps, I should make some kind of a comment to motivate the problem at hand. A sequence of n complex numbers (x_{i}) can be represented by a function f:\mathbb{Z}_{n} \to \mathbb{C}.

A Discrete Fourier Transform of (x_{i}) sends it to the sequence (F_{i}) where F_{i}=\sum\limits_{j=0}^{n-1}x_{j}e^{\frac{-i2\pi k j}{n}}.

This motivates the definition of the Fourier transform \hat{f} of a function from a finite group f : G \to \mathbb{C} which takes as input a representation \rho:G \to GL(d_{\rho},\mathbb{C}), i.e

\hat{f}(\rho)=\sum\limits_{g \in G} f(g)\rho(g)

where one must note that here, more generally, \rho(g) can be matrix-valued.From Maschke’s theorem, the DFT of a finite group reduces to a map which takes as input an element \alpha \in latex \mathbb{C}[G](group ring consisting of all functions from G to \mathbb{C} under convolution)and sends it to \sum\limits_{g} \alpha(g) \bigoplus_{\rho \in Irr(G)}\rho(g). Throughout, one assumes that we have already have the \rho and chosen a basis for each \rho.

Quite simply, one trivially gets that this requires O(|G|^{2}) operations. Exponent-one algorithms exist for abelian groups(as one would expect using the decomposition into cyclic groups), supersolvable and certain classes of symmetric and alternating groups.

Throughout the post, let \omega be the exponent of matrix multiplication(currently it is around 2.38 I guess).

The main result is that Chris Umans achieves an operation complexity of O(|G|^{\frac{\omega}{2}}. Some basic facts from representation theory will be used throughout though I guess most readers would already be familiar with that. For example, it is easy to prove that through the Schur orthogonality relations is that \sum\limits_{i} |V_{i}|^{2}=|G| where the V_{i} is the set of all irreducible representations of G. This gives us the useful lemma.

Lemma 1:

For every real number \alpha \geq 2, \sum\limits_{\rho \in Irr(G)} dim(\rho)^{\alpha} \leq |G|^{\frac{\alpha}{2}}(Proof omitted)


Some facts about the induced representation and Clifford theory are used but it shouldn’t be too much of a hassle to discuss it on-the-fly.

The Support Trick

Let G be a group and S, a subset. Let’s assume that we can caluclate the generalized DFT with respect to G for all inputs \alpha \in \mathbb{C}[G] supported on S(i.e components \alpha_{g}=0 for g \not\in S) in m operations. Surely, with some extra computational cost(involving the index of S), one can extend this to all inputs because if we can compute DFT supported on S then we can compute DFT supported on Sg' by multiplying  \sum\limits_{g} \alpha(g) \bigoplus_{\rho \in Irr(G)}\rho(g) by \bigoplus_{\rho \in Irr(G)} \rho(g') which has an additional operation cost of \sum\limits_{\rho \in Irr(G)} O(dim(\rho)^{\omega+\epsilon}. We can apply Lemma 1,take appropriate number of ‘shifts/translates’ by elements of G to get the following result(see Section 2.3):

Continue reading “Fast generalized DFT for finite groups”

Monads I: The Eilenberg-Moore category, Adjunction and Kleisli categories

This series(probably three posts) will be on the monad. The first part will deal with only an introduction to monads, algebras over moands and the adjunction problem . The second part will deal with coequalizers and Beck’s theorem in connection with Kleisli categories. In keeping with this blog’s slight CS tilt, the third part will deal with examples of monads from functional programming which I think are crucial to wholly understand the monad. I’ve noticed that I am running too many ‘series posts’ simultaneously. I am still working on the third part of the Grothendieck Spectral Sequence series and will continue the Hopf fibration post which I haven’t updated for a year!


 

I think that before I continue to introduce the main players of today’s post, I should review the definitions of an adjoint functor as it’ll be quite useful as a tool to see the power of monads.

Consider two categories C,D and functors F:D \to C and G:C \to D. F,G are said to be adjoint i.e F \dashv G if there exists a natural isomorphism of Hom-sets Hom-{C}(Fd,c) \simeq Hom_{D}(d,Gc) for all d \in D,c \in C. Equivalently, by the unit=counit definition, if there exists the pair of natural transformations \eta:1_{D} \Rightarrow GF(unit) and \epsilon:FG \Rightarrow 1_{C}(counit) which satisfy the following commutative diagrams:

adjoint

 

Throughout the post, I’ll represent natural transformations by the symbol \Rightarrow. I recommend the confused reader to refer to Aluffi’s Algebra Chapter 0 book in the section on category theory and also my post on the importance of adjoint functors.

If \eta:F \Rightarrow G is a natural transformation of functors F,G: A \to B whose components are given by \eta_{X}:F(X) \to G(X) for an object X in A. If T:B \to C is another functor, then I’ll represent the components of the induced natural transformation HF \Rightarrow HG by (H\eta)_{X}=H(\eta_{X}).

If instead, there is a functor T:Z \to A, then I’ll represent the components of the natural transformation \eta Z:ZF \to ZG by (\eta Z)_{X}=\eta_{Z(X)} where X is an object in Z.


Monad

Let’s say that T:C \to C is an endofunctor equppied with two natural transformations \eta:1_{C} \Rightarrow T(unit) and \mu:TT \Rightarrow T such that the following diagrams commute:

diag1

 

The commutative diagram is a kind of generalization of associativity. I direct the reader to the nlab page to learn more about the basics of it. Much of the importance of monads is derived from its connection to adjoint functors and its connection to programming paradigms(see continuation monad for more information).

Let’s see how monads are connected to adjoint functors.

Continue reading “Monads I: The Eilenberg-Moore category, Adjunction and Kleisli categories”

Proof of the Sensitivity Conjecture without Cauchy’s Interlacing Theorem

If you aren’t aware, one month ago, the Sensitivity Conjecture, a 30-year old problem, was proven by Hao Hung, an assistant professor at Emory University in just a little more than 2 pages using particularly simple methods albeit cleverly implemented. The proof is very easy to understand and requires nothing more than basic linear algebra. You can find it here.

A few days ago, Donald Knuth simplified the already simple proof of H. Hung and could fit all his arguments in half a page. What really shocked me when I heard the news is that Knuth is still alive. I could have sworn that it was only a few days ago when I was skimming through a textbook on Discrete Mathematics and saw a bio with a black-and-white pic of him in one of the chapters related to computational complexity in the same manner that I’d see pictures and mini-biographies of other 20th century giants such as Grothendieck and Nash relegated to the margins of math textbooks and online articles romantically detailing the course of their fabled lives.

Now that I’ve realized that he’s well and alive, it shocks me equally to learn that at the age of 81, he is still able to contribute to research.

I’m not exactly going to regurgitate Knuth’s proof but what I’ll present certainly uses the same ideas. I must note here than Knuth’s proof itself is inspired by a comment on Scott Aronson’s blog which provided a method to prove the conjecture without using Cauchy’s Interlacing Theorem.

If you don’t know what the Sensitivity Conjecture, I’ll state it below.

If f:\{0,1\}^{n} \mapsto \{0,1 \} is a boolean function where the domain is the graph of the n-dimensional cube denoted by Q^{n}=\{0,1 \}^{n}. So, a particular input x is just a string of 0s and 1s. The local sensitivity of f at an input x is the number of indices in the ‘string’ of x that you can flip and not change the output. The local block sensitivity of f at input x is the number of disjoint subsets of the set of indices \{0,1,2 \cdots, n \} such that the output doesn’t change when every index corresponding to a block flips in the input string of x. The sensitivity and block sensitivity are defined to be the maximum of these corresponding measures over all the inputs.

Continue reading “Proof of the Sensitivity Conjecture without Cauchy’s Interlacing Theorem”

Topological Constraints on the Universal Approximation of Neural Networks

The title may seem like a contradiction given that there is such a thing as the Universal Approximation Theorem which simply states that a neural network with a single hidden layer of finite width(i.e finite number of neurons) can approximate any function on a compact set of \mathbb{R}^{n} given that the activation function is non-constant,bounded and continuous.

Needless to say, I haven’t found any kind of flaws in the existing proofs(see Kolmogrov or Cybenko). However, I thought of something a little simple and scoured the internet for an answer.

What if we allow an arbitrary number of hidden layers and bound the dimension of the hidden layers making them ‘Deep, Skinny Neural Networks’? Would that be a universal approximator?

Continue reading “Topological Constraints on the Universal Approximation of Neural Networks”