# Introduction to coding theory (CMU, Spr 2010)

## January 25, 2010

### Notes 3: Stochastic channels and noisy coding theorem

Filed under: Lecture notes — Venkat Guruswami @ 11:17 pm

We now turn to the basic elements of Shannon’s theory of communication over an intervening noisy channel.

1. Model of information communication and noisy channel

To quote Shannon from his paper A Mathematical theory of communication: “The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.” The basic setup of the communication problem consists of a source that generates digital information which is to reliably communicated to a destination through a channel, preferably in the most efficient manner possible. This “destination” could be spatially separated (eg., a distant satellite is sending images of Jupiter back to the space station on Earth), or could be temporally separated (eg., we want to retrieve date stored on our hard disk at a later point of time).

The following is a schematic of the communication model:

The first step in the communication model is to exploit the redundancy in the output of the source and compress the information to economize the amount of “raw, non-redundant” data that must be transmitted across the channel. This data compression step in called source coding. If at each time step the source outputs an i.i.d copy of a random variable ${Z}$ supported on a finite set ${{\cal Z}}$, then Shannon’s source coding theorem states that one can compress its output to ${H(Z)}$ bits per time step (on average, over ${n}$ i.i.d samples from the source ${Z}$, as ${n \rightarrow \infty}$). In other words ${n}$ samples from the source can be coded as one of ${M \approx 2^{H(Z) n}}$ possible outputs. Here ${H(Z)}$ is the fundamental Shannon entropy defined as

$\displaystyle H(Z) = \sum_{z \in {\cal Z}} \mathop{\mathbb P}[Z=z] \log \frac{1}{\mathop{\mathbb P}[Z=z]} \ . \ \ \ \ \ (1)$

where ${\log}$ is to the base ${2}$. Thus the entropy of a fair coin toss is ${1}$, and that of a ${p}$-biased coin toss is ${h(p) = -p\log p - (1-p) \log(1-p)}$. The source decoder at the other end of the communication channel then decompresses the received information into (hopefully) the original output of the source.

The output of the source coder, say ${m}$, must then be communicated over a noisy channel. The channel’s noisy behavior causes errors in the received symbols at the destination. To recover from the errors incurred due to the channel, one should encode the information output by the source coder by adding systematic redundancy to it. This is done through channel coding which maps ${m}$ to a codeword ${c}$ of some suitable error-correcting code (the study of channel coding will be our focus in this course).

1.1. Modeling the noisy channel

The basic channel model consists of an input alphabet ${{\cal X}}$ and output alphabet ${{\cal Y}}$. We will focus on memoryless channels — for each ${x \in {\cal X}}$ there is a distribution ${D_x}$ on ${{\cal Y}}$ such that when input ${x \in {\cal X}}$ is fed at one end of the channel, the channel distorts it to ${y \in {\cal Y}}$ according to an independent sample drawn according to ${D_x}$. (In particular, the channel has no “state,” and its behavior is independent of the history of previously transmitted symbols.) The collection of the distributions ${D_x}$ comprise the “channel law” for the behavior of the channel. In a discrete memoryless channel (DMC), given by a triple ${\Lambda=({\cal X},{\cal Y},\Pi)}$, the input and output alphabets ${{\cal X}, {\cal Y}}$ are finite, and therefore the channel law can be specified by a ${|\cal{X}| \times |\cal{Y}|}$ conditional probability matrix ${\Pi}$ which is a stochastic matrix where each row sums to 1:

$\displaystyle \ \ \qquad \quad |\cal{Y}|$

$\displaystyle |\cal{X}| \left \{\overbrace{\left( \begin{array}{ccc} & & \\ & \Pi(y|x) & \\ & & \end{array} \right)}\right.$

The ${(x,y)}$‘th entry ${\Pi(y|x)}$ is the conditional probability ${\mathop{\mathbb P}(Y=y | X=x)}$ of the receiving ${y}$ when ${x}$ was transmitted on the channel.

1.2. Noisy coding and joint source-channel coding theorems

Suppose at the output of the source coder, we have a message ${m}$ from one of ${M \approx 2^{H(Z) n}}$ possible messages (that encode ${n}$ samples from the source ${Z}$), which is to be communicated across a channel ${({\cal X}, {\cal Y}, \Pi)}$. Then the channel encoder it into a sequence ${\mathbf{x} = (x_1,x_2,\dots,x_n) \in C \subseteq {\cal X}^n}$ for some error-correcting code ${C}$ and the information is sent via ${n}$ uses of the channel. At the other end, a sequence ${\mathbf{y} = (y_1,y_2,\dots,y_n) \in {\cal Y}^n}$ is received with conditional probability $\displaystyle p(\mathbf{y} | \mathbf{x}) = \prod_{i=1}^n \Pi(y_i | x_i) \ \ \ \ \ (2)$

(due to the memoryless nature of the channel). The decoder must then map this sequence ${\mathbf{y}}$ into a legal codeword ${c \in C}$ (or equivalently into a message ${m \in {\cal M}}$).

A piece of notation: For a DMC ${({\cal X}, {\cal Y}, \Pi)}$, a positive integer ${n}$, and ${\mathbf{x} \in {\cal X}^n}$, let us denote by ${\Pi(\mathbf{x})}$ the above distribution (2) on ${{\cal Y}^n}$ induced by the ${\Pi}$ on input sequence ${\mathbf{x}}$.

Theorem 1 (Shannon’s noisy coding theorem) For every discrete memoryless channel ${\Lambda=({\cal X}, {\cal Y}, \Pi)}$, there exists a real number ${C_0=C_0(\Lambda)}$ called its channel capacity, such that the following holds for every ${R < C_0}$. For all large enough ${n}$, there exists an integer ${M \ge 2^{Rn}}$ and

1. an encoding map ${{\rm Enc}: \{1,2,\dots,M\} \rightarrow {\cal X}^n}$ (of some error-correcting code over alphabet ${{\cal X}}$ of rate ${R/\log |{\cal X}|}$), and
2. a decoding map ${{\rm Dec} : {\cal Y}^n \rightarrow \{1,2,\dots,M\} \cup \{{\rm fail}\}}$

such that for every ${m \in \{1,2,\dots,M\}}$ $\displaystyle \mathop{\mathbb P}_{\Pi} [ {\rm Dec}(\Pi({\rm Enc}(m))) = m ] \ge 1 - 2^{-\Omega_{R,C_0}(n)} \$

where the probability is over the behavior of the channel ${\Pi}$ (on input ${{\rm Enc}(m)}$). Further, the capacity ${C_0}$ is given by the expression

$\displaystyle \max_{p \in {\rm Dist}_{{\cal X}}} H(Y) - H(Y | X)$

where the maximum is taken over all probability distributions ${p}$ on ${{\cal X}}$. In the above, ${H(Y)}$ is the entropy of the ${{\cal Y}}$-valued random variable ${Y}$ with distribution function

$\displaystyle \mathop{\mathbb P}[Y=y] = \sum_{x \in {\cal X}} \mathop{\mathbb P}(Y=y | X=x) p(x) = \sum_{x \in {\cal X}} \Pi(y | x) p(x)$

and ${H(Y | X)}$ is the conditional entropy $\displaystyle H(Y | X) = \sum_{x \in {\cal X}} p(x) H(Y | X=x) = \sum_{x \in {\cal X}} p(x) \sum_{y \in {\cal Y}} \Pi(y|x) \log \frac{1}{\Pi(y | x)} \ .$

Remark 1 The quantity ${H(Y) - H(Y | X)}$ is called the mutual information between ${X}$ and ${Y}$, and denoted ${I(X,Y)}$. It represents the decrease in uncertainty of a random variable ${Y}$ given the knowledge of random variable ${X}$, which intuitively captures how much information ${X}$ reveals about ${Y}$. If ${Y}$ is independent of ${X}$, then ${H(Y | X) = H(Y)}$, and ${I(X,Y) = 0}$. On the other hand if ${Y = f(X)}$ for some function ${f}$ (i.e., ${Y}$ is determined by ${X}$), then ${H(Y |X)=0}$ and ${I(X,Y) = H(Y)}$.

Combining Shannon’s source coding and noisy coding theorems, and the two-stage communication process comprising a separate source coding stage followed by channel coding stage, one can conclude that reliable communication of the output of a source ${Z}$ on a noisy channel ${\Lambda}$ is possible as long as ${H(Z) < C_0(\Lambda)}$, i.e., the source outputs data at a rate that is less than the capacity of the channel. This result has a converse (called the converse to the joint source-channel coding theorem) that says that if ${H(Z) > C_0(\Lambda)}$ then reliable communication is not possible.

Together, these imply a “separation theorem,” namely that it is information-theoretically optimal to do source and channel coding separately, and thus one can gain modularity in communication system design without incurring any loss in rate of data transfer. While this converse to the joint source-channel coding theorem is rather intuitive in the setting of point-to-point communication between a sender and a receiver, it is worth remarking that the separation theorem breaks down in some scenarios with multiple users and correlated sources.

We will not prove Shannon’s theorem in the above generality here, but content ourselves with establishing a special case (for the binary symmetric channel). The proof for the general case follows the same general structure once some basic information theory tools are set up, and we will remark briefly about this at the end. But first we will see some important examples of noisy channels.

2. Examples of channels

A discrete channel with finite input and output alphabets ${{\cal X}}$ and ${{\cal Y}}$ respectively, specified by the conditional probability matrix ${\Pi(y | x)}$, can also be represented pictorially by an input-output diagram, which is a bipartite graph with nodes on left identified with ${{\cal X}}$ and nodes on right identified with ${{\cal Y}}$ and a directed edge between ${x \in {\cal X}}$ and ${y \in {\cal Y}}$ with weight ${\Pi( y | x)}$.

2.1. Binary symmetric channel

The Binary Symmetric Channel (BSC) has input alphabet ${{\cal X} = \{0,1\}}$ and output alphabet ${{\cal Y} = \{0,1\}}$. The BSC is parameterized by a real number ${p}$, ${0 \le p \le 1/2}$ called the crossover probability, and often denoted ${{\rm BSC}_p}$. The channel flips its input with probability ${p}$, in other words,

$\displaystyle \Pi(y | x) = \left\{ \begin{array}{cc} p & \text{if } y=x \\ 1-p & \text{if } y =1-x \end{array} \right.$

Pictorially, ${{\rm BSC}_p}$ can be represented as

If a uniform input ${X \in \{0,1\}}$ is fed as input to ${{\rm BSC}_p}$, then the output ${Y}$ is also uniformly distributed. Given ${X=x}$, ${Y}$ is distributed as a ${p}$-biased coin, and ${H(Y | X =x) = h(p)}$. Thus ${H(Y|X) = h(p)}$, and therefore ${I(X,Y) = H(Y) = H(Y|X) = 1 -h(p)}$. It can be checked that the uniformly distributed ${X}$ maximizes ${I(X,Y)}$, and so Shannon’s theorem implies that ${1-h(p)}$ is the capacity of ${{\rm BSC}_p}$. We will shortly prove this special case of Shannon’s theorem.

2.2. Binary erasure channel

The Binary Erasure Channel (BEC) is parameterized by a real ${\epsilon}$, ${0 \le \epsilon \le 1}$, which is called the erasure probability, and is denoted ${{\rm BEC}_{\epsilon}}$. Its input alphabet is ${{\cal X} = \{0,1\}}$ and output alphabet is ${{\cal Y} = \{0,1,?\}}$. Upon input ${x \in {\cal X}}$, the channel outputs ${x}$ with probability ${1-\epsilon}$, and outputs ${?}$ (corresponding to erasing the symbol) with probability ${\epsilon}$. (It never flips the value of a bit.) Pictorially:

When a bit string of length ${n}$ for large ${n}$ is transmitted across ${{\rm BEC}_\epsilon}$, with high probability only ${\approx (1-\epsilon) n}$ bits are received unerased at the other end. This suggests that the maximum rate at which reliable communication is possible is at most ${1-\epsilon}$. It turns out that a rate approaching ${1-\epsilon}$ can be achieved, and the capacity of ${{\rm BEC}_\epsilon}$ equals ${1-\epsilon}$.

2.3. Noisy Typewriter Channel

The noisy typewriter channel is given by the following diagram:

If we restrict the code to send only one of the symbols ${\{ A, C, E,\ldots, Y\}}$ in each channel use, we can communicate one of ${13}$ possible messages with zero error. Therefore the capacity of the channel is at least ${\log_2 13}$. One can prove that this rate is the maximum possible and the capacity of the channel is exactly ${\log 13}$. (Indeed, this follows from Shannon’s capacity formula: Since ${|{\cal Y}|=26}$, ${H(Y)}$ is at most ${\log 26}$. Also ${H(Y|X) =1}$ for every distribution of the channel input ${X}$. Hence ${H(Y) - H(Y|X) \le \log 13}$.)

Note that we can achieve a rate equal to capacity with zero probability of miscommunication. For the ${{\rm BSC}_p}$ with ${p > 0}$ on the other hand, zero error communication is not possible at any positive rate, since for every pair of strings ${x,x' \in \{0,1\}^n}$, there is a positive probability that ${x}$ will get distorted to ${x'}$ by the noise caused by the ${{\rm BSC}_p}$.

The study of zero error capacity of channels was introduced in another classic work of Shannon. Estimating the zero error capacity of even simple channels (such as the ${5}$-cycle) has led to some beautiful results in combinatorics, including Lovász’s celebrated work on the Theta function.

2.4. Continuous Output Channel

We now see an example of a continuous output channel that is widely used to model noise and compare the performance (typically via simulation) of different coding schemes. The binary input additive white Gaussian noise (BIAWGN) channel has input alphabet ${{\cal X}=\{1,-1\}}$ (it is more convenient to encode binary symbols by ${\pm 1}$ instead of ${\{0,1\}}$) and output alphabet ${{\cal Y} = {\mathbb R}}$. The input ${x \in \{1,-1\}}$ is “modulated” into the real number ${\beta x}$ and the channel adds additive noise distributed according to ${N(0,\sigma^2)}$ to ${\beta x}$. Thus the output distribution is a Gaussian with mean ${\beta x}$ and variance ${\sigma^2}$. Formally

$\displaystyle \mathop{\mathbb P} [ Y \le y | X = x ] = \int_{-\infty}^{y} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(z-\beta x)^2/(2\sigma^2)} \ dz \ .$

The quantity ${(\beta/\sigma)^2}$ is commonly referred to as the signal-to-noise ratio (SNR for short), with ${\beta^2}$ corresponding to the energy per input bit and ${\sigma^2}$ corresponding to the amount of noise. The SNR is usually measured in decibel units (dB), and expressed as the value ${10 \log_{10} (\beta/\sigma)^2}$. As one might expect, the capacity of the AWGN channel increases as its SNR increases.

3. Shannon’s capacity theorem for the binary symmetric channel

We now turn to establishing the capacity of ${{\rm BSC}_p}$ to be ${1-h(p)}$.

3.1. Connection to minimum distance

First, let us connect this question to the Hamming world. If we have a family of binary codes of relative distance more than ${(2p+\epsilon)}$, then we claim that this enables communicating on the ${{\rm BSC}_p}$ with exponentially small probability of miscommunication. The reason is that by the Chernoff bound for independent Bernoulli random variables (stated below), the probability that at least ${(p+\epsilon/2) n}$ are corrupted out of ${n}$ bits transmitted on a ${{\rm BSC}_p}$ is exponentially small. When the number of errors is less than ${(p+\epsilon/2) n}$, the received word has a unique closest codeword in Hamming distance, which is also the original transmitted codeword.

Lemma 2 (Chernoff bound for i.i.d. Bernoulli random variables) If ${X_1,X_2,\ldots X_n}$ are i.i.d. ${\{0,1\}}$-valued random variables with ${\mathop{\mathbb P}[X_i=1]=p}$, then for every ${\epsilon > 0}$, for large enough ${n}$ the following tail estimates hold: $\displaystyle \mathop{\mathbb P}[\sum_{i=1}^n X_i \geq (p+\epsilon)n] \leq 2^{\frac{- \epsilon^2 n}{3}}$

$\displaystyle \mathop{\mathbb P}[\sum_{i=1}^n X_i \leq (p-\epsilon)n] \leq 2^{\frac{-\epsilon^2 n}{3}}$

Together with the Gilbert-Varshamov bound, we conclude the existence of codes of rate at least ${1-h(2p)}$ for reliable communication on ${{\rm BSC}_p}$. This rate is positive only for ${p < 1/4}$, and falls short of the bound ${1-h(p)}$ which we “know” to be the capacity of ${{\rm BSC}_p}$ from Shannon’s general theorem.

The Hamming upper bound on rate for codes of relative distance ${2p}$ was also equal to ${1-h(p)}$. So if the Hamming bound could be attained, we could achieve the capacity of ${{\rm BSC}_p}$ simply by using codes of relative distance ${2p}$. However, we will soon see that the Hamming upper bound can be improved, and there are no codes of positive rate for relative distance ${2p}$ for ${p \ge 1/4}$ or of rate ${1-h(p)}$ for ${p < 1/4}$.

3.2. Intuition: mostly disjoint packings

The key to Shannon’s theorem is that we do not need every pair of codewords to differ in a fraction ${2p}$ of locations, but only that for most (as opposed to for all) points obtained by flipping about a ${p}$ fraction of bits of a codeword ${c}$ have no other codeword closer than ${c}$. In other words, it suffices to be able to pack ${\approx 2^{(1-h(p))n}}$ “mostly-disjoint” Hamming balls of radius ${pn}$ so that most points in ${\{0,1\}^n}$ belong to at most one such Hamming ball. Indeed, we will show below (Theorem 3) that such a packing exists, and therefore one can reliably communicate on ${{\rm BSC}_p}$ with rate approaching ${1-h(p)}$.

The intuition for the case of general discrete memoryless channels as stated in Theorem 1 is similar. For a typical sequence ${x \in {\cal X}^n}$ (chosen according to the product distribution ${p^{\otimes n}}$), when ${x}$ is transmitted, there are ${\approx 2^{H(Y|X) n}}$ possible received sequences in ${{\cal Y}^n}$ (call this the “neighborhood” of ${x}$), out of a total volume of ${2^{H(Y)n}}$. It turns out it is possible to pick a collection of ${\approx 2^{(H(Y)-H(Y|X))n}}$ sequences in ${{\cal X}^n}$ whose neighborhoods are mostly disjoint. This enables reliable communication at rate approaching ${H(Y) - H(Y|X)}$.

3.3. Converse to capacity theorem for BSC

We now give an explanation for why ${1-h(p)}$ ought to be an upper bound on capacity of the ${{\rm BSC}_p}$. Suppose a code ${C \subset \{0,1\}^n}$ achieves negligible error probability for communication on ${{\rm BSC}_p}$ with some decoding rule ${D : \{0,1\}^n \rightarrow C \cup \{{\rm fail}\}}$. When ${c}$ is transmitted, with overwhelming probability the received word belongs to a set ${{\rm Typical}_c}$ of ${\approx 2^{h(p)n}}$ possible strings whose Hamming distance to ${c}$ is close to ${p n}$ (say in the range ${[(p-o(1))n,(p+o(1))n]}$, and these possibilities are all roughly equally likely. Therefore, in order to ensure that ${c}$ is recovered with high probability from its noisy version, the decoding rule ${D}$ must map most of the strings in ${{\rm Typicakl}_c}$ to ${c}$. Thus we must have ${|D^{-1}(c)| \approx 2^{h(p)n}}$ for each ${c \in C}$, leading to the upper bound ${|C| \le 2^{(1-h(p)+o(1))n}}$.

A different way to argue about the ${1-h(p)}$ upper bound is related to a discussion in our very first lecture. It is based on the observation that when communication is successful, the decoder not only recovers the transmitted codeword but also the locations of the (typically around ${pn}$) errors. The former carries ${\log |C|}$ bits of information, whereas the latter typically conveys ${\approx h(p) n}$ bits of information. Since the total amount of non-redundant information that can be reliably conveyed by ${n}$ bits cannot exceed ${n}$, we again get the upper bound ${|C| \le 2^{(1-h(p)+o(1))n}}$.

Exercise 1 Develop the above arguments into a formal proof that communication at a rate of ${1-h(p)+\epsilon}$ on ${{\rm BSC}_p}$ incurs a probability of error bounded below by an absolute constant, and in fact by ${1-2^{-\Omega_{\epsilon,p}(n)}}$ where ${n}$ is the block length of the code.

3.4. The theorem

We conclude these notes with the formal statement and proof of the capacity theorem for ${{\rm BSC}_p}$.

Theorem 3 For every ${p \in (0,1/2)}$ such that ${0\le p<\frac{1}{2}}$, and every ${0 < \gamma <1/2-p}$ and all large enough integers ${n}$, there exists a ${\delta = \delta(\gamma,p)}$ and a code with encoding map ${{\rm Enc}: \{0,1\}^k \rightarrow \{0,1\}^n}$ for ${k = (1-h(p+\gamma)) n}$ and a decoding rule ${D: \{0,1\}^n \rightarrow \{0,1\}^k \cup \{{\rm fail}\}}$ such that$\displaystyle \mathop{\mathbb P}_{z} [ D(E(m)+z)=m ] \ge 1- 2^{-\delta n}$

where the probability is over the noise ${z}$ caused by ${{\rm BSC}_p}$.

Proof: The construction is by the probabilistic method. Let ${\ell=k+1}$. The encoding function ${{\rm Enc} :\{0,1\}^{\ell} \rightarrow \{0,1\}^n}$ is chosen uniformly at random from all possible functions. In other words, for every message ${m\in\{0, 1\}^{\ell}}$, the corresponding codeword, ${{\rm Enc}(m)}$ is chosen uniformly at random from ${\{0,1\}^n}$. (Note that this might assign the same codeword to two different messages but this (tiny) probability will be absorbed into the decoding error probability.)

Pick ${\epsilon=\epsilon(\gamma) > 0}$ to be a small enough constant. The decoding function ${D}$ is defined as follows: ${D(y) = m}$ if ${{\rm Enc}(m)}$ is the unique codeword such that ${\Delta(y,{\rm Enc}(m))\le (p+\epsilon)n}$ and ${D(y) = {\rm fail}}$ otherwise.

For ${z\in \{0,1\}^n}$, let ${{\rm prob}(z)}$ denote the probability that the noise caused by ${{\rm BSC}_p}$ on input the all ${0}$‘s vector equals ${z}$ (note that ${{\rm prob}(z) = p^{{\rm wt}(z)} (1-p)^{n-{\rm wt}(z)}}$).

Fix a message ${m}$. For each possible ${{\rm Enc}}$, the probability that ${D({\rm Enc}(m) + z) \neq m}$ taken over the noise ${z}$ caused by ${{\rm BSC}_p}$ is at most

$\displaystyle \begin{array}{rcl} \mathop{\mathbb P}_z[D({\rm Enc}(m)+z) \neq m)] & \le & \mathop{\mathbb P}_z [ {\rm wt}(z) > (p+\epsilon)n ] + \sum_{z \in B(0,(p+\epsilon)n)} {\rm prob}(z) \mathbf{1}(D({\rm Enc}(m)+z) \neq m) \\ & \le & 2^{-\Omega(\epsilon^2 n)} + \sum_{z \in B(0,(p+\epsilon)n)} {\rm prob}(z) \sum_{m' \neq m} \mathbf{1}(\Delta({\rm Enc}(m)+z,{\rm Enc}(m')) \le (p+\epsilon) n) \end{array}$

where the notation ${\mathbf{1}(E)}$ stands for the indicator random variable of the event ${E}$, the first estimate follows from the Chernoff bound, and the second estimate because when the decoding is unsuccessful when at most ${(p+\epsilon)n}$ errors occur, there must be some other codeword besides ${{\rm Enc}(m)}$ that is close to the received word ${{\rm Enc}(m)+z}$.

Now let us bound the expected value of this probability of miscommunication over the random choice of ${{\rm Enc}}$. For each fixed ${z}$, and ${m \neq m'}$,

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}_{{\rm Enc}}[\mathbf{1}(\Delta({\rm Enc}(m)+z,{\rm Enc}(m'))] & \le & \mathop{\mathbb P}_{{\rm Enc}} [ \Delta({\rm Enc}(m)+z,{\rm Enc}(m')) \le (p+\epsilon)n ] \\ & = & \frac{{\rm Vol}(n,(p+\epsilon)n)}{2^n} \\ & \le & 2^{-(1-h(p+\epsilon))n} \\ \end{array}$

Therefore, by linearity of expectation

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}_{{\rm Enc}}\mathop{\mathbb P}_z[D({\rm Enc}(m)+z) \neq m)] & \le & 2^{-\Omega(\epsilon^2 n)} + \sum_{z \in B(0,(p+\epsilon)n)} {\rm prob}(z) 2^{\ell} 2^{-(1-h(p+\epsilon))n} \\ & \le & 2^{-\Omega(\epsilon^2 n)} + 2 \cdot 2^{- (h(p+\gamma)-h(p+\epsilon))n} < \frac{1}{2} \cdot 2^{-\delta n} \end{array}$

for some ${\delta = \delta(\gamma,p) > 0}$ when ${\epsilon}$ is chosen small enough.

We can conclude from the above for each fixed ${m}$ that the probability over ${{\rm Enc}}$ that the error probability in communicating ${m}$ (over the channel noise) exceeds ${2^{-\delta n/2}}$ is ${2^{-\delta n/2}}$. We would like to find an encoding ${{\rm Enc}}$ for which the error probability is low for every ${m}$ simultaneously. The bound is too weak to do a union bound over all ${2^{\ell}}$ messages. So we proceed as follows.

Since ${\mathop{\mathbb E}_{{\rm Enc}}\mathop{\mathbb P}_z[D({\rm Enc}(m)+z) \neq m)] < 2^{-1-\delta n}}$ for each fixed ${m}$, this also holds on average over all choices of ${m}$. That is

$\displaystyle \mathop{\mathbb E}_{m} \mathop{\mathbb E}_{{\rm Enc}}\mathop{\mathbb P}_z[D({\rm Enc}(m)+z) \neq m)] < 2^{-1-\delta n} \ .$

Changing the order of expectations

$\displaystyle \mathop{\mathbb E}_{{\rm Enc}} \mathop{\mathbb E}_{m}\mathop{\mathbb P}_z[D({\rm Enc}(m)+z) \neq m)] < 2^{-1-\delta n} \ .$

Therefore there must exist an encoding ${{\rm Enc}^*}$ for which

$\displaystyle \mathop{\mathbb E}_m \mathop{\mathbb P}_z[D({\rm Enc}^*(m)+z) \neq m)] < 2^{-1-\delta n} \ .$

By an averaging argument, for at most ${1/2}$ the messages ${m \in \{0,1\}^{\ell}}$ one can have ${\mathop{\mathbb P}_z[D({\rm Enc}^*(m)+z) \neq m)] \ge 2^{-\delta n}}$. Expurgating these messages, we get an encoding ${{\rm Enc}' : \{0,1\}^{\ell-1} \rightarrow \{0,1\}^n}$ and a decoding function ${D'}$ such that for every ${m' \in \{0,1\}^k}$, ${\mathop{\mathbb P}_z[D'({\rm Enc}'(m')+z) \neq m'] < 2^{-\delta n}}$. This finishes the proof of the theorem. $\Box$

We remark that neither the encoding function nor the decoding function in the above proof are efficiently computable. The challenge put forth by Shannon’s work is to “constructivize” his result and find explicit codes with polynomial time encoding and decoding that achieve capacity.

Exercise 2 Prove that Theorem 3 also holds with a linear code, and that a random linear code achieves capacity of the BSC with high probability. (In fact, the proof becomes easier in this case, as no expurgation is needed at the end.)

We end these notes by noting another connection between the Shannon and Hamming worlds. Though minimum distance is not the governing factor for achieving capacity on the BSC, a large minimum distance is necessary to to have a positive error exponent (i.e., achieve exponentially small error probability). We leave it as an exercise to justify this claim.