We now turn to the basic elements of Shannon’s theory of communication over an intervening noisy channel.

**1. Model of information communication and noisy channel **

To quote Shannon from his paper *A Mathematical theory of communication*: “The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.” The basic setup of the communication problem consists of a source that generates digital information which is to reliably communicated to a destination through a channel, preferably in the most efficient manner possible. This “destination” could be spatially separated (eg., a distant satellite is sending images of Jupiter back to the space station on Earth), or could be temporally separated (eg., we want to retrieve date stored on our hard disk at a later point of time).

The following is a schematic of the communication model:

The first step in the communication model is to exploit the redundancy in the output of the source and compress the information to economize the amount of “raw, non-redundant” data that must be transmitted across the channel. This data compression step in called *source coding.* If at each time step the source outputs an i.i.d copy of a random variable supported on a finite set , then Shannon’s source coding theorem states that one can compress its output to bits per time step (on average, over i.i.d samples from the source , as ). In other words samples from the source can be coded as one of possible outputs. Here is the fundamental Shannon entropy defined as

where is to the base . Thus the entropy of a fair coin toss is , and that of a -biased coin toss is . The *source decoder* at the other end of the communication channel then decompresses the received information into (hopefully) the original output of the source.

The output of the source coder, say , must then be communicated over a noisy channel. The channel’s noisy behavior causes errors in the received symbols at the destination. To recover from the errors incurred due to the channel, one should *encode* the information output by the source coder by adding systematic redundancy to it. This is done through channel coding which maps to a codeword of some suitable error-correcting code (the study of channel coding will be our focus in this course).

** 1.1. Modeling the noisy channel **

The basic channel model consists of an input alphabet and output alphabet . We will focus on *memoryless channels* — for each there is a distribution on such that when input is fed at one end of the channel, the channel distorts it to according to an independent sample drawn according to . (In particular, the channel has no “state,” and its behavior is independent of the history of previously transmitted symbols.) The collection of the distributions comprise the “channel law” for the behavior of the channel. In a discrete memoryless channel (DMC), given by a triple , the input and output alphabets are finite, and therefore the channel law can be specified by a conditional probability matrix which is a stochastic matrix where each row sums to 1:

The ‘th entry is the conditional probability of the receiving when was transmitted on the channel.

** 1.2. Noisy coding and joint source-channel coding theorems **

Suppose at the output of the source coder, we have a message from one of possible messages (that encode samples from the source ), which is to be communicated across a channel . Then the channel encoder it into a sequence for some error-correcting code and the information is sent via uses of the channel. At the other end, a sequence is received with conditional probability

(due to the memoryless nature of the channel). The decoder must then map this sequence into a legal codeword (or equivalently into a message ).

A piece of notation: For a DMC , a positive integer , and , let us denote by the above distribution (2) on induced by the on input sequence .

Theorem 1 (Shannon’s noisy coding theorem)For every discrete memoryless channel , there exists a real number called itschannel capacity, such that the following holds for every . For all large enough , there exists an integer and

an encoding map (of some error-correcting code over alphabet of rate ), anda decoding map

such that for every

where the probability is over the behavior of the channel (on input ).Further, the capacity is given by the expression

where the maximum is taken over all probability distributions on . In the above, is the entropy of the -valued random variable with distribution function

and is the conditional entropy

Remark 1The quantity is called themutual informationbetween and , and denoted . It represents the decrease in uncertainty of a random variable given the knowledge of random variable , which intuitively captures how much information reveals about . If is independent of , then , and . On the other hand if for some function (i.e., is determined by ), then and .

Combining Shannon’s source coding and noisy coding theorems, and the two-stage communication process comprising a separate source coding stage followed by channel coding stage, one can conclude that reliable communication of the output of a source on a noisy channel is possible as long as , i.e., the source outputs data at a rate that is less than the capacity of the channel. This result has a converse (called the converse to the joint source-channel coding theorem) that says that if then reliable communication is not possible.

Together, these imply a “separation theorem,” namely that it is information-theoretically optimal to do source and channel coding separately, and thus one can gain modularity in communication system design without incurring any loss in rate of data transfer. While this converse to the joint source-channel coding theorem is rather intuitive in the setting of point-to-point communication between a sender and a receiver, it is worth remarking that the separation theorem breaks down in some scenarios with multiple users and correlated sources.

We will not prove Shannon’s theorem in the above generality here, but content ourselves with establishing a special case (for the binary symmetric channel). The proof for the general case follows the same general structure once some basic information theory tools are set up, and we will remark briefly about this at the end. But first we will see some important examples of noisy channels.

**2. Examples of channels **

A discrete channel with finite input and output alphabets and respectively, specified by the conditional probability matrix , can also be represented pictorially by an input-output diagram, which is a bipartite graph with nodes on left identified with and nodes on right identified with and a directed edge between and with weight .

** 2.1. Binary symmetric channel **

The *Binary Symmetric Channel* (BSC) has input alphabet and output alphabet . The BSC is parameterized by a real number , called the *crossover probability*, and often denoted . The channel flips its input with probability , in other words,

Pictorially, can be represented as

If a uniform input is fed as input to , then the output is also uniformly distributed. Given , is distributed as a -biased coin, and . Thus , and therefore . It can be checked that the uniformly distributed maximizes , and so Shannon’s theorem implies that is the capacity of . We will shortly prove this special case of Shannon’s theorem.

** 2.2. Binary erasure channel **

The Binary Erasure Channel (BEC) is parameterized by a real , , which is called the *erasure probability*, and is denoted . Its input alphabet is and output alphabet is . Upon input , the channel outputs with probability , and outputs (corresponding to erasing the symbol) with probability . (It never flips the value of a bit.) Pictorially:

When a bit string of length for large is transmitted across , with high probability only bits are received unerased at the other end. This suggests that the maximum rate at which reliable communication is possible is at most . It turns out that a rate approaching can be achieved, and the capacity of equals .

** 2.3. Noisy Typewriter Channel **

The noisy typewriter channel is given by the following diagram:

If we restrict the code to send only one of the symbols in each channel use, we can communicate one of possible messages with **zero** error. Therefore the capacity of the channel is at least . One can prove that this rate is the maximum possible and the capacity of the channel is exactly . (Indeed, this follows from Shannon’s capacity formula: Since , is at most . Also for every distribution of the channel input . Hence .)

Note that we can achieve a rate equal to capacity with *zero* probability of miscommunication. For the with on the other hand, zero error communication is not possible at *any* positive rate, since for every pair of strings , there is a positive probability that will get distorted to by the noise caused by the .

The study of zero error capacity of channels was introduced in another classic work of Shannon. Estimating the zero error capacity of even simple channels (such as the -cycle) has led to some beautiful results in combinatorics, including Lovász’s celebrated work on the Theta function.

** 2.4. Continuous Output Channel **

We now see an example of a continuous output channel that is widely used to model noise and compare the performance (typically via simulation) of different coding schemes. The binary input additive white Gaussian noise (BIAWGN) channel has input alphabet (it is more convenient to encode binary symbols by instead of ) and output alphabet . The input is “modulated” into the real number and the channel adds additive noise distributed according to to . Thus the output distribution is a Gaussian with mean and variance . Formally

The quantity is commonly referred to as the *signal-to-noise ratio* (SNR for short), with corresponding to the energy per input bit and corresponding to the amount of noise. The SNR is usually measured in decibel units (dB), and expressed as the value . As one might expect, the capacity of the AWGN channel increases as its SNR increases.

**3. Shannon’s capacity theorem for the binary symmetric channel **

We now turn to establishing the capacity of to be .

** 3.1. Connection to minimum distance **

First, let us connect this question to the Hamming world. If we have a family of binary codes of relative distance more than , then we claim that this enables communicating on the with exponentially small probability of miscommunication. The reason is that by the Chernoff bound for independent Bernoulli random variables (stated below), the probability that at least are corrupted out of bits transmitted on a is exponentially small. When the number of errors is less than , the received word has a unique closest codeword in Hamming distance, which is also the original transmitted codeword.

Lemma 2 (Chernoff bound for i.i.d. Bernoulli random variables)If are i.i.d. -valued random variables with , then for every , for large enough the following tail estimates hold:

Together with the Gilbert-Varshamov bound, we conclude the existence of codes of rate at least for reliable communication on . This rate is positive only for , and falls short of the bound which we “know” to be the capacity of from Shannon’s general theorem.

The Hamming upper bound on rate for codes of relative distance was also equal to . So if the Hamming bound could be attained, we could achieve the capacity of simply by using codes of relative distance . However, we will soon see that the Hamming upper bound can be improved, and there are no codes of positive rate for relative distance for or of rate for .

** 3.2. Intuition: mostly disjoint packings **

The key to Shannon’s theorem is that we do not need every pair of codewords to differ in a fraction of locations, but only that for *most* (as opposed to for all) points obtained by flipping about a fraction of bits of a codeword have no other codeword closer than . In other words, it suffices to be able to pack “mostly-disjoint” Hamming balls of radius so that most points in belong to at most one such Hamming ball. Indeed, we will show below (Theorem 3) that such a packing exists, and therefore one can reliably communicate on with rate approaching .

The intuition for the case of general discrete memoryless channels as stated in Theorem 1 is similar. For a typical sequence (chosen according to the product distribution ), when is transmitted, there are possible received sequences in (call this the “neighborhood” of ), out of a total volume of . It turns out it is possible to pick a collection of sequences in whose neighborhoods are mostly disjoint. This enables reliable communication at rate approaching .

** 3.3. Converse to capacity theorem for BSC **

We now give an explanation for why ought to be an *upper bound* on capacity of the . Suppose a code achieves negligible error probability for communication on with some decoding rule . When is transmitted, with overwhelming probability the received word belongs to a set of possible strings whose Hamming distance to is close to (say in the range , and these possibilities are all roughly equally likely. Therefore, in order to ensure that is recovered with high probability from its noisy version, the decoding rule must map most of the strings in to . Thus we must have for each , leading to the upper bound .

A different way to argue about the upper bound is related to a discussion in our very first lecture. It is based on the observation that when communication is successful, the decoder not only recovers the transmitted codeword but also the locations of the (typically around ) errors. The former carries bits of information, whereas the latter typically conveys bits of information. Since the total amount of non-redundant information that can be reliably conveyed by bits cannot exceed , we again get the upper bound .

Exercise 1Develop the above arguments into a formal proof that communication at a rate of on incurs a probability of error bounded below by an absolute constant, and in fact by where is the block length of the code.

** 3.4. The theorem **

We conclude these notes with the formal statement and proof of the capacity theorem for .

Theorem 3For every such that , and every and all large enough integers , there exists a and a code with encoding map for and a decoding rule such that

where the probability is over the noise caused by .

*Proof:* The construction is by the probabilistic method. Let . The encoding function is chosen uniformly at random from all possible functions. In other words, for every message , the corresponding codeword, is chosen uniformly at random from . (Note that this might assign the same codeword to two different messages but this (tiny) probability will be absorbed into the decoding error probability.)

Pick to be a small enough constant. The decoding function is defined as follows: if is the unique codeword such that and otherwise.

For , let denote the probability that the noise caused by on input the all ‘s vector equals (note that ).

Fix a message . For each possible , the probability that taken over the noise caused by is at most

where the notation stands for the indicator random variable of the event , the first estimate follows from the Chernoff bound, and the second estimate because when the decoding is unsuccessful when at most errors occur, there must be some other codeword besides that is close to the received word .

Now let us bound the expected value of this probability of miscommunication over the random choice of . For each fixed , and ,

Therefore, by linearity of expectation

for some when is chosen small enough.

We can conclude from the above for each fixed that the probability over that the error probability in communicating (over the channel noise) exceeds is . We would like to find an encoding for which the error probability is low for every simultaneously. The bound is too weak to do a union bound over all messages. So we proceed as follows.

Since for each fixed , this also holds on average over all choices of . That is

Changing the order of expectations

Therefore there must exist an encoding for which

By an averaging argument, for at most the messages one can have . Expurgating these messages, we get an encoding and a decoding function such that for every , . This finishes the proof of the theorem.

We remark that neither the encoding function nor the decoding function in the above proof are efficiently computable. The challenge put forth by Shannon’s work is to “constructivize” his result and find explicit codes with polynomial time encoding and decoding that achieve capacity.

Exercise 2Prove that Theorem 3 also holds with a linear code, and that a random linear code achieves capacity of the BSC with high probability. (In fact, the proof becomes easier in this case, as no expurgation is needed at the end.)

We end these notes by noting another connection between the Shannon and Hamming worlds. Though minimum distance is not the governing factor for achieving capacity on the BSC, a large minimum distance is necessary to to have a positive error exponent (i.e., achieve exponentially small error probability). We leave it as an exercise to justify this claim.

[…] you are interested in learning more or just having a reference for what we covered, here are some notes from the coding theory course I taught in Spring 2010. (The same notes in pdf.) Leave a Comment […]

Pingback by Lecture 26 « 15-359: Probability and Computing (Fall 2010) — December 6, 2010 @ 3:39 pm |