B.3  Probability Concepts for Signal Processing

This appendix provides a brief overview of probability concepts relevant to digital signal processing.

B.3.1  Joint and Conditional Probability

Let A and B be two events. Their joint probability, denoted by P(A,B) (or equivalently P(A B)), quantifies the probability that both events occur.

The conditional probability of B given A is defined as

P(BA) = P(A,B) P(A) ,for P(A) > 0.

Equivalently, the joint probability can be written as

P(A,B) = P(A)P(BA).

Two events A and B are said to be independent if

P(BA) = P(B),

which implies that

P(A,B) = P(A)P(B).

Example B.1. Fair coin. Consider two independent tosses of a fair coin. Let A be the event “the first toss results in heads” and B the event “the second toss results in heads.” Then P(A) = P(B) = 0.5, and

P(A,B) = 0.5 × 0.5 = 0.25.

Moreover, P(BA) = P(B), reflecting the independence of the two events.   

Starting from the two equivalent factorizations of the joint probability,

P(A,B) = P(A)P(BA) = P(B)P(AB),

we obtain

P(AB) = P(BA)P(A) P(B) ,
(B.25)

which is known as Bayes’ rule (see, e. g., [ urlBMpro]).

Although the notation P(BA) may suggest that A influences B, conditional probability does not, in general, imply a causal relationship.

B.3.2  Random variables

Understanding the application of probability concepts into signal processing requires distinguishing three concepts: experiment outcomes, events, and random variables. A random variable maps outcomes into numbers as detailed below.

The outcome of a probabilistic experiment is not necessarily a number, but an element ω of a set Ω, called the sample space. For example, when tossing a coin, the outcome can be ω = heads or ω = tails.

A random variable provides a way to assign numerical values to these outcomes. Formally, a random variable X is a function

X : Ω ,

which associates a real number X(ω) with each outcome ω Ω. (More generally, random variables may take values in , vectors, etc., but here we restrict attention to real-valued variables.)

A common source of confusion is the notation. In standard mathematical notation, a function is written as y = f(x), where f is the function and y is its output. In probability, however, it is customary to use the same symbol X both for the function (the random variable) and for its values. Strictly speaking, the numerical value is X(ω), but in practice we often write simply X. This is an abuse of notation that must be understood from context.

Every random variable induces a probability distribution on the real line. If the set of possible values of X is finite or countable, X is called a discrete random variable and is described by a probability mass function. If X takes values in a continuum, it is called a continuous random variable and is described (when it exists) by a probability density function.

In summary, a random variable is a function and its “randomness” comes from the randomness of its input ω, not from the function itself.

First note that the outcome of a probabilistic experiment need not be a number but any element of a set Ω of possible outcomes. For example, the outcome when a coin is tossed can be ω =“heads” or ω =“tails”. Basically, random variables allow us to map any probabilistic event into numbers, which are then conveniently manipulated using mathematical operations such as integral and derivatives. A source of confusion is that, strictly, a random variable (e.g., X or Y) is a function. More specifically, a random variable (r.v.) is a function X : Ω that associates a unique numerical value with every outcome of an experiment (r.v. can be complex numbers, vectors, etc., but here it will be assumed as a real number). In math, a function output is often represented as y = f(x). When dealing with a r.v., instead of adopting something like X(ω), both the random variable (equivalent to the function f) and its output value (equivalent to y) is represented by a single letter (e.g., X or Y).

There are two types of r.v.: discrete and continuous. Hence, a r.v. has either an associated probability distribution (discrete r.v.) or probability density function (continuous r.v.).

Assume a discrete r.v. X and a continuous r.v. Y. While the former is typically described by a probability mass function (pmf), the latter can be described by a probability density function (pdf).

Say that X represents the outcome of rolling a dice. Its PMF is shown in Figure B.1 and indicates that each face of a fair dice has a probability of 16.

PIC

Figure B.1: PMF for a dice result.

Now consider that Y represents the amplitude of a Gaussian noise source with mean 2 and variance equal to 1. Its pdf is shown in Figure B.2. A common mistake is to assign a non-zero value to a specific value of a density function. For example, it is wrong to say that the probability of Y = 2 is 0.4, in spite of this being the value of the function. The function represents a density, and the correct answer is that the probability of Y = 2, or any other point, is 0. One can extract probability from a pdf only integrating it over a non-zero range of its abscissa. For example, over the range [2, 3] the probability is approximately 0.34 as indicated by the shaded area in Figure B.2.

PIC

Figure B.2: Obtaining probability from a pdf (density function) requires integrating over a range. Example shows that the probability of this Gaussian variable be within [2, 3] is approximately 0.34.

When dealing with ratios of pdfs it is possible to have the abscissa range Δx canceling out. For example, if a discrete binary r.v. is used to represent two classes (A and B, for example), and each class has a pdf associated to it (f(x|A) and f(x|B), respectively), the Bayes’ rule states

P(A|x) = f(x|B) f(x|A)P(B|x).
(B.26)

In this case, Δx cancels because it appears on both numerator and denominator.

B.3.3  Expected value

The expected value 𝔼[] operator is the most common mathematical formalism for calculating an average (or mean).

The expected value is a linear operator (see Appendix B.7.1 for more information about linearity), such that

𝔼[αX + βY] = α𝔼[X] + β𝔼[Y].
(B.27)

The expected value 𝔼[X] of a random variable X can be estimated as a typical average when X can be represented by a finite-dimension vector. For example, if x = [3,5,4,4,5,3] is a vector with random samples from X, the expected value 𝔼[X] can be estimated as 𝔼[X] x¯ = (3 + 5 + 4 + 4 + 5 + 3)6 = 4, where x¯ is the conventional mean value.

If one has realizations of a random variable X, for instance organized as a vector x, and is looking for the expected value 𝔼[g(X)] of a function g(X), it is possible to apply the function g() to each realization (value of x) and then take their average. For example, assume g(X) = X2 and one is interested on estimating 𝔼[X2] based on realizations x = [1,3,3,4,2] of X. In this case, applying g(X) = X2 to elements of x leads to the vector y = [(1)2,32,(3)2,42,(2)2], which has the average y¯ = (1 + 9 + 9 + 16 + 4)5 = 7.8. The estimate is 𝔼[X2] y¯ = 7.8.

As another example, consider μx = 𝔼[X] is known (or has been previously estimated), and one is interested on estimating the variance 𝔼[(X μx)2]. In this case, g(X) = (X μx)2 and if the realizations are still x = [1,3,3,4,2], which has an average μx = 0.2, applying f(X) leads to a vector z = [(1 0.2)2,(3 0.2)2,(3 0.2)2,(4 0.2)2,(2 0.2)2] with mean value z¯ = 7.76. In this case the estimated variance is 𝔼[(X μx)2] z¯ = 7.76.

The variance is often denoted as σx2, and can be written as

σx2 = 𝔼[(X μx)2]    (definition of variance) = 𝔼[X2 2Xμ x + μx2]    (by expanding the square) = 𝔼[X2] 𝔼[2Xμ x] + 𝔼[μx2]    (the expected value is a linear operation) = 𝔼[X2] 2μ x𝔼[X] + μx2     (μx is a constant) = 𝔼[X2] 2μ x2 + μ x2     (𝔼[X] = μx) = 𝔼[X2] μ x2, (B.28)

which is often interpreted as 𝔼[X2] = σx2 + μx2.

When a discrete random variable X has V distinct values, its mean 𝔼[X] can be estimated with

𝔼[X] x¯ = i=1V p ixi,
(B.29)

where pi is the probability of the i-th possible value xi. For instance, the realizations x = [3,5,4,4,5,3] have only V = 3 distinct values: x1 = 3,x2 = 4 and x3 = 5, each one with estimated probability pi = 13,i. Hence,

𝔼[X] x¯ = i=1V p ixi = (13)(3 + 4 + 5) = 4.

The same result could be obtained by directly taking the mean value of the N = 6 elements with

𝔼[X] x¯ = 1 N n=1Nx[n] = (3 + 5 + 4 + 4 + 5 + 3)6 = 4.

Note that x[n] corresponds to the n-th element in x, while xi is the i-th distinct value of x.

When X is a continuous random variable, instead of Eq. (B.29) one has its continuous version:

𝔼[X] =xf X(x)dx,

where fX(x) is the probability density function of X. In this case, for each value of x, the role of the probability pi in the discrete r.v. case, is played by the value fX(x) that corresponds to a “weight” that depends on the likelihood of the specific value x.

In order to find 𝔼[g(X)] for a given function g() one can use:

𝔼[g(X)] =g(x)f X(x)dx.
(B.30)

B.3.4  Orthogonal versus uncorrelated

Two random variables X and Y are said to be orthogonal to each other if

𝔼[XY] = 0.

They are said to be uncorrelated with each other if

𝔼[(X 𝔼[X])(Y 𝔼[Y])] = 0.

The above condition is equivalent to

𝔼[XY] = 𝔼[X]𝔼[Y].

Note that if one or both of X and Y have zero mean, then the orthogonal and uncorrelated conditions are equivalent.

B.3.5  PDF of a sum of two independent random variables

If X and Y are independent random variables and Z = X + Y, then the probability density function (pdf) of Z is given by the convolution

fz(z) = fx(x) fy(y).

Example B.2. Probability distribution of noisy bipolar signal. As an example, consider a bipolar signal that takes values ± 5 V with equal probability, whose pdf is

fx(x) = 1 2[δ(x 5) + δ(x + 5)],

and additive white Gaussian noise (AWGN) with pdf fy(y). In this case, the pdf of Z = X + Y is obtained by convolution and results in two shifted replicas of fy(y):

fz(z) = 1 2[fy(z 5) + fy(z + 5)].

That is, the output pdf consists of two Gaussian components, each scaled by 12 and centered at 5 and 5, respectively.