Probability and Stochastic Processes

A.19 Probability and Stochastic Processes

This appendix will provide a very brief overview about key topics. If the reader finds extra time, he/she can find many good textbooks about probability and stochastic (or random) processes.

A.19.1 Joint and Conditional probability

If two events $A$ and $B$ are statistically independent, their joint probability $P (A, B)$ is the multiplication of their individual probabilities: $P (A, B) = P (A) P (B)$ . For instance, obtaining two heads when tossing a fair coin twice is $P (H, H) = 0.5 \times 0.5 = 0.25$ , given that $P (H) = 0.5$ . Given that the event $B$ is tossing the coin for the second time, we can say that the conditional probability $P (B ∕ A)$ of $B$ given that the first tossing event $A$ occurred is $P (B ∕ A) = P (B)$ because $A$ does not influence $B$ .

In general, the conditional probability is obtained by its definition: $P (B ∕ A) = P (A, B) ∕ P (A)$ . Writing this equation as $P (A, B) = P (A) P (B ∕ A)$ , one can imagine a causal relation between $A$ and $B$ , with $A$ occurring with $P (A)$ , before $B$ , and then $B$ occurring with $P (B ∕ A)$ given that $A$ happened.

Starting from the joint probability $P (A, B)$ , one can pick $B$ as the first event that happens and write $P (A, B) = P (B) P (A ∕ B)$ . Combining the two alternatives of writing $P (A, B)$ , i. e. $P (A) P (B ∕ A) = P (B) P (A ∕ B)$ leads to

P (A ∕ B) = \frac{P (B ∕ A) P (A)}{P (B)},

(A.65)

which is called the Bayes rule (see, e. g., [ urlBMpro]).

A.19.2 Random variables

First note that the outcome of a probabilistic experiment need not be a number but any element of a set $Ω$ of possible outcomes. For example, the outcome when a coin is tossed can be $ω =$ “heads” or $ω =$ “tails”. Basically, random variables allow us to map any probabilistic event into numbers, which are then conveniently manipulated using mathematical operations such as integral and derivatives. A source of confusion is that, strictly, a random variable (e.g., $X$ or $Y$ ) is a function. More specifically, a random variable (r.v.) is a function $X : Ω \to ℝ$ that associates a unique numerical value with every outcome of an experiment (r.v. can be complex numbers, vectors, etc., but here it will be assumed as a real number). In math, a function output is often represented as $y = f (x)$ . When dealing with a r.v., instead of adopting something like $X (ω)$ , both the random variable (equivalent to the function $f$ ) and its output value (equivalent to $y$ ) is represented by a single letter (e.g., $X$ or $Y$ ).

There are two types of r.v.: discrete and continuous. Hence, a r.v. has either an associated probability distribution (discrete r.v.) or probability density function (continuous r.v.).

Assume a discrete r.v. $X$ and a continuous r.v. $Y$ . While the former is typically described by a probability mass function (pmf), the latter can be described by a probability density function (pdf).

Say that $X$ represents the outcome of rolling a dice. Its PMF is shown in Figure A.7 and indicates that each face of a fair dice has a probability of $1 ∕ 6$ .

Now consider that $Y$ represents the amplitude of a Gaussian noise source with mean 2 and variance equal to 1. Its pdf is shown in Figure A.8. A common mistake is to assign a non-zero value to a specific value of a density function. For example, it is wrong to say that the probability of $Y = 2$ is $0.4$ , in spite of this being the value of the function. The function represents a density, and the correct answer is that the probability of $Y = 2$ , or any other point, is 0. One can extract probability from a pdf only integrating it over a non-zero range of its abscissa. For example, over the range [2, 3] the probability is approximately 0.34 as indicated by the shaded area in Figure A.8.

Figure A.8: Obtaining probability from a pdf (density function) requires integrating over a range. Example shows that the probability of this Gaussian variable be within [2, 3] is approximately 0.34.

When dealing with ratios of pdfs it is possible to have the abscissa range $Δ_{x}$ canceling out. For example, if a discrete binary r.v. is used to represent two classes (A and B, for example), and each class has a pdf associated to it ( $f (x | A)$ and $f (x | B)$ , respectively), the Bayes’ rule states

P (A | x) = \frac{f (x | B)}{f (x | A)} P (B | x) .

(A.66)

In this case, $Δ_{x}$ cancels because it appears on both numerator and denominator.

A.19.3 Expected value

The expected value $𝔼 [\cdot]$ operator is the most common mathematical formalism for calculating an average (or mean).

The expected value is a linear operator (see Appendix A.27.1 for more information about linearity), such that

𝔼 [α X + β Y] = α𝔼 [X] + β𝔼 [Y] .

(A.67)

The expected value $𝔼 [X]$ of a random variable $X$ can be estimated as a typical average when $X$ can be represented by a finite-dimension vector. For example, if $x = [3, 5, 4, 4, 5, 3]$ is a vector with random samples from $X$ , the expected value $𝔼 [X]$ can be estimated as $𝔼 [X] \approx \bar{x} = (3 + 5 + 4 + 4 + 5 + 3) ∕ 6 = 4$ , where $\bar{x}$ is the conventional mean value.

If one has realizations of a random variable $X$ , for instance organized as a vector $x$ , and is looking for the expected value $𝔼 [g (X)]$ of a function $g (X)$ , it is possible to apply the function $g (\cdot)$ to each realization (value of $x$ ) and then take their average. For example, assume $g (X) = X^{2}$ and one is interested on estimating $𝔼 [X^{2}]$ based on realizations $x = [- 1, 3, - 3, 4, - 2]$ of $X$ . In this case, applying $g (X) = X^{2}$ to elements of $x$ leads to the vector $y = [{(- 1)}^{2}, 3^{2}, {(- 3)}^{2}, 4^{2}, {(- 2)}^{2}]$ , which has the average $\bar{y} = (1 + 9 + 9 + 16 + 4) ∕ 5 = 7.8$ . The estimate is $𝔼 [X^{2}] \approx \bar{y} = 7.8$ .

As another example, consider $μ_{x} = 𝔼 [X]$ is known (or has been previously estimated), and one is interested on estimating the variance $𝔼 [{(X - μ_{x})}^{2}]$ . In this case, $g (X) = {(X - μ_{x})}^{2}$ and if the realizations are still $x = [- 1, 3, - 3, 4, - 2]$ , which has an average $μ_{x} = 0.2$ , applying $f (X)$ leads to a vector $z = [{(- 1 - 0.2)}^{2}, {(3 - 0.2)}^{2}, {(- 3 - 0.2)}^{2}, {(4 - 0.2)}^{2}, {(- 2 - 0.2)}^{2}]$ with mean value $\bar{z} = 7.76$ . In this case the estimated variance is $𝔼 [{(X - μ_{x})}^{2}] \approx \bar{z} = 7.76$ .

The variance is often denoted as $σ_{x}^{2}$ , and can be written as

\begin{align} σ_{x}^{2} & = 𝔼 [{(X - μ_{x})}^{2}] & (definition of variance) \\ = 𝔼 [X^{2} - 2 X μ_{x} + μ_{x}^{2}] & (by expanding the square) \\ = 𝔼 [X^{2}] - 𝔼 [2 X μ_{x}] + 𝔼 [μ_{x}^{2}] & (the expected value is a linear operation) \\ = 𝔼 [X^{2}] - 2 μ_{x} 𝔼 [X] + μ_{x}^{2} & (μ_{x} is a constant) \\ = 𝔼 [X^{2}] - 2 μ_{x}^{2} + μ_{x}^{2} & (𝔼 [X] = μ_{x}) \\ = 𝔼 [X^{2}] - μ_{x}^{2}, & (A.68) \end{align}

which is often interpreted as $𝔼 [X^{2}] = σ_{x}^{2} + μ_{x}^{2}$ .

When a discrete random variable $X$ has $V$ distinct values, its mean $𝔼 [X]$ can be estimated with

𝔼 [X] \approx \bar{x} = \sum_{i = 1}^{V} p_{i} x_{i},

(A.69)

where $p_{i}$ is the probability of the $i$ -th possible value $x_{i}$ . For instance, the realizations $x = [3, 5, 4, 4, 5, 3]$ have only $V = 3$ distinct values: $x_{1} = 3, x_{2} = 4$ and $x_{3} = 5$ , each one with estimated probability $p_{i} = 1 ∕ 3, \forall i$ . Hence,

𝔼 [X] \approx \bar{x} = \sum_{i = 1}^{V} p_{i} x_{i} = (1 ∕ 3) (3 + 4 + 5) = 4 .

The same result could be obtained by directly taking the mean value of the $N = 6$ elements with

𝔼 [X] \approx \bar{x} = \frac{1}{N} \sum_{n = 1}^{N} x [n] = (3 + 5 + 4 + 4 + 5 + 3) ∕ 6 = 4 .

Note that $x [n]$ corresponds to the $n$ -th element in $x$ , while $x_{i}$ is the $i$ -th distinct value of $x$ .

When $X \in ℝ$ is a continuous random variable, instead of Eq. (A.69) one has its continuous version:

𝔼 [X] = \int_{- \infty}^{\infty} x f_{X} (x) dx,

where $f_{X} (x)$ is the probability density function of $X$ . In this case, for each value of $x$ , the role of the probability $p_{i}$ in the discrete r.v. case, is played by the value $f_{X} (x)$ that corresponds to a “weight” that depends on the likelihood of the specific value $x$ .

In order to find $𝔼 [g (X)]$ for a given function $g (\cdot)$ one can use:

𝔼 [g (X)] = \int_{- \infty}^{\infty} g (x) f_{X} (x) dx .

(A.70)

A.19.4 Orthogonal versus uncorrelated

Two random variables $X$ and $Y$ are said to be orthogonal to each other if

𝔼 [X Y^{*}] = 0 .

They are said to be uncorrelated with each other if

𝔼 [(X - 𝔼 [X]) {(Y - 𝔼 [Y])}^{*}] = 0 .

The above condition is equivalent to

𝔼 [X Y^{*}] = 𝔼 [X] 𝔼 {[Y]}^{*} .

Note that if one or both of $X$ and $Y$ have zero mean, then the orthogonal and uncorrelated conditions are equivalent.

A.19.5 PDF of a sum of two independent random variables

If $X$ and $Y$ are independent, $Z = X + Y$ implies $f (z) = f (x) * f (y)$ . For example, the sum of a bipolar signal (-5 and +5V with 0.5 of probability each) with pdf $f_{x} (x) = 0.5 [δ (x - 5) + δ (x + 5)]$ and additive white Gausssian noise (AWGN) with pdf $f_{y} (y)$ is a good example because the result is the Gaussian scaled by 0.5 and shifted to the position of each of the original impulses, at $- 5$ and 5, i. e., $f_{z} (z) = 0.5 [f_{y} (z - 5) + f_{y} (z + 5)]$ .