This is an old revision of the document!
Table of Contents
Introduction
Set theory
A set is a collection of elements, such as the possible outcomes of an experiment. $S$ is the universal set, also known as the sample space. $\varnothing$ is the empty set. $A\cup B$ is the union of two sets, and $A\cap B$ is their intersection. $A-B$ is the set subtraction (also $A\backslash B$). $A^C$ is the compliment of $A$ (also $S-A$). We can say that x is an element of A by $x\in A$. We can also say A is a subset of B as $A\subset B$. Two sets are mutually exclusive (disjoint) if $A\cap B=\varnothing$, meaning they have no elements in common. Union and intersection are commutative, associative and distributive. De Morgans Laws: $$(A\cup B)^C=A^C\cap B^C$$ $$(A\cap B)^C=A^C\cup B^C$$
A partition of a set is a grouping of its elements into non-empty, mutually exclusive subsets. Every time an experiment is run is known as a trial and the result is the outcome. A random experiment is one where the outcome is nondeterministic, and we use probability to model the outcome. Changing what is observed can result in a different sample space. An event is a collection of outcomes, and hence are sets and subsets of S. An event is said to have occurred if the outcome is in the event. If the outcome is not in the event, we say that the event's compliment has occurred. Events that are not mutually exclusive can occur simultaneously. An event containing a single outcome is called an elementary event. $S$ is a subset of itself, and hence the event $S$ always occurs. The empty set is an event which never occurs and is included as a convention.
Probability
Models
A probabilistic model is a mathematical description of an unknown situations. It contains:
- The sample space, containing all possible outcomes
- A probability law that assigns probabilities to all possible events.
The probability of an event is the chance or likelihood of the event occurring when an experiment is performed. We write this set-function (measure) as $P[A]$. $P[\cdot]$ is assumed to satisfy the three axioms of probability:
- Non-negativity $P[A]\geq 0,\forall x\in A$
- Additivity If $A\cap B=\varnothing$, then $P[A\cup B]=P[A]+P[B]$
- Normalisation $P[S]=1$
From these we can imply:
- $P[A^C]-1-P[A]$
- $P[\varnothing]=0$
- $\forall A, 0\leq P[A]\leq 1$
- $P[A\cup B]=P[A]+P[B]-P[A\cap B]$
- $A\subset B\implies P[A]\leq P[B]$
For discrete (countable) sample spaces, the probability of an event is: $$P[A]=P[{s_i\in A}]=P\left[\bigcup_{i:s_i\in A}s_i\right]=\sum_{i:s_i\in A}P[s_i]$$ For continuous (uncountable) sample spaces, we must use probabilities of events, rather than outcomes as the probability of an individual outcome is virtually 0.
Conditional Probability
If an experiment is performed and the exact outcome is known, then there is no more randomness. The outcome says whether an event has occurred or not. If we instead know whether an event has occurred, but not the outcome, conditional probability can let us find the chance another event has also occurred. We can write the conditional probability of B given A as $P[B|A]$, where as $P[B]$ is the original probability. We can find this probability by normalising the probability of the intersection by the probability of A. $$P[B|A]=\frac{P[A\cap B]}{P[A]}$$ We can use this to find: $$P[A\cap B]=P[A|B]*P[B]=P[B|A]*P[A]$$ Which gives us Bayes' rule: $$P[B|A]=\frac{P[A|B]*P[B]}{P[A]}$$
For a set with partitions, we can find the probability of an event as such: $$P[A]=P[(A\cap B_1)\cup(A\cap B_2)\cup...\cup(A\cap B_n)]=\sum_{i=1}^n P[A\cap B_i]$$ Which gives us the theorem of total probability: $$P[A]=P[A|B_1]*P[B_1]+P[A|B_2]*P[B_2]*...*P[A|B_n]*P[B_n]=\sum_{i=1}^n P[A|B_i]*P[B_i]$$ We can also extend Bayes' rule with partitions to: $$P[B_j|A]=\frac{P[A|B_j]*P[B_j]}{\sum_{k=1}^n P[A|B_k]*P[B_k]}$$ This is finding the posterior (a posteriori) probabilities [post experiment] from the prior (a priori) probabilities [pre experiment]. From the observed event we can infer the conditional probabilities of unobserved events.
Independence
Two events are said to be independent if: $$P[A\cap B]=P[A]*P[B]$$ This means that knowing event A occurs does not alter the probability of B occurring, i.e. $P[B|A]=P[B]$. Mutual independence and disjointness/exclusivity are not the same thing. Mutually independent events may occur simultaneously, unlike mutually disjoint events. Independence is often a natural assumption which can be reasoned about.
Tree diagrams
Tree diagrams can be used to visualise probabilities. The probabilities on the leaves of the tree (intersections) are the product of the probabilities of all branches to the leaf. Each branch has probability equal to its choice given all the previous branches. Tree diagrams can be useful in visualising sequential experiments. To construct the tree we start with one of the sub-experiments and branch out from the root, then we branch from each node with the next sub-experiment, which is repeated for all sub-experiments. At each step, the probability of that branch is equal to the probability of that choice given all the previous choices, and the probability at the node is equal to the intersection of all previous choices. In most sequential experiments, the number of outcomes at each step does not depend on previous results, so we can use the fundamental principle of counting. The fundamental principle of counting states: “If sub-experiment A has n possible outcomes and the sub-experiment B has m possible outcomes, then there are n*k possible experiment outcomes”. If the possible outcomes are equally likely, the probability of an event A occurring comes down to the total number of outcomes and the number of outcomes in A. We need to determine how to count the number of possible outcomes for this to work.
Sampling without replacement results in sequential experiments' sample spaces depending on the outcome of previous experiments. Sampling with replacement causes all experiments to have identical independent trials. In both cases, whether the order of the sub-experiment outcomes can change the probabilities. If sampling k times from n items with replacement, the sample space is $n^k$. If sampling k times from n items without replacement but with ordering, the sample space is $\frac{n!}{(n-k)!}=(n)_k$. If sampling k times from n items without replacement and without ordering, the sample space is $\frac{n!}{(n-k)!k!}=\begin{pmatrix}n\\k\end{pmatrix}$, this is the binomial coefficient. We can consider the more general case of the multinomial coefficient, where $n=n_0+n_1+...+n_{m-1}$, and the number of different ways a set of n objects can be split up into m subsets equals: $$\begin{pmatrix}n\\n_0,n_1,...,n_{m-1}\end{pmatrix}=\frac{n!}{n_0!n_1!... n_{m-1}!}$$
Discrete Random Variables
Random variables can be used to map outcomes to numbers. More formally, a random variable $X$ is a function that assigns a real number $X(\omega)$ to each outcome $\omega$ in $S$. The sample space is the domain of X, and the range $S_X$ is the collection of all possible outputs $X(\omega)$. Two outcomes can map to the same value. Events can be defined in terms of a single value or a set of output values that a random variable can take. $$A=\{\forall\omega:X(\omega)\in\mathcal{I}\}$$
The Cumulative Distribution Function
The Cumulative Distribution Function (CDF) is defined for $x\in\mathbb{R}$ as: $$F_X(x)=P[X\leq x]$$ Thus the CDF is the probability of the event $[X\leq x]$. A CDF is build by working from the regions where X gives a value. The function jumps in value where $P[X=x]\neq0$, of height $P[X=x]$.
The CDF has some properties:
- $F_X(x)\in[0,1]$
- $x_1\geq x_2\implies F_X(x_1)\geq F_X(x_2)$
- $\lim_{x\to\infty}F_X(x)=1$
- $\lim_{x\to-\infty}F_X(x)=0$
- $F_X(x)=\lim_{h\to0}F_X(x+h)=F_X(x^+)$
This means the CDF is an increasing function (2), that is right continuous (5).
We can use the CDF to find:
- $P[a<x\leq b]=F_X(b)-F_X(a)$
- $P[x>a]=1-F_X(a)$
- $P[X=a]=F_X(a)-F_X(a^-)$
- $P[X<a]=F_X(a^-)$
$X$ is a discrete random variable if its range contains a finite or countably infinite number of points. Alternately $X$ with CDF $F_X(x)$ is discrete if $F_X(x)$ only changes at discrete points and is constant in-between them. In other words, $F_X(x)$ is a staircase function when $X$ is discrete.
Probability Mass Function
The Probability Mass Function (PMF) is defined to be: $$p_X(x)=P[X=x]$$ It has the following properties:
- $0\leq p_X(x_i)\leq 1$ for all $x_i\in S_X$
- $p_X(x_i)=0$ if $x_i\notin S_X$
- $\sum_{\forall x_i\in S_X}p_X(x_i)=1$
- For any event $A\subset S_X,P[A]=\sum_{\forall x_i\in A}p_X(x_i)$
When plotted, usually vertical lines or arrows to dots are drawn representing each value of the function.
We can relate the PDF to the CDF as follows: $$p_X(x)=P[X=x]=F_X(x)-F_X(x^-)$$ $$F_X(x)=\sum_{\forall x_i\leq x}p_X(x_i)$$
Bernoulli Random Variable
Used for an experiment where an event $A$, usually called a “success”, has occurred or not. Defined by a “success” (1) with probability $p$ and “failure” (0) with probability $1-p$. $$p_X(k)=P[X=k]=p^k(1-p)^{1-k},k=0,1$$ We'll denote a Bernoulli random variable as: $$X\sim\text{Bernoulli}(p)$$
Binomial Random Variable
Used for repeated random, independent and identical experiments made of Bernoulli trials. It counts the number of successes over $n$ trials. Denoted as: $$X\sim\text{Binomial}(n,p)$$ $$P[X=k]=\begin{pmatrix}n\\k\end{pmatrix}p^k(1-p)^{n-k},k=0,1,...,n$$
Geometric Random Variable
Used when interested in knowing when the first success will occur and repeating until then. The waiting time (number of trials) is a geometric random variable. The first success on the kth trial means the first $k-1$ trials are failures followed by a success. We can wait forever to get a success. $$X\sim\text{Geometric}(p)$$ $$P[X=k]=(1-p)^{k-1}p,k=1,2,3,...$$ Called a geometric variable as the probabilities progresses like a geometric progression. The geometric random variable has the following interesting property: $$P[X=k+j|X>j]=P[X=k]$$ This is known as the memory-less property (geometric is the only memory-less discrete random variable).
Poisson Random Variable
$N$ is a Poisson Random Variable with parameter $\lambda$ if: $$p_N(k)=P[N=k]=\frac{\lambda^k}{k!}e^{-k},k=0,1,2,...$$ Used often for counting the number of events in a given time period or region of space. Denoted as: $$N\sim\text{Poisson}(\lambda)$$
Statistics
A discrete RV is completely described by its PDF or CDF, but we are not always interested in so much information. Statistics are a summary of the RV used for making comparisons. Three different statistics are the mode, median and mean.
The mode is a number $x_{mod}\in S$ such that: $$p_X(x_{mod})\geq p_X(x),\forall x$$ That is, the mode is the most probable outcome.
The median is a number $x_{med}\in S$ such that: $$P[X\geq x_{med}]\geq 0.5\text{ and }P[X\leq x_{med}]\geq 0.5$$ Ideally it exactly balances the probability of lying above or below, but not always satisfied by a discrete RV. It does not always need to be in the range $S_X$.
The mean, or average, is the expected value of the distribution. $$\mu_x=E[X]=\sum_{x_i\in S_X}x_iP[X=x_i]$$ This is the average empirical outcome if we were to perform the experiment many times.
The expected value of a Bernoulli RV is: $$E[X]=0*(1-p)+1*p=p$$ For a binomial RV: $$E[X]=E\left[\sum_{k=1}^n(I_k\sim\text{Bernoulli}(p))\right]=np$$
The expectation operator has the following properties:
- $E[a]=a$
- $E[aX]=aE[X]$
- $E[X+Y]=E[X]+E[Y]$
- $E[aX+bY]=aE[X]+bE[Y]$
Hence expectation is a linear operator.
The expected value of a geometric variable is: $$E[X]=\sum_{k=1}^\infty k[(1-p)^{k-1}p]=\frac{1}{p}$$ The mean may not be in the range $S_X$, i.e. $E[X]\notin S_X$. For a Poisson RV: $$E[N]=\sum_{k=0}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=0+\sum_{k=1}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=\lambda e^{-\lambda}\sum_{k=1}^{\infty}k\frac{\lambda^{k-1}}{(k-1)!}=\lambda$$
The variance of a RV is: $$\sigma^2_X=\text{Var}(X)=E[(X-E[X])^2]=\sum_{x_i\in S_X}(x_i-\mu_X)^2p_X(x_i)$$ This is an indication of the spread of the values about its mean. Variance can also be calculated as: $$\text{Var}(X)=E[X^2]-(E[X])^2$$ Both of these values can be found separately. The variance has two useful properties:
- $\text{Var}(aX)=a^2\text{Var}(X)$
- If $X$ and $Y$ are independent, $\text{Var}(X+Y)=\text{Var}(X)+\text{Var}(Y)$
The variance of a Bernoulli trial is: $$\text{Var}(X)=p(1-p)$$ For a Binomial: $$\text{Var}(X)=np(1-p)$$ For a Geometric: $$\text{Var}(X)=\frac{1-p}{p^2}$$ For a Poisson: $$\text{Var}(X)=\lambda$$
The Poisson RV can be used to approximate the binomial RV, which can be hard to calculate when $n$ is large. For distributions where $n$ is large and $p$ is small, we can set $\lambda=np$. We do need to note that the Poisson distribution counts for an infinite space, whereas the binomial is finite so we have problems as $n\to\infty$.
Continuous Random Variables
If the results of an experiment are uncountable forming a continuum, we call the experiment continuous. Analogue measurements, for example are continuous. Here, the probability that the outcome is an exact number is virtually 0, so we assess probabilities on a range. The continuous random variable has a CDF that is continuous everywhere and has a derivative (may not exist at a finite number of points). The equivalent of the PMF for discrete random variables is the probability density function (PDF). The CDF can be written as: $$F_X(x)=\int_{-\infty}^xf_X(u)du$$ Hence the PDF can be written as: $$f_X(x)=\frac{dF_X(x)}{dx},x\in\mathbb{R}$$ We can say we have an element of probability: $$f_X(x)dx=P[x<X\leq x+dx]$$ The PDF has the following properties:
- $f_X(x)\geq0$
- $\int_{-\infty}^\infty f_X(x)dx=1$
- $F_X(a)=P[X\leq a]=\int_{-\infty}^af_X(u)du$
- $P[a<X\leq b]=\int_a^bf_X(u)du$
The probability on an open or closed interval is the same, as the probability of the bound is 0. The mean and variance are written as: $$\mu_X=E[X]=\int_{-\infty}^\infty xf_X(x)dx$$ $$\sigma^2=\text{Var}[X]=\int_{-\infty}^\infty(x-\mu_X)^2f_X(x)dx$$
Uniform Random Variables
A uniform random variable is one where all outcomes on a range have equal probability: $$f_X(x)=\begin{cases}\frac{1}{b-a},&x\in(a,b)\\0,&\text{else}\end{cases}$$ We can find the CDF to be the following: $$F_X(x)=\begin{cases}\frac{x-a}{b-a},&a\leq x\leq b\\0,&x<a\\1,&x>b\end{cases}$$ The mean and variance are: $$\mu_X=\frac{b+a}{2}$$ $$\text{Var}(X)=\frac{(b-a)^2}{12}$$
Exponential Random Variable
The exponential random variable has the following PDF: $$f_X(x)=\begin{cases}\lambda e^{-\lambda x},&x\geq 0\\0,&\text{else}\end{cases}$$ Its CDF is: $$F_X(x)=\begin{cases}0,&x<0\\1-e^{-\lambda x},&x\geq 0\end{cases}$$ Its mean and variance are: $$\mu_X=\frac{1}{\lambda}$$ $$\sigma^2_X=\frac{1}{\lambda^2}$$ Used in studies of waiting and service times. $\lambda$ is in units of number per time, so the expected value is the average waiting time for one occurrence. The exponential RV is the equivalent of the geometric RV. It also has the same memory-less property. The CDF of an exponential RV can be constructed by modelling the probability that an arrival happens in the first t time with a Poisson RV with $\lambda$ equal to the number of arrivals in t time. The CDF is then the 1 less the probability that no arrivals happen in the first t time. $$F_X(t)=P[X\leq t]=1-P[X>t]=1-P[N_t=0]=1-e^{-\lambda t}$$ The Poisson RV models the number of arrivals where as the exponential RV models the inter-arrival time. The memory-less property means that when conditioned on a past event, time is restarted and a new time 0 is established.
Gaussian (Normal) Random Variable
Many phenomena have a bell-shaped empirical distribution. This is often modelled with a Gaussian or normal PDF with mean $\mu$ and variance $\sigma^2$: $$f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)$$ $$X\sim\mathcal{N}(\mu,\sigma^2)$$ As the variance increases, the curve becomes shallower and wider. A closed form solution doesn't exist for the CDF, so there are numerical tables which exist to find values for the standard ($\mu=0,\sigma=1$) CDF. For nonstandard normal RVs, we can adjust them to work as $F_X(x)=\Phi\left(\frac{x-\mu}{\sigma}\right)$. There is also the Q function, defined as the compliment of $\Phi$. We can often model noise by adding a Gaussian with zero mean to a signal. The Q function is reflective: $P[X\leq y]=1-P[X>y]=1-Q(y)=Q(-y)$. For probabilities involving the maximum value: $P[|X|\geq y]=P[X\leq -y]+P[X\geq t]=2P[X\geq y]=2Q(y)$. Likewise: $P[|X|\leq y]=1-P[|X|\geq y]=1-2Q(y)$.
Mixed random variables
These contain CDFs of continuous increasing segments with jumps. As the CDF isn't discrete, a PMF isn't possible. We can consider a discrete RV's PMF as a PDF by considering the probability points as Dirac Delta functions. A delta function has infinite height with infinitesimal width, such that the area under it is 1. We can express a PMF as a PDF by: $$f_X(x)=\sum_{x_k\in S_X}p_X(x_k)\delta(x-x_k)$$ We can go from a CDF to a PDF with a generalised derivative, by defining the derivative of a jump to be equal to a Dirac delta of area equal to the jump. If the delta exists at a point, then we can calculate the probability of an exact value.
Functions of a random variable
If we were to consider $Y=g(X)$, how can we express $Y$ based on the PDF/CDF/PMF of $X$, as $Y$ inherits randomness from $X$.
Discrete Random Variables
$Y$ will also be a discrete RV, taking on the value of $g(x_k),\forall x_k\in S_X$. As such, the PMF of $Y$ can be determined by: $$p_Y(k)=P[Y=k]=\sum_ip_X(x_i)$$ for all $i$ such that $k=g(x_i)$.
Continuous Random Variables
$$F_Y(y)=P[Y\leq y]=P[g(X)\leq y]=P[g(X)\in(-\infty,y]]=P[X\in g^{-1}((-\infty,y])]=\int_{g^{-1}((-\infty,y])}f_X(x)dx$$
Generate a RV with a Prescribed CDF
We want to generate $X$ with CDF $F_X(x)$, so we start with a standard uniform RV $U\sim U[0,1]$. We then set $X=g(U)=F_X^{-1}(U)$.
Expectation of a function of a RV
$$E[Y]=E[g(X)]=\int_{-\infty}^\infty g(x)f_X(x)dx$$ $$E[Y]=E[g(X)=\sum_{\forall x_k\in S_X}g(x_k)p_X(x_k)$$ For the continuous and discrete cases respectively. Variance is the expectation of a function of a random variable.
Joint random variables
Can be useful to visualise joint variables ass a random vector. Can visualise a pair of random variables as two functions mapping an outcome to two different number lines. The joint mapping is taking the outcome to a point on the plane. This point can contain more information than only one of the mapping functions. As with the single RV case, we don't know the outcome beforehand so we want to look at the joint probabilities of points and regions on the plane. $$P[(X\leq x)\cap(Y\leq y)]=P[X\leq x,Y\leq y]$$ $$P[(X=x)\cap(Y=y)]=P[X=x,Y=y]$$ If both functions are discrete RVs, the probability mass function is: $$p_{X,Y}(x,y)=P[X=x,Y=y]$$ The range of the PMF if $S_{X,Y}$. The sum of the probabilities of the entire sample space is 1. The joint probabilities have the following properties:
- $0\leq p_{X,Y}(x,y)\leq 1,\forall(x,y)\in\mathbb{R}^2$
- $p_{X,Y}(x_i,y_j)=0$ if $(x_i,y_j)\notin S_{X,Y}$
- $\sum_{x\in S_X}\sum_{y\in S_Y}P_{X,Y}(x,y)=1$
Joint RVs are said to be independent if the probability of the point is equal to the product of the probabilities of the independent functions: $$p_{X,Y}(x,y)=P[X=x,Y=y]=p_X(x)p_Y(y)$$ In the case where the variables are independent means we can recover the individual functions. $$P[X=1]=\sum_{i}P[X=1,Y=i]$$
With 2 RVs, we need 3d plots to express PMFs, CDFs and PDFs, which can be difficult. We use level curves and labels to represent them in a 2-D plane. Joint PMFs are easy to do in 2D, joint PMFs need level curves making them harder, joint CDFs are hard to plot and usually not meaningful.
Given a joint RV, we can consider each random variable individually by computing the marginal PMFs of X and Y. $$p_X(x)=P[X=x]=P[X=x,y=\text{anything}]=\sum_{\forall y_k\in S_Y}P[X=x,Y=y_k]$$ This turns the 2D function into a 1D function. Given a joint PMF, we can always compute the marginal PMFs.
Joint CDF
The joint CDF is; $$F_{X,Y}(x,y)=P[X\leq x,Y\leq y]$$ It results in a semi-infinite rectangular region. If the random variables are independent, the product of the variables' CDFs is the joint CDF: $$F_{X,Y}(x,y)=F_X(x)F_Y(y)$$ We can also get marginal CDFs: $$F_X(x)=P[X\leq x]=P[X\leq x,Y\leq\infty]=F_{X,Y}(x,\infty)$$ If both X and Y are discrete, there will be jumps, forming a 3D staircase function. If both X and Y are continuous, the joint CDF will be a continuous surface everywhere. We can then define a joint PDF which gives the concentration (density) of the probability per unit area, done with double integrals. $$f_{X,Y}(x,y)=\frac{\partial^2F_{X,Y}(x,y)}{\partial x\partial y}$$ $$F_{X,Y}(x,y)=\int_{-\infty}^x\int_{-\infty}^y=f_{X,Y}(u,v)dudv$$ The joint PDF is the product of the marginal PDFs when independent. The marginal PDF can be computed from the joint PDF: $$f_X(x)=\frac{dF_{X}(x)}{dx}=\frac{dF_{X,Y}(x,\infty)}{dx}=\frac{d}{dx}\int_{-\infty}^x\int_{-\infty}^\infty f_{X,Y}(u,v)dudv=\int_{-\infty}^\infty f_{X,Y}(x,y)dy$$
The joint CDF has the following properties:
- $0\leq F_{X,Y}(x,y)\leq 1$
- $\lim_{x,y\to\infty}F_{X,Y}(x,y)=F_{X,Y}(\infty,\infty)=1$
- If $x_1\leq x_2$ and $y_1\leq y_2$, then $F_{X,Y}(x_1,y_1)\leq F_{X,Y}(x_2,y_1)\leq F_{X,Y}(x_2,y_2)$
As a consequence, we get: $$P[x_1\leq X\leq x_2,y_1\leq Y\leq y_2]=F_{X,Y}(x_2,y_2)-F_{X,Y}(x_1,y_2)-F_{X,Y}(x_2,y_1)+F_{X,Y}(x_1,y_1)$$
- $\lim_{x\to-\infty}F_{X,Y}(x,y)=F_{X,Y}(-\infty,y)=0$
- $\lim_{x\to a^+}F_{X,Y}(x,y)=F_{X,Y}(a^+,y)=F_{X,Y}(a,y)$
The last two mean the CDF is top and right continuous. If both X and Y are continuous RVs, then the CDF is continuous from all directions.
Joint PDF
The joint PDF has the following properties:
- $f_{X,Y}(x,y)\geq 0$
- $\int_{-\infty}^\infty\int_{-\infty}^\infty f_{X,Y}(x,y)dxdy=1$
- $P[(X,Y)\in A]=\iint_{(x,y)\in A}f_{X,Y}(x,y)dxdy$
- $P[a<X\leq b,c<Y\leq d]=\int_c^d\int_a^bf_{X,Y}(x,y)dxdy$
- Continuous at all values except a countable set
Two or more joint random variables are mutually independent iff their joint PMF/CDF/PDF is a product of their marginals (they are separable). $$p_{X,Y}(x,y)=p_X(x)p_Y(y)$$ $$F_{X,Y}(x,y)=F_X(x)F_Y(y)$$ $$f_{X,Y}(x,y)=f_X(x)f_Y(y)$$
Conditional Probability
For a given event in the sample space, we can express the conditional CDF of another event given the first as: $$F_{X|B}(x)=P[X\leq x|B]=\frac{P[(x\leq x)\cap B]}{P[B]}$$ For a discrete RV, with PMF $p_X(s)$, the conditional PMF is: $$p_{X|B}=\begin{cases}\frac{p_X(x)}{P[B]},&x\in B\\0,&x\notin B\end{cases}$$ We can note that the theorem of total probability applies as well: $$F_X(x)=F_{X|B}(x)P[B]+F_{X|B^C}(x)P[B^C]$$ $$p_X(x)=p_{X|B}(x)P[B]+p_{X|B^C}(x)P[B^C]$$ For a continuous RV, the PDF is: $$f_{X|B}=\begin{cases}\frac{f_X(x)}{P[B]},&x\in B\\0,&x\notin B\end{cases}$$
The conditional PMF of Y given X is: $$p_{Y|X}(y|x)=\frac{P[Y=y,X=x]}{P[X=x]}=\frac{p_{X,Y}(x,y)}{p_{X}(x)}$$ We can do the same thing for the CDF: $$F_{Y|X}(y|x)=\frac{P[Y\leq y,X=x]}{P[X=x]}=\frac{p_{X,Y}(x,y)}{p_{X}(x)}$$ We can also specify conditional probabilities where we condition on a set: $$F_{Y|X}(y|A)=\frac{P[Y\leq y,X\in A]}{P[X\in A]}$$ For the definition for the continuous RV conditional CDF, the definition doesn't make sense for $P[X=x]$. Instead we define the conditional CDF with a limiting argument, given $f_X(x)$ is continuous at $x$. $$F_{Y|X}(y|x)=\frac{\int_{-\infty}^yf_{X,Y}(x,\bar{y})d\bar{y}}{f_{X}(x)}$$ This lets us define the conditional PDF. $$f_{Y|X}(y|x)=\frac{d}{dx}F_{Y|X}(y|x)=\frac{f_{X,Y}(x,y)}{f_X(x)},f_X(x)\neq 0$$
Bayes' recap
For an experiment, there is an unknown cause and observed evidence. Bayes' rule seeks to find the probability of the cause given the evidence. $$P[Cause|Evidence]=\frac{P[Evidence|Cause]P[Cause]}{P[Evidence]}$$ The diagnostic (posterior) probability is what is being found. The likelihood function is the other conditional argument. The prior probability is the likelihood of the cause. The probability of the evidence is the marginal probability.
Conditional expectation
$$E[X|B]=\int_{-\infty}^\infty xf_{X|B}(x|B)dx$$ $$E[X|B]=\sum_{x_k\in S_X}x_kp_{X|B}(x_k|B)$$ The first case is continuous and the second is discrete. Both of these result in a constant, the same as if B hadn't occurred. We can exchange B for another random variable: $$E[Y|X=x]=\int_{-\infty}^\infty yf_{Y|X}(y|x)dy$$ $$E[Y|X=x]=\sum_{y_j\in S_y}y_jp_{Y|X}(y_j|x)$$ Here the expectation is dependent on $x$, effectively changing the sample space and the expectation is still constant, albeit a function of $x$. If $X$ is not specified, the conditional expectation is itself a random variable. $$g(X)=E[Y|X]$$ For each outcome, there is a corresponding value of the conditional expectation $E[Y|X(\omega)]$, meaning we can define a mapping $S\to\mathbb{R}$. We can find the expected value like that of any random variable: $$E[E[Y|X]]=\int_{-\infty}^\infty E[Y|X=x]f_X(x)dx=\int_{-\infty}^\infty\int_{-\infty}^\infty yf_{Y|X}(y|x)dy f_X(x)dx=\int_{-\infty}^\infty y\int_{-\infty}^\infty f_{X,Y}(x,y)dxdy=\int_{-\infty}^\infty yf_Y(y)dy=E[Y]$$ This is known as the law of total expectation. The effect of conditioning on $X$ is removed by averaging (taking the expectation) over $X$.
Correlation and Covariance
The correlation of two random variables is the expected value of their product. $$E[XY]=\iint_{-\infty}^\infty xyf_{X,Y}(x,y)dxdy$$ The covariance is its own measure: $$\text{cov}(X,Y)=E[(X-E[X])(Y-E[Y])]=E[XY]-E[X]E[Y]=E[XY]-\mu_X\mu_Y$$ If $X$ and $Y$ are independent, then the correlation is the product of their means ($E[XY]=E[X]E[Y]=\mu_X\mu_Y$), and hence their covariance is zero. If either $X$ or $Y$ have zero mean, then $\text{cov}(X,Y)=E[XY]$. The covariance of a variable with itself is its mean, $\text{cov}(X,X)=E[X^2]-E[X]^2=\text{var}(X)$. If $Z=X+Y$, then $\text{var}(Z)=E[(X+Y)^2]-E[X+Y]^2=\text{var}(X)+\text{var}(Y)+2\text{cov}(X,Y)$.
$X$ and $Y$ are uncorrelated if the covariance is equal to the product of their means ($E[XY]=E[X]E[Y]\implies\text{cov}(X,Y)=0$). Likewise if the covariance is non-zero, $X$ and $Y$ are correlated. If $E[XY]=0$, we say $X$ and $Y$ are orthogonal. Independence implies uncorrelation but not the converse. The converse is guaranteed true when $X$ and $Y$ are jointly Gaussian. We define the correlation coefficient as follows: $$\rho_{X,Y}=\frac{E[XY]-\mu_X\mu_Y}{\sigma_X\sigma_Y}=\frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y}$$
Joint Gaussian
Random variables $X$ and $Y$ are defined as jointly Gaussian if their joint density can be written as: $$f_{X,Y}(x,y)=\frac{\exp\left\{\frac{-1}{2(1-\rho_{X,Y}^2)}\left[\left(\frac{x-\mu_X}{\sigma_X}\right)^2-2\rho_{X,Y}\left(\frac{x-\mu_X}{\sigma_X}\right)\left(\frac{y-\mu_Y}{\sigma_Y}\right)+\left(\frac{y-\mu_Y}{\sigma_Y}\right)^2\right]\right\}}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho_{X,Y}^2}}$$ Their marginal densities are Gaussian also, with means $\mu$ and variance $\sigma^2$. If the product correlation coefficient is zero, the joint density can be written as the product of the marginal densities so $X$ and $Y$ are independent and uncorrelated. It is important to note that two Gaussian RVs need not produce a joint Gaussian.
Sum of 2 random variables
Given $X$ and $Y$ with joint PDF $f_{X,Y}(x,y)$, we want the PDF of $Z=X+Y$. We do this by first finding the CDF and then the PDF from the derivative of the CDF. $$F_Z(z)=P[X+Y\leq z]=\int_{y=-\infty}^{y=\infty}\int_{x=-\infty}^{x=z-y}f_{X,Y}(x,y)dxdy$$ $$f_Z(z)=\frac{d}{dz}F_Z(z)=\int_{-\infty}^\infty\left[\frac{d}{dz}\int_{-\infty}^{z-y}f_{X,Y}(x,y)dx\right]dy=\int_{-\infty}^\infty f_{X,Y}(z-y,y)dy$$
Given $X$ and $Y$ with a joint PMF $p_{X,Y}(x,y)$, we want the PMF of $Z=X+Y$. We consider all points on the boundary, and knowing the events are disjoint: $$P[Z=z]=P\left[\bigcup_{\forall x\in S_X}(X=x)\cap(Y=z-x)\right]=\sum_{\forall x\in S_X}P[(X=x)\cap(Y=z-x)]=\sum_{\forall x\in S_X}p_{X,Y}(x,z-x)$$
Where $X$ and $Y$ are independent RVs, we can separate the joint PDF into the product of marginal PDFs: $$f_Z(z)=\int_{-\infty}^\infty f_{X,Y}(x,z-x)dx=\int_{-\infty}^\infty f_X(x)f_Y(z-x)dx=(f_X\star f_Y)(z)$$ The sum is the convolution of the individual PDFs. Likewise for discrete RVs, we also get the convolution.
Random Vectors and transformations
Matrix and vector preliminaries
- $\mathbb{R}^2$ is the set of column vectors with $n$ real components
- $\mathbb{R}^{m\times n}$ is the set of real matrices with n rows and m columns
- The transpose of the matrix ($\mathbf{A}^T$) is the swapping of the entries $ij$ with $ji$
- If $\mathbf{A}$ is a column vector, then $\mathbf{A}^T$ is a row vector, and vice-versa
- $\mathbf{X,Y}\in\mathbb{R}^n\implies\mathbf{X}^T\mathbf{Y}\in\mathbb{R}$
- $\mathbf{X}\in\mathbb{R}^n,\mathbf{Y}\in\mathbb{R}^m\implies\mathbb{XY}^T\in\mathbb{R}^{n\times m}$
- For square matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$, $\text{trace(A)=\sum_{i=1}^na_{ii}$
- $\mathbf{X,Y}\in\mathbb{R}^n\implies\mathbf{X}^T\mathbf{Y}=\text{trace}(\mathbf{XY}^T)$
Random vectors
A random vector is a convenient way to represent a set of random variables: $$\mathbf{X}=\begin{bmatrix}X_1\\X_2\\\vdots\\X_n\end{bmatrix},\mathbf{X}\in\mathbb{R}^n$$ The CDF of a random vector is: $$F_{\mathbf{X}}(\mathbf{x})=P[X_1\leq x_1,X_2\leq x_2,...,X_n\leq x_n]$$ The PMF of a random vector is: $$p_{\mathbf{X}}(x)=P[X_1=x_1,X_2=x_2,...,X_n=x_n]$$ The PDF of a random vector is: $$f_\mathbf{X}(\mathbf{x})=\frac{\partial^nF_\mathbf{X}(\mathbf{x})}{\partial x_1\partial x_2...\partial x_n}$$ The mean of a random vector is: $$E[\mathbf{X}]=\begin{bmatrix}E[X_1]\\E[X_2]\\\vdots\\E[X_n]\end{bmatrix}$$ The variance in 1-D measures the average squared distance from the mean, so an analogous definition for random vectors is: $$\text{var}(\mathbf{X})=E[||\mathbf{X}-\mu_\mathbf{X}||^2]=E\left[\sum_{i=1}^n|X_i-\mu_i|^2\right]=\sum_{i=1}^n\text{var}(X_i)=E[(\mathbf{X}-\mu_\mathbf{X})^T(\mathbf{X}-\mu_\mathbf{X})]$$
The (auto)correlation matrix is used to measure the correlation between elements of random vectors: $$\mathbf{R_X}=E[\mathbf{XX}^T]$$ $${R_X}(i,j)=E[X_iX_j]$$ The autocorrelation matrix is symmetric. The diagonals are $R_X(i,i)=E[X_i^2]$.
The (auto)covariance matrix is: $$\mathbf{C_X}=E[(\mathbf{X}-\mu_\mathbf{X})(\mathbf{X}-\mu_\mathbf{X})^T]=\mathbf{R_X}-\mu_\mathbf{X}\mu_\mathbf{X}^T$$ $${C_X}(i,j)=\text{cov}(X_i,X_j)$$ The covariance matrix is a symmetric collection of the covariances. The diagonals are $C_{X}(i,i)=\text{var}(X_i)$. The variance of a random vector is $\text{var}(\mathbf{X})=\text{trace}(\mathbf{C_X})$.
If all the components of $\mathbf{X}$ are mutually independent, the terms are uncorrelated. $\mathbf{R_X}=\mu_\mathbf{X}\mu_\mathbf{X}^T$ and the covariance matrix is a diagonal matrix.
When given two random vectors, we can find the cross-correlation of $\mathbf{X}$ and $\mathbf{Y}$. $$\mathbf{R_{XY}}=E[\mathbf{XY}^T]$$ This is a $n\times m$ matrix that is not necessarily symmetric. Note that $\mathbf{R_{XY}}=\mathbf{R_{YX}}^T$. Likewise we can find the cross-correlation. $$\mathbf{C_{XY}=\mathbf{R_{XY}}-\mu_\mathbf{X}\mu_\mathbf{Y}^T$$ Again, this is not symmetric in general and $\mathbf{C_{XY}}=\mathbf{C_{YX}}^T$.