$conf['savedir'] = '/app/www/public/data';
======= ELEN90054 - Probability and Random Models =======
{{tag>notes unimelb math electrical_engineering}}
====== Introduction ======

===== Set theory =====

A set is a collection of elements, such as the possible outcomes of an experiment.
\(S\) is the universal set, also known as the sample space.
\(\varnothing\) is the empty set.
\(A\cup B\) is the union of two sets, and \(A\cap B\) is their intersection.
\(A-B\) is the set subtraction (also \(A\backslash B\)).
\(A^C\) is the compliment of \(A\) (also \(S-A\)).
We can say that x is an element of A by \(x\in A\).
We can also say A is a subset of B as \(A\subset B\).
Two sets are mutually exclusive (disjoint) if \(A\cap B=\varnothing\), meaning they have no elements in common.
Union and intersection are commutative, associative and distributive.
De Morgans Laws:
\[(A\cup B)^C=A^C\cap B^C\]
\[(A\cap B)^C=A^C\cup B^C\]

A partition of a set is a grouping of its elements into non-empty, mutually exclusive subsets.
Every time an experiment is run is known as a trial and the result is the outcome.
A random experiment is one where the outcome is nondeterministic, and we use probability to model the outcome.
Changing what is observed can result in a different sample space.
An event is a collection of outcomes, and hence are sets and subsets of S.
An event is said to have occurred if the outcome is in the event.
If the outcome is not in the event, we say that the event's compliment has occurred.
Events that are not mutually exclusive can occur simultaneously.
An event containing a single outcome is called an elementary event.
\(S\) is a subset of itself, and hence the event \(S\) always occurs.
The empty set is an event which never occurs and is included as a convention.

===== Probability =====

==== Models ====

A probabilistic model is a mathematical description of an unknown situations.
It contains:
  - The sample space, containing all possible outcomes
  - A probability law that assigns probabilities to all possible events.

The probability of an event is the chance or likelihood of the event occurring when an experiment is performed.
We write this set-function (measure) as \(P[A]\).
\(P[\cdot]\) is assumed to satisfy the three axioms of probability:
  * **Non-negativity:** \(P[A]\geq 0,\forall x\in A\)
  * **Additivity:** If \(A\cap B=\varnothing\), then \(P[A\cup B]=P[A]+P[B]\)
  * **Normalisation:** \(P[S]=1\)
From these we can imply:
  * \(P[A^C]-1-P[A]\)
  * \(P[\varnothing]=0\)
  * \(\forall A, 0\leq P[A]\leq 1\)
  * \(P[A\cup B]=P[A]+P[B]-P[A\cap B]\)
  * \(A\subset B\implies P[A]\leq P[B]\)
For discrete (countable) sample spaces, the probability of an event is:
\[P[A]=P[{s_i\in A}]=P\left[\bigcup_{i:s_i\in A}s_i\right]=\sum_{i:s_i\in A}P[s_i]\]
For continuous (uncountable) sample spaces, we must use probabilities of events, rather than outcomes as the probability of an individual outcome is virtually 0.
==== Conditional Probability ====

If an experiment is performed and the exact outcome is known, then there is no more randomness.
The outcome says whether an event has occurred or not.
If we instead know whether an event has occurred, but not the outcome, conditional probability can let us find the chance another event has also occurred.
We can write the conditional probability of B given A as \(P[B|A]\), where as \(P[B]\) is the original probability.
We can find this probability by normalising the probability of the intersection by the probability of A.
\[P[B|A]=\frac{P[A\cap B]}{P[A]}\]
We can use this to find:
\[P[A\cap B]=P[A|B]*P[B]=P[B|A]*P[A]\]
Which gives us Bayes' rule:
\[P[B|A]=\frac{P[A|B]*P[B]}{P[A]}\]

For a set with partitions, we can find the probability of an event as such:
\[P[A]=P[(A\cap B_1)\cup(A\cap B_2)\cup...\cup(A\cap B_n)]=\sum_{i=1}^n P[A\cap B_i]\]
Which gives us the theorem of total probability:
\[P[A]=P[A|B_1]*P[B_1]+P[A|B_2]*P[B_2]*...*P[A|B_n]*P[B_n]=\sum_{i=1}^n P[A|B_i]*P[B_i]\]
We can also extend Bayes' rule with partitions to:
\[P[B_j|A]=\frac{P[A|B_j]*P[B_j]}{\sum_{k=1}^n P[A|B_k]*P[B_k]}\]
This is finding the posterior (a posteriori) probabilities [post experiment] from the prior (a priori) probabilities [pre experiment].
From the observed event we can infer the conditional probabilities of unobserved events.
==== Independence ====

Two events are said to be independent if:
\[P[A\cap B]=P[A]*P[B]\]
This means that knowing event A occurs does not alter the probability of B occurring, i.e. \(P[B|A]=P[B]\).
Mutual independence and disjointness/exclusivity are not the same thing.
Mutually independent events may occur simultaneously, unlike mutually disjoint events.
Independence is often a natural assumption which can be reasoned about.
==== Tree diagrams ====

Tree diagrams can be used to visualise probabilities.
The probabilities on the leaves of the tree (intersections) are the product of the probabilities of all branches to the leaf.
Each branch has probability equal to its choice given all the previous branches.
Tree diagrams can be useful in visualising sequential experiments.
To construct the tree we start with one of the sub-experiments and branch out from the root, then we branch from each node with the next sub-experiment, which is repeated for all sub-experiments.
At each step, the probability of that branch is equal to the probability of that choice given all the previous choices, and the probability at the node is equal to the intersection of all previous choices.
In most sequential experiments, the number of outcomes at each step does not depend on previous results, so we can use the fundamental principle of counting.
The fundamental principle of counting states: "If sub-experiment A has n possible outcomes and the sub-experiment B has m possible outcomes, then there are n\*k possible experiment outcomes".
If the possible outcomes are equally likely, the probability of an event A occurring comes down to the total number of outcomes and the number of outcomes in A.
We need to determine how to count the number of possible outcomes for this to work.

Sampling without replacement results in sequential experiments' sample spaces depending on the outcome of previous experiments.
Sampling with replacement causes all experiments to have identical independent trials.
In both cases, whether the order of the sub-experiment outcomes can change the probabilities.
If sampling k times from n items with replacement, the sample space is \(n^k\).
If sampling k times from n items without replacement but with ordering, the sample space is \(\frac{n!}{(n-k)!}=(n)_k\).
If sampling k times from n items without replacement and without ordering, the sample space is \(\frac{n!}{(n-k)!k!}=\begin{pmatrix}n\\k\end{pmatrix}\), this is the binomial coefficient.
We can consider the more general case of the multinomial coefficient, where \(n=n_0+n_1+...+n_{m-1}\), and the number of different ways a set of n objects can be split up into m subsets equals:
\[\begin{pmatrix}n\\n_0,n_1,...,n_{m-1}\end{pmatrix}=\frac{n!}{n_0!n_1!... n_{m-1}!}\]
====== Discrete Random Variables ======

Random variables can be used to map outcomes to numbers.
More formally, a random variable \(X\) is a function that assigns a real number \(X(\omega)\) to each outcome \(\omega\) in \(S\).
The sample space is the domain of X, and the range \(S_X\) is the collection of all possible outputs \(X(\omega)\).
Two outcomes can map to the same value.
Events can be defined in terms of a single value or a set of output values that a random variable can take.
\[A=\{\forall\omega:X(\omega)\in\mathcal{I}\}\]
===== The Cumulative Distribution Function =====

The Cumulative Distribution Function (CDF) is defined for \(x\in\mathbb{R}\) as:
\[F_X(x)=P[X\leq x]\]
Thus the CDF is the probability of the event \([X\leq x]\).
A CDF is build by working from the regions where X gives a value.
The function jumps in value where \(P[X=x]\neq0\), of height \(P[X=x]\).

The CDF has some properties:
  - \(F_X(x)\in[0,1]\)
  - \(x_1\geq x_2\implies F_X(x_1)\geq F_X(x_2)\)
  - \(\lim_{x\to\infty}F_X(x)=1\)
  - \(\lim_{x\to-\infty}F_X(x)=0\)
  - \(F_X(x)=\lim_{h\to0}F_X(x+h)=F_X(x^+)\)
This means the CDF is an increasing function (2), that is right continuous (5).

We can use the CDF to find:
  * \(P[a<x\leq b]=F_X(b)-F_X(a)\)
  * \(P[x>a]=1-F_X(a)\)
  * \(P[X=a]=F_X(a)-F_X(a^-)\)
  * \(P[X<a]=F_X(a^-)\)

\(X\) is a discrete random variable if its range contains a finite or countably infinite number of points.
Alternately \(X\) with CDF \(F_X(x)\) is discrete if \(F_X(x)\) only changes at discrete points and is constant in-between them.
In other words, \(F_X(x)\) is a staircase function when \(X\) is discrete.

===== Probability Mass Function =====

The Probability Mass Function (PMF) is defined to be:
\[p_X(x)=P[X=x]\]
It has the following properties:
  - \(0\leq p_X(x_i)\leq 1\) for all \(x_i\in S_X\)
  - \(p_X(x_i)=0\) if \(x_i\notin S_X\)
  - \(\sum_{\forall x_i\in S_X}p_X(x_i)=1\)
  - For any event \(A\subset S_X,P[A]=\sum_{\forall x_i\in A}p_X(x_i)\)
When plotted, usually vertical lines or arrows to dots are drawn representing each value of the function.

We can relate the PDF to the CDF as follows:
\[p_X(x)=P[X=x]=F_X(x)-F_X(x^-)\]
\[F_X(x)=\sum_{\forall x_i\leq x}p_X(x_i)\]

===== Bernoulli Random Variable =====

Used for an experiment where an event \(A\), usually called a "success", has occurred or not.
Defined by a "success" (1) with probability \(p\) and "failure" (0) with probability \(1-p\).
\[p_X(k)=P[X=k]=p^k(1-p)^{1-k},k=0,1\]
We'll denote a Bernoulli random variable as:
\[X\sim\text{Bernoulli}(p)\]
===== Binomial Random Variable =====

Used for repeated random, independent and identical experiments made of Bernoulli trials.
It counts the number of successes over \(n\) trials.
Denoted as:
\[X\sim\text{Binomial}(n,p)\]
\[P[X=k]=\begin{pmatrix}n\\k\end{pmatrix}p^k(1-p)^{n-k},k=0,1,...,n\]
===== Geometric Random Variable =====

Used when interested in knowing when the first success will occur and repeating until then.
The waiting time (number of trials) is a geometric random variable.
The first success on the kth trial means the first \(k-1\) trials are failures followed by a success.
We can wait forever to get a success.
\[X\sim\text{Geometric}(p)\]
\[P[X=k]=(1-p)^{k-1}p,k=1,2,3,...\]
Called a geometric variable as the probabilities progresses like a geometric progression.
The geometric random variable has the following interesting property:
\[P[X=k+j|X>j]=P[X=k]\]
This is known as the memory-less property (geometric is the only memory-less discrete random variable).
===== Poisson Random Variable =====

\(N\) is a Poisson Random Variable with parameter \(\lambda\) if:
\[p_N(k)=P[N=k]=\frac{\lambda^k}{k!}e^{-k},k=0,1,2,...\]
Used often for counting the number of events in a given time period or region of space.
Denoted as:
\[N\sim\text{Poisson}(\lambda)\]
===== Statistics =====

A discrete RV is completely described by its PDF or CDF, but we are not always interested in so much information.
Statistics are a summary of the RV used for making comparisons.
Three different statistics are the mode, median and mean.

The mode is a number \(x_{mod}\in S\) such that:
\[p_X(x_{mod})\geq p_X(x),\forall x\]
That is, the mode is the most probable outcome.

The median is a number \(x_{med}\in S\) such that:
\[P[X\geq x_{med}]\geq 0.5\text{ and }P[X\leq x_{med}]\geq 0.5\]
Ideally it exactly balances the probability of lying above or below, but not always satisfied by a discrete RV.
It does not always need to be in the range \(S_X\).

The mean, or average, is the expected value of the distribution.
\[\mu_x=E[X]=\sum_{x_i\in S_X}x_iP[X=x_i]\]
This is the average empirical outcome if we were to perform the experiment many times.

The expected value of a Bernoulli RV is:
\[E[X]=0*(1-p)+1*p=p\]
For a binomial RV:
\[E[X]=E\left[\sum_{k=1}^n(I_k\sim\text{Bernoulli}(p))\right]=np\]

The expectation operator has the following properties:
  - \(E[a]=a\)
  - \(E[aX]=aE[X]\)
  - \(E[X+Y]=E[X]+E[Y]\)
  - \(E[aX+bY]=aE[X]+bE[Y]\)
Hence expectation is a linear operator.

The expected value of a geometric variable is:
\[E[X]=\sum_{k=1}^\infty k[(1-p)^{k-1}p]=\frac{1}{p}\]
The mean may not be in the range \(S_X\), i.e. \(E[X]\notin S_X\).
For a Poisson RV:
\[E[N]=\sum_{k=0}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=0+\sum_{k=1}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=\lambda e^{-\lambda}\sum_{k=1}^{\infty}k\frac{\lambda^{k-1}}{(k-1)!}=\lambda\]

The variance of a RV is:
\[\sigma^2_X=\text{Var}(X)=E[(X-E[X])^2]=\sum_{x_i\in S_X}(x_i-\mu_X)^2p_X(x_i)\]
This is an indication of the spread of the values about its mean.
Variance can also be calculated as:
\[\text{Var}(X)=E[X^2]-(E[X])^2\]
Both of these values can be found separately.
The variance has two useful properties:
  - \(\text{Var}(aX)=a^2\text{Var}(X)\)
  - If \(X\) and \(Y\) are independent, \(\text{Var}(X+Y)=\text{Var}(X)+\text{Var}(Y)\)

The variance of a Bernoulli trial is:
\[\text{Var}(X)=p(1-p)\]
For a Binomial:
\[\text{Var}(X)=np(1-p)\]
For a Geometric:
\[\text{Var}(X)=\frac{1-p}{p^2}\]
For a Poisson:
\[\text{Var}(X)=\lambda\]

The Poisson RV can be used to approximate the binomial RV, which can be hard to calculate when \(n\) is large.
For distributions where \(n\) is large and \(p\) is small, we can set \(\lambda=np\).
We do need to note that the Poisson distribution counts for an infinite space, whereas the binomial is finite so we have problems as \(n\to\infty\).

====== Continuous Random Variables ======

If the results of an experiment are uncountable forming a continuum, we call the experiment continuous.
Analogue measurements, for example are continuous.
Here, the probability that the outcome is an exact number is virtually 0, so we assess probabilities on a range.
The continuous random variable has a CDF that is continuous everywhere and has a derivative (may not exist at a finite number of points).
The equivalent of the PMF for discrete random variables is the probability density function (PDF).
The CDF can be written as:
\[F_X(x)=\int_{-\infty}^xf_X(u)du\]
Hence the PDF can be written as:
\[f_X(x)=\frac{dF_X(x)}{dx},x\in\mathbb{R}\]
We can say we have an element of probability:
\[f_X(x)dx=P[x<X\leq x+dx]\]
The PDF has the following properties:
  - \(f_X(x)\geq0\)
  - \(\int_{-\infty}^\infty f_X(x)dx=1\)
  - \(F_X(a)=P[X\leq a]=\int_{-\infty}^af_X(u)du\)
  - \(P[a<X\leq b]=\int_a^bf_X(u)du\)
The probability on an open or closed interval is the same, as the probability of the bound is 0.
The mean and variance are written as:
\[\mu_X=E[X]=\int_{-\infty}^\infty xf_X(x)dx\]
\[\sigma^2=\text{Var}[X]=\int_{-\infty}^\infty(x-\mu_X)^2f_X(x)dx\]
===== Uniform Random Variables =====

A uniform random variable is one where all outcomes on a range have equal probability:
\[f_X(x)=\begin{cases}\frac{1}{b-a},&x\in(a,b)\\0,&\text{else}\end{cases}\]
We can find the CDF to be the following:
\[F_X(x)=\begin{cases}\frac{x-a}{b-a},&a\leq x\leq b\\0,&x<a\\1,&x>b\end{cases}\]
The mean and variance are:
\[\mu_X=\frac{b+a}{2}\]
\[\text{Var}(X)=\frac{(b-a)^2}{12}\]
===== Exponential Random Variable =====

The exponential random variable has the following PDF:
\[f_X(x)=\begin{cases}\lambda e^{-\lambda x},&x\geq 0\\0,&\text{else}\end{cases}\]
Its CDF is:
\[F_X(x)=\begin{cases}0,&x<0\\1-e^{-\lambda x},&x\geq 0\end{cases}\]
Its mean and variance are:
\[\mu_X=\frac{1}{\lambda}\]
\[\sigma^2_X=\frac{1}{\lambda^2}\]
Used in studies of waiting and service times.
\(\lambda\) is in units of number per time, so the expected value is the average waiting time for one occurrence.
The exponential RV is the equivalent of the geometric RV.
It also has the same memory-less property.
The CDF of an exponential RV can be constructed by modelling the probability that an arrival happens in the first t time with a Poisson RV with \(\lambda\) equal to the number of arrivals in t time.
The CDF is then the 1 less the probability that no arrivals happen in the first t time.
\[F_X(t)=P[X\leq t]=1-P[X>t]=1-P[N_t=0]=1-e^{-\lambda t}\]
The Poisson RV models the number of arrivals where as the exponential RV models the inter-arrival time.
The memory-less property means that when conditioned on a past event, time is restarted and a new time 0 is established.
===== Gaussian (Normal) Random Variable =====

Many phenomena have a bell-shaped empirical distribution.
This is often modelled with a Gaussian or normal PDF with mean \(\mu\) and variance \(\sigma^2\):
\[f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)\]
\[X\sim\mathcal{N}(\mu,\sigma^2)\]
As the variance increases, the curve becomes shallower and wider.
A closed form solution doesn't exist for the CDF, so there are numerical tables which exist to find values for the standard (\(\mu=0,\sigma=1\)) CDF.
For nonstandard normal RVs, we can adjust them to work as \(F_X(x)=\Phi\left(\frac{x-\mu}{\sigma}\right)\).
There is also the Q function, defined as the compliment of \(\Phi\).
We can often model noise by adding a Gaussian with zero mean to a signal.
The Q function is reflective: \(P[X\leq y]=1-P[X>y]=1-Q(y)=Q(-y)\).
For probabilities involving the maximum value: \(P[|X|\geq y]=P[X\leq -y]+P[X\geq t]=2P[X\geq y]=2Q(y)\).
Likewise: \(P[|X|\leq y]=1-P[|X|\geq y]=1-2Q(y)\).
====== Mixed random variables ======

These contain CDFs of continuous increasing segments with jumps.
As the CDF isn't discrete, a PMF isn't possible.
We can consider a discrete RV's PMF as a PDF by considering the probability points as Dirac Delta functions.
A delta function has infinite height with infinitesimal width, such that the area under it is 1.
We can express a PMF as a PDF by:
\[f_X(x)=\sum_{x_k\in S_X}p_X(x_k)\delta(x-x_k)\]
We can go from a CDF to a PDF with a generalised derivative, by defining the derivative of a jump to be equal to a Dirac delta of area equal to the jump.
If the delta exists at a point, then we can calculate the probability of an exact value.
====== Functions of a random variable ======

If we were to consider \(Y=g(X)\), how can we express \(Y\) based on the PDF/CDF/PMF of \(X\), as \(Y\) inherits randomness from \(X\).
===== Discrete Random Variables =====

\(Y\) will also be a discrete RV, taking on the value of \(g(x_k),\forall x_k\in S_X\).
As such, the PMF of \(Y\) can be determined by:
\[p_Y(k)=P[Y=k]=\sum_ip_X(x_i)\]
for all \(i\) such that \(k=g(x_i)\).
===== Continuous Random Variables =====

\[F_Y(y)=P[Y\leq y]=P[g(X)\leq y]=P[g(X)\in(-\infty,y]]=P[X\in g^{-1}((-\infty,y])]=\int_{g^{-1}((-\infty,y])}f_X(x)dx\]
===== Generate a RV with a Prescribed CDF =====

We want to generate \(X\) with CDF \(F_X(x)\), so we start with a standard uniform RV \(U\sim U[0,1]\).
We then set \(X=g(U)=F_X^{-1}(U)\).
===== Expectation of a function of a RV =====

\[E[Y]=E[g(X)]=\int_{-\infty}^\infty g(x)f_X(x)dx\]
\[E[Y]=E[g(X)=\sum_{\forall x_k\in S_X}g(x_k)p_X(x_k)\]
For the continuous and discrete cases respectively.
Variance is the expectation of a function of a random variable.
====== Joint random variables ======

Can be useful to visualise joint variables ass a random vector.
Can visualise a pair of random variables as two functions mapping an outcome to two different number lines.
The joint mapping is taking the outcome to a point on the plane.
This point can contain more information than only one of the mapping functions.
As with the single RV case, we don't know the outcome beforehand so we want to look at the joint probabilities of points and regions on the plane.
\[P[(X\leq x)\cap(Y\leq y)]=P[X\leq x,Y\leq y]\]
\[P[(X=x)\cap(Y=y)]=P[X=x,Y=y]\]
If both functions are discrete RVs, the probability mass function is:
\[p_{X,Y}(x,y)=P[X=x,Y=y]\]
The range of the PMF if \(S_{X,Y}\).
The sum of the probabilities of the entire sample space is 1.
The joint probabilities have the following properties:
  - \(0\leq p_{X,Y}(x,y)\leq 1,\forall(x,y)\in\mathbb{R}^2\)
  - \(p_{X,Y}(x_i,y_j)=0\) if \((x_i,y_j)\notin S_{X,Y}\)
  - \(\sum_{x\in S_X}\sum_{y\in S_Y}P_{X,Y}(x,y)=1\)

Joint RVs are said to be independent if the probability of the point is equal to the product of the probabilities of the independent functions:
\[p_{X,Y}(x,y)=P[X=x,Y=y]=p_X(x)p_Y(y)\]
In the case where the variables are independent means we can recover the individual functions.
\[P[X=1]=\sum_{i}P[X=1,Y=i]\]

With 2 RVs, we need 3d plots to express PMFs, CDFs and PDFs, which can be difficult.
We use level curves and labels to represent them in a 2-D plane.
Joint PMFs are easy to do in 2D, joint PMFs need level curves making them harder, joint CDFs are hard to plot and usually not meaningful.

Given a joint RV, we can consider each random variable individually by computing the marginal PMFs of X and Y.
\[p_X(x)=P[X=x]=P[X=x,y=\text{anything}]=\sum_{\forall y_k\in S_Y}P[X=x,Y=y_k]\]
This turns the 2D function into a 1D function.
Given a joint PMF, we can always compute the marginal PMFs.
===== Joint CDF =====

The joint CDF is;
\[F_{X,Y}(x,y)=P[X\leq x,Y\leq y]\]
It results in a semi-infinite rectangular region.
If the random variables are independent, the product of the variables' CDFs is the joint CDF:
\[F_{X,Y}(x,y)=F_X(x)F_Y(y)\]
We can also get marginal CDFs:
\[F_X(x)=P[X\leq x]=P[X\leq x,Y\leq\infty]=F_{X,Y}(x,\infty)\]
If both X and Y are discrete, there will be jumps, forming a 3D staircase function.
If both X and Y are continuous, the joint CDF will be a continuous surface everywhere.
We can then define a joint PDF which gives the concentration (density) of the probability per unit area, done with double integrals.
\[f_{X,Y}(x,y)=\frac{\partial^2F_{X,Y}(x,y)}{\partial x\partial y}\]
\[F_{X,Y}(x,y)=\int_{-\infty}^x\int_{-\infty}^y=f_{X,Y}(u,v)dudv\]
The joint PDF is the product of the marginal PDFs when independent.
The marginal PDF can be computed from the joint PDF:
\[f_X(x)=\frac{dF_{X}(x)}{dx}=\frac{dF_{X,Y}(x,\infty)}{dx}=\frac{d}{dx}\int_{-\infty}^x\int_{-\infty}^\infty f_{X,Y}(u,v)dudv=\int_{-\infty}^\infty f_{X,Y}(x,y)dy\]

The joint CDF has the following properties:
  - \(0\leq F_{X,Y}(x,y)\leq 1\)
  - \(\lim_{x,y\to\infty}F_{X,Y}(x,y)=F_{X,Y}(\infty,\infty)=1\)
  - If \(x_1\leq x_2\) and \(y_1\leq y_2\), then \(F_{X,Y}(x_1,y_1)\leq F_{X,Y}(x_2,y_1)\leq F_{X,Y}(x_2,y_2)\)
As a consequence, we get:
\[P[x_1\leq X\leq x_2,y_1\leq Y\leq y_2]=F_{X,Y}(x_2,y_2)-F_{X,Y}(x_1,y_2)-F_{X,Y}(x_2,y_1)+F_{X,Y}(x_1,y_1)\]
  - \(\lim_{x\to-\infty}F_{X,Y}(x,y)=F_{X,Y}(-\infty,y)=0\)
  - \(\lim_{x\to a^+}F_{X,Y}(x,y)=F_{X,Y}(a^+,y)=F_{X,Y}(a,y)\)
The last two mean the CDF is top and right continuous.
If both X and Y are continuous RVs, then the CDF is continuous from all directions.
===== Joint PDF =====

The joint PDF has the following properties:
  - \(f_{X,Y}(x,y)\geq 0\)
  - \(\int_{-\infty}^\infty\int_{-\infty}^\infty f_{X,Y}(x,y)dxdy=1\)
  - \(P[(X,Y)\in A]=\iint_{(x,y)\in A}f_{X,Y}(x,y)dxdy\)
  - \(P[a<X\leq b,c<Y\leq d]=\int_c^d\int_a^bf_{X,Y}(x,y)dxdy\)
  - Continuous at all values except a countable set

Two or more joint random variables are mutually independent iff their joint PMF/CDF/PDF is a product of their marginals (they are separable).
\[p_{X,Y}(x,y)=p_X(x)p_Y(y)\]
\[F_{X,Y}(x,y)=F_X(x)F_Y(y)\]
\[f_{X,Y}(x,y)=f_X(x)f_Y(y)\]
===== Conditional Probability =====

For a given event in the sample space, we can express the conditional CDF of another event given the first as:
\[F_{X|B}(x)=P[X\leq x|B]=\frac{P[(x\leq x)\cap B]}{P[B]}\]
For a discrete RV, with PMF \(p_X(s)\), the conditional PMF is:
\[p_{X|B}=\begin{cases}\frac{p_X(x)}{P[B]},&x\in B\\0,&x\notin B\end{cases}\]
We can note that the theorem of total probability applies as well:
\[F_X(x)=F_{X|B}(x)P[B]+F_{X|B^C}(x)P[B^C]\]
\[p_X(x)=p_{X|B}(x)P[B]+p_{X|B^C}(x)P[B^C]\]
For a continuous RV, the PDF is:
\[f_{X|B}=\begin{cases}\frac{f_X(x)}{P[B]},&x\in B\\0,&x\notin B\end{cases}\]

The conditional PMF of Y given X is:
\[p_{Y|X}(y|x)=\frac{P[Y=y,X=x]}{P[X=x]}=\frac{p_{X,Y}(x,y)}{p_{X}(x)}\]
We can do the same thing for the CDF:
\[F_{Y|X}(y|x)=\frac{P[Y\leq y,X=x]}{P[X=x]}=\frac{p_{X,Y}(x,y)}{p_{X}(x)}\]
We can also specify conditional probabilities where we condition on a set:
\[F_{Y|X}(y|A)=\frac{P[Y\leq y,X\in A]}{P[X\in A]}\]
For the definition for the continuous RV conditional CDF, the definition doesn't make sense for \(P[X=x]\).
Instead we define the conditional CDF with a limiting argument, given \(f_X(x)\) is continuous at \(x\).
\[F_{Y|X}(y|x)=\frac{\int_{-\infty}^yf_{X,Y}(x,\bar{y})d\bar{y}}{f_{X}(x)}\]
This lets us define the conditional PDF.
\[f_{Y|X}(y|x)=\frac{d}{dx}F_{Y|X}(y|x)=\frac{f_{X,Y}(x,y)}{f_X(x)},f_X(x)\neq 0\]
==== Bayes' recap ====

For an experiment, there is an unknown cause and observed evidence.
Bayes' rule seeks to find the probability of the cause given the evidence.
\[P[Cause|Evidence]=\frac{P[Evidence|Cause]P[Cause]}{P[Evidence]}\]
The diagnostic (posterior) probability is what is being found.
The likelihood function is the other conditional argument.
The prior probability is the likelihood of the cause.
The probability of the evidence is the marginal probability.
===== Conditional expectation =====

\[E[X|B]=\int_{-\infty}^\infty xf_{X|B}(x|B)dx\]
\[E[X|B]=\sum_{x_k\in S_X}x_kp_{X|B}(x_k|B)\]
The first case is continuous and the second is discrete.
Both of these result in a constant, the same as if B hadn't occurred.
We can exchange B for another random variable:
\[E[Y|X=x]=\int_{-\infty}^\infty yf_{Y|X}(y|x)dy\]
\[E[Y|X=x]=\sum_{y_j\in S_y}y_jp_{Y|X}(y_j|x)\]
Here the expectation is dependent on \(x\), effectively changing the sample space and the expectation is still constant, albeit a function of \(x\).
If \(X\) is not specified, the conditional expectation is itself a random variable.
\[g(X)=E[Y|X]\]
For each outcome, there is a corresponding value of the conditional expectation \(E[Y|X(\omega)]\), meaning we can define a mapping \(S\to\mathbb{R}\).
We can find the expected value like that of any random variable:
\[E[E[Y|X]]=\int_{-\infty}^\infty E[Y|X=x]f_X(x)dx=\int_{-\infty}^\infty\int_{-\infty}^\infty yf_{Y|X}(y|x)dy f_X(x)dx=\int_{-\infty}^\infty y\int_{-\infty}^\infty f_{X,Y}(x,y)dxdy=\int_{-\infty}^\infty yf_Y(y)dy=E[Y]\]
This is known as the law of total expectation.
The effect of conditioning on \(X\) is removed by averaging (taking the expectation) over \(X\).
===== Correlation and Covariance =====

The correlation of two random variables is the expected value of their product.
\[E[XY]=\iint_{-\infty}^\infty xyf_{X,Y}(x,y)dxdy\]
The covariance is its own measure:
\[\text{cov}(X,Y)=E[(X-E[X])(Y-E[Y])]=E[XY]-E[X]E[Y]=E[XY]-\mu_X\mu_Y\]
If \(X\) and \(Y\) are independent, then the correlation is the product of their means (\(E[XY]=E[X]E[Y]=\mu_X\mu_Y\)), and hence their covariance is zero.
If either \(X\) or \(Y\) have zero mean, then \(\text{cov}(X,Y)=E[XY]\).
The covariance of a variable with itself is its mean, \(\text{cov}(X,X)=E[X^2]-E[X]^2=\text{var}(X)\).
If \(Z=X+Y\), then \(\text{var}(Z)=E[(X+Y)^2]-E[X+Y]^2=\text{var}(X)+\text{var}(Y)+2\text{cov}(X,Y)\).

\(X\) and \(Y\) are uncorrelated if the covariance is equal to the product of their means (\(E[XY]=E[X]E[Y]\implies\text{cov}(X,Y)=0\)).
Likewise if the covariance is non-zero, \(X\) and \(Y\) are correlated.
If \(E[XY]=0\), we say \(X\) and \(Y\) are orthogonal.
Independence implies uncorrelation but not the converse.
The converse is guaranteed true when \(X\) and \(Y\) are jointly Gaussian.
We define the correlation coefficient as follows:
\[\rho_{X,Y}=\frac{E[XY]-\mu_X\mu_Y}{\sigma_X\sigma_Y}=\frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y}\]
===== Joint Gaussian =====

Random variables \(X\) and \(Y\) are defined as jointly Gaussian if their joint density can be written as:
\[f_{X,Y}(x,y)=\frac{\exp\left\{\frac{-1}{2(1-\rho_{X,Y}^2)}\left[\left(\frac{x-\mu_X}{\sigma_X}\right)^2-2\rho_{X,Y}\left(\frac{x-\mu_X}{\sigma_X}\right)\left(\frac{y-\mu_Y}{\sigma_Y}\right)+\left(\frac{y-\mu_Y}{\sigma_Y}\right)^2\right]\right\}}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho_{X,Y}^2}}\]
Their marginal densities are Gaussian also, with means \(\mu\) and variance \(\sigma^2\).
If the product correlation coefficient is zero, the joint density can be written as the product of the marginal densities so \(X\) and \(Y\) are independent and uncorrelated.
It is important to note that two Gaussian RVs need not produce a joint Gaussian.
===== Sum of 2 random variables =====

Given \(X\) and \(Y\) with joint PDF \(f_{X,Y}(x,y)\), we want the PDF of \(Z=X+Y\).
We do this by first finding the CDF and then the PDF from the derivative of the CDF.
\[F_Z(z)=P[X+Y\leq z]=\int_{y=-\infty}^{y=\infty}\int_{x=-\infty}^{x=z-y}f_{X,Y}(x,y)dxdy\]
\[f_Z(z)=\frac{d}{dz}F_Z(z)=\int_{-\infty}^\infty\left[\frac{d}{dz}\int_{-\infty}^{z-y}f_{X,Y}(x,y)dx\right]dy=\int_{-\infty}^\infty f_{X,Y}(z-y,y)dy\]

Given \(X\) and \(Y\) with a joint PMF \(p_{X,Y}(x,y)\), we want the PMF of \(Z=X+Y\).
We consider all points on the boundary, and knowing the events are disjoint:
\[P[Z=z]=P\left[\bigcup_{\forall x\in S_X}(X=x)\cap(Y=z-x)\right]=\sum_{\forall x\in S_X}P[(X=x)\cap(Y=z-x)]=\sum_{\forall x\in S_X}p_{X,Y}(x,z-x)\]

Where \(X\) and \(Y\) are independent RVs, we can separate the joint PDF into the product of marginal PDFs:
\[f_Z(z)=\int_{-\infty}^\infty f_{X,Y}(x,z-x)dx=\int_{-\infty}^\infty f_X(x)f_Y(z-x)dx=(f_X\star f_Y)(z)\]
The sum is the convolution of the individual PDFs.
Likewise for discrete RVs, we also get the convolution.
====== Random Vectors and transformations ======

===== Matrix and vector preliminaries =====

  - \(\mathbb{R}^2\) is the set of column vectors  with \(n\) real components
  - \(\mathbb{R}^{m\times n}\) is the set of real matrices with n rows and m columns
  - The transpose of the matrix (\(\mathbf{A}^T\)) is the swapping of the entries \(ij\) with \(ji\)
  - If \(\mathbf{A}\) is a column vector, then \(\mathbf{A}^T\) is a row vector, and vice-versa
  - \(\mathbf{X,Y}\in\mathbb{R}^n\implies\mathbf{X}^T\mathbf{Y}\in\mathbb{R}\)
  - \(\mathbf{X}\in\mathbb{R}^n,\mathbf{Y}\in\mathbb{R}^m\implies\mathbb{XY}^T\in\mathbb{R}^{n\times m}\)
  - For square matrix \(\mathbf{A}\in\mathbb{R}^{n\times n}\), \(\text{trace(A)=\sum_{i=1}^na_{ii}\)
  - \(\mathbf{X,Y}\in\mathbb{R}^n\implies\mathbf{X}^T\mathbf{Y}=\text{trace}(\mathbf{XY}^T)\)
===== Random vectors =====

A random vector is a convenient way to represent a set of random variables:
\[\mathbf{X}=\begin{bmatrix}X_1\\X_2\\\vdots\\X_n\end{bmatrix},\mathbf{X}\in\mathbb{R}^n\]
The CDF of a random vector is:
\[F_{\mathbf{X}}(\mathbf{x})=P[X_1\leq x_1,X_2\leq x_2,...,X_n\leq x_n]\]
The PMF of a random vector is:
\[p_{\mathbf{X}}(x)=P[X_1=x_1,X_2=x_2,...,X_n=x_n]\]
The PDF of a random vector is:
\[f_\mathbf{X}(\mathbf{x})=\frac{\partial^nF_\mathbf{X}(\mathbf{x})}{\partial x_1\partial x_2...\partial x_n}\]
The mean of a random vector is:
\[E[\mathbf{X}]=\begin{bmatrix}E[X_1]\\E[X_2]\\\vdots\\E[X_n]\end{bmatrix}\]
The variance in 1-D measures the average squared distance from the mean, so an analogous definition for random vectors is:
\[\text{var}(\mathbf{X})=E[||\mathbf{X}-\mu_\mathbf{X}||^2]=E\left[\sum_{i=1}^n|X_i-\mu_i|^2\right]=\sum_{i=1}^n\text{var}(X_i)=E[(\mathbf{X}-\mu_\mathbf{X})^T(\mathbf{X}-\mu_\mathbf{X})]\]

The (auto)correlation matrix is used to measure the correlation between elements of random vectors:
\[\mathbf{R_X}=E[\mathbf{XX}^T]\]
\[{R_X}(i,j)=E[X_iX_j]\]
The autocorrelation matrix is symmetric.
The diagonals are \(R_X(i,i)=E[X_i^2]\).

The (auto)covariance matrix is:
\[\mathbf{C_X}=E[(\mathbf{X}-\mu_\mathbf{X})(\mathbf{X}-\mu_\mathbf{X})^T]=\mathbf{R_X}-\mu_\mathbf{X}\mu_\mathbf{X}^T\]
\[{C_X}(i,j)=\text{cov}(X_i,X_j)\]
The covariance matrix is a symmetric collection of the covariances.
The diagonals are \(C_{X}(i,i)=\text{var}(X_i)\).
The variance of a random vector is \(\text{var}(\mathbf{X})=\text{trace}(\mathbf{C_X})\).

If all the components of \(\mathbf{X}\) are mutually independent, the terms are uncorrelated.
\(\mathbf{R_X}=\mu_\mathbf{X}\mu_\mathbf{X}^T\) and the covariance matrix is a diagonal matrix.

When given two random vectors, we can find the cross-correlation of \(\mathbf{X}\) and \(\mathbf{Y}\).
\[\mathbf{R_{XY}}=E[\mathbf{XY}^T]\]
This is a \(n\times m\) matrix that is not necessarily symmetric.
Note that \(\mathbf{R_{XY}}=\mathbf{R_{YX}}^T\).
Likewise we can find the cross-correlation.
\[\mathbf{C_{XY}}=\mathbf{R_{XY}}-\mu_\mathbf{X}\mu_\mathbf{Y}^T\]
Again, this is not symmetric in general and \(\mathbf{C_{XY}}=\mathbf{C_{YX}}^T\).
===== Gaussian Random Vectors =====

\[\mathbf{X}\sim\mathcal{N}(\mu_\mathbf{X},\mathbf{C_X})\]
A set of Gaussian variables defined by their means and covariances defines a Gaussian random vector.
A random vector is said to be Gaussian if its density function can be written as:
\[f_\mathbf{X}(\mathbf{x})=\frac{\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu_\mathbf{X})^T\mathbf{C_X}^{-1}(\mathbf{x}-\mu_\mathbf{X})\right\}}{(2\pi)^{n/2}|\mathbf{C_X}|^{1/2}}\]
A Gaussian random vector consists of random variables that are jointly Gaussian.
A subset of the variables from a Gaussian vector must be jointly Gaussian.

The vector has the following properties:
  * **The vector is uncorrelated, so must be independent:** covariance matrix is diagonal (as does inverse), the pdf is separable
  * **Affine transform:** \(\mathbf{Y}=\mathbf{AX+b}\) is also Gaussian if \(\mathbf{X}\) is Gaussian.
===== PDF Level sets =====

A level set is a function of the form:
\[L_C(f)=\{(x_1,...,x_n)|f(x_1,...,x_n)=c\}\]
The level set plots the set of points for which the function takes the same value.
The contour plot can help to visualise the density function.

For a Gaussian RV, everything outside of the exponent is a constant, and the exponent defines a n-dimensional ellipse.
The ellipse is centred at \(\mu_X\) and its orientation is determined by \(\mathbf{C_X}\).
The variance determines the width of the ellipse.
The covariance determines rotation of the ellipse.

We can apply a rotational transform by multiplying a vector by:
\[\begin{bmatrix}\hat{X_1}\\\hat{X_2}\end{bmatrix}=\begin{bmatrix}\cos\theta&\sin\theta\\-\sin\theta&\cos\theta\end{bmatrix}\begin{bmatrix}X_1\\X_2\end{bmatrix}\]
This applies a rotation of \(\theta\) to the original vector.
The challenge is finding a rotation such that the resulting vector has uncorrelated components.
In doing so, the new vector has covariance equal to \(\mathbf{ACA}^T\).
This gives the covariance as:
\[\text{cov}(\hat{X},\hat{Y})=-\sigma_X^2\sin\theta\cos\theta-\text{cov}(X,Y)\sin^2\theta+\text{cov}(X,Y)\cos^2\theta+\sigma_Y^2\sin\theta\cos\theta\]
Which in turn yields:
\[\tan 2\theta=\frac{2\text{cov}(X,Y)}{\sigma_X^2-\sigma_Y^2}\]
\[\theta=\frac{1}{2}\arctan\left(\frac{2\text{cov}(X,Y)}{\sigma_X^2-\sigma_Y^2}\right)\]
===== Transforms of random vectors =====

A system can be modelled as having random inputs and therefore has random outputs.
The outputs can be thought of as transformations of the inputs.
The question arises as to how to model the outputs for a system with known parameters taking random inputs.

For discrete random vectors:
\[p_\mathbf{Y}(\mathbf{y})=P[\mathbf{Y=y}]=P[g(\mathbf{X})=\mathbf{y}]=\sum_{\forall\mathbf{x}_k:g(\mathbf{x}_k)=y}P[\mathbf{X=x}_k]\]
The CDF can then be obtained from the PMF, although this calculation can be tedious.

For continuous random vectors, we can use a similar direct method for the CDF as for a single variable.
If the function is an invertible and differentiable function mapping from \(\mathbb{R}^n\to\mathbb{R}^n\).
Then \(\mathbf{Y}=g(\mathbf{X})\) has the PDF:
\[f_\mathbf{Y}(\mathbf{y})=f_\mathbf{X}(g^{-1}(\mathbf{y}))\left|\det\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\right|\]
Where \(\det\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\) is the \(n\times n\) Jacobian.
If \(\mathbf{Y}\) has dimension \(m<n\), the analysis is more complex.
Where \(n=1\), the analysis simplifies to:
\[f_Y(y)=f_X(g^{-1}(y))\left|\frac{dx}{dy}\right|\]
For the conversion via the CDF, we can think of \(g(\cdot)\) as \(m\) separate functions mapping from \(\mathbb{R}^n\to\mathbb{R}\).
\[F_\mathbf{Y}(\mathbf{y})=P[Y_1\leq y_1,...,Y_m\leq y_m]=P[g_1(\mathbf{X})\leq y_1,...,g_m(\mathbf{X})\leq y_m]=\int...\int_{\forall\mathbf{x}:g(\mathbf{X})\leq y_i,1\leq i\leq m}f_\mathbf{X}(\mathbf{x}) dx_1... dx_n\]
====== Inequalities, Central Limit Theorem and the Moment Generating Function ======

===== Markov inequality =====

We may be interested in a bound on the probability of a random variable.
Of particular interest is a bound on the probability for values far from the mean (tail probabilities).
The Markov inequality states that for non-negative random variables \(X\) and \(a>0\):
\[P[X\geq a]\leq\frac{E[X]}{a}\]
In general this provides a loose upper bound, although little information is needed to provide a bound.
===== Chebyshev inequality =====

For a given random variable \(X\) and a constant \(a>0\), then:
\[P[|X-\mu_X|\geq a]\leq\frac{\text{var}(X)}{a^2}\]
This uses more information than the Markov inequality (needs variance)
This tells us that if the variance is small, \(X\) is unlikely to be far from the mean.
We can also tell that most of the probability mass/density of a RV is within a few standard deviations of the mean.
===== Sum of random variables =====

Many macroscopic phenomena can be seen as the sum of a number of probabilistic microscopic actions.
When we want to add two random variables, we convolve their PDFs to get the sum's PDF.
We can repeat this to sum an arbitrary number of variables.

If we have a sum of iid random variables, the expected value of the sum is \(E[S_n]=n\mu\) and its variance is \(\text{Var}(S_n)=n\sigma^2\).
If we take the mean of n iid random variables, the expected value is \(E[M_n]=\mu\) and the variance is \(\text{Var}(M_n)=\sigma^2/n\).
The latter can be thought of as taking a sample from a distribution.
The sample has a mean equal to that of the original distribution, but has a variance that approaches 0 as the sample size increases.
In sampling, we can get outliers that can distort the sample mean, but sampling allows us to create bounds on the sample to give confidence.
For an example Bernoulli variable which we sample, we can place bounds due to the Chebyshev inequality.
\[E[|M_n-p]\geq a]\leq\frac{\text{Var}(M_n)}{a^2}=\frac{\text{Var}(X_n)}{na^2}=\frac{p(1-p)}{na^2}\leq\frac{1/4}{na^2}\]
==== Weak law of large numbers ====

\[\text{Var}(M_n)=\frac{\sigma^2}{n}\implies\lim_{n\to\infty}\text{Var}(M_n)=0\]
By Chebyshev's inequality (for \(\epsilon>0\)):
\[P[|M_n-\mu|\geq\epsilon]\leq\frac{\text{Var}(M_n)}{\epsilon^2}\]
Taking this limit gives the weak law of large numbers:
\[\lim_{n\to\infty}P[|M_n-\mu|\geq\epsilon]=0\]
Alternately we can express this as:
\[\lim_{n\to\infty}P[|M_n-\mu|\leq\epsilon]=1\]
==== Empirical data and probabilities ====

We get probabilistic models for data from statistical inference.
For a given event, we can calculate an empirical probability by running \(n\) trials and counting the number of times the event occurs.
We can consider our experiment as a Bernoulli trial, were either the event occurs or it doesn't, which lets us use the above example about the polling problem.
\[P[|\tilde{P}_n[A]-P[A]|<\epsilon]\geq\frac{P[A](1-P[A])}{n\epsilon^2}=\frac{\text{Var}(A)}{\epsilon^2}\]
From this we can get that to double our precision, we need 4 times the number of trials.
===== Central limit theorem =====

For a given sum of \(n\) random variables, with mean \(\mu\) and variance \(\sigma^2\), we can define:
\[Z_n=\frac{S_n-n\mu}{\sigma\sqrt{n}}\]
The mean of \(Z\) is \(E[Z_n]=0\), and its variance is \(\text{Var}(Z_n)=\frac{\text{Var}(S_n)}{(\sigma\sqrt{n})^2}=1\).
The central limit theorem states that:
\[\lim_{n\to\infty}P[Z_n\leq z]=\frac{1}{2\pi}\int_{-\infty}^ze^{-x^2}dx=\Phi(z)\]
In other words, \(Z_n\) converges to \(\mathcal{N}(0,1)\) as \(n\to\infty\).
This is regardless of the underlying distribution, only requiring a finite mean and variance, provided the variance is non-zero.

The central limit theorem allows us to approximate sums and averages of iid random variables with Gaussian distributions, provided \(n\) is sufficiently large.
We can see this in a system if noise is due to a large number of events, the noise will be Gaussian.
Additionally the empirical mean can be modelled as Gaussian for sufficiently large \(n\).
As a loose guideline, \(n=30\) can to be used, but this depends on the underlying distribution, with some resembling after less and some after more.
Sometimes the CLT will be uses for fewer samples as it is easier than using the underlying distribution.
We can find that the original sum approaches \(F_{S_n}(a)\approx\Phi\left(\frac{a-n\mu}{\sigma\sqrt{n}}\right)\), meaning that \(S_n\sim\mathcal{N}(n\mu,n\sigma^2)\).
As the sample size increases, the mean of the sample tends to be distributed as \(M_n\sim\mathcal{N}(\mu,\sigma^2/n)\).
==== De Moivre-Laplace Approximation ====

Because the CLT approximates with a continuous random variable, getting the probability of a single outcome is 0, which limits its use for discrete random variables.
We can use the De Moivre-Laplace approximation here to find for a single value for binomial variables.
If \(S_n\sim\text{Binomial}(n,p)\), sufficiently large \(n\), and \(k_1\leq k_2\), we can approximate
\[P[k_1\leq S_n\leq k_2]\approx\Phi\left(\frac{k_2+0.5-n\mu}{\sqrt{np(1-p)}}\right)-\Phi\left(\frac{k_1-0.5-n\mu}{\sqrt{np(1-p)}}\right)\]
This is equivalent to the approximation:
\[P[k_1\leq S_n\leq k_2]\approx P[k_1-0.5\leq S_n\leq k_2+0.5]\]
==== Random vectors ====

For a sequence of iid random vectors, witha mean \(E[\mathbf{X}_i]=\mu_\mathbf{X_i}\) and covariance \(\mathbf{C_X}\), we can get:
\[\mathbf{S}_n=\sum_{i=0}^n\mathbf{X}_i\]
\[\mathbf{Z}_n=\frac{1}{\sqrt{n}}(\mathbf{S}_n-n\mu_\mathbf{X})\]
So as \(n\to\infty\), \(\mathbf{Z}_n\sim\mathcal{N}(0,\mathbf{C_X})\).
===== Moment Generating Function (MGF) =====

The Moment Generating Function is:
\[\phi_X(s)=E[e^{sX}]\]
So for a discrete RV:
\[\phi_X(s)=\sum_{x\in S_X}e^{sx}p_X(x)\]
And for a continuous RV:
\[\phi_X(s)=\int_{-\infty}^\infty e^{sx}f_X(x)dx\]
In the continuous case, \(\phi(s)\) is the two sided Laplace transform of \(f_X(x)\), with \(s\) instead of \(-s\).
In the discrete case, \(\phi(s)\) is a Z-transform of \(p_X\), with \(z=e^{-s}\).

The MGF has use in proving the CLT.
The kth moment of a random variable \(X\) is defined as \(E[X^k]\), where \(k\) is a positive integer.
For some distributions finding the moment will involve integration by parts \(k-1\) times.
Instead, the kth derivative of the MGF gives the moment, where \(s=0\).
\[\left.\frac{d^k\phi_X(s)}{ds^k}\right|_{s=0}=E[X^k]\]
\(\phi_X(s)\) is only defined where \(E[e^{sX}]\) is finite, meaning within the region of convergence.
As summing two random variables results in the convolution of their PDFs, in the MGF domain, we can multiply the individual MGFs to get the resultant MGF.
===== Random Processes =====

A stochastic process (random process) is a mapping from the sample space \(S\) to the set of time-domain waveforms (either discrete or continuous time).
In that sense, the experiment results in an outcome that is a time domain signal.
==== Continuous time ====

In continuous time a random process is usually denoted as \(X(t)\) or \(\{X(t)\}\).
The process is a function of two variables \(X(t,\omega)\), where \(t\) is the time and \(\omega\) is the outcome, although \(\omega\) is commonly dropped.
For a fixed time \(t_0\), \(X(t_0,\omega)\) is a standard RV.
For a fixed outcome \(\omega_0\), \(X(t,\omega_0)\) is a deterministic signal.
Fixing both results in \(X(t_0,\omega_0)\) being a constant.

For a fixed outcome, the deterministic signal is referred to as a sample function, sample path or realisation of \(X(t)\).
For every time, \(X(t,\omega)\) is a random variable, so we can think of \(X(t)\) as an infinite set of random variables.
The ensemble of a random process is the set of all possible time domain functions that can result from the experiment, analogous to the range of the RV.
==== Discrete time ====

Discrete time processes are similar to continuous time, but using discrete time instants \(n\in\mathbb{Z}\) instead of \(t\in\mathbb{R}\).
It is written as \(X(n)\) or \(\{X(n)\}\).
A discrete time random process is a sequence of random variables, and the ensemble is all possible sequences.
==== Types of random processes ====

Random processes can be either:
  * Discrete time, discrete valued (DT DV)
  * Discrete time, continuous valued (DT CV)
  * Continuous time, discrete valued (CT DV)
  * Continuous time, continuous valued (CT CV)
Random processes are not just limited to time, and can be generalised to a random field.
==== Specifying random processes ====

Can be difficult to fully statistically characterise a random process.
The First-Order Distribution is a random process \(\{X(t)\}\):
\[F_{X(t)}(x)=P[X(t)\leq x]\]
This is the First-Order CDF, and is a function of \(x\) and \(t\).
The First-Order PDF is:
\[f_{X(t)}(x)=\frac{\partial}{\partial x}F_{X(t)}(x)\]
We can also define mean and variance functions:
\[\mu_X(t)=E[X(t)]=\int_{-\infty}^\infty xf_{X(t)}(x)dx\]
\[\text{Var}[X(t)]=E\left\{[E[X(t)-\mu_X(t)]^2\right\}\]

The Second-Order distribution compares how two times are related.
The CDF is:
\[F_{X(t_1),X(t_2)}(x_1,x_2)=P[X(t_1)\leq x_1,X(t_2)\leq x_2]\]
The PDF is:
\[f_{X(t_1),X(t_2)}(x_1,x_2)=\frac{\partial^2}{\partial x_1\partial x_2}F_{X(t_1),X(t_2)}(x_1,x_2)\]
\(X(t_1)\) and \(X(t_2)\) are a pair of random variables, so we can look at their autocorrelation and autocovariance.
\[R_X(t_1,t_2)=E[X(t_1)\cdot X(t_2)]\]
\[C_X(t_1,t_2)=\text{Cov}[X(t_1),X(t_2)]=R_X(t_1,t_2)-\mu_X(t_1)\mu_X(t_2)\]

For the nth order distribution, we can write the CDF and PDF as:
\[F_{X(t_1),...,X(t_n)}(x_1,...,x_n)=P[X(t_1)\leq x_1,...,X(t_n)\leq x_n]\]
\[f_{X(t_1),...,X(t_n)}(x_1,...,x_n)=\frac{\partial^n}{\partial x_1...\partial x_n}F_{X(t_1),...,X(t_n)}(x_1,...,x_n)\]
A stochastic process is specified by the collection of nth order joint distribution functions for and \(n\) and any choice of sampling times \(t_1,...,t_n\).
This means we may need an infinite collection of distribution functions to fully characterise a stochastic process.
Some processes have simple models that are easier to deal with.
==== Stationarity ====

Some processes have (strict) stationarity.
That means that for any \(n\in\mathbb{N}\) and \(t_1,...,t_n\) and time shift \(\tau\):
\[F_{X(t_1),...,X(t_n)}(x_1,...,x_n)=F_{X(t_1+\tau),...,X(t_n+\tau)}(x_1,...,x_n)\]
\[f_{X(t_1),...,X(t_n)}(x_1,...,x_n)=f_{X(t_1+\tau),...,X(t_n+\tau)}(x_1,...,x_n)\]
Then the random process \(\{X(t)\}\) is strictly stationary.
This means that the joint distributions are time-shift invariant.
This must hold true for any number and location of time instants and \(\tau\).
Stationarity implies that the CDF/PDF/PMF only depends on the relative times between instants.

For a First-Order distribution:
\[F_{X(t)}(x)=F_{X(t+\tau)}(x)\]
\[f_{X(t)}(x)=f_{X(t+\tau)}(x)\]
Meaning that the distribution is equal for all times, letting us write:
\[f_{X(t)}(x)=f_{X(t+\tau)}(x)=f_{\overline{X}}(x),\forall t,\tau\]
\[X(t)\stackrel{d}{=}\overline{X},\forall t\]
As a result we have the mean and variance being constant:
\[E[X(t)]=\mu\]
\[\text{Var}(X(t))=\sigma^2\]

For a Second-Order distribution:
\[f_{X(t_1),X(t_1+\tau)}(x_1,x_2)=f_{X(t_2,t_2+\tau)}(x_1,x_2)=f_{X(0),X(\tau)}(x_1,x_2)\]
As these PDFs are the same, so will their autocorrelation and autocovariance functions.
\[R_X(t_1,t_1+\tau)=R_X(0,\tau)=R_X(\tau)\]
\[C_X(t_1,t_1+\tau)=C_X(0,\tau)=C_X(\tau)\]
\[R_X(s,t)=R_X(t-s)\]
\[C_X(s,t)=C_X(t-s)\]

A joint distribution of \(n\) iid random processes can be written as not having any dependency on \(t\), so a joint iid process must be stationary.
=== Wide Sense Stationarity (WSS) ===

A random process \(\{X(t)\}\) is WSS if:
  - \(E[X(t)]=\mu,\forall t\in\mathbb{R}\)
  - \(R_X(t_1,t_2)=E[X(t_1)X(t_2)]\) depends only on the time difference \(t_2-t_1\), i.e. \(R_X(\tau)=R_X(t,t+\tau)\)
Wide sense stationarity is weaker than strict stationarity.

For a WSS process, the autocorrelation has the following properties:
  - \(R_X(\tau)=R_X(-\tau)\), \(R_X\) is an even function
  - \(R_X(0)=E[X^2(t)]=\text{average power }X(t)\)
  - \(|R_X(\tau)|\leq R_X(0)\)
  - \(P[|X(t+\tau)-X(\tau)|>\epsilon]\leq\frac{2(R_X(0)-R_X(\tau))}{\epsilon^2}\)
Properties 1, 3 and 4 also apply to the autocovariance, which is a constant difference from the autocorrelation.
Property 4 is a bound on how fast the process can change, the larger \(R_X(\tau)\), the closer \(X(t+\tau)\) and \(X(t)\) must be in a probabilistic sense.

There is also the WSS periodic property, being that if \(R_X(T)=R_X(0)\) for a \(T\neq 0\), then \(R_X(\tau)\) is periodic with period \(T\) and \(X(t+T)=X(t)\) with probability 1.
==== Gaussian Process ====

\(\{X(t)\}\) is a Gaussian process if \(\forall n\in\mathbb{N}\) and \(t_1,...,t_n\), \(X(t_1),...,X(t_n)\) are jointly Gaussian.
\[\mathbf{X}=[X(t_1),...,X(t_n)]^T\sim\mathcal{N}(\mathbf{\mu_X},\mathbf{C_X})\]
If a Gaussian random process is wide sense stationary, then it is strictly stationary as well.
This is because the joint density are specified by the mean and autocovariance, which are time shift invariant, so the joint density does not change.
==== White noise ====

A WSS discrete time process \(\{W(t)\}\) is white noise if its mean is 0 and \(R_X(l)=0\) when \(l\neq0\).
\[R_X(l)=C_X(l)=\sigma^2_W\cdot\delta(l)\]
Here \(\sigma^2_W\) is the average noise power.
Here any two distinct time samples are always uncorrelated.

A WSS continuous time process \(\{W(t)\}\) is white noise if its mean is 0 and:
\[R_W(\tau)=C_W(\tau)=\alpha\delta(\tau)\]
Again any two distinct samples are uncorrelated, but \(R_W(0)=\infty\) due to the \(\delta\) function.

White noise is white as its power spectral density is constant for all frequencies, like white is formed from all visible colours.
For a WSS random process, the power spectral density is the Fourier transform of the autocorrelation.
==== Counting Process ====

A counting process \(\{N(t)\}\) counts the occurrence of events, taking on the values of the non-negative integers.
The process usually starts at \(t=0\), so \(N(t)=0,t<0\).
The discontinuity occurs at times of arrival, so \(N(0,t)\) counts the number of arrivals in \([0,t]\) and is non-decreasing.
The inter-arrival times are also random, so this is a discrete valued continuous time random process.
=== Poisson Process ===

A counting process \(\{N(t)\}\) is a Poisson Process if:
  - The number of arrivals in any interval \((s,t]\) is a Poisson RV with parameter \(\lambda(t-s)\)
  - Any pair of non-overlapping intervals has an independent number of arrivals: Independent Increments Property
The parameter \(\lambda\) is the rate or intensity of the process, expressed as number of arrivals per period of time.

For a given period, we can write the expectation, variance and First-Order PMF:
\[N(t)\sim\text{Poisson}(t\lambda)\]
\[E[N(t)]=t\lambda\]
\[\text{Var}[N(t)]=t\lambda\]
\[p_{N(t)}(k)=P[N(t)=k]=\frac{(t\lambda)^k}{k!}e^{-t\lambda}\]
We can see that \(\{N(t)\}\) is neither stationary nor WSS.

When intervals don't overlap, the intervals are independent, so it is easy to calculate the joint probability.
When intervals overlap, we need to split into independent intervals.
This can be done by splitting into conditional probabilities:
\[P[A\cap B]=P[B|A]P[A]\]
Where \(A\) and \(B\) overlap, and both are Poisson processes.
If \(A:N(s)=i\) and \(B:N(t)=j\), then \(B|A:N(t-s)=j-i\).
Where only part of the intervals overlap, we need to split into 3 disjoint intervals and sum over all possible interval combinations.
=== Inter-Arrival times ===

For a Poisson Process, the Inter-Arrival times are also random variables:
\[X_i=T_i-T_{i-1}\]
We can model this as an exponential RV.
\[X_i\sim\text{exponential}(\lambda)\]
\[E[X_i]=\frac{1}{\lambda}\]
\[F_{X_i}(t)=1-e^{-\lambda t}\]
Because of the independent increments property, the inter-arrival times are independent.

We can simulate a Poisson random process using the exponential generation and summing to obtain the inter-arrival times.
  - Generate a sequence of \([0,1]\) uniform random variables
  - Transform each \(U_i\) to exponential RVs: \(X_i=\frac{-\ln(1-U)}{\lambda}\)
  - Cumulative sum to find the inter-arrival times: \(T_i=\sum_{n=1}^i X_n\)
  - For all \(t>0\), \(N(t)=i\) for \(T_{i}<t<T_{i+1}\)
Here the \(T_i\) are Erlang random variables.
=== Bernoulli Process ===

The process is a sequence of iid Bernoulli RVs \(X(k)\).
It is a DT DV process, where at each time instant, the value is either 0 or 1.
\[P[X(k)=1]=P[\text{Success at kth trial}]=p\]
\[P[X(k)=0]=P[\text{Failiure at kth trail}]=1-p\]
As a sequence of iid RVs, it is a iid random process, and thus is a stationary random process.
=== Binomial Counting Process ===

The Binomial Counting Process is the result of cumulatively summing a Bernoulli Process.
\[S(k)=\sum_{n=1}^k X(n)\]
Where \(X(k)\) are iid Bernoulli random variables.
\[S(k)\sim\text{Bernoulli}(k,p)\]
We can think of this as counting arrivals, where an arrival is a Bernoulli success.
We can think of \(T_k=Y_1+Y_2+...+Y_k\) as the time of the kth arrival (success).
Here \(Y_k=T_k-T_{k-1}\) is the inter-arrival time.

We can think of the inter-arrival time \(Y_k=1\) as being the time where \(i-1\) failures occur, followed by a success.
Each trial that is an iid Bernoulli RV, and the resultant distribution for the inter-arrival time is geometric.
\[Y_k\sim\text{Geometric}(p)\]
\[P[Y_k=1]=p(1-p)^{i-1}\]
Due to the trials being iid, the sequence of geometric RVs \(Y_1,...,Y_k\) is also iid and is a DT DV process.
Here \(k\) is the arrival number, so is not technically discrete time, but it is still discrete as it counts the number of arrivals.
=== Poisson vs Binomial ===

We can think of the Binomial process as a discretised version of the Poisson process.
Both have independent increments and inter-arrival times.
Both the exponential and geometric distributions share the memoryless property.
^ ^ Poisson^ Binomial^
|Times of arrival|Continuous|Discrete|
|Arrival rate|\(\lambda\) per unit time|\(p\) per trial|
|No. of arrivals PMF|Poisson|Binomial|
|Inter-arrival times|Exponential|Geometric|
|Time to kth arrival|Erlang|Pascal|
=== Sum Process ===

For an iid \(\{X(k)\},k\in\mathbb{N}\):
\[S(k)=S(k-1)+X(k)\]
Where \(S(1)=X(1)\) and \(S(0)=0\).
The Binomial counting process is a sum process.
Sum processes have the important property:
\[P[S(k+1)\leq a|S(k)=s_k,...,S(1)=s_1]=P[S(k+1)\leq a|S(k)=s_k]\]
This is that the next value is dependent only on the most recent value.
The Pascal counting process, the arrival time sequence of the Binomial process, is a sum process.
The Erlang counting process, the arrival time sequence of the Poisson process, is also a sum process.
Random walks in 1D are sum processes.

Sum processes have the following processes:
\[E[S(k)]=kE[X]\]
\[\text{Var}[S(k)]=k\text{Var}[X]\]
\[C_S(n,k)=\text{Var}[X]\cdot\min\{n,k\}\]
Non-overlapping intervals have independent sum increments.
=== Markov Process ===

A Markov Process is a process where conditioning on the past is equivalent to conditioning on the immediate past.
\[f_{X(t_{n+1})|X(t_n),...,X(1)}(x_{n+1}|x_n,...,x_1)=f_{X(t_{n+1})|X(t_n)}(x_{n+1}|x_n)\]
This also applies to discrete time processes, replacing \(t\) with \(k\).
This can be thought of as only the present matters in predicting the future.
Sum processes are Markov processes, as is any iid process.

It is generally easy to write Markov processes high order high order distributions.
We can reduce a high order PMF/PDF/CDF into a product of first order conditional distributions.

There are 4 types of Markov processes:
  * Discrete time Markov chain (DT DV)
  * Continuous time Markov chain (CT DV)
  * Discrete time Markov process (DT CV)
  * Continuous time Markov process (CT CV)
A discrete valued Markov process is called a chain as the outcomes or states are finite or countably infinite.
If we were to draw a state diagram for the outcomes of a Markov chain, they form a weighted directed graph, with nodes of states and edges representing transitions.
The joining of discrete states create the "chain".
As such, a finite state Markov chain can be viewed as a finite state machine.
We can define a state PMF and a transition probability matrix for a Markov chain.
\[\mathbf{p}(k)=\begin{bmatrix}p_{X(k)}(0)&p_{X(k)}(1)\end{bmatrix}\]
\[\mathbf{P}=\begin{bmatrix}p_{00}&p_{01}\\p_{10}&p_{11}\end{bmatrix}\]
This then lets us calculate the next state from the previous with the transition matrix.
\[\mathbf{p}(k+1)=\mathbf{p}(k)\cdot\mathbf{P}\]
This then can be expanded to give:
\[\mathbf{p}(k)=\mathbf{p}(0)\cdot\mathbf{P}^k\]
We have the initial state vector \(\mathbf{p}(0)\), and as \(k\to\infty\) we get the steady state PMF vector \(\mathbf{p}^*\).
The steady state vector needs to have components that sum to 1, and is independent of the initial state.
A time homogeneous discrete time Markov chain is one where the transition matrix is constant with time.
If all components of the transition probability matrix are strictly positive, the steady state vector is guaranteed to exist and be uniquely determined by \(\mathbf{P}\).
If there are zeros present, the steady state vector may exist, may be dependent on initial conditions or may not exist.
Where the initial state vector is equal to the steady state vector, all future state vectors are the same for all times.
Conditioning cause the state vector to be dependent only on the time difference from the conditioned step.
\[\mathbf{p}(k)=u_i\mathbf{P}^l\]
Where \(u_i\) has the a 1 in the ith entry and 0 elsewhere and we are conditioning on time \(l\) and getting the probability for \(k>l\).
We can see that shifting the times by the same amount does not change the PMF.
As such a Markov chain is stationary when \(\mathbf{p}(0)=\mathbf{p}^*\).
====== Statistical inference ======

Statistical inference is using data to draw conclusions about distributions of random variables.
Parameter estimation is estimating unknown or random parameters based on random samples.
Hypothesis testing is tests for specific values of a unknown or random parameter based on random samples.
It is trying to classify the parameter from a set of choices.
Hypothesis testing is decision making under uncertainty, and is extensively used.
===== Hypothesis testing =====

For Hypothesis testing, we need to design a test with objective to either accept or reject \(H_0\).
The test's outcome will be the test statistic.
We also need a decision rule for choosing each hypothesis based on the test statistic.
We then must assess the quality of the decision rules.
For the decision rule, we partition the sample space into an acceptance set and the complimentary rejection set.
We then accept \(H_0\) iff the outcome is in the acceptance set.
==== Errors ====

This can give rise to errors:
  * **Type I:** We reject \(H_0\) when \(H_0\) is true
  * **Type II:** We accept \(H_0\) when \(H_0\) is false
There is a trade off with these errors.
==== Tests ====

A two tail test is one where we accept if the test statistic is within a given distance of the expected value.
A standard approach in picking the distance is by choosing a significance level such that the probability of a type I error is less than the significance.
The upper value on the lower rejection set is the left critical value, and the lower value on the upper rejection set is the right critical value.

When we are only concerned with the test statistic moving in one direction, we use a one tailed test.
This sets the rejection set as the region from 0 to a critical value, or from a critical value to infinity.
We then need to set the critical value according to the level of significance.

We can also use a p-value for significance, which in the one tailed case is:
\[p=P[X\leq x|H_0]\]
The p-value expresses the probability we got the value we received given \(H_0\), being the amount of evidence for the rejection of \(H_0\).
The smaller the p-value, the stronger the evidence.
In practice the p-value is used as a significance level is fairly arbitrary.

The type of hypothesis testing where we test if a test statistic reasonably fits a distribution is called significance testing.
The two hypothesises used form an acceptance or rejection of a given distribution.
This is not binary hypothesis testing, as we do not have probability models for each hypothesis.
==== Binary hypothesis testing ====

Significance testing checks whether a test statistic fits a known distribution, and we either accept or reject the given model given type I error being below a significance level.
Binary hypothesis testing checks that given \(H_0\) and \(H_1\), both with their own distributions, which better fits a given test statistic.
Sometimes, prior probabilities \(P[H_0]\) and \(P[H_1]\) can be known.

Given a random variable \(X\) parameterised by \(\theta\), an unknown constant or sampled from a random variable \(\Theta\), we need to express the variable.
If \(\theta\) is constant, we can write the parameterised CDF as \(F_X(X;\theta)\), and if sampled \(F_{X|\Theta}(x|\theta)\).
If \(\theta\) is hidden, being not directly observable, given a realisation of \(X\), we need to make deductions about \(\theta\).
We need to construct our two hypothesises, involving different realisations of \(\theta\), then design decision rules.

Four decision rule criterion are:
  * **Maximum A Posteriori (MAP):** Accept \(H_0\) iff \(P[H_0|X=x]\geq P[H_1|X=x]\), else accept \(H_1\)
  * **Maximum Likelihood (ML):** MAP with uncertainty about prior probabilities
  * **Minimum Cost (MC):** Assigns costs to errors and minimises expected costs
  * **Neyman-Pearson (NP):** Puts a tolerance level on the false alarm probability
=== MAP ===

Alternately we can write the MAP rule as:
\[\frac{P[H_0|X=x]}{P[H_1|X=x]}\mathop{\stackrel{H_0}{\lessgtr}}_{H_1}1\]
Using Bayes' rule, we can express the a posteriori probabilities in terms of the a priori, which then lets us express the decision rule as:
\[\frac{P[H_0|X=x]}{P[H_1|X=x]}=\frac{f_{X|H_0}(x)\cdot P[H_0]}{f_{X|H_1}(x)\cdot P[H_1]}\mathop{\stackrel{H_0}{\lessgtr}}_{H_1}1\]
We can define the likelihood ratio as \(\Lambda(x)=\frac{f_{X|H_0}(x)}{f_{X|H_1}(x)}\) and the MAP decision threshold as \(\gamma=\frac{P[H_1]}{P[H_0]}\).
The decision rule then becomes:
\[\Lambda(x)\underset{H_1}{\stackrel{H_0}{\lessgtr}}\gamma\]

In binary hypothesis testing there are two errors:
  * **Type I:** Accept \(H_1\) when \(H_0\) true (False alarm)
  * **Type II:** Accept \(H_0\) when \(H_1\) is true (Missed detection)
\[P_{FA}=P[\text{Type I error}]=P[\text{Accept }H_1|H_0\text{ true}]\]
\[P_{miss}=P[\text{Type II error}]=P[\text{Accept }H_0|H_1\text{ true}]\]
The overall error probability is:
\[P_{error}=P_{FA}\cdot P[H_0]+P_{miss}\cdot P_[H1]\]
The MAP decision rule minimises the overall error.
=== ML ===

When the a priori probabilities are unknown, we can assume they are equiprobable, i.e. \(\gamma=1\).
This results in the Maximum Likelihood decision rule:
\[\Lambda(x)\underset{H_1}{\stackrel{H_0}{\lessgtr}}1\]
This is equivalent to the MAP rule when \(P[H_0]=P[H_1]=\frac{1}{2}\).
The MAP rule can be unreliable when we have poor estimates of prior probabilities.

The probabilities of a miss and false alarm are inversely related, such that decreasing one increases the other.
The further apart the distributions, the lower the ML error probability.
This is due to the increased separation making it easier to distinguish the distributions.

The receiver operating curve (ROC) is a plot of \(P_{FA}\) against \(P_{miss}\).
The closer the curve is to the bottom left \((0,0)\), the better.
The worst case is a straight line from \((1,0)\) to \((0,1)\).
=== MC ===

A convenient way of handling the trade off between false alarms and misses is to assign costs to each.
This then lets us prioritise one type of error over another, with the goal to minimise cost.
The expected overall cost is:
\[E[C]=c_{FA}P_{FA}P[H_0]+c_{miss}P_{miss}P[H_1]\]
The decision rule minimising this is:
\[\Lambda(x)\underset{H_1}{\overset{H_0}{\lessgtr}}\frac{c_{miss}P[H_1]}{c_{FA}P[H[0]]}\]
We can see that this is a weighted form of the MAP threshold, and the cost function is a weighted form of the overall error probability.
If the costs are equal, we get the MAP rule.
The difficulty comes in choosing suitable costs for each error type.
=== NP ===

The Neyman-Pearson decision rule puts a tolerance level on the false alarm probability.
\[P_{FA}=P[\text{accept }H_1|H_0\text{ true}]\leq\alpha\]
With this constraint, the secondary objective is to find the decision rule that minimises \(P_{miss}\).
This requires creating the decision rule \(\Lambda(x)\overset{H_0}{\underset{H_1}{\lessgtr}}\gamma\), where \(\gamma\) is the largest number such that \(P_{FA}\leq\alpha\).
==== Multiple hypothesis testing ====

Often the number of possible decisions is larger than 2.
We can still have MAP and ML rules, but the decision rule will not be as simple as in the binary case.
There are \((M+1)^2\) decision probabilities.

We can write a generalised MAP rule as:
\[\text{Choose }H_j\text{ if }f_{X|H_j}(x)P[H_j]\geq f_{X|H_i}(x)P[H_i],\forall i\]
This is choosing the hypothesis with the maximum a posteriori probability.
This is equivalent to:
\[\text{Choose }H_j=\arg\max_if_{X|X_i}(x)P[H_i]\]
The ML rule is the same, with \(P[H_i]=P[H_j],\forall i,j\).
===== Statistical inferencing =====

==== Finite value estimation ====

When the number of possible values for a parameter is finite, we can formulate the problem of deciding a value for the parameter as a M-ary hypothesis problem.
We can assign a particular \(\theta_i\) to hypothesis \(H_i\), or formulate the MAP rule as:
\[\hat{\theta}_{MAP}=\theta_i\text{ iff }P[\Theta=\theta_i|X=x]=\max_{j\in0,...,M-1}P[\Theta=\theta_j|X=x]\]
\[\hat{\theta}_{MAP}=\arg\max_{\forall\theta\in S_\Theta}P[\Theta=\theta|X=x]\]
For a continuous RV, this reduces to:
\[\hat{\theta}_{MAP}=\arg\max_{\forall\theta\in S_\Theta}f_{X|\Theta}(x|\theta)P[\Theta=\theta]\]
This is the same as the hypothesis notation, using \(H_i=\theta_i\).
When \(\theta\) is discrete, we want to maximise the a posteriori PMF/PDF to satisfy the MAP rule:
\[\hat{\theta}_{MAP}=\arg\max_{\theta}P[\Theta=\theta|X=x]\]
\[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)p_\Theta(\theta)\]
When \(\theta\) is a continuous RV, the a posteriori probability is 0, so we use the conditional PDF instead.
\[\hat{\theta}_{MAP}=\arg\max_\theta f_{\Theta|X}(\theta|x)\]
\[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)f_\Theta(\theta)\]
==== Continuous MAP and ML estimation ====

When \(\theta\) is a continuous RV, we try to estimate the value of \(\Theta\) from \(X\).
The MAP rule minimises overall probability of error:
\[P[\hat{\theta}\neq\Theta|X=x]\]
As \(\Theta\) is a continuous RV, perfectly estimating the exact parameter is a zero probability event.
The MAP rule is:
\[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)f_\Theta(\theta)\]
For the ML rule, we can drop \(f_\Theta(\theta)\), being the prior.
==== Minimum Mean Square Error estimation ====

We can measure the performance of an estimation indicator with the expected mean square error:
\[E[(\Theta-\hat{\theta}(X))^2]\]
The Minimum Mean Square Error (MMSE) estimator is defined as:
\[\hat{\theta}_{MMSE}(X)=\arg\min_{\hat{\theta}}E[(\Theta-\hat{\theta}(X))^2]\]
We can write the mean square error as:
\[MSE=E[E[(\Theta-\hat{\theta}(X))^2|X]]\]
The aim is to find a function \(\hat{\theta}(\cdot)\) that minimises the inner expectation for all \(x\in S_X\), which gives the MMSE estimator.
For a given \(X=x\), we can write the inner expectation as:
\[E[(\Theta-\hat{\theta}(X))^2]=\underbrace{\text{var}(\Theta-\hat{\theta}(x))}_{=\text{var}(\Theta)}+\underbrace{\left(E[\Theta-\hat{\theta}(x)]\right)^2}_{=E[\Theta]-\hat{\theta}(x)}\]
In minimising this, we find \(\hat{\theta}(x)=E[\Theta|X=x]\), for any \(X=x\).
With Bayes' rule and conditioning on the denominator, we can find the MMSE estimator to be:
\[\hat{\theta}_{MMSE}(x)=\frac{\int_{-\infty}^{\infty}\theta\cdot f_{\Theta|X}(\theta|x)f_\Theta(\theta)d\theta}{\int_{-\infty}^{\infty}f_{\Theta|X}(\theta|x)f_\Theta(\theta)d\theta}\]
The MMSE estimator can also be known as the Least Mean Squares (LMS) estimator.
Like with the MAP estimator, the MMSE estimator requires a prior distribution on \(\Theta\), which may not always be available.
The MMSE estimator may be difficult to compute compared to the MAP and ML estimators.

ML estimation is classical statistical inference, being that the underlying parameter is deterministic and unknown.
MAP and MMSE estimation is Bayesian statistical inference, using posterior probabilities and through Bayes' rule is based on prior distributions.