$conf['savedir'] = '/app/www/public/data'; ======= ELEN90054 - Probability and Random Models ======= {{tag>notes unimelb math electrical_engineering}} ====== Introduction ====== ===== Set theory ===== A set is a collection of elements, such as the possible outcomes of an experiment. \(S\) is the universal set, also known as the sample space. \(\varnothing\) is the empty set. \(A\cup B\) is the union of two sets, and \(A\cap B\) is their intersection. \(A-B\) is the set subtraction (also \(A\backslash B\)). \(A^C\) is the compliment of \(A\) (also \(S-A\)). We can say that x is an element of A by \(x\in A\). We can also say A is a subset of B as \(A\subset B\). Two sets are mutually exclusive (disjoint) if \(A\cap B=\varnothing\), meaning they have no elements in common. Union and intersection are commutative, associative and distributive. De Morgans Laws: \[(A\cup B)^C=A^C\cap B^C\] \[(A\cap B)^C=A^C\cup B^C\] A partition of a set is a grouping of its elements into non-empty, mutually exclusive subsets. Every time an experiment is run is known as a trial and the result is the outcome. A random experiment is one where the outcome is nondeterministic, and we use probability to model the outcome. Changing what is observed can result in a different sample space. An event is a collection of outcomes, and hence are sets and subsets of S. An event is said to have occurred if the outcome is in the event. If the outcome is not in the event, we say that the event's compliment has occurred. Events that are not mutually exclusive can occur simultaneously. An event containing a single outcome is called an elementary event. \(S\) is a subset of itself, and hence the event \(S\) always occurs. The empty set is an event which never occurs and is included as a convention. ===== Probability ===== ==== Models ==== A probabilistic model is a mathematical description of an unknown situations. It contains: - The sample space, containing all possible outcomes - A probability law that assigns probabilities to all possible events. The probability of an event is the chance or likelihood of the event occurring when an experiment is performed. We write this set-function (measure) as \(P[A]\). \(P[\cdot]\) is assumed to satisfy the three axioms of probability: * **Non-negativity:** \(P[A]\geq 0,\forall x\in A\) * **Additivity:** If \(A\cap B=\varnothing\), then \(P[A\cup B]=P[A]+P[B]\) * **Normalisation:** \(P[S]=1\) From these we can imply: * \(P[A^C]-1-P[A]\) * \(P[\varnothing]=0\) * \(\forall A, 0\leq P[A]\leq 1\) * \(P[A\cup B]=P[A]+P[B]-P[A\cap B]\) * \(A\subset B\implies P[A]\leq P[B]\) For discrete (countable) sample spaces, the probability of an event is: \[P[A]=P[{s_i\in A}]=P\left[\bigcup_{i:s_i\in A}s_i\right]=\sum_{i:s_i\in A}P[s_i]\] For continuous (uncountable) sample spaces, we must use probabilities of events, rather than outcomes as the probability of an individual outcome is virtually 0. ==== Conditional Probability ==== If an experiment is performed and the exact outcome is known, then there is no more randomness. The outcome says whether an event has occurred or not. If we instead know whether an event has occurred, but not the outcome, conditional probability can let us find the chance another event has also occurred. We can write the conditional probability of B given A as \(P[B|A]\), where as \(P[B]\) is the original probability. We can find this probability by normalising the probability of the intersection by the probability of A. \[P[B|A]=\frac{P[A\cap B]}{P[A]}\] We can use this to find: \[P[A\cap B]=P[A|B]*P[B]=P[B|A]*P[A]\] Which gives us Bayes' rule: \[P[B|A]=\frac{P[A|B]*P[B]}{P[A]}\] For a set with partitions, we can find the probability of an event as such: \[P[A]=P[(A\cap B_1)\cup(A\cap B_2)\cup...\cup(A\cap B_n)]=\sum_{i=1}^n P[A\cap B_i]\] Which gives us the theorem of total probability: \[P[A]=P[A|B_1]*P[B_1]+P[A|B_2]*P[B_2]*...*P[A|B_n]*P[B_n]=\sum_{i=1}^n P[A|B_i]*P[B_i]\] We can also extend Bayes' rule with partitions to: \[P[B_j|A]=\frac{P[A|B_j]*P[B_j]}{\sum_{k=1}^n P[A|B_k]*P[B_k]}\] This is finding the posterior (a posteriori) probabilities [post experiment] from the prior (a priori) probabilities [pre experiment]. From the observed event we can infer the conditional probabilities of unobserved events. ==== Independence ==== Two events are said to be independent if: \[P[A\cap B]=P[A]*P[B]\] This means that knowing event A occurs does not alter the probability of B occurring, i.e. \(P[B|A]=P[B]\). Mutual independence and disjointness/exclusivity are not the same thing. Mutually independent events may occur simultaneously, unlike mutually disjoint events. Independence is often a natural assumption which can be reasoned about. ==== Tree diagrams ==== Tree diagrams can be used to visualise probabilities. The probabilities on the leaves of the tree (intersections) are the product of the probabilities of all branches to the leaf. Each branch has probability equal to its choice given all the previous branches. Tree diagrams can be useful in visualising sequential experiments. To construct the tree we start with one of the sub-experiments and branch out from the root, then we branch from each node with the next sub-experiment, which is repeated for all sub-experiments. At each step, the probability of that branch is equal to the probability of that choice given all the previous choices, and the probability at the node is equal to the intersection of all previous choices. In most sequential experiments, the number of outcomes at each step does not depend on previous results, so we can use the fundamental principle of counting. The fundamental principle of counting states: "If sub-experiment A has n possible outcomes and the sub-experiment B has m possible outcomes, then there are n\*k possible experiment outcomes". If the possible outcomes are equally likely, the probability of an event A occurring comes down to the total number of outcomes and the number of outcomes in A. We need to determine how to count the number of possible outcomes for this to work. Sampling without replacement results in sequential experiments' sample spaces depending on the outcome of previous experiments. Sampling with replacement causes all experiments to have identical independent trials. In both cases, whether the order of the sub-experiment outcomes can change the probabilities. If sampling k times from n items with replacement, the sample space is \(n^k\). If sampling k times from n items without replacement but with ordering, the sample space is \(\frac{n!}{(n-k)!}=(n)_k\). If sampling k times from n items without replacement and without ordering, the sample space is \(\frac{n!}{(n-k)!k!}=\begin{pmatrix}n\\k\end{pmatrix}\), this is the binomial coefficient. We can consider the more general case of the multinomial coefficient, where \(n=n_0+n_1+...+n_{m-1}\), and the number of different ways a set of n objects can be split up into m subsets equals: \[\begin{pmatrix}n\\n_0,n_1,...,n_{m-1}\end{pmatrix}=\frac{n!}{n_0!n_1!... n_{m-1}!}\] ====== Discrete Random Variables ====== Random variables can be used to map outcomes to numbers. More formally, a random variable \(X\) is a function that assigns a real number \(X(\omega)\) to each outcome \(\omega\) in \(S\). The sample space is the domain of X, and the range \(S_X\) is the collection of all possible outputs \(X(\omega)\). Two outcomes can map to the same value. Events can be defined in terms of a single value or a set of output values that a random variable can take. \[A=\{\forall\omega:X(\omega)\in\mathcal{I}\}\] ===== The Cumulative Distribution Function ===== The Cumulative Distribution Function (CDF) is defined for \(x\in\mathbb{R}\) as: \[F_X(x)=P[X\leq x]\] Thus the CDF is the probability of the event \([X\leq x]\). A CDF is build by working from the regions where X gives a value. The function jumps in value where \(P[X=x]\neq0\), of height \(P[X=x]\). The CDF has some properties: - \(F_X(x)\in[0,1]\) - \(x_1\geq x_2\implies F_X(x_1)\geq F_X(x_2)\) - \(\lim_{x\to\infty}F_X(x)=1\) - \(\lim_{x\to-\infty}F_X(x)=0\) - \(F_X(x)=\lim_{h\to0}F_X(x+h)=F_X(x^+)\) This means the CDF is an increasing function (2), that is right continuous (5). We can use the CDF to find: * \(P[aa]=1-F_X(a)\) * \(P[X=a]=F_X(a)-F_X(a^-)\) * \(P[Xj]=P[X=k]\] This is known as the memory-less property (geometric is the only memory-less discrete random variable). ===== Poisson Random Variable ===== \(N\) is a Poisson Random Variable with parameter \(\lambda\) if: \[p_N(k)=P[N=k]=\frac{\lambda^k}{k!}e^{-k},k=0,1,2,...\] Used often for counting the number of events in a given time period or region of space. Denoted as: \[N\sim\text{Poisson}(\lambda)\] ===== Statistics ===== A discrete RV is completely described by its PDF or CDF, but we are not always interested in so much information. Statistics are a summary of the RV used for making comparisons. Three different statistics are the mode, median and mean. The mode is a number \(x_{mod}\in S\) such that: \[p_X(x_{mod})\geq p_X(x),\forall x\] That is, the mode is the most probable outcome. The median is a number \(x_{med}\in S\) such that: \[P[X\geq x_{med}]\geq 0.5\text{ and }P[X\leq x_{med}]\geq 0.5\] Ideally it exactly balances the probability of lying above or below, but not always satisfied by a discrete RV. It does not always need to be in the range \(S_X\). The mean, or average, is the expected value of the distribution. \[\mu_x=E[X]=\sum_{x_i\in S_X}x_iP[X=x_i]\] This is the average empirical outcome if we were to perform the experiment many times. The expected value of a Bernoulli RV is: \[E[X]=0*(1-p)+1*p=p\] For a binomial RV: \[E[X]=E\left[\sum_{k=1}^n(I_k\sim\text{Bernoulli}(p))\right]=np\] The expectation operator has the following properties: - \(E[a]=a\) - \(E[aX]=aE[X]\) - \(E[X+Y]=E[X]+E[Y]\) - \(E[aX+bY]=aE[X]+bE[Y]\) Hence expectation is a linear operator. The expected value of a geometric variable is: \[E[X]=\sum_{k=1}^\infty k[(1-p)^{k-1}p]=\frac{1}{p}\] The mean may not be in the range \(S_X\), i.e. \(E[X]\notin S_X\). For a Poisson RV: \[E[N]=\sum_{k=0}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=0+\sum_{k=1}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=\lambda e^{-\lambda}\sum_{k=1}^{\infty}k\frac{\lambda^{k-1}}{(k-1)!}=\lambda\] The variance of a RV is: \[\sigma^2_X=\text{Var}(X)=E[(X-E[X])^2]=\sum_{x_i\in S_X}(x_i-\mu_X)^2p_X(x_i)\] This is an indication of the spread of the values about its mean. Variance can also be calculated as: \[\text{Var}(X)=E[X^2]-(E[X])^2\] Both of these values can be found separately. The variance has two useful properties: - \(\text{Var}(aX)=a^2\text{Var}(X)\) - If \(X\) and \(Y\) are independent, \(\text{Var}(X+Y)=\text{Var}(X)+\text{Var}(Y)\) The variance of a Bernoulli trial is: \[\text{Var}(X)=p(1-p)\] For a Binomial: \[\text{Var}(X)=np(1-p)\] For a Geometric: \[\text{Var}(X)=\frac{1-p}{p^2}\] For a Poisson: \[\text{Var}(X)=\lambda\] The Poisson RV can be used to approximate the binomial RV, which can be hard to calculate when \(n\) is large. For distributions where \(n\) is large and \(p\) is small, we can set \(\lambda=np\). We do need to note that the Poisson distribution counts for an infinite space, whereas the binomial is finite so we have problems as \(n\to\infty\). ====== Continuous Random Variables ====== If the results of an experiment are uncountable forming a continuum, we call the experiment continuous. Analogue measurements, for example are continuous. Here, the probability that the outcome is an exact number is virtually 0, so we assess probabilities on a range. The continuous random variable has a CDF that is continuous everywhere and has a derivative (may not exist at a finite number of points). The equivalent of the PMF for discrete random variables is the probability density function (PDF). The CDF can be written as: \[F_X(x)=\int_{-\infty}^xf_X(u)du\] Hence the PDF can be written as: \[f_X(x)=\frac{dF_X(x)}{dx},x\in\mathbb{R}\] We can say we have an element of probability: \[f_X(x)dx=P[xb\end{cases}\] The mean and variance are: \[\mu_X=\frac{b+a}{2}\] \[\text{Var}(X)=\frac{(b-a)^2}{12}\] ===== Exponential Random Variable ===== The exponential random variable has the following PDF: \[f_X(x)=\begin{cases}\lambda e^{-\lambda x},&x\geq 0\\0,&\text{else}\end{cases}\] Its CDF is: \[F_X(x)=\begin{cases}0,&x<0\\1-e^{-\lambda x},&x\geq 0\end{cases}\] Its mean and variance are: \[\mu_X=\frac{1}{\lambda}\] \[\sigma^2_X=\frac{1}{\lambda^2}\] Used in studies of waiting and service times. \(\lambda\) is in units of number per time, so the expected value is the average waiting time for one occurrence. The exponential RV is the equivalent of the geometric RV. It also has the same memory-less property. The CDF of an exponential RV can be constructed by modelling the probability that an arrival happens in the first t time with a Poisson RV with \(\lambda\) equal to the number of arrivals in t time. The CDF is then the 1 less the probability that no arrivals happen in the first t time. \[F_X(t)=P[X\leq t]=1-P[X>t]=1-P[N_t=0]=1-e^{-\lambda t}\] The Poisson RV models the number of arrivals where as the exponential RV models the inter-arrival time. The memory-less property means that when conditioned on a past event, time is restarted and a new time 0 is established. ===== Gaussian (Normal) Random Variable ===== Many phenomena have a bell-shaped empirical distribution. This is often modelled with a Gaussian or normal PDF with mean \(\mu\) and variance \(\sigma^2\): \[f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)\] \[X\sim\mathcal{N}(\mu,\sigma^2)\] As the variance increases, the curve becomes shallower and wider. A closed form solution doesn't exist for the CDF, so there are numerical tables which exist to find values for the standard (\(\mu=0,\sigma=1\)) CDF. For nonstandard normal RVs, we can adjust them to work as \(F_X(x)=\Phi\left(\frac{x-\mu}{\sigma}\right)\). There is also the Q function, defined as the compliment of \(\Phi\). We can often model noise by adding a Gaussian with zero mean to a signal. The Q function is reflective: \(P[X\leq y]=1-P[X>y]=1-Q(y)=Q(-y)\). For probabilities involving the maximum value: \(P[|X|\geq y]=P[X\leq -y]+P[X\geq t]=2P[X\geq y]=2Q(y)\). Likewise: \(P[|X|\leq y]=1-P[|X|\geq y]=1-2Q(y)\). ====== Mixed random variables ====== These contain CDFs of continuous increasing segments with jumps. As the CDF isn't discrete, a PMF isn't possible. We can consider a discrete RV's PMF as a PDF by considering the probability points as Dirac Delta functions. A delta function has infinite height with infinitesimal width, such that the area under it is 1. We can express a PMF as a PDF by: \[f_X(x)=\sum_{x_k\in S_X}p_X(x_k)\delta(x-x_k)\] We can go from a CDF to a PDF with a generalised derivative, by defining the derivative of a jump to be equal to a Dirac delta of area equal to the jump. If the delta exists at a point, then we can calculate the probability of an exact value. ====== Functions of a random variable ====== If we were to consider \(Y=g(X)\), how can we express \(Y\) based on the PDF/CDF/PMF of \(X\), as \(Y\) inherits randomness from \(X\). ===== Discrete Random Variables ===== \(Y\) will also be a discrete RV, taking on the value of \(g(x_k),\forall x_k\in S_X\). As such, the PMF of \(Y\) can be determined by: \[p_Y(k)=P[Y=k]=\sum_ip_X(x_i)\] for all \(i\) such that \(k=g(x_i)\). ===== Continuous Random Variables ===== \[F_Y(y)=P[Y\leq y]=P[g(X)\leq y]=P[g(X)\in(-\infty,y]]=P[X\in g^{-1}((-\infty,y])]=\int_{g^{-1}((-\infty,y])}f_X(x)dx\] ===== Generate a RV with a Prescribed CDF ===== We want to generate \(X\) with CDF \(F_X(x)\), so we start with a standard uniform RV \(U\sim U[0,1]\). We then set \(X=g(U)=F_X^{-1}(U)\). ===== Expectation of a function of a RV ===== \[E[Y]=E[g(X)]=\int_{-\infty}^\infty g(x)f_X(x)dx\] \[E[Y]=E[g(X)=\sum_{\forall x_k\in S_X}g(x_k)p_X(x_k)\] For the continuous and discrete cases respectively. Variance is the expectation of a function of a random variable. ====== Joint random variables ====== Can be useful to visualise joint variables ass a random vector. Can visualise a pair of random variables as two functions mapping an outcome to two different number lines. The joint mapping is taking the outcome to a point on the plane. This point can contain more information than only one of the mapping functions. As with the single RV case, we don't know the outcome beforehand so we want to look at the joint probabilities of points and regions on the plane. \[P[(X\leq x)\cap(Y\leq y)]=P[X\leq x,Y\leq y]\] \[P[(X=x)\cap(Y=y)]=P[X=x,Y=y]\] If both functions are discrete RVs, the probability mass function is: \[p_{X,Y}(x,y)=P[X=x,Y=y]\] The range of the PMF if \(S_{X,Y}\). The sum of the probabilities of the entire sample space is 1. The joint probabilities have the following properties: - \(0\leq p_{X,Y}(x,y)\leq 1,\forall(x,y)\in\mathbb{R}^2\) - \(p_{X,Y}(x_i,y_j)=0\) if \((x_i,y_j)\notin S_{X,Y}\) - \(\sum_{x\in S_X}\sum_{y\in S_Y}P_{X,Y}(x,y)=1\) Joint RVs are said to be independent if the probability of the point is equal to the product of the probabilities of the independent functions: \[p_{X,Y}(x,y)=P[X=x,Y=y]=p_X(x)p_Y(y)\] In the case where the variables are independent means we can recover the individual functions. \[P[X=1]=\sum_{i}P[X=1,Y=i]\] With 2 RVs, we need 3d plots to express PMFs, CDFs and PDFs, which can be difficult. We use level curves and labels to represent them in a 2-D plane. Joint PMFs are easy to do in 2D, joint PMFs need level curves making them harder, joint CDFs are hard to plot and usually not meaningful. Given a joint RV, we can consider each random variable individually by computing the marginal PMFs of X and Y. \[p_X(x)=P[X=x]=P[X=x,y=\text{anything}]=\sum_{\forall y_k\in S_Y}P[X=x,Y=y_k]\] This turns the 2D function into a 1D function. Given a joint PMF, we can always compute the marginal PMFs. ===== Joint CDF ===== The joint CDF is; \[F_{X,Y}(x,y)=P[X\leq x,Y\leq y]\] It results in a semi-infinite rectangular region. If the random variables are independent, the product of the variables' CDFs is the joint CDF: \[F_{X,Y}(x,y)=F_X(x)F_Y(y)\] We can also get marginal CDFs: \[F_X(x)=P[X\leq x]=P[X\leq x,Y\leq\infty]=F_{X,Y}(x,\infty)\] If both X and Y are discrete, there will be jumps, forming a 3D staircase function. If both X and Y are continuous, the joint CDF will be a continuous surface everywhere. We can then define a joint PDF which gives the concentration (density) of the probability per unit area, done with double integrals. \[f_{X,Y}(x,y)=\frac{\partial^2F_{X,Y}(x,y)}{\partial x\partial y}\] \[F_{X,Y}(x,y)=\int_{-\infty}^x\int_{-\infty}^y=f_{X,Y}(u,v)dudv\] The joint PDF is the product of the marginal PDFs when independent. The marginal PDF can be computed from the joint PDF: \[f_X(x)=\frac{dF_{X}(x)}{dx}=\frac{dF_{X,Y}(x,\infty)}{dx}=\frac{d}{dx}\int_{-\infty}^x\int_{-\infty}^\infty f_{X,Y}(u,v)dudv=\int_{-\infty}^\infty f_{X,Y}(x,y)dy\] The joint CDF has the following properties: - \(0\leq F_{X,Y}(x,y)\leq 1\) - \(\lim_{x,y\to\infty}F_{X,Y}(x,y)=F_{X,Y}(\infty,\infty)=1\) - If \(x_1\leq x_2\) and \(y_1\leq y_2\), then \(F_{X,Y}(x_1,y_1)\leq F_{X,Y}(x_2,y_1)\leq F_{X,Y}(x_2,y_2)\) As a consequence, we get: \[P[x_1\leq X\leq x_2,y_1\leq Y\leq y_2]=F_{X,Y}(x_2,y_2)-F_{X,Y}(x_1,y_2)-F_{X,Y}(x_2,y_1)+F_{X,Y}(x_1,y_1)\] - \(\lim_{x\to-\infty}F_{X,Y}(x,y)=F_{X,Y}(-\infty,y)=0\) - \(\lim_{x\to a^+}F_{X,Y}(x,y)=F_{X,Y}(a^+,y)=F_{X,Y}(a,y)\) The last two mean the CDF is top and right continuous. If both X and Y are continuous RVs, then the CDF is continuous from all directions. ===== Joint PDF ===== The joint PDF has the following properties: - \(f_{X,Y}(x,y)\geq 0\) - \(\int_{-\infty}^\infty\int_{-\infty}^\infty f_{X,Y}(x,y)dxdy=1\) - \(P[(X,Y)\in A]=\iint_{(x,y)\in A}f_{X,Y}(x,y)dxdy\) - \(P[a0\): \[P[X\geq a]\leq\frac{E[X]}{a}\] In general this provides a loose upper bound, although little information is needed to provide a bound. ===== Chebyshev inequality ===== For a given random variable \(X\) and a constant \(a>0\), then: \[P[|X-\mu_X|\geq a]\leq\frac{\text{var}(X)}{a^2}\] This uses more information than the Markov inequality (needs variance) This tells us that if the variance is small, \(X\) is unlikely to be far from the mean. We can also tell that most of the probability mass/density of a RV is within a few standard deviations of the mean. ===== Sum of random variables ===== Many macroscopic phenomena can be seen as the sum of a number of probabilistic microscopic actions. When we want to add two random variables, we convolve their PDFs to get the sum's PDF. We can repeat this to sum an arbitrary number of variables. If we have a sum of iid random variables, the expected value of the sum is \(E[S_n]=n\mu\) and its variance is \(\text{Var}(S_n)=n\sigma^2\). If we take the mean of n iid random variables, the expected value is \(E[M_n]=\mu\) and the variance is \(\text{Var}(M_n)=\sigma^2/n\). The latter can be thought of as taking a sample from a distribution. The sample has a mean equal to that of the original distribution, but has a variance that approaches 0 as the sample size increases. In sampling, we can get outliers that can distort the sample mean, but sampling allows us to create bounds on the sample to give confidence. For an example Bernoulli variable which we sample, we can place bounds due to the Chebyshev inequality. \[E[|M_n-p]\geq a]\leq\frac{\text{Var}(M_n)}{a^2}=\frac{\text{Var}(X_n)}{na^2}=\frac{p(1-p)}{na^2}\leq\frac{1/4}{na^2}\] ==== Weak law of large numbers ==== \[\text{Var}(M_n)=\frac{\sigma^2}{n}\implies\lim_{n\to\infty}\text{Var}(M_n)=0\] By Chebyshev's inequality (for \(\epsilon>0\)): \[P[|M_n-\mu|\geq\epsilon]\leq\frac{\text{Var}(M_n)}{\epsilon^2}\] Taking this limit gives the weak law of large numbers: \[\lim_{n\to\infty}P[|M_n-\mu|\geq\epsilon]=0\] Alternately we can express this as: \[\lim_{n\to\infty}P[|M_n-\mu|\leq\epsilon]=1\] ==== Empirical data and probabilities ==== We get probabilistic models for data from statistical inference. For a given event, we can calculate an empirical probability by running \(n\) trials and counting the number of times the event occurs. We can consider our experiment as a Bernoulli trial, were either the event occurs or it doesn't, which lets us use the above example about the polling problem. \[P[|\tilde{P}_n[A]-P[A]|<\epsilon]\geq\frac{P[A](1-P[A])}{n\epsilon^2}=\frac{\text{Var}(A)}{\epsilon^2}\] From this we can get that to double our precision, we need 4 times the number of trials. ===== Central limit theorem ===== For a given sum of \(n\) random variables, with mean \(\mu\) and variance \(\sigma^2\), we can define: \[Z_n=\frac{S_n-n\mu}{\sigma\sqrt{n}}\] The mean of \(Z\) is \(E[Z_n]=0\), and its variance is \(\text{Var}(Z_n)=\frac{\text{Var}(S_n)}{(\sigma\sqrt{n})^2}=1\). The central limit theorem states that: \[\lim_{n\to\infty}P[Z_n\leq z]=\frac{1}{2\pi}\int_{-\infty}^ze^{-x^2}dx=\Phi(z)\] In other words, \(Z_n\) converges to \(\mathcal{N}(0,1)\) as \(n\to\infty\). This is regardless of the underlying distribution, only requiring a finite mean and variance, provided the variance is non-zero. The central limit theorem allows us to approximate sums and averages of iid random variables with Gaussian distributions, provided \(n\) is sufficiently large. We can see this in a system if noise is due to a large number of events, the noise will be Gaussian. Additionally the empirical mean can be modelled as Gaussian for sufficiently large \(n\). As a loose guideline, \(n=30\) can to be used, but this depends on the underlying distribution, with some resembling after less and some after more. Sometimes the CLT will be uses for fewer samples as it is easier than using the underlying distribution. We can find that the original sum approaches \(F_{S_n}(a)\approx\Phi\left(\frac{a-n\mu}{\sigma\sqrt{n}}\right)\), meaning that \(S_n\sim\mathcal{N}(n\mu,n\sigma^2)\). As the sample size increases, the mean of the sample tends to be distributed as \(M_n\sim\mathcal{N}(\mu,\sigma^2/n)\). ==== De Moivre-Laplace Approximation ==== Because the CLT approximates with a continuous random variable, getting the probability of a single outcome is 0, which limits its use for discrete random variables. We can use the De Moivre-Laplace approximation here to find for a single value for binomial variables. If \(S_n\sim\text{Binomial}(n,p)\), sufficiently large \(n\), and \(k_1\leq k_2\), we can approximate \[P[k_1\leq S_n\leq k_2]\approx\Phi\left(\frac{k_2+0.5-n\mu}{\sqrt{np(1-p)}}\right)-\Phi\left(\frac{k_1-0.5-n\mu}{\sqrt{np(1-p)}}\right)\] This is equivalent to the approximation: \[P[k_1\leq S_n\leq k_2]\approx P[k_1-0.5\leq S_n\leq k_2+0.5]\] ==== Random vectors ==== For a sequence of iid random vectors, witha mean \(E[\mathbf{X}_i]=\mu_\mathbf{X_i}\) and covariance \(\mathbf{C_X}\), we can get: \[\mathbf{S}_n=\sum_{i=0}^n\mathbf{X}_i\] \[\mathbf{Z}_n=\frac{1}{\sqrt{n}}(\mathbf{S}_n-n\mu_\mathbf{X})\] So as \(n\to\infty\), \(\mathbf{Z}_n\sim\mathcal{N}(0,\mathbf{C_X})\). ===== Moment Generating Function (MGF) ===== The Moment Generating Function is: \[\phi_X(s)=E[e^{sX}]\] So for a discrete RV: \[\phi_X(s)=\sum_{x\in S_X}e^{sx}p_X(x)\] And for a continuous RV: \[\phi_X(s)=\int_{-\infty}^\infty e^{sx}f_X(x)dx\] In the continuous case, \(\phi(s)\) is the two sided Laplace transform of \(f_X(x)\), with \(s\) instead of \(-s\). In the discrete case, \(\phi(s)\) is a Z-transform of \(p_X\), with \(z=e^{-s}\). The MGF has use in proving the CLT. The kth moment of a random variable \(X\) is defined as \(E[X^k]\), where \(k\) is a positive integer. For some distributions finding the moment will involve integration by parts \(k-1\) times. Instead, the kth derivative of the MGF gives the moment, where \(s=0\). \[\left.\frac{d^k\phi_X(s)}{ds^k}\right|_{s=0}=E[X^k]\] \(\phi_X(s)\) is only defined where \(E[e^{sX}]\) is finite, meaning within the region of convergence. As summing two random variables results in the convolution of their PDFs, in the MGF domain, we can multiply the individual MGFs to get the resultant MGF. ===== Random Processes ===== A stochastic process (random process) is a mapping from the sample space \(S\) to the set of time-domain waveforms (either discrete or continuous time). In that sense, the experiment results in an outcome that is a time domain signal. ==== Continuous time ==== In continuous time a random process is usually denoted as \(X(t)\) or \(\{X(t)\}\). The process is a function of two variables \(X(t,\omega)\), where \(t\) is the time and \(\omega\) is the outcome, although \(\omega\) is commonly dropped. For a fixed time \(t_0\), \(X(t_0,\omega)\) is a standard RV. For a fixed outcome \(\omega_0\), \(X(t,\omega_0)\) is a deterministic signal. Fixing both results in \(X(t_0,\omega_0)\) being a constant. For a fixed outcome, the deterministic signal is referred to as a sample function, sample path or realisation of \(X(t)\). For every time, \(X(t,\omega)\) is a random variable, so we can think of \(X(t)\) as an infinite set of random variables. The ensemble of a random process is the set of all possible time domain functions that can result from the experiment, analogous to the range of the RV. ==== Discrete time ==== Discrete time processes are similar to continuous time, but using discrete time instants \(n\in\mathbb{Z}\) instead of \(t\in\mathbb{R}\). It is written as \(X(n)\) or \(\{X(n)\}\). A discrete time random process is a sequence of random variables, and the ensemble is all possible sequences. ==== Types of random processes ==== Random processes can be either: * Discrete time, discrete valued (DT DV) * Discrete time, continuous valued (DT CV) * Continuous time, discrete valued (CT DV) * Continuous time, continuous valued (CT CV) Random processes are not just limited to time, and can be generalised to a random field. ==== Specifying random processes ==== Can be difficult to fully statistically characterise a random process. The First-Order Distribution is a random process \(\{X(t)\}\): \[F_{X(t)}(x)=P[X(t)\leq x]\] This is the First-Order CDF, and is a function of \(x\) and \(t\). The First-Order PDF is: \[f_{X(t)}(x)=\frac{\partial}{\partial x}F_{X(t)}(x)\] We can also define mean and variance functions: \[\mu_X(t)=E[X(t)]=\int_{-\infty}^\infty xf_{X(t)}(x)dx\] \[\text{Var}[X(t)]=E\left\{[E[X(t)-\mu_X(t)]^2\right\}\] The Second-Order distribution compares how two times are related. The CDF is: \[F_{X(t_1),X(t_2)}(x_1,x_2)=P[X(t_1)\leq x_1,X(t_2)\leq x_2]\] The PDF is: \[f_{X(t_1),X(t_2)}(x_1,x_2)=\frac{\partial^2}{\partial x_1\partial x_2}F_{X(t_1),X(t_2)}(x_1,x_2)\] \(X(t_1)\) and \(X(t_2)\) are a pair of random variables, so we can look at their autocorrelation and autocovariance. \[R_X(t_1,t_2)=E[X(t_1)\cdot X(t_2)]\] \[C_X(t_1,t_2)=\text{Cov}[X(t_1),X(t_2)]=R_X(t_1,t_2)-\mu_X(t_1)\mu_X(t_2)\] For the nth order distribution, we can write the CDF and PDF as: \[F_{X(t_1),...,X(t_n)}(x_1,...,x_n)=P[X(t_1)\leq x_1,...,X(t_n)\leq x_n]\] \[f_{X(t_1),...,X(t_n)}(x_1,...,x_n)=\frac{\partial^n}{\partial x_1...\partial x_n}F_{X(t_1),...,X(t_n)}(x_1,...,x_n)\] A stochastic process is specified by the collection of nth order joint distribution functions for and \(n\) and any choice of sampling times \(t_1,...,t_n\). This means we may need an infinite collection of distribution functions to fully characterise a stochastic process. Some processes have simple models that are easier to deal with. ==== Stationarity ==== Some processes have (strict) stationarity. That means that for any \(n\in\mathbb{N}\) and \(t_1,...,t_n\) and time shift \(\tau\): \[F_{X(t_1),...,X(t_n)}(x_1,...,x_n)=F_{X(t_1+\tau),...,X(t_n+\tau)}(x_1,...,x_n)\] \[f_{X(t_1),...,X(t_n)}(x_1,...,x_n)=f_{X(t_1+\tau),...,X(t_n+\tau)}(x_1,...,x_n)\] Then the random process \(\{X(t)\}\) is strictly stationary. This means that the joint distributions are time-shift invariant. This must hold true for any number and location of time instants and \(\tau\). Stationarity implies that the CDF/PDF/PMF only depends on the relative times between instants. For a First-Order distribution: \[F_{X(t)}(x)=F_{X(t+\tau)}(x)\] \[f_{X(t)}(x)=f_{X(t+\tau)}(x)\] Meaning that the distribution is equal for all times, letting us write: \[f_{X(t)}(x)=f_{X(t+\tau)}(x)=f_{\overline{X}}(x),\forall t,\tau\] \[X(t)\stackrel{d}{=}\overline{X},\forall t\] As a result we have the mean and variance being constant: \[E[X(t)]=\mu\] \[\text{Var}(X(t))=\sigma^2\] For a Second-Order distribution: \[f_{X(t_1),X(t_1+\tau)}(x_1,x_2)=f_{X(t_2,t_2+\tau)}(x_1,x_2)=f_{X(0),X(\tau)}(x_1,x_2)\] As these PDFs are the same, so will their autocorrelation and autocovariance functions. \[R_X(t_1,t_1+\tau)=R_X(0,\tau)=R_X(\tau)\] \[C_X(t_1,t_1+\tau)=C_X(0,\tau)=C_X(\tau)\] \[R_X(s,t)=R_X(t-s)\] \[C_X(s,t)=C_X(t-s)\] A joint distribution of \(n\) iid random processes can be written as not having any dependency on \(t\), so a joint iid process must be stationary. === Wide Sense Stationarity (WSS) === A random process \(\{X(t)\}\) is WSS if: - \(E[X(t)]=\mu,\forall t\in\mathbb{R}\) - \(R_X(t_1,t_2)=E[X(t_1)X(t_2)]\) depends only on the time difference \(t_2-t_1\), i.e. \(R_X(\tau)=R_X(t,t+\tau)\) Wide sense stationarity is weaker than strict stationarity. For a WSS process, the autocorrelation has the following properties: - \(R_X(\tau)=R_X(-\tau)\), \(R_X\) is an even function - \(R_X(0)=E[X^2(t)]=\text{average power }X(t)\) - \(|R_X(\tau)|\leq R_X(0)\) - \(P[|X(t+\tau)-X(\tau)|>\epsilon]\leq\frac{2(R_X(0)-R_X(\tau))}{\epsilon^2}\) Properties 1, 3 and 4 also apply to the autocovariance, which is a constant difference from the autocorrelation. Property 4 is a bound on how fast the process can change, the larger \(R_X(\tau)\), the closer \(X(t+\tau)\) and \(X(t)\) must be in a probabilistic sense. There is also the WSS periodic property, being that if \(R_X(T)=R_X(0)\) for a \(T\neq 0\), then \(R_X(\tau)\) is periodic with period \(T\) and \(X(t+T)=X(t)\) with probability 1. ==== Gaussian Process ==== \(\{X(t)\}\) is a Gaussian process if \(\forall n\in\mathbb{N}\) and \(t_1,...,t_n\), \(X(t_1),...,X(t_n)\) are jointly Gaussian. \[\mathbf{X}=[X(t_1),...,X(t_n)]^T\sim\mathcal{N}(\mathbf{\mu_X},\mathbf{C_X})\] If a Gaussian random process is wide sense stationary, then it is strictly stationary as well. This is because the joint density are specified by the mean and autocovariance, which are time shift invariant, so the joint density does not change. ==== White noise ==== A WSS discrete time process \(\{W(t)\}\) is white noise if its mean is 0 and \(R_X(l)=0\) when \(l\neq0\). \[R_X(l)=C_X(l)=\sigma^2_W\cdot\delta(l)\] Here \(\sigma^2_W\) is the average noise power. Here any two distinct time samples are always uncorrelated. A WSS continuous time process \(\{W(t)\}\) is white noise if its mean is 0 and: \[R_W(\tau)=C_W(\tau)=\alpha\delta(\tau)\] Again any two distinct samples are uncorrelated, but \(R_W(0)=\infty\) due to the \(\delta\) function. White noise is white as its power spectral density is constant for all frequencies, like white is formed from all visible colours. For a WSS random process, the power spectral density is the Fourier transform of the autocorrelation. ==== Counting Process ==== A counting process \(\{N(t)\}\) counts the occurrence of events, taking on the values of the non-negative integers. The process usually starts at \(t=0\), so \(N(t)=0,t<0\). The discontinuity occurs at times of arrival, so \(N(0,t)\) counts the number of arrivals in \([0,t]\) and is non-decreasing. The inter-arrival times are also random, so this is a discrete valued continuous time random process. === Poisson Process === A counting process \(\{N(t)\}\) is a Poisson Process if: - The number of arrivals in any interval \((s,t]\) is a Poisson RV with parameter \(\lambda(t-s)\) - Any pair of non-overlapping intervals has an independent number of arrivals: Independent Increments Property The parameter \(\lambda\) is the rate or intensity of the process, expressed as number of arrivals per period of time. For a given period, we can write the expectation, variance and First-Order PMF: \[N(t)\sim\text{Poisson}(t\lambda)\] \[E[N(t)]=t\lambda\] \[\text{Var}[N(t)]=t\lambda\] \[p_{N(t)}(k)=P[N(t)=k]=\frac{(t\lambda)^k}{k!}e^{-t\lambda}\] We can see that \(\{N(t)\}\) is neither stationary nor WSS. When intervals don't overlap, the intervals are independent, so it is easy to calculate the joint probability. When intervals overlap, we need to split into independent intervals. This can be done by splitting into conditional probabilities: \[P[A\cap B]=P[B|A]P[A]\] Where \(A\) and \(B\) overlap, and both are Poisson processes. If \(A:N(s)=i\) and \(B:N(t)=j\), then \(B|A:N(t-s)=j-i\). Where only part of the intervals overlap, we need to split into 3 disjoint intervals and sum over all possible interval combinations. === Inter-Arrival times === For a Poisson Process, the Inter-Arrival times are also random variables: \[X_i=T_i-T_{i-1}\] We can model this as an exponential RV. \[X_i\sim\text{exponential}(\lambda)\] \[E[X_i]=\frac{1}{\lambda}\] \[F_{X_i}(t)=1-e^{-\lambda t}\] Because of the independent increments property, the inter-arrival times are independent. We can simulate a Poisson random process using the exponential generation and summing to obtain the inter-arrival times. - Generate a sequence of \([0,1]\) uniform random variables - Transform each \(U_i\) to exponential RVs: \(X_i=\frac{-\ln(1-U)}{\lambda}\) - Cumulative sum to find the inter-arrival times: \(T_i=\sum_{n=1}^i X_n\) - For all \(t>0\), \(N(t)=i\) for \(T_{i}l\). We can see that shifting the times by the same amount does not change the PMF. As such a Markov chain is stationary when \(\mathbf{p}(0)=\mathbf{p}^*\). ====== Statistical inference ====== Statistical inference is using data to draw conclusions about distributions of random variables. Parameter estimation is estimating unknown or random parameters based on random samples. Hypothesis testing is tests for specific values of a unknown or random parameter based on random samples. It is trying to classify the parameter from a set of choices. Hypothesis testing is decision making under uncertainty, and is extensively used. ===== Hypothesis testing ===== For Hypothesis testing, we need to design a test with objective to either accept or reject \(H_0\). The test's outcome will be the test statistic. We also need a decision rule for choosing each hypothesis based on the test statistic. We then must assess the quality of the decision rules. For the decision rule, we partition the sample space into an acceptance set and the complimentary rejection set. We then accept \(H_0\) iff the outcome is in the acceptance set. ==== Errors ==== This can give rise to errors: * **Type I:** We reject \(H_0\) when \(H_0\) is true * **Type II:** We accept \(H_0\) when \(H_0\) is false There is a trade off with these errors. ==== Tests ==== A two tail test is one where we accept if the test statistic is within a given distance of the expected value. A standard approach in picking the distance is by choosing a significance level such that the probability of a type I error is less than the significance. The upper value on the lower rejection set is the left critical value, and the lower value on the upper rejection set is the right critical value. When we are only concerned with the test statistic moving in one direction, we use a one tailed test. This sets the rejection set as the region from 0 to a critical value, or from a critical value to infinity. We then need to set the critical value according to the level of significance. We can also use a p-value for significance, which in the one tailed case is: \[p=P[X\leq x|H_0]\] The p-value expresses the probability we got the value we received given \(H_0\), being the amount of evidence for the rejection of \(H_0\). The smaller the p-value, the stronger the evidence. In practice the p-value is used as a significance level is fairly arbitrary. The type of hypothesis testing where we test if a test statistic reasonably fits a distribution is called significance testing. The two hypothesises used form an acceptance or rejection of a given distribution. This is not binary hypothesis testing, as we do not have probability models for each hypothesis. ==== Binary hypothesis testing ==== Significance testing checks whether a test statistic fits a known distribution, and we either accept or reject the given model given type I error being below a significance level. Binary hypothesis testing checks that given \(H_0\) and \(H_1\), both with their own distributions, which better fits a given test statistic. Sometimes, prior probabilities \(P[H_0]\) and \(P[H_1]\) can be known. Given a random variable \(X\) parameterised by \(\theta\), an unknown constant or sampled from a random variable \(\Theta\), we need to express the variable. If \(\theta\) is constant, we can write the parameterised CDF as \(F_X(X;\theta)\), and if sampled \(F_{X|\Theta}(x|\theta)\). If \(\theta\) is hidden, being not directly observable, given a realisation of \(X\), we need to make deductions about \(\theta\). We need to construct our two hypothesises, involving different realisations of \(\theta\), then design decision rules. Four decision rule criterion are: * **Maximum A Posteriori (MAP):** Accept \(H_0\) iff \(P[H_0|X=x]\geq P[H_1|X=x]\), else accept \(H_1\) * **Maximum Likelihood (ML):** MAP with uncertainty about prior probabilities * **Minimum Cost (MC):** Assigns costs to errors and minimises expected costs * **Neyman-Pearson (NP):** Puts a tolerance level on the false alarm probability === MAP === Alternately we can write the MAP rule as: \[\frac{P[H_0|X=x]}{P[H_1|X=x]}\mathop{\stackrel{H_0}{\lessgtr}}_{H_1}1\] Using Bayes' rule, we can express the a posteriori probabilities in terms of the a priori, which then lets us express the decision rule as: \[\frac{P[H_0|X=x]}{P[H_1|X=x]}=\frac{f_{X|H_0}(x)\cdot P[H_0]}{f_{X|H_1}(x)\cdot P[H_1]}\mathop{\stackrel{H_0}{\lessgtr}}_{H_1}1\] We can define the likelihood ratio as \(\Lambda(x)=\frac{f_{X|H_0}(x)}{f_{X|H_1}(x)}\) and the MAP decision threshold as \(\gamma=\frac{P[H_1]}{P[H_0]}\). The decision rule then becomes: \[\Lambda(x)\underset{H_1}{\stackrel{H_0}{\lessgtr}}\gamma\] In binary hypothesis testing there are two errors: * **Type I:** Accept \(H_1\) when \(H_0\) true (False alarm) * **Type II:** Accept \(H_0\) when \(H_1\) is true (Missed detection) \[P_{FA}=P[\text{Type I error}]=P[\text{Accept }H_1|H_0\text{ true}]\] \[P_{miss}=P[\text{Type II error}]=P[\text{Accept }H_0|H_1\text{ true}]\] The overall error probability is: \[P_{error}=P_{FA}\cdot P[H_0]+P_{miss}\cdot P_[H1]\] The MAP decision rule minimises the overall error. === ML === When the a priori probabilities are unknown, we can assume they are equiprobable, i.e. \(\gamma=1\). This results in the Maximum Likelihood decision rule: \[\Lambda(x)\underset{H_1}{\stackrel{H_0}{\lessgtr}}1\] This is equivalent to the MAP rule when \(P[H_0]=P[H_1]=\frac{1}{2}\). The MAP rule can be unreliable when we have poor estimates of prior probabilities. The probabilities of a miss and false alarm are inversely related, such that decreasing one increases the other. The further apart the distributions, the lower the ML error probability. This is due to the increased separation making it easier to distinguish the distributions. The receiver operating curve (ROC) is a plot of \(P_{FA}\) against \(P_{miss}\). The closer the curve is to the bottom left \((0,0)\), the better. The worst case is a straight line from \((1,0)\) to \((0,1)\). === MC === A convenient way of handling the trade off between false alarms and misses is to assign costs to each. This then lets us prioritise one type of error over another, with the goal to minimise cost. The expected overall cost is: \[E[C]=c_{FA}P_{FA}P[H_0]+c_{miss}P_{miss}P[H_1]\] The decision rule minimising this is: \[\Lambda(x)\underset{H_1}{\overset{H_0}{\lessgtr}}\frac{c_{miss}P[H_1]}{c_{FA}P[H[0]]}\] We can see that this is a weighted form of the MAP threshold, and the cost function is a weighted form of the overall error probability. If the costs are equal, we get the MAP rule. The difficulty comes in choosing suitable costs for each error type. === NP === The Neyman-Pearson decision rule puts a tolerance level on the false alarm probability. \[P_{FA}=P[\text{accept }H_1|H_0\text{ true}]\leq\alpha\] With this constraint, the secondary objective is to find the decision rule that minimises \(P_{miss}\). This requires creating the decision rule \(\Lambda(x)\overset{H_0}{\underset{H_1}{\lessgtr}}\gamma\), where \(\gamma\) is the largest number such that \(P_{FA}\leq\alpha\). ==== Multiple hypothesis testing ==== Often the number of possible decisions is larger than 2. We can still have MAP and ML rules, but the decision rule will not be as simple as in the binary case. There are \((M+1)^2\) decision probabilities. We can write a generalised MAP rule as: \[\text{Choose }H_j\text{ if }f_{X|H_j}(x)P[H_j]\geq f_{X|H_i}(x)P[H_i],\forall i\] This is choosing the hypothesis with the maximum a posteriori probability. This is equivalent to: \[\text{Choose }H_j=\arg\max_if_{X|X_i}(x)P[H_i]\] The ML rule is the same, with \(P[H_i]=P[H_j],\forall i,j\). ===== Statistical inferencing ===== ==== Finite value estimation ==== When the number of possible values for a parameter is finite, we can formulate the problem of deciding a value for the parameter as a M-ary hypothesis problem. We can assign a particular \(\theta_i\) to hypothesis \(H_i\), or formulate the MAP rule as: \[\hat{\theta}_{MAP}=\theta_i\text{ iff }P[\Theta=\theta_i|X=x]=\max_{j\in0,...,M-1}P[\Theta=\theta_j|X=x]\] \[\hat{\theta}_{MAP}=\arg\max_{\forall\theta\in S_\Theta}P[\Theta=\theta|X=x]\] For a continuous RV, this reduces to: \[\hat{\theta}_{MAP}=\arg\max_{\forall\theta\in S_\Theta}f_{X|\Theta}(x|\theta)P[\Theta=\theta]\] This is the same as the hypothesis notation, using \(H_i=\theta_i\). When \(\theta\) is discrete, we want to maximise the a posteriori PMF/PDF to satisfy the MAP rule: \[\hat{\theta}_{MAP}=\arg\max_{\theta}P[\Theta=\theta|X=x]\] \[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)p_\Theta(\theta)\] When \(\theta\) is a continuous RV, the a posteriori probability is 0, so we use the conditional PDF instead. \[\hat{\theta}_{MAP}=\arg\max_\theta f_{\Theta|X}(\theta|x)\] \[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)f_\Theta(\theta)\] ==== Continuous MAP and ML estimation ==== When \(\theta\) is a continuous RV, we try to estimate the value of \(\Theta\) from \(X\). The MAP rule minimises overall probability of error: \[P[\hat{\theta}\neq\Theta|X=x]\] As \(\Theta\) is a continuous RV, perfectly estimating the exact parameter is a zero probability event. The MAP rule is: \[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)f_\Theta(\theta)\] For the ML rule, we can drop \(f_\Theta(\theta)\), being the prior. ==== Minimum Mean Square Error estimation ==== We can measure the performance of an estimation indicator with the expected mean square error: \[E[(\Theta-\hat{\theta}(X))^2]\] The Minimum Mean Square Error (MMSE) estimator is defined as: \[\hat{\theta}_{MMSE}(X)=\arg\min_{\hat{\theta}}E[(\Theta-\hat{\theta}(X))^2]\] We can write the mean square error as: \[MSE=E[E[(\Theta-\hat{\theta}(X))^2|X]]\] The aim is to find a function \(\hat{\theta}(\cdot)\) that minimises the inner expectation for all \(x\in S_X\), which gives the MMSE estimator. For a given \(X=x\), we can write the inner expectation as: \[E[(\Theta-\hat{\theta}(X))^2]=\underbrace{\text{var}(\Theta-\hat{\theta}(x))}_{=\text{var}(\Theta)}+\underbrace{\left(E[\Theta-\hat{\theta}(x)]\right)^2}_{=E[\Theta]-\hat{\theta}(x)}\] In minimising this, we find \(\hat{\theta}(x)=E[\Theta|X=x]\), for any \(X=x\). With Bayes' rule and conditioning on the denominator, we can find the MMSE estimator to be: \[\hat{\theta}_{MMSE}(x)=\frac{\int_{-\infty}^{\infty}\theta\cdot f_{\Theta|X}(\theta|x)f_\Theta(\theta)d\theta}{\int_{-\infty}^{\infty}f_{\Theta|X}(\theta|x)f_\Theta(\theta)d\theta}\] The MMSE estimator can also be known as the Least Mean Squares (LMS) estimator. Like with the MAP estimator, the MMSE estimator requires a prior distribution on \(\Theta\), which may not always be available. The MMSE estimator may be difficult to compute compared to the MAP and ML estimators. ML estimation is classical statistical inference, being that the underlying parameter is deterministic and unknown. MAP and MMSE estimation is Bayesian statistical inference, using posterior probabilities and through Bayes' rule is based on prior distributions.