notes:elen90054
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
notes:elen90054 [2021/05/09 04:54] – Fixed lists joeleg | notes:elen90054 [2023/05/30 22:32] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 5: | Line 5: | ||
===== Set theory ===== | ===== Set theory ===== | ||
- | A set is a collection of elements, such as the possible outcomes of an experiment. | + | A set is a collection of elements, such as the possible outcomes of an experiment. |
+ | \(S\) is the universal set, also known as the sample space. | ||
+ | \(\varnothing\) is the empty set. | ||
+ | \(A\cup B\) is the union of two sets, and \(A\cap B\) is their intersection. | ||
+ | \(A-B\) is the set subtraction (also \(A\backslash B\)). | ||
+ | \(A^C\) is the compliment of \(A\) (also \(S-A\)). | ||
+ | We can say that x is an element of A by \(x\in A\). | ||
+ | We can also say A is a subset of B as \(A\subset B\). | ||
+ | Two sets are mutually exclusive (disjoint) if \(A\cap B=\varnothing\), meaning they have no elements in common. | ||
+ | Union and intersection are commutative, | ||
+ | De Morgans Laws: | ||
+ | \[(A\cup B)^C=A^C\cap B^C\] | ||
+ | \[(A\cap B)^C=A^C\cup B^C\] | ||
- | A partition of a set is a grouping of its elements into non-empty, mutually exclusive subsets. Every time an experiment is run is known as a trial and the result is the outcome. A random experiment is one where the outcome is nondeterministic, | + | A partition of a set is a grouping of its elements into non-empty, mutually exclusive subsets. |
+ | Every time an experiment is run is known as a trial and the result is the outcome. | ||
+ | A random experiment is one where the outcome is nondeterministic, | ||
+ | Changing what is observed can result in a different sample space. | ||
+ | An event is a collection of outcomes, and hence are sets and subsets of S. | ||
+ | An event is said to have occurred if the outcome is in the event. | ||
+ | If the outcome is not in the event, we say that the event' | ||
+ | Events that are not mutually exclusive can occur simultaneously. | ||
+ | An event containing a single outcome is called an elementary event. | ||
+ | \(S\) is a subset of itself, and hence the event \(S\) always occurs. | ||
+ | The empty set is an event which never occurs and is included as a convention. | ||
===== Probability ===== | ===== Probability ===== | ||
Line 13: | Line 35: | ||
==== Models ==== | ==== Models ==== | ||
- | A probabilistic model is a mathematical description of an unknown situations. It contains: | + | A probabilistic model is a mathematical description of an unknown situations. |
+ | It contains: | ||
- The sample space, containing all possible outcomes | - The sample space, containing all possible outcomes | ||
- A probability law that assigns probabilities to all possible events. | - A probability law that assigns probabilities to all possible events. | ||
- | The probability of an event is the chance or likelihood of the event occurring when an experiment is performed. We write this set-function (measure) as $P[A]$. $P[\cdot]$ is assumed to satisfy the three axioms of probability: | + | The probability of an event is the chance or likelihood of the event occurring when an experiment is performed. |
- | + | We write this set-function (measure) as \(P[A]\). | |
- | * **Non-negativity** | + | \(P[\cdot]\) is assumed to satisfy the three axioms of probability: |
- | * **Additivity** If $A\cap B=\varnothing$, then $P[A\cup B]=P[A]+P[B]$ | + | * **Non-negativity:** \(P[A]\geq 0,\forall x\in A\) |
- | * **Normalisation** | + | * **Additivity:** If \(A\cap B=\varnothing\), then \(P[A\cup B]=P[A]+P[B]\) |
+ | * **Normalisation:** \(P[S]=1\) | ||
From these we can imply: | From these we can imply: | ||
- | + | | |
- | | + | * \(P[\varnothing]=0\) |
- | * $P[\varnothing]=0$ | + | * \(\forall A, 0\leq P[A]\leq 1\) |
- | * $\forall A, 0\leq P[A]\leq 1$ | + | * \(P[A\cup B]=P[A]+P[B]-P[A\cap B]\) |
- | * $P[A\cup B]=P[A]+P[B]-P[A\cap B]$ | + | * \(A\subset B\implies P[A]\leq P[B]\) |
- | * $A\subset B\implies P[A]\leq P[B]$ | + | For discrete (countable) sample spaces, the probability of an event is: |
- | + | \[P[A]=P[{s_i\in A}]=P\left[\bigcup_{i: | |
- | For discrete (countable) sample spaces, the probability of an event is: $$P[A]=P[{s_i\in A}]=P\left[\bigcup_{i: | + | For continuous (uncountable) sample spaces, we must use probabilities of events, rather than outcomes as the probability of an individual outcome is virtually 0. |
==== Conditional Probability ==== | ==== Conditional Probability ==== | ||
- | If an experiment is performed and the exact outcome is known, then there is no more randomness. The outcome says whether an event has occurred or not. If we instead know whether an event has occurred, but not the outcome, conditional probability can let us find the chance another event has also occurred. We can write the conditional probability of B given A as $P[B|A]$, where as $P[B]$ is the original probability. We can find this probability by normalising the probability of the intersection by the probability of A. $$P[B|A]=\frac{P[A\cap B]}{P[A]}$$ We can use this to find: $$P[A\cap B]=P[A|B]*P[B]=P[B|A]*P[A]$$ Which gives us Bayes' rule: $$P[B|A]=\frac{P[A|B]*P[B]}{P[A]}$$ | + | If an experiment is performed and the exact outcome is known, then there is no more randomness. |
- | + | The outcome says whether an event has occurred or not. | |
- | For a set with partitions, we can find the probability of an event as such: $$P[A]=P[(A\cap B_1)\cup(A\cap B_2)\cup...\cup(A\cap B_n)]=\sum_{i=1}^n P[A\cap B_i]$$ Which gives us the theorem of total probability: | + | If we instead know whether an event has occurred, but not the outcome, conditional probability can let us find the chance another event has also occurred. |
+ | We can write the conditional probability of B given A as \(P[B|A]\), where as \(P[B]\) is the original probability. | ||
+ | We can find this probability by normalising the probability of the intersection by the probability of A. | ||
+ | \[P[B|A]=\frac{P[A\cap B]}{P[A]}\] | ||
+ | We can use this to find: | ||
+ | \[P[A\cap B]=P[A|B]*P[B]=P[B|A]*P[A]\] | ||
+ | Which gives us Bayes' rule: | ||
+ | \[P[B|A]=\frac{P[A|B]*P[B]}{P[A]}\] | ||
+ | For a set with partitions, we can find the probability of an event as such: | ||
+ | \[P[A]=P[(A\cap B_1)\cup(A\cap B_2)\cup...\cup(A\cap B_n)]=\sum_{i=1}^n P[A\cap B_i]\] | ||
+ | Which gives us the theorem of total probability: | ||
+ | \[P[A]=P[A|B_1]*P[B_1]+P[A|B_2]*P[B_2]*...*P[A|B_n]*P[B_n]=\sum_{i=1}^n P[A|B_i]*P[B_i]\] | ||
+ | We can also extend Bayes' rule with partitions to: | ||
+ | \[P[B_j|A]=\frac{P[A|B_j]*P[B_j]}{\sum_{k=1}^n P[A|B_k]*P[B_k]}\] | ||
+ | This is finding the posterior (a posteriori) probabilities [post experiment] from the prior (a priori) probabilities [pre experiment]. | ||
+ | From the observed event we can infer the conditional probabilities of unobserved events. | ||
==== Independence ==== | ==== Independence ==== | ||
- | Two events are said to be independent if: $$P[A\cap B]=P[A]*P[B]$$ This means that knowing event A occurs does not alter the probability of B occurring, i.e. $P[B|A]=P[B]$. Mutual independence and disjointness/ | + | Two events are said to be independent if: |
+ | \[P[A\cap B]=P[A]*P[B]\] | ||
+ | This means that knowing event A occurs does not alter the probability of B occurring, i.e. \(P[B|A]=P[B]\). | ||
+ | Mutual independence and disjointness/ | ||
+ | Mutually independent events may occur simultaneously, | ||
+ | Independence is often a natural assumption which can be reasoned about. | ||
==== Tree diagrams ==== | ==== Tree diagrams ==== | ||
- | Tree diagrams can be used to visualise probabilities. The probabilities on the leaves of the tree (intersections) are the product of the probabilities of all branches to the leaf. Each branch has probability equal to its choice given all the previous branches. Tree diagrams can be useful in visualising sequential experiments. To construct the tree we start with one of the sub-experiments and branch out from the root, then we branch from each node with the next sub-experiment, | + | Tree diagrams can be used to visualise probabilities. |
- | + | The probabilities on the leaves of the tree (intersections) are the product of the probabilities of all branches to the leaf. | |
- | Sampling without replacement results in sequential experiments' | + | Each branch has probability equal to its choice given all the previous branches. |
+ | Tree diagrams can be useful in visualising sequential experiments. | ||
+ | To construct the tree we start with one of the sub-experiments and branch out from the root, then we branch from each node with the next sub-experiment, | ||
+ | At each step, the probability of that branch is equal to the probability of that choice given all the previous choices, and the probability at the node is equal to the intersection of all previous choices. | ||
+ | In most sequential experiments, | ||
+ | The fundamental principle of counting states: "If sub-experiment A has n possible outcomes and the sub-experiment B has m possible outcomes, then there are n\*k possible experiment outcomes" | ||
+ | If the possible outcomes are equally likely, the probability of an event A occurring comes down to the total number of outcomes and the number of outcomes in A. | ||
+ | We need to determine how to count the number of possible outcomes for this to work. | ||
+ | Sampling without replacement results in sequential experiments' | ||
+ | Sampling with replacement causes all experiments to have identical independent trials. | ||
+ | In both cases, whether the order of the sub-experiment outcomes can change the probabilities. | ||
+ | If sampling k times from n items with replacement, | ||
+ | If sampling k times from n items without replacement but with ordering, the sample space is \(\frac{n!}{(n-k)!}=(n)_k\). | ||
+ | If sampling k times from n items without replacement and without ordering, the sample space is \(\frac{n!}{(n-k)!k!}=\begin{pmatrix}n\\k\end{pmatrix}\), | ||
+ | We can consider the more general case of the multinomial coefficient, | ||
+ | \[\begin{pmatrix}n\\n_0, | ||
====== Discrete Random Variables ====== | ====== Discrete Random Variables ====== | ||
- | Random variables can be used to map outcomes to numbers. More formally, a random variable | + | Random variables can be used to map outcomes to numbers. |
+ | More formally, a random variable | ||
+ | The sample space is the domain of X, and the range \(S_X\) is the collection of all possible outputs | ||
+ | Two outcomes can map to the same value. | ||
+ | Events can be defined in terms of a single value or a set of output values that a random variable can take. | ||
+ | \[A=\{\forall\omega: | ||
===== The Cumulative Distribution Function ===== | ===== The Cumulative Distribution Function ===== | ||
- | The Cumulative Distribution Function (CDF) is defined for $x\in\mathbb{R}$ as: $$F_X(x)=P[X\leq x]$$ Thus the CDF is the probability of the event $[X\leq x]$. A CDF is build by working from the regions where X gives a value. The function jumps in value where $P[X=x]\neq0$, of height | + | The Cumulative Distribution Function (CDF) is defined for \(x\in\mathbb{R}\) as: |
+ | \[F_X(x)=P[X\leq x]\] | ||
+ | Thus the CDF is the probability of the event \([X\leq x]\). | ||
+ | A CDF is build by working from the regions where X gives a value. | ||
+ | The function jumps in value where \(P[X=x]\neq0\), of height | ||
The CDF has some properties: | The CDF has some properties: | ||
- | + | | |
- | | + | - \(x_1\geq x_2\implies F_X(x_1)\geq F_X(x_2)\) |
- | - $x_1\geq x_2\implies F_X(x_1)\geq F_X(x_2)$ | + | - \(\lim_{x\to\infty}F_X(x)=1\) |
- | - $\lim_{x\to\infty}F_X(x)=1$ | + | - \(\lim_{x\to-\infty}F_X(x)=0\) |
- | - $\lim_{x\to-\infty}F_X(x)=0$ | + | - \(F_X(x)=\lim_{h\to0}F_X(x+h)=F_X(x^+)\) |
- | - $F_X(x)=\lim_{h\to0}F_X(x+h)=F_X(x^+)$ | + | |
This means the CDF is an increasing function (2), that is right continuous (5). | This means the CDF is an increasing function (2), that is right continuous (5). | ||
We can use the CDF to find: | We can use the CDF to find: | ||
+ | * \(P[a< | ||
+ | * \(P[x> | ||
+ | * \(P[X=a]=F_X(a)-F_X(a^-)\) | ||
+ | * \(P[X< | ||
- | * $P[a<x\leq b]=F_X(b)-F_X(a)$ | + | \(X\) is a discrete random variable if its range contains a finite or countably infinite number of points. |
- | * $P[x> | + | Alternately |
- | * $P[X=a]=F_X(a)-F_X(a^-)$ | + | In other words, |
- | * $P[X< | + | |
- | + | ||
- | $X$ is a discrete random variable if its range contains a finite or countably infinite number of points. Alternately | + | |
===== Probability Mass Function ===== | ===== Probability Mass Function ===== | ||
- | The Probability Mass Function (PMF) is defined to be: $$p_X(x)=P[X=x]$$ It has the following properties: | + | The Probability Mass Function (PMF) is defined to be: |
- | + | \[p_X(x)=P[X=x]\] | |
- | - $0\leq p_X(x_i)\leq 1$ for all $x_i\in S_X$ | + | It has the following properties: |
- | - $p_X(x_i)=0$ if $x_i\notin S_X$ | + | - \(0\leq p_X(x_i)\leq 1\) for all \(x_i\in S_X\) |
- | - $\sum_{\forall x_i\in S_X}p_X(x_i)=1$ | + | - \(p_X(x_i)=0\) if \(x_i\notin S_X\) |
- | - For any event $A\subset S_X, | + | - \(\sum_{\forall x_i\in S_X}p_X(x_i)=1\) |
+ | - For any event \(A\subset S_X, | ||
When plotted, usually vertical lines or arrows to dots are drawn representing each value of the function. | When plotted, usually vertical lines or arrows to dots are drawn representing each value of the function. | ||
- | We can relate the PDF to the CDF as follows: | + | We can relate the PDF to the CDF as follows: |
+ | \[p_X(x)=P[X=x]=F_X(x)-F_X(x^-)\] | ||
+ | \[F_X(x)=\sum_{\forall x_i\leq x}p_X(x_i)\] | ||
===== Bernoulli Random Variable ===== | ===== Bernoulli Random Variable ===== | ||
- | Used for an experiment where an event $A$, usually called a " | + | Used for an experiment where an event \(A\), usually called a " |
+ | Defined by a " | ||
+ | \[p_X(k)=P[X=k]=p^k(1-p)^{1-k}, | ||
+ | We'll denote a Bernoulli random variable as: | ||
+ | \[X\sim\text{Bernoulli}(p)\] | ||
===== Binomial Random Variable ===== | ===== Binomial Random Variable ===== | ||
- | Used for repeated random, independent and identical experiments made of Bernoulli trials. It counts the number of successes over $n$ trials. Denoted as: $$X\sim\text{Binomial}(n, | + | Used for repeated random, independent and identical experiments made of Bernoulli trials. |
+ | It counts the number of successes over \(n\) trials. | ||
+ | Denoted as: | ||
+ | \[X\sim\text{Binomial}(n, | ||
+ | \[P[X=k]=\begin{pmatrix}n\\k\end{pmatrix}p^k(1-p)^{n-k}, | ||
===== Geometric Random Variable ===== | ===== Geometric Random Variable ===== | ||
- | Used when interested in knowing when the first success will occur and repeating until then. The waiting time (number of trials) is a geometric random variable. The first success on the kth trial means the first $k-1$ trials are failures followed by a success. We can wait forever to get a success. | + | Used when interested in knowing when the first success will occur and repeating until then. |
+ | The waiting time (number of trials) is a geometric random variable. | ||
+ | The first success on the kth trial means the first \(k-1\) trials are failures followed by a success. | ||
+ | We can wait forever to get a success. | ||
+ | \[X\sim\text{Geometric}(p)\] | ||
+ | \[P[X=k]=(1-p)^{k-1}p, | ||
+ | Called a geometric variable as the probabilities progresses like a geometric progression. | ||
+ | The geometric random variable has the following interesting property: | ||
+ | \[P[X=k+j|X> | ||
+ | This is known as the memory-less property (geometric is the only memory-less discrete random variable). | ||
===== Poisson Random Variable ===== | ===== Poisson Random Variable ===== | ||
- | $N$ is a Poisson Random Variable with parameter | + | \(N\) is a Poisson Random Variable with parameter |
+ | \[p_N(k)=P[N=k]=\frac{\lambda^k}{k!}e^{-k}, | ||
+ | Used often for counting the number of events in a given time period or region of space. | ||
+ | Denoted as: | ||
+ | \[N\sim\text{Poisson}(\lambda)\] | ||
===== Statistics ===== | ===== Statistics ===== | ||
- | A discrete RV is completely described by its PDF or CDF, but we are not always interested in so much information. Statistics are a summary of the RV used for making comparisons. Three different statistics are the mode, median and mean. | + | A discrete RV is completely described by its PDF or CDF, but we are not always interested in so much information. |
+ | Statistics are a summary of the RV used for making comparisons. | ||
+ | Three different statistics are the mode, median and mean. | ||
- | The mode is a number | + | The mode is a number |
+ | \[p_X(x_{mod})\geq p_X(x), | ||
+ | That is, the mode is the most probable outcome. | ||
- | The median is a number | + | The median is a number |
+ | \[P[X\geq x_{med}]\geq 0.5\text{ and }P[X\leq x_{med}]\geq 0.5\] | ||
+ | Ideally it exactly balances the probability of lying above or below, but not always satisfied by a discrete RV. | ||
+ | It does not always need to be in the range \(S_X\). | ||
- | The mean, or average, is the expected value of the distribution. | + | The mean, or average, is the expected value of the distribution. |
+ | \[\mu_x=E[X]=\sum_{x_i\in S_X}x_iP[X=x_i]\] | ||
+ | This is the average empirical outcome if we were to perform the experiment many times. | ||
- | The expected value of a Bernoulli RV is: $$E[X]=0*(1-p)+1*p=p$$ For a binomial RV: $$E[X]=E\left[\sum_{k=1}^n(I_k\sim\text{Bernoulli}(p))\right]=np$$ | + | The expected value of a Bernoulli RV is: |
+ | \[E[X]=0*(1-p)+1*p=p\] | ||
+ | For a binomial RV: | ||
+ | \[E[X]=E\left[\sum_{k=1}^n(I_k\sim\text{Bernoulli}(p))\right]=np\] | ||
The expectation operator has the following properties: | The expectation operator has the following properties: | ||
- | + | | |
- | | + | - \(E[aX]=aE[X]\) |
- | - $E[aX]=aE[X]$ | + | - \(E[X+Y]=E[X]+E[Y]\) |
- | - $E[X+Y]=E[X]+E[Y]$ | + | - \(E[aX+bY]=aE[X]+bE[Y]\) |
- | - $E[aX+bY]=aE[X]+bE[Y]$ | + | |
Hence expectation is a linear operator. | Hence expectation is a linear operator. | ||
- | The expected value of a geometric variable is: $$E[X]=\sum_{k=1}^\infty k[(1-p)^{k-1}p]=\frac{1}{p}$$ The mean may not be in the range $S_X$, i.e. $E[X]\notin S_X$. For a Poisson RV: $$E[N]=\sum_{k=0}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=0+\sum_{k=1}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=\lambda e^{-\lambda}\sum_{k=1}^{\infty}k\frac{\lambda^{k-1}}{(k-1)!}=\lambda$$ | + | The expected value of a geometric variable is: |
+ | \[E[X]=\sum_{k=1}^\infty k[(1-p)^{k-1}p]=\frac{1}{p}\] | ||
+ | The mean may not be in the range \(S_X\), i.e. \(E[X]\notin S_X\). | ||
+ | For a Poisson RV: | ||
+ | \[E[N]=\sum_{k=0}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=0+\sum_{k=1}^{\infty}k\frac{\lambda^k}{k!}e^{-\lambda}=\lambda e^{-\lambda}\sum_{k=1}^{\infty}k\frac{\lambda^{k-1}}{(k-1)!}=\lambda\] | ||
- | The variance of a RV is: $$\sigma^2_X=\text{Var}(X)=E[(X-E[X])^2]=\sum_{x_i\in S_X}(x_i-\mu_X)^2p_X(x_i)$$ This is an indication of the spread of the values about its mean. Variance can also be calculated as: $$\text{Var}(X)=E[X^2]-(E[X])^2$$ Both of these values can be found separately. The variance has two useful properties: | + | The variance of a RV is: |
+ | \[\sigma^2_X=\text{Var}(X)=E[(X-E[X])^2]=\sum_{x_i\in S_X}(x_i-\mu_X)^2p_X(x_i)\] | ||
+ | This is an indication of the spread of the values about its mean. | ||
+ | Variance can also be calculated as: | ||
+ | \[\text{Var}(X)=E[X^2]-(E[X])^2\] | ||
+ | Both of these values can be found separately. | ||
+ | The variance has two useful properties: | ||
+ | - \(\text{Var}(aX)=a^2\text{Var}(X)\) | ||
+ | - If \(X\) and \(Y\) are independent, | ||
- | - $\text{Var}(aX)=a^2\text{Var}(X)$ | + | The variance of a Bernoulli trial is: |
- | | + | \[\text{Var}(X)=p(1-p)\] |
+ | For a Binomial: | ||
+ | \[\text{Var}(X)=np(1-p)\] | ||
+ | For a Geometric: | ||
+ | \[\text{Var}(X)=\frac{1-p}{p^2}\] | ||
+ | For a Poisson: | ||
+ | \[\text{Var}(X)=\lambda\] | ||
- | The variance of a Bernoulli trial is: $$\text{Var}(X)=p(1-p)$$ For a Binomial: $$\text{Var}(X)=np(1-p)$$ For a Geometric: $$\text{Var}(X)=\frac{1-p}{p^2}$$ For a Poisson: $$\text{Var}(X)=\lambda$$ | + | The Poisson RV can be used to approximate the binomial RV, which can be hard to calculate when \(n\) is large. |
- | + | For distributions where \(n\) is large and \(p\) is small, we can set \(\lambda=np\). | |
- | The Poisson RV can be used to approximate the binomial RV, which can be hard to calculate when $n$ is large. For distributions where $n$ is large and $p$ is small, we can set $\lambda=np$. We do need to note that the Poisson distribution counts for an infinite space, whereas the binomial is finite so we have problems as $n\to\infty$. | + | We do need to note that the Poisson distribution counts for an infinite space, whereas the binomial is finite so we have problems as \(n\to\infty\). |
====== Continuous Random Variables ====== | ====== Continuous Random Variables ====== | ||
- | If the results of an experiment are uncountable forming a continuum, we call the experiment continuous. Analogue measurements, | + | If the results of an experiment are uncountable forming a continuum, we call the experiment continuous. |
- | + | Analogue measurements, | |
- | - $f_X(x)\geq0$ | + | Here, the probability that the outcome is an exact number is virtually 0, so we assess probabilities on a range. |
- | - $\int_{-\infty}^\infty f_X(x)dx=1$ | + | The continuous random variable has a CDF that is continuous everywhere and has a derivative (may not exist at a finite number of points). |
- | - $F_X(a)=P[X\leq a]=\int_{-\infty}^af_X(u)du$ | + | The equivalent of the PMF for discrete random variables is the probability density function (PDF). |
- | - $P[a< | + | The CDF can be written as: |
- | + | \[F_X(x)=\int_{-\infty}^xf_X(u)du\] | |
- | The probability on an open or closed interval is the same, as the probability of the bound is 0. The mean and variance are written as: $$\mu_X=E[X]=\int_{-\infty}^\infty xf_X(x)dx$$ $$\sigma^2=\text{Var}[X]=\int_{-\infty}^\infty(x-\mu_X)^2f_X(x)dx$$ | + | Hence the PDF can be written as: |
+ | \[f_X(x)=\frac{dF_X(x)}{dx}, | ||
+ | We can say we have an element of probability: | ||
+ | \[f_X(x)dx=P[x< | ||
+ | The PDF has the following properties: | ||
+ | - \(f_X(x)\geq0\) | ||
+ | - \(\int_{-\infty}^\infty f_X(x)dx=1\) | ||
+ | - \(F_X(a)=P[X\leq a]=\int_{-\infty}^af_X(u)du\) | ||
+ | - \(P[a< | ||
+ | The probability on an open or closed interval is the same, as the probability of the bound is 0. | ||
+ | The mean and variance are written as: | ||
+ | \[\mu_X=E[X]=\int_{-\infty}^\infty xf_X(x)dx\] | ||
+ | \[\sigma^2=\text{Var}[X]=\int_{-\infty}^\infty(x-\mu_X)^2f_X(x)dx\] | ||
===== Uniform Random Variables ===== | ===== Uniform Random Variables ===== | ||
- | A uniform random variable is one where all outcomes on a range have equal probability: | + | A uniform random variable is one where all outcomes on a range have equal probability: |
+ | \[f_X(x)=\begin{cases}\frac{1}{b-a},& | ||
+ | We can find the CDF to be the following: | ||
+ | \[F_X(x)=\begin{cases}\frac{x-a}{b-a},& | ||
+ | The mean and variance are: | ||
+ | \[\mu_X=\frac{b+a}{2}\] | ||
+ | \[\text{Var}(X)=\frac{(b-a)^2}{12}\] | ||
===== Exponential Random Variable ===== | ===== Exponential Random Variable ===== | ||
- | The exponential random variable has the following PDF: $$f_X(x)=\begin{cases}\lambda e^{-\lambda x},& | + | The exponential random variable has the following PDF: |
+ | \[f_X(x)=\begin{cases}\lambda e^{-\lambda x},& | ||
+ | Its CDF is: | ||
+ | \[F_X(x)=\begin{cases}0,& | ||
+ | Its mean and variance are: | ||
+ | \[\mu_X=\frac{1}{\lambda}\] | ||
+ | \[\sigma^2_X=\frac{1}{\lambda^2}\] | ||
+ | Used in studies of waiting and service times. | ||
+ | \(\lambda\) is in units of number per time, so the expected value is the average waiting time for one occurrence. | ||
+ | The exponential RV is the equivalent of the geometric RV. | ||
+ | It also has the same memory-less property. | ||
+ | The CDF of an exponential RV can be constructed by modelling the probability that an arrival happens in the first t time with a Poisson RV with \(\lambda\) equal to the number of arrivals in t time. | ||
+ | The CDF is then the 1 less the probability that no arrivals happen in the first t time. | ||
+ | \[F_X(t)=P[X\leq t]=1-P[X> | ||
+ | The Poisson RV models the number of arrivals where as the exponential RV models the inter-arrival time. | ||
+ | The memory-less property means that when conditioned on a past event, time is restarted and a new time 0 is established. | ||
===== Gaussian (Normal) Random Variable ===== | ===== Gaussian (Normal) Random Variable ===== | ||
- | Many phenomena have a bell-shaped empirical distribution. This is often modelled with a Gaussian or normal PDF with mean $\mu$ and variance | + | Many phenomena have a bell-shaped empirical distribution. |
+ | This is often modelled with a Gaussian or normal PDF with mean \(\mu\) and variance | ||
+ | \[f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)\] | ||
+ | \[X\sim\mathcal{N}(\mu, | ||
+ | As the variance increases, the curve becomes shallower and wider. | ||
+ | A closed form solution doesn' | ||
+ | For nonstandard normal RVs, we can adjust them to work as \(F_X(x)=\Phi\left(\frac{x-\mu}{\sigma}\right)\). | ||
+ | There is also the Q function, defined as the compliment of \(\Phi\). | ||
+ | We can often model noise by adding a Gaussian with zero mean to a signal. | ||
+ | The Q function is reflective: | ||
+ | For probabilities involving the maximum value: | ||
+ | Likewise: | ||
====== Mixed random variables ====== | ====== Mixed random variables ====== | ||
- | These contain CDFs of continuous increasing segments with jumps. As the CDF isn't discrete, a PMF isn't possible. We can consider a discrete RV's PMF as a PDF by considering the probability points as Dirac Delta functions. A delta function has infinite height with infinitesimal width, such that the area under it is 1. We can express a PMF as a PDF by: $$f_X(x)=\sum_{x_k\in S_X}p_X(x_k)\delta(x-x_k)$$ We can go from a CDF to a PDF with a generalised derivative, by defining the derivative of a jump to be equal to a Dirac delta of area equal to the jump. If the delta exists at a point, then we can calculate the probability of an exact value. | + | These contain CDFs of continuous increasing segments with jumps. |
+ | As the CDF isn't discrete, a PMF isn't possible. | ||
+ | We can consider a discrete RV's PMF as a PDF by considering the probability points as Dirac Delta functions. | ||
+ | A delta function has infinite height with infinitesimal width, such that the area under it is 1. | ||
+ | We can express a PMF as a PDF by: | ||
+ | \[f_X(x)=\sum_{x_k\in S_X}p_X(x_k)\delta(x-x_k)\] | ||
+ | We can go from a CDF to a PDF with a generalised derivative, by defining the derivative of a jump to be equal to a Dirac delta of area equal to the jump. | ||
+ | If the delta exists at a point, then we can calculate the probability of an exact value. | ||
====== Functions of a random variable ====== | ====== Functions of a random variable ====== | ||
- | If we were to consider | + | If we were to consider |
===== Discrete Random Variables ===== | ===== Discrete Random Variables ===== | ||
- | $Y$ will also be a discrete RV, taking on the value of $g(x_k), | + | \(Y\) will also be a discrete RV, taking on the value of \(g(x_k), |
+ | As such, the PMF of \(Y\) can be determined by: | ||
+ | \[p_Y(k)=P[Y=k]=\sum_ip_X(x_i)\] | ||
+ | for all \(i\) such that \(k=g(x_i)\). | ||
===== Continuous Random Variables ===== | ===== Continuous Random Variables ===== | ||
- | $$F_Y(y)=P[Y\leq y]=P[g(X)\leq y]=P[g(X)\in(-\infty, | + | \[F_Y(y)=P[Y\leq y]=P[g(X)\leq y]=P[g(X)\in(-\infty, |
===== Generate a RV with a Prescribed CDF ===== | ===== Generate a RV with a Prescribed CDF ===== | ||
- | We want to generate | + | We want to generate |
+ | We then set \(X=g(U)=F_X^{-1}(U)\). | ||
===== Expectation of a function of a RV ===== | ===== Expectation of a function of a RV ===== | ||
- | $$E[Y]=E[g(X)]=\int_{-\infty}^\infty g(x)f_X(x)dx$$ $$E[Y]=E[g(X)=\sum_{\forall x_k\in S_X}g(x_k)p_X(x_k)$$ For the continuous and discrete cases respectively. Variance is the expectation of a function of a random variable. | + | \[E[Y]=E[g(X)]=\int_{-\infty}^\infty g(x)f_X(x)dx\] |
+ | \[E[Y]=E[g(X)=\sum_{\forall x_k\in S_X}g(x_k)p_X(x_k)\] | ||
+ | For the continuous and discrete cases respectively. | ||
+ | Variance is the expectation of a function of a random variable. | ||
====== Joint random variables ====== | ====== Joint random variables ====== | ||
- | Can be useful to visualise joint variables ass a random vector. Can visualise a pair of random variables as two functions mapping an outcome to two different number lines. The joint mapping is taking the outcome to a point on the plane. This point can contain more information than only one of the mapping functions. As with the single RV case, we don't know the outcome beforehand so we want to look at the joint probabilities of points and regions on the plane. | + | Can be useful to visualise joint variables ass a random vector. |
+ | Can visualise a pair of random variables as two functions mapping an outcome to two different number lines. | ||
+ | The joint mapping is taking the outcome to a point on the plane. | ||
+ | This point can contain more information than only one of the mapping functions. | ||
+ | As with the single RV case, we don't know the outcome beforehand so we want to look at the joint probabilities of points and regions on the plane. | ||
+ | \[P[(X\leq x)\cap(Y\leq y)]=P[X\leq x,Y\leq y]\] | ||
+ | \[P[(X=x)\cap(Y=y)]=P[X=x, | ||
+ | If both functions are discrete RVs, the probability mass function is: | ||
+ | \[p_{X, | ||
+ | The range of the PMF if \(S_{X,Y}\). | ||
+ | The sum of the probabilities of the entire sample space is 1. | ||
+ | The joint probabilities have the following properties: | ||
+ | - \(0\leq p_{X, | ||
+ | - \(p_{X, | ||
+ | - \(\sum_{x\in S_X}\sum_{y\in S_Y}P_{X, | ||
- | - $0\leq p_{X, | + | Joint RVs are said to be independent if the probability of the point is equal to the product of the probabilities of the independent functions: |
- | - $p_{X,Y}(x_i,y_j)=0$ if $(x_i,y_j)\notin S_{X,Y}$ | + | \[p_{X, |
- | - $\sum_{x\in S_X}\sum_{y\in S_Y}P_{X,Y}(x,y)=1$ | + | In the case where the variables are independent means we can recover the individual functions. |
+ | \[P[X=1]=\sum_{i}P[X=1,Y=i]\] | ||
- | Joint RVs are said to be independent if the probability of the point is equal to the product of the probabilities of the independent functions: $$p_{X, | + | With 2 RVs, we need 3d plots to express PMFs, CDFs and PDFs, which can be difficult. |
- | + | We use level curves and labels to represent them in a 2-D plane. | |
- | With 2 RVs, we need 3d plots to express PMFs, CDFs and PDFs, which can be difficult. We use level curves and labels to represent them in a 2-D plane. Joint PMFs are easy to do in 2D, joint PMFs need level curves making them harder, joint CDFs are hard to plot and usually not meaningful. | + | Joint PMFs are easy to do in 2D, joint PMFs need level curves making them harder, joint CDFs are hard to plot and usually not meaningful. |
- | + | ||
- | Given a joint RV, we can consider each random variable individually by computing the marginal PMFs of X and Y. $$p_X(x)=P[X=x]=P[X=x, | + | |
+ | Given a joint RV, we can consider each random variable individually by computing the marginal PMFs of X and Y. | ||
+ | \[p_X(x)=P[X=x]=P[X=x, | ||
+ | This turns the 2D function into a 1D function. | ||
+ | Given a joint PMF, we can always compute the marginal PMFs. | ||
===== Joint CDF ===== | ===== Joint CDF ===== | ||
- | The joint CDF is; $$F_{X, | + | The joint CDF is; |
+ | \[F_{X, | ||
+ | It results in a semi-infinite rectangular region. | ||
+ | If the random variables are independent, | ||
+ | \[F_{X, | ||
+ | We can also get marginal CDFs: | ||
+ | \[F_X(x)=P[X\leq x]=P[X\leq x, | ||
+ | If both X and Y are discrete, there will be jumps, forming a 3D staircase function. | ||
+ | If both X and Y are continuous, the joint CDF will be a continuous surface everywhere. | ||
+ | We can then define a joint PDF which gives the concentration (density) of the probability per unit area, done with double integrals. | ||
+ | \[f_{X, | ||
+ | \[F_{X, | ||
+ | The joint PDF is the product of the marginal PDFs when independent. | ||
+ | The marginal PDF can be computed from the joint PDF: | ||
+ | \[f_X(x)=\frac{dF_{X}(x)}{dx}=\frac{dF_{X, | ||
The joint CDF has the following properties: | The joint CDF has the following properties: | ||
- | + | | |
- | | + | - \(\lim_{x, |
- | - $\lim_{x, | + | - If \(x_1\leq x_2\) and \(y_1\leq y_2\), then \(F_{X, |
- | - If $x_1\leq x_2$ and $y_1\leq y_2$, then $F_{X, | + | As a consequence, |
- | + | \[P[x_1\leq X\leq x_2,y_1\leq Y\leq y_2]=F_{X, | |
- | As a consequence, | + | - \(\lim_{x\to-\infty}F_{X, |
- | + | - \(\lim_{x\to a^+}F_{X, | |
- | - $\lim_{x\to-\infty}F_{X, | + | The last two mean the CDF is top and right continuous. |
- | - $\lim_{x\to a^+}F_{X, | + | If both X and Y are continuous RVs, then the CDF is continuous from all directions. |
- | + | ||
- | The last two mean the CDF is top and right continuous. If both X and Y are continuous RVs, then the CDF is continuous from all directions. | + | |
===== Joint PDF ===== | ===== Joint PDF ===== | ||
The joint PDF has the following properties: | The joint PDF has the following properties: | ||
- | + | | |
- | | + | - \(\int_{-\infty}^\infty\int_{-\infty}^\infty f_{X, |
- | - $\int_{-\infty}^\infty\int_{-\infty}^\infty f_{X, | + | - \(P[(X,Y)\in A]=\iint_{(x, |
- | - $P[(X,Y)\in A]=\iint_{(x, | + | - \(P[a< |
- | - $P[a< | + | |
- Continuous at all values except a countable set | - Continuous at all values except a countable set | ||
- | Two or more joint random variables are mutually independent iff their joint PMF/CDF/PDF is a product of their marginals (they are separable). | + | Two or more joint random variables are mutually independent iff their joint PMF/CDF/PDF is a product of their marginals (they are separable). |
+ | \[p_{X, | ||
+ | \[F_{X, | ||
+ | \[f_{X, | ||
===== Conditional Probability ===== | ===== Conditional Probability ===== | ||
- | For a given event in the sample space, we can express the conditional CDF of another event given the first as: $$F_{X|B}(x)=P[X\leq x|B]=\frac{P[(x\leq x)\cap B]}{P[B]}$$ For a discrete RV, with PMF $p_X(s)$, the conditional PMF is: $$p_{X|B}=\begin{cases}\frac{p_X(x)}{P[B]},& | + | For a given event in the sample space, we can express the conditional CDF of another event given the first as: |
- | + | \[F_{X|B}(x)=P[X\leq x|B]=\frac{P[(x\leq x)\cap B]}{P[B]}\] | |
- | The conditional PMF of Y given X is: $$p_{Y|X}(y|x)=\frac{P[Y=y, | + | For a discrete RV, with PMF \(p_X(s)\), the conditional PMF is: |
+ | \[p_{X|B}=\begin{cases}\frac{p_X(x)}{P[B]},& | ||
+ | We can note that the theorem of total probability applies as well: | ||
+ | \[F_X(x)=F_{X|B}(x)P[B]+F_{X|B^C}(x)P[B^C]\] | ||
+ | \[p_X(x)=p_{X|B}(x)P[B]+p_{X|B^C}(x)P[B^C]\] | ||
+ | For a continuous RV, the PDF is: | ||
+ | \[f_{X|B}=\begin{cases}\frac{f_X(x)}{P[B]},& | ||
+ | The conditional PMF of Y given X is: | ||
+ | \[p_{Y|X}(y|x)=\frac{P[Y=y, | ||
+ | We can do the same thing for the CDF: | ||
+ | \[F_{Y|X}(y|x)=\frac{P[Y\leq y, | ||
+ | We can also specify conditional probabilities where we condition on a set: | ||
+ | \[F_{Y|X}(y|A)=\frac{P[Y\leq y,X\in A]}{P[X\in A]}\] | ||
+ | For the definition for the continuous RV conditional CDF, the definition doesn' | ||
+ | Instead we define the conditional CDF with a limiting argument, given \(f_X(x)\) is continuous at \(x\). | ||
+ | \[F_{Y|X}(y|x)=\frac{\int_{-\infty}^yf_{X, | ||
+ | This lets us define the conditional PDF. | ||
+ | \[f_{Y|X}(y|x)=\frac{d}{dx}F_{Y|X}(y|x)=\frac{f_{X, | ||
==== Bayes' recap ==== | ==== Bayes' recap ==== | ||
- | For an experiment, there is an unknown cause and observed evidence. Bayes' rule seeks to find the probability of the cause given the evidence. | + | For an experiment, there is an unknown cause and observed evidence. |
+ | Bayes' rule seeks to find the probability of the cause given the evidence. | ||
+ | \[P[Cause|Evidence]=\frac{P[Evidence|Cause]P[Cause]}{P[Evidence]}\] | ||
+ | The diagnostic (posterior) probability is what is being found. | ||
+ | The likelihood function is the other conditional argument. | ||
+ | The prior probability is the likelihood of the cause. | ||
+ | The probability of the evidence is the marginal probability. | ||
===== Conditional expectation ===== | ===== Conditional expectation ===== | ||
- | $$E[X|B]=\int_{-\infty}^\infty xf_{X|B}(x|B)dx$$ $$E[X|B]=\sum_{x_k\in S_X}x_kp_{X|B}(x_k|B)$$ The first case is continuous and the second is discrete. Both of these result in a constant, the same as if B hadn't occurred. We can exchange B for another random variable: | + | \[E[X|B]=\int_{-\infty}^\infty xf_{X|B}(x|B)dx\] |
+ | \[E[X|B]=\sum_{x_k\in S_X}x_kp_{X|B}(x_k|B)\] | ||
+ | The first case is continuous and the second is discrete. | ||
+ | Both of these result in a constant, the same as if B hadn't occurred. | ||
+ | We can exchange B for another random variable: | ||
+ | \[E[Y|X=x]=\int_{-\infty}^\infty yf_{Y|X}(y|x)dy\] | ||
+ | \[E[Y|X=x]=\sum_{y_j\in S_y}y_jp_{Y|X}(y_j|x)\] | ||
+ | Here the expectation is dependent on \(x\), effectively changing the sample space and the expectation is still constant, albeit a function of \(x\). | ||
+ | If \(X\) is not specified, the conditional expectation is itself a random variable. | ||
+ | \[g(X)=E[Y|X]\] | ||
+ | For each outcome, there is a corresponding value of the conditional expectation | ||
+ | We can find the expected value like that of any random variable: | ||
+ | \[E[E[Y|X]]=\int_{-\infty}^\infty E[Y|X=x]f_X(x)dx=\int_{-\infty}^\infty\int_{-\infty}^\infty yf_{Y|X}(y|x)dy f_X(x)dx=\int_{-\infty}^\infty y\int_{-\infty}^\infty f_{X, | ||
+ | This is known as the law of total expectation. | ||
+ | The effect of conditioning on \(X\) is removed by averaging (taking the expectation) over \(X\). | ||
===== Correlation and Covariance ===== | ===== Correlation and Covariance ===== | ||
- | The correlation of two random variables is the expected value of their product. | + | The correlation of two random variables is the expected value of their product. |
- | + | \[E[XY]=\iint_{-\infty}^\infty xyf_{X, | |
- | $X$ and $Y$ are uncorrelated if the covariance is equal to the product of their means ($E[XY]=E[X]E[Y]\implies\text{cov}(X, | + | The covariance is its own measure: |
+ | \[\text{cov}(X, | ||
+ | If \(X\) and \(Y\) are independent, | ||
+ | If either | ||
+ | The covariance of a variable with itself is its mean, \(\text{cov}(X, | ||
+ | If \(Z=X+Y\), then \(\text{var}(Z)=E[(X+Y)^2]-E[X+Y]^2=\text{var}(X)+\text{var}(Y)+2\text{cov}(X, | ||
+ | \(X\) and \(Y\) are uncorrelated if the covariance is equal to the product of their means (\(E[XY]=E[X]E[Y]\implies\text{cov}(X, | ||
+ | Likewise if the covariance is non-zero, \(X\) and \(Y\) are correlated. | ||
+ | If \(E[XY]=0\), | ||
+ | Independence implies uncorrelation but not the converse. | ||
+ | The converse is guaranteed true when \(X\) and \(Y\) are jointly Gaussian. | ||
+ | We define the correlation coefficient as follows: | ||
+ | \[\rho_{X, | ||
===== Joint Gaussian ===== | ===== Joint Gaussian ===== | ||
- | Random variables | + | Random variables |
+ | \[f_{X, | ||
+ | Their marginal densities are Gaussian also, with means \(\mu\) and variance | ||
+ | If the product correlation coefficient is zero, the joint density can be written as the product of the marginal densities so \(X\) and \(Y\) are independent and uncorrelated. | ||
+ | It is important to note that two Gaussian RVs need not produce a joint Gaussian. | ||
===== Sum of 2 random variables ===== | ===== Sum of 2 random variables ===== | ||
- | Given $X$ and $Y$ with joint PDF $f_{X, | + | Given \(X\) and \(Y\) with joint PDF \(f_{X, |
+ | We do this by first finding the CDF and then the PDF from the derivative of the CDF. | ||
+ | \[F_Z(z)=P[X+Y\leq z]=\int_{y=-\infty}^{y=\infty}\int_{x=-\infty}^{x=z-y}f_{X, | ||
+ | \[f_Z(z)=\frac{d}{dz}F_Z(z)=\int_{-\infty}^\infty\left[\frac{d}{dz}\int_{-\infty}^{z-y}f_{X, | ||
- | Given $X$ and $Y$ with a joint PMF $p_{X, | + | Given \(X\) and \(Y\) with a joint PMF \(p_{X, |
- | + | We consider all points on the boundary, and knowing the events are disjoint: | |
- | Where $X$ and $Y$ are independent RVs, we can separate the joint PDF into the product of marginal PDFs: $$f_Z(z)=\int_{-\infty}^\infty f_{X, | + | \[P[Z=z]=P\left[\bigcup_{\forall x\in S_X}(X=x)\cap(Y=z-x)\right]=\sum_{\forall x\in S_X}P[(X=x)\cap(Y=z-x)]=\sum_{\forall x\in S_X}p_{X, |
+ | Where \(X\) and \(Y\) are independent RVs, we can separate the joint PDF into the product of marginal PDFs: | ||
+ | \[f_Z(z)=\int_{-\infty}^\infty f_{X, | ||
+ | The sum is the convolution of the individual PDFs. | ||
+ | Likewise for discrete RVs, we also get the convolution. | ||
====== Random Vectors and transformations ====== | ====== Random Vectors and transformations ====== | ||
===== Matrix and vector preliminaries ===== | ===== Matrix and vector preliminaries ===== | ||
- | - $\mathbb{R}^2$ is the set of column vectors with $n$ real components | + | - \(\mathbb{R}^2\) is the set of column vectors |
- | - $\mathbb{R}^{m\times n}$ is the set of real matrices with n rows and m columns | + | - \(\mathbb{R}^{m\times n}\) is the set of real matrices with n rows and m columns |
- | - The transpose of the matrix ($\mathbf{A}^T$) is the swapping of the entries | + | - The transpose of the matrix (\(\mathbf{A}^T\)) is the swapping of the entries |
- | - If $\mathbf{A}$ is a column vector, then $\mathbf{A}^T$ is a row vector, and vice-versa | + | - If \(\mathbf{A}\) is a column vector, then \(\mathbf{A}^T\) is a row vector, and vice-versa |
- | - $\mathbf{X, | + | - \(\mathbf{X, |
- | - $\mathbf{X}\in\mathbb{R}^n, | + | - \(\mathbf{X}\in\mathbb{R}^n, |
- | - For square matrix | + | - For square matrix |
- | - $\mathbf{X, | + | - \(\mathbf{X, |
===== Random vectors ===== | ===== Random vectors ===== | ||
- | A random vector is a convenient way to represent a set of random variables: | + | A random vector is a convenient way to represent a set of random variables: |
- | + | \[\mathbf{X}=\begin{bmatrix}X_1\\X_2\\\vdots\\X_n\end{bmatrix}, | |
- | The (auto)correlation matrix is used to measure the correlation between elements of random vectors: $$\mathbf{R_X}=E[\mathbf{XX}^T]$$ $${R_X}(i, | + | The CDF of a random vector is: |
+ | \[F_{\mathbf{X}}(\mathbf{x})=P[X_1\leq x_1,X_2\leq x_2, | ||
+ | The PMF of a random vector is: | ||
+ | \[p_{\mathbf{X}}(x)=P[X_1=x_1, | ||
+ | The PDF of a random vector is: | ||
+ | \[f_\mathbf{X}(\mathbf{x})=\frac{\partial^nF_\mathbf{X}(\mathbf{x})}{\partial x_1\partial x_2...\partial x_n}\] | ||
+ | The mean of a random vector is: | ||
+ | \[E[\mathbf{X}]=\begin{bmatrix}E[X_1]\\E[X_2]\\\vdots\\E[X_n]\end{bmatrix}\] | ||
+ | The variance in 1-D measures the average squared distance from the mean, so an analogous definition for random vectors is: | ||
+ | \[\text{var}(\mathbf{X})=E[||\mathbf{X}-\mu_\mathbf{X}||^2]=E\left[\sum_{i=1}^n|X_i-\mu_i|^2\right]=\sum_{i=1}^n\text{var}(X_i)=E[(\mathbf{X}-\mu_\mathbf{X})^T(\mathbf{X}-\mu_\mathbf{X})]\] | ||
- | The (auto)covariance | + | The (auto)correlation |
+ | \[\mathbf{R_X}=E[\mathbf{XX}^T]\] | ||
+ | \[{R_X}(i,j)=E[X_iX_j]\] | ||
+ | The autocorrelation | ||
+ | The diagonals are \(R_X(i,i)=E[X_i^2]\). | ||
- | If all the components of $\mathbf{X}$ are mutually independent, | + | The (auto)covariance matrix is: |
+ | \[\mathbf{C_X}=E[(\mathbf{X}-\mu_\mathbf{X})(\mathbf{X}-\mu_\mathbf{X})^T]=\mathbf{R_X}-\mu_\mathbf{X}\mu_\mathbf{X}^T\] | ||
+ | \[{C_X}(i, | ||
+ | The covariance matrix is a symmetric collection of the covariances. | ||
+ | The diagonals are \(C_{X}(i, | ||
+ | The variance of a random vector is \(\text{var}(\mathbf{X})=\text{trace}(\mathbf{C_X})\). | ||
- | When given two random vectors, we can find the cross-correlation | + | If all the components |
+ | \(\mathbf{R_X}=\mu_\mathbf{X}\mu_\mathbf{X}^T\) and the covariance matrix | ||
+ | When given two random vectors, we can find the cross-correlation of \(\mathbf{X}\) and \(\mathbf{Y}\). | ||
+ | \[\mathbf{R_{XY}}=E[\mathbf{XY}^T]\] | ||
+ | This is a \(n\times m\) matrix that is not necessarily symmetric. | ||
+ | Note that \(\mathbf{R_{XY}}=\mathbf{R_{YX}}^T\). | ||
+ | Likewise we can find the cross-correlation. | ||
+ | \[\mathbf{C_{XY}}=\mathbf{R_{XY}}-\mu_\mathbf{X}\mu_\mathbf{Y}^T\] | ||
+ | Again, this is not symmetric in general and \(\mathbf{C_{XY}}=\mathbf{C_{YX}}^T\). | ||
===== Gaussian Random Vectors ===== | ===== Gaussian Random Vectors ===== | ||
- | $$\mathbf{X}\sim\mathcal{N}(\mu_\mathbf{X}, | + | \[\mathbf{X}\sim\mathcal{N}(\mu_\mathbf{X}, |
+ | A set of Gaussian variables defined by their means and covariances defines a Gaussian random vector. | ||
+ | A random vector is said to be Gaussian if its density function can be written as: | ||
+ | \[f_\mathbf{X}(\mathbf{x})=\frac{\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu_\mathbf{X})^T\mathbf{C_X}^{-1}(\mathbf{x}-\mu_\mathbf{X})\right\}}{(2\pi)^{n/ | ||
+ | A Gaussian random vector consists of random variables that are jointly Gaussian. | ||
+ | A subset of the variables from a Gaussian vector must be jointly Gaussian. | ||
The vector has the following properties: | The vector has the following properties: | ||
- | + | | |
- | | + | * **Affine transform:** \(\mathbf{Y}=\mathbf{AX+b}\) is also Gaussian if \(\mathbf{X}\) is Gaussian. |
- | * **Affine transform** | + | |
===== PDF Level sets ===== | ===== PDF Level sets ===== | ||
- | A level set is a function of the form: $$L_C(f)=\{(x_1, | + | A level set is a function of the form: |
+ | \[L_C(f)=\{(x_1, | ||
+ | The level set plots the set of points for which the function takes the same value. | ||
+ | The contour plot can help to visualise the density function. | ||
- | For a Gaussian RV, everything outside of the exponent is a constant, and the exponent defines a n-dimensional ellipse. The ellipse is centred at $\mu_X$ and its orientation is determined by $\mathbf{C_X}$. The variance determines the width of the ellipse. The covariance determines rotation of the ellipse. | + | For a Gaussian RV, everything outside of the exponent is a constant, and the exponent defines a n-dimensional ellipse. |
- | + | The ellipse is centred at \(\mu_X\) and its orientation is determined by \(\mathbf{C_X}\). | |
- | We can apply a rotational transform by multiplying a vector by: $$\begin{bmatrix}\hat{X_1}\\\hat{X_2}\end{bmatrix}=\begin{bmatrix}\cos\theta& | + | The variance determines the width of the ellipse. |
+ | The covariance determines rotation of the ellipse. | ||
+ | We can apply a rotational transform by multiplying a vector by: | ||
+ | \[\begin{bmatrix}\hat{X_1}\\\hat{X_2}\end{bmatrix}=\begin{bmatrix}\cos\theta& | ||
+ | This applies a rotation of \(\theta\) to the original vector. | ||
+ | The challenge is finding a rotation such that the resulting vector has uncorrelated components. | ||
+ | In doing so, the new vector has covariance equal to \(\mathbf{ACA}^T\). | ||
+ | This gives the covariance as: | ||
+ | \[\text{cov}(\hat{X}, | ||
+ | Which in turn yields: | ||
+ | \[\tan 2\theta=\frac{2\text{cov}(X, | ||
+ | \[\theta=\frac{1}{2}\arctan\left(\frac{2\text{cov}(X, | ||
===== Transforms of random vectors ===== | ===== Transforms of random vectors ===== | ||
- | A system can be modelled as having random inputs and therefore has random outputs. The outputs can be thought of as transformations of the inputs. The question arises as to how to model the outputs for a system with known parameters taking random inputs. | + | A system can be modelled as having random inputs and therefore has random outputs. |
+ | The outputs can be thought of as transformations of the inputs. | ||
+ | The question arises as to how to model the outputs for a system with known parameters taking random inputs. | ||
- | For discrete random vectors: | + | For discrete random vectors: |
- | + | \[p_\mathbf{Y}(\mathbf{y})=P[\mathbf{Y=y}]=P[g(\mathbf{X})=\mathbf{y}]=\sum_{\forall\mathbf{x}_k: | |
- | For continuous random vectors, we can use a similar direct method for the CDF as for a single variable. If the function is an invertible and differentiable function mapping from $\mathbb{R}^n\to\mathbb{R}^n$. Then $\mathbf{Y}=g(\mathbf{X})$ has the PDF: $$f_\mathbf{Y}(\mathbf{y})=f_\mathbf{X}(g^{-1}(\mathbf{y}))\left|\det\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\right|$$ Where $\det\frac{\partial\mathbf{x}}{\partial\mathbf{y}}$ is the $n\times n$ Jacobian. If $\mathbf{Y}$ has dimension $m<n$, the analysis is more complex. Where $n=1$, the analysis simplifies to: $$f_Y(y)=f_X(g^{-1}(y))\left|\frac{dx}{dy}\right|$$ For the conversion via the CDF, we can think of $g(\cdot)$ as $m$ separate functions mapping from $\mathbb{R}^n\to\mathbb{R}$. $$F_\mathbf{Y}(\mathbf{y})=P[Y_1\leq y_1, | + | The CDF can then be obtained from the PMF, although this calculation can be tedious. |
+ | For continuous random vectors, we can use a similar direct method for the CDF as for a single variable. | ||
+ | If the function is an invertible and differentiable function mapping from \(\mathbb{R}^n\to\mathbb{R}^n\). | ||
+ | Then \(\mathbf{Y}=g(\mathbf{X})\) has the PDF: | ||
+ | \[f_\mathbf{Y}(\mathbf{y})=f_\mathbf{X}(g^{-1}(\mathbf{y}))\left|\det\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\right|\] | ||
+ | Where \(\det\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\) is the \(n\times n\) Jacobian. | ||
+ | If \(\mathbf{Y}\) has dimension \(m<n\), the analysis is more complex. | ||
+ | Where \(n=1\), the analysis simplifies to: | ||
+ | \[f_Y(y)=f_X(g^{-1}(y))\left|\frac{dx}{dy}\right|\] | ||
+ | For the conversion via the CDF, we can think of \(g(\cdot)\) as \(m\) separate functions mapping from \(\mathbb{R}^n\to\mathbb{R}\). | ||
+ | \[F_\mathbf{Y}(\mathbf{y})=P[Y_1\leq y_1, | ||
====== Inequalities, | ====== Inequalities, | ||
===== Markov inequality ===== | ===== Markov inequality ===== | ||
- | We may be interested in a bound on the probability of a random variable. Of particular interest is a bound on the probability for values far from the mean (tail probabilities). The Markov inequality states that for non-negative random variables | + | We may be interested in a bound on the probability of a random variable. |
+ | Of particular interest is a bound on the probability for values far from the mean (tail probabilities). | ||
+ | The Markov inequality states that for non-negative random variables | ||
+ | \[P[X\geq a]\leq\frac{E[X]}{a}\] | ||
+ | In general this provides a loose upper bound, although little information is needed to provide a bound. | ||
===== Chebyshev inequality ===== | ===== Chebyshev inequality ===== | ||
- | For a given random variable | + | For a given random variable |
+ | \[P[|X-\mu_X|\geq a]\leq\frac{\text{var}(X)}{a^2}\] | ||
+ | This uses more information than the Markov inequality (needs variance) | ||
+ | This tells us that if the variance is small, | ||
+ | We can also tell that most of the probability mass/ | ||
===== Sum of random variables ===== | ===== Sum of random variables ===== | ||
- | Many macroscopic phenomena can be seen as the sum of a number of probabilistic microscopic actions. When we want to add two random variables, we convolve their PDFs to get the sum's PDF. We can repeat this to sum an arbitrary number of variables. | + | Many macroscopic phenomena can be seen as the sum of a number of probabilistic microscopic actions. |
- | + | When we want to add two random variables, we convolve their PDFs to get the sum's PDF. | |
- | If we have a sum of iid random variables, the expected value of the sum is $E[S_n]=n\mu$ and its variance is $\text{Var}(S_n)=n\sigma^2$. If we take the mean of n iid random variables, the expected value is $E[M_n]=\mu$ and the variance is $\text{Var}(M_n)=\sigma^2/ | + | We can repeat this to sum an arbitrary number of variables. |
+ | If we have a sum of iid random variables, the expected value of the sum is \(E[S_n]=n\mu\) and its variance is \(\text{Var}(S_n)=n\sigma^2\). | ||
+ | If we take the mean of n iid random variables, the expected value is \(E[M_n]=\mu\) and the variance is \(\text{Var}(M_n)=\sigma^2/ | ||
+ | The latter can be thought of as taking a sample from a distribution. | ||
+ | The sample has a mean equal to that of the original distribution, | ||
+ | In sampling, we can get outliers that can distort the sample mean, but sampling allows us to create bounds on the sample to give confidence. | ||
+ | For an example Bernoulli variable which we sample, we can place bounds due to the Chebyshev inequality. | ||
+ | \[E[|M_n-p]\geq a]\leq\frac{\text{Var}(M_n)}{a^2}=\frac{\text{Var}(X_n)}{na^2}=\frac{p(1-p)}{na^2}\leq\frac{1/ | ||
==== Weak law of large numbers ==== | ==== Weak law of large numbers ==== | ||
- | $$\text{Var}(M_n)=\frac{\sigma^2}{n}\implies\lim_{n\to\infty}\text{Var}(M_n)=0$$ By Chebyshev' | + | \[\text{Var}(M_n)=\frac{\sigma^2}{n}\implies\lim_{n\to\infty}\text{Var}(M_n)=0\] |
+ | By Chebyshev' | ||
+ | \[P[|M_n-\mu|\geq\epsilon]\leq\frac{\text{Var}(M_n)}{\epsilon^2}\] | ||
+ | Taking this limit gives the weak law of large numbers: | ||
+ | \[\lim_{n\to\infty}P[|M_n-\mu|\geq\epsilon]=0\] | ||
+ | Alternately we can express this as: | ||
+ | \[\lim_{n\to\infty}P[|M_n-\mu|\leq\epsilon]=1\] | ||
==== Empirical data and probabilities ==== | ==== Empirical data and probabilities ==== | ||
- | We get probabilistic models for data from statistical inference. For a given event, we can calculate an empirical probability by running | + | We get probabilistic models for data from statistical inference. |
+ | For a given event, we can calculate an empirical probability by running | ||
+ | We can consider our experiment as a Bernoulli trial, were either the event occurs or it doesn' | ||
+ | \[P[|\tilde{P}_n[A]-P[A]|< | ||
+ | From this we can get that to double our precision, we need 4 times the number of trials. | ||
===== Central limit theorem ===== | ===== Central limit theorem ===== | ||
- | For a given sum of $n$ random variables, with mean $\mu$ and variance | + | For a given sum of \(n\) random variables, with mean \(\mu\) and variance |
- | + | \[Z_n=\frac{S_n-n\mu}{\sigma\sqrt{n}}\] | |
- | The central limit theorem allows us to approximate sums and averages of iid random variables with Gaussian distributions, | + | The mean of \(Z\) is \(E[Z_n]=0\), and its variance is \(\text{Var}(Z_n)=\frac{\text{Var}(S_n)}{(\sigma\sqrt{n})^2}=1\). |
+ | The central limit theorem states that: | ||
+ | \[\lim_{n\to\infty}P[Z_n\leq z]=\frac{1}{2\pi}\int_{-\infty}^ze^{-x^2}dx=\Phi(z)\] | ||
+ | In other words, | ||
+ | This is regardless of the underlying distribution, | ||
+ | The central limit theorem allows us to approximate sums and averages of iid random variables with Gaussian distributions, | ||
+ | We can see this in a system if noise is due to a large number of events, the noise will be Gaussian. | ||
+ | Additionally the empirical mean can be modelled as Gaussian for sufficiently large \(n\). | ||
+ | As a loose guideline, \(n=30\) can to be used, but this depends on the underlying distribution, | ||
+ | Sometimes the CLT will be uses for fewer samples as it is easier than using the underlying distribution. | ||
+ | We can find that the original sum approaches \(F_{S_n}(a)\approx\Phi\left(\frac{a-n\mu}{\sigma\sqrt{n}}\right)\), | ||
+ | As the sample size increases, the mean of the sample tends to be distributed as \(M_n\sim\mathcal{N}(\mu, | ||
==== De Moivre-Laplace Approximation ==== | ==== De Moivre-Laplace Approximation ==== | ||
- | Because the CLT approximates with a continuous random variable, getting the probability of a single outcome is 0, which limits its use for discrete random variables. We can use the De Moivre-Laplace approximation here to find for a single value for binomial variables. If $S_n\sim\text{Binomial}(n, | + | Because the CLT approximates with a continuous random variable, getting the probability of a single outcome is 0, which limits its use for discrete random variables. |
+ | We can use the De Moivre-Laplace approximation here to find for a single value for binomial variables. | ||
+ | If \(S_n\sim\text{Binomial}(n, | ||
+ | \[P[k_1\leq S_n\leq k_2]\approx\Phi\left(\frac{k_2+0.5-n\mu}{\sqrt{np(1-p)}}\right)-\Phi\left(\frac{k_1-0.5-n\mu}{\sqrt{np(1-p)}}\right)\] | ||
+ | This is equivalent to the approximation: | ||
+ | \[P[k_1\leq S_n\leq k_2]\approx P[k_1-0.5\leq S_n\leq k_2+0.5]\] | ||
==== Random vectors ==== | ==== Random vectors ==== | ||
- | For a sequence of iid random vectors, witha mean $E[\mathbf{X}_i]=\mu_\mathbf{X_i}$ and covariance | + | For a sequence of iid random vectors, witha mean \(E[\mathbf{X}_i]=\mu_\mathbf{X_i}\) and covariance |
+ | \[\mathbf{S}_n=\sum_{i=0}^n\mathbf{X}_i\] | ||
+ | \[\mathbf{Z}_n=\frac{1}{\sqrt{n}}(\mathbf{S}_n-n\mu_\mathbf{X})\] | ||
+ | So as \(n\to\infty\), \(\mathbf{Z}_n\sim\mathcal{N}(0, | ||
===== Moment Generating Function (MGF) ===== | ===== Moment Generating Function (MGF) ===== | ||
- | The Moment Generating Function is: $$\phi_X(s)=E[e^{sX}]$$ So for a discrete RV: $$\phi_X(s)=\sum_{x\in S_X}e^{sx}p_X(x)$$ And for a continuous RV: $$\phi_X(s)=\int_{-\infty}^\infty e^{sx}f_X(x)dx$$ In the continuous case, $\phi(s)$ is the two sided Laplace transform of $f_X(x)$, with $s$ instead of $-s$. In the discrete case, $\phi(s)$ is a Z-transform of $p_X$, with $z=e^{-s}$. | + | The Moment Generating Function is: |
- | + | \[\phi_X(s)=E[e^{sX}]\] | |
- | The MGF has use in proving the CLT. The kth moment of a random variable $X$ is defined as $E[X^k]$, where $k$ is a positive integer. For some distributions finding the moment will involve integration by parts $k-1$ times. Instead, the kth derivative of the MGF gives the moment, where $s=0$. $$\left.\frac{d^k\phi_X(s)}{ds^k}\right|_{s=0}=E[X^k]$$ $\phi_X(s)$ is only defined where $E[e^{sX}]$ is finite, meaning within the region of convergence. As summing two random variables results in the convolution of their PDFs, in the MGF domain, we can multiply the individual MGFs to get the resultant MGF. | + | So for a discrete RV: |
+ | \[\phi_X(s)=\sum_{x\in S_X}e^{sx}p_X(x)\] | ||
+ | And for a continuous RV: | ||
+ | \[\phi_X(s)=\int_{-\infty}^\infty e^{sx}f_X(x)dx\] | ||
+ | In the continuous case, \(\phi(s)\) is the two sided Laplace transform of \(f_X(x)\), with \(s\) instead of \(-s\). | ||
+ | In the discrete case, \(\phi(s)\) is a Z-transform of \(p_X\), with \(z=e^{-s}\). | ||
+ | The MGF has use in proving the CLT. | ||
+ | The kth moment of a random variable \(X\) is defined as \(E[X^k]\), where \(k\) is a positive integer. | ||
+ | For some distributions finding the moment will involve integration by parts \(k-1\) times. | ||
+ | Instead, the kth derivative of the MGF gives the moment, where \(s=0\). | ||
+ | \[\left.\frac{d^k\phi_X(s)}{ds^k}\right|_{s=0}=E[X^k]\] | ||
+ | \(\phi_X(s)\) is only defined where \(E[e^{sX}]\) is finite, meaning within the region of convergence. | ||
+ | As summing two random variables results in the convolution of their PDFs, in the MGF domain, we can multiply the individual MGFs to get the resultant MGF. | ||
===== Random Processes ===== | ===== Random Processes ===== | ||
- | A stochastic process (random process) is a mapping from the sample space $S$ to the set of time-domain waveforms (either discrete or continuous time). In that sense, the experiment results in an outcome that is a time domain signal. | + | A stochastic process (random process) is a mapping from the sample space \(S\) to the set of time-domain waveforms (either discrete or continuous time). |
+ | In that sense, the experiment results in an outcome that is a time domain signal. | ||
==== Continuous time ==== | ==== Continuous time ==== | ||
- | In continuous time a random process is usually denoted as $X(t)$ or $\{X(t)\}$. The process is a function of two variables | + | In continuous time a random process is usually denoted as \(X(t)\) or \(\{X(t)\}\). |
- | + | The process is a function of two variables | |
- | For a fixed outcome, the deterministic signal is referred to as a sample function, sample path or realisation of $X(t)$. For every time, $X(t, | + | For a fixed time \(t_0\), \(X(t_0, |
+ | For a fixed outcome | ||
+ | Fixing both results in \(X(t_0, | ||
+ | For a fixed outcome, the deterministic signal is referred to as a sample function, sample path or realisation of \(X(t)\). | ||
+ | For every time, \(X(t, | ||
+ | The ensemble of a random process is the set of all possible time domain functions that can result from the experiment, analogous to the range of the RV. | ||
==== Discrete time ==== | ==== Discrete time ==== | ||
- | Discrete time processes are similar to continuous time, but using discrete time instants | + | Discrete time processes are similar to continuous time, but using discrete time instants |
+ | It is written as \(X(n)\) or \(\{X(n)\}\). | ||
+ | A discrete time random process is a sequence of random variables, and the ensemble is all possible sequences. | ||
==== Types of random processes ==== | ==== Types of random processes ==== | ||
Random processes can be either: | Random processes can be either: | ||
- | |||
* Discrete time, discrete valued (DT DV) | * Discrete time, discrete valued (DT DV) | ||
* Discrete time, continuous valued (DT CV) | * Discrete time, continuous valued (DT CV) | ||
* Continuous time, discrete valued (CT DV) | * Continuous time, discrete valued (CT DV) | ||
* Continuous time, continuous valued (CT CV) | * Continuous time, continuous valued (CT CV) | ||
- | |||
Random processes are not just limited to time, and can be generalised to a random field. | Random processes are not just limited to time, and can be generalised to a random field. | ||
- | |||
==== Specifying random processes ==== | ==== Specifying random processes ==== | ||
- | Can be difficult to fully statistically characterise a random process. The First-Order Distribution is a random process | + | Can be difficult to fully statistically characterise a random process. |
+ | The First-Order Distribution is a random process | ||
+ | \[F_{X(t)}(x)=P[X(t)\leq x]\] | ||
+ | This is the First-Order CDF, and is a function of \(x\) and \(t\). | ||
+ | The First-Order PDF is: | ||
+ | \[f_{X(t)}(x)=\frac{\partial}{\partial x}F_{X(t)}(x)\] | ||
+ | We can also define mean and variance functions: | ||
+ | \[\mu_X(t)=E[X(t)]=\int_{-\infty}^\infty xf_{X(t)}(x)dx\] | ||
+ | \[\text{Var}[X(t)]=E\left\{[E[X(t)-\mu_X(t)]^2\right\}\] | ||
- | The Second-Order distribution compares how two times are related. The CDF is: $$F_{X(t_1), | + | The Second-Order distribution compares how two times are related. |
- | + | The CDF is: | |
- | For the nth order distribution, | + | \[F_{X(t_1), |
+ | The PDF is: | ||
+ | \[f_{X(t_1), | ||
+ | \(X(t_1)\) and \(X(t_2)\) are a pair of random variables, so we can look at their autocorrelation and autocovariance. | ||
+ | \[R_X(t_1, | ||
+ | \[C_X(t_1, | ||
+ | For the nth order distribution, | ||
+ | \[F_{X(t_1), | ||
+ | \[f_{X(t_1), | ||
+ | A stochastic process is specified by the collection of nth order joint distribution functions for and \(n\) and any choice of sampling times \(t_1, | ||
+ | This means we may need an infinite collection of distribution functions to fully characterise a stochastic process. | ||
+ | Some processes have simple models that are easier to deal with. | ||
==== Stationarity ==== | ==== Stationarity ==== | ||
- | Some processes have (strict) stationarity. That means that for any $n\in\mathbb{N}$ and $t_1,...,t_n$ and time shift $\tau$: $$F_{X(t_1), | + | Some processes have (strict) stationarity. |
- | + | That means that for any \(n\in\mathbb{N}\) and \(t_1,...,t_n\) and time shift \(\tau\): | |
- | For a First-Order distribution: | + | \[F_{X(t_1), |
+ | \[f_{X(t_1), | ||
+ | Then the random process | ||
+ | This means that the joint distributions are time-shift invariant. | ||
+ | This must hold true for any number and location of time instants and \(\tau\). | ||
+ | Stationarity implies that the CDF/PDF/PMF only depends on the relative times between instants. | ||
- | For a Second-Order distribution: | + | For a First-Order distribution: |
+ | \[F_{X(t)}(x)=F_{X(t+\tau)}(x)\] | ||
+ | \[f_{X(t)}(x)=f_{X(t+\tau)}(x)\] | ||
+ | Meaning that the distribution is equal for all times, letting us write: | ||
+ | \[f_{X(t)}(x)=f_{X(t+\tau)}(x)=f_{\overline{X}}(x),\forall t,\tau\] | ||
+ | \[X(t)\stackrel{d}{=}\overline{X}, | ||
+ | As a result we have the mean and variance being constant: | ||
+ | \[E[X(t)]=\mu\] | ||
+ | \[\text{Var}(X(t))=\sigma^2\] | ||
- | A joint distribution | + | For a Second-Order |
+ | \[f_{X(t_1), | ||
+ | As these PDFs are the same, so will their autocorrelation and autocovariance functions. | ||
+ | \[R_X(t_1, | ||
+ | \[C_X(t_1, | ||
+ | \[R_X(s, | ||
+ | \[C_X(s, | ||
+ | A joint distribution of \(n\) iid random processes can be written as not having any dependency on \(t\), so a joint iid process must be stationary. | ||
=== Wide Sense Stationarity (WSS) === | === Wide Sense Stationarity (WSS) === | ||
- | A random process $\{X(t)\}$ is WSS if: | ||
- | * $E[X(t)]=\mu, | ||
- | * $R_X(t_1, | ||
+ | A random process \(\{X(t)\}\) is WSS if: | ||
+ | - \(E[X(t)]=\mu, | ||
+ | - \(R_X(t_1, | ||
Wide sense stationarity is weaker than strict stationarity. | Wide sense stationarity is weaker than strict stationarity. | ||
+ | |||
For a WSS process, the autocorrelation has the following properties: | For a WSS process, the autocorrelation has the following properties: | ||
- | - $R_X(\tau)=R_X(-\tau)$, $R_X$ is an even function | + | - \(R_X(\tau)=R_X(-\tau)\), \(R_X\) is an even function |
- | - $R_X(0)=E[X^2(t)]=\text{average power }X(t)$ | + | - \(R_X(0)=E[X^2(t)]=\text{average power }X(t)\) |
- | - $|R_X(\tau)|\leq R_X(0)$ | + | - \(|R_X(\tau)|\leq R_X(0)\) |
- | - $P[|X(t+\tau)-X(\tau)|> | + | - \(P[|X(t+\tau)-X(\tau)|> |
- | + | Properties 1, 3 and 4 also apply to the autocovariance, | |
- | Properties 1, 3 and 4 also apply to the autocovariance, | + | Property 4 is a bound on how fast the process can change, the larger |
- | There is also the WSS periodic property, being that if $R_X(T)=R_X(0)$ for a $T\neq 0$, then $R_X(\tau)$ is periodic with period $T$ and $X(t+T)=X(t)$ with probability 1. | + | |
+ | There is also the WSS periodic property, being that if \(R_X(T)=R_X(0)\) for a \(T\neq 0\), then \(R_X(\tau)\) is periodic with period \(T\) and \(X(t+T)=X(t)\) with probability 1. | ||
==== Gaussian Process ==== | ==== Gaussian Process ==== | ||
- | $\{X(t)\}$ is a Gaussian process if $\forall n\in\mathbb{N}$ and $t_1,...,t_n$, $X(t_1), | + | \(\{X(t)\}\) is a Gaussian process if \(\forall n\in\mathbb{N}\) and \(t_1,...,t_n\), \(X(t_1), |
+ | \[\mathbf{X}=[X(t_1), | ||
+ | If a Gaussian random process is wide sense stationary, then it is strictly stationary as well. | ||
+ | This is because the joint density are specified by the mean and autocovariance, | ||
==== White noise ==== | ==== White noise ==== | ||
- | A WSS discrete time process | + | A WSS discrete time process |
+ | \[R_X(l)=C_X(l)=\sigma^2_W\cdot\delta(l)\] | ||
+ | Here \(\sigma^2_W\) is the average noise power. | ||
+ | Here any two distinct time samples are always uncorrelated. | ||
- | A WSS continuous time process | + | A WSS continuous time process |
- | + | \[R_W(\tau)=C_W(\tau)=\alpha\delta(\tau)\] | |
- | White noise is white as its power spectral density is constant for all frequencies, | + | Again any two distinct samples are uncorrelated, |
+ | White noise is white as its power spectral density is constant for all frequencies, | ||
+ | For a WSS random process, the power spectral density is the Fourier transform of the autocorrelation. | ||
==== Counting Process ==== | ==== Counting Process ==== | ||
- | A counting process | + | A counting process |
+ | The process usually starts at \(t=0\), so \(N(t)=0, | ||
+ | The discontinuity occurs at times of arrival, so \(N(0,t)\) counts the number of arrivals in \([0,t]\) and is non-decreasing. | ||
+ | The inter-arrival times are also random, so this is a discrete valued continuous time random process. | ||
+ | === Poisson Process === | ||
- | === Poisson Process === | + | A counting process |
- | A counting process | + | - The number of arrivals in any interval |
- | - The number of arrivals in any interval | + | |
- Any pair of non-overlapping intervals has an independent number of arrivals: Independent Increments Property | - Any pair of non-overlapping intervals has an independent number of arrivals: Independent Increments Property | ||
- | The parameter | + | The parameter |
- | For a given period, we can write the expectation, | + | For a given period, we can write the expectation, |
- | + | \[N(t)\sim\text{Poisson}(t\lambda)\] | |
- | When intervals don't overlap, the intervals are independent, | + | \[E[N(t)]=t\lambda\] |
+ | \[\text{Var}[N(t)]=t\lambda\] | ||
+ | \[p_{N(t)}(k)=P[N(t)=k]=\frac{(t\lambda)^k}{k!}e^{-t\lambda}\] | ||
+ | We can see that \(\{N(t)\}\) is neither stationary nor WSS. | ||
+ | When intervals don't overlap, the intervals are independent, | ||
+ | When intervals overlap, we need to split into independent intervals. | ||
+ | This can be done by splitting into conditional probabilities: | ||
+ | \[P[A\cap B]=P[B|A]P[A]\] | ||
+ | Where \(A\) and \(B\) overlap, and both are Poisson processes. | ||
+ | If \(A: | ||
+ | Where only part of the intervals overlap, we need to split into 3 disjoint intervals and sum over all possible interval combinations. | ||
=== Inter-Arrival times === | === Inter-Arrival times === | ||
- | For a Poisson Process, the Inter-Arrival times are also random variables: | + | |
+ | For a Poisson Process, the Inter-Arrival times are also random variables: | ||
+ | \[X_i=T_i-T_{i-1}\] | ||
+ | We can model this as an exponential RV. | ||
+ | \[X_i\sim\text{exponential}(\lambda)\] | ||
+ | \[E[X_i]=\frac{1}{\lambda}\] | ||
+ | \[F_{X_i}(t)=1-e^{-\lambda t}\] | ||
+ | Because of the independent increments property, the inter-arrival times are independent. | ||
We can simulate a Poisson random process using the exponential generation and summing to obtain the inter-arrival times. | We can simulate a Poisson random process using the exponential generation and summing to obtain the inter-arrival times. | ||
- | - Generate a sequence of $[0,1]$ uniform random variables | + | - Generate a sequence of \([0,1]\) uniform random variables |
- | - Transform each $U_i$ to exponential RVs: $X_i=\frac{-\ln(1-U)}{\lambda}$ | + | - Transform each \(U_i\) to exponential RVs: \(X_i=\frac{-\ln(1-U)}{\lambda}\) |
- | - Cumulative sum to find the inter-arrival times: | + | - Cumulative sum to find the inter-arrival times: |
- | - For all $t>0$, $N(t)=i$ for $T_{i}< | + | - For all \(t>0\), \(N(t)=i\) for \(T_{i}< |
- | + | Here the \(T_i\) are Erlang random variables. | |
- | Here the $T_i$ are Erlang random variables. | + | |
=== Bernoulli Process === | === Bernoulli Process === | ||
- | The process is a sequence of iid Bernoulli RVs $X(k)$. It is a DT DV process, where at each time instant, the value is either 0 or 1. $$P[X(k)=1]=P[\text{Success at kth trial}]=p$$ $$P[X(k)=0]=P[\text{Failiure at kth trail}]=1-p$$ As a sequence of iid RVs, it is a iid random process, and thus is a stationary random process. | ||
+ | The process is a sequence of iid Bernoulli RVs \(X(k)\). | ||
+ | It is a DT DV process, where at each time instant, the value is either 0 or 1. | ||
+ | \[P[X(k)=1]=P[\text{Success at kth trial}]=p\] | ||
+ | \[P[X(k)=0]=P[\text{Failiure at kth trail}]=1-p\] | ||
+ | As a sequence of iid RVs, it is a iid random process, and thus is a stationary random process. | ||
=== Binomial Counting Process === | === Binomial Counting Process === | ||
- | The Binomial Counting Process is the result of cumulatively summing a Bernoulli Process. $$S(k)=\sum_{n=1}^k X(n)$$ Where $X(k)$ are iid Bernoulli random variables. $$S(k)\sim\text{Bernoulli}(k, | ||
- | We can think of the inter-arrival time $Y_k=1$ as being the time where $i-1$ failures occur, followed by a success. Each trial that is an iid Bernoulli | + | The Binomial Counting Process is the result of cumulatively summing |
+ | \[S(k)=\sum_{n=1}^k X(n)\] | ||
+ | Where \(X(k)\) are iid Bernoulli | ||
+ | \[S(k)\sim\text{Bernoulli}(k,p)\] | ||
+ | We can think of this as counting arrivals, where an arrival is a Bernoulli success. | ||
+ | We can think of \(T_k=Y_1+Y_2+...+Y_k\) as the time of the kth arrival (success). | ||
+ | Here \(Y_k=T_k-T_{k-1}\) is the inter-arrival time. | ||
+ | We can think of the inter-arrival time \(Y_k=1\) as being the time where \(i-1\) failures occur, followed by a success. | ||
+ | Each trial that is an iid Bernoulli RV, and the resultant distribution for the inter-arrival time is geometric. | ||
+ | \[Y_k\sim\text{Geometric}(p)\] | ||
+ | \[P[Y_k=1]=p(1-p)^{i-1}\] | ||
+ | Due to the trials being iid, the sequence of geometric RVs \(Y_1, | ||
+ | Here \(k\) is the arrival number, so is not technically discrete time, but it is still discrete as it counts the number of arrivals. | ||
=== Poisson vs Binomial === | === Poisson vs Binomial === | ||
- | We can think of the Binomial process as a discretised version of the Poisson process. Both have independent increments and inter-arrival times. Both the exponential and geometric distributions share the memoryless property. | ||
- | ^ | ||
- | |Times of arrival | ||
- | |Arrival rate | ||
- | |No. of arrivals PMF|Poisson | ||
- | |Inter-arrival times|Exponential | ||
- | |Time to kth arrival|Erlang | ||
- | === Sum Process | + | We can think of the Binomial process as a discretised version of the Poisson process. |
- | For an iid $\{X(k)\}, | + | Both have independent increments and inter-arrival times. |
+ | Both the exponential and geometric distributions share the memoryless property. | ||
+ | ^ ^ Poisson^ Binomial^ | ||
+ | |Times of arrival|Continuous|Discrete| | ||
+ | |Arrival rate|\(\lambda\) per unit time|\(p\) per trial| | ||
+ | |No. of arrivals PMF|Poisson|Binomial| | ||
+ | |Inter-arrival times|Exponential|Geometric| | ||
+ | |Time to kth arrival|Erlang|Pascal| | ||
+ | === Sum Process === | ||
+ | |||
+ | For an iid \(\{X(k)\}, | ||
+ | \[S(k)=S(k-1)+X(k)\] | ||
+ | Where \(S(1)=X(1)\) and \(S(0)=0\). | ||
+ | The Binomial counting process is a sum process. | ||
+ | Sum processes have the important property: | ||
+ | \[P[S(k+1)\leq a|S(k)=s_k, | ||
+ | This is that the next value is dependent only on the most recent value. | ||
+ | The Pascal counting process, the arrival time sequence of the Binomial process, is a sum process. | ||
+ | The Erlang counting process, the arrival time sequence of the Poisson process, is also a sum process. | ||
+ | Random walks in 1D are sum processes. | ||
+ | |||
+ | Sum processes have the following processes: | ||
+ | \[E[S(k)]=kE[X]\] | ||
+ | \[\text{Var}[S(k)]=k\text{Var}[X]\] | ||
+ | \[C_S(n, | ||
+ | Non-overlapping intervals have independent sum increments. | ||
+ | === Markov Process === | ||
+ | |||
+ | A Markov Process is a process where conditioning on the past is equivalent to conditioning on the immediate past. | ||
+ | \[f_{X(t_{n+1})|X(t_n), | ||
+ | This also applies to discrete time processes, replacing \(t\) with \(k\). | ||
+ | This can be thought of as only the present matters in predicting the future. | ||
+ | Sum processes are Markov processes, as is any iid process. | ||
+ | |||
+ | It is generally easy to write Markov processes high order high order distributions. | ||
+ | We can reduce a high order PMF/PDF/CDF into a product of first order conditional distributions. | ||
+ | |||
+ | There are 4 types of Markov processes: | ||
+ | * Discrete time Markov chain (DT DV) | ||
+ | * Continuous time Markov chain (CT DV) | ||
+ | * Discrete time Markov process (DT CV) | ||
+ | * Continuous time Markov process (CT CV) | ||
+ | A discrete valued Markov process is called a chain as the outcomes or states are finite or countably infinite. | ||
+ | If we were to draw a state diagram for the outcomes of a Markov chain, they form a weighted directed graph, with nodes of states and edges representing transitions. | ||
+ | The joining of discrete states create the " | ||
+ | As such, a finite state Markov chain can be viewed as a finite state machine. | ||
+ | We can define a state PMF and a transition probability matrix for a Markov chain. | ||
+ | \[\mathbf{p}(k)=\begin{bmatrix}p_{X(k)}(0)& | ||
+ | \[\mathbf{P}=\begin{bmatrix}p_{00}& | ||
+ | This then lets us calculate the next state from the previous with the transition matrix. | ||
+ | \[\mathbf{p}(k+1)=\mathbf{p}(k)\cdot\mathbf{P}\] | ||
+ | This then can be expanded to give: | ||
+ | \[\mathbf{p}(k)=\mathbf{p}(0)\cdot\mathbf{P}^k\] | ||
+ | We have the initial state vector \(\mathbf{p}(0)\), | ||
+ | The steady state vector needs to have components that sum to 1, and is independent of the initial state. | ||
+ | A time homogeneous discrete time Markov chain is one where the transition matrix is constant with time. | ||
+ | If all components of the transition probability matrix are strictly positive, the steady state vector is guaranteed to exist and be uniquely determined by \(\mathbf{P}\). | ||
+ | If there are zeros present, the steady state vector may exist, may be dependent on initial conditions or may not exist. | ||
+ | Where the initial state vector is equal to the steady state vector, all future state vectors are the same for all times. | ||
+ | Conditioning cause the state vector to be dependent only on the time difference from the conditioned step. | ||
+ | \[\mathbf{p}(k)=u_i\mathbf{P}^l\] | ||
+ | Where \(u_i\) has the a 1 in the ith entry and 0 elsewhere and we are conditioning on time \(l\) and getting the probability for \(k> | ||
+ | We can see that shifting the times by the same amount does not change the PMF. | ||
+ | As such a Markov chain is stationary when \(\mathbf{p}(0)=\mathbf{p}^*\). | ||
+ | ====== Statistical inference ====== | ||
+ | |||
+ | Statistical inference is using data to draw conclusions about distributions of random variables. | ||
+ | Parameter estimation is estimating unknown or random parameters based on random samples. | ||
+ | Hypothesis testing is tests for specific values of a unknown or random parameter based on random samples. | ||
+ | It is trying to classify the parameter from a set of choices. | ||
+ | Hypothesis testing is decision making under uncertainty, | ||
+ | ===== Hypothesis testing ===== | ||
+ | |||
+ | For Hypothesis testing, we need to design a test with objective to either accept or reject \(H_0\). | ||
+ | The test's outcome will be the test statistic. | ||
+ | We also need a decision rule for choosing each hypothesis based on the test statistic. | ||
+ | We then must assess the quality of the decision rules. | ||
+ | For the decision rule, we partition the sample space into an acceptance set and the complimentary rejection set. | ||
+ | We then accept \(H_0\) iff the outcome is in the acceptance set. | ||
+ | ==== Errors ==== | ||
+ | |||
+ | This can give rise to errors: | ||
+ | * **Type I:** We reject \(H_0\) when \(H_0\) is true | ||
+ | * **Type II:** We accept \(H_0\) when \(H_0\) is false | ||
+ | There is a trade off with these errors. | ||
+ | ==== Tests ==== | ||
+ | |||
+ | A two tail test is one where we accept if the test statistic is within a given distance of the expected value. | ||
+ | A standard approach in picking the distance is by choosing a significance level such that the probability of a type I error is less than the significance. | ||
+ | The upper value on the lower rejection set is the left critical value, and the lower value on the upper rejection set is the right critical value. | ||
+ | |||
+ | When we are only concerned with the test statistic moving in one direction, we use a one tailed test. | ||
+ | This sets the rejection set as the region from 0 to a critical value, or from a critical value to infinity. | ||
+ | We then need to set the critical value according to the level of significance. | ||
+ | |||
+ | We can also use a p-value for significance, | ||
+ | \[p=P[X\leq x|H_0]\] | ||
+ | The p-value expresses the probability we got the value we received given \(H_0\), being the amount of evidence for the rejection of \(H_0\). | ||
+ | The smaller the p-value, the stronger the evidence. | ||
+ | In practice the p-value is used as a significance level is fairly arbitrary. | ||
+ | |||
+ | The type of hypothesis testing where we test if a test statistic reasonably fits a distribution is called significance testing. | ||
+ | The two hypothesises used form an acceptance or rejection of a given distribution. | ||
+ | This is not binary hypothesis testing, as we do not have probability models for each hypothesis. | ||
+ | ==== Binary hypothesis testing ==== | ||
+ | |||
+ | Significance testing checks whether a test statistic fits a known distribution, | ||
+ | Binary hypothesis testing checks that given \(H_0\) and \(H_1\), both with their own distributions, | ||
+ | Sometimes, prior probabilities \(P[H_0]\) and \(P[H_1]\) can be known. | ||
+ | |||
+ | Given a random variable \(X\) parameterised by \(\theta\), an unknown constant or sampled from a random variable \(\Theta\), we need to express the variable. | ||
+ | If \(\theta\) is constant, we can write the parameterised CDF as \(F_X(X; | ||
+ | If \(\theta\) is hidden, being not directly observable, given a realisation of \(X\), we need to make deductions about \(\theta\). | ||
+ | We need to construct our two hypothesises, | ||
+ | |||
+ | Four decision rule criterion are: | ||
+ | * **Maximum A Posteriori (MAP):** Accept \(H_0\) iff \(P[H_0|X=x]\geq P[H_1|X=x]\), | ||
+ | * **Maximum Likelihood (ML):** MAP with uncertainty about prior probabilities | ||
+ | * **Minimum Cost (MC):** Assigns costs to errors and minimises expected costs | ||
+ | * **Neyman-Pearson (NP):** Puts a tolerance level on the false alarm probability | ||
+ | === MAP === | ||
+ | |||
+ | Alternately we can write the MAP rule as: | ||
+ | \[\frac{P[H_0|X=x]}{P[H_1|X=x]}\mathop{\stackrel{H_0}{\lessgtr}}_{H_1}1\] | ||
+ | Using Bayes' rule, we can express the a posteriori probabilities in terms of the a priori, which then lets us express the decision rule as: | ||
+ | \[\frac{P[H_0|X=x]}{P[H_1|X=x]}=\frac{f_{X|H_0}(x)\cdot P[H_0]}{f_{X|H_1}(x)\cdot P[H_1]}\mathop{\stackrel{H_0}{\lessgtr}}_{H_1}1\] | ||
+ | We can define the likelihood ratio as \(\Lambda(x)=\frac{f_{X|H_0}(x)}{f_{X|H_1}(x)}\) and the MAP decision threshold as \(\gamma=\frac{P[H_1]}{P[H_0]}\). | ||
+ | The decision rule then becomes: | ||
+ | \[\Lambda(x)\underset{H_1}{\stackrel{H_0}{\lessgtr}}\gamma\] | ||
+ | |||
+ | In binary hypothesis testing there are two errors: | ||
+ | * **Type I:** Accept \(H_1\) when \(H_0\) true (False alarm) | ||
+ | * **Type II:** Accept \(H_0\) when \(H_1\) is true (Missed detection) | ||
+ | \[P_{FA}=P[\text{Type I error}]=P[\text{Accept }H_1|H_0\text{ true}]\] | ||
+ | \[P_{miss}=P[\text{Type II error}]=P[\text{Accept }H_0|H_1\text{ true}]\] | ||
+ | The overall error probability is: | ||
+ | \[P_{error}=P_{FA}\cdot P[H_0]+P_{miss}\cdot P_[H1]\] | ||
+ | The MAP decision rule minimises the overall error. | ||
+ | === ML === | ||
+ | |||
+ | When the a priori probabilities are unknown, we can assume they are equiprobable, | ||
+ | This results in the Maximum Likelihood decision rule: | ||
+ | \[\Lambda(x)\underset{H_1}{\stackrel{H_0}{\lessgtr}}1\] | ||
+ | This is equivalent to the MAP rule when \(P[H_0]=P[H_1]=\frac{1}{2}\). | ||
+ | The MAP rule can be unreliable when we have poor estimates of prior probabilities. | ||
+ | |||
+ | The probabilities of a miss and false alarm are inversely related, such that decreasing one increases the other. | ||
+ | The further apart the distributions, | ||
+ | This is due to the increased separation making it easier to distinguish the distributions. | ||
+ | |||
+ | The receiver operating curve (ROC) is a plot of \(P_{FA}\) against \(P_{miss}\). | ||
+ | The closer the curve is to the bottom left \((0,0)\), the better. | ||
+ | The worst case is a straight line from \((1,0)\) to \((0, | ||
+ | === MC === | ||
+ | |||
+ | A convenient way of handling the trade off between false alarms and misses is to assign costs to each. | ||
+ | This then lets us prioritise one type of error over another, with the goal to minimise cost. | ||
+ | The expected overall cost is: | ||
+ | \[E[C]=c_{FA}P_{FA}P[H_0]+c_{miss}P_{miss}P[H_1]\] | ||
+ | The decision rule minimising this is: | ||
+ | \[\Lambda(x)\underset{H_1}{\overset{H_0}{\lessgtr}}\frac{c_{miss}P[H_1]}{c_{FA}P[H[0]]}\] | ||
+ | We can see that this is a weighted form of the MAP threshold, and the cost function is a weighted form of the overall error probability. | ||
+ | If the costs are equal, we get the MAP rule. | ||
+ | The difficulty comes in choosing suitable costs for each error type. | ||
+ | === NP === | ||
+ | |||
+ | The Neyman-Pearson decision rule puts a tolerance level on the false alarm probability. | ||
+ | \[P_{FA}=P[\text{accept }H_1|H_0\text{ true}]\leq\alpha\] | ||
+ | With this constraint, the secondary objective is to find the decision rule that minimises \(P_{miss}\). | ||
+ | This requires creating the decision rule \(\Lambda(x)\overset{H_0}{\underset{H_1}{\lessgtr}}\gamma\), | ||
+ | ==== Multiple hypothesis testing ==== | ||
+ | |||
+ | Often the number of possible decisions is larger than 2. | ||
+ | We can still have MAP and ML rules, but the decision rule will not be as simple as in the binary case. | ||
+ | There are \((M+1)^2\) decision probabilities. | ||
+ | |||
+ | We can write a generalised MAP rule as: | ||
+ | \[\text{Choose }H_j\text{ if }f_{X|H_j}(x)P[H_j]\geq f_{X|H_i}(x)P[H_i], | ||
+ | This is choosing the hypothesis with the maximum a posteriori probability. | ||
+ | This is equivalent to: | ||
+ | \[\text{Choose }H_j=\arg\max_if_{X|X_i}(x)P[H_i]\] | ||
+ | The ML rule is the same, with \(P[H_i]=P[H_j], | ||
+ | ===== Statistical inferencing ===== | ||
+ | |||
+ | ==== Finite value estimation ==== | ||
+ | |||
+ | When the number of possible values for a parameter is finite, we can formulate the problem of deciding a value for the parameter as a M-ary hypothesis problem. | ||
+ | We can assign a particular \(\theta_i\) to hypothesis \(H_i\), or formulate the MAP rule as: | ||
+ | \[\hat{\theta}_{MAP}=\theta_i\text{ iff }P[\Theta=\theta_i|X=x]=\max_{j\in0, | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_{\forall\theta\in S_\Theta}P[\Theta=\theta|X=x]\] | ||
+ | For a continuous RV, this reduces to: | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_{\forall\theta\in S_\Theta}f_{X|\Theta}(x|\theta)P[\Theta=\theta]\] | ||
+ | This is the same as the hypothesis notation, using \(H_i=\theta_i\). | ||
+ | When \(\theta\) is discrete, we want to maximise the a posteriori PMF/PDF to satisfy the MAP rule: | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_{\theta}P[\Theta=\theta|X=x]\] | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)p_\Theta(\theta)\] | ||
+ | When \(\theta\) is a continuous RV, the a posteriori probability is 0, so we use the conditional PDF instead. | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_\theta f_{\Theta|X}(\theta|x)\] | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)f_\Theta(\theta)\] | ||
+ | ==== Continuous MAP and ML estimation ==== | ||
+ | |||
+ | When \(\theta\) is a continuous RV, we try to estimate the value of \(\Theta\) from \(X\). | ||
+ | The MAP rule minimises overall probability of error: | ||
+ | \[P[\hat{\theta}\neq\Theta|X=x]\] | ||
+ | As \(\Theta\) is a continuous RV, perfectly estimating the exact parameter is a zero probability event. | ||
+ | The MAP rule is: | ||
+ | \[\hat{\theta}_{MAP}=\arg\max_\theta f_{X|\Theta}(x|\theta)f_\Theta(\theta)\] | ||
+ | For the ML rule, we can drop \(f_\Theta(\theta)\), | ||
+ | ==== Minimum Mean Square Error estimation ==== | ||
+ | |||
+ | We can measure the performance of an estimation indicator with the expected mean square error: | ||
+ | \[E[(\Theta-\hat{\theta}(X))^2]\] | ||
+ | The Minimum Mean Square Error (MMSE) estimator is defined as: | ||
+ | \[\hat{\theta}_{MMSE}(X)=\arg\min_{\hat{\theta}}E[(\Theta-\hat{\theta}(X))^2]\] | ||
+ | We can write the mean square error as: | ||
+ | \[MSE=E[E[(\Theta-\hat{\theta}(X))^2|X]]\] | ||
+ | The aim is to find a function \(\hat{\theta}(\cdot)\) that minimises the inner expectation for all \(x\in S_X\), which gives the MMSE estimator. | ||
+ | For a given \(X=x\), we can write the inner expectation as: | ||
+ | \[E[(\Theta-\hat{\theta}(X))^2]=\underbrace{\text{var}(\Theta-\hat{\theta}(x))}_{=\text{var}(\Theta)}+\underbrace{\left(E[\Theta-\hat{\theta}(x)]\right)^2}_{=E[\Theta]-\hat{\theta}(x)}\] | ||
+ | In minimising this, we find \(\hat{\theta}(x)=E[\Theta|X=x]\), | ||
+ | With Bayes' rule and conditioning on the denominator, | ||
+ | \[\hat{\theta}_{MMSE}(x)=\frac{\int_{-\infty}^{\infty}\theta\cdot f_{\Theta|X}(\theta|x)f_\Theta(\theta)d\theta}{\int_{-\infty}^{\infty}f_{\Theta|X}(\theta|x)f_\Theta(\theta)d\theta}\] | ||
+ | The MMSE estimator can also be known as the Least Mean Squares (LMS) estimator. | ||
+ | Like with the MAP estimator, the MMSE estimator requires a prior distribution on \(\Theta\), which may not always be available. | ||
+ | The MMSE estimator may be difficult to compute compared to the MAP and ML estimators. | ||
- | Sum processes have the following processes: $$E[S(k)]=kE[X]$$ $$\text{Var}[S(k)]=k\text{Var}[X]$$ $$C_S(n,k)=\text{Var}[X]\cdot\min\{n,k\}$$ Non-overlapping intervals have independent sum increments. | + | ML estimation is classical statistical inference, being that the underlying parameter is deterministic and unknown. |
+ | MAP and MMSE estimation is Bayesian statistical inference, using posterior probabilities and through Bayes' rule is based on prior distributions. | ||
notes/elen90054.1620536059.txt.gz · Last modified: 2023/05/30 22:32 (external edit)