COMP20007 - Design of Algorithms
Problem solving steps
- Sum of the first n integers
Fundamental data structures
Growth rate and Algorithm Efficiency
Graphs
Greedy Algorithms
Priority Queue
Spanning trees
- Prim's Algorithm
- Dijkstra's Algorithm
Divide and Conquer
Warshall's Algorithm
Sorts
Heap
- Intermediate properties
Dynamic programming
Hashing
Compression
Space-Time trade off
String matching
- Horspool's Strung Search Algorithm
- Knuth-Morris-Pratt (KMP) algorithm
Complexity theory

COMP20007 - Design of Algorithms

$$\require{algorithm2e,amsmath,amssymb,mathtools}$$

Problem solving steps

Understand the problem
Decide on the computational means
Decide on method to use
Design the necessary data structures and algorithms
Check for correctness, trace input example
Evaluate analytically (time, space, worst case, average case)
Code it
Evaluate empirically

Sum of the first n integers

\[\sum^n_{i=0}i=\frac{n(n+1)}{2}\] The average of the numbers: $\frac{n+1}{2}$. The number of numbers: $n$. Therefore the sum of the numbers: $\frac{n+1}{2}*n$.

Fundamental data structures

Not implemented but hold promises of operations possible.

Stack (LIFO)

Last in first out. Operations:

CreateStack
Push
Pop
Top
EmptyStack?
…

Can be implemented as an array, linked list

Queue (FIFO)

First in first out. Operations:

CreateQueue
Enqueue
Dequeue
Head
Empty

Others

Priority Queue, trees, graphs

Growth rate and Algorithm Efficiency

Assume operations take the same amount of time, abstract away from the architecture. Compare algorithm growth rates in terms of asymptotic analysis (order hierarchy). $O(n)$ denotes a set of algorithms that grow faster than an algorithm (Asymptotically bounded above). $\Omega(n)$ denotes a set of algorithms which grow no slower than an algorithm (Asymptotically bounded below). $\Theta(n)$ denotes a set of algorithms which grow as fast as an algorithm (Bounded both above and below asymptotically).

Graphs

Common set of problems, such as network design, flow design, planning, scheduling, etc.

Definitions

Denoted as $G=<V,E>$, where V is the set of nodes or vertices and E is the set of edges (pairs of nodes). Direction of edges can matter, creates a directed graph. Graphs can be unconnected if there exists nodes with no possible path through other nodes. If $(u,v)\in E$ then u and v are adjacent or neighbours. $(u,v)$ connects u and v, and u and v are incident to $(u,v)$. The degree of a node is the number of edges incident to the node. A path in $<V,E>$ is a sequence of nodes $v_0,v_1,...,v_k$ from V such that $(v_i,v_{i+1})\in E$. The path $v_0,v_1,...,v_k$ has length k. A simple path is one that has no repeated nodes. A cycle is a simple path, except that $v_0=v_k$, i.e. the last node is the same as the first. Graph edges can have weights, making a weighted graph. A dense graph has a lot of edges connecting nodes, where as a sparse graph has few. A complete graph is one where every node is connected to every other node. Acyclic graphs are “trees”, they have no way to go “up” to previous points. Directed acyclic graphs are “dags” A rooted tree is a tree with one node identified as special in that every other node is reachable from the root. If the root is removes, a set of rooted sub-trees remain. A directed graph is strongly connected if there exists a path from every node to every other node, keeping with the directions. A graph is connected if there is a path between every node.

Depth-first search

Visit nodes along a path before visiting neighbours along the path. Push vertices onto a stack, as we go back we pop off the stack. Start at each node with 0, mark with the order they were found. Produces a DFS tree/forest. Tree edges denote the path followed. Back edges denote alternative connections. With an adjacency matrix, complexity is $\Theta(|V|^2)$. With adjacency lists, complexity is $\Theta(|V|+|E|)$.

Breadth-first search

Proceeds by visiting every node that is one step away from the start, then those that are two steps away and so on. Uses a queue to process the nodes to be visited. Produces a BFS tree/forest. Tree edges denote the direct connections followed. Cross edges denote alternative connections. BFS has the same complexity as DFS.

Topological sorting

Assume a directed edge a,b means that a must be before b, implying a directed graph, i.e. a dag. We can linearise the graphs such that the nodes form a sequence $v_1,v_2,...,v_n$ such that for each edge $(v_i,v_j)\in E$, we have $i<j$. We can solve this with depth-first search:

Perform DFS and note order which popped off the stack
List nodes in reverse of that order

Works because of stack discipline.

An alternative method is to repeatedly randomly select a random source (node without incoming edges, only outgoing) in the graph and remove it from the graph. Repeat until the graph is empty. This is a natural approach, but has the drawback of repeated scanning. This is known as decrease-and-conquer.

Binary trees

Binary tree has every node either empty or with a left and right sub-tree. A full binary tree has every node with 0 or 2 children. A complete tree has every node filled left to right. A non-empty tree has a root, left sub-tree and right sub-tree. The height can be calculated recursively. External nodes are at the ends of the tree, denoting its end.

Traversal

Pseudo-code

Inject(T)
While{data structure is non-empty} {
  T gets eject
  If{T_{left} is non-empty} {
    Inject(T_{left})
  }
  If{T_{right} is non-empty} {
    Inject(T_{right})
  }
}

Pre-order

Visit the root, then the left sub-tree, and finally the right sub-tree. Done using a stack.

In-order

Visit the left sub-tree, then the root, and finally the right sub-tree.

Post-order

Visit the left sub-tree, then the right sub-tree, and finally the root.

Level-order

Visit the nodes level by level, starting from the root. Done using a queue.

Greedy Algorithms

Strategy based on making the locally best choice. Works for some problems in giving the optimal solution but not others. Greedy algorithms can be correct and fast, leading to a good approximation of the optimal solution.

Priority Queue

A set of elements such that the element with the highest priority is at the top, i.e. ejected first. Operations

Find an item with maximal priority
Insert a new item
Test for empty
Eject larges item

Becomes a stack if priority is latest time, and a queue for earliest time.

Costs

Unsorted array or list: Inject O(1), Eject O(n)
Sorted array or list: Inject O(n), Eject O(n)

Spanning trees

A spanning tree of a graph $<V,E>$ is a tree $<V,E'>$ with $E'\subseteq E$. In some applications edges have costs, so some spanning trees become more desirable than others. A minimum spanning tree is the tree with the minimal cost of edges.

Prim's Algorithm

Prim's algorithm is a greedy algorithm for constructing spanning trees. It constructs a sequence of sub-trees T, each adding a node together with an edge to a node in the previous sub-tree. In each step it picks a closest node from outside the tree and adds that.

Dijkstra's Algorithm

Similar structure to Prim's. Shortest path between two nodes. Same as Prim, but add the length of a path Only works for positive weights.

Divide and Conquer

Divide a problem into smaller instances, solve the instances recursively, then combine the smaller solutions to solve the original instance. Use trees.

Warshall's Algorithm

Computes the transitive closure of a binary relation (or a directed graph), presented as a matrix. An edge (a,z) is in the transitive closure of graph G iff there is a path in G from a to z.

Sorts

Basic sorts

Selection Sort

Key idea: find max, place in front, repeat
Simple, consistent, slow ($O(n^2)$)
Stable

Insertion Sort

Key idea: maintain a sorted list, insert one element at a time
Mostly slow ($O(n^2)$), but good best case ($O(n)$)
Fast on almost sorted arrays

Counting Sort

Key idea: count the number of appearances of each key, then rebuild the sorted array
Great performance when key values are limited ($O(n)$)

Mergesort

Divide the lists up, then recursively merge the sub-lists. $O(n\log(n))$ time, $O(n)$ space. Forms a stable sort, requires extra space, although not 100% input insensitive, but relatively robust. Efficient for sorting linked lists. Good for (extremely) large data.

Quicksort

Partition the list, then place elements such that all elements to the left are smaller than the partition and all to the right are larger. Repeat recursively on left and right halves. $O(n\log(n))$ best/average time, $O(1)$ space, $O(n^2)$ worst time.

Heap

Data structure for storing many things. Tries to maintain a complete binary tree such that each parent is not smaller that its children. Naturally represented as a tree, but can be put into an array. As an array, the left child is in position $2i$ and the right child is in $2i+1$.

Intermediate properties

Root is the largest element
Each sub-tree is a heap
Heap is of a height $1+ \lfloor\log{n}\rfloor$

Dynamic programming

Dependent on the principle of optimality: The optimal solution can be found by finding the optimal solution to simpler sub-problems.

Example: The Coin-Row Problem

Each coin has a value
The coins are laid in a row
Adjacent coins can not be selected
Select the coins such that set has the maximum sum of values

Algorithm: Apply the following formula iteratively, where n is the element of the array. \[S(n)=max\{S(n-1),S(n-2)+v_n\}\] $S(0)=0$ and $S(1)=v_1$ are used as starting points.

Example: The Knapsack Problem

You are given a set of items
- Each item has a value and weight
Select a subset to maximise value such that the sum of the weights does not exceed a given capacity

Algorithm

For{i gets 0 to n} {
  K[i,0] gets 0
}
For{j gets 1 to W} {
  K[0,j] gets 0
}
For{i gets 1 to n} {
  For{j gets 1 to W} {
    If{j<w_i}{
      K[i,j] gets K[i-1,j]
    } else {
      K[i,j] gets max(K[i-i,j],K[i-1,j-w_i]+v_i)
    }
  }
}
Return{K[n,W]}

Binary Search tree

Binary tree where the left child is less than the node and the right child is greater. Insertion, $O(height)$. Imbalanced trees are not ideal, worst case $O(n)$. Key idea is to keep the tree balanced, guaranteed $O(\log(n))$ search.

Self balancing trees
- AVL trees
- Red-black trees
- Splay trees
Representational changes
- 2-3 trees
- 2-3-4 trees
- B-trees

AVL Trees

Balanced trees have left and right sub-trees of equal heights. At every node, the difference in the balance factor is $-1,0,1$. Can re-balance tree by a rotation. It is always the lowest unbalanced sub-tree that is re-balanced. The idea of balance here guarantees the depth of an AVL tree with n nodes is $\Theta(\log(n))$. Insertion is $O(\log(n))$. Deletion is also $O(\log(n))$ but is harder to implement.

Red-Black Trees

Has coloured nodes such that:

No red node has a red child
Every path from the root to the fringe has the same number of black nodes

A worst case Red-Black tree has the longest path as twice as long as the shortest.

Splay Trees

Not only self adjusting, but also adaptive. Frequently accessed items are brought closer to the root so their access cost becomes cheaper.

2-3 and 2-3-4 Trees

Allow for more than one item to be stored in a node. 2-node holds 1 item and has at most 2 children. 3-node holds at most 2 items and has at most 3 children. 4-node holds at most 3 items and has at most 4 children. To insert we pretend we are searching for the key, k. This will take us to where k should be, whereupon we can insert it.

Hashing

A hash function is a mapping. \[h: K_1\to K_2\] Some poor examples:

h(string) = length of the string
h(array) = sum of elements in array
h(website) = list of key words

Hash table

Type of data structure for storing data. Operations

Insert
Remove
Find

Each element has a key used to find it. Usually hashing algorithm is non-bijective.

Challenges:

Design of function
Collision handling

Often use $h=n\mod m$, where m is chosen to be prime. Can be difficult as n becomes large. Instead, we can utilise theorems about modular arithmetic: \[(x+y)\mod m = ((x\mod m)+(y\mod m))\mod m\] \[(x*y)\mod m = ((x\mod m)*(y\mod m))\mod m\] i.e. it suffices to perform the modular arithmetic on each sub-expression, then recombine at the end. This can reduce the complexity of the calculations.

Collisions

Collisions can occur when the hashing function produces the same hash for two different pieces of data.

Chaining

A solution to collisions is hashing by chaining. This uses a list to store the collided values in a list at the same hash value. It is a very easy method of collision handling. The load factor ($\alpha$) is the average number of elements stored at a spot. Number of probes in a successful search $\approx 1+\alpha/2$. Number of probes in an unsuccessful search $\approx \alpha$.

Linear probing

If a collision occurs between two keys, the new object is inserted at the next available space. There is a load factor, $\alpha=n/m$, where n is the number of records stored and m is the table size. On average the number of probes required are Successful: $\frac{1}{2}+\frac{1}{2(1-\alpha)}$ Unsuccessful: $\frac{1}{2}+\frac{1}{2(1-\alpha^2)}$ This means that there is a small number of probes required, even when the hash table is relatively full. This method of collision solving is relatively space efficient, although it has a very bad worst case, so $\alpha$ should not increase over 0.9. Clustering, begin many full adjacent cells, is a big problem. Deletion is almost impossible.

Double hashing

Uses a second hash function to determine the offset location for probing for a free cell. If $h(k)$ is occupied, $h(k)+s(k)$ is tried, then $h(k)+2s(k)$, etc. Hash should use a prime number, to avoid cycles for offset.

Rehashing

Keep track of a load factor and rehash the whole table when it exceeds a certain number, say 0.9. This means allocating a larger table, revisiting each element in the old table and moving it to its new position in the new table. Introduces long delays at unpredictable times, “stop the world” operation.

Rabin-Karp string search

String search algorithm based on string hashing. Searches for a sub-string by hashing every sub-string in the original and testing if they match the original string. If the hashes are the same, the strings are compared in the usual way. Since false positives are rare with a good algorithm, the $O(m)$ time, where m is the sub-string length, to compare the strings can be ignored. The hash can be computed incrementally, reducing the hashing complexity time

Drawbacks

Not good for traversal in sorted order
Unless separate chaining is used, deletion is almost impossible
It may be hard to predict the volume of data, rehashing is an expensive “stop the world” procedure

Compression

Computer files can contain redundancy. Often introduced for practical purposes. Compression aims to remove redundancy.

Run length encoding

For a piece of text that has long runs of repeated characters we could compress by counting runs. For example: \[AAAABBBCABCAAABBBBBBBCCCCCCCCCCCCC\] Could be encoded as \[4A3BCABC3A7B13C\] Not appropriate for English text, but can work with binary files.

Variable length encoding

Instead of having 8 bits for a character, use codes such that common characters have shorter codes. For example, E could have code 0. But then no other character code can start with 0. The prefix for each character must be unique.

Trie (prefix tree)

Binary tree for search applications. Leaves store elements, path to leaf encodes prefix. e.g. left=0, right=1. Using a trie means no code will be the prefix of another.

Huffman encoding

If we know letter frequencies well, what is the optimal code assignments for each letter. We can find them by finding the two smallest weighs and fusing them (Huffman's algorithm, a greedy algorithm). This results in a Huffman tree. For this type of compression, both the compressor and decompressor must agree on the codes being used. Alternatively the trie can be sent along with the message, taking negligible space with large files. Modern variants encode reoccurring patterns as well as individual letters.

Space-Time trade off

Use extra space to reduce the time needed for a process. This is a type of prepossessing.

String matching

Strings are built from a small alphabet, in a structured order.

Horspool's Strung Search Algorithm

Comparing left to right in the pattern. Very good at random text strings. Shift search string based on the latest occurrence of the sub string in the search string. Builds a shift table for the pattern string to determine how far the pattern should shift. Inspired by the Boyer-Moore algorithm, which uses a more sophisticated shifting strategy.

Knuth-Morris-Pratt (KMP) algorithm

Good for patterns utilising a small alphabet. Constructs a finite state machine to do the string matching.

Complexity theory

There are problems without simple solutions and even some without any solution. Some problems without reasonable worst time algorithms:

Shortest path with additional constants
Timetabling
Production planning

Complexity theory asks:

What is an inherently difficult problem
Can we categorise problems by their difficulty
How can we know an algorithm is optimal in the asymptotic sense

Table of Contents