Table of Contents
COMP20007 - Design of Algorithms
$$\require{algorithm2e,amsmath,amssymb,mathtools}$$
Problem solving steps
- Understand the problem
- Decide on the computational means
- Decide on method to use
- Design the necessary data structures and algorithms
- Check for correctness, trace input example
- Evaluate analytically (time, space, worst case, average case)
- Code it
- Evaluate empirically
Sum of the first n integers
\[\sum^n_{i=0}i=\frac{n(n+1)}{2}\] The average of the numbers: \(\frac{n+1}{2}\). The number of numbers: \(n\). Therefore the sum of the numbers: \(\frac{n+1}{2}*n\).
Fundamental data structures
Not implemented but hold promises of operations possible.
Stack (LIFO)
Last in first out. Operations:
- CreateStack
- Push
- Pop
- Top
- EmptyStack?
- …
Can be implemented as an array, linked list
Queue (FIFO)
First in first out. Operations:
- CreateQueue
- Enqueue
- Dequeue
- Head
- Empty
Others
Priority Queue, trees, graphs
Growth rate and Algorithm Efficiency
Assume operations take the same amount of time, abstract away from the architecture. Compare algorithm growth rates in terms of asymptotic analysis (order hierarchy). \(O(n)\) denotes a set of algorithms that grow faster than an algorithm (Asymptotically bounded above). \(\Omega(n)\) denotes a set of algorithms which grow no slower than an algorithm (Asymptotically bounded below). \(\Theta(n)\) denotes a set of algorithms which grow as fast as an algorithm (Bounded both above and below asymptotically).
Graphs
Common set of problems, such as network design, flow design, planning, scheduling, etc.
Definitions
Denoted as \(G=<V,E>\), where V is the set of nodes or vertices and E is the set of edges (pairs of nodes). Direction of edges can matter, creates a directed graph. Graphs can be unconnected if there exists nodes with no possible path through other nodes. If \((u,v)\in E\) then u and v are adjacent or neighbours. \((u,v)\) connects u and v, and u and v are incident to \((u,v)\). The degree of a node is the number of edges incident to the node. A path in \(<V,E>\) is a sequence of nodes \(v_0,v_1,...,v_k\) from V such that \((v_i,v_{i+1})\in E\). The path \(v_0,v_1,...,v_k\) has length k. A simple path is one that has no repeated nodes. A cycle is a simple path, except that \(v_0=v_k\), i.e. the last node is the same as the first. Graph edges can have weights, making a weighted graph. A dense graph has a lot of edges connecting nodes, where as a sparse graph has few. A complete graph is one where every node is connected to every other node. Acyclic graphs are “trees”, they have no way to go “up” to previous points. Directed acyclic graphs are “dags” A rooted tree is a tree with one node identified as special in that every other node is reachable from the root. If the root is removes, a set of rooted sub-trees remain. A directed graph is strongly connected if there exists a path from every node to every other node, keeping with the directions. A graph is connected if there is a path between every node.
Depth-first search
Visit nodes along a path before visiting neighbours along the path. Push vertices onto a stack, as we go back we pop off the stack. Start at each node with 0, mark with the order they were found. Produces a DFS tree/forest. Tree edges denote the path followed. Back edges denote alternative connections. With an adjacency matrix, complexity is \(\Theta(|V|^2)\). With adjacency lists, complexity is \(\Theta(|V|+|E|)\).
Breadth-first search
Proceeds by visiting every node that is one step away from the start, then those that are two steps away and so on. Uses a queue to process the nodes to be visited. Produces a BFS tree/forest. Tree edges denote the direct connections followed. Cross edges denote alternative connections. BFS has the same complexity as DFS.
Topological sorting
Assume a directed edge a,b means that a must be before b, implying a directed graph, i.e. a dag. We can linearise the graphs such that the nodes form a sequence \(v_1,v_2,...,v_n\) such that for each edge \((v_i,v_j)\in E\), we have \(i<j\). We can solve this with depth-first search:
- Perform DFS and note order which popped off the stack
- List nodes in reverse of that order
Works because of stack discipline.
An alternative method is to repeatedly randomly select a random source (node without incoming edges, only outgoing) in the graph and remove it from the graph. Repeat until the graph is empty. This is a natural approach, but has the drawback of repeated scanning. This is known as decrease-and-conquer.
Binary trees
Binary tree has every node either empty or with a left and right sub-tree. A full binary tree has every node with 0 or 2 children. A complete tree has every node filled left to right. A non-empty tree has a root, left sub-tree and right sub-tree. The height can be calculated recursively. External nodes are at the ends of the tree, denoting its end.
Traversal
Pseudo-code
Inject(T) While{data structure is non-empty} { T gets eject If{T_{left} is non-empty} { Inject(T_{left}) } If{T_{right} is non-empty} { Inject(T_{right}) } }
Pre-order
Visit the root, then the left sub-tree, and finally the right sub-tree. Done using a stack.
In-order
Visit the left sub-tree, then the root, and finally the right sub-tree.
Post-order
Visit the left sub-tree, then the right sub-tree, and finally the root.
Level-order
Visit the nodes level by level, starting from the root. Done using a queue.
Greedy Algorithms
Strategy based on making the locally best choice. Works for some problems in giving the optimal solution but not others. Greedy algorithms can be correct and fast, leading to a good approximation of the optimal solution.
Priority Queue
A set of elements such that the element with the highest priority is at the top, i.e. ejected first. Operations
- Find an item with maximal priority
- Insert a new item
- Test for empty
- Eject larges item
Becomes a stack if priority is latest time, and a queue for earliest time.
Costs
- Unsorted array or list: Inject O(1), Eject O(n)
- Sorted array or list: Inject O(n), Eject O(n)
Spanning trees
A spanning tree of a graph \(<V,E>\) is a tree \(<V,E'>\) with \(E'\subseteq E\). In some applications edges have costs, so some spanning trees become more desirable than others. A minimum spanning tree is the tree with the minimal cost of edges.
Prim's Algorithm
Prim's algorithm is a greedy algorithm for constructing spanning trees. It constructs a sequence of sub-trees T, each adding a node together with an edge to a node in the previous sub-tree. In each step it picks a closest node from outside the tree and adds that.
Dijkstra's Algorithm
Similar structure to Prim's. Shortest path between two nodes. Same as Prim, but add the length of a path Only works for positive weights.
Divide and Conquer
Divide a problem into smaller instances, solve the instances recursively, then combine the smaller solutions to solve the original instance. Use trees.
Warshall's Algorithm
Computes the transitive closure of a binary relation (or a directed graph), presented as a matrix. An edge (a,z) is in the transitive closure of graph G iff there is a path in G from a to z.
Sorts
Basic sorts
Selection Sort
- Key idea: find max, place in front, repeat
- Simple, consistent, slow (\(O(n^2)\))
- Stable
Insertion Sort
- Key idea: maintain a sorted list, insert one element at a time
- Mostly slow (\(O(n^2)\)), but good best case (\(O(n)\))
- Fast on almost sorted arrays
Counting Sort
- Key idea: count the number of appearances of each key, then rebuild the sorted array
- Great performance when key values are limited (\(O(n)\))
Mergesort
Divide the lists up, then recursively merge the sub-lists. \(O(n\log(n))\) time, \(O(n)\) space. Forms a stable sort, requires extra space, although not 100% input insensitive, but relatively robust. Efficient for sorting linked lists. Good for (extremely) large data.
Quicksort
Partition the list, then place elements such that all elements to the left are smaller than the partition and all to the right are larger. Repeat recursively on left and right halves. \(O(n\log(n))\) best/average time, \(O(1)\) space, \(O(n^2)\) worst time.
Heap
Data structure for storing many things. Tries to maintain a complete binary tree such that each parent is not smaller that its children. Naturally represented as a tree, but can be put into an array. As an array, the left child is in position \(2i\) and the right child is in \(2i+1\).
Intermediate properties
- Root is the largest element
- Each sub-tree is a heap
- Heap is of a height \(1+ \lfloor\log{n}\rfloor\)
Dynamic programming
Dependent on the principle of optimality: The optimal solution can be found by finding the optimal solution to simpler sub-problems.
Example: The Coin-Row Problem
- Each coin has a value
- The coins are laid in a row
- Adjacent coins can not be selected
- Select the coins such that set has the maximum sum of values
Algorithm: Apply the following formula iteratively, where n is the element of the array. \[S(n)=max\{S(n-1),S(n-2)+v_n\}\] \(S(0)=0\) and \(S(1)=v_1\) are used as starting points.
Example: The Knapsack Problem
- You are given a set of items
- Each item has a value and weight
- Select a subset to maximise value such that the sum of the weights does not exceed a given capacity
Algorithm
For{i gets 0 to n} { K[i,0] gets 0 } For{j gets 1 to W} { K[0,j] gets 0 } For{i gets 1 to n} { For{j gets 1 to W} { If{j<w_i}{ K[i,j] gets K[i-1,j] } else { K[i,j] gets max(K[i-i,j],K[i-1,j-w_i]+v_i) } } } Return{K[n,W]}
Binary Search tree
Binary tree where the left child is less than the node and the right child is greater. Insertion, \(O(height)\). Imbalanced trees are not ideal, worst case \(O(n)\). Key idea is to keep the tree balanced, guaranteed \(O(\log(n))\) search.
- Self balancing trees
- AVL trees
- Red-black trees
- Splay trees
- Representational changes
- 2-3 trees
- 2-3-4 trees
- B-trees
AVL Trees
Balanced trees have left and right sub-trees of equal heights. At every node, the difference in the balance factor is \(-1,0,1\). Can re-balance tree by a rotation. It is always the lowest unbalanced sub-tree that is re-balanced. The idea of balance here guarantees the depth of an AVL tree with n nodes is \(\Theta(\log(n))\). Insertion is \(O(\log(n))\). Deletion is also \(O(\log(n))\) but is harder to implement.
Red-Black Trees
Has coloured nodes such that:
- No red node has a red child
- Every path from the root to the fringe has the same number of black nodes
A worst case Red-Black tree has the longest path as twice as long as the shortest.
Splay Trees
Not only self adjusting, but also adaptive. Frequently accessed items are brought closer to the root so their access cost becomes cheaper.
2-3 and 2-3-4 Trees
Allow for more than one item to be stored in a node. 2-node holds 1 item and has at most 2 children. 3-node holds at most 2 items and has at most 3 children. 4-node holds at most 3 items and has at most 4 children. To insert we pretend we are searching for the key, k. This will take us to where k should be, whereupon we can insert it.
Hashing
A hash function is a mapping. \[h: K_1\to K_2\] Some poor examples:
- h(string) = length of the string
- h(array) = sum of elements in array
- h(website) = list of key words
Hash table
Type of data structure for storing data. Operations
- Insert
- Remove
- Find
Each element has a key used to find it. Usually hashing algorithm is non-bijective.
Challenges:
- Design of function
- Collision handling
Often use \(h=n\mod m\), where m is chosen to be prime. Can be difficult as n becomes large. Instead, we can utilise theorems about modular arithmetic: \[(x+y)\mod m = ((x\mod m)+(y\mod m))\mod m\] \[(x*y)\mod m = ((x\mod m)*(y\mod m))\mod m\] i.e. it suffices to perform the modular arithmetic on each sub-expression, then recombine at the end. This can reduce the complexity of the calculations.
Collisions
Collisions can occur when the hashing function produces the same hash for two different pieces of data.
Chaining
A solution to collisions is hashing by chaining. This uses a list to store the collided values in a list at the same hash value. It is a very easy method of collision handling. The load factor (\(\alpha\)) is the average number of elements stored at a spot. Number of probes in a successful search \(\approx 1+\alpha/2\). Number of probes in an unsuccessful search \(\approx \alpha\).
Linear probing
If a collision occurs between two keys, the new object is inserted at the next available space. There is a load factor, \(\alpha=n/m\), where n is the number of records stored and m is the table size. On average the number of probes required are Successful: \(\frac{1}{2}+\frac{1}{2(1-\alpha)}\) Unsuccessful: \(\frac{1}{2}+\frac{1}{2(1-\alpha^2)}\) This means that there is a small number of probes required, even when the hash table is relatively full. This method of collision solving is relatively space efficient, although it has a very bad worst case, so \(\alpha\) should not increase over 0.9. Clustering, begin many full adjacent cells, is a big problem. Deletion is almost impossible.
Double hashing
Uses a second hash function to determine the offset location for probing for a free cell. If \(h(k)\) is occupied, \(h(k)+s(k)\) is tried, then \(h(k)+2s(k)\), etc. Hash should use a prime number, to avoid cycles for offset.
Rehashing
Keep track of a load factor and rehash the whole table when it exceeds a certain number, say 0.9. This means allocating a larger table, revisiting each element in the old table and moving it to its new position in the new table. Introduces long delays at unpredictable times, “stop the world” operation.
Rabin-Karp string search
String search algorithm based on string hashing. Searches for a sub-string by hashing every sub-string in the original and testing if they match the original string. If the hashes are the same, the strings are compared in the usual way. Since false positives are rare with a good algorithm, the \(O(m)\) time, where m is the sub-string length, to compare the strings can be ignored. The hash can be computed incrementally, reducing the hashing complexity time
Drawbacks
- Not good for traversal in sorted order
- Unless separate chaining is used, deletion is almost impossible
- It may be hard to predict the volume of data, rehashing is an expensive “stop the world” procedure
Compression
Computer files can contain redundancy. Often introduced for practical purposes. Compression aims to remove redundancy.
Run length encoding
For a piece of text that has long runs of repeated characters we could compress by counting runs. For example: \[AAAABBBCABCAAABBBBBBBCCCCCCCCCCCCC\] Could be encoded as \[4A3BCABC3A7B13C\] Not appropriate for English text, but can work with binary files.
Variable length encoding
Instead of having 8 bits for a character, use codes such that common characters have shorter codes. For example, E could have code 0. But then no other character code can start with 0. The prefix for each character must be unique.
Trie (prefix tree)
Binary tree for search applications. Leaves store elements, path to leaf encodes prefix. e.g. left=0, right=1. Using a trie means no code will be the prefix of another.
Huffman encoding
If we know letter frequencies well, what is the optimal code assignments for each letter. We can find them by finding the two smallest weighs and fusing them (Huffman's algorithm, a greedy algorithm). This results in a Huffman tree. For this type of compression, both the compressor and decompressor must agree on the codes being used. Alternatively the trie can be sent along with the message, taking negligible space with large files. Modern variants encode reoccurring patterns as well as individual letters.
Space-Time trade off
Use extra space to reduce the time needed for a process. This is a type of prepossessing.
String matching
Strings are built from a small alphabet, in a structured order.
Horspool's Strung Search Algorithm
Comparing left to right in the pattern. Very good at random text strings. Shift search string based on the latest occurrence of the sub string in the search string. Builds a shift table for the pattern string to determine how far the pattern should shift. Inspired by the Boyer-Moore algorithm, which uses a more sophisticated shifting strategy.
Knuth-Morris-Pratt (KMP) algorithm
Good for patterns utilising a small alphabet. Constructs a finite state machine to do the string matching.
Complexity theory
There are problems without simple solutions and even some without any solution. Some problems without reasonable worst time algorithms:
- Shortest path with additional constants
- Timetabling
- Production planning
Complexity theory asks:
- What is an inherently difficult problem
- Can we categorise problems by their difficulty
- How can we know an algorithm is optimal in the asymptotic sense