$conf['savedir'] = '/app/www/public/data'; ====== COMP20007 - Design of Algorithms ======= {{tag>notes unimelb computer_science}} $$\require{algorithm2e,amsmath,amssymb,mathtools}$$ ====== Problem solving steps ====== * Understand the problem * Decide on the computational means * Decide on method to use * Design the necessary data structures and algorithms * Check for correctness, trace input example * Evaluate analytically (time, space, worst case, average case) * Code it * Evaluate empirically ===== Sum of the first n integers ===== \[\sum^n_{i=0}i=\frac{n(n+1)}{2}\] The average of the numbers: \(\frac{n+1}{2}\). The number of numbers: \(n\). Therefore the sum of the numbers: \(\frac{n+1}{2}*n\). ====== Fundamental data structures ====== Not implemented but hold promises of operations possible. ===== Stack (LIFO) ===== Last in first out. Operations: * CreateStack * Push * Pop * Top * EmptyStack? * ... Can be implemented as an array, linked list ===== Queue (FIFO) ===== First in first out. Operations: * CreateQueue * Enqueue * Dequeue * Head * Empty ===== Others ===== Priority Queue, trees, graphs ====== Growth rate and Algorithm Efficiency ====== Assume operations take the same amount of time, abstract away from the architecture. Compare algorithm growth rates in terms of asymptotic analysis (order hierarchy). \(O(n)\) denotes a set of algorithms that grow faster than an algorithm (Asymptotically bounded above). \(\Omega(n)\) denotes a set of algorithms which grow no slower than an algorithm (Asymptotically bounded below). \(\Theta(n)\) denotes a set of algorithms which grow as fast as an algorithm (Bounded both above and below asymptotically). ====== Graphs ====== Common set of problems, such as network design, flow design, planning, scheduling, etc. ===== Definitions ===== Denoted as \(G=\), where V is the set of nodes or vertices and E is the set of edges (pairs of nodes). Direction of edges can matter, creates a directed graph. Graphs can be unconnected if there exists nodes with no possible path through other nodes. If \((u,v)\in E\) then u and v are adjacent or neighbours. \((u,v)\) connects u and v, and u and v are incident to \((u,v)\). The degree of a node is the number of edges incident to the node. A path in \(\) is a sequence of nodes \(v_0,v_1,...,v_k\) from V such that \((v_i,v_{i+1})\in E\). The path \(v_0,v_1,...,v_k\) has length k. A simple path is one that has no repeated nodes. A cycle is a simple path, except that \(v_0=v_k\), i.e. the last node is the same as the first. Graph edges can have weights, making a weighted graph. A dense graph has a lot of edges connecting nodes, where as a sparse graph has few. A complete graph is one where every node is connected to every other node. Acyclic graphs are "trees", they have no way to go "up" to previous points. Directed acyclic graphs are "dags" A rooted tree is a tree with one node identified as special in that every other node is reachable from the root. If the root is removes, a set of rooted sub-trees remain. A directed graph is strongly connected if there exists a path from every node to every other node, keeping with the directions. A graph is connected if there is a path between every node. ===== Depth-first search ===== Visit nodes along a path before visiting neighbours along the path. Push vertices onto a stack, as we go back we pop off the stack. Start at each node with 0, mark with the order they were found. Produces a DFS tree/forest. Tree edges denote the path followed. Back edges denote alternative connections. With an adjacency matrix, complexity is \(\Theta(|V|^2)\). With adjacency lists, complexity is \(\Theta(|V|+|E|)\). ===== Breadth-first search ===== Proceeds by visiting every node that is one step away from the start, then those that are two steps away and so on. Uses a queue to process the nodes to be visited. Produces a BFS tree/forest. Tree edges denote the direct connections followed. Cross edges denote alternative connections. BFS has the same complexity as DFS. ===== Topological sorting ===== Assume a directed edge a,b means that a must be before b, implying a directed graph, i.e. a dag. We can linearise the graphs such that the nodes form a sequence \(v_1,v_2,...,v_n\) such that for each edge \((v_i,v_j)\in E\), we have \(i Inject(T) While{data structure is non-empty} { T gets eject If{T_{left} is non-empty} { Inject(T_{left}) } If{T_{right} is non-empty} { Inject(T_{right}) } } === Pre-order === Visit the root, then the left sub-tree, and finally the right sub-tree. Done using a stack. === In-order === Visit the left sub-tree, then the root, and finally the right sub-tree. === Post-order === Visit the left sub-tree, then the right sub-tree, and finally the root. === Level-order === Visit the nodes level by level, starting from the root. Done using a queue. ====== Greedy Algorithms ====== Strategy based on making the locally best choice. Works for some problems in giving the optimal solution but not others. Greedy algorithms can be correct and fast, leading to a good approximation of the optimal solution. ====== Priority Queue ====== A set of elements such that the element with the highest priority is at the top, i.e. ejected first. Operations * Find an item with maximal priority * Insert a new item * Test for empty * Eject larges item Becomes a stack if priority is latest time, and a queue for earliest time. Costs * **Unsorted array or list:** Inject O(1), Eject O(n) * **Sorted array or list:** Inject O(n), Eject O(n) ====== Spanning trees ====== A spanning tree of a graph \(\) is a tree \(\) with \(E'\subseteq E\). In some applications edges have costs, so some spanning trees become more desirable than others. A minimum spanning tree is the tree with the minimal cost of edges. ===== Prim's Algorithm ===== Prim's algorithm is a greedy algorithm for constructing spanning trees. It constructs a sequence of sub-trees T, each adding a node together with an edge to a node in the previous sub-tree. In each step it picks a closest node from outside the tree and adds that. ===== Dijkstra's Algorithm ===== Similar structure to Prim's. Shortest path between two nodes. Same as Prim, but add the length of a path Only works for positive weights. ====== Divide and Conquer ====== Divide a problem into smaller instances, solve the instances recursively, then combine the smaller solutions to solve the original instance. Use trees. ====== Warshall's Algorithm ====== Computes the transitive closure of a binary relation (or a directed graph), presented as a matrix. An edge (a,z) is in the transitive closure of graph G iff there is a path in G from a to z. ====== Sorts ====== ===== Basic sorts ===== Selection Sort * **Key idea:** find max, place in front, repeat * Simple, consistent, slow (\(O(n^2)\)) * Stable Insertion Sort * **Key idea:** maintain a sorted list, insert one element at a time * Mostly slow (\(O(n^2)\)), but good best case (\(O(n)\)) * Fast on __almost__ sorted arrays Counting Sort * **Key idea:** count the number of appearances of each key, then rebuild the sorted array * Great performance when key values are limited (\(O(n)\)) ===== Mergesort ===== Divide the lists up, then recursively merge the sub-lists. \(O(n\log(n))\) time, \(O(n)\) space. Forms a stable sort, requires extra space, although not 100% input insensitive, but relatively robust. Efficient for sorting linked lists. Good for (extremely) large data. ===== Quicksort ===== Partition the list, then place elements such that all elements to the left are smaller than the partition and all to the right are larger. Repeat recursively on left and right halves. \(O(n\log(n))\) best/average time, \(O(1)\) space, \(O(n^2)\) worst time. ====== Heap ====== Data structure for storing many things. Tries to maintain a complete binary tree such that each parent is not smaller that its children. Naturally represented as a tree, but can be put into an array. As an array, the left child is in position \(2i\) and the right child is in \(2i+1\). ===== Intermediate properties ===== * Root is the largest element * Each sub-tree is a heap * Heap is of a height \(1+ \lfloor\log{n}\rfloor\) ====== Dynamic programming ====== Dependent on the principle of optimality: //The optimal solution can be found by finding the optimal solution to simpler sub-problems//. ===== Example: The Coin-Row Problem ===== * Each coin has a value * The coins are laid in a row * Adjacent coins can not be selected * Select the coins such that set has the maximum sum of values Algorithm: Apply the following formula iteratively, where n is the element of the array. \[S(n)=max\{S(n-1),S(n-2)+v_n\}\] \(S(0)=0\) and \(S(1)=v_1\) are used as starting points. ===== Example: The Knapsack Problem ===== * You are given a set of items * Each item has a value and weight * Select a subset to maximise value such that the sum of the weights does not exceed a given capacity ==== Algorithm ==== For{i gets 0 to n} { K[i,0] gets 0 } For{j gets 1 to W} { K[0,j] gets 0 } For{i gets 1 to n} { For{j gets 1 to W} { If{j ===== Binary Search tree ===== Binary tree where the left child is less than the node and the right child is greater. Insertion, \(O(height)\). Imbalanced trees are not ideal, worst case \(O(n)\). Key idea is to keep the tree balanced, guaranteed \(O(\log(n))\) search. * Self balancing trees * AVL trees * Red-black trees * Splay trees * Representational changes * 2-3 trees * 2-3-4 trees * B-trees ==== AVL Trees ==== Balanced trees have left and right sub-trees of equal heights. At every node, the difference in the balance factor is \(-1,0,1\). Can re-balance tree by a rotation. It is always the lowest unbalanced sub-tree that is re-balanced. The idea of balance here guarantees the depth of an AVL tree with n nodes is \(\Theta(\log(n))\). Insertion is \(O(\log(n))\). Deletion is also \(O(\log(n))\) but is harder to implement. ==== Red-Black Trees ==== Has coloured nodes such that: - No red node has a red child - Every path from the root to the fringe has the same number of black nodes A worst case Red-Black tree has the longest path as twice as long as the shortest. ==== Splay Trees ==== Not only self adjusting, but also adaptive. Frequently accessed items are brought closer to the root so their access cost becomes cheaper. ==== 2-3 and 2-3-4 Trees ==== Allow for more than one item to be stored in a node. 2-node holds 1 item and has at most 2 children. 3-node holds at most 2 items and has at most 3 children. 4-node holds at most 3 items and has at most 4 children. To insert we pretend we are searching for the key, k. This will take us to where k should be, whereupon we can insert it. ====== Hashing ====== A hash function is a mapping. \[h: K_1\to K_2\] Some poor examples: * h(string) = length of the string * h(array) = sum of elements in array * h(website) = list of key words ===== Hash table ===== Type of data structure for storing data. Operations * Insert * Remove * Find Each element has a key used to find it. Usually hashing algorithm is non-bijective. Challenges: * Design of function * Collision handling Often use \(h=n\mod m\), where m is chosen to be prime. Can be difficult as n becomes large. Instead, we can utilise theorems about modular arithmetic: \[(x+y)\mod m = ((x\mod m)+(y\mod m))\mod m\] \[(x*y)\mod m = ((x\mod m)*(y\mod m))\mod m\] i.e. it suffices to perform the modular arithmetic on each sub-expression, then recombine at the end. This can reduce the complexity of the calculations. ===== Collisions ===== Collisions can occur when the hashing function produces the same hash for two different pieces of data. ==== Chaining ==== A solution to collisions is hashing by chaining. This uses a list to store the collided values in a list at the same hash value. It is a very easy method of collision handling. The **load factor** (\(\alpha\)) is the average number of elements stored at a spot. Number of probes in a successful search \(\approx 1+\alpha/2\). Number of probes in an unsuccessful search \(\approx \alpha\). ==== Linear probing ==== If a collision occurs between two keys, the new object is inserted at the next available space. There is a load factor, \(\alpha=n/m\), where n is the number of records stored and m is the table size. On average the number of probes required are Successful: \(\frac{1}{2}+\frac{1}{2(1-\alpha)}\) Unsuccessful: \(\frac{1}{2}+\frac{1}{2(1-\alpha^2)}\) This means that there is a small number of probes required, even when the hash table is relatively full. This method of collision solving is relatively space efficient, although it has a very bad worst case, so \(\alpha\) should not increase over 0.9. Clustering, begin many full adjacent cells, is a big problem. Deletion is almost impossible. ==== Double hashing ==== Uses a second hash function to determine the offset location for probing for a free cell. If \(h(k)\) is occupied, \(h(k)+s(k)\) is tried, then \(h(k)+2s(k)\), etc. Hash should use a prime number, to avoid cycles for offset. ==== Rehashing ==== Keep track of a load factor and rehash the whole table when it exceeds a certain number, say 0.9. This means allocating a larger table, revisiting each element in the old table and moving it to its new position in the new table. Introduces long delays at unpredictable times, "stop the world" operation. ===== Rabin-Karp string search ===== String search algorithm based on string hashing. Searches for a sub-string by hashing every sub-string in the original and testing if they match the original string. If the hashes are the same, the strings are compared in the usual way. Since false positives are rare with a good algorithm, the \(O(m)\) time, where m is the sub-string length, to compare the strings can be ignored. The hash can be computed incrementally, reducing the hashing complexity time ===== Drawbacks ===== * Not good for traversal in sorted order * Unless separate chaining is used, deletion is almost impossible * It may be hard to predict the volume of data, rehashing is an expensive "stop the world" procedure ====== Compression ====== Computer files can contain redundancy. Often introduced for practical purposes. Compression aims to remove redundancy. ===== Run length encoding ===== For a piece of text that has long runs of repeated characters we could compress by counting runs. For example: \[AAAABBBCABCAAABBBBBBBCCCCCCCCCCCCC\] Could be encoded as \[4A3BCABC3A7B13C\] Not appropriate for English text, but can work with binary files. ===== Variable length encoding ===== Instead of having 8 bits for a character, use codes such that common characters have shorter codes. For example, E could have code 0. But then no other character code can start with 0. The prefix for each character must be unique. ==== Trie (prefix tree) ==== Binary tree for search applications. Leaves store elements, path to leaf encodes prefix. e.g. left=0, right=1. Using a trie means no code will be the prefix of another. ===== Huffman encoding ===== If we know letter frequencies well, what is the optimal code assignments for each letter. We can find them by finding the two smallest weighs and fusing them (Huffman's algorithm, a greedy algorithm). This results in a Huffman tree. For this type of compression, both the compressor and decompressor must agree on the codes being used. Alternatively the trie can be sent along with the message, taking negligible space with large files. Modern variants encode reoccurring patterns as well as individual letters. ====== Space-Time trade off ====== Use extra space to reduce the time needed for a process. This is a type of prepossessing. ====== String matching ====== Strings are built from a small alphabet, in a structured order. ===== Horspool's Strung Search Algorithm ===== Comparing left to right in the pattern. Very good at random text strings. Shift search string based on the latest occurrence of the sub string in the search string. Builds a shift table for the pattern string to determine how far the pattern should shift. Inspired by the Boyer-Moore algorithm, which uses a more sophisticated shifting strategy. ===== Knuth-Morris-Pratt (KMP) algorithm ===== Good for patterns utilising a small alphabet. Constructs a finite state machine to do the string matching. ====== Complexity theory ====== There are problems without simple solutions and even some without any solution. Some problems without reasonable worst time algorithms: * Shortest path with additional constants * Timetabling * Production planning Complexity theory asks: * What is an inherently difficult problem * Can we categorise problems by their difficulty * How can we know an algorithm is optimal in the asymptotic sense