Algorithms, Part I - Princeton University

This course covers programming and problem solving with applications as well as data structures and algorithms:

topic	data structures and algorithms	part
data types	stack, queue, bag, union-find, priority queue	part I
sorting	quicksort, mergesort, heapsort	part I
searching	BST, red-black BST, hash table	part I
graphs	BFS, DFS, Prim, Kruskal, Dijkstra	part II
strings	radix sorts, tries, KMP, regexps, data compression	part II
advanced	B-tree, suffix array, maxflow	part II

Week 1:
- Union-find
- Analysis of Algorithms
Week 2:
- Stacks and Queues
- Elementary Sorts
Week 3:
- Mergesort
- Quicksort
Week 4:
- Priority Queues
- Elementary Symbol Tables
Week 5:
- Balanced Search Trees
- Geometric Applications of BSTs
Week 6:
- Hash Tables
- Symbol Table Applications

Week 1: Union-Find

We illustrate our basic approach to developing and analyzing algorithms by considering the dynamic connectivity problem. We introduce the union–find data type and consider several implementations (quick find, quick union, weighted quick union, and weighted quick union with path compression). Finally, we apply the union–find data type to the percolation problem from physical chemistry.

Steps to developing a usable algorithm

Model the problem
Find an algorithm to solve it
Fast enough? Fits in memory?
If not, figure out why
Find a way to address the problem
Iterate until satisfied

Dynamic Connectivity

Given a set of N objects
- Union command: connect two objects
- Find/connected query: is there a path connecting the two objects?
When programming, it is convenient to name to name objects 0 to N - 1
- Use integers as array index
- Suppress details not relevant to union-find
Find query will check if two objects are in the same component
Replace components containing two objects with their union
The skeleton java program might look something like this:

public class UF {
  // Initialize union-find data structure with N objects (0 to N-1)
  UF(int N)
  // Add connection between p and q
  void union(int p, int, q) {}
  // Are p and q in the same component?
  boolean connected(int p, int, q) {}
}

We can check the API design before getting too far:

public static void main(String[] args) {
  int N = StdIn.readInt();
  UF uf = new UF(N);
  while (!StdIn.isEmpty()) {
    int p = StdIn.readInt();
    int q = StdIn.readInt();
    if (!uf.connected(p, q)) {
      uf.union(p, q);
      StdOut.println(p + ' ' + q);
    }
  }
}

Quick-find

Also called an eager algorithm, has a data structure consisting of:
- Integer array id[] of size N
- p and q are connected if and only if they have the same id
Find operation becomes: check if p and q have the same id
Union operation is a bit trickier: to merge components containing p and q, change all entries whose id equals id[p] to id[q]
The java implementation for quick find might look something like this:

public class QuickFindUF {
  private int[] id;

  public QuickFindUF(int N) {
    id = new int[N];
    for (int i = 0; i < N; i++) {
      id[i] = i;
    }
  }

  public boolean connected(int p, int q) {
    return id[p] == id[q];
  }

  public void union(int p, int q) {
    int pid = id[p];
    int qid = id[q];
    for (int i = 0; i < id.length; i++) {
      if (id[i] == pid) id[i] = qid
    }
  }
}

However, quick-find is too slow, it would take N^2 array accesses to process sequence of N union commands on N objects

Quick-union

Also known as lazy approach, has data structure consisting of:
- Integer array id[] of size N
- id[i] is parent of i
- Root of i is id[id[id[...id[i]...]]]
Find: check if p and q have the same root
Union: Merge components containing p and q, set the id of p's root to the id of q's root
Implementation in java:

public class QuickUnionUF {
  private int[] id;

  public QuickUnionUF(int N) {
    id = new int[N];
    for (int i = 0; i < N; i++) {
      id[i] = i;
    }
  }

  private int root(int i) {
    while (i != id[i]) {
      i = id[i];
    }
    return i;
  }

  public boolean connected(int p, int q) {
    return root(p) == root(q);
  }

  public void union(int p, int q) {
    int i = root(p);
    int j = root(q);
    id[i] = j;
  }
}

Unfortunately, quick-union is also too slow:
- Trees can get tall
- Find too expensive (could be N array accesses)

Quick-union Improvements

Weighted quick-union improves quick-union by:
- Modifying quick-union to avoid tall trees
- Keeping track of size of each tree (number of objects)
- Balance by linking root of smaller tree to root of larger tree
Data structure for weighted quick-union is the same as quick-union but we maintain extra array sz[i] to count number of objects in the tree rooted at i
Find is identical to quick-union:

return root(p) == root(q);

Union will link root of smaller tree to root of larger tree and update the sz[] array:

int i = root(p);
int j = root(q);
if (i == j) {
  return;
}
if (sz[i] < sz[j]) {
  id[i] = j;
  sz[j] += sz[i];
}
else  {
  id[j] = i;
  sz[i] += sz[j];
}

This weighted improvement allows the following:
- Find takes time proportional to the depth of p and q
- Union takes constant time, given roots
Another improvement is to implement path compression:
- Just after computing the root of p, set the id of each examined node to point to that root
The java implementation of path compression is very simple, just add one line of code:

private int root (int i) {
  while (i != id[i]) {
    id[i] = id[id[i]]; // extra line of code
    i = id[i];
  }
  return i;
}

Friedman and Sachs proved that there is no linear-time algorithm for the union-find problem but weighted quick-union with path compression gets us pretty close!

Percolation

N-by-N grid of sites
Each site is open with probability p (or blocked with probability 1 - p)
System percolates if and only if top and bottom are connected by open sites

Week 1: Analysis of Algorithms

The basis of our approach for analyzing the performance of algorithms is the scientific method. We begin by performing computational experiments to measure the running times of our programs. We use these measurements to develop hypotheses about performance. Next, we create mathematical models to explain their behavior. Finally, we consider analyzing the memory usage of our Java programs.

Reasons to analyze algorithms

Predict performance
Compare algorithms
Provide guarantees
Understand theoretical basis
Avoid performance bugs

Scientific method applied to analysis of algorithms

A framework for predicting performance and comparing algorithms
- Observe some feature of the natural world (run-time of a program on a computer)
- Hypothesize a mode that is consistent with the observations
- Predict events using the hypothesis (run-time of a larger problem-size)
- Verify the predictions by making further observations
- Validate by repeating until the hypothesis and observations agree
Some principles of the scientific method include:
- Experiments must be reproducible
- Hypotheses must be falsifiable

Experimental Algorithmics

Observing what is happening to the run-time of our programs as we vary input size gives us a way to predict performance
System independent effects:
- Algorithm
- Input data
System dependent effects:
- Hardware: CPU, memory, cache, ...
- Software: compiler, interpreter, garbage collector, ...
- System: operating system, network, other apps, ...
Bad news:
- It is difficult to get precise measurements
Good news:
- Much easier and cheaper than other sciences

Mathematical Models

Although we could observe the run-time of our programs, it does not help us understand what our algorithms are doing
By looking at some basic operation as a proxy for running time we can come up with a general run-time of our operation
Once we have an estimate we can ignore lower order terms (tilde notation)
We can estimate a discrete sum through discrete mathematics or calculus

Order-of-growth Classifications

There are only a small set of functions we care about regarding performance:
- log(N) - Fast
- N - OK
- _N _ log(N)* - OK
- N^2 - Slow
- N^3 - Slower
- 2^N - Slowest
We try the best we can to make sure an algorithm is not quadratic or cubic

Binary Search: Java implementation

Below is a Java implementation of binary search which takes logarithmic time to complete:

public static int binarySearch(int[] a, int key) {
  int lo = 0, hi = a.length - 1;
  while (lo <= hi) {
    int mid = lo + (hi - lo) / 2;
    if (key < a[mid]) {
      hi = mid - 1;
    } else if (key > a[mid]) {
      lo = mid + 1;
    } else {
      return mid;
    }
    return -1;
  }
}

Types of Analysis

There are different ways of analyzing the algorithm depending on the input, this breaks down algorithm run-time into best case, average case, and worst case:
- Best case is determined by the "easiest" input and provides a goal for all inputs
- Worse case is determined by the "hardest" input and provides a guarantee for all inputs
- Best and worst case can be mapped to the lower and upper bound of run-time respectively
- Average case is the expected cost for a random input and provides a way to predict run-time
- Important to note that Big-O notation provides only an upper bound on the growth rate of a function as n gets large

Algorithm design approach

Develop an algorithm
Prove a lower bound
Lower the upper bound (discover a new algorithm)
Raise the lower bound (more difficult)

Memory

Although run-time costs of operations are important, memory cost is also important
Let's look at an example of memory cost in Java, how much memory does this union-find function use as a function of N? About 8N bytes

public class WeightedQuickUnionUF { // 16 bytes (object overhead)
  private int[] id; // 8 + (4N + 24) each reference + int[] array
  private int[] sz;
  private int count; // 4 bytes (int)
  // 4 bytes (padding)
  public WeightedQuickUnionUF(int N) {
    id = new int[N];
    sz = new int[N];
    for (int i = 0; i < N; i++) {
      id[i] = i;
    }
    for (int i = 0; i < N; i++) {
      sz[i] = i;
    }
  }
}

Week 2: Stacks and Queues

We consider two fundamental data types for storing collections of objects: the stack and the queue. We implement each using either a singly-linked list or a resizing array. We introduce two advanced Java features—generics and iterators—that simplify client code. Finally, we consider various applications of stacks and queues ranging from parsing arithmetic expressions to simulating queueing systems.

Stacks and Queues

There are some fundamental data types:
- Value: collection of objects
- Operations: insert, remove, iterate, test if empty
Stack - LIFO (last-in-first-out)
Queue - FIFO (first-in-first-out)

Stacks

An example of a stack test client implemented in Java:

public static void main(String[] args) {
  StackOfStrings stack = new StackOfStrings();
  while (!StdIn.isEmpty()) {
    String s = StdIn.readString();
    if (s.equals("-")) {
      StdOut.print(stack.pop());
    }
    else {
      stack.push(s);
    }
  }
}

What if we wanted to implement a stack with a linked-list?

public class LinkedStackOfStrings {
  private Node first = null;
  // Private inner class
  private class Node {
    String item;
    Node next;
  }

  public boolean isEmpty() {
    return first == null;
  }

  public void push(String item) {
    Node oldfirst = first;
    first = new Node();
    first.item = item;
    first.next = oldfirst;
  }

  public String pop() {
    String item = first.item;
    first = first.next;
    return item;
  }
}

Every operation takes constant time in the worst case
A stack with N items uses ~40 N bytes
What if we wanted to implement a stack with an array?

public class FixedCapacityStackOfStrings {
  private String[] s;
  private int N = 0;
  // Cheating a bit here since we are requiring the client to input a capacity
  public FixedCapacityStackOfStrings(int capacity) {
    s = new String[capacity];
  }

  public boolean isEmpty() {
    return N == 0;
  }

  public void push(String item) {
    s[N++] = item; // Use N++ to index into array then increment N
  }

  public String pop() {
    return s[--N]; // Decrement N, then use to index into array
  }
}

Stack considerations include:
- Underflow: throw an exception if pop from an empty stack
- Overflow: use resizing array for array implementation
Null items can be inserted into a stack
Loitering: holding reference to an object when it is no longer needed

// Here we are still holding references
public String pop() {
  return s[--N];
}

// To address loitering
public String pop() {
  String item = s[--N];
  s[N] = null
  return item;
}

Resizing Arrays

Problem: looking back at our stack array implementation, we require the client to provide capacity when we need to know how to grow and shrink the array
If array is full, create a new array of twice the size:

public ResizingArrayStackOfStrings() {
  s = new String[1];
}

public void push(String item) {
  if (N == s.length) {
    resize(2 * s.length);
  }
  s[N++] = item;
}

private void resize(int capacity) {
  String[] copy = new String[capacity];
  for (int i = 0; i < N; i++) {
    copy[i] = s[i];
  }
  s = copy;
}

To shrink the array we would implement the logic:
- push(): double size of array s[] when array is full
- pop(): halve the size of array s[] when array is one-half full
This sequence is too expensive in the worst case however since each operation takes N time, instead it is more efficient to implement the following:
- push(): double size of array s[] when array is full
- pop(): halve the size of array s[] when array is one-quarter full

public String pop() {
  String item = s[--N];
  s[N] = null;
  if (N > 0 && N == s.length/4) {
    resize(s.length/2);
  }
  return item;
}

So when would we implement a resizing array vs a linked-list?
- Linked-list implementation:
  - Every operation takes constant time in the worst case
  - Uses extra time and space to deal with the links
- Resizing-array implementation:
  - Every operation takes constant amortized time (constant time overall)
  - Less wasted space

Queues

What if we wanted to implement a queue with a linked-list?

public class LinkedQueueOfStrings {
  private Node first = null;
  private Node last = null;

  private class Node {
    String item;
    Node next;
  }

  public boolean isEmpty() {
    return first == null;
  }

  public void enqueue(String item) {
    Node oldlast = last;
    last = new Node();
    last.item = item;
    last.next = null;
    // Check if empty
    if (isEmpty()) {
      first = last;
    } else {
      oldlast.next = last;
    }
  }

  public String dequeue() {
    String item = first.item;
    first = first.next;
    // Check if empty
    if (isEmpty()) {
      last = null;
    }
    return item;
  }
}

Generics

So far our implementations only covers strings not other data types, we want to implement a separate stack class for each type
One solution is to implement casting, however casting is error-prone since run-time error could occur if types mismatch
A better approach is to use generics:
- Avoids casting in the client
- Discovers type mismatch errors at compile-time instead of run-time
A linked-list implementation for stacks using generics:

// Here we can use our stack implementation for the generic type name Item
public class Stack<Item> {
  private Node first = null;

  private class Node {
    Item item;
    Node next;
  }

  public boolean isEmpty() {
    return first == null;
  }

  public void push(Item item) {
    Node oldfirst = first;
    first = new Node();
    first.item = item;
    first.next = oldfirst;
  }

  public Item pop() {
    Item item = first.item;
    first = first.next;
    return item;
  }
}

However Java does not allow for array implementation using generics:

// This is not allowed in Java
public class FixedCapacityStack<Item> {
  private Item[] s;
  private int N = 0;

  public FixedCapacityStack(int capacity) {
    s = new Item[capacity];
  }

  public boolean isEmpty() {
    return N == 0;
  }

  public void push(Item item) {
    s[N++] = item;
  }

  public Item pop() {
    return s[--N];
  }
}

Therefore, for arrays we are stuck with casting:

// This is allowed in Java
public class FixedCapacityStack<Item> {
  private Item[] s;
  private int N = 0;

  public FixedCapacityStack(int capacity) {
    s = (Item[]) new Object[capacity]; // Cast using (Item[])
  }

  public boolean isEmpty() {
    return N == 0;
  }

  public void push(Item item) {
    s[N++] = item;
  }

  public Item pop() {
    return s[--N];
  }
}

What about primitive types?
- Each primitive type has a wrapper object type
- There is a feature known as autoboxing which acts as an automatic cast between a primitive type and its wrapper:

// This is what happens behind the scenes
Stack<Integer> s = new Stack<Integer>();
s.push(17); // s.push(new Integer(17));
int a = s.pop(); // int a = s.pop().intValue();

Iterators

What is an Iterable?
- Has a method that returns an Iterator

public interface Iterable<Item> {
  Iterator<Item> iterator();
}

What is an Iterator?
- Has methods hasNext() and next()

public interface Iterable<Item> {
  boolean hasNext();
  Item next();
}

Why make data structures Iterable?
- Java supports elegant client code

for (String s: stack) {
  StdOut.println(s);
}

// longhand foreach statement
Iterator<String> i = stack.iterator();
while (i.hasNext()) {
  String s = i.next();
  StdOut.println(s);
}

Below is an implementation of a stack iterator for linked-lists:

public class Stack<Item> implements Iterable<Item> {

  // ...

  public Iterator<Item> iterator() {
    return new ListIterator();
  }

  private class ListIterator implements Iterator<Item> {
    private Node current = first;

    public boolean hasNext() {
      return current != null;
    }

    // Not supported
    public void remove() {}

    public Item next() {
      Item item = current.item;
      current = current.next;
      return item
    }
  }
}

For arrays it is similar:

public class Stack<Item> implements Iterable<Item> {

  // ...

  public Iterator<Item> iterator() {
    return new ReverseArrayIterator();
  }

  private class ReverseArrayIterator implements Iterator<Item> {
    private int i = N;

    public boolean hasNext() {
      return i > 0;
    }

    // Not supported
    public void remove() {}

    public Item next() {
      return s[--i];
    }
  }
}

Week 2: Elementary Sorts

We introduce the sorting problem and Java's Comparable interface. We study two elementary sorting methods (selection sort and insertion sort) and a variation of one of them (shellsort). We also consider two algorithms for uniformly shuffling an array. We conclude with an application of sorting to computing the convex hull via the Graham scan algorithm.

Selection Sort

Start out with an unsorted array and in the ith iteration we go through the array to try to find the smallest remaining entry
Scans from left to right and tries to find the smallest to swap
Below is an implementation of selection sort in Java:

public class Selection {
  public class static void sort(Comparable[] a) {
    int N = a.length;
    for (int i = 0; i < N; i++) {
      int min = i;
      for (int j = i + 1; j < N; j++) {
        if (less(a[j], a[min])) {
          min = j;
        }
        exch(a, i, min);
      }
    }
  }

  private static boolean less(Comparable v, Comparable w) {
    // ...
  }

  private static void exch(Comparable[] a, int i, int j) {
    // ...
  }
}

Selection sort has a quadratic run-time even if input is sorted
Linear number of exchanges

Insertion Sort

Move an index i from left to right as before but now move the element into position among the other elements to the left (can implement this least-to-greatest or greatest-to-least)
The implementation for insertion sort in Java is as follows:

public class Insertion {
  public static void sort(Comparable[] a) {
    int N = a.length;
    for (int i = 0; i < N; i++) {
      for (int j = i; j > 0; j--) {
        if (less(a[j], a[j - 1])) {
          exch(a, j, j - 1);
        }
        else {
          break;
        }
      }
    }
  }

  private static boolean less(Comparable v, Comparable w) {
    // ...
  }

  private static void exch(Comparable[] a, int i, int j) {
    // ...
  }
}

Insertion sort does depend on the initial order of the data
In most cases, the data is partially sorted and the run-time is linear for this case

Shellsort

One of the oldest sorting methods invented by Shell in 1959
The idea of shellsort is to move entries more than one position at a time by h-sorting the array
Very similar to insertion sort except when going back left to check order we go back by h and not index i
A g-sorted array remains g-sorted after h-sorting it
The implementation for shellsort in Java is as follows:

public class Shell {
  public static void sort(Comparable[] a) {
    int N = a.length;

    int h = 1;
    while (h < N/3) {
      h = 3 * h + 1
    }

    while (h >= 1) {
      for (int i = h; i < N; i++) {
        for (int j = i; j >= h && less(a[j], a[j - h]); j -= h) {
          exch(a, j, j - h)
        }
      }

      h = h/3
    }
  }

  private static boolean less(Comparable v, Comparable w) {
    // ...
  }

  private static void exch(Comparable[] a, int i, int j) {
    // ...
  }
}

In general, the worst-case number of compares used by shellsort with the 3x + 1 increments is about O(n^3/2), however in practice the run-time is much less but there is no known model of that
Why use shellsort?
- Fast unless array size is huge
- Tiny, fixed footprint for code (used in embedded systems)
- Hardware sort prototype

Shuffling

How would we shuffle a deck of cards in linear-time such that the result is a uniformly random permutation?
- Use Knuth's shuffling algorithm which picks a random integer r between 0 and i and applies that to shuffle the deck on each iteration
The implementation of Knuth's shuffling algorithm in Java:

public class StdRandom {
  // ...
  public static void shuffle(Object[] a) {
    int N = a.length;
    for (int i = 0; i < N; i++) {
      int r = StdRandom.uniform(i + 1);
      exch(a, i , r);
    }
  }
}

Convex Hull

The convex hull is a geometric object which is the smallest polygon that encloses all points
How to write a program which gives us a convex hull?
- Consider the geometric properties of the convex hull:
  - Can transverse convex hull only by making counterclockwise turns
  - The vertices of convex hull appear in increasing order of polar angle
From what we know about the geometric properties of the convex hull we can apply Graham scan:
- Choose point p with smallest y-coordinate
- Sort points by polar angle with p
- Consider points in order, discard unless it can create a CCW turn
Below is a sample implementation of Graham scan in Java:

Stack<Point2D> hull = new Stack<Point>();

Arrays.sort(p, Point2D.Y_ORDER); // p[0] is now point with lowest y-coordinate
Arrays.sort(p, p[0].BY_POLAR_ORDER); // Sort by polar angle with respect to p[0]

hull.push(p[0]);
hull.push(p[1]);

for (int i = 2; i < N; i++) {
  Point2D top = hull.pop();
  // Discard points that would create CW turns
  while (Point2D.ccw(hull.peek(), top, p[i]]) <= 0) {
    top = hull.pop()
  }
  hull.push(top);
  hull.push(p[i]); // Add p[i] to putative hull
}

Week 3: Mergesort

We study the mergesort algorithm and show that it guarantees to sort any array of N _ N* items with at most *N _ log(N) compares. We also consider a non-recursive, bottom-up version. We prove that any compare-based sorting algorithm must make at least ∼*N * log(N)* compares in the worst case. We discuss using different orderings for the objects that we are sorting and the related concept of stability.

Mergesort

Mergesort could be broken down into parts:
- Divide array into halves
- Recursively sort each half
- Merge two halves
The Java implementation of mergesort could be implemented as follows:

private static void merge(Comparable[] a, Comparable[] aux, int lo, int mid, int hi) {
  assert isSorted(a, lo, mid); // Precondition: a[lo... mid] sorted
  assert isSorted(a, mid + 1, hi); // Precondition: a[mid + 1... hi] sorted

  for (int k = lo; k <= hi; k++) {
    aux[k] = a[k];
  }

  int i = lo, j = mid + 1;
  for ( int k = lo; k <= hi; k++) {
    if (i > mid) {
      a[k] = aux[j++];
    } else if (j > hi) {
      a[k] = aux[i++];
    } else if (less(aux[j], aux[i])) {
      a[k] = aux[j++];
    } else {
      a[k] = aux[i++];
    }
  }

  assert isSorted(a, lo, hi); // Post-condition: a[lo... hi] sorted
}

To add to the above implementation:

public class Merge {
  private static void merge(...) {
    /* as before */
  }

  private static void sort(Comparable[] a, Comparable[] aux, int lo, int mid, int hi) {
    if (hi <= lo) {
      return;
    }

    int mid = lo + (hi - lo) / 2;
    sort(a, aux, lo, mid);
    sort(a, aux, mid + 1, hi);
    merge(a, aux, low, mid, hi);
  }

  public static void sort(Comparable[] a) {
    aux = new Comparable[a.length];
    sort(a, aux, 0, a.length - 1);
  }
}

The run-time estimate of mergesort is 10^8 compares/second
A supercomputer executes 10^12 compares/second
The lesson learned is that a good algorithm is better than investing in a supercomputer
Mergesort uses at most N _ log(N)* compares and about *N _ log(N) array accesses to sort any array of size N
One improvement we could make is using insertion sort as part of our sort procedure:

private static void sort(Comparable[] a, Comparable[] aux, int lo, int mid, int hi) {
  if (hi <= lo + CUTOFF - 1) {
    Insertion.sort(a, lo, hi);
    return;
  }

  int mid = lo + (hi - lo) / 2;
  sort(a, aux, lo, mid);
  sort(a, aux, mid + 1, hi);
  // Another improvement is to stop if sorted
  if (!less(a[mid + 1], a[mid])) {
    return;
  }
  merge(a, aux, low, mid, hi);
}

Bottom-up Mergesort

Although mergesort is easy to understand as recursive program, bottom-up version has no recursion:
- Pass through array, merging sub-arrays of size 1
- Repeat for sub-arrays of size 2, 4, 8, 16, ...
A Java implementation of bottom-up mergesort:

public class MergeBU {
  private static Comparable[] aux;

  private static void merge(Comparable[] a, int lo, int mid, int hi) {
    /* as before */
  }

  private static void sort(Comparable[] a) {
    int N = a.length;
    aux = new Comparable[N];
    for (int sz = 1; sz < N; sz = sz + sz) {
      for (int lo = 0; lo < N - sz; lo += sz + sz) {
        merge(a, lo, lo + sz - 1, Math.min(lo + sz + sz - 1, N - 1));
      }
    }
  }
}

Sort Complexity

Computational complexity is a framework to study efficiency of algorithms for solving a particular problem X
There are several thing we care about when studying complexity:
- Model of computation - allowable operations
- Cost model - operation counts
- Upper bound - cost guarantee provided by some algorithm for X
- Lower bound - proven limit on cost guarantee of all algorithms for X
- Optimal algorithm - algorithm with best possible cost guarantee for X
Here is an example with sorting:
- Model of computation: decision tree
- Cost model: # compares
- Upper bound: ~_N _ log(N)* (mergesort)
- Lower bound: ~_N _ log(N)*
- Optimal algorithm: mergesort
Mergesort is optimal with respect to # compares but not optimal with respect to space usage

Comparators

We previously saw how Comparables allow us to implement sorts with any type of data, with Comparators, we can help sort using some alternate order or many different orders on the same data
Here is an insertion sort implementation using Comparators:

private static void sort(Object[] a, Comparator comparator) {
  int N = a.length;
  for (int i = 0; i < N; i++) {
    for (int j = i; j > 0 && less(comparator, a[j], a[j - 1]); j--) {
      exch(a, j, j - 1);
    }
  }
}

private static boolean less(Comparator c, Object v, Object w) {
  return c.compare(v, w) < 0;
}

private static void exch(Object[] a; int i; int j) {
  Object swap = a[i];
  a[i] = a[j];
  a[j] = swap;
}

In general, to implement a comparator:
- Define a (nested) class that implements the Comparator interface
- Implement the compare() method

Stability

Stability refers to how well an algorithm is able to preserve the relative order of items with equal keys
Which one of our previously covered sorts are stable?
- Insertion and merge sort are stable but not selection or shellsort
Why are insertion sort and merge sort stable?
- Equal items never move past each other
Why are selection sort and shellsort not stable?
- There may be long-distance exchanges which might move an item past some equal item

Week 3: Quicksort

We introduce and implement the randomized quicksort algorithm and analyze its performance. We also consider randomized quick-select, a quicksort variant which finds the kth smallest item in linear time. Finally, consider 3-way quicksort, a variant of quicksort that works especially well in the presence of duplicate keys.

Quicksort

Quicksort is a recursive method which works in several steps:
- Shuffle the array
- Partition so that, for some j:
  - Entry a[j] is in place
  - No larger entry to the left of j
  - No smaller entry to the right of j
- Sort each piece recursively
The code for partitioning is as follows:

private static int partition(Comparable[] a, int lo, int hi) {
  int i = lo, j = hi + 1;
  while (true) {
    // Find item on left to swap
    while (less(a[++i], a[lo]) {
      if (i = hi) {
        break;
      }
    })
    // Find item on right to swap
    while (less(a[lo], a[--j]) {
      if (j = lo) {
        break;
      }
    })
    // Check if pointers cross
    if (i >= j) {
      break;
      exch(a, i, j);
    }
  }
  // Swap with partitioning item
  exch(a, lo, j);
  // Return index of item now known to be in place
  return j;
}

Remember that quicksort itself is a recursive program:

public class Quick {
  private static int partition(Comparable[] a, int lo, int hi) {
    /* as before */
  }

  public static void sort(Comparable[] a) {
    StdRandom.shuffle(a); // Needed for performance guarantee
    sort(a, 0, a.length - 1);
  }

  public static void sort(Comparable[] a, int lo, int hi) {
    if (hi <= lo) {
      return;
    }
    int j = partition(a, lo, hi);
    sort(a, lo, j - 1);
    sort(a, j + 1, hi);
  }
}

Shuffling is needed for performance guarantee, this preserves randomness
Best case number of compares for quicksort is ~_N _ log(N)*
Worst case number of compares for quicksort is ~_(1/2) _ N^2*
The average number of compares for quicksort is ~2N _ log(N)_
On average, quicksort does about 40% more compares than mergesort but faster than mergesort due to less data movement
Quicksort is an in-place sorting algorithm
Quicksort is not stable
There is a way to make quicksort faster by using insertion sort as a step for small arrays:

public static void sort(Comparable[] a, int lo, int hi) {
  if (hi <= lo + CUTOFF - 1) {
    Insertion.sort(a, lo, hi);
    return;
  }
  int j = partition(a, lo, hi);
  sort(a, lo, j - 1);
  sort(a, j + 1, hi);
}

Another improvement is to take median of sample (not worth for large samples):

public static void sort(Comparable[] a, int lo, int hi) {
  if (hi <= lo) {
    return;
  }

  int m = medianOfThree(a, lo, lo + (hi - lo) / 2, hi);
  swap(a, lo, m);

  int j = partition(a, lo, hi);
  sort(a, lo, j - 1);
  sort(a, j + 1, hi);
}

Selection

Similar to quicksort we want a method to quickly select a particular item in a set
The idea for selection is to partition an array so that:
- Entry* a[j]* is in place
- No larger entry to the left of j
- No smaller entry to the right of j
Repeat in one sub-array, depending on j
Finished when j equals k
An implementation of quick-select is as follows:

public static Comparable select(Comparable[] a, int k) {
  StdRandom.shuffle(a);
  int lo = 0, hi = a.length - 1;
  while (hi < lo) {
    int j = partition(a, lo, hi);
    if (j < k) {
      lo = j + 1;
    } else if (j > k) {
      hi = j - 1;
    } else {
      return a[k];
    }
    return a[k];
  }
}

Quick-select takes linear-time on average

Duplicate Keys

Mergesort with duplicate keys always took between (1/2) _ N _ log(N) and _N _ log(N)* compares
Quicksort with duplicate keys goes quadratic unless partitioning stops on equal keys
We want to partition array into three parts such that:
- Entries between a and b are equal to partition item c
- No larger entries to left to a (lower bound)
- No smaller entries to right of b (upper bound)
Here's how we would implement three-way quicksort:

private static void sort(Comparable[] a, int lo, int hi) {
  if (hi <= lo) {
    return;
  }

  int lt = lo, gt = hi;
  Comparable v = a[lo];
  int i = lo;

  while (i <= gt) {
    int cmp = a[i].compareTo(v);

    if (cmp < 0) {
      exch(a, lt++, i++);
    } else if (cmp > 0) {
      exch(a, i, gt--);
    } else {
      i++;
    }

    sort(a, lo, lt - 1);
    sort(a, gt + 1, hi);
  }
}

Randomized quicksort with three-way partitioning reduces running time from linearithmic to linear in broad class of applications

System Sorts

Java has a sort() method which:
- Has a different method for each primitive type
- Has a method for data types that implement Comparable
- Has a method that uses a Comparator
- Uses tuned quicksort for primitive types; tuned mergesort for objects
A sort function widely used in C, C++ and Java has the following:
- Cutoff to insertion sort for small sub-arrays
- Partitioning scheme: three-way partitioning (Dijkstra's)
- Partitioning item:
  - Small arrays: middle entry
  - Medium arrays: median of three
  - Large arrays: Tukey's ninther
Tukey's ninther - median of three samples, each three entries:
- Approximates the median of nine
- Uses at most twelve compares

Week 4: Priority Queues

We introduce the priority queue data type and an efficient implementation using the binary heap data structure. This implementation also leads to an efficient sorting algorithm known as heapsort. We conclude with an applications of priority queues where we simulate the motion of N particles subject to the laws of elastic collision.

APIs and elementary implementations

Let's compare inserts and deletes:
- Stack - remove the item most recently added
- Queue - remove the item least recently added
- Randomized queue - remove a random item
- Priority queue - remove the largest or smallest item
A requirement for priority queues - generic types are Comparable
Let's look at a common scenario where a priority queue would be implemented:
- Find the largest M items in a stream of N items (e.g. fraud detection, file maintenance)
- Constraint: not enough memory to store N items
Here's how this would be implemented in Java:

// Use a min-oriented pq
// Transaction data type is Comparable (ordered by $)
MinPQ<Transaction> pq = new MinPQ<Transaction>();

while (Std.hasNextLine()) {
  String line = StdIn.readLine();
  Transaction item = new Transaction(line);
  pq.insert(item);
  // pq contains largest *M* items
  if (pq.size() > M) {
    pq.delMin();
  }
}

For unordered arrays, priority queue would look something like this in Java:

public class UnorderedMaxPQ<Key extends Comparable<key>> {
  private Key[] pq; // pq[i] = ith element on pq
  private int N; // Number of elements on pq

  public UnorderedMaxPQ(int capacity) {
    pq = (Key[]) new Comparable[capacity]; // No generic array creation
  }

  public boolean isEmpty() {
    return N == 0;
  }

  public void insert(Key x) {
    pq[N++] = x;
  }

  public Key delMax() {
    int max = 0;
    for (int i = 1; i < N; i++) {
      if (less(max, i)) {
        max = i;
      }
      exch(max, n - 1);
      return pq[--N]; // Null out entry to prevent loitering
    }
  }
}

Binary Heaps

A binary heap is based on the binary tree (empty or node with lines to left and right binary trees)
Here are some characteristics of trees:
- A complete tree is perfectly balanced except for bottom level
- Height of complete tree with N nodes is log(n)
- Height only increases when N is a power of two
Binary heaps are an array representation of a heap-ordered complete binary tree, here's what we mean by that:
- Heap-ordered binary tree:
  - Keys in nodes
  - Parent's key no smaller than children's keys
- Array representation:
  - Indices start at one
  - Take nodes in level order
  - No explicit links needed
The largest key in a binary heap is a[1] which is the root of the binary tree
Here's an implementation of binary heap:

public class MaxPQ<Key extends Comparable<Key>> {
  private Key[] pq;
  private int N;

  public MaxPQ(int capacity) {
    pq = (Key[]) new Comparable[capacity + 1];
  }

  public boolean isEmpty() {
    return N == 0;
  }

  public void insert(Key key) {
    pq[++N] = x;
    swim(N);
  }

  public Key delMax() {
    Key max = pq[1];
    exch(1, N--);
    sink(1);
    pq[N + 1] = null;

    return max;
  }

  private void swim(int k) {
    while (k > 1 && less(k / 2, k)) {
      exch(k , k / 2);
      k = k / 2;
    }
  }

  private void sink(int k) {
    while (2 * k <= N) {
      int j = 2 * k;
      if (j < N && less(j, j + 1)) {
        j++;
      }
      if (!less(k, j)) {
        break;
      }

      exch(k, j);
      k = j;
    }
  }

  private boolean less(int i, int j) {
    return pq[i].compareTo(pq[j]) < 0;
  }

  private void exch(int i, int j) {
    Key t = pq[i];
    pq[i] = pq[j];
    pq[j] = t;
  }
}

Heap Sort

Here's the basic idea of heap sort:
- Create max-heap with all N keys
- Repeatedly remove the maximum key
We want to build max heap using bottom-up method (start from the bottom of the tree)
Then we want to repeatedly delete the largest remaining item
Here is what the implementation looks like:

public class Heap {
  public static void sort(Comparable[] pq) {
    int N = pq.length;
    for (int k = N / 2; k >= 1; k--) {
      sink(pq, k, N);
    }
    while (N > 1) {
      exch(pq, 1, N);
      sink(pq, 1, --N);
    }
  }

  private static void sink(Comparable[] pq, int k, int N) {
    /* as before */
  }

  private static boolean less(Comparable[] pq, int i, int j) {
    /* as before */
  }

  private static void exch(Comparable[] pq, int i, int j) {
    /* as before */
  }
}

Heap construction uses less than or equal to 2 N compares and exchanges
Heap sort uses less than or equal to 2 * _(N _ log(n))* compares and exchanges
Heap sort has an in-place sorting algorithm with _N _ log(n)* worse-case
The bottom line is that heap sort is optimal for both time and space but:
- Inner loop is longer than quicksort
- Makes poor use of cache memory
- Not stable

Week 4: Elementary Symbol Tables

We define an API for symbol tables (also known as associative arrays) and describe two elementary implementations using a sorted array (binary search) and an unordered list (sequential search). When the keys are Comparable, we define an extended API that includes the additional methods min, max floor, ceiling, rank, and select. To develop an efficient implementation of this API, we study the binary search tree data structure and analyze its performance.

Symbol Table API

Symbol tables are key-value pair abstractions:
- Insert a value with a specified key
- Given a key, search for the corresponding value
A good way to set up a basic symbol table API is to implement an associative array abstraction (associate one value with each key)
Some conventions of this implementation are:
- Values are not null
- Method get() returns null if key not present
- Method put() overwrites old value with new value
For keys and values:
- Assume keys are Comparable, use compareTo()
- Assume keys are any generic type, use equals() to test equality
- Assume keys are any generic type, use equals() to test equality and then hashCode() to scramble key
Best practices:
- Best to use immutable types for symbol table keys
- Immutable types in Java include: String, Integer, Double, java.io.File, ...
- Mutable in Java: StringBuilder, java.net.URL, arrays, ...
Lastly, all java classes inherit a method equals()

Elementary Implementations

One elementary implementation is sequential search in a linked list:
- Data structure - maintain an unordered linked list of key-value pairs
- Search - scan through all keys until match is found
- Insert - scan through all keys until match is found, and add to front if there is no match
Another implementation is binary search with an ordered array:
- Data structure - maintain an ordered array of key-value pairs
- Rank helper function - how many keys are less than k
Here is what the binary search implementation looks like in Java:

public Value get(Key key) {
  if (isEmpty()) {
    return null;
  }
  int i = rank(key);
  if (i < N && keys[i].compareTo(key) == 0) {
    return vals[i];
  } else {
    return null;
  }
}

private int rank(Key key) {
  int lo = 0, hi = N - 1;
  while (lo <= hi) {
    int mid = lo + (hi - lo) / 2;
    int cmp = key.compareTo(key[mid]);
    if (cmp < 0) {
      hi = mid - 1;
    } else if (cmp > 0) {
      lo = mid + 1;
    } else {
      return mid;
    }
  }
  return lo;
}

Ordered Operations

When keys are comparable we can put them in order, we saw that we can use binary search to get an efficient symbol table implementation
Below is a summary of sequential search versus binary search for the following operations:
- Search
- Insert/delete
- Min/max
- Floor/ceiling
- Rank
- Select
- Ordered iteration

operations	sequential search	binary search
search	N	log(N)
insert/delete	N	N
min/max	N	1
floor/ceiling	N	log(N)
rank	N	log(N)
select	N	log(N)
ordered iteration	_N _ log(N)*	N

Binary Search Tree

A BST (Binary Search Tree) is a binary tree in symmetric order
A binary tree tree is either:
- Empty
- Two disjoint binary trees (left and right)
BSTs have symmetric order - each node has a key and every node's key is:
- Larger than all keys in its left subtree
- Smaller than all keys in its right subtree
BSTs is a reference to a root Node
A Node is comprised of four fields:
- A Key and a Value
- A reference to the left and right subtree
Here is what the Node class looks like in Java:

private class Node {
  private Key key;
  private Value val;
  private Node left, right;
  public Node(Key key, Value val) {
    this.key = key;
    this.val = val;
  }
}

BST search in Java is:

public Value get(Key key) {
  Node x = root;
  while (x != null) {
    int cmp = key.compareTo(x.key);
    if (cmp < 0) {
      x = x.left;
    } else if (cmp > 0) {
      x = x.right;
    } else {
      return x.val;
    }
  }
  return null;
}

BST insert implemented in Java is:

public void put(Key key, Value val) {
  root = put(root, key, val);
}
// This implementation of put() is recursive
private Node put(Node x, Key key, Value val) {
  if (x == null) {
    return new Node(key, val);
  }
  int cmp = key.compareTo(x.key);
  if (cmp < 0) {
    x.left = put(x.left, key, val);
  } else if (cmp > 0) {
    x.right = put(x.right, key, val);
  } else {
    x.val = val;
  }
  return x;
}

Ordered operations in BSTs

There are many different ordered operations when it comes to BSTs
How do we find the minimum or maximum in a BST?
- Minimum - move left from the root
- Maximum - move right from the root
How do we determine the floor and ceiling?
- Floor - largest key less than or equal to a given key
- Ceiling - smallest key greater than or equal to a given key
How do we determine rank (How many keys less than k)?
- Implement recursive algorithm (3 cases)
- Need to implement subtree count as part of the Node class for this to work
How to implement in-order traversal?
- Traverse left subtree
- Enqueue key
- Traverse right subtree
For all ordered operations, BSTs have a run-time efficiency proportional to the height of the tree except for ordered iteration
The height of the BST is also proportional to log(N) if keys are inserted in random order

Deletion in BSTs

To remove a node with a given key (lazy approach):
- Set its value to null
- Leave key in tree to guide searches (but don't consider it equal in search)
- The problem with this lazy approach is that memory will be overload
How to delete the minimum (also applicable to maximum):
- Go left until finding a node with a null left link
- Replace that node by its right link
- Update subtree counts
Another way to implement min/max deletion is using Hibbard deletion:
- To delete a node with key k, search for node t containing key k
The main problem with Hibbert deletion is that it unbalances the tree leading to sqrt(N) height

Week 5: Balanced Search Trees

In this lecture, our goal is to develop a symbol table with guaranteed logarithmic performance for search and insert (and many other operations). We begin with 2-3 trees, which are easy to analyze but hard to implement. Next, we consider red-black binary search trees, which we view as a novel way to implement 2-3 trees as binary search trees. Finally, we introduce B-trees, a generalization of 2-3 trees that are widely used to implement file systems.

2-3 Search Trees

In a two-three search tree we allow one or two keys per node:
- Two-node: one key, two children
- Three-node, two keys, three children
A two-three tree has perfect balance - every path from root to null has same length
In-order traversal yields keys in ascending order
Splitting a four-node is a local transformation: constant number of operations
Each transformation maintains symmetric order and perfect balance
In terms of tree height:
- Worst case performance for all two-nodes is log(N)
- Best case performance for all three-nodes is about _0.631 _ log(N)*
- A two-three tree is guaranteed logarithmic performance for search and insert

Red-black BSTs

In red-black BSTs, we:
- Represent a two-three tree as a BST
- Use "internal" left-leaning links as "glue" for three-nodes
Another way of looking at red-black trees is that they are BSTs where:
- No node has two red links connected to it
- Every path from root to null link has the same number of black links
- Red links lean left
Search is the same for elementary BSTs as for red-black BSTs (however red-black BSTs run faster due to better balance)
The basic operations for red-black BST are:
- Rotate left
- Rotate right
- Swap colors
A red-black tree is represented as the following:

private static final boolean RED = true;
private static final boolean BLACK = false;

private class Node {
  Key key;
  Value val;
  Node left, right;
  Boolean color; // Color of parent link
}

private boolean isRed(Node x) {
  // Null links are black
  if (x == null) {
    return false;
  }
  return x.color == RED;
}

To rotate a right-leaning red link left:

private Node rotateLeft(Node h) {
  Node x = h.right;
  h.right = x.left;
  x.left = h;
  x.color = h.color;
  h.color = RED;
  return x;
}

To rotate a left-leaning red link right:

private Node rotateRight(Node h) {
  Node x = h.left;
  h.left = x.right;
  x.right = h;
  x.color = h.color;
  h.color = RED;
  return x;
}

If we want to swap colors we would implement:

private void flipColors(Node h) {
  h.color = RED;
  h.left.color = BLACK;
  h.right.color = BLACK;
}

For left-leaning red-black trees, insertion is implemented as:

private Node put(Node h, Key key, Value val) {
  // Insert at bottom and color red
  if (h == null) {
    return new Node(key, val, RED);
  }

  int cmp = key.compareTo(h.key);

  if (cmp < 0) {
    h.left = put(h.left, key, val);
  } else if (cmp > 0) {
    h.right = put(h.right, key, val);
  } else {
    h.val = val;
  }

  // Lean left
  if (isRed(h.right) && !isRed(h.left)) {
    h = rotateLeft(h);
  }
  // Balance four-node
  if (isRed(h.left) && isRed(h.left.left)) {
    h = rotateRight(h);
  }
  // Split four-node
  if (isRed(h.left) && isRed(h.right)) {
    flipColors(h);
  }

  return h;
}

Height of tree is less than or equal to _2 _ log(N)* in the worst case
Every path from root to null link has same number of black links
Never two red links in-a-row

B-Trees

B-trees are a generalization of balance trees which allow for data access using a minimum number of probes
B-trees generalize two-three trees by allowing up to M - 1 key-link pairs per node
- At least 2 key-link pairs at root
- At least M / 2 key-link pairs in other nodes
- External nodes contain client keys
- Internal nodes contain copies of keys to guide search
To search in a B-tree:
- Start at root
- Find interval for search key and take corresponding link
- Search terminates in external node
To insert in a B-tree
- Search for new key
- Insert at bottom
- Split nodes with M key-link pairs on the way up the tree
A search or insertion in a B-tree of order M with N keys requires between log_{M-1}(N) and log_{M/2}(N) probes
All internal nodes (besides root) have between M / 2 and M - 1 links
In practice number of probes is at most four
To optimize a B-tree, always keep root page in memory

Week 5: Geometric Applications of BSTs

We start with 1D and 2D range searching, where the goal is to find all points in a given 1D or 2D interval. To accomplish this, we consider kd-trees, a natural generalization of BSTs when the keys are points in the plane (or higher dimensions). We also consider intersection problems, where the goal is to find all intersections among a set of line segments or rectangles.

1D Range Search

We will look at intersections among geometric objects where BSTs are the basis for solving these problems efficiently
A 1D range search is an extension of ordered symbol table, here are some operations:
- Insert key-value pair
- Search for key k
- Delete key k
- Range search: find all keys between k1 and k2
- Range count: number of keys between k1 and k2
An application for 1D range search is using it for database queries
A geometric interpretation of 1D range search is:
- Keys are a point on a line
- Find/count points in a given 1D interval

Line Segment Intersection

Generally given N horizontal and vertical line segments, find all intersections
A brute force method for counting each intersection will produce a quadratic algorithm (not practical or scalable)
A better way to find the intersections between two pairs of lines is to implement the sweep-line algorithm:
- Sweep a vertical line from left to right
- x-coordinates define events
- h-segment (left endpoint): insert y-coordinate into BST
- h-segment (right endpoint): remove y-coordinate into BST
- v-segment: range search for interval of y-endpoints
The sweep-line algorithm takes time proportional to _N _ log(N + R)* to fin all R intersections among *N* orthogonal line segments
In general the run-time of each operation for sweep-line is as follows:
- Put x-coordinates on a PQ (or sort): _N _ log(N)*
- Insert y-coordinates into BST: _N _ log(N)*
- Delete y-coordinates from BST: _N _ log(N)*
- Range searches in BST: _N _ log(N + R)*
In summary, using the sweep-line algorithm reduces 2D orthogonal line segment intersection search to 1D range search

KD-trees

There exist an extension of the 1D range search known as 2D orthogonal range search:
- Insert a 2D key
- Delete a 2D key
- Search for a 2D key
- Range search: find all keys that lie in a 2D range
- Range count: number of keys that lie in a 2D range
The geometric interpretation of the 2D orthogonal range search includes:
- Keys which are a point in the plane
- Operations to find/count the points in a given h-v rectangle (axis-aligned)
In grids, we want to use a _sqrt(N) _ sqrt(N)* size square
However, the main challenge with the 2D orthogonal range search is the problem of clustering:
- Lists are too long, even though average length is short
- Need data structure that adapts gracefully to data
How to address clustering? Use 2D-tree (binary search tree) to find all points in a query axis-aligned rectangle:
- Check if point in node lies in given rectangle
- Recursively search left/bottom (if any could fall in rectangle)
- Recursively search right/top (if any could fall in rectangle)
Typical run-time for 2D-tree is R + log(N)
Worst case for a balanced tree is R + sqrt(N)
Similarly, to find the nearest neighbor in a 2D-tree:
- Check distance from point in node to query point
- Recursively search left/bottom (if any could fall in rectangle)
- Recursively search right/top (if any could fall in rectangle)
- Organize method so that it begins by searching for query point
Typical run-time for 2D-tree when finding nearest neighbor is log(N)
Worst case for a balanced tree when finding nearest neighbor is N
So what is a KD-tree? It's simply an extension of what we have already seen before:
- We want to recursively partition k-dimensional space into two half-spaces
- To implement a KD-tree we use a BST but cycle through dimensions with 2D trees

Interval Search

1D interval search is a data structure to hold set of overlapping intervals
Some operations include:
- Insert an interval (lo, hi)
- Search for an interval (lo, hi)
- Delete an interval (lo, hi)
- Interval intersection query: given an interval (lo, hi), find all intervals (or one interval) in data structure that intersects (lo, hi)
For interval search trees we want to:
- Create BST where each node stores an interval (lo, hi)
- Use the left endpoint as BST key
- Store max endpoint in subtree rooted at node
To get performance guarantee we want use a red-black BST

Rectangle Intersection

With KD-trees we can find all intersections among a set of N orthogonal rectangles
The rectangle intersection problem became a very important problem when building computers since we needed a way to figure out how to draw on a computer, particularly microprocessor design became a geometric problem
Since processing power doubles every 18 months (Moore's Law), we need to use linearithmic algorithm to sustain Moore's Law
To solve the orthogonal rectangle search we implement the sweep-line algorithm but slightly modified for rectangles (interval search instead of range search):
- x-coordinates of left and right endpoints define events
- Maintain set of rectangles that intersect the sweep line in an interval search tree (using y-intervals of rectangle)
- Left endpoint: interval search for y-interval of rectangle; insert y-interval
- Right endpoint: remove y-interval
In general the run-time of each operation for sweep-line for orthogonal rectangles is as follows:
- Put x-coordinates on a PQ (or sort): _N _ log(N)*
- Insert y-coordinates into ST: _N _ log(N)*
- Delete y-coordinates from ST: _N _ log(N)*
- Interval searches for y-intervals: N _ log(N) + R _ log(N)

Week 6: Hash Tables

We begin by describing the desirable properties of hash function and how to implement them in Java, including a fundamental tenet known as the uniform hashing assumption that underlies the potential success of a hashing application. Then, we consider two strategies for implementing hash tables—separate chaining and linear probing. Both strategies yield constant-time performance for search and insert under the uniform hashing assumption. We conclude with applications of symbol tables including sets, dictionary clients, indexing clients, and sparse vectors.

Hash Tables

Another approach to implementing symbol tables that can also be very effective in a practical application is hashing
Hashing involves using a hash function - a method for computing array index from a key
When using a hash function we want to scramble the keys uniformly to produce a table index:
- Efficiently computable
- Each table index equally likely for each key
Some examples of bad hashing:
- Using the first three digits of a phone number
- Using the first three digits of SSN
Some examples of good hashing:
- Using the last digits of a phone number or SSN
We also need a different approach for each key type
All Java classes inherit a method hashCode() which returns a 32-bit int
In general if x.equals(y), then (x.hashCode() == y.hashCode())
It is highly desirable to have !x.equals(y), then (x.hashCode() != y.hashCode())
The default implementation for hashCode() uses the memory address of x
Java implementation is different for Integer, Boolean, Double, etc. below is an implementation of hashCode() for the Integer type:

public final class Integer {
  private final int value {
    // ...
    public int hashCode() {
      return value;
    }
  }
}

The standard recipe for hash code design is as follows:
- Combine each significant field using the _31 _ x + y* rule
- If field is a primitive type, use wrapper type hashCode()
- If field is null, return 0
- If field is a reference type, use hashCode()
- If field is an array, apply to each entry

Separate Chaining

Separate chaining is a collision red solution strategy that makes use of elementary linked list
During a collision, two distinct keys hashing to same index:
- Birthday problem - can't avoid collisions unless you have a ridiculous (quadratic) amount of memory
- Coupon collector + load balancing - collisions will be evenly distributed
How do we deal with collisions efficiently?
- Use an array of M < N linked lists:
  - Hash: map key to integer i between 0 and M - 1
  - Insert: put at front of ith chain (if not already there)
  - Search: need to search only ith chain
Separate chaining implementation in Java is as follows:

public class SeparateChainHashST<Key, Value> {
  // NOTE: Array doubling and halving code omitted
  private int M = 97; // Number of chains
  private Node[] st = new Node[M]; // Array of chains

  private static class Node {
    private Object key; // No generic array creation
    private Object val; // Declare key and value of type Object
    private Node next;
    // ...
  }

  public Value get(Key key) {
    int i = hash(key);
    for (Node x = st[i]; x != null; x = x.next) {
      if (key.equals(x.key)) {
        return (Value) x.val;
      }
    }
    return null;
  }
}

For insertion the code becomes:

public class SeparateChainHashST<Key, Value> {
  // NOTE: Array doubling and halving code omitted
  private int M = 97; // Number of chains
  private Node[] st = new Node[M]; // Array of chains

  private static class Node {
    private Object key; // No generic array creation
    private Object val; // Declare key and value of type Object
    private Node next;
    // ...
  }

  public void put(Key key, Value val) {
    int i = hash(key);
    for (Node x = st[i]; x != null; x = x.next) {
      if (key.equals(x.key)) {
        x.val = val;
        return;
      }
    }
    st[i] = new Node(key, val, st[i]);
  }
}

The number of probes (equals() and hashCode()) for search/insert is proportional to N / M:
- M too large -> too many empty chains
- M too small -> chains too long
- Typical choice: M ~ N / 5 -> constant-time operation

Linear Probing

Another closure resolution method is known as linear probing
Linear probing is also known as open addressing - when a new key collides, find next empty slot, and put it there
The methods for linear probing for hash tables is as follows:
- Hash - map key to integer i between 0 and M - 1
- Insert - put a table at index i if free; if not try i + 1, i + 2, etc.
- Search - search table index i; if occupied but no match, try i + 1, i + 2, etc.
- Note: array size M must be greater than number of key-value pairs N
The Java implementation for linear probing:

public class LinearProbingHashST<Key, Value> {
  // NOTE: Array doubling and halving code omitted
  private int M = 30001;
  private Value[] vals = (Value[]) new Object[M];
  private Key[] keys = (Key[]) new Object[M];

  private int hash(Key key) {
    /* as before */
  }

  private void put (Key key, Value val) {
    int i;
    for (i = hash(key); keys[i] != null; i = (i + 1) % M) {
      if (keys[i].equals(key)) {
        break;
      }
    }
    keys[i] = key;
    vals[i] = val;
  }

  private Value get(Key key) {
    for (int i = hash(key); keys[i] != null; i = (i + 1) % M) {
      if (key.equals(keys[i])) {
        return vals[i];
      }
    }
    return null;
  }
}

Clustering became a problem:
- Cluster - a contiguous block of items
- Observation - new keys likely to hash into middle of big clusters
Under uniform hashing assumption, the average of number of probes in a linear probing hash table of size M that contains _N = a _ M* keys is:
- Search hit ~ _(1 / 2) _ (1 + (1 / (1 - a)))*
- Search miss/insert ~ _(1 / 2) _ (1 + (1 / (1 - a)^2))*
In summary, for linear probing:
- M too large -> too many empty array entries
- M too small -> search time blows up
- Typical choice: a ~ N / M ~ 1/2 -> constant-time operation

Hash Table Context

For String hashCode() in Java 1.1:
- Only examine 8-9 evenly spaced characters for long strings
- The benefit is saving time in performing arithmetic
An implementation of hashCode() in Java:

public int hashCode() {
  int hash = 0;
  int skip = Math.max(1, length() / 8);
  for (int i = 0; i < length(); i += skip) {
    hash = s[i] + (37 * hash);
  }
  return hash;
}

However, there was great potential for bad collision patterns
Separate chaining versus linear probing:
- Separate chaining:
  - Easier to implement delete
  - Performance degrades gracefully
  - Clustering less sensitive to poorly-designed hash function
- Linear probing:
  - Less wasted space
  - Better cache performance
Hash tables versus balanced search trees:
- Hash tables:
  - Simpler to code
  - No effective alternative for unordered keys
  - Faster for simple keys (a few arithmetic operations versus log(N) compares)
  - Better system support in Java for strings (e.g. cache hash code)
- Balanced search trees:
  - Stronger performance guarantee
  - Support for ordered ST (symbol table) operations
  - Easier to implement compareTo() correctly than equals() and hashCode()

Week 6: Symbol Table Applications

This section will cover topics (optional) on symbol table applications including sets, dictionary clients, indexing clients, and sparse vectors

Sets

Mathematically, sets are a collection of distinct keys
To implement a set:
- Read in a list of words from one file
- Print out all words from standard input that are { in, not in } the list
An example of such implementation in the real world is a whitelist:

public class WhiteList {
  public static void main(String[] args) {
    // Create empty set of strings
    SET<String> set = new SET<String>();

    // Read in whitelist
    In in = new In(args[0]);
    while (!in.isEmpty()) {
      set.add(in.readString());
    }

    // Print words in list
    while (!StdIn.isEmpty()) {
      String word = StdIn.readString();
      if (set.contains(word)) {
        StdOut.println(word);
      }
    }
  }
}

Conversely, there is also a blacklist implementation:

public class BlackList {
  public static void main(String[] args) {
    // Create empty set of strings
    SET<String> set = new SET<String>();

    // Read in blacklist
    In in = new In(args[0]);
    while (!in.isEmpty()) {
      set.add(in.readString());
    }

    // Print words not in list
    while (!StdIn.isEmpty()) {
      String word = StdIn.readString();
      if (!set.contains(word)) {
        StdOut.println(word);
      }
    }
  }
}

Dictionary Clients

Dictionary clients refer to the dictionary lookup problem:
- A comma-separated value (CSV) file
- Key field
- Value field
Example of such a problem in practice include DNS lookups, amino acids, and class lists
Below is an example of a dictionary lookup in a CSV file:

public class LookupCSV {
  public static void main(String[] args) {
    // Process input file
    In in = new In(args[0]);
    int keyField = Integer.parseInt(args[1]);
    int valField = Integer.parseInt(args[2]);

    // Build symbol table
    ST<String, String> st = new ST<String, String>();
    while (!in.isEmpty()) {
      String line = in.readLine();
      String[] tokens = line.split(",");
      String key = tokens[keyField];
      String val = tokens[valField];
      st.put(key, val);
    }

    // Process lookups with standard I/O
    while (!StdIn.isEmpty()) {
      String s = StdIn.readString();
      if (!st.contains(s)) {
        StdOut.println("Not found");
      } else {
        StdOut.println(st.get(s));
      }
    }
  }
}

Indexing Clients

Indexing clients refer to the file indexing problem:
- Our goal is to index a PC or the web
- Given a list of files specified, create an index so that you can efficiently find all files containing a given query string
- Solution: key = query string, value = set of files containing that string
Below is an example of the file indexing implementation:

import java.io.file;
public class FileIndex {
  public static void main(String[] args) {
    // Symbol table
    ST<String, SET<File>> st = new ST<String, SET<File>>();

    // List of file names from command line
    for (String filename : args) {
      File file = new File(filename);
      In in = new In(file);
      // For each word in file, add file to corresponding set
      while (!in.isEmpty()) {
        String key = in.readString();
        if (!st.contains(key)) {
          st.put(key, new SET<File>());
        }
        SET<File> set = st.get(key);
        set.add(file);
      }
    }

    // Process queries
    while (!StdIn.isEmpty()) {
      String query = StdIn.readString();
      StdOut.println(st.get(query));
    }
  }
}

Sparse Vectors

For nested loops the running time is N^2, how do we address this issue for matrices and matrix multiplication?
- Use symbol tables!
The standard representation of a vector is done using a 1D array:
- Constant time access to elements
- Space proportional to N
The symbol table representation for a vector could be as follows:
- Key = index, value = entry
- Efficient iterator
- Space proportional to number of non-zeroes
The implementation in java is as follows:

public class SparseVector {
  // HashST because order not important
  private HashST<Integer, Double> v;

  // Empty ST represents all zeros vector
  public SparseVector() {
    v = new HashST<Integer, Double>();
  }

  // a[i] = value
  public void put (int i, double x) {
    v.put(i, x);
  }

  // Return a[i]
  public double get(int i) {
    if (!v.contains(i)) {
      return 0.0;
    } else {
      return v.get(i);
    }
  }

  public Iterable<Integer> indices() {
    return v.keys();
  }

  // Dot product is constant time for sparse vectors
  public double dot(double[] that) {
    double sum = 0.0;
    for (int i : indices()) {
      sum += that[i] * this.get(i);
    }
    return sum;
  }
}

For 2D arrays, the standard matrix representation is:
- Each row of matrix is an array
- Constant time access to elements
- Space proportional to N^2
The sparse matrix representation is as follows:
- Each row of matrix is a sparse vector
- Efficient access to elements
- Space proportional to number of non-zeroes plus N
The code for 1D sparse vector just needs the following:

// ...
SparseVector[] a = new SparseVector[N];
double[] x = new double[N];
double[] b = new double[N];
// ...
// Initialize a[] and x[]
// ...
// Linear running time for sparse matrix
for (int i = 0; i < N; i++) {
  b[i] = a[i].dot(x);
}

stevenxchung / algorithms-part-i Goto Github PK

algorithms-part-i's Introduction