6.006 data structures

Sam Li

2021-11-09

Algorithms › 6.006 algorithms

Binary search tree
Sort

Binary search tree¶

Balanced: a tree on n nodes is balanced if its height is O(logn). Bst is not always balanced.

AVL tree: height-balanced tree
- skew of a node is the height difference between right subtree and left subtree.
- skew is -1,0,1

Sort¶

Selection sort, insertion sort, heap sort, merge sort, quick sort

in-place: 当数列中有相同的成员时，先后次序的不同。
- Whether merge sort is stable depends on how an implementation breaks ties when merging
  - Stable: that items with duplicate keys appear in the same order in the output as the input
Comparison model
- worse case analysis
The worst-case number of comparisons that must be made by any comparison search algorithm will be the height of the algorithm’s decision tree. Comparisons are limiting because each operation performed can lead to most constant branching factor in the decision tree. Any fixed a will lead to a decision tree with at least omega(logn) height. It can be improved that data structures are not limited to comparisons.
comparison sorting
- Analysis to lower bound the worst-case running time of any sorting algorithm that only uses comparisons
sort n items, n! Permutations of the items, height of decision tree is omega(log(n!)) = omega(nlogn). so run time at least omega(nlogn).
Direct access array: use data as index, might cause collisions and interger overflow, but skip comparisons to gain more efficiency.

Space: u slots to store items; n items
- solution: make it dynamic: let m = O(n)
  
  need a function that maps item keys to different slots of the direct access array, and no two keys map to the same direct access array index. Then we can support worst-case constant time search.
- If m < u: collisions occur.
  - Solution: 1. Open address. 2. Chaining
  Chaining, a separate data structure that supports the dynamic set interfaces, operations including find, insert and delete. It is common to implant using a linked list or dynamic array, but any implementation supporting those operations would work.
  
  what’s a good hash function? Minimize the frequency of collisions (all items stored in one chain) in order to minimize the maximum size of any chain.
  - hash function: the performance of data structures would be independent of the keys we choose to store.
  - amortized bounds for dynamic operations
    - Resize
- Problems: n positive integers array A, find duplicates.
- direct access array sort
  
  n items, u slots array. (n <= u)
  - runtime analysis: insert + initialize & scan: theta(n+u), if u = O(n), then the algorithm is linear.
  - Drawbacks: can’t handle duplicate keys and large key ranges

Impovement:

Allow duplicates on the basis of direct access array sort while preserving linear runtime property-> counting sort
sort keys from a larger integer range: break up integer keys into parts, and then sort each part -> tuple sort
- Example: LSD/MSD radix sort, sort in a certain order.
Then tuple sort uses a stable sorting algorithm as a subroutine to repeatedly sort the objects, first according to the least important key, then the second least important key, all the way up to most important key, thus lexicographically sorting the objects.
Radix sort: counting sort + tuple sort
- Problems
  - n ints from [-n** 2, n** 3]
  - n strings having k English characters