Post on 23-Feb-2022
transcript
6.046 Course NotesWanlin Li
Spring 2019
1 February 5• Very similar problems can have very different solutions and complexity
• Eulerian cycle (use all edges exactly once) is in P but Hamiltonian cycle is NP-complete
• Interval scheduling problem: given list of requests, find maximum number ofcompatible requests with single resource
• Greedy approach: use some strategy to select next ri• Include ri, remove all rj not compatible with ri and repeat until done
• Definition. Greedy algorithm: repeatedly make locally best choice with no look-ahead
• Possible rules for greedy: choose smallest interval, choose interval with earlieststart request, choose interval with fewest conflicts
• Rule that actually works is to select interval that finishes first
• Hybrid/exchange argument: some solution is chosen by greedy algorithm, takeoptimal solution with longest common prefix as greedy and show that if the twoare not identical, the greedy solution can be extended
• Idea to transform any optimal solution into greedy solution with no loss in qual-ity
• Can be done in O(n log n) time
• Weighted interval scheduling: want maximum value of compatible requests
• Greedy no longer appears to work, use dynamic programming instead
• Sorting by start time, O(n) subproblems because always passing a suffix
1
6.046 Notes Wanlin Li
• Dynamic programming also works in O(n log n) time with binary search for in-compatibility
• Adding complexity of multiple time slots for same class puts problem in NPclass
2 February 7: Divide and Conquer• Break problem into smaller subproblems, not necessarily a partition
• Solve each subproblem, combine subproblems into final solution
• Combination is the difficult step requiring creativity
• Runtime analysis from Master’s Theorem T (n) = aT(nb
)+ combination time
• Example. Median finding problem: given set S of n distinct numbers, find themedian
• Define rank of element x as the number of elements of S that are at most x
• Example. Rank finding problem: given set S and some index i, find the elementof rank i
• Possible solutions: sort S in O(n log n)
• Result [BFPRT ’73] in O(n) time
• Pick x ∈ S, compute L = y ∈ S|y < x and G = y ∈ S|y > x; rank of x is|L|+ 1
• If rank of x is i, return x; if rank is > i, find element of rank i in L; otherwisefind element of rank i− |L| − 1 in G
• Need to pick x optimally or worst case runtime is O(n2)
• Define x as c-balanced if maxrank(x)− 1, n− rank(x) ≤ c · n
• Then T (n) = T (cn) +O(n) and T (n) = O(n)
• Ideally would want x to be median but that is original problem
• Bootstrapping: using one rough solution to find faster algorithms
• Assume n = 10k for convenience, divide S into n5 groups of size 5 each; find
median in each group in O(n) time across all groups, recursively find median xof n
5 medians and continue as in above description
2
6.046 Notes Wanlin Li
• Claim x is 34 -balanced, easily shown by counting; only approximate median but
still works effectively
• Runtime recurrence is T (n) = O(n) + O(n) + T(n5
)+ T
(34n)
= T(
1920n)
+ O(n)which is O(n)
• Example. Problem of integer multipication: given two n-bit numbers a, b, com-pute product ab
• Standard multiplication algorithm is Θ(n2)
• Cut each of a and b strings in half so a = 2n2 x + y and b = 2
n2w + z; then ab =
2n(xw) + 2n2 (xz + yw) + yz
• Runtime is T (n) = 4 · T(n2
)+O(n) which is still Θ(n2)
• Result [Karatsuba ’62]: compute xw, yz, (x + y)(z + w) which uses only threemultiplications and linear time addition/subtraction
• Then T (n) = 3 · T(n2
)+O(n) which is Θ
(nlog2(3)
)• Faster and more complicated algorithms exist up to O
(n log n · 2Θ(log∗ n)
)where
log∗ is number of times log must be applied to reach a value < 1
3 February 12: Fast Fourier Transform and Polynomial Mul-tiplication
• Fast Fourier Transform: shows up in numerous contexts, including signal pro-cessing, integer multiplication, multiplication of polynomials
• Evaluation of polynomial: naive calculation (assuming binary operation requiresconstant time) takesO(n2),Horner’s ruleA(x0) = a0+x0(a1+x0(a2+· · ·+xan−1))takes only O(n) which is optimal
• Addition of polynomials: ck = ak + bk, easily O(n) time
• Multiplication of polynomials: naive approach in O(n2)
• Polynomial multiplication equivalent to convolution of vectors a,b
• Vector (padded with zeros) is good representation of signal, convolution for sig-nal processing
• Example. Boxcar filter computing running average of last t signals
• Difficulty of polynomial multiplication is mostly in chosen representation of poly-nomial
3
6.046 Notes Wanlin Li
• Another way of representating polynomials is by keeping track of roots and lead-ing coefficient
• With representation by roots, evaluation takes O(n) and multiplication takesO(n) but addition is too difficult
• Final representation of polynomial by values at x1, x2, . . . , xn
• Addition isO(n),multiplication requires computation of sufficiently many pointsbut is O(n) otherwise
• Lagrange interpolation formula gives evaluation in O(n2) time
• Each representation has flaws, but optimizing for a single operation is efficient
• Goal to find conversion between coefficient and sample representations inO(n log n)time (FFT) to take advantage of best of both worlds
• Convert coefficients to samples: given polynomial A = (a0, a1, . . . , an−1) and setof points X = x0, . . . , xm−1, compute A(x) ∀x ∈ X
• Idea to use divide and conquer
• Split A into even and odd degree coefficients, compute recursively using x2i
• Runtime T (n, |X|) = 2T(n2 , |X
2|)
+Θ(|X|) so with arbitrary choice of |X|, T (n, n)is O(n2)
• Choose |X| to be set of 2k roots of unity where k = dlog2(n)e
• This gives Θ(n log n) time algorithm
• Important part was choosing X to be collapsible, this is why even/odd coeffi-cient split worked; |X| needed to decrease as well for divide and conquer
• Discrete Fourier transform taking coefficients to sample is linear transformation
• Convert samples to coefficients: exists fixed matrix V independent of A suchthat V A = A∗
• V is Vandermonde matrix
1 x0 x2
0 . . . xn−10
1 x1 x21 . . . xn−1
1...
...... . . .
...1 xn−1 x2
n−1 . . . xn−1n−1
• Coefficients to samples is given by V ·A, fast Fourier transform gives computation
in O(n log n) time
• Claim V −1 = 1nV , computation can be done in O(n log n) because V A could be
computed in O(n log n)
4
6.046 Notes Wanlin Li
4 February 14: Amortized Analysis Union Finding• Standard table doubling is O(1) most of the time and O(n) every once in a while
• Amortization is idea of spreading expensive cost across all cheap costs
• Aggregate method: sum costs of all steps and find average
• Accounting method: pre-pay for expensive step on each earlier step
• Union-find problem: maintain dynamic collection of pairwise disjoint sets S =S1, . . . , Sr with one representative element per set R[Si]
• Supported operations:
1. make-set(x) adding set x to collection with x as representative;2. find-set(x) returning representative of set S(x) containing element x3. union(x, y) replacing sets S(x), S(y) containing elements x, ywith S(x)∪
S(y) having single representative
• Possible representation: linked list with head of list as representative; allowsfor make set in Θ(1) time, find-set in Θ(n) time, union in Θ(n) time
• Augment linked list representation with every element pointing to head, keeptrack of tail as well; allows for find and make set in Θ(1) time by following point-ers to head, union almost works except updating pointers of S(y) takes O(n)time
• Worst case would be Ω(n) union operations taking Ω(n) time each
• Potential improvements: always concatenate smaller list into larger list by main-taing length of list; adversary could select two sets of size Ω(n) and union wouldstill take Ω(n)
• Let n be the total number of elements (number of make-set operations) and mthe total number of operations with m ≥ n; claim cost of all unions is O(n log n)and total cost is O(m+ n log n)
• Proof: focus on element u,make-set creation of S(u) results in list of size 1; whenS(u) merges with S(v), updating u’s head pointer means length of S(u) at leastdoubles
• S(u) can double at most log n times so paid cost for u is at most log n
• Total cost of unions is O(n log n) so total cost is O(m+ n log n)
• Average cost per operation is O(log n) because m ≥ n
5
6.046 Notes Wanlin Li
• Potential method
• Union-find with forest of trees: each set is (possibly unbalanced, not necessarilybinary) tree with root as representative
• Make-set inO(1) time, find-set inO(h(S[u])) where h is height of tree, union(u, v)is O(h(S[u]) + h(S[v]))
• Tree representation differs from linked list by allowing multiple branches; atthe extreme, resembles direct access array
• Rearrange only the parts that are affected because they were moved anyway; infind-set operation, for every node that is reached reattach it to representativeelement
• Path compression/flattening the tree results in amortized cost O(log n) per op-eration
• Potential function φmapping data structure configuration to non-negative inte-ger with “make-believe cost” c = c+ ∆φ
•∑c =
∑c+ φf − φi where φf is final potential and φi is initial potential
• Select φ(DS) =∑
u log(u.size) where u.size is the size of the subtree rooted at u
• Amortized cost found to be O(log n) per operation
• Combining path compression and union by rank is O(mα(n)) where α is inverseAckermann function
5 February 21: Amortized Analysis - Competitive Analysis• Aggregate, accounting, potential
• Self-organizing list: list L of n elements with single operation of accessing ele-ment with key x costing rank(x) where rank is the index into the list
• After every access, can use transpositions of adjacent elements to reorganize listwith each transposition costing 1
• Does there exist some sequence of transpositions to minimize cost of access inon-line manner
• On-line: can only see keys one at a time, must respond immediately before seeingmore of input sequence (e.g. Tetris)
6
6.046 Notes Wanlin Li
• Off-line: can see whole sequence of inputs in advance and make possibly betterchoices
• In worst case, adversary always picks key of last element, CA(s) = Ω(|s| · n) andany algorithm does poorly even ignoring cost of re-ordering
• Average case analysis: suppose key x is accessed with probability p(x), expectedcost of input sequence is E[CA(S)] = |S|
∑x∈L
p(x) · rankL(x) which is minimized
when L is sorted in decreasing order of p(x)
• Heuristic: keep count of number of times each element is accessed and adjust Lin decreasing order of count
• While adversary can still produce poor worst case performance, in practice thisalgorithm works well
• Move to front algorithm: after accessing x, move x to head of L
• Cost of access is rankL(x) and cost of transpositions is rankL(x)− 1
• Definition. Competitive analysis: on-line algorithm is α-competitive if thereexists a constant k such that for any sequence S of operations, CA(S) ≤ αCOPT+kwhere OPT represents the optimal off-line algorithm
• Claim move to front is 4-competitive for self-organizing lists
• Proof: let Li be MTF list after ith access and L∗i the OPT list after ith access,Ci the MTF cost for ith operation = 2 · rankLi−1(xi) − 1 and C∗i is OPT cost forith operation = rankL∗i−1
(xi)+ ti where ti is the number of transpositions in OPTalgorithm
• Amortized analysis with potential function, want lots of potential built up whenstep needed for MTF that is much more expensive than OPT
• Try potential Φi as 2 times the number of inversions betweenLi andL∗i ; 2|(x, y) :x <Li y, y <L∗i x|
• Example. Li = [E,C,A,D,B] and L∗i = [C,A,B,D,E] has 5 inversions, 5 trans-positions can make the lists equal
• If lists are the same, potential is 0; transposition changes Φ by ±2
• Once x is accessed in both Li−1 and L∗i−1, all other elements fall into 4 categories:
– A: elements before x in Li−1 and L∗i−1
– B: elements before x in Li−1 and after x in L∗i−1
7
6.046 Notes Wanlin Li
– C: elements after x in Li−1 but before x in L∗i−1
– D: elements after x in Li−1 and L∗i−1
• r = rankLi−1(x) = |A|+ |B|+ 1, r∗ is analogous for Li−1∗ and is |A|+ |C|+ 1
• When MTF moves x to the front, Φ(Li)−Φ(Li−1) ≤ 2(|A|− |B|+ ti) because OPTcreates at most ti inversions
• Per access cost for ith access: ci is amortized cost and ci is actual cost
• ci = ci+Φ(Li)−Φ(Li−1)≤ 2r−1+2(|A|−|B|+ti) = 2r−1+2(|A|−(r−1−|A|)+ti)= 4|A|+ 1 + 2ti ≤ 4(r∗ + ti) = 4c∗i
• Examine sequence of operations CMTF =|S|∑i=1
ci =∑
(ci + Φ(Li−1) − Φ(Li)) ≤
|S|∑i=1
4c∗i + Φ(L0)− Φ(L|S|) ≤ 4COPT
6 February 26• Minimum spanning tree (MST) problem: given G = (V,E) and edge weightsw : E → R, find spanning tree T ⊆ E of minimum weight w(T ) =
∑e∈T w(e)
• Applicable to planning minimum-length networks for connecting cities
• Heuristics: avoid large weights and include small weights, some edges are in-evitable
Theorem. G = (V,E) is a connected graph with a cost function defined on itsedges. U is a proper nonempty subset of V. If (u, v) is an edge of lowest cost withu ∈ U and v ∈ (V \U), then there is an MST containing (u, v).
• Definition. Cut: partition of V into U and V \U
• Cut respects set of edges if no edge in the set crosses the cut
• If (u, v) is the unique lightest edge, (u, v) is in all MSTs
• Definition. Kruskal’s Algorithm: initially T = (V, ∅); examine edges in increas-ing weight order, arbitrarily breaking ties. If an edge connects two unconnectedcomponents, add the edge to T, and otherwise discard the edge and continue(edge forms a cycle). Terminate when all vertices are in a single connected tree.
• Implementation: use union-find data structure to maintain connected compo-nents of MST
8
6.046 Notes Wanlin Li
• For v ∈ V, make-set(v) in Θ(|V |) times make-set time
• Sort E by weight in O(|E| log |E|)
• For edge (u, v) ∈ E, if find-set(u) 6= find-set(v), add (u, v) to T and union (u, v) inO(|E|) times sum of find-set and union times
• Overall time is O(E logE) + O(V )Θ(1) + O(E)O(α(V )) = O(E logE) + O((V +E)α(V )) = O(E log V ) because |E| < |V |2
Theorem. Given G = (V,E) a connected, undirected graph with real-valueweights on the edges and a subset A of E included in some MST for G, (U, V \U) acut of G respecting A and (u, v) a light edge crossing the cut, then edge (u, v) canbe added to A and the new edge set is still included in some MST of G.
• Definition. Prim’s Algorithm: select vertex r to start and add r to T. On eachsubsequent step, select a light edge connecting T to an isolated vertex and addthe edge (u, v) to T.
• Implementation: use min-priority queue data structure
• Put all vertices into queue with initial distance∞, extract vertex with minimumdistance and update distances of other vertices to MST
• Fibonacci heap as min-priority queue gives O(log V ) extraction, amortized O(1)decrease-key, total run time O(E + V log V )
7 February 28: Network Flows• Definition. Network: directed graph G = (V,E) with source vertex s ∈ V and
sink vertex t ∈ V and edge capacities c : E → R≥0; if edge (u, v) does not exist,c(u, v) = 0
• If vertex is not source or sink, same amount of flow enters and leaves the vertex
• Definition. Gross flow: g : E → R≥0 such that 0 ≤ g(u, v) ≤ c(u, v) for all edgesand
∑u[g(u, v)− g(v, u)] = 0 for all v 6= s, t
• Definition. Net flow: f : V × V → R such that f(u, v) ≤ c(u, v) ∀u, v ∈ V(feasibility),
∑u f(u, v) = 0 ∀v 6= s, t (flow conservation), and f(u, v) = −f(v, u)
(skew symmetry)
• Value of flow is |f | =∑
v f(s, v)
• Max flow problem: given G(V,E, s, t, c), find a flow of maximum value
9
6.046 Notes Wanlin Li
• Claim any flow can be constructed from f = 0, flow cycle (any cycle not exceedingcapacity of any edge, with value 0), and s→ t path
• Definition. Support: suppf (G) is a subgroup ofG of edges (u, v) with f(u, v) > 0
Flow Decomposition Lemma. For any flow f with |f | ≥ 0, suppf (G) can bedecomposed into a collection of s− t paths and flow cycles.
• Let f∗ be a maximum flow in G and F ∗ = |f∗| the max flow value
• G+ is subgraph of G with edges of positive capacity
• If there exists s → t path P in G+, then F ∗ > 0 because P can support positiveflow
• Use cut to certify F ∗ = 0; let S = v ∈ V |∃s→ v ∈ G+, s ∈ S
• If F ∗ = 0, then t /∈ S and instead t ∈ V \S; then S is an s− t cut that separates sfrom t
• Definition. Cut: s− t cut is cut (S, V \S) such that s ∈ S and t ∈ V \S
• Definition. Capacity of cut: c(S) =∑u∈S
∑v∈V \S
c(u, v)
• If F ∗ = 0, there does not exist an s − t path in G+ but there does exist an s − tcut in S with c(S) = 0
• Min cut problem: given G(V,E, s, t, c), find an s− t cut of minimum capacity.
• Given s− t cut S and flow f, f(S) = f(S, V \S) =∑u∈S
∑v∈V \S
f(u, v) so f(S) ≤ c(S)
• Then |f | = f(S) for any s− t cut
• F ∗ = |f∗| = f∗(S∗) ≤ c(S∗)
• Cannot always iteratively increase value by identifying and adding s− t path tocurrent flow, may need to undo some existing flows
• Residual network Gf (V,Ef , s, t, cf ) of flow f in network G with residual capaci-ties cf (u, v) = c(u, v) − f(u, v) if (u, v) ∈ E, f(v, u) if (v, u) ∈ E, 0 otherwise (i.e.how much extra net u→ v flow can be sent)
• By feasibility of f, 0 ≤ cf (u, v) ≤ c(u, v) + c(v, u)
• Edge (u, v) ∈ Ef whenever cf (u, v) > 0 discards saturated edges
10
6.046 Notes Wanlin Li
• If f is flow in G and f ′ is flow in Gf , then f + f ′ is flow in G; reduces improvingflow f to finding nonzero flow in Gf
• If no non-zero flow in Gf , ∃s− t cut S with cf (S) = 0 (residual capacity)
• For any s − t cut S, cf (S) = c(S) − f(S) so if cf (S) = 0 then c(S) = |f |, c(S) =
|f | ≤ F ∗ ≤ c(S∗) ≤ c(S); f is max-flow and S is min s− t cut
• Max-flow min-cut Theorem: F ∗ = c(S∗)
• Max flow algorithm: augmenting path is directed s − t path in Gf , can pushadditional flow along augmenting path up to bottleneck capacity
• Total runtime O(EV C) if capacities are integers in [0, c], pseudopolynomial al-gorithm
8 March 5• Find max flow from residual network of flow, find augmenting path from s to t
in Gf (up to residual bottleneck capacity)
• Definition. Ford-Fulkerson Algorithm: start with zero flow, while augmentingpath P exists in Gf augment f along P
Max Flow Min Cut. The following are equivalent:
1. |f | = c(S) for some s− t cut S
2. f is a max flow
3. f admits no augmenting path
• Weak duality: if S∗ is minimum s − t cut and f∗ is max flow, then F ∗ = |f∗| ≤c(S∗)
• Strong duality F ∗ = c(S∗)
• Runtime of Ford-Fulkerson: each iteration/augmentation takesO(E) time, if ca-pacities are integers in [0, C] then total runtime is O(EV C) (pseudopolynomial)
• If capacities rational then runtime still finite and pseudopolynomial
• If capacities are real this could run for infinite time
Flow Integrality Theorem. If G = (V,E, s, t, c) has all capacities integral, thenthere exists a flow f such that |f | = F ∗ and both F ∗ and f are integral.
11
6.046 Notes Wanlin Li
• Can still exist max flow that is not integral but has integral capacity
• Ford-Fulkerson picks any augmenting path, may be smarter choice
• Definition. Maximum bottleneck path: augmenting path P that maximizesbottleneck capacity cf (P )
• Maximum bottleneck path can be found inO(E log V ) time: binary search to findmaximum capacity c∗, c ≤ c∗ iff ∃s− t path in Gf after removing all edges withcf (u, v) < c
• Definition. Maximum bottleneck path algorithm: start with f(u, v) = 0 ∀u, v;while augmenting path exists, find augmenting path P with maximum bottle-neck capacity and augment flow with P
Lemma. In any graph G = (V,E, s, t, c), ther exists an s − t path in G withc(P ) = mine∈P c(e) ≥ F ∗
m where m = |E|
• Proof by flow decomposition lemma
• Corollary: MBP runs in O(m2 log n log nC) time
• Ford-Fulkerson still better when C is small
• MBP weakly polynomial time
• Definition. Edmonds-Karp algorithm: variant of Ford-Fulkerson, always chooseaugmenting path with fewest number of edges
• Edmonds-Karp runs in O(m2n) time, no dependency on edge capacities
• Applications of max flow: maximum bipartite graph matching problem (reduceto max flow), Ford-Fulkerson can give O(mn) time
9 March 7: Linear Programming• Linear programming example: how to campaign to win election based on af-
fected demographics and votes gained/lost from ads with goal to win majority ineach demographic while spending as little as possible
• Definition. Linear programming: minimize/maximize a linear objective func-tion subject to a set of linear constraints; variables as vector ~x ∈ Rm with objec-tive function ~c · ~x and constraints A~x ≤ ~b ∈ Rn where A is an n ×m constraintmatrix
12
6.046 Notes Wanlin Li
• Standard LP form: maximize ~c · ~x subject to A · ~x ≤ ~b and ~x ≥ 0
• Transformations can change any LP to standard form, e.g. min to max by~c→ −~c
• Strict equality can be enforced using ≤ and ≥ combined, xi ∈ R can be trans-formed by x+
i ≥ 0 and x−i ≥ 0 and substituting xi = x+i − x
−i
• Geometric view: ~x is point in Rm, ~c is direction vector, want most extreme ~x indirection of~c subject to constraints which form polytope with at most n polygonalfacets
• Polytope may be unbounded (possibly no best solution) or empty (no solution, LPinfeasible)
Theorem. If the polytope is bounded, the optimal solution is a vertex of thepolytope.
• Simplex algorithm (greedy): start at any vertex in polytope, walk from vertexto vertex of feasible polytope in direction of ~c; very practical but exponential inworst-case
• Ellipsoid method: maintain ellipsoid containing optimal ~x∗, at each step cutellipsoid by hyperplane and find smaller ellipsoid containing optimal solution;geometric binary search, polynomial in worst-case and useful in theory but poorin practice
• Interior point method: start inside polytope and move vaguely in direction of ~c;polynomial in worst case and quite practical
• Simplex moves on edge of polytope, highly attuned to combinatorial structure ofconstraints
• Interior-point: moves inside polytope, ignores most combinatorial structure ofconstraints
• Integer linear programming (additional constraint that all xi are integers) isNP-complete
• Given LP in standard form: maximize ~c · ~x such that A~x ≤ ~b and ~x ≥ 0, considerdual LP min ~b · ~y such that AT~y ≥ ~c and ~y ≥ 0
• Corresponds to finding coefficients of linear constraints to sum to inequalityproving optimality of original LP
Weak LP duality. If ~x and ~y are feasible solutions to the primal and dual sys-tems, then ~c~x ≤ ~b~y.
13
6.046 Notes Wanlin Li
Strong LP duality. If ~x∗ and ~y∗ are optimal feasible solutions to the primal anddual programs, then ~c~x∗ = ~b~y∗. Moreover, only one of the following four possibili-ties exists:
1. Both (P) and (D) have optimal solutions
2. (P) is unbounded and (D) is infeasible
3. (D) is unbounded and (P) is infeasible
4. Both (P) and (D) are infeasible
• Roles of (P) and (D) are completely symmetric
• Max flow min cut is special case of above strong duality
10 March 14• Game theory: performing thought experiments to help predict behavior of ra-
tional agent in situation of conflict
• Two player game: Aij is utility of playerR ifR plays i and C plays j; Bij is utilityof player C if R plays i and C plays j
• Definition. Two player zero sum games: A = −B where matrix A representsutility of row player and matrix B represents utility of column player
• Example. Rock paper scissors
• RPS has randomized stable outcome where each option is chosen with 13 proba-
bility
• Definition. Nash equilibrium: state of game such that no player has incen-tive to deviate from current strategy; no player can improve expected utility byunilaterally changing strategy
• Example. Testify, testify is deterministic Nash equilibrium of prisoner’s dilemmagame
Nash Equilibrium. Any game with a finite number of players and a finite num-ber of actions has a Nash equilibrium.
Min-Max Theorem. For any matrix A, if VR = maxx∈P miny∈Q xAy and VC =miny∈Q maxx∈P xAy, then VR = VC .
14
6.046 Notes Wanlin Li
• P,Q are sets of positive vectors with sum of components equal to 1; correspondto probability distributions over rows and columns of matrix
• VR is expected utility of row player if row player goes first, VC is expected nega-tive utility of column player if column player goes first
• VR ≤ VC because VR corresponds to row player playing with handicap
• (x∗, y∗) corresponding to VR, VC is always Nash equilibrium of two-person zero-sum game described by A
• Nash equilibrium always exists for two person zero sum game
• Proof of min-max theorem by expressing VR, VC as linear program
• Need x ≥ 0 with∑xi = 1; if z = VR then for any column action j the expected
utility must be ≥ z
•∑
iAijxi ≥ z
• Want to maximize z
• By similar reasoning, VC = minu such that∑Aijyj)− u ≤ 0,
∑j yj = 1, y ≥ 0
• Observation: (R) and (C) programs are dual to each other so strong LP dualityimplies Min-Max Theorem
• C∗ = R∗ so C∗ ≥ VC ≥ VR ≥ R∗ and VC = VR
• Nash equilibrium always exists but finding it might be extremely difficult
• Simple stock market model: Xt is stock market index on day t with X0 = 0, eachday predict if Xt = Xt−1 + 1 or Xt = Xt−1 − 1
• Correct prediction gains one million, otherwise lose one million
• Given n experts, get up/down advice from each expert (not necessarily compe-tent); goal to do well when at least one expert is consistently providing decentadvice
• Define regret as number of mispredictions minus the number of mistakes of bestexpert
• Difficulty is best expert can only be identified in hindsight
• If best expert never makes a mistake, use Halving algorithm that maintainspool of trustworthy experts, at each step go with majority vote of trustworthyexperts and remove all experts that mispredicted
15
6.046 Notes Wanlin Li
• Regret of Halving algorithm is log n
• In general setting, even best expert makesm∗mistakes; can use iterated halvingalgorithm and replenish trusted pool when emptied by putting all experts back
• Iterated halving algorithm has regret (m∗ + 1) log n
• Replenishing S fails to distinguish between very bad experts and decent experts
• Idea to use weights to capture trustworthiness of experts, start out with weightof 1 and in each round update estimate, reducing weight of expert by halvingupon each mistake
• Aggregate predictions by taking weighted majority of answers
• Weighted majority algorithm has regret of ≤ 2.4(m∗ + log n)
• Using (1 − ε)−1 as weight reduction factor instead of 2 for 0 < ε ≤ 12 , regret is
≤ (1 + ε)m∗ + 2ε log n
11 March 19: Randomized Algorithms• Randomized/probabilistic algorithm: generates random number and makes de-
cisions based on value of number; given same input, different executions mayhave different runtime or produce a different output
• Definition. Monte Carlo algorithms: always run in polynomial time with highprobability of correct output
• Definition. Las Vegas algorithms: run in expected polynomial time and alwaysgive correct output
• Matrix multiplication requires certain number of multiplications
• Matrix product verification: check if C = A×B, can multiply both sides by samerandom vector and checking agreement
• Definition. Frievald’s algorithm: choose random binary vector such that P (ri =1) = 1
2 independently; if A(Br) = Cr return yes, otherwise no
• Frievald is Monte Carlo algorithm because always time efficient but may be in-correct
• Runtime is O(n2) for three matrix vector multiplications
• Definition. Sensitivity: true positive rate
16
6.046 Notes Wanlin Li
• Definition. Specificity: true negative rate
• Frievald has excellent sensitivity, claim C 6= AB means P (ABr 6= Cr) ≥ 12 ; let
D = C −AB with D 6= 0 so want to show there are many r with Dr 6= 0
• For any vector r with Dr = 0, ∃r′ such that Dr′ 6= 0
• Definition. Quicksort: divide and conquer algorithm with work mostly done individing step, sorts in place
• Basic, median-based pivoting, randomized quicksort (Las Vegas)
• Core quicksort: given n element array A, output sorted array
• Pick pivot element xp in A and partition into subarrays L = xi|xi < xp, G =xi|xi > xp; recursively sort subarrays
• Basic quicksort: choose pivot to be first element or last element, remove in turneach element xi from A and insert into L,E,G based on comparison to xp, canbe done in place
• Partition in Θ(n) time, worst case Θ(n2) (sorted or reverse sorted) but in practicedoes well on random inputs
• Median-based pivoting: guarantees balanced split, Θ(n log n) but loses to merge-sort in practice
• Randomized quicksort: xp chosen at random from A and new random choicemade at each recursion, expected running time O(n log n) for all input arrays
• Average-case analysis: average over inputs
• Expected-case analysis: average over random choices
• Paranoid quicksort: repeatedly choose pivot as random element of A, performpartition, exit if |L| ≤ 3
4 |A| and |G| ≤ 34 |A| and recurse
• Call is good with probability ≥ 12
• T (n) includes time to sort left and right subarrays, number of iterations to findgood call times cn per partition
• T (n) ≤ T(n4
)+ T
(3n4
)+ 2cn for expected Θ(n log n) runtime
Markov Inequality. For a nonnegative random variable X with positive ex-pectaiton value, P [X ≥ cE[X]] ≤ 1
c for all constants c > 0.
17
6.046 Notes Wanlin Li
• Markov inequality bounds probability that random variable exceeds expectationby some proportion
• Proof by integration computation
• Markov inequality provides way to convert Las Vegas algorithm into MonteCarlo one; run Las Vegas for time cT where T is expected running time, newalgorithm completes efficiently but may not give correct answer
Chernoff Bound. For a random binary variable Y = B(n, p), P (Y ≥ E[Y ] +
r) ≤ e−2r2
n ∀r > 0 where n is the number of trials and p is the probability ofsuccess.
12 March 21: Random Walks and Markov Chain Monte CarloMethods
• Definition. Random graph walk: for undirected graph G = (V,E) and startingvertex s, start at s and repeat t times the process of randomly moving to neighborof current vertex
• If graph has non-negative edge weights, move to neighbor v′ with probabilityproportional to weight of (v, v′)
• Representation of random walk as trajectory: list of vertices visited in order ofvisiting
• Distribution: probability distribution on vertices induced by walks
• ptv is probability that walk visits vertex v at step t of walk
• Generally represent set of ptv across v as vector pt ∈ Rv where vth coordinate isptv
• Lazy random walk: allow random walk to remain at current vertex
• As t→∞, lazy walk can eliminate oscillation and lead to convergence of proba-bilities
• Given undirected graph, adjacency matrix A of G is n × n matrix Au,v = 1 if(v, u) ∈ E and 0 otherwise; degree matrixD is n×n diagonal matrix withDu,v =d(u) if u = v and 0 otherwise
• Walk matrixW = AD−1 withWu,v = 1d(v) if (v, u) ∈ E; then pt+1 = Wpt = W t+1p0
• W = PlI + (1− Pl)W for lazy random walks
18
6.046 Notes Wanlin Li
• Many graphs converge to stationary distribution π independent of starting ver-tex
• Wπ = π or Wπ = π, represents steady state that exists whether or not walksactually converge to it
• πv = d(v)∑u∈V d(u) is a stationary distribution
Theorem. Every connected, non-bipartite, undirected graph has a stationarydistribution to which random walks in the graph are guaranteed to converge fort→∞. For lazy random walks, the graph does not have to be bipartite for this tobe true.
• For directed graphs, the graph must be strongly connected (every vertex reach-able from all other vertices) and aperiodic
• Example. Process of diffusion, self loops at end of chain
• Example. Card shuffling starting at one vertex in graph of 52! possibilities andperforming random walk
• Riffle shuffle, top to random both strongly connected, have stationary distribu-tions
• Mixing time for n cards approximately 32 log2(n) for riffle shuffle
• Problem of ranking web pages
• Count rank by making rank of page proportional to number of incoming edges(links to page); adjacency matrix times vector of all 1s
• Weight rank: weight recommendation is inverse of number of recommendationsmade by page wu =
∑v
1d(v)Au,v, WR = AD−11 = W1 where D is the outgoing
degree matrix
• Weight rank does not depend on importance of recommending page
• RecRank: RRu =∑
v∈V Au,v1
d(v)RRv, rec rank is a stationary distribution for W
• PageRank: (1 − α)W · Pr + αn1 where α is a parameter of choice and n = |V |;
stationary distribution for random process taking random step with probability(1− α) and jumps to random page with probability α
• Definition. Markov chain: process for which future staet of system dependsprobabilistically on current state of system without dependence on past states
19
6.046 Notes Wanlin Li
• Definition. Metropolis-Hastings algorithm: states and stationary distributionare known, want to calculate transition probabilities; start at arbitrary ver-tex xt = x0, randomly pick neighbor xr as transition candidate and evaluatefr = P (xr)
P (xt); if fr ≥ 1 (trial vertex at least as probable in stationary distribution
as xt), accept trial move and let xt+1 = xr, repeat; otherwise accept trial withprobability fr, if rejected set xt+1 = xt and try again
• Metropolis-Hastings mimics random walk with appropriate stationary distribu-tion
13 April 2: Universal and Perfect Hashing• Dictionary problem: abstract data type to maintain set of keyed items subject
to insertion/deletion of item, search for key (return item with key if it exists)
• Items have distinct keys
• Easier than predecessor/successor problem solved by AVL trees or skip lists(O(log n)) or van Emde Boas (O(log log u))
• Hashing: goal of O(1) time per operation and O(n) space
• u is number of keys over all possible items, n is number of keys/items current intable and m is number of slots in table
• Hashing with chaining achieves O(1 + α) time per operation where α is loadfactor n
m
• With simple uniform hashing, probability of collision is 1m but requires assump-
tion that inputs are random, works only in average case (like basic quicksort)
• Universal hashing: choose random hash function h ∈ H where H is a universalhash family such that
Ph∈Hh(k) = h(k′) ≤ 1
m∀k 6= k′
• Then h is random, no assumption needed about input keys (like randomizedquicksort)
Theorem. For n arbitrary distinct keys and a random h ∈ H, the expected num-ber of keys colliding in a slot is at most 1 + α where α = n
m .
• Then insert, delete, search are expected to cost O(1 + α)
20
6.046 Notes Wanlin Li
• Existence of universal hash families: e.g. all hash functions h : [u] → [n] isuniversal but useless because storing h takes logmu bits
• Definition. Dot product hash family: assume m prime (find nearby prime),assume u = mr for some integer r (round up), view keys in base m as k =〈k0, k1, . . . , kr−1〉 (cut up) and for key a = 〈a0, . . . , ar−1〉 define ha(k) = a ·k mod m(mix); then H = ha|a ∈ 0, 1, . . . , u− 1
• Storing ha requires storing just one key a
• Word RAM model: manipulatingO(1) machine words takesO(1) time and everyobject of interest (key) fits in a machine word
• Then ha(k) computation takes O(1) time
Theorem. The dot product hash family H is universal.
• Another universal hash family: choose prime p ≥ u and let hab(k) = [(ak +b) mod p] mod m, H = hab|a, b ∈ 0, 1, . . . , u− 1
• Static dictionary problem: given n keys to store in table, support search(k)
• Perfect hashing (no collisions): polynomial build time, O(1) worst case searchand O(n) worst case space
• Idea of two-level hashing: first pick h1 : 0, 1, . . . , u− 1 → 0, 1, . . . ,m− 1 froma universal hash family for m = Θ(n) (e.g. nearby prime) and hash all itemswith chaining using h1
• For each slot j ∈ 0, 1, . . . ,m−1 let lj be the number of items in slot j, pick h2,j :0, 1, . . . , u−1 → 0, 1, . . . ,mj from a universal hash family for l2j ≤ mj ≤ O(l2j )(nearby prime)
• Replace chain in slot j with hashing with chaining using h2,j
• Space is O(n+
∑m−1j=0 l2j
); if∑m−1
j=0 l2j > cn then redo first step
• Search time is O(1) for first table h1 +O(max chain size in second table)
• While h2,j(ki) = h2,j(ki′) for any i 6= i′, repick h2,j and rehash those lj items sothere are no collisions at the second level
• First and second steps are both O(n) buildtime
• Second hashing collision avoidance: expected to require at most 2 trials to reachgood h2,j so number of trials is O(log n) with high probability
21
6.046 Notes Wanlin Li
• Chernoff bound: lj = O(log n) with high probability and each trial in O(log n)time, overall O(n log2 n) time with high probability
• Expected size of∑m−1
j=0 l2j is O(n) because m = Θ(n)
• For sufficiently large constant, by Markov inequality probability that h1 is notO(n) space is ≤ 1
2 so first step is O(n log n) with high probability
14 April 4: Streaming Algorithms• Definition. Streaming algorithms: with very limited memory (usually o(n) orO(log n)) and sequential access to data, characterize data stream; typically onlyoutput at end of input, correctness only approximate or probable
• n may refer to number of elements in stream or size of largest output of datastream
• Applications: networking (e.g. IP packet routing) to characterize network flows,identify threats; database modification and access to characterize patterns
• One pass through data stream (x1, x2, . . . , xn) with small local memory
• Exact algorithms (rare): statistics of input data stream (average, max, min, ma-jority, etc.), reservoir sampling (keep collection of elements that are uniformsample of input stream up to that point)
• Probably approximate alogirthms: number of distinct elements, additional fre-quency moments Fp =
∑mi=1(fi)
p
• Want some degree of correctness guarantee
• For simple statistics, computing average or max requires keeping partial answerwhich requires only O(log n) space and one pass
• Given input stream with majority element (occurs > n2 times): each instance of
non-majority element can be cancelled by some other element, majority elementwill remain
• Reservoir sampling: given input stream of elements xi, keep one representativeelement for output with probability 1
n but don’t know n in advance
• Solution: keep x1 in storage, when xi is read replace storage element with xiwith probability 1
i ; at each step, random sample from x1, . . . , xi is stored
• Instead of storing single element, store reservoir of k elements each with prob-ability k
n
22
6.046 Notes Wanlin Li
• Keep first k elements, for each further element xi+1 with probability ki+1 keep it
and remove random reservoir element
• Weighted sampling: each element comes with weight wi, output according toweighted probability; keep xi with probaiblity wi∑i
j=1 wj
• Definition. Frequency moment: Fp =∑m
i=1 fpi where fi is number of times items
i appears in input stream and each xi ∈ [m]
• Frequency moments tend to be approximate rather than exact for streamingalgorithms
• Example. F0 is number of distinct elements under convention 00 = 0
• Example. F2 corresponds to size of database joins
• Probabilistic approximation: compute (ε, δ) approximation F0 to F0 such thatwith probability ≥ 1− δ, (1− ε)F0 ≤ F0 ≤ (1 + ε)F0
• Deterministic approximate algorithm and randomized exact algorithm both im-possible, δ, ε both needed for streaming algorithm
• Estimate F0 by pairwise-independent hash functions: family of hash functionsH = h : X → Y such that h(x1) and h(x2) are independent for all x1 6= x2
where randomness is over choice of hash function
• Equivalent condition: for every x1 6= x2 ∈ X, y1, y2 ∈ Y, P [h(x1) = y1 AND h(x2) =y2] = 1
|Y |2
• Many possible constructions
• H = h : [m] → [m] is pairwise independent family of hash functions, z(x) isnumber of trailing zeros of x in binary
• Algorithm: start with z = 0, for each item j, compute h(j) and test if z(h(j)) > z,set z to be z(h(j)); return 2z as estimate for F0
• With d distinct elements, Y is set of bins of elements ending with each binarystring of length log d
• With d unique elements, there is good chance one will hash to first bin (elementsending with log d zeros) making output ≥ d with good chance
• With >> d bins, good chance that no element will hash to first bin so output is< 2log(cd) = cd with good chance
• Claim for any c > 1, 1c ≤
F0F0≤ c with probability 2
c
• P (z(h(j)) ≥ r) = 12r and P (z(h(j)) ≥ r AND z(h(k)) ≥ r) = 1
22r
23
6.046 Notes Wanlin Li
15 April 9: Dynamic Programming I• Definition. Memoization: use some form of look-up table to store previously-
solved subproblem solutions
• Iterative approach: solve subproblems in smaller-to-larger order
• Dynamic programming: essentially clever brute force solution, reduce generallyexponential problem to polynomial one through reuse of subproblem solutions
• For DP to be effective (polynomial), needed polynomial number of unique sub-problems, polynomial number of cases per subproblem, polynomial time to com-pute problem solution given subproblem solutions
• Subproblem dependency graph must be DAG
• Top down approach: corresponds to DFS of subproblem dependency graph, gen-erally larger asymptotic constants
• Bottom up approach: systematically fill subproblem solution storage in orderdictated by subproblem dependency graph, only need to consider each subprob-lem once but does not skip unnecessary subproblems
• Alternating coins game: row of n coins of value v1, . . . , vn with n even, 2 playerstake turns in which player removes first or last coin and receives correspondingvalue
• Must have one function for each player
• Optimal BST problem: given set of keys k1, . . . , kn and search probabilitiesp1, . . . , pn, construct optimal binary search tree to store keys minimizing ex-pected search costs
∑pi(d(ki) + 1) where d(ki) is depth of ki
• Enumeration of all BSTs is too large, greedy approach not guaranteed to becorrect
• Split subproblems through choice of key at root, Θ(n2) subproblems and Θ(n)per subproblem
16 April 11: Dynamic Programming II• Edit distance: given two sequences and catalog of edit functions and their as-
sociated costs (insert, delete, substitute), find minimum cost for converting onestring into the other
• Optimal alignment contains optimal subalignments e.g. prefixes X1,i → Y1,j
24
6.046 Notes Wanlin Li
• Subproblems involve removing single character from right-hand end of one orboth strings
• Runtime Θ(mn) where m,n are lengths of sequences; mn subproblems each re-quiring O(1) time
• Knapsack problem: want to fill knapsack with goods of n types of various valueand weight, produce sack of maximal value without exceeding given weight ca-pacity W
• Subproblem structure based on smaller weight capacity
• O(nW ) runtime
• Definition. Pseudopolynomial runtime: polynomial in number of items butexponential in storage of weights/values
• General longest path in graph lacks optimal substructure but longest path inDAG has optimal substructure
• DFS can be used to sort DAG into topologically sorted order
• Topologically sort G, for each vertex v ∈ V in sorted order before s set distanceto be −∞
17 April 18• Seemingly related problems can have vastly different difficulties
• Example. Shortest path (polynomial time) vs longest path (no polynomial timealgorithm known)
• MST (given weighted graph, find spanning tre of minimum weight) vs TSP (findspanning simple cylce of minimum weight)
• Bipartite vs tripartite matching
• Optimization version of problem (MST): given weighted graph, find spanningtree of minimum weight; result is tree or report that graph is not connected
• Search version of problem: given weight graph and budgetK, find spanning treewhose weight is ≤ K or report that none exists; result is tree or report that Kis too small or that graph is not connected
• Decision version of problem: given weighted graph and budgetK, decide whetherthere exists spanning tree with weight ≤ K; result is yes or no
25
6.046 Notes Wanlin Li
• Existence of polynomial time solution to optimization implies polynomial time tosearch; existence of polynomial time solution to search implies one for decision
• For showing intractability, generally focus on decision version because beingunsolvable in polynomial time implies the others are also unsolvable in poly-nomial time
• Decision problem π is solvable in polynomial time if there exists a polynomial-time algorithm A such that for all x, x is a yes input for π iff A(x) outputs yes
• NP: non-deterministic polynomial time captures problems with polynomiallyshort and polynomial time verifiable certificates of yes instances
• π ∈NP if there exists a polynomial-time verification algorithm Vπ and a constantc such that for all x, π(x) is yes iff there exists a certificate y such that |y| ≤ |x|cand Vπ(x, y) is yes
• Reduction: for input x, problem π1, algorithm for π2, some function R(x) suchthat A(R(x)) is solution to π1
• Polynomial time reduction from π1 to π2 useful when π2 ∈ P and want to showπ1 ∈ P or when π1 is hard and want to deduce that π2 is also hard
• Definition. Reduction: polynomial-time reduction from π1 to π2 is polynomial-time algorithm R such that if x is an input to π1, then R(x) is an input to π2 andπ1(x) is yes iff π2(R(x)) is yes
• If polynomial time reduction from π1 to π2 exists, π1 ≤p π2 (π2 at least as hardas π1)
• Definition. NP-hard: problem π such that for all π′ ∈ NP, π′ ≤p π
• Definition. NP-complete: problem π such that π ∈ NP and π is NP-hard
Cook’s Theorem. Imagine a circuit made of 3 types of boolean logic gates: AND,OR, and NOT, where OR takes exactly 2 arguments. Inputs and outputs are bi-nary variables xi ∈ 0, 1, and assume no feedback so the graph is a DAG withone output. The circuit-SAT problem is as follows: given a circuit C(x1, . . . , xn),is there an input for which the output of C is 1? The circuit-SAT problem is NP -complete.
• For any problem π ∈ NP, need to find reduction to circuit-SAT
• Reduction builds circuit Cx satisfiable iff π(x) is yes; Cx(y) is implementation ofVπ(x, y)
26
6.046 Notes Wanlin Li
18 April 25
Cook’s Theorem. Circui-SAT is NP-complete.
• cSAT: given boolean circuit of AND, OR, NOT gates and no feedback, is there aset of 0, 1 input values to produce an output of 1
• Definition. SAT problem: given boolean formula, is it satisfiable? e.g. φ =(x1 ∨ x2) ∧ x3 ∧ (x3 ∨ x1 ∨ x2)
• Formulas of n boolean variables x1, . . . , xn, m boolean connectives ∧,∨, NOT,⇒,⇔, parentheses
• To show SAT is NP-hard, only need to show SAT is at least as hard as cSAT
• Given reduction, need to show circuit is satisfiable iff φ is satisfiable
• Definition. 3-SAT: given formula φ in conjunctive normal form (AND of ORs)with 3 literals per clause, is φ satisfiable?
• Example. φ = (x1 ∨ x2) ∧ x3 ∧ (x3 ∨ x1 ∨ x2) is not valid 3-SAT input because thefirst two clauses do not have three literals each
• Karp showed 3-SAT is NP-complete
• Definition. Vertex cover problem: given a graph G = (V,E) and an integer k,does there exist a subset S ⊆ V such that |S| ≤ k and every edge e ∈ E is incidentto at least one vertex in S?
• Would like to reduce 3-SAT to VC (vertex cover)
• Gadget construction: for each variable, assign gadget (subgraph ofG) represent-ing its truth value; for each clause assign gadget representing that at least oneliteral must be true, assign edges connecting these kinds of gadgets
• For each variable create edge with two vertices pxi , nxi , for each clause create3-cycle fci , sci , tci (all these vertices distinct)
• If first literal of clause ci is xj , add edge (fci , pxj ) and if first literal is xj , addedge (fci , nxj ) corresponding to positive or negative
• If clause is satisfied, at least one of its outgoing incident edges (to variable) iscovered; remaining 2 edges covered by picking two nodes from triangle
• Covering outgoing edge from variable node equivalent to satisfying correspond-ing literal in clause
27
6.046 Notes Wanlin Li
• Exists vertex cover of size k = 2m+ n iff φ is satisfiable
• If pxi is vertex cover, set xi = 1 and otherwise xi = 0
• If ~x is a satisfying assignment and xi = 1, include pxi in VC and nxi otherwise,then pick 2 other vertices from each of the m clause gadgets to cover all edges
• Therefore VC is NP-complete
• Beyond NP-completeness: approximation algorithms, intelligent exponentialsearch, average case analysis, special input cases
19 April 30: Approximation Algorithms• Want algorithms to solve hard (NP-hard) problems using fast algorithms to ob-
tain exact solutions
• Can obtain solutions satisfying any two conditions but not all three (currently)
• Hard problems in polynomial time require approximation algorithms; optimiza-tion version of problem
• Given optimization problem of size n, c∗ is cost of optimal solution and c is costof approximate solution
• Ratio bound ρ(n) = max cc∗ ,c∗
c ≤ ρ(n) ∀n gives ρ(n)-approximation algorithm
• Approximation scheme takes ε > 0 as input and provides (1 + ε)-approximationalgorithm
• Polynomial time approximation scheme provides algorithm polynomial in n butnot necessarily in 1
ε e.g. O(n2/ε)
• Fully polyonmial time approximation scheme: provides algorithm polynomial inn and 1
ε e.g. O(nε2
)• Vertex cover optimization version: input graphG(V,E) and output set of verticesS ⊆ V such that for all edges e ∈ E, S ∩ e 6= ∅ with objective to minimize |S|
• 2-approximation algorithm for VC: pick any edge (u, v) ∈ E, add both u, v to Sand remove all edges from E incident to one of the vertices; repeat until E isempty
• Runtime O(V + E), non-deterministic (depends on order of edge selection), Salways a valid vertex cover
28
6.046 Notes Wanlin Li
• Set cover: given set X of n points and m subsets Si of X whose union is X, findcover C ⊆ [m] such that
⋃i∈C Si = X and |C| is minimized
• Greedy algorithm: while some element is not covered, choose new set Si con-taining maximum number of uncovered elements and add i to cover
• Number of iterations is O(min(n,m)) and overall runtime is O(mnmin(m,n))
• Greedy set cover is (lnn+1)-approximation; on each iteration, 1|COPT|
of remainingelements are covered
• If t = |COPT| and Xi is set of remaining elements on each iteration i, Xi can becovered by ≤ t sets and there exists a set that covers ≥ |Xi|
t elements
• Partition problem (NP-hard): given sorted list of n positive numbers s1 ≥ · · · ≥sn, find partition of indices [n] into two setsA,B such that max
∑i∈A si,
∑j∈B sj
is minimized; find most balanced partition
• Let m =⌈
1ε
⌉− 1 so ε ≥ 1
m+1 ; find optimal partition A′, B′ for first m numberss1, . . . , sm which takes constant time wrt n
• For each successive element si, add si to partition with smaller sum
• (1 + ε)-approximation algorithm, takes O(n) time
• Greedy algorithm on vertex cover selecting vertices of maximum degree: poly-nomial time and returns valid vertex cover
• There exist inputs on which greedy vertex cover is extremely suboptimal, e.g.bipartite graph where k! vertices have degree k in the first group and secondgroup has k!
i vertices of degree i for each i; greedy may pick all vertices in secondgroup
• Linear programming relaxation for vertex cover: assign indicator xi to each ver-tex vi ∈ V
• Seek to minimize∑xi subject to 0 ≤ xi ≤ 1 (temporarily relax integer con-
straint) and xi + xj ≥ 1 for all edges (vi, vj) ∈ E
• Take x∗i = 1 iff xi ≥ 12 for 2-approximation algorithm
20 May 2: Distributed Algorithms• Computing paradigms: parallel computing (multiple processor cores), paral-
lelize task when benefit from more identical workers; distributed computing(computer networks/internet), cooperate to solve joint task even when some com-ponents are not cooperating
29
6.046 Notes Wanlin Li
• n processors/players each with input xi, for each processor want to compute yi =fi(x1, . . . , xn)
• Each fi might depend on all inputs so cooperation is essential
• Message passing model: processors connected in undirected graph, in each roundcan send/receive messages along edges; each processor knows their parts by ar-bitrary local name
• Shared memory model: processors communicate by reading/writing to sharedmemory in each round
• Synchronous model assumed, i.e. things happen in rounds
• Leader election: run protocol (algorithm) so exactly one processor outputs thatit is the leader
• Impossible if protocol is deterministic and processors are truly identical
• Fundamental problem is lack of symmetry breaking mechanism
• First solution: make processors non-identical so each processor has a unique ID
• Simple protocol (assuming connected graph): each processor has local variablemaxi, in each round send maxi to all neighbors and update maxi to the maximummessage known to processor; after ∆ rounds (∆ upper bound on diameter ofgraph), output result of leader if maxi = IDi
• ∆ usually known to processors in order to ensure termination
• Second solution: no unique IDs but use randomness; idea to use randomness tomanufacture unique IDs
• Protocol: choose random ID from set 1, 2, . . . ,K for some K and run the pro-tocol for unique ID setting
• Probability of collision is upper bounded(n2
)1K ≤ ε if K ≥ ε−1
(n2
)for some ε > 0
so protocol succeeds with probability ≥ 1− ε
• Processors do not know if they succeed so this is a Monte Carlo algorithm; un-known how to obtain Las Vegas algorithm
• Ignore time for local computations, focus on complexity of reaching consensus(number of rounds of communication)
• Maximal independent set problem: no new elements can be added without needto swap out another element
30
6.046 Notes Wanlin Li
• Goal to obtain protocol such that each processor outputs yes/no decision and theyes-decision processors form a maximal independent set (no two yes-processorsare neighbors)
• Maximal 6= maximum independent set
• Maximum independent set is NP-hard but maximal independent set is in P
• Simple protocol: do leader election, add leader to MIS and make its neighborsinactive, repeat for O(n∆) rounds
• Luby’s randomized MIS protocol: all processors start active, protocol proceedsin phases with each phase consisting of 2 rounds;
• Round 1 of phase: choose random value ri ∈ [K] and send to all neighbors,receive values from neighbors, if all received values < r then join MIS (outputYES)
• Round 2: if processor joined MIS, announce to all neighbors; if announcement isreceived, do not join MIS (output NO); if YES/NO decided in this phase, becomeinactive
• Final set is independent and maximal
• If K >> n3, O(log n) rounds needed to terminate; terminate in 4 log n phaseswith probability ≥ 1− 1
n
• Proof for linear graph
• In each phase, if i 6= j then P [ri = rj ] = 1K << 1
n3 and union bounding over allpairs and 4 log n first phases gives Pi 6=j [ri = rj ] ≤
4 logn·(n2)K ≤ 2n2 logn
K << 1n so
WLOG all ri always distinct
• Call edge (u, v) active iff both u, v are still active; for any edge (u, v) and anyphase in which (u, v) starts active, P [(u, v) becomes inactive] ≥ 1
2 by caseworkon active edges incident to u, v
• Therefore probability that (u, v) is still active after t phases is ≤(
12
)t21 May 7: Continuous Optimization I• Optimization in continuous spaces (e.g. LP) but with greater generality
• Problems involve set of control variables that are continuous and multi-dimensionalwith continuous scalar objective function generally non-linear in control vari-ables
31
6.046 Notes Wanlin Li
• Model of some type required to specify relationship between control and objec-tive
• Definition. Unconstrained minimization: given a real-valued function f : Rn →R, find its minimum, assuming it exists.
• For maximum, take min of −f ; if constraints required, minimize g(x) = f(x) +ψ(x) where ψ(x)→∞ when constraints violated and ψ(x) = 0 otherwise
• Assume f continuous and infinitely differentiable
• Definition. Gradient descent: iterative approach of locally using gradient tocontinuously attempt to make improvements by walking downhill; locally greedyapproach
• Linear expansion about current point ~x, move in opposite direction as ∇f(~x)because ∇f is direction of greatest local increase
• Algorithm: begin at some starting point ~x(0), for each i set ~x(i+1) = ~x(i)−ηi~∇f(~x(i))
• Local optimality when gradient is 0; does not improve on local optimum but canlocally perturb and continue from maximum or saddle point
• f(~x+ ~δ) ≈ f(~x) +[~∇f(~x)
]Tδ + 1
2δT∇2f(~x)δ + · · ·
• f is β-smooth if δT∇2f(~x)δ ≤ β||δ||2 for all x, δ ∈ Rn
• Plugging in choice of step gives δ = −η~∇f(~x) so f(~x + δ) ≤ f(~x) − η||~∇f(~x)||2 +βη2
2 ||~∇f(~x)||2 where first term is expected progress and second term is expectederror; if η ≈ 1
β , progress should exceed error in estimated progress (need η ≤ 2β )
• Expected minimum progress in ith step is 12β ||~∇f(~x(i))||2
• Gradient descent converges to point ~x with ~∇f(~x) = 0 (local min/max or saddle)or diverges to f(~x)→ −∞
• Local min may not be global min
• Gradient descent guaranteed to converge to global min when f is convex
• Definition. Convex function: f(λ~x + (1 − λ)~y) ≤ λf(~x) + (1 − λ)f(~y) for all0 ≤ λ ≤ 1 or equivalently, f(~x+ δ) ≥ f(~x) +
[~∇f(~x)
]Tδ for all x, δ
• Convergence analysis: how quickly gradient descent converges
32
6.046 Notes Wanlin Li
• Let ~x∗ be the minimum of convex function f ; f(~x∗) ≥ f(~x) +[~∇f(~x)
]T(~x∗−~x) so
f(~x)− f(~x∗) ≤ −[~∇f(~x)]T (~x∗ − ~x) ≤ ||~∇f(~x)|| · ||~x∗ − ~x|| ≤ ε by Cauchy-Schwarz
• This is dependent on the unknown distance between ~x and ~x∗
• Definition. α-strong convexity: f α-strongly convex if ~yT∇2f(~x)~y ≥ α||~y||2 forall ~x, ~y ∈ Rn with α ≥ 0
• Normal convexity is α = 0
• For α-strong convex function, f(x+δ) ≥ f(x)+[~∇f(~x)
]Tδ+α
2 ||δ||2 so convergence
to minimum has f(~x)− f(~x∗) ≥ α2 ||~x− ~x
∗||2
• Within ε of optimum requires O(K log f(~x(0))−f(~x∗)
ε
)steps where K = β
α ≥ 1 isthe condition number of f
• If function not strongly convex, idea to construct new function based on f thatis α-convex by adding α||~x− ~x(0)||2 (regularization)
• New function has possibly different optimum; reduce regularization as ~x∗ is ap-proached
22 May 9: Applications of Gradient Descent• Unconstrained minimization given smooth and continuous f : Rn → R by locally
greedy gradient descent method
• Hessian matrix ∇2f(x)
• As t→∞, x(t) either diverges to −∞ or converges to critical point
• If f is convex, every critical point is a global minimum
• Convergence analysis for strongly convex function
• Definition. β-smooth: δT∇2f(x)δ ≤ β||δ||2 for some β ≥ 0
• Definition. α-strongly convex: δT∇2f(x)δ ≥ α||δ||2 for some α ≥ 0
• α ≤ β always, β-smoothness is upper bounding parabola and α-strong convexityis lower bounding parabola
Theorem. If f is β-smooth and α-strongly convex, then for any t > 0, f(x(T )) −f(x∗) ≤ ε whenever T = Ω
(K log f(x(0))−f(x∗)
ε
)where K = β
α is the conditionnumber of f.
33
6.046 Notes Wanlin Li
• Step size η ≈ 1β ; can also binary search to minimize error term
• K measures quality of local approximation of f at x
• Condition number K is main factor affecting growth of number of steps
• Key application domain of gradient descent is training machine learning models
• Linear regression as illustrative example: m data points x(1), . . . , x(m) ∈ Rn andm labels y(1), . . . , y(m) ∈ R; goal to find linear function that predicts each y(j)
given x(j)
• Popular choice for measure of best fit given by mean squared error L(w) =1m
∑Ej(w)2
• Goal to compute argument with minimum L(w)
• Gradient update takes majority vote of all data point updates
• Classic approach to solving linear system is by directly computing inverse matrixbut this is always fairly slow and numerically problematic
• Iterative approach: start with some x(0), iteratively improve solution by gradientdescent on function fA(x) = 1
2xT (ATA)x−AT bx so any critical point is a solution
• Beyond linear classifications by increasing number of dimensions, i.e. sendingx to x2 to send lines to parabolas in original space
• Deep learning: work with mapping expressed by neural network with param-eters, resulting optimization problems are highly non-convex but gradient de-scent still delivers good solutions for unknown reason
23 May 14: Quantum Computation• Represent each step of computation as n-bit binary string with transitions given
by 2n × 2n matrix
• Randomized model of computation has states as probability distributions in 2n
space, transitions are stochastic matrices
• In quantum model, universes interact and can cancel each other out
• Quantum operations must be invertible, restricts operations that can be carriedout
• Qubit: state represented as linear combinations of basis vectors over C
34
6.046 Notes Wanlin Li
• Transition matrix is unitary, MM∗ = I to preserve lengths; state vectors havelength 1
• Quantum computing enables operations nonexistent in classical computations
• 2-qubit means n = 2
• States cannot always be separated by qubit (states of individual qubits not nec-essarily independent)
• Paradox of entanglement: measurement of first qubit can determine outcome ofsecond qubit arbitrarily far away
No-cloning Theorem. There is no quantum operation U such that (α |0〉 +β |1〉)(|0〉)→ (α |0〉+ β |1〉)(α |0〉+ β |1〉) for all α, β ∈ C.
• Can be used to design quantum money that is impossible to counterfeit or designperfectly secure protocols
• Computing XOR: given f : 0, 1 → 0, 1, compute f(0)⊕ f(1) in as few queriesas possible
• Classical model requires two queries
• Quantum query to f is given by query transformation Uf : |x y〉 → |x (y ⊕ f(x))〉for all x, y ∈ 0, 1 to ensure Uf is reversible
• Quantum algorithm to compute XOR with single query: start with |0 0〉 , flip thesecond bit, apply Hadamard gate to both bits, apply Uf , then apply Hadamardgate again to the first bit
• Hadamard gate sends |0〉 to 1√2(|0〉+ |1〉) and |1〉 to 1√
2(|0〉 − |1〉)
• This algorithm causes cancellation of undesirable states and measuring the firstqubit gives the correct answer
35