Machine Learning笔记 第06周

作者: 我的名字叫清阳 | 来源:发表于2016-02-19 14:52 被阅读282次

    Week 06 tasks:

    • Lectures: Randomized Optimization.
    • Reading: Chapter 9 of Mitchell, as well as the overview on the No Free Lunch Theorem.
    Randomized Optimization Optimization
    • optimization is to find a x* in input space X to make the objective function has the maximum (or close to maximum).

    Optimize Me

    Quiz: best x
    • The first question, since the input space is small, we can plot the function and find the largest f(x).
    • the second the functions, we can try to use calculus: Set the derivative f(x) to 0 and solve the derivative function ( which will be hard to solve since it's a cubic function). Alternatively, we can try to plot the data using subgroup of x.

    Optimization Approaches

    Optimization Approaches
    • Three approaches:
      • generate & test: when we have a small input space but a complex function
      • using calculus: when the function has derivative and the derivative function is solvable when its value is set to 0
    • Newton's method: when the function has derivative and it can be improved iteratively.
    • When these approaches does not work, then Randomized Optimization.
    Hill Climbing
    • the hill climbing algorithm will stuck at a local maxima.
    Quiz 2: find my word

    Random Restart Hill Climbing

    Random Restart Hill Climbing
    • how to avoid the unlucky starting point?
      • checking and make sure the starting point is not visited before
      • make sure the starting point is faraway from known local maxima
    Quiz3: randomized Hill climbing
    • The six points under the maxima hill are special
      • 4 points under the global maxima will take average 4 steps to get to the global maxima. and that's the 4/28*(4)
      • the 2 basin points, half of the time will lead up in global maxima, and half of the time will lead to local maxima (and restart).
    • all the other 22 points will lead to local maxima and then start a new round with a randomized starting point.
    • if keep track of the points visited, the algorithm might do better than 28 on average.
    • if the attraction basin under the local maxima is large, hill climbing will do better.

    Annealing algorithm

    Simulated Annealing
    • One step further than hill climbing: allow downwards to "explore"
    Annealing algorithm
    • if f(xt) and f(x) is very close, then f(xt) -f(x) is close to 0 and p will be 1.
    • If T is really big, say infinity, then f(xt) -f(x) doesn't matter then. P will be 1, and x will jump to xt
    • if T is really small, close to 0, p will climb, not jump.
    Properties of Simulated Annealing
    • T -> 0: like hill climbing
    • T-> ∞: random walk
    • Desearse T slowly: As T decreases, it's harder and harder to jump over local maxima, and algorithm will reach global maxima

    Genetic Algorithms

    GA
    • GA is like random restart but doing the restart in parallel.
    • Crossover - population holds information.
    GA Skeleton
    • Define "most fit"
      • top half or truncate selection:the half individuals with the largest score
      • weighted prob. or roulette wheel: weighted fitness function with probability.
    • what's crossover?
    Crossover Example
    • Uniform crossover: randomization at each position
    Recap

    MIMIC

    MiMIC Quiz 4:
    • when θ is min, then P is the uniform distribution of X
    • when θ is max, we get the maxima or optima of the distribution
    Pseudo code
    • Given a threshold θt, generate samples from Pθt(x), then set θt+1 to be nth percentage and retain only those samples, s.t. f(x)>=θt_1
    • instead of using these samples, we estimate Pθt+1(x), and then generate new samples from it. then repeat the process again and again until converge.
    • Two assumptions:
      • Pθt(x) is estimable
      • Pθt+1 is close enough to Pθt
    • this is a lot like Genetic algorithm but Pθ and Pθ+ ϵ can preserve the structure.
    Estimating Distributions
    • P(x) is the combination of the p of each x in X=[x1 x2 .... xn] given others in X. To estimate P(x) this way, we will need huge amount of data ( exponential to n).
    • Dependent tree: each node only have one parent.
      • it can represent relationships
      • and sampling dependent tree is easy
      • it's a general form of crossover and it does not require locality.

    Finding Dependency Trees

    Finding Dependency Trees
    • Minimize Divergence of P and P_hatπ, Divergency can be written as entropy.
    • build a cost function by remove -h(p) and add entropy -h(xi), then we get the mutual information between x and π
    • Finding Dependency Trees is now converted to a problem to maximize mutual information between x and it's parent. or in another word, to find the parent who knows the most about the node.
    Finding Dependency Trees
    • This can be solved by solving the maximum spanning tree. Using Prim.
    • The prim algorithms is fairly fast algorithm ( polynomial to the number of edges of a graph) and it's good for dense connected graph.
    back to pseudo code
    • dependent tree is used to estimate P given the samples from the previous step.
    Quiz 5: Probability Distribution
    • what distribution can represent the maxima of the problems on left column.
    Practical Matters
    • MIMIC evaluate cost function less when comparing to Simulated Annealing, but it usually takes longer to run.
    • MIMIC will get more information in and out
    2016-02-17 完成 Genetic Algorithms
    2016-02-19 初稿
    

    相关文章

      网友评论

        本文标题:Machine Learning笔记 第06周

        本文链接:https://www.haomeiwen.com/subject/kxfmkttx.html