NVIDIA CUDA Learning Note 1

作者: 戬杨Jason | 来源:发表于2018-02-08 22:53 被阅读0次

    1)CPU architecture

    • Pipelining
    • Branch Prediction
    • Superscalar
    • Out-of-Order Execution
    • Memory Hierarchy
    • Vector Operation
    • Multi-core

    What is CPU?

    • Execute instruction, process data
    • Additional complex function
    • Contains many transistor

    What is instruction

    For example:
    arithmetic:add r3,r4 > r4
    visit and save:load[r4] > r7
    control:jz end

    Optimize Objective:

    cycles for instruction * seconds/cycle

    CPI(clock cycle per instruction) & clock cycle
    The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

    Desktop Programs

    Lightly threaded
    Lots of branches
    Lots of memory accesses
    Most desktop program deals with data transfer instead of numeric computation.

    Moore's Law

    The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
    What do we do with our transistor budget?


    image.png

    8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

    Pipelining

    Several steps involved in executing an instruction:
    Fetch -> Decode -> Execute -> Memory -> Writeback
    This process can be separate to different parts of pipeline


    image.png

    Pros

    • Instruction level parallelism (ILP)
    • Significantly reduced clock period.

    Cons

    • Slight latency & area increase (pipeline latches)
    • Dependency
    • How to manage the branch
    • Alieged Pipeline Lengths

    Bypassing

    image.png

    If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

    Stalls

    image.png

    If load is not finished, the pipeline must stop to wait.

    Branch

    image.png

    Branch Prediction

    Guess what instruction comes next
    Based off branch history
    Example: two-level predictor with global history

    • Maintain history table of all outcomes for M successive
    • Compare with past N results (history register)
    • Sandy Bridge employs 32-bit history register

    Modern predictors > 90%

    Pros:
    Raise performance and energy efficiency
    Cons;
    Area increase
    Potential fetch stage latency increase

    Predication

    Replace branches with conditional instructions
    Avoids branch predictor

    • Avoid area penalty, misprediction penalty

    GPU also use prediction

    Increase IPC

    • Normal IPC is limited by 1 instruction / clock
    • Superscalar - increase the width of the pipeline

    Superscalar

    Peak IPC is N (for N-way superscalar)


    image.png

    Scheduling

    xor r1,r2 -> r3
    add r3,r4 -> r4

    sub r5,r2 ->r3
    addi r3,1->r1

    xor and add : Read-After-Write,RAW
    sub and addi: RAW
    xor and sub: WAW

    Register Renaming

    xor r1,r2 -> r6
    add r6,r4 -> r7

    sub r5,r2 ->r8
    addi r8,1->r9
    xor and sub can parallel compute

    Out-of-Order(OoO) Execution

    Reordering the order
    Fetch -> Decode -> Rename -> Dispatch -> Issue ->
    Register-Read - > Execute -> Memory -> Writeback ->
    Commit

    Reorder Buffer
    Issue Queue/Scheduler

    Pros:
    IPC near to the ideal state
    Cons:
    Area increase
    Power cost

    Modern Desktop/ Mobile In-order CPUs
    • Intel Atom
    • ARM Cortex-A8
    • Quaicomm Scorpion
    Modern Desktop/Mobile OoO CPUs
    • Intel Pentium Pro and onwards
    • ARM Cortex-A9
    • Quaicomm Krait

    Memory Hierarchy

    image.png

    Caching

    Put the data in a position as close as possible。

    • Time proximity
    • Spatial proximity

    Cpu parallel

    • Instruction - level extraction
    • Data - Level Parallelism (Vectors)
    • Thread- Level Parallelism (TLP)

    Vectors Motivation

    for(int i = 0;i<N;i++)
    A[i] = B[i] + c[i]

    Single instruction multiple Data
    //in parallel
    A[i] = B[i] + c[i]
    A[i+!] = B[i+!] + c[i+!]
    A[i+2] = B[i+2] + c[i+2]
    A[i+3] = B[i+3] + c[i+3]
    A[i+4] = B[i+4] + c[i+4]

    X86 Vector Motivation

    • SSE2
    • AVX

    Thread-Level Parallelism

    Programmers can destroy and create.
    Programmers or OS can dispatch.

    Multicore

    Locks, Coherence and Consistency

    • Multi thread access same data
    • Coherence: which one is correct
    • Consistency: what kind of data is correct

    Power Wall

    The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

    CPU provides optimization for series program

    相关文章

      网友评论

        本文标题:NVIDIA CUDA Learning Note 1

        本文链接:https://www.haomeiwen.com/subject/pdmxtftx.html