NVIDIA CUDA Learning Note 1

作者: 戬杨Jason | 来源:发表于2018-02-08 22:53 被阅读0次

NVIDIA CUDA Learning Note 1
ubuntu安装指定版本nvidia驱动、CUDA、nvidia
安装NVIDIA显卡驱动和CUDA Toolkit
环境搭建02-Ubuntu16.04 安装CUDA和CUDNN、
深度学习之N卡安装与配置（Windows10）
PyTorch的使用-----1. CUDA安装
Ubuntu 18.04多cuda配置
VS2015 + CUDA 开发环境配置
Status: CUDA driver version is i
Ubuntu16.10上搭建深基于python+tensorfl

1）CPU architecture

Pipelining
Branch Prediction
Superscalar
Out-of-Order Execution
Memory Hierarchy
Vector Operation
Multi-core

What is CPU?

Execute instruction, process data
Additional complex function
Contains many transistor

What is instruction

For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end

Optimize Objective:

cycles for instruction * seconds/cycle

CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

Desktop Programs

Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.

Moore's Law

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?

image.png

8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

Pipelining

Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline

image.png

Pros

Instruction level parallelism (ILP)
Significantly reduced clock period.

Cons

Slight latency & area increase (pipeline latches)
Dependency
How to manage the branch
Alieged Pipeline Lengths

Bypassing

image.png

If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

Stalls

image.png

If load is not finished, the pipeline must stop to wait.

Branch

image.png

Branch Prediction

Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history

Maintain history table of all outcomes for M successive
Compare with past N results (history register)
Sandy Bridge employs 32-bit history register

Modern predictors > 90%

Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase

Predication

Replace branches with conditional instructions
Avoids branch predictor

Avoid area penalty, misprediction penalty

GPU also use prediction

Increase IPC

Normal IPC is limited by 1 instruction / clock
Superscalar - increase the width of the pipeline

Superscalar

Peak IPC is N (for N-way superscalar)

image.png

Scheduling

xor r1,r2 -> r3
add r3,r4 -> r4

sub r5,r2 ->r3
addi r3,1->r1

xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW

Register Renaming

xor r1,r2 -> r6
add r6,r4 -> r7

sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute

Out-of-Order(OoO) Execution

Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read - > Execute -> Memory -> Writeback ->
Commit

Reorder Buffer
Issue Queue/Scheduler

Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost

Modern Desktop/ Mobile In-order CPUs

Intel Atom
ARM Cortex-A8
Quaicomm Scorpion

Modern Desktop/Mobile OoO CPUs

Intel Pentium Pro and onwards
ARM Cortex-A9
Quaicomm Krait

Memory Hierarchy

image.png

Caching

Put the data in a position as close as possible。

Time proximity
Spatial proximity

Cpu parallel

Instruction - level extraction
Data - Level Parallelism (Vectors)
Thread- Level Parallelism (TLP)

Vectors Motivation

for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]

Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]

X86 Vector Motivation

SSE2
AVX

Thread-Level Parallelism

Programmers can destroy and create.
Programmers or OS can dispatch.

Multicore

Locks, Coherence and Consistency

Multi thread access same data
Coherence: which one is correct
Consistency: what kind of data is correct

Power Wall

The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

CPU provides optimization for series program

网友评论

本文标题：NVIDIA CUDA Learning Note 1

本文链接：https://www.haomeiwen.com/subject/pdmxtftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

NVIDIA CUDA Learning Note 1

1）CPU architecture

What is CPU?

What is instruction

Desktop Programs

Moore's Law

Pipelining

Pros

Cons

Bypassing

Stalls

Branch

Branch Prediction

Predication

Increase IPC

Superscalar

Scheduling

Register Renaming

Out-of-Order(OoO) Execution

Modern Desktop/ Mobile In-order CPUs

Modern Desktop/Mobile OoO CPUs

Memory Hierarchy

Caching

Cpu parallel

Vectors Motivation

X86 Vector Motivation

Thread-Level Parallelism

Multicore

Locks, Coherence and Consistency

Power Wall

CPU provides optimization for series program

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读