@[TOC](嵌入式学习之一 (arm架构优化))
arm架构32位优化
嵌入式设备(即arm架构的板子)在编译时,最好加上 -fsigned-char 因为嵌入式设备默认类型为unsigned char类型,非char 类型。此外在编译arm汇编优化代码时,编译选项需要加上-c
arm 语法简介
gnu语言学习资源:
arm纯汇编语法分armasm语法 和 gnu asm语法
- 常用arm语法
- 定义一个函数
.text
.align 4
.global name
.type %function
name:
FUNCTION STATEMENT
bx lr
- 定义一个宏代码
.macro name arg1, arg2, arg3
ldr r0, \arg1
vstl.u32 \arg2\()[0], [r0]
.endm
本示例意在告诉,宏参数可以通过 \ 来取, 针对特殊的需要用 \() 来分隔,假设arg2是d0寄存器,如果需要将d0[0]里面的数据存储到r0中,就不能用 \arg2[0] 来获取,编译器会认为是解析宏参数arg2[0]。
- .ltorg的使用
在代码中,如果常量区跟代码区距离相隔太大,当前函数需要访问常量区的某个常量,则需要在当前函数开头前,上一函数结尾后,添加.ltorg,否则编译会提示相应的错误。
.ltorg Insert the literal pool of constants at this point in the program. The literal pool is used by the ldr = and adrl assembly language pseudo-instructions and is specific to the ARM. Using this assembler directive is almost always optional, as the GNU Assembler is smart enough to figure out when and where to put any literal pool.However, there are situations when it is very useful to include this directive, such as when you need absolute control over where the assembler places your code.
- 注释
注释虽然有多种形式,但为了便于将arm32的优化代码转译为arm64的优化代码,注释最好采用 “//” 或“/* /”的形式,因为arm64汇编不支持以@开头的注释。或者 (使用/ */ 注释多行;使用//注释单行,但是//的使用,需要文件的后缀为.S)
Inline comment char: ‘@’
Line comment char: ‘#’
Statement separator: ‘;’
arm 32位架构简介
arm寄存器
arm寄存器有16个32位的通用寄存器(R0-R15),寄存器列表如图3-5所示,需注意的是:R14(LR)用来存储调用子例程时的返回地址、R0~R3被用来传递函数形参、其它的寄存器如果在被调用者函数中使用,则需要进行Push操作,但是R12寄存器比较特殊,在被调用者函数中使用时可以不用push;关于更详细的调用规则参考ATPCS(2参考网址,5.1.1 Core registers), ATPCS采用满降序堆栈(STMFD/LDMFD)。
参考:地址1,地址2
neon寄存器
neon技术第一次实现是在ARM Cortex-A8处理器,ARMv7架构体系(ARMv7-A与ARMv7-R系列)上;neon寄存器有16个128位的Q寄存器, 32个64位的D寄存器(摘自1参考网址, 5.1.1 Core registers),寄存器列表如图A2-1所示(摘自2参考网址, A2.6.1 Advanced SIMD and VFP extension registers),需注意的是:S0是D0的低32位,S1是D0的高32位,同理D0是Q0的低64位,D1是Q0的高64位;S、D、Q寄存器之间的关系为:
The mapping between the registers is as follows:
• S<2n> maps to the least significant half of D
• S<2n+1> maps to the most significant half of D
• D<2n> maps to the least significant half of Q
• D<2n+1> maps to the most significant half of Q.
For example, you can access the least significant half of the elements of a vector in Q6 by referring to D12,
and the most significant half of the elements by referring to D13.
注意: (d8-d15, q4-q7) 在子程序中使用时,需要压栈保存。参考网址3:5.1.2.1 VFP register usage conventions (VFP v2, v3 and the Advanced SIMD Extension)
NEON指令集
- ARMv7/AArch32指令格式
所有的支持NEON指令都有一个助记符V,下面以32位指令为例,说明指令的一般格式(参考1参考网址,Armv7-A/AArch32 instruction syntax):
V{<mod>}<op>{<shape>}{<cond>}{.<dt>}{<dest>}, src1, src2
-
< mod>:
- Q: The instruction uses saturating arithmetic, so that the result is saturated within the range of the specified data type, such as VQABS, VQSHL etc.
- H: The instruction will halve the result. It does this by shifting right by one place (effectively a divide by two with truncation), such as VHADD, VHSUB.
- D: The instruction doubles the result, such as VQDMULL, VQDMLAL, VQDMLSL and VQ{R}DMULH.
- R: The instruction will perform rounding on the result, equivalent to adding 0.5 to the result before truncating, such as VRHADD, VRSHR.
-
< op>: the operation (for example, ADD, SUB, MUL).
-
< cond>: Condition, used with IT instruction.
-
< .dt>: Data type, such as s8, u8, f32 etc.
-
< dest>: Destination.
-
< src1>: Source operand 1.
-
< src2>: Source operand 2.
-
< shape>: Shape,即NEON数据处理类型Long (L), Wide (W), Narrow (N)。
NEON数据处理类型可分为Normal、Long、Wide、Narrow:
- Normal instructions can operate on any vector types, and produce result vectors the same size, and usually the same type, as the operand vectors.
- Long instructions operate on doubleword vector operands and produce a quadword vector result.(操作双字vectors,生成四倍长字vectors) The result elements are usually twice the width of the operands, and of the same type.(结果的宽度一般比操作数加倍,同类型) Long instructions are specified using an L appended to the instruction.(在指令中加L)
- Wide instructions operate on a doubleword vector operand and a quadword vector operand, producing a quadword vector result.(操作双字 + 四倍长字,生成四倍长字) The result elements and the first operand are twice the width of the second operand elements.(结果和第一个操作数都是第二个操作数的两倍宽度) Wide instructions have a W appended to the instruction.(在指令中加W)
- Narrow instructions operate on quadword vector operands, and produce a doubleword vector result.(操作四倍长字,生成双字) The result elements are usually half the width of the operand elements.(结果宽度一般是操作数的一半) Narrow instructions are specified using an N appended to the instruction.(在指令中加N)
arm 32位架构指令手册
优化
NEON优化技巧
-
Skill1: 减少数据之间的依赖
在ARMv7-A平台上,为了减少指令延时时间,应当避免使用当前指令的目的寄存器作为下一条指令的源寄存器。英文原文:On the ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.
-
Skill2: 减少指令分支
NEON指令集没有jump指令跳转分支;当汇编代码中需要使用分支跳转时,使用的是ARM跳转指令Jump。 在ARM处理器中,分支预测技术的使用非常广泛。但是一旦分支预测失败,代价相当大。 因此在汇编优化中尽量少用分支跳转指令。英文原文:There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.
-
Skill3: 预装载指令PLD的使用
ARM处理器是load/store系统, 除了加载和存储指令,其他的操作都是针对寄存器。提高加载和存储指令的命中率对优化程序很重要。
预装载指令允许处理器发送信号给内存系统,告诉内存系统此处装在的数据在将来可能要用。如果数据被正确的预装载到了cache中,对于提高cache的命中率很有用,命中率提高了,性能也就提高了。但是如果没有预装载正确,将会降低性能。英文原文:ARM processors are a load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.
Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance. -
Skill4: Misc
在ARM NEON编程里面,不同的指令序列能实现同样的操作;但是更少的指令并不总是意味着更好的性能。这基于在特定情况下的benchmark and profiling result(基准和分析结果),如下就是一些特定情况下的实践分析。
Floating-point VMLA/VMLS instruction
通常,VMUL+VADD/VMUL+VSUB指令能够被VMLA/VMLS指令替换,因为指令数量更少了,更精简了。但是,对比于浮点VMUL操作,浮点VMLA/VMLS操作有更长的指令delay,假如在这段delay空隙中没有其他的指令能够插入的话,使用浮点VMUL+VADD/VMUL+VSUB操作将会表现出更好的性能
参考网址:
调试优化代码
- 汇编代码中添加如下代码(即.S文件中)
.macro print_m in1=r0, in2=d0
push {r0-r3, lr}
vstl.u64 {\in2\()}, [\in1\()]
mov r0, \in1
bl cprintf
pop {r0-r3, pc}
.endm
注意:in1应该是表示内存的arm寄存器, in2表示NEON寄存器如D0。
C文件中添加如下代码
void cprintf(unsigned char *srcu8)
{
int I=0;
char *srcs8 = (char *)srcu8;
for(i=0; i < 16; i++){
printf("%d ", srcu8[I])
}
for(i=0; i < 16; i++){
printf("%d ", srcs8[I])
}
printf("\n");
}
参考网址:
arm架构64位优化
arm架构64位寄存器介绍
arm寄存器
- arm寄存器有31个64位通用寄存器(X0X30),他们的低32位称为W寄存器(W0W30),Xn和Wn的对应关系如图:
image
参考:B1.2.1 Register in AArch64 state
-
需注意的是,arm寄存器的调用规则遵循AAPCS调用规则,如图:
imageX0~X7: 用来传递函数形参和返回结果,一般来说,单个64位的返回结果存储在X0中,单个128位的返回结果存储在X1:X0中;
X8被用来保存子程序(在这指被调用者函数,后续没特别说明,均指此意)的返回地址;
X19~X28是易损坏的寄存器,在子程序中使用时需要保存;
X18(Platform Register,PR)是跟平台相关的寄存器,用于特殊用途,不要使用他;注意:SP需要16字节对齐,在对Xn寄存器压栈时特别小心。更多信息参考:General language issues
英文原文:摘自:https://wiki.cdot.senecacollege.ca/wiki/Aarch64_Register_and_Instruction_Quick_Start
r0-r7 are used for arguments and return values; additional arguments are on the stack
For syscalls, the syscall number is in r8
r9-r15 are for temporary values (may get trampled)
r16-r18 are used for intra-procedure-call and platform values (avoid)
The called routine is expected to preserve r19-r28 *** These registers are generally safe to use in your program.
r29 and r30 are used as the frame register and link register (avoid)
详细信息参考:5.1.1 General-purpose Registers
neon寄存器
标量寄存器
- 每个寄存器可以根据数据类型映射成不同的标量寄存器,如:
一个128位的寄存器(Q0~Q31);
一个64位的寄存器(D0~D31);
一个32位的寄存器(S0~S31);
一个16位的寄存器(H0~S31);
一个8位的寄存器(B0~B31)。
注意: S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图:
image
参考:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页
矢量寄存器
- 64位宽或128位宽的矢量寄存器可以有一个或多个元素,然后使用索引去访问相应的元素,如图V0.2D[0] :
- 调用规则
V0~V7 用于传递函数形参和返回结果;
V8~V15在子程序中被使用时需要压栈保存;
V0V7和V16V31 调用者可能需要保存;
参考网址:5.1.2 SIMD and Floating-Point Registers
Neon指令集
ARMv8/AArch64指令格式
In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:
{<prefix>}<op>{<suffix>} Vd.<T>, Vn.<T>, Vm.<T>
Where:
- < prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
- < op> – operation, such as ADD, AND etc.
-
< T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
For example:
UADDLP V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S
-
< suffix> - suffix
- P: “pairwise” operations, such as ADDP.
- V: the new reduction (across-all-lanes) operations, such as FMAXV.
- 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.
ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.
For more information, please refer to the documents listed in the Appendix.
参考网址1:
参考网址2:
关于指令中post-index\pre-index的介绍
imageimage
参考网址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页
arm 64位架构指令手册
arm32优化到aarch64的转变
详情参考:arm32优化到aarch64的转变
-
函数返回
image - 寄存器压栈
-
普通寄存器压栈
因为SP指针需要16字节对齐,所以aarch64对寄存器压栈需要成对压栈。
image -
neon寄存器压栈
.macro push_v_regs
stp d8, d9, [sp, #-16]!
stp d10, d11, [sp, #-16]!
stp d12, d13, [sp, #-16]!
stp d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
ldp d14, d15, [sp], #16
ldp d12, d13, [sp], #16
ldp d10, d11, [sp], #16
ldp d8, d9, [sp], #16
.endm
至于要用的是v8v15寄存器,为什么成了压d8d15,参考“1.2.3 调用规则”。
不幸的是,在GDB调试时,此种压栈方式会提示:
tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.
解决办法:
.macro push_v_regsd
sub sp, sp, #128
st1 {v8.8h, v9.8h}, [sp], #32
st1 {v10.8h, v11.8h}, [sp], #32
st1 {v12.8h, v13.8h}, [sp], #32
st1 {v14.8h, v15.8h}, [sp],
.endm
.macro pop_v_regsd
ld1 {v14.5h, v15.8h}, [sp]
sub sp, sp, #32
ld1 {v12.5h, v13.8h}, [sp]
sub sp, sp, #32
ld1 {v10.5h, v11.8h}, [sp]
sub sp, sp, #32
ld1 {v8.5h, v9.8h}, [sp]
add sp, sp, #128
.endm
需要注意的是:此方法虽能解决在GDB调试过程中出现的问题,但是在GDB调试完后,还需使用压d寄存器的方法(即push_v_regs),否则出现时间信息统计不出的情况。为了便于这两种方式进行切换,可使用宏定义:#define push_v_regs push_v_regsd
网友评论