美文网首页Android TipsAndroid高阶
Android Stability - arm v8a的异常处理

Android Stability - arm v8a的异常处理

作者: HuangTao_Zoey | 来源:发表于2018-05-18 08:48 被阅读72次

    Android的稳定性问题中有一类问题,我们暂且叫做Android Native Error,在传统的叫法中,它可能叫段错误,内存访问异常等等,做过稳定性的人都知道这一类问题分析难度是比较大的,虽然在Android里面会给出一个出错进程相关的Tombstone日志,但是也只能看到最终出错的代码行,但是要去分析为什么会出错,一般就有难度了,需要足够熟悉那一块的代码,能够使用一些常用的调试手段和工具,例如GDB、Crash、Coredump文件等等,后续的文章再对这些做一些分享,这篇文章我们先看一下Native Error问题是怎么出现的,ARM和Linux底层又是如何处理的,做到知其所以然,掌握了背后的原理之后才能更从容的去分析问题.

    • arm v8a有两种执行模式:AARCH32和AARCH64,本文只分析AARCH64执行模式下的异常处理过程
    • arm v8a 支持4种不同的Exception Level(EL0、EL1、EL2、EL3),Android应用程序运行在EL0,Linux运行在EL1,其他的如虚拟化和Secure分别运行在EL2和EL3层,本文只分析EL0的异常处理,关于其他层的异常处理和路油配置,读者可以通过阅读 arm v8-a官方文档和网上资料继续深入了解.
    arm v8-a基础知识 - Exception Level
    arm-v8a Exception Level

    arm v8-a的异常等级是一个很重要的概念,无论是应用程序还是Kernel代码都是在某个Level层级运行的,分等级就意味着不同的权限,不同的视图:

    • EL0 的代码权限是小于EL1的,EL1小于EL2,依次类推,所以Android应用程序一般运行在EL0,而像Linux Kernel代码就运行在EL1或者EL2,但是一般不会运行在EL3,因为EL3一般是一个很小的trusted os,Linux这种不适合运行在这个层级.
      EL0 :Normal user applications. EL0 corresponds to the lowest privilege level and is often described as unprivileged, whereas execution at any Exception level above EL0 is often referred to as privileged execution.
      EL1:An operating system kernel typically described as privileged.
      EL2:Hypervisor.
      EL3:Low-level firmware, including the Secure Monitor..
    • 每个异常等级所能看到的寄存器也是不同的,另外即使是相同的寄存器,但是在不同的Exception Level下面也会有不同的作用,例如每个Exception Level都有自己的SPSR_ELx寄存器(x=0,1,2,3),这个寄存器会保存进入ELx的PSTATE状态信息
    • 异常等级之间是可以切换的,例如应用程序调用系统调用,就可以主动切换到EL1层级运行,还有就是本文所要重点讲述的各种异常,例如data abort,instruction abort等,都会导致Exception Level的切换.
    arm v8-a基础知识 - Execution State

    arm v8有两种执行模式:AARCH64和AARCH32,其中AARCH64是新增的,它可以使用31个64位的通用寄存器,而AARCH32是为了兼容arm v7,它只能使用32位的通用寄存器,这两种执行模式之间也是可以切换的,例如从一个64为的进程要切到一个32位的进程执行的时候:

    • Changing to AArch32 requires going from a higher to a lower Exception level. This is the result of exiting an exception handler by executing the ERET
    • Changing to AArch64 requires going from a lower to a higher Exception level. The exception can be the result of an instruction execution or an external signal.
    • If, when taking an exception or returning from an exception, the Exception level remains the same, then the Execution state also cannot change.
    • Both AArch64 and AArch32 Execution states have Exception levels that are similar, but there are some differences between Secure and Non-secure operation. The Execution state the processor is in when the exception is generated can limit the Exception levels available to the other Execution state.
    • Where an ARMv8-A processor operates in AArch32 Execution state at a particular Exception level, it uses the same exception model as in ARMv7-A for exceptions that are taken to that Exception level.
    • Code at EL3 cannot take an exception to a higher Exception level, so cannot change Execution state, except by going through a reset.
    • When the processor moves from a higher to a lower Exception level, the Execution state can stay the same, or it can switch from AArch64 to AArch32.
    • When moving from a lower to a higher Exception level, the Execution state can stay the same or switch from AArch32 to AArch64.

    下面再用两张图来说明一下两种不同执行状态下面对应的Exception Level


    Exception levels in AArch64 Exception levels in AArch32
    arm v8-a基础知识 - 重要寄存器

    关于aarch32和aarch64两种执行模式下的寄存器介绍,ARMv8 架构与指令集.学习笔记这个文章已经有很详细的对比了,这里借用一下原作者的两张图表说明一下:

    AArch32重要寄存器
    AArch64重要寄存器
    ARM64异常类型

    有了上面的这些基础知识之后,可以着重的讲一下本文的重点内容了,上面我们讲过,通过系统调用,我们可以改变CPU的Exception Level,而像系统调用、中断等,其实有一个专业的名词来描述,叫做异常,所谓异常就是讲在代码的执行过程中,由于某些情况或者系统事件需要暂时中断代码的执行,转而进入另一个代码路径,待处理完后,重新恢复代码的执行,所以异常可能在任何情况下发生的,如下图所示:

    异常
    异常类型 异常说明
    Aborts Aborts can be generated either on failed instruction fetches (Instruction Aborts) or failed data accesses (Data Aborts). They can come from the external memory system giving an error response on a memory access (indicating perhaps that the specified address does not correspond to real memory in the system).

    Alternatively, the Memory Management Unit (MMU) of the core generates the abort. An OS can use MMU aborts to allocate memory to applications dynamically.

    An instruction that cannot be fetched causes an abort. The Instruction Abort exception is taken only if the core then tries to execute it. A Data Abort exception is caused by a load or store instruction and happens after the data read or write has been attempted.

    An abort is described as being synchronous if it is generated by direct execution of instructions and the return address indicates the instruction which caused it.

    Otherwise, an abort is described as asynchronous.

    In AArch64, synchronous aborts cause a Synchronous exception. Asynchronous aborts cause an SError interrupt exception.
    Reset Reset is treated as a special case because it has its own vector that always targets the highest implemented Exception level. This vector uses an IMPLEMENTATION DEFINED address which is typically set by configuration input signals.

    The address can be read from the Reset Vector Base Address Register RVBAR_ELn, where n is the number of the highest implemented Exception level.

    All cores have a reset input and take the reset exception after they have been reset. It is the highest priority exception and cannot be masked. This exception is used to execute code on the core to initialize it, after the system has powered up.
    Exception generating instructions Execution of these instructions can generate exceptions. They are typically executed to request a service from software that runs at a higher privilege level:

    The Supervisor Call (SVC) instruction enables User mode programs to request an OS service.

    The Hypervisor Call (HVC) instruction enables the guest OS to request hypervisor services.

    The Secure monitor Call (SMC) instruction enables the Normal world to request Secure world services.
    Interrupts There are three types of interrupts, IRQ, FIQ and SError. IRQ and FIQ are general purpose compared to SError, which is associated specifically with external asynchronous Data Aborts. So typically, the term 'interrupts' refers only to IRQ and FIQ.

    FIQ is higher priority than IRQ. Both of these interrupts are typically associated with individual input pins for each core. External hardware asserts an interrupt request line and the corresponding exception type is raised when the current instruction finishes executing (although some instructions, those that can load multiple values, can be interrupted), assuming that the interrupt is not disabled.

    On almost all systems, various interrupt sources are connected using an interrupt controller. The interrupt controller arbitrates and prioritizes interrupts, and in turn, provides a serialized single signal that is then connected to the FIQ or IRQ signal of the core.

    Because IRQ and FIQ interrupts are not directly related to the software running on the core at any given time, they are classified as asynchronous exceptions.

    在Android Stability问题中,我们遇到的最多的也就是Abort里面的Instruction Aborts和Data Aborts,从上面的表格来看,指令异常发生在这条指令的执行阶段,例如使用函数指针来调用函数的时候,如果函数指针被改变成一个异常的地址值,导致那块区域存储的不是合法的指令,而是一些数据,那么就可能发生指令异常了,而数据异常是发生在使用Load和Store指令来操作数据的时刻,例如使用指针来读写某个变量,如果指针指向的地址非法,那么就可能导致数据异常.

    ARM64异常硬件层面的行为

    当一个异常发生的时候,ARM会自动进行以下操作:

    • The SPSR_ELn is updated (where n is the Exception level where the exception is taken), to store the PSTATE information that is required to correctly return at the end of the exception.

    • PSTATE is updated to reflect the new processor status (and this can mean that the Exception level is raised, or it can stay the same).

    • The address to return to at the end of the exception is stored in ELR_ELn.


      异常处理硬件行为
    异常处理软件行为

    当异常发生的时候,处理器必须要响应这个异常,也就是执行某些异常处理代码,在ARM64里面,这些异常处理代码是存储在异常向量表(exception vector table)里面的,它的内容存储在Memory中,除了EL0(EL0不处理异常),每个异常等级都有自己的异常向量表,这些异常向量表的基地址被存储在VBAR_EL3, VBAR_EL2 、VBAR_EL1这几个寄存器里面,一个典型的异常向量表如下所示,另外也可以参考ARM64的启动过程之(六):异常向量表的设定这个文章

    Address Exception type Description
    VBAR_ELn + 0x000 Synchronous Current EL with SP0
    0x080 IRQ/vIRQ Current EL with SP0
    0x100 FIQ/vFIQ Current EL with SP0
    0x180 SError/vSError Current EL with SP0
    0x200 Synchronous Current EL with SPx
    0x280 IRQ/vIRQ Current EL with SPx
    0x300 FIQ/vFIQ Current EL with SPx
    0x380 SError/vSError Current EL with SPx
    0x400 Synchronous Lower EL using AArch64
    0x480 IRQ/vIRQ Lower EL using AArch64
    0x500 FIQ/vFIQ Lower EL using AArch64
    0x580 SError/vSError Lower EL using AArch64
    0x600 Synchronous Lower EL using AArch32
    0x680 IRQ/vIRQ Lower EL using AArch32
    0x700 FIQ/vFIQ Lower EL using AArch32
    0x780 SError/vSError Lower EL using AArch32
    arm64 Linux的异常响应

    ARM64 Linux的异常向量表定义在 kernel-src/arch/arm64/kernel/entry.S里面,如下所示,所以如果在EL0也就是在应用层发生了Data Aborts或者Instruction Aborts,都会暂停当前代码的执行,转而执行el0_sync这个地址的代码,相应的在EL0层如果发生了中断,CPU就会被重定向到el0_irq来执行.

    /*
     * Exception vectors.
     */
    
        .align  11
    ENTRY(vectors)
        ventry  el1_sync_invalid        // Synchronous EL1t
        ventry  el1_irq_invalid         // IRQ EL1t
        ventry  el1_fiq_invalid         // FIQ EL1t
        ventry  el1_error_invalid       // Error EL1t
    
        ventry  el1_sync            // Synchronous EL1h
        ventry  el1_irq             // IRQ EL1h
        ventry  el1_fiq_invalid         // FIQ EL1h
        ventry  el1_error_invalid       // Error EL1h
    
        ventry  el0_sync            // Synchronous 64-bit EL0
        ventry  el0_irq             // IRQ 64-bit EL0
        ventry  el0_fiq_invalid         // FIQ 64-bit EL0
        ventry  el0_error_invalid       // Error 64-bit EL0
    
    #ifdef CONFIG_COMPAT
        ventry  el0_sync_compat         // Synchronous 32-bit EL0
        ventry  el0_irq_compat          // IRQ 32-bit EL0
        ventry  el0_fiq_invalid_compat      // FIQ 32-bit EL0
        ventry  el0_error_invalid_compat    // Error 32-bit EL0
    #else
        ventry  el0_sync_invalid        // Synchronous 32-bit EL0
        ventry  el0_irq_invalid         // IRQ 32-bit EL0
        ventry  el0_fiq_invalid         // FIQ 32-bit EL0
        ventry  el0_error_invalid       // Error 32-bit EL0
    #endif
    END(vectors)
    
    /*
     * EL0 mode handlers.
     */
        .align  6
    el0_sync:
        kernel_entry 0
        mrs x25, esr_el1            // read the syndrome register
        lsr x24, x25, #ESR_ELx_EC_SHIFT // exception class 从ESR寄存器得到具体的异常信息,以便选择合适的代码处理
        cmp x24, #ESR_ELx_EC_SVC64      // SVC in 64-bit state如果是系统调用会走el0_svc
        b.eq    el0_svc
        cmp x24, #ESR_ELx_EC_DABT_LOW   // data abort in EL0 如果是EL0的变量访问地址异常就会走el0_da
        b.eq    el0_da
        cmp x24, #ESR_ELx_EC_IABT_LOW   // instruction abort in EL0
        b.eq    el0_ia
        cmp x24, #ESR_ELx_EC_FP_ASIMD   // FP/ASIMD access
        b.eq    el0_fpsimd_acc
        cmp x24, #ESR_ELx_EC_FP_EXC64   // FP/ASIMD exception
        b.eq    el0_fpsimd_exc
        cmp x24, #ESR_ELx_EC_SYS64      // configurable trap
        b.eq    el0_sys
        cmp x24, #ESR_ELx_EC_SP_ALIGN   // stack alignment exception
        b.eq    el0_sp_pc
        cmp x24, #ESR_ELx_EC_PC_ALIGN   // pc alignment exception
        b.eq    el0_sp_pc
        cmp x24, #ESR_ELx_EC_UNKNOWN    // unknown exception in EL0
        b.eq    el0_undef
        cmp x24, #ESR_ELx_EC_BREAKPT_LOW    // debug exception in EL0
        b.ge    el0_dbg
        b   el0_inv
    
    el0_dbg:
        /*
         * Debug exception handling
         */
        tbnz    x24, #0, el0_inv        // EL0 only
        mrs x0, far_el1
        mov x1, x25
        mov x2, sp
        bl  do_debug_exception
        enable_dbg
        ct_user_exit
        b   ret_to_user
    el0_inv:
        enable_dbg
        ct_user_exit
        mov x0, sp
        mov x1, #BAD_SYNC
        mov x2, x25
        bl  bad_mode
        b   ret_to_user
    
    el0_da:  //变量内存访问异常一般会走这个路径
        /*
         * Data abort handling
         */
        mrs x26, far_el1
        // enable interrupts before calling the main handler
        enable_dbg_and_irq
        ct_user_exit
        bic x0, x26, #(0xff << 56)  //注意这里的x0、x1、x2是用来给do_mem_abort传递参数的
        mov x1, x25
        mov x2, sp
        bl  do_mem_abort //调用do_mem_abort进一步处理
        b   ret_to_user //返回用户态执行
    

    do_mem_abort @kernel-src/arch/arm64/mm/fault.c

    asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
                         struct pt_regs *regs)
    {
        const struct fault_info *inf = fault_info + (esr & 63);
        struct siginfo info;
    
        if (!inf->fn(addr, esr, regs)) //通过定义的数组来尝试处理这个异常
            return;
    
        pr_alert("Unhandled fault: %s (0x%08x) at 0x%016lx\n",
             inf->name, esr, addr);
    
        info.si_signo = inf->sig;
        info.si_errno = 0;
        info.si_code  = inf->code;
        info.si_addr  = (void __user *)addr;
        arm64_notify_die("", regs, &info, esr); //如果上面没有处理成功,那么发送signal给相应的进程
    }
    
    static struct fault_info {
        int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
        int sig;
        int code;
        const char *name;
    } fault_info[] = {
        { do_bad,       SIGBUS,  0,     "ttbr address size fault"   },
        { do_bad,       SIGBUS,  0,     "level 1 address size fault"    },
        { do_bad,       SIGBUS,  0,     "level 2 address size fault"    },
        { do_bad,       SIGBUS,  0,     "level 3 address size fault"    },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 0 translation fault" },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 1 translation fault" },
        { do_translation_fault, SIGSEGV, SEGV_MAPERR,   "level 2 translation fault" },
        { do_page_fault,    SIGSEGV, SEGV_MAPERR,   "level 3 translation fault" },
        { do_bad,       SIGBUS,  0,     "unknown 8"         },
        { do_page_fault,    SIGSEGV, SEGV_ACCERR,   "level 1 access flag fault" },
        { do_page_fault,    SIGSEGV, SEGV_ACCERR,   "level 2 access flag fault" },
        { do_page_fault,    SIGSEGV, SEGV_ACCERR,   "level 3 access flag fault" },
        { do_bad,       SIGBUS,  0,     "unknown 12"            },
        { do_page_fault,    SIGSEGV, SEGV_ACCERR,   "level 1 permission fault"  },
        { do_page_fault,    SIGSEGV, SEGV_ACCERR,   "level 2 permission fault"  },
        { do_page_fault,    SIGSEGV, SEGV_ACCERR,   "level 3 permission fault"  },
        { do_bad,       SIGBUS,  0,     "synchronous external abort"    },
        { do_bad,       SIGBUS,  0,     "unknown 17"            },
        { do_bad,       SIGBUS,  0,     "unknown 18"            },
        { do_bad,       SIGBUS,  0,     "unknown 19"            },
        { do_bad,       SIGBUS,  0,     "synchronous external abort (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous external abort (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous external abort (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous external abort (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous parity error"  },
        { do_bad,       SIGBUS,  0,     "unknown 25"            },
        { do_bad,       SIGBUS,  0,     "unknown 26"            },
        { do_bad,       SIGBUS,  0,     "unknown 27"            },
        { do_bad,       SIGBUS,  0,     "synchronous parity error (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous parity error (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous parity error (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "synchronous parity error (translation table walk)" },
        { do_bad,       SIGBUS,  0,     "unknown 32"            },
        { do_alignment_fault,   SIGBUS,  BUS_ADRALN,    "alignment fault"       },
        { do_bad,       SIGBUS,  0,     "unknown 34"            },
        { do_bad,       SIGBUS,  0,     "unknown 35"            },
        { do_bad,       SIGBUS,  0,     "unknown 36"            },
        { do_bad,       SIGBUS,  0,     "unknown 37"            },
        { do_bad,       SIGBUS,  0,     "unknown 38"            },
        { do_bad,       SIGBUS,  0,     "unknown 39"            },
        { do_bad,       SIGBUS,  0,     "unknown 40"            },
        { do_bad,       SIGBUS,  0,     "unknown 41"            },
        { do_bad,       SIGBUS,  0,     "unknown 42"            },
        { do_bad,       SIGBUS,  0,     "unknown 43"            },
        { do_bad,       SIGBUS,  0,     "unknown 44"            },
        { do_bad,       SIGBUS,  0,     "unknown 45"            },
        { do_bad,       SIGBUS,  0,     "unknown 46"            },
        { do_bad,       SIGBUS,  0,     "unknown 47"            },
        { do_bad,       SIGBUS,  0,     "TLB conflict abort"        },
        { do_bad,       SIGBUS,  0,     "unknown 49"            },
        { do_bad,       SIGBUS,  0,     "unknown 50"            },
        { do_bad,       SIGBUS,  0,     "unknown 51"            },
        { do_bad,       SIGBUS,  0,     "implementation fault (lockdown abort)" },
        { do_bad,       SIGBUS,  0,     "implementation fault (unsupported exclusive)" },
        { do_bad,       SIGBUS,  0,     "unknown 54"            },
        { do_bad,       SIGBUS,  0,     "unknown 55"            },
        { do_bad,       SIGBUS,  0,     "unknown 56"            },
        { do_bad,       SIGBUS,  0,     "unknown 57"            },
        { do_bad,       SIGBUS,  0,     "unknown 58"            },
        { do_bad,       SIGBUS,  0,     "unknown 59"            },
        { do_bad,       SIGBUS,  0,     "unknown 60"            },
        { do_bad,       SIGBUS,  0,     "section domain fault"      },
        { do_bad,       SIGBUS,  0,     "page domain fault"     },
        { do_bad,       SIGBUS,  0,     "unknown 63"            },
    };
    

    一个典型的缺页异常处理堆栈如下所示:

    [<ffffff800808bbfc>] bug_handler+0x60/0x90
    [<ffffff80080839f4>] brk_handler+0xf4/0x208
    [<ffffff800808255c>] do_debug_exception+0x4c/0x114
    [<ffffff8008085708>] el1_dbg+0x18/0x8c 
    [<ffffff8008b2a280>] aee_wdt_atf_entry+0xdc/0xe8
    [<ffffff8008166110>] smp_call_function_many+0x254/0x2f4
    [<ffffff8008166404>] on_each_cpu_mask+0x48/0xec
    [<ffffff80081d5454>] drain_all_pages+0xfc/0x118
    [<ffffff80081da110>] _alloc_pages_nodemask+0x764/0xc54
    [<ffffff80081df0ac>] _do_page_cache_readahead+0x164/0x314
    [<ffffff80081cfcc4>] filemap_fault+0x374/0x45c 
    [<ffffff80082bf72c>] ext4filemap_fault+0x34/0x50
    [<ffffff80081fedac>] __do_fault+0x48/0xdc
    [<ffffff8008202e14>] handlemm_fault+0x85c/0x1160
    [<ffffff800809c6b4>] do_page_fault+0x2ec/0x3c4 
    [<ffffff8008082354>] do_mem_abort+0x50/0x10c
    [<ffffff8008085c24>] el0_da+0x18/0x1c
    

    上面代码的具体含义,可以参考 armv8 Linux内核异常处理相关文件这个文章,里面已经描述的很详细了,这里就不赘述了,有一个小细节要注意一下,如果直接通过shell kill命令来发送信号比如signal 11给进程,是不会走到异常处理过程的,所以我们去看它的tombstone log的时候它的 fault addr为 -------- .

    相关文章

      网友评论

        本文标题:Android Stability - arm v8a的异常处理

        本文链接:https://www.haomeiwen.com/subject/adnqkftx.html