这是用户在 2024-9-26 9:36 为 https://blogs.igalia.com/gpiccoli/2023/02/debugging-early-boot-issues-on-arm64-without-console/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Debugging early boot issues on ARM64 (without console!)
调试 ARM64 上的早期启动问题(无需控制台!

In the last blog post, about booting the upstream kernel on Inforce 6640, I mentioned that there was an interesting issue with ftrace that led to a form of debug-by-rebooting approach, and that is worth a blog post. Of course it only took me 1 year+ to write it (!) – so apologies for this huge delay, and let’s debug some early boot failures!
在上一篇关于在 Inforce 6640 上引导上游内核的博文中,我提到 ftrace 有一个有趣的问题,它导致了一种通过重启进行调试的方法,这值得写一篇博文。当然,我只花了 1 年+ 的时间就写完了 (!) – 所以很抱歉这么大的延迟,让我们调试一些早期的启动失败吧!
First of all, a reminder of what was the issue: in an attempt to boot the upstream kernel on the Inforce 6640 board, I found that for some reason ftrace doesn’t work in such a setup, starting around kernel 5.7 – bisect wasn’t precise, even talking with the ftrace maintainer on IRC didn’t provide many clues (since more debugging was required).
首先,提醒一下问题所在:在尝试在 Inforce 6640 板上启动上游内核时,我发现由于某种原因 ftrace 在这样的设置中不起作用,从内核 5.7 开始 – bisect 并不精确,即使在 IRC 上与 ftrace 维护者交谈也没有提供很多线索(因为需要更多的调试)。
In order to be accurate and provide up-to-date results, I re-tested this with kernel v6.2-rc7 and…same issue is observed, the kernel just can’t boot with ftrace, and not a single hint is provided in serial console!
为了准确并提供最新的结果,我使用内核 v6.2-rc7 重新测试了这一点,然后......观察到同样的问题,内核就是无法使用 ftrace 启动,并且串行控制台中没有提供任何提示!
A parenthesis here: kernel 6.2-rc7 does boot on inforce 6640, but with no graphics (likely some changes in the msm driver), having some SCSI timeouts on boot time (sd 0:0:0:0: [sda] tag#13 timing out command, waited 120s) and…no USB(!), unless the “old” device-tree (from v5.16) is used – I think it’s related with some dwc3/usb changes merged on qcom device-trees. But anyway, this might be a topic for a whole new blog post, so let’s get back to debugging early boot issues.
括号里:内核 6.2-rc7 可以在 inforce 6640 上启动,但没有图形(可能是 msm 驱动程序中的一些更改),在启动时有一些 SCSI 超时 ( sd 0:0:0:0: [sda] tag#13 timing out command, waited 120s ) 和......没有 USB(!),除非使用“旧”设备树(从 v5.16 开始)——我认为它与 qcom 设备树上合并的一些 dwc3/usb 更改有关。但无论如何,这可能是一篇全新博客文章的主题,因此让我们回到调试早期启动问题。
Once we face a completely empty serial output after the usual bootloader messages, a basic question comes up: has the kernel even booted? In other words, more technically meaningful: did the bootloader properly load the kernel and jump to the code? Did the kernel code effectively start execution?
一旦我们在通常的 bootloader 消息后面遇到一个完全空的串行输出,一个基本问题就会出现:内核甚至启动了吗?换句话说,从技术上讲更有意义:bootloader 是否正确加载了内核并跳转到了代码?内核代码是否有效地开始执行?
The first idea is always “let’s print something”, but how, if I don’t know where I’d put the print statement? Not to mention…that super early code has nothing initialized, so even a printk could be “too much” for such code. Hence, the only alternative seemed to me…to reboot the machine from kernel code! With that, at least I’d validate that kernel code was running. And thinking about that…I even extended this logic: if I can reboot the kernel, I could kinda bisect the code to determine where the failure is. One question remains though – how to reboot a kernel so early?
第一个想法总是“让我们打印一些东西”,但是如果我不知道我应该把 print 语句放在哪里,该怎么办呢?更不用说......该 Super Early 代码没有初始化任何内容,因此即使是 PrintK 对于此类代码也可能“太多”。因此,在我看来,唯一的选择似乎是......从 kernel code!这样,至少我可以验证内核代码是否正在运行。想想看......我什至扩展了这个逻辑:如果我能重启内核,我就可以把代码一分为二来确定故障在哪里。不过,还有一个问题 – 如何这么早重启内核?
Rebooting/shutdown is definitely a non-trivial task. By checking the kernel code (here, for example), one can see it’s full of callbacks to the architecture code; naturally, it’s a platform/arch low-level process. Checking machine_restart() under arch/arm64, we can see it goes through EFI calls (if EFI is supported). Noticing I was walking an unknown and potentially daunting path, the decision was to first seek help on IRC, and that brought me gold: Marc Zyngier (the ARM64 maintainer) presented me PSCI (Power State Coordination Interface) – if my device’s firmware supports this specification (and Inforce 6640 does!), I could issue a PSCI reset SMC (Secure Monitor Call) to get a board reset. In other words, with some assembly instructions I could perhaps reboot the kernel! Marc even provided me information about the register I should write, and after some tinkering (and more code study), I came up with this function:
重启/关闭绝对是一项非同小的任务。通过检查内核代码(例如,这里),可以看到它充满了对架构代码的回调;自然,它是一个平台/Arch 低级流程。检查 arch/arm64 下的 machine_restart(),我们可以看到它通过 EFI 调用(如果支持 EFI)。注意到我正在走一条未知且可能令人生畏的道路,因此决定首先在 IRC 上寻求帮助,这给我带来了金子:Marc Zyngier(ARM64 维护者)向我展示了 PSCI(电源状态协调接口)——如果我设备的固件支持此规范(Inforce 6640 也支持),我可以发出 PSCI 重置 SMC(安全监视器调用)来重置电路板。换句话说,通过一些汇编说明,我也许可以重新启动内核!Marc 甚至为我提供了有关我应该编写的寄存器的信息,经过一番修改(和更多的代码研究),我想出了这个函数:
static inline void arm64_smc_reset(void)
{
    asm("mov x0, #0x9");
    asm("mov x3, #0x0");
    asm("mov x2, #0x0");
    asm("mov x1, #0x0");
    asm("movk x0, #0x8400, lsl #16");
    asm("smc #0");
}
This was first tested as an alternative way for the sysrq-b reset handler, and that worked like a charm! Now, where to plug such code in kernel early path? Entry point seems to make sense, so I tried this hack on head.S:
这首先是作为 sysrq-b reset 处理程序的替代方法进行测试的,它就像一个魅力一样!现在,在内核早期路径中的哪个位置插入此类代码?切入点似乎是有道理的,所以我尝试了这个 hack on head。S
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -87,6 +87,12 @@
         */
 SYM_CODE_START(primary_entry)
        bl      preserve_boot_args
+       mov x0, #0x9
+       mov x3, #0x0
+       mov x2, #0x0
+       mov x1, #0x0
+       movk x0, #0x8400, lsl #16
+       smc #0
        bl      init_kernel_el    // w0=cpu_boot_mode
        mov     x20, x0
        bl      create_idmap
And voilá! With that, the kernel was bootlooping…meaning the board reached kernel code, so the first question was answered and debug could proceed! I decided to take a lucky step and jump directly to start_kernel(), which is C code and way easier to play with no fears of causing another issue while debugging! And the lucky step paid off: kernel was indeed executing that function. So, through a set of attempts using the arm64_smc_reset() above and commenting the results, I got the following “bisect” from the code:
瞧!就这样,内核开始 bootlooping......这意味着开发板到达了内核代码,因此第一个问题已得到解答,调试可以继续进行!我决定迈出幸运的一步,直接跳到 start_kernel(),这是 C 代码,更容易玩,不用担心在调试时导致另一个问题!幸运的是,这一步得到了回报:内核确实在执行该函数。因此,通过使用上面的 arm64_smc_reset() 进行一系列尝试并注释结果,我从代码中得到了以下“bisect”:
It was a bit ugly to see the diff above in code, hence I’ve added as an image. Basically, it tells us the issue likely happens in setup_machine_fdt(), and by digging more (with more “bisect” reboots), I found the issue happens in fdt_check_header(). Now, as a next step I really wanted to print some values from the guilty function, but using the kernel command-line earlycon didn’t help – it turned out this is not early enough.
在代码中看到上面的差异有点难看,因此我添加了图像。基本上,它告诉我们问题可能发生在 setup_machine_fdt() 中,通过挖掘更多(使用更多“一等分”重启),我发现问题发生在 fdt_check_header() 中。现在,作为下一步,我真的很想从 guity 函数打印一些值,但使用内核命令行 earlycon 没有帮助 - 事实证明这还不够早。
But what if I could write a very simple really early console, or even an easier approach: copy/port the earlycon code for this board to a bit “earlier” point in the code? And that’s what I tried next: by inspecting (and instrumenting) the msm_serial driver, I came up with this following alternative (a big code snippet here would be just terrible to read, so uploaded the patch instead): https://people.igalia.com/gpiccoli/arm64-really-early-console-msm-qcom.patch
但是,如果我能写一个非常简单的早期控制台,甚至更简单的方法:将这个板子的 earlycon 代码复制/移植到代码中“更早”的位置呢?这就是我接下来尝试的:通过检查(和检测)msm_serial驱动程序,我想出了以下替代方案(这里的大代码片段很难阅读,因此请上传补丁):https://people.igalia.com/gpiccoli/arm64-really-early-console-msm-qcom.patch
Notice the code is quite hacky – it ports the __msm_console_write() function with a pre-configured MMIO address for the serial port to early arch code. But despite the code being merely a “mock-up”, using it with sprintf() (for formatting) allowed me to print stuff, which was quite nice! Unfortunately I didn’t have time to debug the ftrace issue more in this board – I guess the best path forward would be first to submit a proper ifc6640 device-tree, and then follow with the debug. I think it’s likely related to the bootloader version and kernel/DT offsets, so that’s not an easy debug and definitely it’d be time-consuming.
请注意,该代码非常 hacky —— 它将 __msm_console_write() 函数与串行端口的预配置 MMIO 地址移植到早期的架构代码。但是,尽管代码只是一个“模型”,但将其与 sprintf()(用于格式化)一起使用可以让我打印东西,这真是太好了!不幸的是,我没有时间在此板中更多地调试 ftrace 问题——我想最好的前进路径是首先提交适当的 ifc6640 设备树,然后进行调试。我认为这可能与 bootloader 版本和 kernel/DT 偏移量有关,因此这不是一个容易的调试,而且肯定会很耗时。
But I hope at least the PSCI reset trick, and maybe this really early serial console prototype/idea could be useful to somebody in the world of ARM boards debugging! Thanks for reading and see you in the next blog post (which I expect takes waaay less than 1 year heheh).
但我希望至少 PSCI 重置技巧,也许这个真正早期的串行控制台原型/想法可能对 ARM 板调试领域的某些人有用!感谢您的阅读,下一篇文章见(我预计这需要不到 1 年的时间呵呵)。

Author: gpiccoli

I enjoy the low-level world, like kernel, firmware, virtualization and all sorts of HW/SW interactions. Free software is both part of my work and personal beliefs - I’m an enthusiast of Linux overall!

One thought on “Debugging early boot issues on ARM64 (without console!)”

Leave a Reply

Your email address will not be published. Required fields are marked *

What is 15 + 15 ?
Please leave these two fields as-is:
IMPORTANT! To be able to proceed, you need to solve the following simple math (so we know that you are a human) :-)