Once I had to track down an issue where very rarely, with no discernible pattern the web app would produce garbled PDFs. Turned out this happened when an admin account remotely connected to the app server, which caused a reset of default screen resolution, which messed up the PDF library that relied on a specific resolution (it was HTML-to-PDF conversion). Happened rarely and randomly because there were multiple web servers and they were occasionally restarted which would fix the problem until next time.
Another fun problem I dealt with was when I was moving my employer's codebase from Subversion to Mercurial version control ages ago. Everything looked good, except a directory named CVS (after the pharmacy, a customer) was missing. Was banging my head on the table before realizing that the default .hgignore file instructed Mercurial to ignore all contents of .*/CVS (another old version control system).
[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.
("He's familiar with the the build-system scripts, so he can see what changed.")
-----
I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.
What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.
Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.
Combination PIT/serial interrupt issue involving microsecond-resolution system programmable interval timer and multi-port serial driver. Would crash every day or so.
Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.
Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.
Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.
Upgrading from QT4 to 5 broke the appending of QStrings to QByteArrays such that it stored half the data from a QString (some wonkiness with UTF8 and UTF16 IIRC). Took a rewrite of the RTMP/AMF layer in the codebase to figure it out.
Once I had to track down an issue where very rarely, with no discernible pattern the web app would produce garbled PDFs. Turned out this happened when an admin account remotely connected to the app server, which caused a reset of default screen resolution, which messed up the PDF library that relied on a specific resolution (it was HTML-to-PDF conversion). Happened rarely and randomly because there were multiple web servers and they were occasionally restarted which would fix the problem until next time.
Another fun problem I dealt with was when I was moving my employer's codebase from Subversion to Mercurial version control ages ago. Everything looked good, except a directory named CVS (after the pharmacy, a customer) was missing. Was banging my head on the table before realizing that the default .hgignore file instructed Mercurial to ignore all contents of .*/CVS (another old version control system).
Recycling a comment, where part of the annoyance came from the feeling that they should have been asking someone else to solve it: https://news.ycombinator.com/item?id=37859771
_____
[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.
("He's familiar with the the build-system scripts, so he can see what changed.")
-----
I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.
What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.
Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.
Recently, this one which I'm still investigating -- if you want to help :) https://github.com/anza-xyz/agave/pull/4585
Combination PIT/serial interrupt issue involving microsecond-resolution system programmable interval timer and multi-port serial driver. Would crash every day or so.
Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.
Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.
Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.
Upgrading from QT4 to 5 broke the appending of QStrings to QByteArrays such that it stored half the data from a QString (some wonkiness with UTF8 and UTF16 IIRC). Took a rewrite of the RTMP/AMF layer in the codebase to figure it out.
Any flaky selenium test.
rendering corruption issue or perf issue of wayland that involves 100 processes.