|
|
Created:
8 years ago by Scott Hess - ex-Googler Modified:
7 years, 7 months ago CC:
chromium-reviews, sail+watch_chromium.org Base URL:
http://git.chromium.org/chromium/src.git@master Visibility:
Public. |
Description[Mac] Override abort() to crash into breakpad.
abort() sends SIGABRT, breakpad unfortunately does not intercept
signal crashes. Override abort() to crash in a way that breakpad will
intercept.
BUG=57504
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=201978
Patch Set 1 #Patch Set 2 : Catch signal handler instead. #Patch Set 3 : unmask signals in the handler. #
Total comments: 6
Patch Set 4 : tweak comment and signal mask. #
Total comments: 2
Messages
Total messages: 24 (0 generated)
Initially, I was going to go with process_util_mac.mm, which wouldn't need the DEPS change, but also would allow setting the abort() override when breakpad wasn't even enabled. So it put it here. I guess there could be a helper in base/, but that doesn't seem awesomer by much. an example crash caused by |std::string s(NULL);|: https://crash.corp.google.com/reportdetail?reportid=91496122c1f4e48d there are only a few crashes with __cxa_throw on the stack, and most of them look like utter insanity. Except this one: https://crash.corp.google.com/reportdetail?reportid=c769859f917c364b#crashing... which I cannot explain. Perhaps it crashed dereferencing something internal?
mach_override is an awfully crude (and awfully awful) hammer, but this problem doesn’t need to be an awfully crude nail. The usual way to intercept abort() would be to handle SIGABRT. Can we do that?
On 2012/12/21 00:55:55, Mark Mentovai wrote: > mach_override is an awfully crude (and awfully awful) hammer, but this problem > doesn’t need to be an awfully crude nail. The usual way to intercept abort() > would be to handle SIGABRT. Can we do that? For some reason, which escapes me at the moment, I thought we intentionally weren't doing anything to catch this at the signal level. But that would be a sensible way to do it, of course. I'll have to poke around to make sure this is done after the various places where SIGABRT is reset. While I'm digging, any other signals that might be worth catching? Based on crash reports I've seen, I'm guessing that Breakpad catches the precursor exception for SIGILL, SIGFPE, SIGBUS, and SIGSEGV, though SIGFPE I'm not entirely certain on. SIGSYS? [Do we ever make system calls directly?]
SIGFPE is already handled. SIGSYS neet not concern you. You can scope this to just SIGABRT.
OK, WDYT of this? Sigh, I'm pretty sure it is wrong. I tested this by injecting: std::string s(NULL); LOG(ERROR) << "100'th char: " << s[100]; it doesn't get to the LOG line, I see: libc++abi.dylib: terminate called throwing an exception and a breadcrumb in the SIGABRTHandler fires. Then I see: ** Unknown exception behavior: 0 and a .dmp file is created (and earlier one was uploaded). BUT, then my Chrome icon sits there bouncing, and never seems to finish crashing. If I pull the *death_ptr code from SIGABRTHandler() into the place where I put the code above, I get: ** Unknown exception behavior: 0 Bus error: 10 I suspect that being in the signal context is mattering, here, but since I'm not masking anything, I'm not sure why it should matter.
What happens if the breakpads are off? What happens if you put your abort-caller after the app has finished launching and bouncing?
Also note that Breakpad has SIGABRT handling on iOS already for the in-process handler. https://code.google.com/p/google-breakpad/source/browse/trunk/src/client/mac/...
On 2013/05/17 16:07:47, Mark Mentovai wrote: > What happens if the breakpads are off? > > What happens if you put your abort-caller after the app has finished launching > and bouncing? OK, the signal mask in the signal handler has all signals masked except for SIGKILL and SIGSTOP (the unmaskable signals). Unmasking SIGBUS before death lets the process die with "Bus error: 10" like what happens if the error happens outside a signal handler.
On 2013/05/17 16:47:11, rsesek wrote: > Also note that Breakpad has SIGABRT handling on iOS already for the in-process > handler. > https://code.google.com/p/google-breakpad/source/browse/trunk/src/client/mac/... WRT my previous comment, I at this point have modified the sa_mask along the lines of that code, and it doesn't affect the mask in place when the signal is handled (AFAICT, there's a union involved, so that makes sense). AFAICT from reading sigaction(2), this doesn't make sense - it _should_ be the union of the process' current signal mask, the signal being delivered, and sa_mask. In fact, this is the basic difference between sigaction(2) and signal(2), maybe I should try that API and see if the results differ.
Nevermind, I see the problem, Libc-825.25/stdlib/FreeBSD/abort.c: void abort() { struct sigaction act; if (!CRGetCrashLogMessage()) CRSetCrashLogMessage("abort() called"); /* * POSIX requires we flush stdio buffers on abort. * XXX ISO C requires that abort() be async-signal-safe. */ if (__cleanup) (*__cleanup)(); sigfillset(&act.sa_mask); /* * Don't block SIGABRT to give any handler a chance; we ignore * any errors -- ISO C doesn't allow abort to return anyway. */ sigdelset(&act.sa_mask, SIGABRT); /* <rdar://problem/7397932> abort() should call pthread_kill to deliver a signal \ to the aborting thread * This helps gdb focus on the thread calling abort() */ if (__is_threaded) { /* Block all signals on all other threads */ sigset_t fullmask; sigfillset(&fullmask); (void)_sigprocmask(SIG_SETMASK, &fullmask, NULL); /* <rdar://problem/8400096> Set the workqueue killable */ __pthread_workqueue_setkill(1); (void)pthread_sigmask(SIG_SETMASK, &act.sa_mask, NULL); (void)pthread_kill(pthread_self(), SIGABRT); } else { (void)_sigprocmask(SIG_SETMASK, &act.sa_mask, NULL); (void)kill(getpid(), SIGABRT); } usleep(TIMEOUT); /* give time for signal to happen */ /* * If SIGABRT was ignored, or caught and the handler returns, do * it again, only harder. */ __abort(); }
OK, this version unmasks in the handler. I cannot craft a reasoned argument for why any signals should be masked at this point - other than, perhaps, SIGABRT, though even that is debatable given that the process-level mask still masks it. So I can go either way on that one. I intentionally did not try to set the process-level mask, or only poke specific holes through the mask. I'd rather crash faster at this point than end up in a weird hung state. Even infinite signal-handler recursion crashes faster than that :-).
pingalingaling?
https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.mm File chrome/app/breakpad_mac.mm (right): https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.m... chrome/app/breakpad_mac.mm:142: // The OSX abort() implementation masks all signals for the process, You may want to refer to the implementation by reference. In 10.8.3 it’s Libc-825.26/stdlib/FreeBSD/abort.c. https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.m... chrome/app/breakpad_mac.mm:145: // and SIGSTOP can be delivered. Don’t you think you should keep SIGABRT masked? https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.m... chrome/app/breakpad_mac.mm:148: // concert with Breakpad. ** Unknown exception behavior: 0 is telling. Breakpad prints that when it’s trying to forward an exception to a previously-installed exception handler, and it thinks it’s found one but it doesn’t know which of the various behaviors (“default”=identity, state, or state+identity) the previous handler expects. Because it was 0. Because there wasn’t really any previous handler. Breakpad should exit in this case, just like it does when it thinks it hasn’t found a previous handler. In fact, when the behavior is 0, it hasn’t found a handler at all. I bet that the handler port is 0 too. Why is this happening? Breakpad’s loop through previous handlers just looks at masks in the task exception ports list, and it finds one that matches. It’s ignorant of the fact that the handler itself is null. It should check the handler port (and perhaps the behavior). What happens next? The EXC_BAD_ACCESS handler, in effect, fails. Exception handling is dropped on the floor. Mach lets you do this because, in theory, you just don’t have enough information to handle the exception right now, but maybe you saved off some of the data you need and you’ll come back and handle the exception properly later. Until then, the victim thread is stuck. Why is there no previous handler? The system’s own handler for EXC_BAD_ACCESS, which is what you’re trying to trigger here, is a host exception port, and Breakpad only looks at task exception ports. It would be nice if it could forward to a host port, but it can’t, because examining the host exception ports is restricted to root.
PTAL? https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.mm File chrome/app/breakpad_mac.mm (right): https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.m... chrome/app/breakpad_mac.mm:142: // The OSX abort() implementation masks all signals for the process, On 2013/05/22 14:23:01, Mark Mentovai wrote: > You may want to refer to the implementation by reference. In 10.8.3 it’s > Libc-825.26/stdlib/FreeBSD/abort.c. Done. https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.m... chrome/app/breakpad_mac.mm:145: // and SIGSTOP can be delivered. On 2013/05/22 14:23:01, Mark Mentovai wrote: > Don’t you think you should keep SIGABRT masked? Done. https://codereview.chromium.org/11645055/diff/14001/chrome/app/breakpad_mac.m... chrome/app/breakpad_mac.mm:148: // concert with Breakpad. On 2013/05/22 14:23:01, Mark Mentovai wrote: > ** Unknown exception behavior: 0 > > is telling. Breakpad prints that when... OK, so I'll drop the TODO, but I'm not going to add commentary because I can't honestly think of a way to not be mis-leading about it.
LGTM
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/shess@chromium.org/11645055/20001
Retried try job too often on chromium_presubmit for step(s) presubmit http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=chromium_p...
thakis for chrome/OWNERS review, since cpu for chrome/app/OWNERS seems less applicable to this change.
Apologies, but autocomplete gave me @google.com over @chromium.org.
lgtm, but: https://chromiumcodereview.appspot.com/11645055/diff/20001/chrome/app/breakpa... File chrome/app/breakpad_mac.mm (right): https://chromiumcodereview.appspot.com/11645055/diff/20001/chrome/app/breakpa... chrome/app/breakpad_mac.mm:298: // intercept. Why doesn't breakpad do that?
Short answer: Breakpad hooks Mach exceptions, which are triggered by hardware traps. SIGABRT is generated in software. No hardware trap, no Mach exception, no Breakpad catchery.
thanks https://chromiumcodereview.appspot.com/11645055/diff/20001/chrome/app/breakpa... File chrome/app/breakpad_mac.mm (right): https://chromiumcodereview.appspot.com/11645055/diff/20001/chrome/app/breakpa... chrome/app/breakpad_mac.mm:298: // intercept. On 2013/05/23 21:27:50, Nico wrote: > Why doesn't breakpad do that? One day on the bug, Mark said "Mac Breakpad doesn’t and shouldn’t deal with things in terms of POSIX signals." [insert picture of doggie scientist here.]
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/shess@chromium.org/11645055/20001
Message was sent while issue was closed.
Change committed as 201978 |