543 lines
		
	
	
		
			24 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			543 lines
		
	
	
		
			24 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
This document describes Linux ptrace implementation in Linux kernels
 | 
						|
version 3.0.0. (Update this notice if you update the document
 | 
						|
to reflect newer kernels).
 | 
						|
 | 
						|
 | 
						|
		Ptrace userspace API.
 | 
						|
 | 
						|
Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
 | 
						|
An unfortunate effect of it is that resulting API is complex and has
 | 
						|
subtle quirks. This document aims to describe these quirks.
 | 
						|
 | 
						|
Debugged processes (tracees) first need to be attached to the debugging
 | 
						|
process (tracer). Attachment and subsequent commands are per-thread: in
 | 
						|
multi-threaded process, every thread can be individually attached to a
 | 
						|
(potentially different) tracer, or left not attached and thus not
 | 
						|
debugged. Therefore, "tracee" always means "(one) thread", never "a
 | 
						|
(possibly multi-threaded) process". Ptrace commands are always sent to
 | 
						|
a specific tracee using ptrace(PTRACE_foo, pid, ...), where pid is a
 | 
						|
TID of the corresponding Linux thread.
 | 
						|
 | 
						|
After attachment, each tracee can be in two states: running or stopped.
 | 
						|
 | 
						|
There are many kinds of states when tracee is stopped, and in ptrace
 | 
						|
discussions they are often conflated. Therefore, it is important to use
 | 
						|
precise terms.
 | 
						|
 | 
						|
In this document, any stopped state in which tracee is ready to accept
 | 
						|
ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can
 | 
						|
be further subdivided into signal-delivery-stop, group-stop,
 | 
						|
syscall-stop and so on. They are described in detail later.
 | 
						|
 | 
						|
 | 
						|
	1.x Death under ptrace.
 | 
						|
 | 
						|
When a (possibly multi-threaded) process receives a killing signal (a
 | 
						|
signal set to SIG_DFL and whose default action is to kill the process),
 | 
						|
all threads exit. Tracees report their death to the tracer(s). This is
 | 
						|
not a ptrace-stop (because tracer can't query tracee status such as
 | 
						|
register contents, cannot restart tracee etc) but the notification
 | 
						|
about this event is delivered through waitpid API similarly to
 | 
						|
ptrace-stop.
 | 
						|
 | 
						|
Note that killing signal will first cause signal-delivery-stop (on one
 | 
						|
tracee only), and only after it is injected by tracer (or after it was
 | 
						|
dispatched to a thread which isn't traced), death from signal will
 | 
						|
happen on ALL tracees within multi-threaded process.
 | 
						|
 | 
						|
SIGKILL operates similarly, with exceptions. No signal-delivery-stop is
 | 
						|
generated for SIGKILL and therefore tracer can't suppress it. SIGKILL
 | 
						|
kills even within syscalls (syscall-exit-stop is not generated prior to
 | 
						|
death by SIGKILL). The net effect is that SIGKILL always kills the
 | 
						|
process (all its threads), even if some threads of the process are
 | 
						|
ptraced.
 | 
						|
 | 
						|
Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This
 | 
						|
operation is deprecated, use kill/tgkill(SIGKILL) instead.
 | 
						|
 | 
						|
^^^ Oleg prefers to deprecate it instead of describing (and needing to
 | 
						|
support) PTRACE_KILL's quirks.
 | 
						|
 | 
						|
When tracee executes exit syscall, it reports its death to its tracer.
 | 
						|
Other threads are not affected.
 | 
						|
 | 
						|
When any thread executes exit_group syscall, every tracee in its thread
 | 
						|
group reports its death to its tracer.
 | 
						|
 | 
						|
If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen
 | 
						|
before actual death. This applies to exits on exit syscall, group_exit
 | 
						|
syscall, signal deaths (except SIGKILL), and when threads are torn down
 | 
						|
on execve in multi-threaded process.
 | 
						|
 | 
						|
Tracer cannot assume that ptrace-stopped tracee exists. There are many
 | 
						|
scenarios when tracee may die while stopped (such as SIGKILL).
 | 
						|
Therefore, tracer must always be prepared to handle ESRCH error on any
 | 
						|
ptrace operation. Unfortunately, the same error is returned if tracee
 | 
						|
exists but is not ptrace-stopped (for commands which require stopped
 | 
						|
tracee), or if it is not traced by process which issued ptrace call.
 | 
						|
Tracer needs to keep track of stopped/running state, and interpret
 | 
						|
ESRCH as "tracee died unexpectedly" only if it knows that tracee has
 | 
						|
been observed to enter ptrace-stop. Note that there is no guarantee
 | 
						|
that waitpid(WNOHANG) will reliably report tracee's death status if
 | 
						|
ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead.
 | 
						|
IOW: tracee may be "not yet fully dead" but already refusing ptrace ops.
 | 
						|
 | 
						|
Tracer can not assume that tracee ALWAYS ends its life by reporting
 | 
						|
WIFEXITED(status) or WIFSIGNALED(status).
 | 
						|
 | 
						|
??? or can it? Do we include such a promise into ptrace API?
 | 
						|
 | 
						|
 | 
						|
	1.x Stopped states.
 | 
						|
 | 
						|
When running tracee enters ptrace-stop, it notifies its tracer using
 | 
						|
waitpid API. Tracer should use waitpid family of syscalls to wait for
 | 
						|
tracee to stop. Most of this document assumes that tracer waits with:
 | 
						|
 | 
						|
	pid = waitpid(pid_or_minus_1, &status, __WALL);
 | 
						|
 | 
						|
Ptrace-stopped tracees are reported as returns with pid > 0 and
 | 
						|
WIFSTOPPED(status) == true.
 | 
						|
 | 
						|
??? Do we require __WALL usage, or will just using 0 be ok? Are the
 | 
						|
rules different if user wants to use waitid? Will waitid require
 | 
						|
WEXITED?
 | 
						|
 | 
						|
__WALL value does not include WSTOPPED and WEXITED bits, but implies
 | 
						|
their functionality.
 | 
						|
 | 
						|
Setting of WCONTINUED bit in waitpid flags is not recommended: the
 | 
						|
continued state is per-process and consuming it can confuse real parent
 | 
						|
of the tracee.
 | 
						|
 | 
						|
Use of WNOHANG bit in waitpid flags may cause waitpid return 0 ("no
 | 
						|
wait results available yet") even if tracer knows there should be a
 | 
						|
notification. Example: kill(tracee, SIGKILL); waitpid(tracee, &status,
 | 
						|
__WALL | WNOHANG);
 | 
						|
 | 
						|
??? waitid usage? WNOWAIT?
 | 
						|
 | 
						|
??? describe how wait notifications queue (or not queue)
 | 
						|
 | 
						|
The following kinds of ptrace-stops exist: signal-delivery-stops,
 | 
						|
group-stop, PTRACE_EVENT stops, syscall-stops [, SINGLESTEP, SYSEMU,
 | 
						|
SYSEMU_SINGLESTEP]. They all are reported as waitpid result with
 | 
						|
WIFSTOPPED(status) == true. They may be differentiated by checking
 | 
						|
(status >> 8) value, and if looking at (status >> 8) value doesn't
 | 
						|
resolve ambiguity, by querying PTRACE_GETSIGINFO. (Note:
 | 
						|
WSTOPSIG(status) macro returns ((status >> 8) & 0xff) value).
 | 
						|
 | 
						|
 | 
						|
	1.x.x Signal-delivery-stop
 | 
						|
 | 
						|
When (possibly multi-threaded) process receives any signal except
 | 
						|
SIGKILL, kernel selects a thread which handles the signal (if signal is
 | 
						|
generated with t[g]kill, thread selection is done by user). If selected
 | 
						|
thread is traced, it enters signal-delivery-stop. By this point, signal
 | 
						|
is not yet delivered to the process, and can be suppressed by tracer.
 | 
						|
If tracer doesn't suppress the signal, it passes signal to tracee in
 | 
						|
the next ptrace request. This second step of signal delivery is called
 | 
						|
"signal injection" in this document. Note that if signal is blocked,
 | 
						|
signal-delivery-stop doesn't happen until signal is unblocked, with the
 | 
						|
usual exception that SIGSTOP can't be blocked.
 | 
						|
 | 
						|
Signal-delivery-stop is observed by tracer as waitpid returning with
 | 
						|
WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
 | 
						|
WSTOPSIG(status) == SIGTRAP, this may be a different kind of
 | 
						|
ptrace-stop - see "Syscall-stops" and "execve" sections below for
 | 
						|
details. If WSTOPSIG(status) == stopping signal, this may be a
 | 
						|
group-stop - see below.
 | 
						|
 | 
						|
 | 
						|
	1.x.x Signal injection and suppression.
 | 
						|
 | 
						|
After signal-delivery-stop is observed by tracer, tracer should restart
 | 
						|
tracee with
 | 
						|
 | 
						|
	ptrace(PTRACE_rest, pid, 0, sig)
 | 
						|
 | 
						|
call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
 | 
						|
0, then signal is not delivered. Otherwise, signal sig is delivered.
 | 
						|
This operation is called "signal injection" in this document, to
 | 
						|
distinguish it from signal-delivery-stop.
 | 
						|
 | 
						|
Note that sig value may be different from WSTOPSIG(status) value -
 | 
						|
tracer can cause a different signal to be injected.
 | 
						|
 | 
						|
Note that suppressed signal still causes syscalls to return
 | 
						|
prematurely. Kernel should always restart the syscall in this case:
 | 
						|
tracer would observe a new syscall-enter-stop for the same syscall,
 | 
						|
or, in case of syscalls returning ERESTART_RESTARTBLOCK,
 | 
						|
tracer would observe a syscall-enter-stop for restart_syscall(2)
 | 
						|
syscall. There may still be bugs in this area which cause some syscalls
 | 
						|
to instead return with -EINTR even though no observable signal
 | 
						|
was injected to the tracee.
 | 
						|
 | 
						|
This is a cause of confusion among ptrace users. One typical scenario
 | 
						|
is that tracer observes group-stop, mistakes it for
 | 
						|
signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0,
 | 
						|
stopsig) with the intention of injecting stopsig, but stopsig gets
 | 
						|
ignored and tracee continues to run.
 | 
						|
 | 
						|
SIGCONT signal has a side effect of waking up (all threads of)
 | 
						|
group-stopped process. This side effect happens before
 | 
						|
signal-delivery-stop. Tracer can't suppress this side-effect (it can
 | 
						|
only suppress signal injection, which only causes SIGCONT handler to
 | 
						|
not be executed in the tracee, if such handler is installed). In fact,
 | 
						|
waking up from group-stop may be followed by signal-delivery-stop for
 | 
						|
signal(s) *other than* SIGCONT, if they were pending when SIGCONT was
 | 
						|
delivered. IOW: SIGCONT may be not the first signal observed by the
 | 
						|
tracee after it was sent.
 | 
						|
 | 
						|
Stopping signals cause (all threads of) process to enter group-stop.
 | 
						|
This side effect happens after signal injection, and therefore can be
 | 
						|
suppressed by tracer.
 | 
						|
 | 
						|
PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which
 | 
						|
corresponds to delivered signal. PTRACE_SETSIGINFO may be used to
 | 
						|
modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t,
 | 
						|
si_signo field and sig parameter in restarting command must match,
 | 
						|
otherwise the result is undefined.
 | 
						|
 | 
						|
 | 
						|
	1.x.x Group-stop
 | 
						|
 | 
						|
When a (possibly multi-threaded) process receives a stopping signal,
 | 
						|
all threads stop. If some threads are traced, they enter a group-stop.
 | 
						|
Note that stopping signal will first cause signal-delivery-stop (on one
 | 
						|
tracee only), and only after it is injected by tracer (or after it was
 | 
						|
dispatched to a thread which isn't traced), group-stop will be
 | 
						|
initiated on ALL tracees within multi-threaded process. As usual, every
 | 
						|
tracee reports its group-stop separately to corresponding tracer.
 | 
						|
 | 
						|
Group-stop is observed by tracer as waitpid returning with
 | 
						|
WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
 | 
						|
is returned by some other classes of ptrace-stops, therefore the
 | 
						|
recommended practice is to perform
 | 
						|
 | 
						|
	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
 | 
						|
 | 
						|
call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
 | 
						|
SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
 | 
						|
tracer sees something else, it can't be group-stop. Otherwise, tracer
 | 
						|
needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with
 | 
						|
EINVAL, then it is definitely a group-stop. (Other failure codes are
 | 
						|
possible, such as ESRCH "no such process" if SIGKILL killed the tracee).
 | 
						|
 | 
						|
As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it
 | 
						|
restarts or kills it, tracee will not run, and will not send
 | 
						|
notifications (except SIGKILL death) to tracer, even if tracer enters
 | 
						|
into another waitpid call.
 | 
						|
 | 
						|
Currently, it causes a problem with transparent handling of stopping
 | 
						|
signals: if tracer restarts tracee after group-stop, SIGSTOP is
 | 
						|
effectively ignored: tracee doesn't remain stopped, it runs. If tracer
 | 
						|
doesn't restart tracee before entering into next waitpid, future
 | 
						|
SIGCONT will not be reported to the tracer. Which would make SIGCONT to
 | 
						|
have no effect.
 | 
						|
 | 
						|
 | 
						|
	1.x.x PTRACE_EVENT stops
 | 
						|
 | 
						|
If tracer sets TRACE_O_TRACEfoo options, tracee will enter ptrace-stops
 | 
						|
called PTRACE_EVENT stops.
 | 
						|
 | 
						|
PTRACE_EVENT stops are observed by tracer as waitpid returning with
 | 
						|
WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. Additional bit
 | 
						|
is set in a higher byte of status word: value ((status >> 8) & 0xffff)
 | 
						|
will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist:
 | 
						|
 | 
						|
PTRACE_EVENT_VFORK - stop before return from vfork/clone+CLONE_VFORK.
 | 
						|
When tracee is continued after this, it will wait for child to
 | 
						|
exit/exec before continuing its execution (IOW: usual behavior on
 | 
						|
vfork).
 | 
						|
 | 
						|
PTRACE_EVENT_FORK - stop before return from fork/clone+SIGCHLD
 | 
						|
 | 
						|
PTRACE_EVENT_CLONE - stop before return from clone
 | 
						|
 | 
						|
PTRACE_EVENT_VFORK_DONE - stop before return from
 | 
						|
vfork/clone+CLONE_VFORK, but after vfork child unblocked this tracee by
 | 
						|
exiting or exec'ing.
 | 
						|
 | 
						|
For all four stops described above: stop occurs in parent, not in newly
 | 
						|
created thread. PTRACE_GETEVENTMSG can be used to retrieve new thread's
 | 
						|
tid.
 | 
						|
 | 
						|
PTRACE_EVENT_EXEC - stop before return from exec.
 | 
						|
 | 
						|
PTRACE_EVENT_EXIT - stop before exit (including death from exit_group),
 | 
						|
signal death, or exit caused by execve in multi-threaded process.
 | 
						|
PTRACE_GETEVENTMSG returns exit status. Registers can be examined
 | 
						|
(unlike when "real" exit happens). The tracee is still alive, it needs
 | 
						|
to be PTRACE_CONTed or PTRACE_DETACHed to finish exit.
 | 
						|
 | 
						|
PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP,
 | 
						|
si_code = (event << 8) | SIGTRAP.
 | 
						|
 | 
						|
 | 
						|
	1.x.x Syscall-stops
 | 
						|
 | 
						|
If tracee was restarted by PTRACE_SYSCALL, tracee enters
 | 
						|
syscall-enter-stop just prior to entering any syscall. If tracer
 | 
						|
restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when
 | 
						|
syscall is finished, or if it is interrupted by a signal. (That is,
 | 
						|
signal-delivery-stop never happens between syscall-enter-stop and
 | 
						|
syscall-exit-stop, it happens *after* syscall-exit-stop).
 | 
						|
 | 
						|
Other possibilities are that tracee may stop in a PTRACE_EVENT stop,
 | 
						|
exit (if it entered exit or exit_group syscall), be killed by SIGKILL,
 | 
						|
or die silently (if execve syscall happened in another thread).
 | 
						|
 | 
						|
Syscall-enter-stop and syscall-exit-stop are observed by tracer as
 | 
						|
waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) ==
 | 
						|
SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then
 | 
						|
WSTOPSIG(status) == (SIGTRAP | 0x80).
 | 
						|
 | 
						|
Syscall-stops can be distinguished from signal-delivery-stop with
 | 
						|
SIGTRAP by querying PTRACE_GETSIGINFO: si_code <= 0 if sent by usual
 | 
						|
suspects like [tg]kill/sigqueue/etc; or = SI_KERNEL (0x80) if sent by
 | 
						|
kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP |
 | 
						|
0x80). However, syscall-stops happen very often (twice per syscall),
 | 
						|
and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat
 | 
						|
expensive.
 | 
						|
 | 
						|
Some architectures allow to distinguish them by examining registers.
 | 
						|
For example, on x86 rax = -ENOSYS in syscall-enter-stop. Since SIGTRAP
 | 
						|
(like any other signal) always happens *after* syscall-exit-stop, and
 | 
						|
at this point rax almost never contains -ENOSYS, SIGTRAP looks like
 | 
						|
"syscall-stop which is not syscall-enter-stop", IOW: it looks like a
 | 
						|
"stray syscall-exit-stop" and can be detected this way. But such
 | 
						|
detection is fragile and is best avoided.
 | 
						|
 | 
						|
Using PTRACE_O_TRACESYSGOOD option is a recommended method, since it is
 | 
						|
reliable and does not incur performance penalty.
 | 
						|
 | 
						|
Syscall-enter-stop and syscall-exit-stop are indistinguishable from
 | 
						|
each other by tracer. Tracer needs to keep track of the sequence of
 | 
						|
ptrace-stops in order to not misinterpret syscall-enter-stop as
 | 
						|
syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is
 | 
						|
always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's
 | 
						|
death - no other kinds of ptrace-stop can occur in between.
 | 
						|
 | 
						|
If after syscall-enter-stop tracer uses restarting command other than
 | 
						|
PTRACE_SYSCALL, syscall-exit-stop is not generated.
 | 
						|
 | 
						|
PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code
 | 
						|
= SIGTRAP or (SIGTRAP | 0x80).
 | 
						|
 | 
						|
 | 
						|
	1.x.x SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP
 | 
						|
 | 
						|
??? document PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP
 | 
						|
 | 
						|
 | 
						|
	1.x Informational and restarting ptrace commands.
 | 
						|
 | 
						|
Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee
 | 
						|
to be in ptrace-stop, otherwise they fail with ESRCH.
 | 
						|
 | 
						|
When tracee is in ptrace-stop, tracer can read and write data to tracee
 | 
						|
using informational commands. They leave tracee in ptrace-stopped state:
 | 
						|
 | 
						|
longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
 | 
						|
	ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
 | 
						|
	ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
 | 
						|
	ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
 | 
						|
	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
 | 
						|
	ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
 | 
						|
	ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
 | 
						|
	ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
 | 
						|
 | 
						|
Note that some errors are not reported. For example, setting siginfo
 | 
						|
may have no effect in some ptrace-stops, yet the call may succeed
 | 
						|
(return 0 and don't set errno).
 | 
						|
 | 
						|
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags) affects one tracee.
 | 
						|
Current flags are replaced. Flags are inherited by new tracees created
 | 
						|
and "auto-attached" via active PTRACE_O_TRACE[V]FORK or
 | 
						|
PTRACE_O_TRACECLONE options.
 | 
						|
 | 
						|
Another group of commands makes ptrace-stopped tracee run. They have
 | 
						|
the form:
 | 
						|
 | 
						|
	ptrace(PTRACE_cmd, pid, 0, sig);
 | 
						|
 | 
						|
where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or
 | 
						|
SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the
 | 
						|
signal to be injected. Otherwise, sig may be ignored.
 | 
						|
 | 
						|
 | 
						|
	1.x Attaching and detaching
 | 
						|
 | 
						|
A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0,
 | 
						|
0) call. This also sends SIGSTOP to this thread. If tracer wants this
 | 
						|
SIGSTOP to have no effect, it needs to suppress it. Note that if other
 | 
						|
signals are concurrently sent to this thread during attach, tracer may
 | 
						|
see tracee enter signal-delivery-stop with other signal(s) first! The
 | 
						|
usual practice is to reinject these signals until SIGSTOP is seen, then
 | 
						|
suppress SIGSTOP injection. The design bug here is that attach and
 | 
						|
concurrent SIGSTOP are racing and SIGSTOP may be lost.
 | 
						|
 | 
						|
??? Describe how to attach to a thread which is already group-stopped.
 | 
						|
 | 
						|
Since attaching sends SIGSTOP and tracer usually suppresses it, this
 | 
						|
may cause stray EINTR return from the currently executing syscall in
 | 
						|
the tracee, as described in "signal injection and suppression" section.
 | 
						|
 | 
						|
ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a
 | 
						|
tracee. It continues to run (doesn't enter ptrace-stop). A common
 | 
						|
practice is to follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and
 | 
						|
allow parent (which is our tracer now) to observe our
 | 
						|
signal-delivery-stop.
 | 
						|
 | 
						|
If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect,
 | 
						|
then children created by (vfork or clone(CLONE_VFORK)), (fork or
 | 
						|
clone(SIGCHLD)) and (other kinds of clone) respectively are
 | 
						|
automatically attached to the same tracer which traced their parent.
 | 
						|
SIGSTOP is delivered to them, causing them to enter
 | 
						|
signal-delivery-stop after they exit syscall which created them.
 | 
						|
 | 
						|
Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig).
 | 
						|
PTRACE_DETACH is a restarting operation, therefore it requires tracee
 | 
						|
to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can
 | 
						|
be injected. Othervice, sig parameter may be silently ignored.
 | 
						|
 | 
						|
If tracee is running when tracer wants to detach it, the usual solution
 | 
						|
is to send SIGSTOP (using tgkill, to make sure it goes to the correct
 | 
						|
thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP
 | 
						|
and then detach it (suppressing SIGSTOP injection). Design bug is that
 | 
						|
this can race with concurrent SIGSTOPs. Another complication is that
 | 
						|
tracee may enter other ptrace-stops and needs to be restarted and
 | 
						|
waited for again, until SIGSTOP is seen. Yet another complication is to
 | 
						|
be sure that tracee is not already ptrace-stopped, because no signal
 | 
						|
delivery happens while it is - not even SIGSTOP.
 | 
						|
 | 
						|
??? Describe how to detach from a group-stopped tracee so that it
 | 
						|
    doesn't run, but continues to wait for SIGCONT.
 | 
						|
 | 
						|
If tracer dies, all tracees are automatically detached and restarted,
 | 
						|
unless they were in group-stop. Handling of restart from group-stop is
 | 
						|
currently buggy, but "as planned" behavior is to leave tracee stopped
 | 
						|
and waiting for SIGCONT. If tracee is restarted from
 | 
						|
signal-delivery-stop, pending signal is injected.
 | 
						|
 | 
						|
 | 
						|
	1.x execve under ptrace.
 | 
						|
 | 
						|
During execve, kernel destroys all other threads in the process, and
 | 
						|
resets execve'ing thread tid to tgid (process id). This looks very
 | 
						|
confusing to tracers:
 | 
						|
 | 
						|
All other threads stop in PTRACE_EXIT stop, if requested by active
 | 
						|
ptrace option. Then all other threads except thread group leader report
 | 
						|
death as if they exited via exit syscall with exit code 0. Then
 | 
						|
PTRACE_EVENT_EXEC stop happens, if requested by active ptrace option
 | 
						|
(on which tracee - leader? execve-ing one?).
 | 
						|
 | 
						|
The execve-ing tracee changes its pid while it is in execve syscall.
 | 
						|
(Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace
 | 
						|
calls, is tracee's tid). That is, pid is reset to process id, which
 | 
						|
coincides with thread group leader tid.
 | 
						|
 | 
						|
If thread group leader has reported its death by this time, for tracer
 | 
						|
this looks like dead thread leader "reappears from nowhere". If thread
 | 
						|
group leader was still alive, for tracer this may look as if thread
 | 
						|
group leader returns from a different syscall than it entered, or even
 | 
						|
"returned from syscall even though it was not in any syscall". If
 | 
						|
thread group leader was not traced (or was traced by a different
 | 
						|
tracer), during execve it will appear as if it has become a tracee of
 | 
						|
the tracer of execve'ing tracee. All these effects are the artifacts of
 | 
						|
pid change.
 | 
						|
 | 
						|
PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this
 | 
						|
case. It enables PTRACE_EVENT_EXEC stop which occurs before execve
 | 
						|
syscall return.
 | 
						|
 | 
						|
Pid change happens before PTRACE_EVENT_EXEC stop, not after.
 | 
						|
 | 
						|
When tracer receives PTRACE_EVENT_EXEC stop notification, it is
 | 
						|
guaranteed that except this tracee and thread group leader, no other
 | 
						|
threads from the process are alive.
 | 
						|
 | 
						|
On receiving this notification, tracer should clean up all its internal
 | 
						|
data structures about all threads of this process, and retain only one
 | 
						|
data structure, one which describes single still running tracee, with
 | 
						|
pid = tgid = process id.
 | 
						|
 | 
						|
Currently, there is no way to retrieve former pid of execve-ing tracee.
 | 
						|
If tracer doesn't keep track of its tracees' thread group relations, it
 | 
						|
may be unable to know which tracee  execve-ed and therefore no longer
 | 
						|
exists under old pid due to pid change.
 | 
						|
 | 
						|
Example: two threads execve at the same time:
 | 
						|
 | 
						|
  ** we get syscall-entry-stop in thread 1: **
 | 
						|
 PID1 execve("/bin/foo", "foo" <unfinished ...>
 | 
						|
  ** we issue PTRACE_SYSCALL for thread 1 **
 | 
						|
  ** we get syscall-entry-stop in thread 2: **
 | 
						|
 PID2 execve("/bin/bar", "bar" <unfinished ...>
 | 
						|
  ** we issue PTRACE_SYSCALL for thread 2 **
 | 
						|
  ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
 | 
						|
  ** we get syscall-exit-stop for PID0: **
 | 
						|
 PID0 <... execve resumed> )             = 0
 | 
						|
 | 
						|
In this situation there is no way to know which execve succeeded.
 | 
						|
 | 
						|
If PTRACE_O_TRACEEXEC option is NOT in effect for the execve'ing
 | 
						|
tracee, kernel delivers an extra SIGTRAP to tracee after execve syscall
 | 
						|
returns. This is an ordinary signal (similar to one which can be
 | 
						|
generated by "kill -TRAP"), not a special kind of ptrace-stop.
 | 
						|
GETSIGINFO on it has si_code = 0 (SI_USER). It can be blocked by signal
 | 
						|
mask, and thus can happen (much) later.
 | 
						|
 | 
						|
Usually, tracer (for example, strace) would not want to show this extra
 | 
						|
post-execve SIGTRAP signal to the user, and would suppress its delivery
 | 
						|
to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal).
 | 
						|
However, determining *which* SIGTRAP to suppress is not easy. Setting
 | 
						|
PTRACE_O_TRACEEXEC option and thus suppressing this extra SIGTRAP is
 | 
						|
the recommended approach.
 | 
						|
 | 
						|
 | 
						|
	1.x Real parent
 | 
						|
 | 
						|
Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
 | 
						|
This used to cause real parent of the process to stop receiving several
 | 
						|
kinds of waitpid notifications when child process is traced by some
 | 
						|
other process.
 | 
						|
 | 
						|
Many of these bugs have been fixed, but as of 2.6.38 several still
 | 
						|
exist.
 | 
						|
 | 
						|
As of 2.6.38, the following is believed to work correctly:
 | 
						|
 | 
						|
- exit/death by signal is reported first to tracer, then, when tracer
 | 
						|
consumes waitpid result, to real parent (to real parent only when the
 | 
						|
whole multi-threaded process exits). If they are the same process, the
 | 
						|
report is sent only once.
 | 
						|
 | 
						|
 | 
						|
	1.x Known bugs
 | 
						|
 | 
						|
Following bugs still exist:
 | 
						|
 | 
						|
Group-stop notifications are sent to tracer, but not to real parent.
 | 
						|
Last confirmed on 2.6.38.6.
 | 
						|
 | 
						|
If thread group leader is traced and exits by calling exit syscall,
 | 
						|
PTRACE_EVENT_EXIT stop will happen for it (if requested), but subsequent
 | 
						|
WIFEXITED notification will not be delivered until all other threads
 | 
						|
exit. As explained above, if one of other threads execve's, thread
 | 
						|
group leader death will *never* be reported. If execve-ed thread is not
 | 
						|
traced by this tracer, tracer will never know that execve happened.
 | 
						|
 | 
						|
??? need to test this scenario
 | 
						|
 | 
						|
One possible workaround is to detach thread group leader instead of
 | 
						|
restarting it in this case. Last confirmed on 2.6.38.6.
 | 
						|
 | 
						|
SIGKILL signal may still cause PTRACE_EVENT_EXIT stop before actual
 | 
						|
signal death. This may be changed in the future - SIGKILL is meant to
 | 
						|
always immediately kill tasks even under ptrace. Last confirmed on
 | 
						|
2.6.38.6.
 |