179 lines
8.7 KiB
Markdown
179 lines
8.7 KiB
Markdown
# heapprofd: Sampling for Memory Profiles
|
||
|
||
_tomlewin, fmayer **·** 2021-04-14_
|
||
|
||
## Background
|
||
|
||
A heap profiler associates memory allocations with the callstacks on which they
|
||
happen ([example visualization]). It is prohibitively expensive to handle every
|
||
allocation done by a program, so the [Android Heap Profiler] employs a sampling
|
||
approach that handles a statistically meaningful subset of allocations. This
|
||
document explores the statistical properties thereof.
|
||
|
||
## Conceptual motivation
|
||
Conceptually the sampling is implemented such that each byte has some
|
||
probability p of being sampled. In theory we can think of each byte undergoing a
|
||
Bernoulli trial. The reason we use a random sampling approach, as opposed to
|
||
taking every nth byte, is that there may be regular allocation patterns in the
|
||
code that may be missed by a correspondingly regular sampling.
|
||
|
||
To scale the sampled bytes to the correct scale, we multiply by 1 / p, i.e. if
|
||
we sample a byte with probability 10%, then each byte sampled represents 10
|
||
bytes allocated.
|
||
|
||
|
||
## Implementation
|
||
|
||
In practice, the [algorithm] works as follows:
|
||
|
||
1. We look at an allocation
|
||
|
||
2. If the size of the allocation is large enough that there’s a greater than 99%
|
||
chance of it being sampled at least once, we return the size of the
|
||
allocation directly. This is a performance optimization.
|
||
|
||
3. If the size of the allocation is smaller, then we compute the number of times
|
||
we would draw a sample if we sampled each byte with the given sampling rate:
|
||
|
||
* In practice we do this by keeping track of the arrival time of the next
|
||
sample. When an allocation happens, we subtract its size from the arrival
|
||
time of the next sample, and check whether we have brought it below zero. We
|
||
then repeatedly draw from the exponential distribution (which is the
|
||
interarrival time of Poisson) until the arrival time is brought back
|
||
above 0. The amount of times we had to draw from the exponential
|
||
distribution is the number of samples the allocation should count as.
|
||
|
||
* We multiply the number of samples we drew within the allocation by the
|
||
sampling rate to get an estimate of the size of the allocation
|
||
|
||
We instead draw directly from the Poisson/binomial distribution to see how many
|
||
samples we get for a given allocation size, but this is not as computationally
|
||
efficient. This is because we don’t sample most of the allocations, due to their
|
||
small size and our low sampling rate. This means it’s more efficient to use the
|
||
exponential draw approach, as for a non-sample, we only need to decrement a
|
||
counter. Sampling from the Poisson for every allocation (rather than drawing
|
||
from exponential for every sample) is more expensive.
|
||
|
||
## Theoretical performance
|
||
|
||
If we sample at some rate 1 / p, then to set p reasonably we need to know what
|
||
our probability of sampling an allocation of any size is, as well as our
|
||
expected error when we sample it. If we set p = 1 then we sample everything and
|
||
have no error. Reducing the sampling rate costs us coverage and accuracy.
|
||
|
||
### Sampling probabilities
|
||
|
||
We will sample an allocation with probability Exponential(sampling rate) <
|
||
allocation size. This is equivalent to the probability that we do not fail to
|
||
sample all bytes within the allocation if we sample bytes at our sampling rate.
|
||
|
||
Because the exponential distribution is memoryless, we can add together
|
||
allocations that are the same even if they occur apart for the purposes of
|
||
probability. This means that if we have an allocation of 1MB and then another of
|
||
1MB, the probability of us taking at least one sample is the same as the
|
||
probability of us taking at least one sample of a contiguous 2MB allocation.
|
||
|
||
We can see from the chart below that if we 16X our sampling rate from 32KiB to
|
||
512KiB we still have a 95% chance of sampling anything above 1.5MiB. If we 64X
|
||
it to 2048KiB we still have an 80% chance to sample anything above 3.5MiB.
|
||
|
||

|
||
|
||
### Expected errors
|
||
We can also consider the expected error we’ll make at given allocation sizes and
|
||
sampling rates. Like before it doesn’t matter where the allocation happens — if
|
||
we have two allocations of the same type occurring at different times they have
|
||
the same statistical properties as a single allocation with size equal to their
|
||
sum. This is because the exponential distribution we use is memoryless.
|
||
|
||
|
||
For expected error we report something like the [mean absolute percentage
|
||
error]. This just means we see how wrong we might be in percent relative to the
|
||
true allocation size, and then scale these results according to their
|
||
probability of occurrence. This is an estimator that has some issues (i.e. it’s
|
||
biased such that underestimates are more penalised) but it’s easy to reason
|
||
about and it’s useful as a gauge of how wrong on average we might be with our
|
||
estimates. I would recommend just reading this as analogous to a standard
|
||
deviation.
|
||
|
||
|
||
Charts below show both the expected error and the conditional expected error:
|
||
the expected error assuming we have at least one sample within the allocation.
|
||
There’s periodicity in both in line with the sampling rate used. The thing to
|
||
take away is that, while the estimates are unbiased such that their expected
|
||
value is equal to their real value, substantial errors are still possible.
|
||
|
||

|
||
|
||
Assuming that we take at least one sample of an allocation, what error might we
|
||
expect? We can answer that using the following chart, which shows expected error
|
||
conditional on at least one sample being taken. This is the expected error we
|
||
can expect for the things that end up in our heap profile. It’s important to
|
||
emphasise that this expected error methodology is not exact, and also that the
|
||
underlying sampling remains unbiased — the expected value is the true value.
|
||
This should be considered more as a measure akin to standard deviation.
|
||
|
||

|
||
|
||
## Performance Considerations
|
||
### STL Distributions
|
||
|
||
Benchmarking of the STL distributions on a Pixel 4 reinforces our approach of
|
||
estimating the geometric distribution using an exponential distribution, as its
|
||
performance is ~8x better ([full results]).
|
||
|
||
|
||
Draw sample from exponential distribution with p = 1 / 32000:
|
||
|
||
ARM64: 21.3ns (sd 0.066ns, negligibly small, smaller than a CPU cycle)
|
||
|
||
ARM32: 33.2ns (sd 0.011ns, negligibly small, smaller than a CPU cycle)
|
||
|
||
|
||
Draw sample from geometric distribution with p = 1 / 32000:
|
||
|
||
ARM64: 169ns (sd 0.023ns, negligibly small, smaller than a CPU cycle) (7.93x)
|
||
|
||
ARM32: 222ns (sd 0.118ns, negligibly small, smaller than a CPU cycle) (6.69x)
|
||
|
||
## Appendix
|
||
|
||
### Improvements made
|
||
|
||
We previously (before Android 12) returned the size of the allocation accurately
|
||
and immediately if the allocation size was greater than our sampling rate. This
|
||
had several impacts.
|
||
|
||
|
||
The most obvious impact is that with the old approach we would expect to sample
|
||
an allocation equal in size to our sampling rate ~63% of the time, rather than
|
||
100%. This means there is a discontinuity in coverage between an allocation a
|
||
byte smaller than our sampling rate, and one a byte larger. This is still
|
||
unbiased from a statistical perspective, but it’s important to note.
|
||
|
||
|
||
Another unusual impact is that the sampling rate depends not only on the size of
|
||
the memory being used, but also how it’s allocated. If our sampling rate were
|
||
10KB, and we have an allocation that’s 10KB, we sample it with certainty. If
|
||
instead that 10KB is split among two 5KB allocations, we sample it with
|
||
probability 63%. Again this doesn’t bias our results, but it means that our
|
||
measurements of memory where there are many small allocations might be noisier
|
||
than ones where there are a few large allocations, even if total memory used is
|
||
the same. If we didn’t return allocations greater than the sampling rate every
|
||
time, then the memorylessness property of the exponential distribution would
|
||
mean our method is insensitive to how the memory is split amongst allocations,
|
||
which seems a desirable property.
|
||
|
||
|
||
We altered the cutoff at which we simply returned the allocation size.
|
||
Previously, as discussed, the cutoff was at the sampling rate, which led to a
|
||
discontinuity. Now the cutoff is determined by the size at which we have a >99%
|
||
chance of sampling an allocation at least once, given the sampling rate we’re
|
||
using. This resolves the above issues.
|
||
|
||
[algorithm]:
|
||
https://cs.android.com/android/platform/superproject/+/master:external/perfetto/src/profiling/memory/sampler.h
|
||
[example visualization]: /docs/images/syssrv-apk-assets-two.png
|
||
[Android Heap Profiler]: /docs/design-docs/heapprofd-design
|
||
[mean absolute percentage error]: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
|
||
[full results]: https://gist.github.com/fmayer/3aafcaf58f8db09714ba09aa4ac2b1ac |