In 2024 we undertook an effort to optimize the time taken to navigate to our Select Payment Method screen. This is one of our most common navigations — it’s critical that it completes in a timely manner.
We found that devices with USB barcode scanners occasionally had egregiously bad performance. We attached USB barcode scanners to our development devices and captured Perfetto traces while simulating payments.
We weren’t able to reproduce the egregiously bad performance every time, but we found instances where our main thread was blocked on a somewhat mysterious Perfetto span named IncrementalDisableThreadFlip
which isn't part of our own codebase — hinting at a deeper system-level problem.
In this particular trace, this span took 867ms, which is much longer than the entire navigation takes under normal circumstances. Clearly this was problematic.
Digging deeper to find the root cause
To better understand the cause, we examined what other threads in the process were doing during the delay. Two stood out — both were blocked at the same time:
The thread name HeapTaskDaemon
and the span name Background concurrent copying GC
were a strong hint that this was garbage-collection related. The other thread involved was Sq-usb_barcode_30652
. This was not surprising in that we knew the problem involved a USB barcode scanner, but we didn’t know what this had to do with garbage collection and the Perfetto trace provided very little detail. We collected a Simpleperf trace so we could see the stack trace when the thread was blocked, looking through each method on the stack for hints:
1 at __schedule 2 at __schedule 3 at schedule 4 at schedule_timeout 5 at wait_for_common 6 at wait_for_completion_timeout 7 at usb_start_wait_urb 8 at usb_bulk_msg 9 at proc_bulk 10 at usbdev_do_ioctl 11 at usbdev_ioctl 12 at do_vfs_ioctl 13 at sys_ioctl 14 at __sys_trace 15 at __ioctl 16 at ioctl 17 at usb_device_bulk_transfer 18 at android_hardware_UsbDeviceConnection_bulk_request 19 at android.hardware.usb.UsbDeviceConnection.bulkTransfer 20 at android.hardware.usb.UsbDeviceConnection.bulkTransfer 21 at com.squareup.cdx.barcodescanners.UsbBarcodeBulkTransferCommunication.transferBytes 22 at com.squareup.cdx.barcodescanners.UsbBarcodeScanner$start$usbRequestRunnable$1.run 23 at java.util.concurrent.ThreadPoolExecutor.runWorker 24 at java.util.concurrent.ThreadPoolExecutor$Worker.run 25 at com.squareup.thread.Threads$namedThreadFactory$1$newThread$1.invoke 26 at com.squareup.thread.Threads$namedThreadFactory$1$newThread$1.invoke 27 at kotlin.concurrent.ThreadsKt$thread$thread$1.run 28 at __pthread_start 29 at __start_thread
One method that caught our attention was android_hardware_UsbDeviceConnection_bulk_request
, which is native code in the Android framework:
cpp1... 2jbyte* bufferBytes = NULL; 3if (buffer) { 4 bufferBytes = (jbyte*)env->GetPrimitiveArrayCritical(buffer, NULL); 5} 6 7jint result = usb_device_bulk_transfer(device, endpoint, bufferBytes + start, length, timeout); 8 9if (bufferBytes) { 10 env->ReleasePrimitiveArrayCritical(buffer, bufferBytes, 0); 11} 12...
GetPrimitiveArrayCritical
/ReleasePrimitiveArrayCritical
are Java Native Interface (JNI) APIs. The Critical
suffix indicates that special caution is warranted. The API documentation states:
After calling
GetPrimitiveArrayCritical
, the native code should not run for an extended period of time before it callsReleasePrimitiveArrayCritical
. We must treat the code inside this pair of functions as running in a "critical region." Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)
These restrictions make it more likely that the native code will obtain an uncopied version of the array, even if the VM does not support pinning. For example, a VM may temporarily disable garbage collection when the native code is holding a pointer to an array obtained via
GetPrimitiveArrayCritical
.
Garbage Collection theory review
Garbage Collection (GC) is the process by which a runtime (like ART — the Android Runtime) automatically reclaims unused memory. In a "concurrent copying GC" the garbage collector copies live objects from one part of the heap to another, updating object references in the process. These parts of the heap are commonly called from-space and to-space. At the end of the copying process, all the live objects are in the to-space — allowing the from-space to be completely reclaimed.
The GC is considered concurrent because the garbage collector closely cooperates with the rest of the VM to allow the threads to continue to run in parallel with the garbage collector. The mechanisms that do this are interesting, but out of scope for this blog post. Suffice to say there is a critical point at which the garbage collector must briefly pause each thread in order to flip it — that is, scan through its stack and update each object reference to point to the corresponding location in the to-space.
Most JNI APIs are carefully designed to include a level of indirection — C/C++ code cannot access Java objects directly. Instead C/C++ code calls JNI methods like GetByteField
to retrieve a field of type byte
. This level of indirection allows the VM to direct the access to the correct space, cooperating with the GC to allow objects to be moved.
Critical
JNI APIs throw a wrench in the mix. The GC updates object references using temporary forwarding pointers placed in the original (from-space) object headers — but these cannot be created until the object is moved, and the object cannot be moved while it is held in place via GetPrimitiveArrayCritical
. Thus, the thread flip is blocked until a corresponding ReleasePrimitiveArrayCritical
is called for each prior GetPrimitiveArrayCritical
. This is what the following line from the API documentation refers to:
a VM may temporarily disable garbage collection when the native code is holding a pointer to an array obtained via
GetPrimitiveArrayCritical
When used carefully, Critical
JNI APIs can avoid unnecessary copies, leading to better performance. But if used incorrectly, these APIs can cause problems far worse than the copies they strive to avoid.
Applying the theory to our problem at hand
android_hardware_UsbDeviceConnection_bulk_request
was calling usb_device_bulk_transfer
which was taking a long time to complete. This was a clear violation of the GetPrimitiveArrayCritical
contract.
The particular barcode scanner we were using would block waiting for new data. We called it with a timeout value of one second. That's not a problem with the barcode scanner — USB transfers are allowed to block and/or take a long time. It's incorrect for android_hardware_UsbDeviceConnection_bulk_request
to hold an array from GetPrimitiveArrayCritical
while calling an API that can block like this.
The last piece of the puzzle: the GC cannot complete the thread flip until all objects have been relocated to "to" space, and it can't relocate arrays until ReleasePrimitiveArrayCritical
is called for each outstanding GetPrimitiveArrayCritical
. The main thread was attempting to call GetPrimitiveArrayCritical
and the GC was blocking that until it finished the thread flip.
1 at art::JNI::GetPrimitiveArrayCritical 2 at art::(anonymous namespace)::CheckJNI::GetPrimitiveArrayCritical 3 at android::NativeApplyStyle 4 at android.content.res.AssetManager.applyStyle 5 at android.content.res.ResourcesImpl$ThemeImpl.obtainStyledAttributes 6 at android.content.res.Resources$Theme.obtainStyledAttributes 7 at android.content.Context.obtainStyledAttributes 8 at android.view.View.<init> 9 at android.widget.TextView.<init> 10 ...
Fixing the problem in AOSP
This problematic use of GetPrimitiveArrayCritical
has been in AOSP since 2013, but it didn't actually become a problem until a year a half later when Lollipop was released, which included Android's first moving garbage collector (you can learn about the history of garbage collection on Android here).
We prepared a change to use a non-Critical
API variant within android_hardware_UsbDeviceConnection_bulk_request
. The problem occurred on our own hardware so we could ship a fix independently of AOSP, but we chose to upstream our fix too. In addition to keeping our sources in sync with AOSP we also benefitted from more thorough code reviews and testing, which uncovered a subtle flaw in our first attempt. Others had reported the same issue years earlier, so they will benefit from this fix too.
We eventually identified this problem in some of our production ANR traces, which led us to identifying the same problem in android_hardware_UsbDeviceConnection_control_request
.
Our changesets:
- https://android-review.googlesource.com/c/platform/frameworks/base/+/3056082
- https://android-review.googlesource.com/c/platform/frameworks/base/+/3341974
Both are in Android 16, ensuring the fix is available to a wide audience.
Results
Even before we fixed the problem at the OS level we were able to work around the issue by using UsbRequest
instead of bulkTransfer
. We saw our P90 latency decrease by nearly 40%.
Takeaways
- Monitor your application's performance in production so you know when your customers experience problems.
- Follow API contracts, but when things go wrong it pays to also understand the theory behind those contracts.
- Android provides powerful tools — like Perfetto and Simpleperf — for understanding your application's behavior.
- No code of non-trivial size is perfect. Platform bugs are rare, but they do occur.
- Contributing to open-source is in everyone's best interest.