GPU Profiling has landed

A quick remainder that one of the biggest benefit to having our own built-in profiler is that individual teams and project can add their own performance reporting features. The graphics team just landed a feature to measure how much GPU time is consumed when compositing.

I already started using this in bug 1087530 where I used it to measure the improvement from recycling our temporary intermediate surfaces.

Screenshot 2014-10-23 14.35.29Here we can see that the frame had two rendering phases (group opacity test case) totaling 7.42ms of GPU time. After applying the patch from the bug and measuring again I get:

Screenshot 2014-10-23 14.38.15Now with retaining the surface the rendering GPU time drops to 5.7ms of GPU time. Measuring the GPU time is important because timing things on the CPU time is not accurate.

Currently we still haven’t completed the D3D implementation or hooked it up to WebGL, we will do that as the need arises. To implement this, when profiling, we insert a query object into the GPU pipeline for each rendering phase (framebuffer switches).

B2G Performance Polish: Unlock Animation (Part 2)

Our destination

Our starting point is a 200ms unlock delay, and an uniform ~25 FPS animation. Our aim should be a 0ms unlock delay and a uniform 60 FPS (or whatever vsync is). The former we will minimize as much as we can but the latter is entirely possible.

Let’s talk about how we should design a lock screen animation in the optimal case. When we go and apply it in practice we often hit some requirements and constraint that make it impossible to behave like we want but let’s ignore that for a second and discuss we want to get to.

In the ideal case we would have the lockscreen rendered offscreen to a GPU texture(s) set. We would have the background app ready in another GPU texture(s) set. These are ‘Layers’. We place the background app behind the lockscreen. When the transition begins we notify the compositor to start fading out the lockscreen. Keep them around costs memory but if we keep the right thing around we can reduce or eliminate repaints entirely.

Our memory requirement is what’s required by the background app + about one fullscreen layer required for the lockscreen. This should be fine for even low end of B2G phones. Our overdraw should be about 200%-300%, again enough for mobile GPUs to keep up at 60 FPS/vsync.

Ideal Lockscreen Layer Tree

Now let’s look at what we hope our timeline for our Main Thread and our Compositor Thread to look like:

Ideal Unlock Timeline

We want to use Off-Main-Thread-Animation to perform the fade entirely on the Compositor. This will be initiated on the main thread and will require a style flush to set a CSS transform transition. If done right we don’t expect to have to reflow or repaint any part of the page if we properly built the layer like shown in the first figure. Note that the style flush will contribute to the unlock delay (and so will the first composite time as incorrectly shown in the diagram). If we can keep that style flush + first composite in under say 50ms and each composite in 16ms or less then we should have a smooth unlock animation.

Next up let’s look at what’s actually happening in the unlock animation in practice…

B2G Performance Polish: Unlock Animation (Part 1)

I’ve decided to start a blog series documenting my workflow for performance investigation. Let me know if you find this useful and I’ll try to make this a regular thing. I’ll update this blog to track the progress made by myself, and anyone who wants to jump in and help.

I wanted to start with the b2g unlock animation. The animation is O.K. but not great and is core to the phone experience. I can notice a delay from the touch up to the start of the animation. I can notice that the animation isn’t smooth. Where do we go from here? First we need to quantity how things stand.

Measuring The Starting Point

The first thing is to grab a profile. From the profile we can extract a lot of information (we will look at it again in future parts). I run the following command:

./profile.sh start -p b2g -t GeckoMain,Compositor
*unlock the screen, wait until the end of the animation*
./profile.sh capture
*open profile_captured.sym in http://people.mozilla.org/~bgirard/cleopatra/*

This results in the following profile. I recommend that you open it and follow along. The lock animation starts from t=7673 and runs until 8656. That’s about 1 second. We can also note that we don’t see any CPU idle time so we’re working really hard during the unlock animation. Things aren’t looking great from a first glance.

I said that there was a long delay at the start of the animation. We can see a large transaction at the start near t=7673. The first composition completes at t=7873. That mean that our unlock delay is about 200ms.

Now let’s look at how the frame rate is for this animation. In the profile open the ‘Frames’ tab. You should see this (minus my overlay):

Lockscreen Frame Uniformity

Alright so our starting point is:

Unlock delay: 200ms

Frame Uniformity: ~25 FPS, poor uniformity

Next step

In part 2 we’re going to discuss the ideal implementation for a lock screen. This is useful because we established a starting point in part 1, part 2 will establish a destination.

Visual warning for slow B2G transaction landed

With the landing of bug 1055050, if you turn on the FPS counter on mobile you will now notice a rectangle around the screen edge to warning you that a transaction was out of budget.

  • The visual warning will appear if a transaction took over 200ms from start to finish.
  • Yellow indicates the transaction took over 200ms.
  • Orange will indicate the transaction took about 500ms.
  • Red will indicate the transaction is about 1000ms or over.

What’s a transaction?

It’s running the rendering pipeline and includes (1) Running request animation frame and other refresh observers, (2) Flushing pending style changes, (3) Flushing and reflow any pending layout changes, (4) Building a display list, (5) Culling, (6) Updating the layer tree, (7) Sending the final to the compositor, (8) Syncing resources with the GPU. It does NOT include compositing which isn’t part of the main thread transaction. It does not warn for other events like slow running JS events.

Why is this important?

A transaction, just like any other gecko event, blocks the main thread. This means that anything else queued and waiting to be service will be delayed. This means that many things on the page/app will be delayed like: animations, typing, canvas, js callbacks, timers, the next frame.

Why 200ms?

200ms is already very high. If we want anything in the app to run at 60 FPS that doesn’t use a magical async path then any event taking 16ms or more will cause noticeable stutter. However we’re starting with a 200ms threshold to focus on the bigger items first.

How do I fix a visual warning?

The warning as just provided as a visual tool.

Profiler: Frame performance and debugging!

When we designed the platform profiler our goal was to expose internal information to answer performance questions that other tools could never practically answer without. I’m very excited to show off these prototypes because I feel that we are without a doubt finally there.

NOTE: We’re still missing at least two un-landed platform patches before this is ready in Nightly; Bug 926922 and Bug 1004899. Feel free to apply these patches locally and give me feedback! They should work on any platform when running with OMTCompositing.

Paint Marker Improvements

I’ve redesigned how frame markers work to expose more details. Painting in Gecko is triggered off a refresh timer. When that timer fires we will execute scripts that are listening for the refresh like pending requestAnimationFrame() callbacks for the document. This is denoted by the Scripts phase. If any changes were made to the styles or flow of the document we will show a marker for each, if applicable, as Styles and Reflow markers. Next we will need to build a display list and a layer tree to prepare for painting denoted by DisplayList marker. The next phase is rasterizing, the process of turning the display list into pixels, any changed region of the page. This is denoted by the, you guessed it, Rasterize phase. And finally, on the compositor thread, we will take the results and composite them to the phase under the Composite phase.

This further break down was necessary to facilitate exposing more information about each of these phases. Note that I’m exposing internal gecko decisions. The primary use should be to debug and optimize the platform. It’s not meant for web developers to micro optimize their pages to implementation details of our rendering engine.

New Frame Markers

New Frame Markers

Styles and Reflow Causes

As platform engineers we’re often asked to look into performance problems of complex web applications. It’s difficult to pin point what’s causing excessive style and reflow flushes in an app we’re not familiar with. Careful breakpoints may get you this information for simple performance problems but in my experience that’s not enough to diagnose the hard problems. One frame tends to have a disproportionately longer style or reflow flush and finding which exactly can be very tricky. With flush causes in the profiler we can now find what code in the past triggered the flush that’s taking so long to execute now.

Here’s a real world example of the b2g homescreen as of the current master tip. See Style #0? We’re spending 35ms in there so understanding why we’re flushing is critical to fixing it. Now it’s a pain free process with a simple mouse-over the problematic Styles block:

B2G Homescreen Style Flush Cause

B2G Homescreen Style Flush Cause

Here the flush is triggered by handleEvent() in grid.js. A closer inspection will point out several style changes that are occurring from that function. After taking a closer look at grid.js, we can pinpoint the problem to a CSS transform.

Display List & Composite Visualization

Sometimes the display list and layer tree is the key to the performance problem. It can point to rendering bugs, severe platform performance problems, an overly complicated page to paint and many more problems. The visualization is still an early implementation. In the future I’m hoping to display a preview of the display items themselves but we need to extend how they are dumped from gecko. This information is logged for each frame so it’s useful to track the evolution and how it corresponds with other things like slow paths being hit.

NOTE: This feature is only enabled if you’ve flipped a preference to dump the display list and/or layer tree for each frame. This will slow down your build and skew profiling sample but will still serve as a great debugging tool.

DisplayList Preview

DisplayList Preview

Frame Uniformity

And to finish off, the profiler now has a simple feature to visualize how consistent our frame rate is. Each dot represents the time between frames. If everything is working as expected, during scrolling each dot should be near 16ms. Below is a graph of fairly smooth scrolling.

Profiler Composite Uniformity Graph

Profiler Composite Uniformity Graph

Graphics Meetup 2014Q01

I just arrived from the Graphics Meetup in early 2014. Before the week we wrapped up the port of tiling from Fennec OpenGL specific code to the abstract Compositor API. Here a summary of the projects we discussed (from my point of view, I’m missing things that I couldn’t attend):

GFX Taipei

  • Off main thread compositing on desktop (OMTCompositing): We discussed our plan for shipping OMTCompositing to desktop and unify our compositing code. Moving compositing off the main thread is a prerequisite for the many projects that build on it such as OMTAnimation, OMTVideo, tiling and Async Pan Zoom. Matt Woodrow managed to make some sizable progress at the end of the week. Our plan is to double down on our resources to get this shipped on desktop.
  • Tiling: Bringing tiling to desktop will be important to better support 4k displays and to support Async Pan Zoom. We decided to focus on OMTCompositing before shipping tiling on desktop.
  • Async Pan Zoom: We discussed upcoming improvements to Async Pan Zoom like hit testing, scroll snap requirements. We discussed our plan to have Async Pan Zoom on the desktop. Mstange has a working prototype of APZ on mac. For now we will first focus on shipping OMTCompositing separately. Changes to the input event queue and dealing with the plugins window on Windows will be a significant problem.
  • Graphics regression test on b2g: We discussed with mchang from the b2g performance team the best way to get b2g performance regressions tests. We decided to focus on some micro benchmarks to isolate platform regressions from gaia regressions by using the Gfx Test App. Kats convinced me that FrameMetrics could be use to accurately measure ‘checkerboarding’ so we will be rolling out some tests based on that as well.
  • VSync: Vincent has been leading the effort of getting Gecko to correctly VSync. This project is very important because no matter how fast we render our animations will never be fluid if we don’t follow vsync carefully. We had a long design review and I’m fairly happy with the result. TL;DR: We will be interpolating input events and driving the refresh driver off the vsync signal.
  • Eideticker: We discussed the challenges of supporting Eideticker using an external camera instead of MHL.
  • WebGL: We reaffirmed our plans to continue to support new WebGL extensions, focus on conformance issues, update the conformance testsuite and continue to work on WebGL 2.
  • Skia: We decided to try to rebase once every 6 weeks. We will be focusing on Skia content on android and SkiaGL canvas on mac.
  • RR with graphics: Roc presented RR (blog). It really blew me away that RR already supported Firefox on Linux. We had a discussion on some of the challenges with using RR with graphics (OpenGL, X) and how it could benefit us.
  • LayerScope: LayerScope will be extended to show frame tree dumps and which display items are associated with which layer.
  • Task Tracer: Shelly presented Task Tracer. We discussed how to integrate it with the profiler and Cleopatra.
  • Ownerships: We’re looking into different approaches to add ownership of sub-modules within graphics and how it can help with improving design and reviews.
  • Designs: We discussed on how to bring better design to the graphics module. We’re going to perform design reviews in bugzilla and keep the final design in a docs folder in the graphics components. This means that design changes will be peer reviewed and versioned.

C/C++ Eclipse Project Generation

With the landing of bug 973770 (and a few touch up landing this week) the Mozilla build system can now generate a fully working out-of-the-box no hassle eclipse workspace for C++ gecko development. This has been tested for Mac, Fennec and B2G. For Windows please use the Visual Studio project generator. If you find any bugs please file and link off bug 973770.

Didn’t we already support eclipse projects? Yes we did but the setup was quite tedious to setup and maintain. My approach automates some of these instructions and uses the information now available from within moz.build to generate the project. This also means we can remove the build scanning phase which tends to make Eclipse unresponsive.

To use this:

  • Download the latest Eclipse CDT 8.3 package for your platform
  • Build your tree
  • Run |./mach build-backend -b CppEclipse|. For b2g cd into objdir-gecko and use |../gecko/mach build-backend -b CppEclipse|
  • Start eclipse, use $OBJDIR/eclipse_workspace as the workspace
  • File->Import->Existing Project with path $OBJDIR/eclipse_workspace/gecko
  • Let the indexer run on the first start. This can take about 20 mins (long but worth it, see below). This will index your exact mozconfig setup (example -DDEBUG, b2g system libs).

Here are some benefits:

Autocomplete

Code Completion

Call Hierarchy

Call Hierarchy

Build, Flash (b2g), Launch

Build, Flash (b2g), Launch

Type Hierarchy

Type Hierarchy

Mozilla Formatter

Mozilla Formatter

If you find yourself tweaking the project let me know so that I can integrate changes into the project generator instead where everyone can benefit.