GPU Profiling has landed

A quick remainder that one of the biggest benefit to having our own built-in profiler is that individual teams and project can add their own performance reporting features. The graphics team just landed a feature to measure how much GPU time is consumed when compositing.

I already started using this in bug 1087530 where I used it to measure the improvement from recycling our temporary intermediate surfaces.

Screenshot 2014-10-23 14.35.29Here we can see that the frame had two rendering phases (group opacity test case) totaling 7.42ms of GPU time. After applying the patch from the bug and measuring again I get:

Screenshot 2014-10-23 14.38.15Now with retaining the surface the rendering GPU time drops to 5.7ms of GPU time. Measuring the GPU time is important because timing things on the CPU time is not accurate.

Currently we still haven’t completed the D3D implementation or hooked it up to WebGL, we will do that as the need arises. To implement this, when profiling, we insert a query object into the GPU pipeline for each rendering phase (framebuffer switches).

Graphics Meetup 2014Q01

I just arrived from the Graphics Meetup in early 2014. Before the week we wrapped up the port of tiling from Fennec OpenGL specific code to the abstract Compositor API. Here a summary of the projects we discussed (from my point of view, I’m missing things that I couldn’t attend):

GFX Taipei

  • Off main thread compositing on desktop (OMTCompositing): We discussed our plan for shipping OMTCompositing to desktop and unify our compositing code. Moving compositing off the main thread is a prerequisite for the many projects that build on it such as OMTAnimation, OMTVideo, tiling and Async Pan Zoom. Matt Woodrow managed to make some sizable progress at the end of the week. Our plan is to double down on our resources to get this shipped on desktop.
  • Tiling: Bringing tiling to desktop will be important to better support 4k displays and to support Async Pan Zoom. We decided to focus on OMTCompositing before shipping tiling on desktop.
  • Async Pan Zoom: We discussed upcoming improvements to Async Pan Zoom like hit testing, scroll snap requirements. We discussed our plan to have Async Pan Zoom on the desktop. Mstange has a working prototype of APZ on mac. For now we will first focus on shipping OMTCompositing separately. Changes to the input event queue and dealing with the plugins window on Windows will be a significant problem.
  • Graphics regression test on b2g: We discussed with mchang from the b2g performance team the best way to get b2g performance regressions tests. We decided to focus on some micro benchmarks to isolate platform regressions from gaia regressions by using the Gfx Test App. Kats convinced me that FrameMetrics could be use to accurately measure ‘checkerboarding’ so we will be rolling out some tests based on that as well.
  • VSync: Vincent has been leading the effort of getting Gecko to correctly VSync. This project is very important because no matter how fast we render our animations will never be fluid if we don’t follow vsync carefully. We had a long design review and I’m fairly happy with the result. TL;DR: We will be interpolating input events and driving the refresh driver off the vsync signal.
  • Eideticker: We discussed the challenges of supporting Eideticker using an external camera instead of MHL.
  • WebGL: We reaffirmed our plans to continue to support new WebGL extensions, focus on conformance issues, update the conformance testsuite and continue to work on WebGL 2.
  • Skia: We decided to try to rebase once every 6 weeks. We will be focusing on Skia content on android and SkiaGL canvas on mac.
  • RR with graphics: Roc presented RR (blog). It really blew me away that RR already supported Firefox on Linux. We had a discussion on some of the challenges with using RR with graphics (OpenGL, X) and how it could benefit us.
  • LayerScope: LayerScope will be extended to show frame tree dumps and which display items are associated with which layer.
  • Task Tracer: Shelly presented Task Tracer. We discussed how to integrate it with the profiler and Cleopatra.
  • Ownerships: We’re looking into different approaches to add ownership of sub-modules within graphics and how it can help with improving design and reviews.
  • Designs: We discussed on how to bring better design to the graphics module. We’re going to perform design reviews in bugzilla and keep the final design in a docs folder in the graphics components. This means that design changes will be peer reviewed and versioned.

Efficient Multi-Process profiling on B2G

Until a few days ago profiling on b2g was either off or on for the whole system. Worse profiling secondary threads would profile secondary threads of every process. These limitations caused profiling to overwhelm the system and skew performance numbers. Additionally it was difficult to follow how processes waited on each other.

With the landing Bug 914654 it is now possible to profile specific threads on specific processes and merge the results with little effort. Currently profiling secondary threads is disabled on b2g but just locally remove the gonk #ifdef from mozilla_sampler_register_thread and mozilla_sampler_unregister_thread and you’re good to go. Once that’s ready use the to start profiling the important threads of your choice. For example if you’re looking into animation delay with the Homescreen you want to run ‘./ start b2g Compositor && ./ start Homescreen’ then run ./ pull which will prepare and merge the data into profile_captured.sym.

Here’s a sample profile collected for the Homescreen swipe animations. Here you can notice the b2g compositor waiting for the paint from the Homescreen then furiously compositing afterwards at nearly 60 FPS. Thus the delay in starting the Homescreen swipe is not caused by the compositor but rather caused by the Homescreen taking too long to rasterize the layers containing the app icons. In this case it’s taking 100ms to prepare which means we’ve missed the first 6 frames of the animation! Happy profiling!

Multi-Process b2g profile

Multi-Process b2g profile

GTest Has Landed. Start Writing Your Unit Tests.

This weekend GTest landed in the mozilla-central tree. See Bug 767231 for the changes and follow up bugs. We have some follow-up changes coming such as adding a mach target, replacing –enable-gtest by –enable-tests and adding gtests to tinderbox. Everything is now ready for developers to start adding their own unit tests to libxul.

All the unit tests will be linked within libxul. This means that you don’t have to export any symbols you intend on testing. See Bug 767231 for the pros and cons of this solution. The summary is that unit tests will run against a different libxul library (libxul+unittest) than the one we will ship (libxul only) at the benefit of having access to all the symbols. Unit tests will not be shipped in a normal release build.

To run GTest build with ‘–enable-gtest’ and simply add the -unittest option when running firefox. To add an unit test create a test .cpp file, declare it to the makefile ‘GTEST_CPPSRCS = TestFoo.cpp’ and you’re done.

For more details see the up-to-date documentation:


Video Synced Profiling

Just a quick update on the Eideticker profiling support William and I have been working on. All the changes needed to sync a video recording with a profile have landed. They will show up as a binary counter in the top left of the frame. This counter is read and the samples collected for that frame are highlighted. It’s simple but effective and very useful for optimizing how we draw.

Video correlation allows stepping samples frame by frame

You can try this yourself by checking out this real life example recording this morning. When stepping through the video the selection will be updated to match the current frame in the top left. You can then filter samples for the current frame. Note that in mobile because of Off-Main-Thread-Compositing we typically present many intermediate frames before getting an update from the main thread.

Tiling Improvements in Fennec

With the native release we refactored how we render to use a tiling approach. This is beneficial because it lets us minimize the work needed to paint as we pan and zoom. The goal is to be able to increase and decrease the size of our view and move its position and in a logically unbounded page without having to reallocate and copy our retained page buffer.

This refactoring was also a blockers for other optimizations that I am currently working on implementing. First I landed a patch to add the ability to draw progressively and interrupt drawing in chunks of tiles (bug 771219). This lets our content thread and compositor paint+upload in parallel instead of serially. This opens up the possibility of showing painting progressively tile by tile. Interrupting drawing will let us decide that the user panned outside of where we are painting, abort the operation and re-target our paint.

Next up I’m currently working on drawing tiles at a low resolution to replace the ‘screenshot code’. Currently we try to detect when the page changed and we paint it into a small offscreen ‘screenshot’ buffer. This ‘screenshot’ is drawn in areas of the page that we’re still working on painting. This is a huge improvement to the user experience. However the current code isn’t integrated to our Layers system which means that the page change notifications are not reliable and updates are more expensive. The goal is move this code inside the Layers system where it can overcome these limitations and make it tile based so we can improve this code.

Once all of these tile improvements are ready it will let us improve our painting code from our current approach of predicting where the view is going to be and painting it start to finish, sending it to the gpu all at once and hoping that what we painting is still inside the view. With these new changes we will be able to improve our heuristics by aborting painting if it’s outside the view, drawing quickly at a reduced resolution first if we’re panning quickly, drawing the most important tiles first, presenting the painting progressively and uploading it the gpu in parallel piece by piece.

Here’s a demonstration of my current set of patches. Note that the performance isn’t tweaked and it’s tested on a slow page (i.e. gradient) to better demonstrate the progressive and reduced resolution painting.

This demonstrates progressive tile painting and a simple heuristics to draw new tiles first at low resolution then to draw them at a full resolution. It does not abort or prioritize tile painting yet which would be useful between 2 second and 3 second where we’re still painting outside the screen.

Correlating Power Usage with Performance Data using the Gecko Profiler and Intel Sandy Bridge

I ran a quick experiment after someone pointed out to me that second generation Intel GPU provide features for querying power management status of the CPU such as the current frequency and power usage of the CPU in Watts. I re-purposed the responsiveness correlation in the Gecko Profiler to instead use the average power consumption of the CPU over the last few milliseconds. This let you see the current power usage of Firefox as it does different tasks (JS, layout, gfx). The idea is similar to looking at throughput performance: Find that area of the code that has the highest power and energy consumption and optimize it. Here’s a sample:

Power Usage in the Gecko Profiler

Power Usage in the Gecko Profiler (View the profile yourself) Color=power usage, Height=Call depth, x=time

I should note that the data is expected to be noisy because I had applications running and the typical set of background processes you’d find on Mac. Nevertheless every profile I ran showed the power usage drop significantly at every point where Firefox was waiting for events so this proves that it is in fact working. I haven’t done much analysis on the data but a quick look at the profiles suggest that our SSE2 code is particularly power hungry.

A neat idea would be to compute the energy consumption from the power consumption and break it down into Gecko Modules.


The biggest roadblock to implementing this is that the power information isn’t available in user mode and I’m don’t think that APIs are widely exposed by operating system. Luckily Intel provides a sample library and driver that let you access this information. Once I had this in place it was simply a matter of querying this information rather then the event loop status like the Profiler normally does. Because this data requires a driver you wont see this feature hit the Profiler unless I see a big demand for it.