Profiler: Frame performance and debugging!

When we designed the platform profiler our goal was to expose internal information to answer performance questions that other tools could never practically answer without. I’m very excited to show off these prototypes because I feel that we are without a doubt finally there.

NOTE: We’re still missing at least two un-landed platform patches before this is ready in Nightly; Bug 926922 and Bug 1004899. Feel free to apply these patches locally and give me feedback! They should work on any platform when running with OMTCompositing.

Paint Marker Improvements

I’ve redesigned how frame markers work to expose more details. Painting in Gecko is triggered off a refresh timer. When that timer fires we will execute scripts that are listening for the refresh like pending requestAnimationFrame()¬†callbacks for the document. This is denoted by the Scripts phase. If any changes were made to the styles or flow of the document we will show a marker for each, if applicable, as Styles and Reflow markers. Next we will need to build a display list and a layer tree to prepare for painting denoted by DisplayList marker. The next phase is rasterizing, the process of turning the display list into pixels, any changed region of the page. This is denoted by the, you guessed it, Rasterize phase. And finally, on the compositor thread, we will take the results and composite them to the phase under the Composite phase.

This further break down was necessary to facilitate exposing more information about each of these phases. Note that I’m exposing internal gecko decisions. The primary use should be to debug and optimize the platform. It’s not meant for web developers to micro optimize their pages to implementation details of our rendering engine.

New Frame Markers

New Frame Markers

Styles and Reflow Causes

As platform engineers we’re often asked to look into performance problems of complex web applications. It’s difficult to pin point what’s causing excessive style and reflow flushes in an app we’re not familiar with. Careful breakpoints may get you this information for simple performance problems but in my experience that’s not enough to diagnose the hard problems. One frame tends to have a disproportionately longer style or reflow flush and finding which exactly can be very tricky. With flush causes in the profiler we can now find what code in the past triggered the flush that’s taking so long to execute now.

Here’s a real world example of the b2g homescreen as of the current master tip. See Style #0? We’re spending 35ms in there so understanding why we’re flushing is critical to fixing it. Now it’s a pain free process with a simple mouse-over the problematic Styles block:

B2G Homescreen Style Flush Cause

B2G Homescreen Style Flush Cause

Here the flush is triggered by handleEvent() in grid.js. A closer inspection will point out several style changes that are occurring from that function. After taking a closer look at grid.js, we can pinpoint the problem to a CSS transform.

Display List & Composite Visualization

Sometimes the display list and layer tree is the key to the performance problem. It can point to rendering bugs, severe platform performance problems, an overly complicated page to paint and many more problems. The visualization is still an early implementation. In the future I’m hoping to display a preview of the display items themselves but we need to extend how they are dumped from gecko. This information is logged for each frame so it’s useful to track the evolution and how it corresponds with other things like slow paths being hit.

NOTE: This feature is only enabled if you’ve flipped a preference to dump the display list and/or layer tree for each frame. This will slow down your build and skew profiling sample but will still serve as a great debugging tool.

DisplayList Preview

DisplayList Preview

Frame Uniformity

And to finish off, the profiler now has a simple feature to visualize how consistent our frame rate is. Each dot represents the time between frames. If everything is working as expected, during scrolling each dot should be near 16ms. Below is a graph of fairly smooth scrolling.

Profiler Composite Uniformity Graph

Profiler Composite Uniformity Graph

Graphics Meetup 2014Q01

I just arrived from the Graphics Meetup in early 2014. Before the week we wrapped up the port of tiling from Fennec OpenGL specific code to the abstract Compositor API. Here a summary of the projects we discussed (from my point of view, I’m missing things that I couldn’t attend):

GFX Taipei

  • Off main thread compositing on desktop (OMTCompositing): We discussed our plan for shipping OMTCompositing to desktop and unify our compositing code. Moving compositing off the main thread is a prerequisite for the many projects that build on it such as OMTAnimation, OMTVideo, tiling and Async Pan Zoom. Matt Woodrow managed to make some sizable progress at the end of the week. Our plan is to double down on our resources to get this shipped on desktop.
  • Tiling: Bringing tiling to desktop will be important to better support 4k displays and to support Async Pan Zoom. We decided to focus on OMTCompositing before shipping tiling on desktop.
  • Async Pan Zoom: We discussed upcoming improvements to Async Pan Zoom like hit testing, scroll snap requirements. We discussed our plan to have Async Pan Zoom on the desktop. Mstange has a working prototype of APZ on mac. For now we will first focus on shipping OMTCompositing separately. Changes to the input event queue and dealing with the plugins window on Windows will be a significant problem.
  • Graphics regression test on b2g: We discussed with mchang from the b2g performance team the best way to get b2g performance regressions tests. We decided to focus on some micro benchmarks to isolate platform regressions from gaia regressions by using the Gfx Test App. Kats convinced me that FrameMetrics could be use to accurately measure ‘checkerboarding’ so we will be rolling out some tests based on that as well.
  • VSync: Vincent has been leading the effort of getting Gecko to correctly VSync. This project is very important because no matter how fast we render our animations will never be fluid if we don’t follow vsync carefully. We had a long design review and I’m fairly happy with the result. TL;DR: We will be interpolating input events and driving the refresh driver off the vsync signal.
  • Eideticker: We discussed the challenges of supporting Eideticker using an external camera instead of MHL.
  • WebGL: We reaffirmed our plans to continue to support new WebGL extensions, focus on conformance issues, update the conformance testsuite and continue to work on WebGL 2.
  • Skia: We decided to try to rebase once every 6 weeks. We will be focusing on Skia content on android and SkiaGL canvas on mac.
  • RR with graphics: Roc presented RR (blog). It really blew me away that RR already supported Firefox on Linux. We had a discussion on some of the challenges with using RR with graphics (OpenGL, X) and how it could benefit us.
  • LayerScope: LayerScope will be extended to show frame tree dumps and which display items are associated with which layer.
  • Task Tracer: Shelly presented Task Tracer. We discussed how to integrate it with the profiler and Cleopatra.
  • Ownerships: We’re looking into different approaches to add ownership of sub-modules within graphics and how it can help with improving design and reviews.
  • Designs: We discussed on how to bring better design to the graphics module. We’re going to perform design reviews in bugzilla and keep the final design in a docs folder in the graphics components. This means that design changes will be peer reviewed and versioned.

C/C++ Eclipse Project Generation

With the landing of bug 973770 (and a few touch up landing this week) the Mozilla build system can now generate a fully working out-of-the-box no hassle eclipse workspace for C++ gecko development. This has been tested for Mac, Fennec and B2G. For Windows please use the Visual Studio project generator. If you find any bugs please file and link off bug 973770.

Didn’t we already support eclipse projects? Yes we did but the setup was quite tedious to setup and maintain. My approach automates some of these instructions and uses the information now available from within moz.build to generate the project. This also means we can remove the build scanning phase which tends to make Eclipse unresponsive.

To use this:

  • Download the latest Eclipse CDT 8.3 package for your platform
  • Build your tree
  • Run |./mach build-backend -b CppEclipse|. For b2g cd into objdir-gecko and use |../gecko/mach build-backend -b CppEclipse|
  • Start eclipse, use $OBJDIR/eclipse_workspace as the workspace
  • File->Import->Existing Project with path $OBJDIR/eclipse_workspace/gecko
  • Let the indexer run on the first start. This can take about 20 mins (long but worth it, see below). This will index your exact mozconfig setup (example -DDEBUG, b2g system libs).

Here are some benefits:

Autocomplete

Code Completion

Call Hierarchy

Call Hierarchy

Build, Flash (b2g), Launch

Build, Flash (b2g), Launch

Type Hierarchy

Type Hierarchy

Mozilla Formatter

Mozilla Formatter

If you find yourself tweaking the project let me know so that I can integrate changes into the project generator instead where everyone can benefit.

Introducing Scroll-Graph

For awhile I’ve been warning people that the FPS counter is very misleading. There’s many case where the FPS might be high but the scroll may still not be smooth. To give a better indicator of the scrolling performance I’ve written Scroll-Graph which can be enabled via layers.scroll-graph. Currently it requires OMTC+OGL but if this is useful we can easily port this to OMTC+D3D. Currently it works best on Desktop, it will need a bit more work for mobile to be compatible with APZC.

To understand how Scroll-Graph works lets look at scrolling from the point of view of a¬†physicist. Scrolling and pan should be smooth. The ideal behavior is for scrolling to behave like you have a page on a low friction surface. Imagine that the page is on a low friction ice arena and that every time you fling the page with your finger or the trackpad you’re applying force to the page. The friction of the ice is applying a small force in the opposite direction. If you plot the velocity graph you’d expect to see something like this roughly:

Expected Velocity Graph

Expected Velocity Graph

Now if we scroll perfectly then we expect the velocity of a layer to follow a similar curve. The important part is *not* the position of the velocity curve but it’s the smoothness of the velocity graph. Smooth scrolling means the page has a smooth velocity curve.

Now on to the fun part. Here’s some examples of Scroll-Graph in the browser. Here’s I pan a page and get smooth scrolling with 2 minor inperfections (Excuse the low FPS gif, the scrolling was smooth):

Smooth ScrollGraph velocity = Smooth Scrolling

Smooth ScrollGraph velocity = Smooth Scrolling

Now here’s an example of a jerky page scrolling. Note the Scroll-Graph is not smooth at all but very ‘spiky’ and it was noticeable to the eye:

Jerky scrolling = Jerky Scroll-Graph

Jerky scrolling = Jerky Scroll-Graph

Efficient Multi-Process profiling on B2G

Until a few days ago profiling on b2g was either off or on for the whole system. Worse profiling secondary threads would profile secondary threads of every process. These limitations caused profiling to overwhelm the system and skew performance numbers. Additionally it was difficult to follow how processes waited on each other.

With the landing Bug 914654 it is now possible to profile specific threads on specific processes and merge the results with little effort. Currently profiling secondary threads is disabled on b2g but just locally remove the gonk #ifdef from mozilla_sampler_register_thread and mozilla_sampler_unregister_thread and you’re good to go. Once that’s ready use the profile.sh to start profiling the important threads of your choice. For example if you’re looking into animation delay with the Homescreen you want to run ‘./profile.sh start b2g Compositor && ./profile.sh start Homescreen’ then run ./profile.sh pull which will prepare and merge the data into profile_captured.sym.

Here’s a sample profile collected for the Homescreen swipe animations. Here you can notice the b2g compositor waiting for the paint from the Homescreen then furiously compositing afterwards at nearly 60 FPS. Thus the delay in starting the Homescreen swipe is not caused by the compositor but rather caused by the Homescreen taking too long to rasterize the layers containing the app icons. In this case it’s taking 100ms to prepare which means we’ve missed the first 6 frames of the animation! Happy profiling!

Multi-Process b2g profile

Multi-Process b2g profile

Better Visual Studio Projects

I just wanted to add to Vlad’s post and show the current progress of generating the Visual Studio Project made this weekend by Vlad and I.

By using the approach mentioned in Vlad’s post we can target Visual Studio directly instead of a build system like Ninja. This gives Visual Studio the correct build flags for each file allowing features like IntelliSense and building from Visual Studio using MSBuild to work perfectly. I’m generating a project per shared library and a Visual Studio solution to group the results together. The results are quite neat because it shows at a glance which sources files are included in each library. Any files listed in the visual studio project are safe to edit which should cover the common developer workflow of editing cpp/headers and recompiling. Other rarer changes for platform developers like adding a new file and tweaking build settings should be done by closing visual studio, doing a local change and rebuilding with the mozilla build system and re-openning the updated Visual Studio Solution.

Visual Studio Build

Visual Studio IntelliSense + Build

This isn’t ready for general use yet because we still don’t generate the Link step properly and handle some corner cases in XUL yet but simpler libraries like gkmedias.dll is already working well. Supporting XUL should only require a few days of work. Once this is fixed we can move to supporting advanced Visual Studio features like “Edit and Continue” and distributed compiling.

Profiler @ Snappy Work Week

Last week I attended the snappy work week with goal of syncing up with Julian Seward and Mike Hommey regarding the new Breakpad unwinding feature, planning and prioritizing new feature work. I also got lots of valuable feedback. Here’s what you need to know if you’re a frequent user of the Gecko Profiler.

Breakpad Unwinding (Mobile)

For the last few months Julian has been spearheading integrating the breakpad unwinding feature to the Gecko Profiler. This turned out to be a very complicated task because we’re trying to use Breakpad for something it wasn’t meant to do. This work is now landed but it only works in a very specific build configuration (unwind tables, unpacking the libraries, no elfhack). We are currently working on addressing these problems to make it work in the default Nightly configuration.

Our current target is to use Breakpad unwinding in the Linux May 1st nightly. A week later have the Fennec nightly support breakpad unwinding. I’m hoping afterwards to support Breakpad unwinding on B2G central (not b2g18) but I expect this may be difficult on a memory constraint phone so I currently have no estimate for this at this time.

If you looked at the code for the profiler you may notice TableTicker1+TableTicker2. During the work week we refactored the code significantly to address this code duplication while resolving inconsistencies between the frame-pointer unwinder and the Breakpad unwinder.

Multi-Thread/Multi-Process Support

The profiler will soon support showing samples and timelines for multiple threads. This is already somewhat supported on B2G if you use a script written by Vlad. This will become easier as we begin to add multi-thread support to all the platform backends (already in progress for Mac and Linux) thanks to the help of James Willcox. This ties in heavily with the next two section (IPDL/Plugins).

Plugin Support

Plugin issues have long been identified as the single biggest contributor to responsiveness issues by Vladan‘s Chrome Hang work. This was the topic of the break out discussion headed by Benjamin Smedberg. We discussed a plan of attack by adding profiling to the plugin process which should land this week but require a manual step. This feature will let us know what’s going on in the plugin process when the browser’ main thread is waiting for an IPC response (and vice versa). It will improve as we add profiling support for profiling IPDL messages.

Flash Profiling

Flash initializing audio on plugin-container startup.

IPDL Message Support

During the week we decided to add profiler labels to each IPDL call automatically. That will help where the unwinder isn’t supported but doesn’t give us the data we need. In particular we’re looking to get better data about how IPDL messages are queued, how long they take to get dequeued and to processed. Combined with the Plugin support this should help us diagnose complex issues with our plugin-container or get better data on how particular plugins are hurting the responsiveness of the browser. Georg Fritzsche has an excellent plan in bug 853864 on how to implement this.