Using RecordReplay to investigate intermittent oranges

This is a quick write up to summarize my, and Jeff’s, experience, using RR to debug a fairly rare intermittent reftest failure. There’s still a lot of be learned about how to use RR effectively so I’m hoping sharing this will help others.

Finding the root of the bad pixel

First given a offending pixel I was able to set a breakpoint on it using these instructions. Next using rr-dataflow I was able to step from the offending bad pixel to the display item responsible for this pixel. Let me emphasize this for a second since it’s incredibly impressive. rr + rr-dataflow allows you to go from a buffer, through an intermediate surface, to the compositor on another thread, through another intermediate surface, back to the main thread and eventually back to the relevant display item. All of this was automated except for when the two pixels are blended together which is logically ambiguous. The speed at which rr was able to reverse continue through this execution was very impressive!

Here’s the trace of this part: rr-trace-reftest-pixel-origin

Understanding the decoding step

From here I started comparing a replay of a failing test and a non failing step and it was clear that the DisplayList was different. In one we have a nsDisplayBackgroundColor in the other we don’t.  From here I was able to step through the decoder and compare the sequence. This was very useful in ruling out possible theories. It was easy to step forward and backwards in the good and bad replay debugging sessions to test out various theories about race conditions and understanding at which part of the decode process the image was rejected. It turned out that we sent two decodes, one for the metadata that is used to sized the frame tree and the other one for the image data itself.

Comparing the frame tree

In hindsight, it would have been more effective to start debugging this test by looking at the frame tree (and I imagine for other tests looking at the display list and layer tree) first would have been a quicker start. It works even better if you have a good and a bad trace to compare the difference in the frame tree. From here, I found that the difference in the layer tree came from a change hint that wasn’t guaranteed to come in before the draw.

The problem is now well understood: When we do a sync decode on reftest draw, if there’s an image error we wont flush the style hints since we’re already too deep in the painting pipeline.

Take away

  • Finding the root cause of a bad pixel is very easy, and fast, to do using rr-dataflow.
  • However it might be better to look for obvious frame tree/display list/layer tree difference(s) first.
  • Debugging a replay is a lot simpler then debugging against non-determinist re-runs and a lot less frustrating too.
  • rr is really useful for race conditions, especially rare ones.

Multi-threaded WebGL on Mac

(Follow Bug 1232742 for more details)

Today we enabled CGL’s ‘Multi-threaded OpenGL Execution‘ mode on Nightly. There’s a lot of good information on that page so I wont repeat it all here.

In short, even though OpenGL is already an asynchronous API, the driver must do some work to prepare the OpenGL command queue which can be a bottleneck for certain content. Right now Firefox makes OpenGL call on the main thread on all platforms so it’s up to the driver where & how to schedule this work. Typically the command buffers are prepared synchronously at the moment of the call and in some cases can have a lot of overhead. If the WebGL applications needs the full refresh window, say ~15ms, then the work done in the driver might start causing missed frame degrading performance. Turning on this feature will move this work to another thread.

We decided to turn it on because we believe it will be overall beneficial for well optimized WebGL content that is CPU-bound as measured in the Unity Benchmark:

Unity Multi-thread WebGL Benchmark

Unity Multi-thread WebGL Benchmark

Overall this leads to a 15% score improvements however in some cases the scores are worse because the feature only helps for speed up demos that are bound by the CPU overhead of certain OpenGL calls.

Turning this on is an experiment to see the benefit of making OpenGL calls asynchronous. Based on the results we may consider ‘remoting’ the OpenGL calls manually on more platforms for performance reasons however we’re still undecided since it’s a large new project.

Image Decoding on the GPU now in Nightly

This was posted on April 1st as an April Fools’ Day hoax.

In 2013-2014 a lot of effort was put into moving image decoding to a background thread. However it became obvious that doing parallel off-main-thread was still the critical path for presenting image heavy pages. The biggest problem we faced was that on B2G keeping active image uncompressed in main memory was something we simply could not afford with a 128 MB device even if it was just for visible images.

Enter image decoding on the GPU. The goal of image decoding is use the GPU to parallelize the decoding of each visible (and only the visible) -pixels- instead of just getting per image parallelization and doing full image decodes. However the biggest advantage comes from the reduced GPU upload bandwidth from being able to upload a compressed texture instead of a large 32-bit RGB bitmap.

We first explored using s3tc compressed textures. However this required us to still decode the image and re-compressing the image to s3tc on the CPU thus regressing page load times.

The trick we ending up doing instead was providing a texture that was the -raw- JPEG stream encoder as a -much- smaller RGB texture plane. Using a clever shader we sample from the compressed JPEG stream when compositing the texture to the frame buffer. This means that we don’t ever have to fit the uncompressed texture in main memory. This means that on pages that would normally cause a memory usage spike leading to an OOM no longer have any memory spike at all.

GPU Image Decoding

GPU Image Decoding

The non trivial bit was designing a shader that can sample from a JPEG texture and composite the decompressed results on the fly without any GPU driver modification. We bind a 3d LUT texture to the second texture unit to perform some approximations when doing the DCT lookup to speed up the shader units, this requires a single 64KB lookup 3D texture that is shared for the whole system. The challenging part of this project however is taking the texture coordinate S&T and looking up the relevant DCT in the JPEG stream. Since the JPEG stream uses a huffman encoding it’s not trivial to map (x, y) coordinate from the decompressed image to a position on the stream. For the lookup our technique uses the work of D. Charles et al.

Testing a JS WebApp

Пост доступен на сайте softdroid.net: Тестирование приложения JS WebApp.

Test Requirements

I’ve been putting off testing my cleopatra project for a while now https://github.com/bgirard/cleopatra because I wanted to take the time to find a solution that would satisfy the following:

  1. The tests can be executed by running a particularly URL.
  2. The tests can be executed headless using a script.
  3. No server side component or proxy is required.
  4. Stretch goal: Continuous integration tests.

After a bit of research I came up with a solution that addressed my requirements. I’m sharing here in case this helps others.

First I found that the easiest way to achieve this is to find a Test Framework to get 1) and find a solution to run a headless browser for 3.

Picking a test framework

For the Test Framework I picked QUnit. I didn’t have any strong requirements there so you may want to review your options if you do. With QUnit I load my page in an iframe and inspect the resulting document after performing operations. Here’s an example:

QUnit.test("Select Filter", function(assert) {
  loadCleopatra({
    query: "?report=4c013822c9b91ffdebfbe6b9ef300adec6d5a99f&select=200,400",
    assert: assert,
    testFunc: function(cleopatraObj) {
    },
    profileLoadFunc: function(cleopatraObj) {
    },
    updatedFiltersFunc: function(cleopatraObj) {
      var samples = shownSamples(cleopatraObj);

      // Sample count for one of the two threads in the profile are both 150
      assert.ok(samples === 150, "Loaded profile");
    }
  });
});

Here I just load a profile, and once the document fires an updateFilters event I check that the right number of samples are selected.

You can run the latest cleopatra test here: http://people.mozilla.org/~bgirard/cleopatra/test.html

Picking a browser (test) driver

Now that we have a page that can run our test suite we just need a way to automate the execution. Turns out that PhantomJS, for webkit, and SlimerJS, for Gecko, provides exactly this. With a small driver script we can load our test.html page and set the process return code based on the result of our test framework, QUnit in this case.

Stretch goal: Continuous integration

If you hooked up the browser driver to run via a simple test.sh script adding continuous integration should be simple. Thanks to Travis-CI and Github it’s easy to setup your test script to run per check-in and set notifications.

All you need is to configure Travis-CI to looks at your repo and to check-in an appropriate .travis.cml config file. Your travis.yml should configure the environment. PhantomJS is pre-installed and should just work. SlimerJS requires a Firefox binary and a virtual display so it just requires a few more configuration lines. Here’s the final configuration:

env:
  - SLIMERJSLAUNCHER=$(which firefox) DISPLAY=:99.0 PATH=$TRAVIS_BUILD_DIR/slimerjs:$PATH
addons:
  firefox: "33.1"
before_script:
  - "sh -e /etc/init.d/xvfb start"
  - "echo 'Installing Slimer'"
  - "wget http://download.slimerjs.org/releases/0.9.4/slimerjs-0.9.4.zip"
  - "unzip slimerjs-0.9.4.zip"
  - "mv slimerjs-0.9.4 ./slimerjs"

notifications:
  irc:
    channels:
      - "irc.mozilla.org#perf"
    template:
     - "BenWa: %{repository} (%{commit}) : %{message} %{build_url}"
    on_success: change
    on_failure: change

script: phantomjs js/tests/run_qunit.js test.html && ./slimerjs/slimerjs js/tests/run_qunit.js $PWD/test.html

Happy testing!

Gecko Bootcamp Talks

Last summer we held a short bootcamp crash course for Gecko. The talks have been posted to air.mozilla.com and collected under the TorontoBootcamp tag. The talks are about an hour each but will be very informative to some. They are aimed at people wanting a deeper understanding of Gecko.

View the talks here: https://air.mozilla.org/search/?q=tag%3A+TorontoBootcamp

Gecko Pipeline

Gecko Pipeline

In the talks you’ll find my first talk covering an overall discussion of the pipeline, what stages run when and how to skip stages for better performance. Kannan’s talk discusses Baseline, our first tier JIT. Boris’ talk discusses Restyle and Reflow. Benoit Jacob’s talk discusses the graphics stack (Rasterization + Compositing + IPC layer) but sadly the camera is off center for the first half. Jeff’s talk goes into depth into Rasterization, particularly path drawing. My second talk discusses performance analysis in Gecko using the gecko profiler where we look at real profiles of real performance problems.

I’m trying to locate two more videos about layout and graphics that were given at another session but would elaborate more the DisplayList/Layer Tree/Invalidation phase and another on Compositing.

CallGraph Added to the Gecko Profiler

In the profiler you’ll now find a new tab called ‘CallGraph’. This will construct a call graph from the sample data. This is the same information that you can extract from the tree view and the timeline but just formatted so that it can be scanned better. Keep in mind that this is only a call graph of what occurred between sample points and not a fully instrumented Call Graph dump. This has a lower collection overhead but missing anything that occurs between sample points. You’ll still want to use the Tree view to get aggregate costs. You can interact with the view using your mouse or with the W/A/S/D equivalent keys of your keyboard layout.

Profiler CallGraph

Profiler CallGraph

Big thanks to Victor Porof for writing the initial widget. This visualization will be coming to the devtools profiler shortly.

Improving Layer Dump Visualization

I’ve blogged before about adding a feature to visualize platforms log dumps including the layer tree. This week while working on bug 1097941 I had no idea which module the bug was coming from. I used this opportunity to improve the layer visualization features hoping that it would help me identify the bug. Here are the results (working for both desktop and mobile):

Layer Tree Visualization Demo
Layer Tree Visualization Demo – Maximize me

This tools works by parsing the output of layers.dump and layers.dump-texture (not yet landed). I reconstruct the data as DOM nodes which can quite trivially support the features of a layers tree because layers tree are designed to be mapped from CSS. From there some javascript or the browser devtools can be used to inspect the tree. In my case all I had to do was locat from which layer my bad texture data was coming from: 0xAC5F2C00.

If you want to give it a spin just copy this pastebin and paste it here and hit ‘Parse’. Note: I don’t intend to keep backwards compatibility with this format so this pastebin may break after I go through the review for the new layers.dump-texture format.