Using RecordReplay to investigate intermittent oranges, bug #2 part 2

As promised here’s the follow up to part 1.

Getting Lost in a Replay

In part 1 I made a mistake. I accidentally got lost in the replay and started debugging from the wrong point. I believe I might have started with the wrong event number.

The best way to make sure you’re not getting lost in a replay is to use the ‘when’ command to print the current timeline event id that you’re at. With this event ID you can compare it against the stdout event marks and make sure you are at the point in the replay that you expect. Because of issue #1653 it wasn’t possible to use ‘when’ without risking a crash so I wasn’t able to see that I was lost. However this is now fixed on trunk! This was fixed today and I was able to use it from time to time and make sure I was where I wanted to be.

On this note I think this RR needs a feature to make it easier to stay within a certain bounds. It’s easy when doing continue/reverse continue to jump past the reftest that you’re interested in and suddenly you’re debugging in something unrelated.

Restarting from the Display List

This time the display list was reporting that both the good and the bad frame had the correct Image Rotation. However the frame size was wrong.

I had originally assumed that the image wasn’t rotated because of the proportion it had. However on a closer look the image was in fact rotated properly but it was stretched out in the opposite direction giving the impression that the image was not rotated.

With this information it was now clear that the rotation was correct but the image frame size was not.

Building a Timeline and Exploiting Checkpoints

It turns out that there’s a few interesting events for how the nsImageFrame is sized. There’s ‘nsLayoutUtils::OrientImage’, ‘nsImageFrame::Reflow’, ‘nsImageFrame::GetIntrinsicSize’ and ‘nsImageFrame::UpdateIntrinsicSize’.

Now since we’re suspected a race conditions it’s likely that these things are called in a bad sequence in the ‘bad’ trace. Turns out that guess was correct.

You can build a timeline manually using ‘when’ and ‘when-ticks’ and keeping notes however instead we exploited the checkpoint feature which also records the ‘when’. This keeps track of the events that you care about, ‘when’ they occurs and also makes it easy to jump to back to these key moments. Once we’ve created checkpoints at the interesting location we could can call ‘info checkpoint’ and read off the call sequence by manually sorting on the ‘when’. This keeps better notes and makes it easier to jump to relevant points. We ended up doing a lot of back and forth with these checkpoints trying to understand the timeline differences. Here’s the good trace:

(rr) info checkpoint
ID    When    Where
1    646393      PresShell::RenderDocument (this=0x3c831e67f000, aRect=..., aFlags=12, aBackgroundColor=4294967295, aThebesContext=0x2f701d8a5280) at /home/bgirard/mozilla-central/tree/layout/base/nsPresShell.cpp:4497
2    646484      nsDisplayList::PaintRoot (this=0x7ffede3eaf50, aBuilder=0x7ffede3eb180, aCtx=0x7ffede3eb970, aFlags=1) at /home/bgirard/mozilla-central/tree/layout/base/nsDisplayList.cpp:1565
3    644554      nsLayoutUtils::OrientImage (aContainer=0xf3568bed200, aOrientation=...) at /home/bgirard/mozilla-central/tree/layout/base/nsLayoutUtils.cpp:6702
4    644598      0x00003c831654ff9e in nsImageFrame::Reflow (this=0x20940978ad38, aPresContext=0x3c831e512000, aMetrics=..., aReflowState=..., aStatus=@0x7ffede3f10b4: 0) at /home/bgirard/mozilla-central/tree/layout/generic/nsImageFrame.cpp:959
6    644566      nsImageFrame::GetIntrinsicSize (this=0x20940978ad38) at /home/bgirard/mozilla-central/tree/layout/generic/nsImageFrame.cpp:919
7    644554      nsImageFrame::UpdateIntrinsicSize (this=0x20940978ad38, aImage=0xf3568b17070) at /home/bgirard/mozilla-central/tree/layout/generic/nsImageFrame.cpp:289

He we can tell that the ‘good’ call sequence is UpdateIntrinsicSize, GetIntrinsicSize, Reflow. Doing the same for the ‘bad’ replay gives us: GetIntrinsicSize, UpdateIntrinsicSize, Reflow. Note how we call UpdateIntrinsicSize with the right value but never query them. That’s bad!

At some point I think RR could use some UI improvements to make it easier to build timelines (and deal with ‘when’ ties) but not today.

Testing Timeline Theories

Now at this point we had several theories about the timeline. Theories were tested mostly using reverse-continue to get a better idea of the timeline differences and a few reverse-continue with a memory watchpoint (watch -l).

The Bug at Last

Turns out that the bug is pretty complicated. There’s a race condition in which we fire the ‘load’ event on the ImageDocument. At this point we’ve actually fully decoded the image and have the correct size and rotation for the image. However in the ‘bad’ trace we’re missing something very important. We haven’t built the proper StyleVisibility for the image frame which is missing an important ‘image-orientation’ property. Without this CSS property we will ignore the image rotation when computing the size ImageDocument even though it’s set correctly and on time. We also have no hooks to invalidate the size of the ImageDocument when the CSS propertie finally comes in so in this race condition the bad value sticks forever. We will get the correct image rotation value when we paint and thus the image will be rotated correctly but since the ImageDocument has the wrong size it will be stretched out vertically.

The Fix

Turns out that we can avoid sizing the image frame entirely and just use CSS to implement the shrinkToFit while preserving the aspect ratio. The fix was to use ‘object-fit: contain;’ instead of trying to sync up the width/height of ImageDocument, nsImageFrame and nsIDOMHTMLImageElement as the CSS styles are changing. Using CSS to implement the shrinkToFit behavior is much more robust.

The good news is fixing this wasn’t just fixing an infrastructure race. The bad behavior can actually be seen manually by toggling ‘image-orientation’ in an image document with exif rotation. This intermittent was warning us about a real problem.

One thought on “Using RecordReplay to investigate intermittent oranges, bug #2 part 2

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s