Performance Regressions are Painful

I think it is a well known fact that performance regressions are really painful. They cause pain on more than one front too! You have users who have a less responsive application, drivers who have to figure out who and what caused the regression, and developers who have to backout or come up with a fix for the regression.

Until recently, we only had a heavy handed tool (the graph server) that is slow and painful to use. Recently, Johnathan has revamped his performance dashboard which is a very quick and easy way for people to see the current status of some of the more important performance graphs (it’s also easy to hack on!). This has made spotting a regression much easier and faster, which great increases the odds of the offending change(s) being backed out. The longer a performance bug is left in the tree, the harder it becomes to do a straight backout.

Today I decided I was going to spend the day eliminating the rest of our open performance bugs (or make sure they had blocking requested for the current release). However, I was amazed at how many old performance bugs we had open that hadn’t been touched in six months or more. I probably closed about 20 bugs as INCOMPLETE since there was virtually no way we were going to be able fix those bugs anymore. One bug I was actually able to mark as FIXED, and there were a few more that were recent regressions that I posted comments on to make sure people were still working on.

This made me realize that we have a serious problem though. We currently have no way for people who care about performance regressions to easily be aware of new bugs filed. To help this, I went ahead and filed bug 467170 which will allow folks to add an e-mail address to the cc list of all performance regression bugs so the folks who care about this can watch the address and get mail about these issues. Once bug 464609 gets resolved the sheriffs will also have a place to bring these issues up so the next sheriff is aware of what is going on as well.

I think we are starting to move in the right direction when it comes to performance monitoring, but I think we also have a long ways to go. Remember kids, only you can help stop performance regressions.

Mozilla Personal Technology

I’m in a podcast!

A little while ago I got interviewed by Anthony Bryan from the metalink project. I feel sorry for him because it took me several long months to actually get the time to sit down and talk to him. Anyway, you can check out the podcast here. They used an old facebook photo of mine, so, uh, pardon the odd image of me. I figured it could be worse though.

It’s worth a listen though. I talk about some of the features of the new download manager (old news now, but he ask for this interview a while ago…), how I got involved with the Mozilla project, and a few other interesting tidbits including what I’ve currently been working on. There is other interesting things in there too!


DTrace Awesomeness

Yesterday dietrich was telling me he was seeing a lot of writes to places.sqlite and places.sqlite-journal. I wanted to get good, hard data to see what was writing, and how often we were doing it. I figured this was a good option for DTrace, but I've had mixed experiences with it in the past. I first turned to Instruments on OS X, but that can't give you stacks for calls, so I had to dump it.

I talked to dolske, who happens to be the resident DTrace expert around here. With his help, I was able to put together this little D script to track writes to the files in question, and give me user stack traces so we know who is writing and when. With this, we've figured out what was writing, and are working on how to make it write less as we speak.

DTrace really is an awesome tool, even if it can be a bit awkward to use from time to time.


DOM Inspector 2.0.1

I’ve just released DOM Inspector 2.0.1. This contains a number of stability and UI fixes. This is available on AMO, but you should automatically receive the update.


Determining a Ts Regression

For those who have been following the tree status of mozilla-central as of late, you probably noticed that I tried to land SQLite once again, but it was backed out due to a nasty Ts regression on Linux. When I had run this through the try server, it had shown no regression so I had thought it was safe (just like the past three or four other times I’ve tried to land this). Luckily, Johnathan, who was the sheriff when I landed, found a linux box that we could use that reproduced this problem. With a lot of his help, we got standalone talos running just Ts, to get strace logs during startup.

Once I had those logs, I needed some way to parse the files for data so I can use it in a reasonable way. I wrote a python script to parse the strace logs, and then insert them into a sqlite database file (26.8 MB) so I could run some interesting queries on the data.

With that data, I decided to generate some graphs to easily see what was going on. All of these graphs compose the data from the six runs of Firefox that talos ran – the data is all summed up. All the graphs have larger versions available if you click on them.

I figured that the most useful graph for investigating this Ts regression would be execution time:
Total execution time

Note that that is six runs of Firefox, which is why it is as long as it is. Next, I looked at the average execution time for each function call:
Average execution time

And finally, I looked at the number of calls of each of these functions:
Number of calls

We are clearly seeing an increase in the number of fsync calls, and we know that on Linux those can be more painful than they are on other operating systems. My next step is to see if we also see this increase on OS X. If we do, I’m going to assume we see it on windows as well, and get backtraces of every single fsync call to determine why we’ve double the number of calls by upgrading.

I’ll make a new post as more data comes in.