Crash! Boom! Bang! But I just wanted to delete some history…
If you’ve been experiencing crashes when trying to delete a large set of history, I have great news for you! We’ve identified the issue, and a fix will be coming to you shortly!
Read more
Asynchronous Location Bar has Landed
About two weeks ago the asynchronous location bar work landed in mozilla-central without much issue. It’s also in the Firefox 3.6 alpha we just recently released. This has the potential to impact all of our users, but those on slower hard drives will notice this the most. Your location bar searches may not complete any faster than before, but they certainly won’t be hanging your browser and locking up the UI.
Background
We’ve been getting reports for some time about the location bar hanging the application for some users when they are typing in it. This wasn’t a problem that was reproducible on every machine, and even on machines that saw it, it wasn’t always 100% reproducible. Clearly, this behavior is not desirable, so we set out to fix it.
I had a theory to the cause almost a year ago and filed a bug that I was hoping we could work on and fix for Firefox 3.5. We knew that reading data off a disk can be slow (and certainly would complete in a non-deterministic amount of time). Since SQLite uses blocking read calls (no more code can execute until the data is read from disk), this could certainly be the cause of the slowdown our users were seeing. Some simple profiling showed that this was largely the cause of the hanging. Work began on the project, but it was clear that enough issues were cropping up that we were not going to be able to safely take this change for Firefox 3.5, and resources were diverted elsewhere.
Process and Solution
This section is a bit technical, so feel free to skip it. The short answer is “do not block the main thread while reading from the hard drive.”
In order to not block the main thread while reading from disk we either need to make SQLite use non-blocking read system calls, or call into SQLite off of the main thread. Changing the SQLite code isn’t something we want to do, so that solution was out of the question. Luckily, we had solved a similar problem with writes and fsyncs earlier in the Firefox 3.5 development with the asynchronous Storage API.
The first implementation that we tried essentially did the same thing that the old code did. We would execute a query, but this time asynchronously, and then process the results and see if they match. There were two issues with this approach, however. The first issue was that we were filtering every history and bookmark entry on the main thread for a given search. That could be a lot of work we end up doing, and with the additional overhead of moving data across threads, the common case would see no win. The second issue was that once we selected a result in the location bar, and a search was not yet complete, there would be a hang as the main thread processed a bunch of events that Storage had posted to it containing results.
At this point, we realized we needed to do the filtering on a thread other than the main thread. After some thought, we was figured that the easiest way to do that would be to use a SQL function that we define in the WHERE clause of our autocomplete queries. This way, all the filtering is done on a background thread, and the code that runs on the main thread only deals with results we will actually use. This solution exposed some things in the Storage backend like lock contention and a few other subtle issues, but nothing major came up.
For more details on how the location bar search results are generated, see my explanation here.
If you weren’t having a problem before, chances are you won’t notice any difference at all.
Query Performance Monitoring
Over in bug 481261 I am adding support in mozStorage to warn when a query doesn’t use an index to sort. This is a debug only warning that uses NS_WARNING to tell you what query is at fault, and the total number of sort operations that were used on it.
Luckily, SQLite exposes a handy function that tells us this information. In general, you want to use a index in your ORDER BY clause so SQLite can use it to generate the order the results a given back. If you do not use an index, SQLite has to first get all the results in memory, then it sorts them, and then it starts to return results. If you expect a lot of results, this can get very expensive.
There were suggestions on the newsgroups to also add an API so debuggers such as ChromeBug or SQLite Manager could listen for these types of errors. I’ll be filing follow-up bugs for some of these suggestions later on.
Determining a Ts Regression
For those who have been following the tree status of mozilla-central as of late, you probably noticed that I tried to land SQLite once again, but it was backed out due to a nasty Ts regression on Linux. When I had run this through the try server, it had shown no regression so I had thought it was safe (just like the past three or four other times I’ve tried to land this). Luckily, Johnathan, who was the sheriff when I landed, found a linux box that we could use that reproduced this problem. With a lot of his help, we got standalone talos running just Ts, to get strace logs during startup.
Once I had those logs, I needed some way to parse the files for data so I can use it in a reasonable way. I wrote a python script to parse the strace logs, and then insert them into a sqlite database file (26.8 MB) so I could run some interesting queries on the data.
With that data, I decided to generate some graphs to easily see what was going on. All of these graphs compose the data from the six runs of Firefox that talos ran – the data is all summed up. All the graphs have larger versions available if you click on them.
I figured that the most useful graph for investigating this Ts regression would be execution time:

Note that that is six runs of Firefox, which is why it is as long as it is. Next, I looked at the average execution time for each function call:

And finally, I looked at the number of calls of each of these functions:

We are clearly seeing an increase in the number of fsync calls, and we know that on Linux those can be more painful than they are on other operating systems. My next step is to see if we also see this increase on OS X. If we do, I’m going to assume we see it on windows as well, and get backtraces of every single fsync call to determine why we’ve double the number of calls by upgrading.
I’ll make a new post as more data comes in.