Benchmarking sqlite-simple
Introduction
I recently benchmarked my Haskell SQLite bindings package sqlite-simple. I was curious to know how sqlite-simple performance compares to native C, Python and other Haskell database bindings. Initial results for sqlite-simple were extremely poor but improved significantly after optimizations. This post will present the results of this benchmarking. It also discusses some of the optimizations that resulted from this performance analysis.
Initially sqlite-simple scored barely over 50K rows/s. Optimizations brought this up to 1.8M rows/s, a nice 34x improvement.
Setup
Here’s the setup I used for running my experiments:
- 64-bit Debian running inside VirtualBox (with only a single core enabled)
- Quad-core Intel Core i7 at 3.2GHz
- GHC 7.4.2 from the Haskell Platform
- MySQL 5.5.28, PostgreSQL 9.1.7
- Package versions:
What was benchmarked
The high-level operation of the benchmark is:
- Setup: Initialize a table
testdata
with 10000 rows of data - Measure the time it takes to query the first column of all these rows.
The schema consists of a single testdata
table, defined as:
CREATE TABLE testdata (id INTEGER PRIMARY KEY, str TEXT, t TIMESTAMP)
See db-bench/db-util for more details.
The sqlite-simple query being measured is defined as:
selectInts :: S.Connection -> IO ()
= do
selectInts conn <- S.query_ conn "SELECT id FROM testdata" :: IO [(S.Only Int)]
rows $ foldl' (\acc (S.Only v) -> v + acc) 0 rows checksum
Basically, it SELECTs all the rows from the testdata
table and
converts the result into a Haskell list of Int
s and, as a sanity
check, sums the resulting integers together.
Several variants of this function are benchmarked. The main variants for SQLite are:
- direct-sqlite
selectIntsDS
: Lowest possible API level - least memory allocated - sqlite-simple
selectIntsFold
: Convert rows into Haskell data but fold over rows to avoid allocating memory for all rows - sqlite-simple
selectInts
: Convert rows into Haskell list containing all the rows
We’ll focus mostly on selectInts
when comparing against other implementations.
You can find implementations of the same for various Haskell database libraries under db-bench/haskell. C and Python implementations are under db-bench/native and db-bench/python, respectively.
Establishing performance targets
My benchmarking goal was to figure out how much overhead the sqlite-simple library adds on top of raw SQLite performance. Ideally, a query should spend all its time in native SQLite and zero time in Haskell bindings.
To better focus optimization work, I first set out to establish some reasonable performance targets to compare against. Establishing targets was straightforward. As sqlite-simple runs on top of direct-sqlite, the sqlite-simple can only be as fast as direct-sqlite. As direct-sqlite runs on top of the native SQLite library, the fastest sqlite-simple and direct-sqlite can possibly go is as fast as SQLite.
To turn this into numbers, I implemented multiple versions of my query benchmark (in order of fastest to slowest):
- A native C benchmark on top of the SQLite library (source)
- Haskell: A Haskell version using direct-sqlite (source, see function
selectInts
) - A Haskell version using sqlite-simple (source, see function
selectIntsDS
)
Here’s how they perform:
Native C vs Haskell bindings (rows/s)
Results analysis
The collected benchmark data was used to identify various performance improvement opportunities in sqlite-simple.
Original performance without optimizations was just barely over 50K
rows/s. This was a performance bug I caused when I forked
sqlite-simple from postgresql-simple. The problem was in a function
called stepStmt
that should’ve been tail recursive but wasn’t.
Fixing this in d2b6f6a5 nicely bumped up the score
from 53K to 750K rows/s.
The next couple of optimizations dealt mostly with clean up to reduce allocation rate. With 3239d474f0 and 0ee050807d, the benchmark score went up from 750K to 764K rows/s.
At this point I ran out of low hanging fruit in sqlite-simple and started to look elsewhere for optimizations. A low-level direct-sqlite benchmark was clocking around 2.43M rows/s which seemed a little low when a C implementation of the same was processing rows at almost 6.9M rows/s. Even a Python reimplementation of my benchmark case was faster at 2.5M rows/s. To highlight how large the performance delta between C and direct-sqlite were, it’s helpful to turn the comparison into absolute clock cycles. On my 3.2GHz machine, 6.9M rows/s means SQLite spends roughly 460 clock cycles per row.
Similarly, at 2.43M rows/s on direct-sqlite, each row cost roughly 1300 clock cycles out of which 460 was spent in the native SQLite library. Somehow roughly 840 clock cycles per row were spent in Haskell SQLite bindings. The overhead of just calling into SQLite from Haskell was higher than the actual cost of computing the result set inside SQLite! Yet, there wasn’t much going on in the wrapper library.
Consider the innerloop of the direct-sqlite benchmark:
sumRowsColumnInt64 :: DS.Statement -> Int64 -> IO Int64
!acc = do
sumRowsColumnInt64 stmt <- DS.step stmt
r case r of
DS.Row ->
do
<- DS.columnInt64 stmt 0
i + i)
sumRowsColumnInt64 stmt (acc DS.Done ->
return acc
The direct-sqlite calls DS.step
and DS.columnInt64
map quite
directly to their native SQLite counterparts
sqlite3_step
and
sqlite3_column_int
.
Thus the expectation is that their cost should be roughly the same as
in the C version of this benchmark.
No matter how bad a compiler you might have, there’s no way the simple
Haskell code around sqlite3_step
and sqlite3_column_int
would add
up to 840 clocks per row.
Turns out FFI call overhead dominated this benchmark. It was possible
to reduce this overhead by using the unsafe
FFI calling convention
for some of the SQLite column accessors. This change was made in
direct-sqlite pull request
#20. As the
name unsafe
implies, it’s not OK to simply mark all FFI functions as
unsafe – please refer to the pull request discussion for details.
The tl;dr of the FFI discussion was that it’s OK to mark result data
accessors like sqlite3_column_int64
and sqlite3_column_count
as
unsafe
but that any functions doing the actual query heavy lifting
like sqlite3_step
should be kept safe.
The FFI optimizations were a big deal: the direct-sqlite benchmark score went up from 2.43M to 3.6M rows/s. As expected, this bumped up the sqlite-simple score up to 1.8M rows/s.
The following chart summarizes how performance progressed through the above optimizations:
Optimization progress (rows/s)
Comparing to other implementations and databases
As an extra bonus, I also compared my results to other databases and to other Haskell database bindings. Furthermore, I wanted to know how well Python fares with its built-in sqlite3 module. I thus implemented a few more variants of my benchmark:
Database/library comparison (rows/s)
I didn’t do any performance analysis for non-sqlite benchmark results, and wouldn’t draw too many conclusions about the results. These results do seem to confirm though that the HDBC library adds a fairly significant overhead to database row access. Comparing HDBC against sqlite-simple, we get 105K vs 1.8M rows/s.
Next steps
There are still many areas that would need further analysis:
- Use more columns in the result set
- Use other column types than
Int
s - Per-query overhead
I expect low per-query overhead to be particularly important for web applications. The typical usage in web apps would be that a single page load performs several queries with relatively low number of rows returned by each query.
Thanks to Emmanuel Surleau, Irene Knapp and Joey Adams for their optimization ideas, contributions and code reviews.
Until next time.