Benchmarking sqlite-simple

Introduction

I recently benchmarked my Haskell SQLite bindings package sqlite-simple. I was curious to know how sqlite-simple performance compares to native C, Python and other Haskell database bindings. Initial results for sqlite-simple were extremely poor but improved significantly after optimizations. This post will present the results of this benchmarking. It also discusses some of the optimizations that resulted from this performance analysis.

Initially sqlite-simple scored barely over 50K rows/s. Optimizations brought this up to 1.8M rows/s, a nice 34x improvement.

Setup

Here’s the setup I used for running my experiments:

What was benchmarked

The high-level operation of the benchmark is:

  1. Setup: Initialize a table testdata with 10000 rows of data
  2. Measure the time it takes to query the first column of all these rows.

The schema consists of a single testdata table, defined as:

CREATE TABLE testdata (id INTEGER PRIMARY KEY, str TEXT, t TIMESTAMP)

See db-bench/db-util for more details.

The sqlite-simple query being measured is defined as:

selectInts :: S.Connection -> IO ()
selectInts conn = do
  rows <- S.query_ conn "SELECT id FROM testdata" :: IO [(S.Only Int)]
  checksum $ foldl' (\acc (S.Only v) -> v + acc) 0 rows

Basically, it SELECTs all the rows from the testdata table and converts the result into a Haskell list of Ints and, as a sanity check, sums the resulting integers together.

Several variants of this function are benchmarked. The main variants for SQLite are:

  • direct-sqlite selectIntsDS: Lowest possible API level - least memory allocated
  • sqlite-simple selectIntsFold: Convert rows into Haskell data but fold over rows to avoid allocating memory for all rows
  • sqlite-simple selectInts: Convert rows into Haskell list containing all the rows

We’ll focus mostly on selectInts when comparing against other implementations.

You can find implementations of the same for various Haskell database libraries under db-bench/haskell. C and Python implementations are under db-bench/native and db-bench/python, respectively.

Establishing performance targets

My benchmarking goal was to figure out how much overhead the sqlite-simple library adds on top of raw SQLite performance. Ideally, a query should spend all its time in native SQLite and zero time in Haskell bindings.

To better focus optimization work, I first set out to establish some reasonable performance targets to compare against. Establishing targets was straightforward. As sqlite-simple runs on top of direct-sqlite, the sqlite-simple can only be as fast as direct-sqlite. As direct-sqlite runs on top of the native SQLite library, the fastest sqlite-simple and direct-sqlite can possibly go is as fast as SQLite.

To turn this into numbers, I implemented multiple versions of my query benchmark (in order of fastest to slowest):

  • A native C benchmark on top of the SQLite library (source)
  • Haskell: A Haskell version using direct-sqlite (source, see function selectInts)
  • A Haskell version using sqlite-simple (source, see function selectIntsDS)

Here’s how they perform:

Native C vs Haskell bindings (rows/s)

Results analysis

The collected benchmark data was used to identify various performance improvement opportunities in sqlite-simple.

Original performance without optimizations was just barely over 50K rows/s. This was a performance bug I caused when I forked sqlite-simple from postgresql-simple. The problem was in a function called stepStmt that should’ve been tail recursive but wasn’t. Fixing this in d2b6f6a5 nicely bumped up the score from 53K to 750K rows/s.

The next couple of optimizations dealt mostly with clean up to reduce allocation rate. With 3239d474f0 and 0ee050807d, the benchmark score went up from 750K to 764K rows/s.

At this point I ran out of low hanging fruit in sqlite-simple and started to look elsewhere for optimizations. A low-level direct-sqlite benchmark was clocking around 2.43M rows/s which seemed a little low when a C implementation of the same was processing rows at almost 6.9M rows/s. Even a Python reimplementation of my benchmark case was faster at 2.5M rows/s. To highlight how large the performance delta between C and direct-sqlite were, it’s helpful to turn the comparison into absolute clock cycles. On my 3.2GHz machine, 6.9M rows/s means SQLite spends roughly 460 clock cycles per row.

Similarly, at 2.43M rows/s on direct-sqlite, each row cost roughly 1300 clock cycles out of which 460 was spent in the native SQLite library. Somehow roughly 840 clock cycles per row were spent in Haskell SQLite bindings. The overhead of just calling into SQLite from Haskell was higher than the actual cost of computing the result set inside SQLite! Yet, there wasn’t much going on in the wrapper library.

Consider the innerloop of the direct-sqlite benchmark:

sumRowsColumnInt64 :: DS.Statement -> Int64 -> IO Int64
sumRowsColumnInt64 stmt !acc = do
  r <- DS.step stmt
  case r of
    DS.Row ->
      do
        i <- DS.columnInt64 stmt 0
        sumRowsColumnInt64 stmt (acc + i)
    DS.Done ->
      return acc

The direct-sqlite calls DS.step and DS.columnInt64 map quite directly to their native SQLite counterparts sqlite3_step and sqlite3_column_int. Thus the expectation is that their cost should be roughly the same as in the C version of this benchmark.

No matter how bad a compiler you might have, there’s no way the simple Haskell code around sqlite3_step and sqlite3_column_int would add up to 840 clocks per row.

Turns out FFI call overhead dominated this benchmark. It was possible to reduce this overhead by using the unsafe FFI calling convention for some of the SQLite column accessors. This change was made in direct-sqlite pull request #20. As the name unsafe implies, it’s not OK to simply mark all FFI functions as unsafe – please refer to the pull request discussion for details. The tl;dr of the FFI discussion was that it’s OK to mark result data accessors like sqlite3_column_int64 and sqlite3_column_count as unsafe but that any functions doing the actual query heavy lifting like sqlite3_step should be kept safe.

The FFI optimizations were a big deal: the direct-sqlite benchmark score went up from 2.43M to 3.6M rows/s. As expected, this bumped up the sqlite-simple score up to 1.8M rows/s.

The following chart summarizes how performance progressed through the above optimizations:

Optimization progress (rows/s)

Comparing to other implementations and databases

As an extra bonus, I also compared my results to other databases and to other Haskell database bindings. Furthermore, I wanted to know how well Python fares with its built-in sqlite3 module. I thus implemented a few more variants of my benchmark:

Database/library comparison (rows/s)

I didn’t do any performance analysis for non-sqlite benchmark results, and wouldn’t draw too many conclusions about the results. These results do seem to confirm though that the HDBC library adds a fairly significant overhead to database row access. Comparing HDBC against sqlite-simple, we get 105K vs 1.8M rows/s.

Next steps

There are still many areas that would need further analysis:

  • Use more columns in the result set
  • Use other column types than Ints
  • Per-query overhead

I expect low per-query overhead to be particularly important for web applications. The typical usage in web apps would be that a single page load performs several queries with relatively low number of rows returned by each query.

Thanks to Emmanuel Surleau, Irene Knapp and Joey Adams for their optimization ideas, contributions and code reviews.

Until next time.