reveal.js

Using SQLite with Python

David Gilman

Why should you use SQLite?

It pairs well with Python.

Use the NumPy/Pandas design pattern of moving core data processing to well-tested and optimized C code

Available since Python 2.5 (released 2006)

Why should you use SQLite?

It does a lot for you.

Implementing a query planner, optimizer, executor and B-tree index is hard. Use the experts' code!

SQLite's features set a high bar

When should you use SQLite?

It sets a high bar.

Consider using it as your data's complexity goes beyond Python lists and dicts

Keep using it until you need features, concurrency or speed it doesn't have

It pairs well with Python:

SQLite concurrency model

	Default (Old)	New
`PRAGMA journal_mode=`	`DELETE`	`WAL`
Concurrent reader processes	✔	✔
Concurrent writer processes	✘	✘
Reading while writing	✘	✔

Why should you use SQLite?

It's an investment in yourself and your career.

Learn the common subset of SQL and relational databases

What you learn with SQLite can be used on future projects, even with different databases

Why should you use SQLite?

It gives you transactions.

Protects you from production bugs like bad data, system failure

Speeds up iterative development

Why should you use SQLite?

It plays nice with other systems.

SQLite is usuable from every language

Distributing data is as easy as copying a file

Why should you use SQLite?

Home team advantage

The author lives in CLT!

Using SQLite and Python together

The basics

import sqlite3
conn = sqlite3.connect("db.sqlite3")
cur = conn.cursor()
with conn:
   cur.execute("""
      SELECT actor_id, actor_name
      FROM actor
      WHERE actor_type = ?""", ("Stage",))
   for actor_id, actor_name in cur:
      print(f"{actor_name} ({actor_id})")

Using SQLite and Python together

`Connection.row_factory`

Implement your own constructor for query results

sqlite3.Row implements a dict-like interface

A NamedTuple factory isn't hard to implement

Anyone want to contribute a dataclass factory to the standard library?

Using SQLite and Python together

Serializing types

register_adapter: Store Python objects in database

register_converter: Re-hydrate Python objects from the database

Not an ORM

Serializing types

Memoizing the adapter and converter functions is often a speed and memory win in ETL

functools.lru_cache

You can't mutate cached values but that typically isn't a problem

Candidates include datetime, JSON, Decimal

Using SQLite and Python together

Serializing types: datetime

datetime.datetime and date have a default .isoformat() serialization

2012-12-12T12:12:12.123456

Sensible lexicographical ordering and binning (LIKE '2019-01-01%')

Supported by sqlite's (few) datetime functions

Things to know about Python

Serializing types: datetime

Can't do math on a string

Space inefficient

strptime() is slow

Example for adapting datetime to an integer (UNIX timestamp) is in the docs

Things to know about SQLite

Type affinity

Affinity is how data is stored on disk

Any kind of value can be stored in any column

Column types are just a suggestion on what affinity to use

CREATE TABLE keys (
    key_id INTEGER PRIMARY KEY,
    email TEXT,
    pubkey SSH_PUBLIC_KEY,
    privkey SSH_PRIVATE_KEY
);

Things to know about SQLite

B-trees

Table data is kept in a B-tree

The key for a table tree is the rowid

Use INTEGER PRIMARY KEY whenever possible

The SQLite Query Optimizer Overview

When no indices are available to aid the evaluation of a query, SQLite might create an automatic index that lasts only for the duration of a single SQL statement. [...]

Things to know about SQLite

Automatic indexing

The query planner can create a temporary index

This can be disabled

Read the logs to find index suggestions

[...] Since the cost of constructing the automatic index is $O(N\log{}N)$ (where $N$ is the number of entries in the table) and the cost of doing a full table scan is only $O(N)$, an automatic index will only be created if SQLite expects that the lookup will be run more than $\log{}N$ times during the course of the SQL statement.

Time complexity, nested for loop

Big O	Step	a=10⁵, b=10⁵	a=50, b=10⁵	a=3, b=10⁵
$O(ab)$	Loop	10¹⁰	10^6.7	10^5.5

Time complexity, index join

Big O	Step	a=10⁵, b=10⁵	a=50, b=10⁵	a=3, b=10⁵
$O(b\log{}b)$	Index creation	10^5.7	10^5.7	10^5.7
$O(a\log{}b)$	Index join	10^5.7	250	15
	Sum	10^6.0	10^5.7	10^5.7

Time complexity, delta

Join method	a=10⁵, b=10⁵	a=50, b=10⁵	a=3, b=10⁵
Nested for loop	10¹⁰	10^6.7	10^5.5
Automatic index	10^6.0	10^5.7	10^5.7
	✔	✔	✘

The lesson?

$O(n^2)$ sucks!

Things to know about SQLite

Automatic indexing

Up-to-date table statistics helps

The query planner tries different join orders, which helps

This is the SQLite version of a hash join

Lessons learned porting from SQL Server to SQLite

Memoization

SQL Server: merge joins and clustered indexes

SQLite: nested for loop joins, materializing joins and subqueries

Lessons learned porting from SQL Server to SQLite

Improvements

A real programming language, in source control

Reliable automation

Many hours faster even if individual parts were slower

openpyxl replaced fragile Excel APIs

Inger

ETL: data collection, parsing, stats

Flask app: runs against the 3NF database

Inger

events.sqlite3: 34GB, 67MM rows of gzipped JSON, 15 minutes to SELECT COUNT(*)

db.sqlite3: 1.2GB, 3NF integers and strings

Inger

Events re-parse

Master API server opens 3NF database

Worker API servers open events.sqlite3

Workers POST their results back to the master server

Inger

Design considerations

JSON and gzip decoding is embarassingly parallel

Primary keys need to be looked up centrally

Heavy cache rates in the workers and bulk inserts

Using SQLite with Python

David Gilman

LeanTaaS

Pro Football History.com

Why should you use SQLite?

Why should you use SQLite?

It pairs well with Python.

Why should you use SQLite?

It does a lot for you.

When should you use SQLite?

It sets a high bar.

It pairs well with Python:

SQLite concurrency model

Why should you use SQLite?

It's an investment in yourself and your career.

Why should you use SQLite?

It gives you transactions.

Why should you use SQLite?

It plays nice with other systems.

Why should you use SQLite?

Home team advantage

Using SQLite and Python together

The basics

Using SQLite and Python together

Connection.row_factory

Using SQLite and Python together

Serializing types

Serializing types

Using SQLite and Python together

Serializing types: datetime

Things to know about Python

Serializing types: datetime

Things to know about SQLite

Things to know about SQLite

Type affinity

Things to know about SQLite

B-trees

Things to know about SQLite

Automatic indexing

Time complexity, nested for loop

Time complexity, index join

Time complexity, delta

The lesson?

$O(n^2)$ sucks!

Things to know about SQLite

Automatic indexing

Where I've used SQLite

CLT Pet Watch Twitter

GitHub

Hybrid fiber-coaxial plant

Everything You Always Wanted to Know About Optical Networking – But Were Afraid to Ask: Richard A Steenbergen

Lessons learned porting from SQL Server to SQLite

Lessons learned porting from SQL Server to SQLite

Lessons learned porting from SQL Server to SQLite

Improvements

Inger

Inger

Inger

Events re-parse

Inger

Design considerations

`Connection.row_factory`