Use the NumPy/Pandas design pattern of moving core data processing to well-tested and optimized C code
Available since Python 2.5 (released 2006)
Implementing a query planner, optimizer, executor and B-tree index is hard. Use the experts' code!
SQLite's features set a high bar
Consider using it as your data's complexity goes beyond Python lists and dicts
Keep using it until you need features, concurrency or speed it doesn't have
Default (Old) | New | |
---|---|---|
PRAGMA journal_mode=
|
DELETE
|
WAL
|
Concurrent reader processes | ✔ | ✔ |
Concurrent writer processes | ✘ | ✘ |
Reading while writing | ✘ | ✔ |
Learn the common subset of SQL and relational databases
What you learn with SQLite can be used on future projects, even with different databases
Protects you from production bugs like bad data, system failure
Speeds up iterative development
SQLite is usuable from every language
Distributing data is as easy as copying a file
The author lives in CLT!
import sqlite3
conn = sqlite3.connect("db.sqlite3")
cur = conn.cursor()
with conn:
cur.execute("""
SELECT actor_id, actor_name
FROM actor
WHERE actor_type = ?""", ("Stage",))
for actor_id, actor_name in cur:
print(f"{actor_name} ({actor_id})")
Connection.row_factory
Implement your own constructor for query results
sqlite3.Row
implements a dict-like interface
A NamedTuple factory isn't hard to implement
Anyone want to contribute a dataclass factory to the standard library?
register_adapter
: Store Python objects in database
register_converter
: Re-hydrate Python objects from the database
Not an ORM
Memoizing the adapter and converter functions is often a speed and memory win in ETL
functools.lru_cache
You can't mutate cached values but that typically isn't a problem
Candidates include datetime
, JSON, Decimal
datetime.datetime
and date
have a default .isoformat()
serialization
2012-12-12T12:12:12.123456
Sensible lexicographical ordering and binning (LIKE '2019-01-01%'
)
Supported by sqlite's (few) datetime functions
Can't do math on a string
Space inefficient
strptime()
is slow
Example for adapting datetime
to an integer (UNIX timestamp) is in the docs
Affinity is how data is stored on disk
Any kind of value can be stored in any column
Column types are just a suggestion on what affinity to use
CREATE TABLE keys (
key_id INTEGER PRIMARY KEY,
email TEXT,
pubkey SSH_PUBLIC_KEY,
privkey SSH_PRIVATE_KEY
);
Table data is kept in a B-tree
The key for a table tree is the rowid
Use INTEGER PRIMARY KEY
whenever possible
When no indices are available to aid the evaluation of a query, SQLite might create an automatic index that lasts only for the duration of a single SQL statement. [...]
The query planner can create a temporary index
This can be disabled
Read the logs to find index suggestions
[...] Since the cost of constructing the automatic index is $O(N\log{}N)$ (where $N$ is the number of entries in the table) and the cost of doing a full table scan is only $O(N)$, an automatic index will only be created if SQLite expects that the lookup will be run more than $\log{}N$ times during the course of the SQL statement.
Big O | Step | a=105, b=105 | a=50, b=105 | a=3, b=105 |
---|---|---|---|---|
$O(ab)$ | Loop | 1010 | 106.7 | 105.5 |
Big O | Step | a=105, b=105 | a=50, b=105 | a=3, b=105 |
---|---|---|---|---|
$O(b\log{}b)$ | Index creation | 105.7 | 105.7 | 105.7 |
$O(a\log{}b)$ | Index join | 105.7 | 250 | 15 |
Sum | 106.0 | 105.7 | 105.7 |
Join method | a=105, b=105 | a=50, b=105 | a=3, b=105 |
---|---|---|---|
Nested for loop | 1010 | 106.7 | 105.5 |
Automatic index | 106.0 | 105.7 | 105.7 |
✔ | ✔ | ✘ |
Up-to-date table statistics helps
The query planner tries different join orders, which helps
This is the SQLite version of a hash join
ETL: Collect and clean up data
Output: Excel
Memoization
SQL Server: merge joins and clustered indexes
SQLite: nested for loop joins, materializing joins and subqueries
A real programming language, in source control
Reliable automation
Many hours faster even if individual parts were slower
openpyxl replaced fragile Excel APIs
ETL: data collection, parsing, stats
Flask app: runs against the 3NF database
events.sqlite3: 34GB, 67MM rows of gzipped JSON, 15 minutes to SELECT COUNT(*)
db.sqlite3: 1.2GB, 3NF integers and strings
Master API server opens 3NF database
Worker API servers open events.sqlite3
Workers POST their results back to the master server
JSON and gzip decoding is embarassingly parallel
Primary keys need to be looked up centrally
Heavy cache rates in the workers and bulk inserts