Integrating SQL Into your Software
Video
Motivation
Data analysts can use the SQL interface, on its own, together with csv imports and exports to do everything they need.
However, most software involves layers - only one of which is the database.
Python Database Interfaces
Python contains a standard for database interfaces, laid out in PEP 249.
This means that - while we will be discussing sqlite
usage - any other compliant database interface will work equivalently. For example, this common interface can be connected to Pandas.
sqlite3 Module
There are two main objects in a database interface; a connection, which is your hub for the database, which can provide you cursors which run your statements and hold more stateful information.
Connections
Connections are the main place where interfaces are allowed to differ, and sqlite3 is characteristically simple.
A connection requires opening and closing - and as such, is best managed as a context:
import sqlite3
with sqlite3.connect("bibliography.sqlite") as con:
pass
Cursors
Cursors represent a particular position in the database, and hold the rows - as an iterator, or accessed through methods - when you select.
With Cursors, we are ready to run our SQL commands:
with sqlite3.connect("bibliography.sqlite") as con:
cur = con.cursor()
cur.execute("SELECT DISTINCT author_name FROM ArticleAuthors")
for row in cur:
print(row)
The execute function runs exactly one SQL operation, so there is no need to add semicolons - and if you want to run multiple SQL operations (for example, to CREATE TABLE
in a query) you must split them up between .execute()
calls.
Values
We know how to use values in SQL statements, and build queries. You may be compelled to use your python string-formatting and assembling skills to build queries.
You may have heard the phrase "Sanitize your Database Inputs." This is a more specific instance of a broader piece of advice - be deliberate in what your code will do, even when passed strange inputs.
The Python database interface makes it easy to sanitize your variables; instead of using .format
, f
-strings, or string concatenation, simply use ?
for your values, and pass them in as a tuple:
with sqlite3.connect("bibliography.sqlite") as con:
cur = con.cursor()
cur.execute("""
SELECT title
FROM ArticleAuthors
NATURAL JOIN Articles
WHERE author_name=?
""",("Wagstaff, Samuel S",))
for row in cur:
print(row)
And let the module place them in.
Streamlining SQL with Python
Note that each SELECT
on the cursor clears the buffer of rows; if you want to do multiple queries nested, you should extract the data to a table first:
with sqlite3.connect("bibliography.sqlite") as con:
cur = con.cursor()
cur.execute("SELECT DISTINCT author_name FROM ArticleAuthors")
authors = cur.fetchall() # Equivalent to [i for i in cur]
coauthors = dict()
for author, in authors:
# Here's our old fetch-coauthors function!
cur.execute("""
SELECT DISTINCT author_name
FROM ArticleAuthors
WHERE article_id
IN (
SELECT article_id
FROM ArticleAuthors
WHERE author_name=?
)
AND NOT author_name=?;
""",(author,author))
coauthors[author]=(set(i for i, in cur)) #
coauthors
You can construct this information within SQL with a complex query, involving GROUP BYs and whatnot, but it is complex enough to look at. A well-designed and optimized SQL query will almost always be faster, but that comes at a cost of readability and time.
Inserting Data
From this, you should be able to piece together how to insert data. However, it is worth noting that there is an abbreviated - and faster to execute - syntax for running the same operation, but varying the data:
article_id = "baillie1980lucas"
title="Lucas pseudoprimes"
journal="Mathematics of Computation"
authors=("Baillie, Robert","Wagstaff, Samuel S")
with sqlite3.connect("bibliography.sqlite") as con:
cur = con.cursor()
cur.execute("""
INSERT INTO Articles(
article_id,
title,
journal
)
VALUES (?,?,?)
""",(article_id, title, journal))
article_authors = [(article_id, name, index)
for index,name in enumerate(authors,start=1)]
cur.executemany("""
INSERT INTO ArticleAuthors(
article_id,
author_name,
author_number
)
VALUES (?,?,?)
""", article_authors)
con.commit()
Commits
However, as an artefact from more complex database architectures - and from concurrent applications - we must commit our changes, or they might not show up in the database.
After any changes, run con.commit()
.
SQLite does autocommits - and largely works without this. However, strange things can happen, and it makes your code less portable to other databases with more explicit transaction models.
Note that only one running proccess can edit a SQLite database at a time, and others may throw errors or wait for a close. If you need higher concurrency, you neeed something other than SQLite.
Worksheet
Today's project, which spans both today and Monday, will take you through using a SQL backend, and creating a Python frontend for a Gradebook.