Integrating SQL Into your Software

Integrating SQL Into your Software

Video

Motivation

Data analysts can use the SQL interface, on its own, together with csv imports and exports to do everything they need.

However, most software involves layers - only one of which is the database.

Python Database Interfaces

Python contains a standard for database interfaces, laid out in PEP 249.

This means that - while we will be discussing sqlite usage - any other compliant database interface will work equivalently. For example, this common interface can be connected to Pandas.

sqlite3 Module

There are two main objects in a database interface; a connection, which is your hub for the database, which can provide you cursors which run your statements and hold more stateful information.

Connections

Connections are the main place where interfaces are allowed to differ, and sqlite3 is characteristically simple.

A connection requires opening and closing - and as such, is best managed as a context:

import sqlite3

with sqlite3.connect("bibliography.sqlite") as con:
    pass

Cursors

Cursors represent a particular position in the database, and hold the rows - as an iterator, or accessed through methods - when you select.

With Cursors, we are ready to run our SQL commands:

with sqlite3.connect("bibliography.sqlite") as con:
    cur = con.cursor()
    cur.execute("SELECT DISTINCT author_name FROM ArticleAuthors")
    for row in cur:
        print(row)

The execute function runs exactly one SQL operation, so there is no need to add semicolons - and if you want to run multiple SQL operations (for example, to CREATE TABLE in a query) you must split them up between .execute() calls.

Values

We know how to use values in SQL statements, and build queries. You may be compelled to use your python string-formatting and assembling skills to build queries.

Exploits of a Mom XKCD

Don't.

You may have heard the phrase "Sanitize your Database Inputs." This is a more specific instance of a broader piece of advice - be deliberate in what your code will do, even when passed strange inputs.

The Python database interface makes it easy to sanitize your variables; instead of using .format, f-strings, or string concatenation, simply use ? for your values, and pass them in as a tuple:

with sqlite3.connect("bibliography.sqlite") as con:
    cur = con.cursor()
    cur.execute("""
        SELECT title
        FROM ArticleAuthors
        NATURAL JOIN Articles
        WHERE author_name=?
        """,("Wagstaff, Samuel S",))
    for row in cur:
        print(row)

And let the module place them in.

Streamlining SQL with Python

Note that each SELECT on the cursor clears the buffer of rows; if you want to do multiple queries nested, you should extract the data to a table first:

with sqlite3.connect("bibliography.sqlite") as con:
    cur = con.cursor()
    cur.execute("SELECT DISTINCT author_name FROM ArticleAuthors")
    authors = cur.fetchall() # Equivalent to [i for i in cur]
    coauthors = dict()
    for author, in authors:
        # Here's our old fetch-coauthors function!
        cur.execute("""
            SELECT DISTINCT author_name
            FROM ArticleAuthors
            WHERE article_id
            IN (
              SELECT article_id
              FROM ArticleAuthors
              WHERE author_name=?
            )
            AND NOT author_name=?;
        """,(author,author))
        coauthors[author]=(set(i for i, in cur)) #
coauthors

You can construct this information within SQL with a complex query, involving GROUP BYs and whatnot, but it is complex enough to look at. A well-designed and optimized SQL query will almost always be faster, but that comes at a cost of readability and time.

Inserting Data

From this, you should be able to piece together how to insert data. However, it is worth noting that there is an abbreviated - and faster to execute - syntax for running the same operation, but varying the data:

article_id = "baillie1980lucas"
title="Lucas pseudoprimes"
journal="Mathematics of Computation"
authors=("Baillie, Robert","Wagstaff, Samuel S")
with sqlite3.connect("bibliography.sqlite") as con:
    cur = con.cursor()
    cur.execute("""
        INSERT INTO Articles(
          article_id,
          title,
          journal
        )
        VALUES (?,?,?)
        """,(article_id, title, journal))
    article_authors = [(article_id, name, index)
                       for index,name in enumerate(authors,start=1)]
    cur.executemany("""
        INSERT INTO ArticleAuthors(
            article_id,
            author_name,
            author_number
        )
        VALUES (?,?,?)
        """, article_authors)
    con.commit()

Commits

However, as an artefact from more complex database architectures - and from concurrent applications - we must commit our changes, or they might not show up in the database.

After any changes, run con.commit().

SQLite does autocommits - and largely works without this. However, strange things can happen, and it makes your code less portable to other databases with more explicit transaction models.

Note that only one running proccess can edit a SQLite database at a time, and others may throw errors or wait for a close. If you need higher concurrency, you neeed something other than SQLite.

Worksheet

Today's project, which spans both today and Monday, will take you through using a SQL backend, and creating a Python frontend for a Gradebook.