the Structured Query Language and SQL Console

the Structured Query Language and SQL Console

Motivation

Now that we have a good description for our data, we are going to talk about the predicates on that data.

Using the full predicate calculus notation would be exhausting and absurd. To make things simpler, we have a set of concise statements which do common tasks.

But first, let's get ourselves an actual tool:

SQLite

SQLite is an attempt at making a size-minimal, (mostly)-full-featured SQL implementation. A strange consequence of its small filesize, it has turned out to be:

remarkably fast, orders of magnitude better than all equivalent competitors for small data
easy to use, allowing for storing your data in a simple file with no server config
portable, able to embed on nearly any platform without an install
poor at scaling above 10-figure bytes
completely unable to handle datasets larger than filesystem max filesize (~140TB)

For anything which requires distributed databases, or advanced concurrency features, you need to go to more complex solutions.

But for pretty much all small-user-count, single-storage projects, SQLite is the way to go.

A quick search of my operating system for .sqlite (the most common extension):

find /c -name *.sqlite | wc -l

Turned up 843 examples (and another 155 for .db), of which only a few dozen were directly made by me. They ranged from the hardware vendor for my computer, complex 3d modeling software, mobile development, but were mostly in web browsing.

SQLite is so small (just over 1.6MB at time of writing) that many applications - rather than trying to rely on install chains and dynamic linking - often simply bundle an appropriate copy. If you are sharing a dataset of any real size, the cost of including a sqlite3 executable is negligible; find suggests I have at least a dozen copies littered about my ProgramFiles and AppData directories by various software.

The Structured Query Language

The evolving specification for relational data has come to be known as SQL, the Structured Query Language. The specification isn't as robust as most we have talked about - like TeX or Python - so scripts written for one implementation likely won't work on others. In addition, several functions leave things mathematically ambiguous - which leads to inconsistency.

However, the core functions remain largely the same. For today, we are going to be exploring the table example from yesterday..

Schemas

It can help to know what the table design is. This is - frustratingly - outside the SQL spec, and every implementation does it a little differently.

In sqlite, you can get it quickly with the .schema directive. For more of these directives, you can run .help.

You may want to store some of the output to a file, or read it with a tool like less.

For that, we can use the sqlite command line syntax:

sqlite bibliography.sqlite ".help" | less

Statements

SQL syntax involves various statements - like any formal language.

SELECT

Select is the statement used to return data. It operates on a Table, and returns the rows - in the theory, the relations:

SELECT * FROM Table;

It is often that you just want some of the data.

SELECT Column1,Column2 FROM Table;

WHERE

For filtering, we can use the Where command:

SELECT * FROM Table WHERE Column="Value";

Note the quotation marks - in SQL, that is how you denote a explicit value. You can also match other column values.

As our queries get more complex, we will want to spread them out accross multiple lines. SQL isn't whitespace sensitive, and evaluates at the ;, so we are free - and encouraged - to move statements to new lines.

Logical Operators

To model the predicate calculus, we need more than matching equality; we need logical operators.

SQL gives us a variety; far more than we need to define a (finite) first-order logical system.

For example, AND, NOT, OR, and parentheses:

SELECT *
FROM Table
WHERE Column1="Value1"
AND NOT (
  Column2="Value2"
  OR
  Column2="Value3"
);

And even a shorthand for comparing to a list:

SELECT *
FROM Table
WHERE Column1="Value1"
AND NOT Column2
IN ("Value2", "Value3");

Do not trust orders of operations. They differ from implementation to implementation.

For now, parentheses are your friends.

Subqueries

Our predicate calculus wouldn't be complete without the ability to make broad, sweeping statements. For that, we have the ability to construct subqueries.

They operate on queries, within queries. For example, we could get all the collaborators for a given author:

SELECT DISTINCT author_name
FROM ArticleAuthors
WHERE article_id
IN (
  SELECT article_id
  FROM ArticleAuthors
  WHERE author_name="Pomerance, Carl"
)
AND NOT author_name="Pomerance, Carl";

And now we have enough SQL to state our predicates, and - as an added bonus - read out matching columns!

INSERT INTO

One thing not particularly covered by the basic math, but of great importance, is modifying the data.

For that, we have INSERT INTO.

Suppose I want to add a new paper into the list. For example, this paper (in BibTeX format):

@article{baillie1980lucas,
  title={Lucas pseudoprimes},
  author={Baillie, Robert and Wagstaff, Samuel S},
  journal={Mathematics of Computation},
  volume={35},
  number={152},
  pages={1391--1417},
  year={1980}
}

We can start by adding the article:

INSERT INTO Articles(
  article_id,
  title,
  journal
)
VALUES
(
  "baillie1980lucas",
  "Lucas Pseudoprimes",
  "Mathematics of Computation"
);

We can even insert multiple rows:

INSERT INTO ArticleAuthors(
  article_id,
  author_name,
  author_number
)
VALUES
(
  "baillie1980lucas",
  "Baillie, Robert",
  "1"
),
(
  "baillie1980lucas",
  "Wagstaff, Samuel S",
  "2"
)
;

UPDATE Table SET Column=Value WHERE

Let's say we made a mistake - perhaps a simple one, like misspelling the title.

For that, we can UPDATE our values:

UPDATE Articles
SET title="Lucas pseudoprimes"
WHERE article_id="baillie1980lucas";

Note that the WHERE is technically optional; if omitted, it will change everything in that column to match the new value. This is virtually never what you want, and a good argument for backing up your .sqlite file somehow (such as with git).

DELETE FROM

Let's say I regret adding that paper.

I can delete it with:

DELETE FROM Articles
WHERE article_id="baillie1980lucas";
DELETE FROM ArticleAuthors
WHERE article_id="baillie1980lucas";

Note that the WHERE is still technically optional, but can lead to serious loss of data.

Joins

The predicate calculus largely concerns itself with single-table representations - though with possible Null values.

Joins are how we achieve single tables from our multitude. For example, to get a list of articles by one author, we can do a JOIN:

SELECT author_name, title FROM ArticleAuthors JOIN Articles ON ArticleAuthors.article_id=Articles.article_id;

If the column names match up, there is a nice shorthand:

SELECT author_name, title FROM ArticleAuthors NATURAL JOIN Articles ON ArticleAuthors.article_id=Articles.article_id;

There are many geometries of Joins, but most of the joins you will do - key-to-key matching - are Inner Joins.

Another common type of join is the Left Join, which appends data to the left table - appending Null if their is no match.

Aggregators

Often, complicated queries involve arithmetic. You can perform simple arithmetic with the standard operators, but for more complex operations, SQL is willing to provide us with reducing arithmetic operators such as:

COUNT, SUM, MIN, MAX, GROUP_CONCAT.

By default, they collapse everything in a query:

SELECT COUNT(DISTINCT author_name) FROM ArticleAuthors;

But these make the final piece of the puzzle for returning to our original table, if you prefer to look at it:

SELECT group_concat(author_name, " and "),
       title,
       Journals.publisher,
       Articles.journal
FROM ArticleAuthors
JOIN Articles
ON ArticleAuthors.article_id=Articles.article_id
JOIN Journals
ON Articles.journal=Journals.journal
GROUP BY Articles.article_id
ORDER BY author_number;

Modifying the Schema

Create Table

The CREATE TABLE directive does exactly what it says - creates a table.

You have already seen examples of it in the schema:

CREATE TABLE Journals (
    journal TEXT PRIMARY KEY,
    publisher TEXT
);

CREATE TABLE Articles (
    article_id TEXT PRIMARY KEY,
    title TEXT,
    journal TEXT,
    FOREIGN KEY (journal) REFERENCES Journals(journal)
);

CREATE TABLE ArticleAuthors (
    article_id TEXT,
    author_name TEXT,
    author_number INTEGER DEFAULT 1,
    PRIMARY KEY (author_name, article_id),
    FOREIGN KEY(article_id) REFERENCES Articles(article_id)
);

Note that we can tell it our keys and foreign keys - and where they come from.

This is not strictly necessary, and databases you are investigating might not do it - but it can really help when reading about a database, or validating it.

For example, FOREIGN KEYs must have (at least) one target value in the target column.

Modifying Tables

There is a ALTER TABLE directive, and it is fairly versatile, but creates a strange .schema and struggles on some modifications.

For small data, it is often easier to trash the old table and start over, copying data:

ALTER TABLE ArticleAuthors RENAME TO oldArticleAuthors;
CREATE TABLE ArticleAuthors (
    article_id TEXT,
    author_name TEXT,
    author_number INTEGER DEFAULT 1,
    PRIMARY KEY (author_name, article_id),
    FOREIGN KEY(article_id) REFERENCES Articles(article_id)
);
INSERT INTO ArticleAuthors(
  article_id,
  author_name,
  author_number
)
SELECT article_id, author_name, author_number
FROM oldArticleAuthors;
DROP TABLE oldArticleAuthors;

I find it useful to write my schemas in separate files, and initialize a database using the command-line interface (or, as we will see, python library).

Assignment

Play around with the data, and if you want, with other examples from across the internet. W3Schools, opened on a websql-compliant browser (non-Firefox), gives you a sqlite3 session in the browser with an editable input and prettyprinted output.

Your proper assignment, however, is to:

Design a (3NF compliant) schema to store the Student Data in SQLite syntax. Save it as gradebooks.schema, and upload the file to brightspace - we will need it tomorrow as we get into using Python to interface with the data.

This is going to be the most complicated schema yet - so I will offer some advice.

You should use your python file reading skills to get your column names, and python string manipulation skills to turn them into a schema. Don't try to write all the students.csv columns by hand! Write them by computer, and tweak them by hand.
Storing the different gradebooks is going to be an adventure - with their different numbers of HWs, Quizzes, etc.
- You can have a table for assignment descriptions - with assignment names (like "HW1"), assignment type (like "Quiz"), maximum values, and course id.
- You can then many-many join students to that table, to record their score in each assignment.
- You can discard the final_grade column. Externally computed columns are almost always lies.