the Structured Query Language and SQL Console
Motivation
Now that we have a good description for our data, we are going to talk about the predicates on that data.
Using the full predicate calculus notation would be exhausting and absurd. To make things simpler, we have a set of concise statements which do common tasks.
But first, let's get ourselves an actual tool:
SQLite
SQLite is an attempt at making a size-minimal, (mostly)-full-featured SQL implementation. A strange consequence of its small filesize, it has turned out to be:
- remarkably fast, orders of magnitude better than all equivalent competitors for small data
- easy to use, allowing for storing your data in a simple file with no server config
- portable, able to embed on nearly any platform without an install
- poor at scaling above 10-figure bytes
- completely unable to handle datasets larger than filesystem max filesize (~140TB)
For anything which requires distributed databases, or advanced concurrency features, you need to go to more complex solutions.
But for pretty much all small-user-count, single-storage projects, SQLite is the way to go.
A quick search of my operating system for .sqlite
(the most common extension):
find /c -name *.sqlite | wc -l
Turned up 843 examples (and another 155 for .db
), of which only a few dozen were directly made by me. They ranged from the hardware vendor for my computer, complex 3d modeling software, mobile development, but were mostly in web browsing.
SQLite is so small (just over 1.6MB at time of writing) that many applications - rather than trying to rely on install chains and dynamic linking - often simply bundle an appropriate copy. If you are sharing a dataset of any real size, the cost of including a sqlite3 executable is negligible; find
suggests I have at least a dozen copies littered about my ProgramFiles
and AppData
directories by various software.
The Structured Query Language
The evolving specification for relational data has come to be known as SQL, the Structured Query Language. The specification isn't as robust as most we have talked about - like TeX or Python - so scripts written for one implementation likely won't work on others. In addition, several functions leave things mathematically ambiguous - which leads to inconsistency.
However, the core functions remain largely the same. For today, we are going to be exploring the table example from yesterday..
Schemas
It can help to know what the table design is. This is - frustratingly - outside the SQL spec, and every implementation does it a little differently.
In sqlite, you can get it quickly with the .schema
directive. For more of these directives, you can run .help
.
You may want to store some of the output to a file, or read it with a tool like less
.
For that, we can use the sqlite
command line syntax:
sqlite bibliography.sqlite ".help" | less
Statements
SQL syntax involves various statements - like any formal language.
SELECT
Select is the statement used to return data. It operates on a Table, and returns the rows - in the theory, the relations:
SELECT * FROM Table;
It is often that you just want some of the data.
SELECT Column1,Column2 FROM Table;
WHERE
For filtering, we can use the Where command:
SELECT * FROM Table WHERE Column="Value";
Note the quotation marks - in SQL, that is how you denote a explicit value. You can also match other column values.
As our queries get more complex, we will want to spread them out accross multiple lines. SQL isn't whitespace sensitive, and evaluates at the ;
, so we are free - and encouraged - to move statements to new lines.
Logical Operators
To model the predicate calculus, we need more than matching equality; we need logical operators.
SQL gives us a variety; far more than we need to define a (finite) first-order logical system.
For example, AND
, NOT
, OR
, and parentheses:
SELECT *
FROM Table
WHERE Column1="Value1"
AND NOT (
Column2="Value2"
OR
Column2="Value3"
);
And even a shorthand for comparing to a list:
SELECT *
FROM Table
WHERE Column1="Value1"
AND NOT Column2
IN ("Value2", "Value3");
Do not trust orders of operations. They differ from implementation to implementation.
For now, parentheses are your friends.
Subqueries
Our predicate calculus wouldn't be complete without the ability to make broad, sweeping statements. For that, we have the ability to construct subqueries.
They operate on queries, within queries. For example, we could get all the collaborators for a given author:
SELECT DISTINCT author_name
FROM ArticleAuthors
WHERE article_id
IN (
SELECT article_id
FROM ArticleAuthors
WHERE author_name="Pomerance, Carl"
)
AND NOT author_name="Pomerance, Carl";
And now we have enough SQL to state our predicates, and - as an added bonus - read out matching columns!
INSERT INTO
One thing not particularly covered by the basic math, but of great importance, is modifying the data.
For that, we have INSERT INTO.
Suppose I want to add a new paper into the list. For example, this paper (in BibTeX format):
@article{baillie1980lucas,
title={Lucas pseudoprimes},
author={Baillie, Robert and Wagstaff, Samuel S},
journal={Mathematics of Computation},
volume={35},
number={152},
pages={1391--1417},
year={1980}
}
We can start by adding the article:
INSERT INTO Articles(
article_id,
title,
journal
)
VALUES
(
"baillie1980lucas",
"Lucas Pseudoprimes",
"Mathematics of Computation"
);
We can even insert multiple rows:
INSERT INTO ArticleAuthors(
article_id,
author_name,
author_number
)
VALUES
(
"baillie1980lucas",
"Baillie, Robert",
"1"
),
(
"baillie1980lucas",
"Wagstaff, Samuel S",
"2"
)
;
UPDATE Table SET Column=Value WHERE
Let's say we made a mistake - perhaps a simple one, like misspelling the title.
For that, we can UPDATE
our values:
UPDATE Articles
SET title="Lucas pseudoprimes"
WHERE article_id="baillie1980lucas";
Note that the WHERE
is technically optional; if omitted, it will change everything in that column to match the new value. This is virtually never what you want, and a good argument for backing up your .sqlite
file somehow (such as with git
).
DELETE FROM
Let's say I regret adding that paper.
I can delete it with:
DELETE FROM Articles
WHERE article_id="baillie1980lucas";
DELETE FROM ArticleAuthors
WHERE article_id="baillie1980lucas";
Note that the WHERE
is still technically optional, but can lead to serious loss of data.
Joins
The predicate calculus largely concerns itself with single-table representations - though with possible Null values.
Joins are how we achieve single tables from our multitude. For example, to get a list of articles by one author, we can do a JOIN:
SELECT author_name, title FROM ArticleAuthors JOIN Articles ON ArticleAuthors.article_id=Articles.article_id;
If the column names match up, there is a nice shorthand:
SELECT author_name, title FROM ArticleAuthors NATURAL JOIN Articles ON ArticleAuthors.article_id=Articles.article_id;
There are many geometries of Joins, but most of the joins you will do - key-to-key matching - are Inner Joins.
Another common type of join is the Left Join, which appends data to the left table - appending Null if their is no match.
Aggregators
Often, complicated queries involve arithmetic. You can perform simple arithmetic with the standard operators, but for more complex operations, SQL is willing to provide us with reducing arithmetic operators such as:
COUNT
, SUM
, MIN
, MAX
, GROUP_CONCAT
.
By default, they collapse everything in a query:
SELECT COUNT(DISTINCT author_name) FROM ArticleAuthors;
But these make the final piece of the puzzle for returning to our original table, if you prefer to look at it:
SELECT group_concat(author_name, " and "),
title,
Journals.publisher,
Articles.journal
FROM ArticleAuthors
JOIN Articles
ON ArticleAuthors.article_id=Articles.article_id
JOIN Journals
ON Articles.journal=Journals.journal
GROUP BY Articles.article_id
ORDER BY author_number;
Modifying the Schema
Create Table
The CREATE TABLE
directive does exactly what it says - creates a table.
You have already seen examples of it in the schema:
CREATE TABLE Journals (
journal TEXT PRIMARY KEY,
publisher TEXT
);
CREATE TABLE Articles (
article_id TEXT PRIMARY KEY,
title TEXT,
journal TEXT,
FOREIGN KEY (journal) REFERENCES Journals(journal)
);
CREATE TABLE ArticleAuthors (
article_id TEXT,
author_name TEXT,
author_number INTEGER DEFAULT 1,
PRIMARY KEY (author_name, article_id),
FOREIGN KEY(article_id) REFERENCES Articles(article_id)
);
Note that we can tell it our keys and foreign keys - and where they come from.
This is not strictly necessary, and databases you are investigating might not do it - but it can really help when reading about a database, or validating it.
For example, FOREIGN KEYs must have (at least) one target value in the target column.
Modifying Tables
There is a ALTER TABLE
directive, and it is fairly versatile, but creates a strange .schema
and struggles on some modifications.
For small data, it is often easier to trash the old table and start over, copying data:
ALTER TABLE ArticleAuthors RENAME TO oldArticleAuthors;
CREATE TABLE ArticleAuthors (
article_id TEXT,
author_name TEXT,
author_number INTEGER DEFAULT 1,
PRIMARY KEY (author_name, article_id),
FOREIGN KEY(article_id) REFERENCES Articles(article_id)
);
INSERT INTO ArticleAuthors(
article_id,
author_name,
author_number
)
SELECT article_id, author_name, author_number
FROM oldArticleAuthors;
DROP TABLE oldArticleAuthors;
I find it useful to write my schemas in separate files, and initialize a database using the command-line interface (or, as we will see, python library).
Assignment
Play around with the data, and if you want, with other examples from across the internet. W3Schools, opened on a websql-compliant browser (non-Firefox), gives you a sqlite3
session in the browser with an editable input and prettyprinted output.
Your proper assignment, however, is to:
Design a (3NF compliant) schema to store the Student Data in SQLite syntax. Save it as
gradebooks.schema
, and upload the file to brightspace - we will need it tomorrow as we get into using Python to interface with the data.
This is going to be the most complicated schema yet - so I will offer some advice.
- You should use your
python
file reading skills to get your column names, and python string manipulation skills to turn them into a schema. Don't try to write all thestudents.csv
columns by hand! Write them by computer, and tweak them by hand. - Storing the different gradebooks is going to be an adventure - with their different numbers of HWs, Quizzes, etc.
- You can have a table for assignment descriptions - with assignment names (like "HW1"), assignment type (like "Quiz"), maximum values, and course id.
- You can then many-many join students to that table, to record their score in each assignment.
- You can discard the
final_grade
column. Externally computed columns are almost always lies.