The Math of SQL
Video
Motivation
Few technologies have changed the way I see the world more than SQL Databases. Deeply mathematical but surprisingly versatile, the storage of information exists in a back-and-forth relationship with your understanding of the information stored.
By working from good principles - of robust design, minimizing unnecessary limitations, and a permissive outlook - what may seem impossible to change could be very easy.
Mathematics
Relational Data
Tabular data is easy to store. Lists of numbers, dates, or names are well-studied.
Real world data is almost never so simple. Data, like the people who generate it, comes with a complex web of relationships.
In the abstract context, mathematicians who study the idea of data define a few core concepts for a Dataset, which contains:
- A (finite) set of Columns values $S_i$, indexed by some (finite) set $I$.
- Each column set represents some Data Type (or Data Domain) : (finite) equality-comparable sets of possible Data Values.
- A (finite) set of Relations $R$ : A "named tuple" of column IDs and Values for some subset of the columns.
Data is manipulated in terms of finite first-order predicate calculus. In those terms, the Relations are typically defined as a truth function:
\begin{align} \prod_i S_i &\overset f \mapsto \{T,F\} & f((s_1,\dots,s_n)) = T \iff (s_1,\dots, s_n) \in R \end{align}
Fundamentally, all finite data can be viewed this way, and complex statements about the data can be phrased - and thus resolved - as predicate statements about the truth function.
It can be helpful to group the Relations according to their Names and Data Domains. By analogy, we refer to these groups as Tables, the Names as Column Names, the Values as Attributes, and the Data Domains/Types as Column Domains/Types.
Note that the same real-world data can be represented in many ways, that are equivalent (up to a translation of the predicate calculus). The study of the relative costs and benefits of certain data representations has many real-world applications to the storage of data, as we will see.
Part of real world data is that some things are impossible - start dates occur after end dates, and so on - so we can define Data Spaces $D \subset \prod S_i$, which contain only "valid" data under some reasonable conditions.
Example
Let's say you had a lot of data - for example, a dataset on mathematical publications:
author | title | publisher | journal |
---|---|---|---|
Wagstaff, Samuel S | Divisors of Mersenne numbers | Amer. Math. Soc. | Mathematics of Computation |
Pomerance, Carl and Selfridge, John L and Wagstaff, Samuel S | The pseudoprimes to 25⋅ 10⁹ | Amer. Math. Soc. | Mathematics of Computation |
Pomerance, Carl | On the distribution of pseudoprimes | Amer. Math. Soc. | Mathematics of Computation |
Wagstaff, Samuel S | Infinite matroids | Amer. Math. Soc. | Transactions of the American Mathematical Society |
Alford, William R and Granville, Andrew and Pomerance, Carl | There are infinitely many Carmichael numbers | JSTOR | Annals of Mathematics |
Using this table works in general, for a tabular representation of small data - this is basically how BibTex stores it, after all.
This can work for some computations; we could group articles by the same single author - providing there are no typos or alternate spellings - with a predicate operation on the authors. However, we can do better.
Normalization
Atomicity and the 1st Normal Form
Let's pick out some flaws from this that we can improve upon. First, the Author section - searching for all entries by the same author is tedious, involving substring matching and errors.
Enter the First Normal Form:
A Dataset is said to be in 1st Normal Form iff every cell is Atomic, that is, no cell represents more than one Data Element.
Example
We could attempt a simple Atomization - conversion to first normal form - by setting some reasonable bound on the number of authors, and making a separate column for each:
Author 1 | Author 2 | Author 3 | title | publisher | journal |
---|---|---|---|---|---|
Wagstaff, Samuel S | None | None | Divisors of Mersenne numbers | Amer. Math. Soc. | Mathematics of Computation |
Pomerance, Carl | Selfridge, John L | Wagstaff, Samuel S | The pseudoprimes to 25⋅ 10⁹ | Amer. Math. Soc. | Mathematics of Computation |
Pomerance, Carl | None | None | On the distribution of pseudoprimes | Amer. Math. Soc. | Mathematics of Computation |
Wagstaff, Samuel S | None | None | Infinite matroids | Amer. Math. Soc. | Transactions of the American Mathematical Society |
Alford, William R | Granville, Andrew | Pomerance, Carl | There are infinitely many Carmichael numbers | JSTOR | Annals of Mathematics |
But there may be some very large collaborations out there - and this makes it hard to search by author.
A more robust Atomization might be to expand the author column into rows:
author name | author number | title | publisher | journal |
---|---|---|---|---|
Alford, William R | 1 | There are infinitely many Carmichael numbers | JSTOR | Annals of Mathematics |
Granville, Andrew | 2 | There are infinitely many Carmichael numbers | JSTOR | Annals of Mathematics |
Pomerance, Carl | 3 | There are infinitely many Carmichael numbers | JSTOR | Annals of Mathematics |
Pomerance, Carl | 1 | The pseudoprimes to 25⋅ 10⁹ | Amer. Math. Soc. | Mathematics of Computation |
Selfridge, John L | 2 | The pseudoprimes to 25⋅ 10⁹ | Amer. Math. Soc. | Mathematics of Computation |
Wagstaff, Samuel S | 3 | The pseudoprimes to 25⋅ 10⁹ | Amer. Math. Soc. | Mathematics of Computation |
Pomerance, Carl | 1 | On the distribution of pseudoprimes | Amer. Math. Soc. | Mathematics of Computation |
Wagstaff, Samuel S | 1 | Infinite matroids | Amer. Math. Soc. | Transactions of the American Mathematical Society |
Wagstaff, Samuel S | 1 | Divisors of Mersenne numbers | Amer. Math. Soc. | Mathematics of Computation |
This allows us to search for all papers by a given author!
Uniqueness and the 2nd Normal Form
However, now we have even more repeated information. To fix that, we will continue Normalization.
To define the 2nd Normal Form, we have to talk a bit about keys.
Under the mathematical definition, no two rows can be identical. Often, in practice, there are smaller subsets which also should be unique in the space - such as the ISBN for books, or author-title combination for journal articles.
Definition: Suppose we have a Data Space $D \subset \prod S_i$, and a set of relations $R \subset D$.
A set of Attributes $A \subset I$ of $r \in R$ is called a Candidate Key if and only if both:
- $\not \exists s \in D \backslash r$ such that $s|A = r|A$ - that is, $r$ restricted to the columns $A$ uniquely identifies $r$ in $D$.
- $A$ is minimal with respect to this condition; that is, $\not \exists B \subsetneq A$ such that $B$ satisfies the first condition.
Candidate Keys will come up a lot, and they show up in the formal definition of the 2nd normal form. However, we need another quick definition:
Definition: A Functional Dependency from $A$ to $B$ in a Data Space $D \subset \prod S_i$ is a relation between disjoint attibute sets $A,B \subset I$, $A \cap B = \emptyset$ such that such that: \[ \forall r,s \in D: r|A=s|A \implies r|B=s|B \]
Or, more colloquially, the columns in B depend completely on the columns in A.
With these definitions, we can build the second normal form:
Definition: A dataset is said to be in 2nd Normal Form if there are no (nontrivial) functional dependencies from a proper subset of a candidate key, to a non-key attribute.
This is kind of technical, but basically it means - don't store conclusions based on part of your key.
Example
Our current table violates this, because the journal title - irrespective of author - determines most of the information.
This can cause extra storage, but the worse thing is inconsistencies - If one of the values is changed, such as an updated publisher, we could get the same title resolving to two different possibilities.
To Normalize further, we can take our data and split it into multiple tables. These tables can store data specific to objects; for example, we could have a table for articles without redundancy:
title | publisher | journal | |
---|---|---|---|
ID | |||
wagstaff1983divisors | Divisors of Mersenne numbers | Amer. Math. Soc. | Mathematics of Computation |
pomerance1980pseudoprimes | The pseudoprimes to 25⋅ 10⁹ | Amer. Math. Soc. | Mathematics of Computation |
pomerance1981distribution | On the distribution of pseudoprimes | Amer. Math. Soc. | Mathematics of Computation |
wagstaff1973infinite | Infinite matroids | Amer. Math. Soc. | Transactions of the American Mathematical Society |
alford1994there | There are infinitely many Carmichael numbers | JSTOR | Annals of Mathematics |
And a many-to-many join table which links them together, along with any information relevant to the join:
paper id | author name | author number |
---|---|---|
wagstaff1983divisors | Wagstaff, Samuel S | 1 |
pomerance1980pseudoprimes | Pomerance, Carl | 1 |
pomerance1980pseudoprimes | Selfridge, John L | 2 |
pomerance1980pseudoprimes | Wagstaff, Samuel S | 3 |
pomerance1981distribution | Pomerance, Carl | 1 |
wagstaff1973infinite | Wagstaff, Samuel S | 1 |
alford1994there | Alford, William R | 1 |
alford1994there | Granville, Andrew | 2 |
alford1994there | Pomerance, Carl | 3 |
Third Normal Form and the End of Redundancy
There's one last thing to notice. All of these articles contain both a journal and a publisher. However, (in an ideal world), the journal determines the publisher.
This doesn't violate 2nd normal form - the journal is not a key - but it is still redundant. To get rid of it, we will define Third Normal Form:
A dataset is said to be in Third Normal Form if:
- The dataset is in Second Normal Form
- Every non-key attribute depends on every key attribute
- No non-key attribute implies another non-key attribute which does not imply it
The classic saying about 3NF is that every attribute should determine "The key, the whole key, and nothing but the key, so help me Codd."
Fundamentally, 3NF is a restriction - and guide - on the design of your tables. A database in 3NF has tables that represent data at the minimal scope necessary. This gives not just a normal form, but close to a standard form.
In practicality, of course, functional dependencies crop up in unexpected ways - in the real world, true 3NF compliance is more of an ideal than a guarantee.
Example
Our remaining 3NF violation is - as hinted above - the journal $\Rightarrow$ publisher dependency.
We can abstract it away by adding a Journals table:
publisher | journal |
---|---|
Amer. Math. Soc. | Mathematics of Computation |
Amer. Math. Soc. | Transactions of the American Mathematical Society |
JSTOR | Annals of Mathematics |
And modifying our Articles to no longer contain redundant information:
title | journal | |
---|---|---|
ID | ||
wagstaff1983divisors | Divisors of Mersenne numbers | Mathematics of Computation |
pomerance1980pseudoprimes | The pseudoprimes to 25⋅ 10⁹ | Mathematics of Computation |
pomerance1981distribution | On the distribution of pseudoprimes | Mathematics of Computation |
wagstaff1973infinite | Infinite matroids | Transactions of the American Mathematical Society |
alford1994there | There are infinitely many Carmichael numbers | Annals of Mathematics |
Normal Forms 3.5-6 And Beyond
Database researchers have imagined countless normal forms, including some that lie outside this sequence. They all have their own advantages or disadvantages, but 3rd normal form ensures us a lack of redundancy - and thus, the ability to change non-key data in one place and have it changed in all uses.
I won't explain them here - they get more and more cumbersome - but to see the limiting case, 6th normal form ("No Nontrivial Functional Dependencies") only allows key-value stores.
For example, we would have to break our journal table further into:
title | |
---|---|
ID | |
wagstaff1983divisors | Divisors of Mersenne numbers |
pomerance1980pseudoprimes | The pseudoprimes to 25⋅ 10⁹ |
pomerance1981distribution | On the distribution of pseudoprimes |
wagstaff1973infinite | Infinite matroids |
alford1994there | There are infinitely many Carmichael numbers |
journal | |
---|---|
ID | |
wagstaff1983divisors | Mathematics of Computation |
pomerance1980pseudoprimes | Mathematics of Computation |
pomerance1981distribution | Mathematics of Computation |
wagstaff1973infinite | Transactions of the American Mathematical Society |
alford1994there | Annals of Mathematics |
Assignment
Consider the following list of columns, that you might get out of an LMS:
Student Name | Student ID | Student Nickname | Course Name | Course ID | HW Grade | Quiz Grade | Exam Grade |
---|---|---|---|---|---|---|---|
Graded Assignment: Submit to Brightspace a database "schema" - lists of column names for one or more tables, seperated by new lines - for a data model that would satisfy the third normal form. Place the key(s) first.
As an example, to submit the schema derived in class, write:
paper id, author name, author number
paper id, title, journal
journal, publisher