The Math of SQL

The Math of SQL

Video

Motivation

Few technologies have changed the way I see the world more than SQL Databases. Deeply mathematical but surprisingly versatile, the storage of information exists in a back-and-forth relationship with your understanding of the information stored.

By working from good principles - of robust design, minimizing unnecessary limitations, and a permissive outlook - what may seem impossible to change could be very easy.

Mathematics

Relational Data

Tabular data is easy to store. Lists of numbers, dates, or names are well-studied.

Real world data is almost never so simple. Data, like the people who generate it, comes with a complex web of relationships.

In the abstract context, mathematicians who study the idea of data define a few core concepts for a Dataset, which contains:

A (finite) set of Columns values $S_i$, indexed by some (finite) set $I$.
- Each column set represents some Data Type (or Data Domain) : (finite) equality-comparable sets of possible Data Values.
A (finite) set of Relations $R$ : A "named tuple" of column IDs and Values for some subset of the columns.

Data is manipulated in terms of finite first-order predicate calculus. In those terms, the Relations are typically defined as a truth function:

\begin{align} \prod_i S_i &\overset f \mapsto \{T,F\} & f((s_1,\dots,s_n)) = T \iff (s_1,\dots, s_n) \in R \end{align}

Fundamentally, all finite data can be viewed this way, and complex statements about the data can be phrased - and thus resolved - as predicate statements about the truth function.

It can be helpful to group the Relations according to their Names and Data Domains. By analogy, we refer to these groups as Tables, the Names as Column Names, the Values as Attributes, and the Data Domains/Types as Column Domains/Types.

Note that the same real-world data can be represented in many ways, that are equivalent (up to a translation of the predicate calculus). The study of the relative costs and benefits of certain data representations has many real-world applications to the storage of data, as we will see.

Part of real world data is that some things are impossible - start dates occur after end dates, and so on - so we can define Data Spaces $D \subset \prod S_i$, which contain only "valid" data under some reasonable conditions.

Example

Let's say you had a lot of data - for example, a dataset on mathematical publications:

author	title	publisher	journal
Wagstaff, Samuel S	Divisors of Mersenne numbers	Amer. Math. Soc.	Mathematics of Computation
Pomerance, Carl and Selfridge, John L and Wagstaff, Samuel S	The pseudoprimes to 25⋅ 10⁹	Amer. Math. Soc.	Mathematics of Computation
Pomerance, Carl	On the distribution of pseudoprimes	Amer. Math. Soc.	Mathematics of Computation
Wagstaff, Samuel S	Infinite matroids	Amer. Math. Soc.	Transactions of the American Mathematical Society
Alford, William R and Granville, Andrew and Pomerance, Carl	There are infinitely many Carmichael numbers	JSTOR	Annals of Mathematics

Using this table works in general, for a tabular representation of small data - this is basically how BibTex stores it, after all.

This can work for some computations; we could group articles by the same single author - providing there are no typos or alternate spellings - with a predicate operation on the authors. However, we can do better.

Normalization

Atomicity and the 1st Normal Form

Let's pick out some flaws from this that we can improve upon. First, the Author section - searching for all entries by the same author is tedious, involving substring matching and errors.

Enter the First Normal Form:

A Dataset is said to be in 1st Normal Form iff every cell is Atomic, that is, no cell represents more than one Data Element.

Example

We could attempt a simple Atomization - conversion to first normal form - by setting some reasonable bound on the number of authors, and making a separate column for each:

Author 1	Author 2	Author 3	title	publisher	journal
Wagstaff, Samuel S	None	None	Divisors of Mersenne numbers	Amer. Math. Soc.	Mathematics of Computation
Pomerance, Carl	Selfridge, John L	Wagstaff, Samuel S	The pseudoprimes to 25⋅ 10⁹	Amer. Math. Soc.	Mathematics of Computation
Pomerance, Carl	None	None	On the distribution of pseudoprimes	Amer. Math. Soc.	Mathematics of Computation
Wagstaff, Samuel S	None	None	Infinite matroids	Amer. Math. Soc.	Transactions of the American Mathematical Society
Alford, William R	Granville, Andrew	Pomerance, Carl	There are infinitely many Carmichael numbers	JSTOR	Annals of Mathematics

But there may be some very large collaborations out there - and this makes it hard to search by author.

A more robust Atomization might be to expand the author column into rows:

author name	author number	title	publisher	journal
Alford, William R	1	There are infinitely many Carmichael numbers	JSTOR	Annals of Mathematics
Granville, Andrew	2	There are infinitely many Carmichael numbers	JSTOR	Annals of Mathematics
Pomerance, Carl	3	There are infinitely many Carmichael numbers	JSTOR	Annals of Mathematics
Pomerance, Carl	1	The pseudoprimes to 25⋅ 10⁹	Amer. Math. Soc.	Mathematics of Computation
Selfridge, John L	2	The pseudoprimes to 25⋅ 10⁹	Amer. Math. Soc.	Mathematics of Computation
Wagstaff, Samuel S	3	The pseudoprimes to 25⋅ 10⁹	Amer. Math. Soc.	Mathematics of Computation
Pomerance, Carl	1	On the distribution of pseudoprimes	Amer. Math. Soc.	Mathematics of Computation
Wagstaff, Samuel S	1	Infinite matroids	Amer. Math. Soc.	Transactions of the American Mathematical Society
Wagstaff, Samuel S	1	Divisors of Mersenne numbers	Amer. Math. Soc.	Mathematics of Computation

This allows us to search for all papers by a given author!

Uniqueness and the 2nd Normal Form

However, now we have even more repeated information. To fix that, we will continue Normalization.

To define the 2nd Normal Form, we have to talk a bit about keys.

Under the mathematical definition, no two rows can be identical. Often, in practice, there are smaller subsets which also should be unique in the space - such as the ISBN for books, or author-title combination for journal articles.

Definition: Suppose we have a Data Space $D \subset \prod S_i$, and a set of relations $R \subset D$.

A set of Attributes $A \subset I$ of $r \in R$ is called a Candidate Key if and only if both:

$\not \exists s \in D \backslash r$ such that $s|A = r|A$ - that is, $r$ restricted to the columns $A$ uniquely identifies $r$ in $D$.

$A$ is minimal with respect to this condition; that is, $\not \exists B \subsetneq A$ such that $B$ satisfies the first condition.

Candidate Keys will come up a lot, and they show up in the formal definition of the 2nd normal form. However, we need another quick definition:

Definition: A Functional Dependency from $A$ to $B$ in a Data Space $D \subset \prod S_i$ is a relation between disjoint attibute sets $A,B \subset I$, $A \cap B = \emptyset$ such that such that: \[ \forall r,s \in D: r|A=s|A \implies r|B=s|B \]

Or, more colloquially, the columns in B depend completely on the columns in A.

With these definitions, we can build the second normal form:

Definition: A dataset is said to be in 2nd Normal Form if there are no (nontrivial) functional dependencies from a proper subset of a candidate key, to a non-key attribute.

This is kind of technical, but basically it means - don't store conclusions based on part of your key.

Example

Our current table violates this, because the journal title - irrespective of author - determines most of the information.

This can cause extra storage, but the worse thing is inconsistencies - If one of the values is changed, such as an updated publisher, we could get the same title resolving to two different possibilities.

To Normalize further, we can take our data and split it into multiple tables. These tables can store data specific to objects; for example, we could have a table for articles without redundancy:

	title	publisher	journal
ID
wagstaff1983divisors	Divisors of Mersenne numbers	Amer. Math. Soc.	Mathematics of Computation
pomerance1980pseudoprimes	The pseudoprimes to 25⋅ 10⁹	Amer. Math. Soc.	Mathematics of Computation
pomerance1981distribution	On the distribution of pseudoprimes	Amer. Math. Soc.	Mathematics of Computation
wagstaff1973infinite	Infinite matroids	Amer. Math. Soc.	Transactions of the American Mathematical Society
alford1994there	There are infinitely many Carmichael numbers	JSTOR	Annals of Mathematics

And a many-to-many join table which links them together, along with any information relevant to the join:

paper id	author name	author number
wagstaff1983divisors	Wagstaff, Samuel S	1
pomerance1980pseudoprimes	Pomerance, Carl	1
pomerance1980pseudoprimes	Selfridge, John L	2
pomerance1980pseudoprimes	Wagstaff, Samuel S	3
pomerance1981distribution	Pomerance, Carl	1
wagstaff1973infinite	Wagstaff, Samuel S	1
alford1994there	Alford, William R	1
alford1994there	Granville, Andrew	2
alford1994there	Pomerance, Carl	3

Third Normal Form and the End of Redundancy

There's one last thing to notice. All of these articles contain both a journal and a publisher. However, (in an ideal world), the journal determines the publisher.

This doesn't violate 2nd normal form - the journal is not a key - but it is still redundant. To get rid of it, we will define Third Normal Form:

A dataset is said to be in Third Normal Form if:

The dataset is in Second Normal Form

Every non-key attribute depends on every key attribute

No non-key attribute implies another non-key attribute which does not imply it

The classic saying about 3NF is that every attribute should determine "The key, the whole key, and nothing but the key, so help me Codd."

Fundamentally, 3NF is a restriction - and guide - on the design of your tables. A database in 3NF has tables that represent data at the minimal scope necessary. This gives not just a normal form, but close to a standard form.

In practicality, of course, functional dependencies crop up in unexpected ways - in the real world, true 3NF compliance is more of an ideal than a guarantee.

Example

Our remaining 3NF violation is - as hinted above - the journal $\Rightarrow$ publisher dependency.

We can abstract it away by adding a Journals table:

publisher	journal
Amer. Math. Soc.	Mathematics of Computation
Amer. Math. Soc.	Transactions of the American Mathematical Society
JSTOR	Annals of Mathematics

And modifying our Articles to no longer contain redundant information:

	title	journal
ID
wagstaff1983divisors	Divisors of Mersenne numbers	Mathematics of Computation
pomerance1980pseudoprimes	The pseudoprimes to 25⋅ 10⁹	Mathematics of Computation
pomerance1981distribution	On the distribution of pseudoprimes	Mathematics of Computation
wagstaff1973infinite	Infinite matroids	Transactions of the American Mathematical Society
alford1994there	There are infinitely many Carmichael numbers	Annals of Mathematics

Normal Forms 3.5-6 And Beyond

Database researchers have imagined countless normal forms, including some that lie outside this sequence. They all have their own advantages or disadvantages, but 3rd normal form ensures us a lack of redundancy - and thus, the ability to change non-key data in one place and have it changed in all uses.

I won't explain them here - they get more and more cumbersome - but to see the limiting case, 6th normal form ("No Nontrivial Functional Dependencies") only allows key-value stores.

For example, we would have to break our journal table further into:

	title
ID
wagstaff1983divisors	Divisors of Mersenne numbers
pomerance1980pseudoprimes	The pseudoprimes to 25⋅ 10⁹
pomerance1981distribution	On the distribution of pseudoprimes
wagstaff1973infinite	Infinite matroids
alford1994there	There are infinitely many Carmichael numbers

	journal
ID
wagstaff1983divisors	Mathematics of Computation
pomerance1980pseudoprimes	Mathematics of Computation
pomerance1981distribution	Mathematics of Computation
wagstaff1973infinite	Transactions of the American Mathematical Society
alford1994there	Annals of Mathematics

Assignment

Consider the following list of columns, that you might get out of an LMS:

Student Name	Student ID	Student Nickname	Course Name	Course ID	HW Grade	Quiz Grade	Exam Grade

Graded Assignment: Submit to Brightspace a database "schema" - lists of column names for one or more tables, seperated by new lines - for a data model that would satisfy the third normal form. Place the key(s) first.

As an example, to submit the schema derived in class, write:

paper id, author name, author number
paper id, title, journal
journal, publisher