Databases

Up to this point we’ve been storing our resources - like our images for our galleries and our server configuration - in files. While this works reasonably well for small projects with simple organization, it does have its drawbacks. What potential drawbacks can you think of?

File-based Data Storage Drawbacks

Speed of Access

One of the largest challenges is the amount of time involved in file I/O. Files in the filesystem are typically stored on hard drives. For magnetic disk drives, this means physically spinning the platter and adjusting the spindle to the right point, and spinning the platter while reading - a very time consumptive process. While solid-state drives perform much faster, they add significant cost to storage, are more failure-prone, and are still time-expensive in terms of the distance signals have to travel from the drive to the CPU.

Concurrent Access

Another important challenge is the need for scalability. File I/O is a blocking operation - only one thread can access a file on that disc at a time - so it becomes a bottleneck in our system. In Node, this will manifest in slow responses as more requests needing to read the file system pile up - we’ll keep accepting new requests due to the structure of the event loop and asynchronous functions, but it will take longer and longer for new file operations to complete and push their results to the event queue as they get backed up.

A second concern from the scalability standpoint comes from horizontal scaling. Horizontal scaling refers to the practice of running your webserver on multiple machines, with a load balancer pushing requests to the least busy server. This allows us to handle far more visitors than a single machine would, while only requiring more reasonably-priced servers. But if we are horizontally scaling a file-bound web app, we’d need our file system accessible and shared by each of these machines - essentially a network drive. This only exacerbates the problems, as network drives are even slower, and now we have more requests piling up that can only be serviced one at a time.

Relational Data

One last weakness we’ll discuss in file-based systems is that they have a limited ability to organize contents - typically in hierarchical form. Consider the homework assignment - if we have multiple picture galleries, how do we establish the relationship between the picture file and the gallery it belongs to?

The picture files are placed in corresponding gallery folders - creating a hierarchical relationship.

What if we wanted to add tags to our galleries, where multiple pictures could share a tag? And what if we wanted to show all the pictures with that tag? How can we implement this functionality using files?

One possibility would be to have a tags directory or configuration file that lists each tag, and an array of image locations that share that tag... there are many other possibilities, but each further complicates the structure of our applicaiton.

What we are talking here about is relational data - the combination of pieces of data and their relationships. If both the data and their relationships are well-defined, they form what we often call structured data, which lends itself to computer-based representation.

Enter the Database

These are not new challenges, nor are they unique to the Web - enterprise applications have tackled them for years. Considering the challenges we’ve outlined, the ideal solution would:

The most common implementation of this solution is what we call a relational database - a program that stores and retrieves structured data in as optimal a way possible.

Tables

Data in relational databases is organized into tables with well-defined structure. Consider our image galleries. We have galleries, which have a name, and a description. They contain images, which are an image file and a caption. Images may be tagged with a tag - which is basically text. We can use this knowledge to define structures to contain them (note, this is psuedo - C# code):

struct Gallery { char[50] name; char[500] description; } struct Image { char[50] filename; char[100] location; char[500] caption; } struct Tag { char[5]] text; }

With these structures defined, we could store our data as arrays or vectors of these structs, i.e.:

List<Gallery> galleries = new List<Gallery>(); List<Image> images = new List<Images>(); List<Tag> tags = new List<Tag>();

And access them via their index, i.e. galleries[5] returns the 5th gallery structure in our list.

Querying Tables

But what if we wanted to find the Gallery with the name “Technohican”? How would you search the galleries list to find it, and how long (in big O notation) would that search take?

Because we potentially have to look through the whole list, the time order would be O(n) where n is the number of items. If we knew that gallery names were in order, we could use a binary search approach for O(nlog(n)).

Clearly it would be handy to have columns we need to search regularly be sorted - but once we sort by that column, the other columns would no longer be sorted. How could we provide the same speed boost, without modifying our table?

One approach would be to create an index for each column we typically would search against - a separate data structure organized around the column’s data.

For example, we could use a sorted dictionary to map the gallery names to their row numbers:

SortedDictionary<char[50], unsigned int> galleryNameIndex = new SortedDictionary<char[50], unsigned int>();

Because the sorted dictionary uses sorted keys, retrieval on the key is O(n log n). However, it does mean that we are storing more data - specifically n (50sizeof(char) + sizeof(unsigned int)). Because of this extra memory cost we only want to create indices where they will be commonly used.

Keys

While in the discussion above, we used the row’s position in the list to access rows, in practice this is not such a good idea - can you imagine why?

Because when we delete a row, we end up shifting the positions - leaving all rows after that one in different positions.

Instead, we typically have a special column of integers we call the primary keys - these need to be unique, and we want this column to be sorted (as it is the one we access the most. The easiest implementation is that when we add a row, we make the primary key one number greater than the previous one - this ensures they are always sorted and unique.

One way to think about implementing this requirement is to use a SortedDictionary or similar structure for the tables as well, i.e.:

SortedDictionary<unsigned integer, Gallery> galleries = new SortedDictionary<unsigned int, Gallery>(); SortedDictionary<unsigned integer, Image> images = new SortedDictionary<unsinged int, Images>(); SortedDictionary<unsigned, Tag> tags = new SortedDictionary<unsigned int, Tag>(); Inter-Table Relationships Keys also become the core way to define relationships between tables, which can be describes as one-to-one (one row in one table corresponds to one row in another), one-to-many (one row in one table corresponds to many in another), and many-to-many (many rows in one table are related to many rows in another).

The secret is to store a primary key of a row in another table - what we call a foreign key. For example, to define the one-to-many relationship between our Galleries and Images (one gallery contains multiple images), we would put the gallery’s primary id into the images:

struct Image { unsigned integer id; unsigned integer gallery_id; char[50] filename; char[100] location; char[500] caption; }

Similarly, to define one-to-one relationships, we create a foreign key in one of the rows referring to the other - and we only add one matching row (what we call a constraint in database parlance).

Many-to-many relationships, like our tags and images (as many images can share the same tag, and an image can have many tags) are a bit more challenging - they require a join table - a table with two foreign keys that connect the two tables.

The Database Schema

The description of the table structures and their relationships is known as a schema, and can be represented graphically in a database diagram:

Structured Query Language

While there are many competing database products, a great majority support Structured Query Language (SQL). SQL is a language for specifying, manipulating, and querying databases. It is a standard maintained by ANSI and ISO. SQL commands and queries consist of text sent to a database through a communications channel (often a socket), and responded to with text as well. Gotcha: While SQL is a standard, it is often implemented slightly differently by different database systems, so you may have to tweak your syntax if you switch database systems.

An example of a SQL query to retrieve all gallery names would be:

SELECT name FROM galleries;

A more complex query, to grab all images tagged with “football” would be:

SELECT images.filename FROM images INNER JOIN tags_images ON images.id = tags_images.tag_id INNER JOIN tags ON tags_images.image_id = tag.id WHERE tag.text = ‘football’;

There are many more SQL commands that you can learn and use, though for this course the main ones we will worry about are SELECT, UPDATE, DESTROY, and INSERT, along with table creation and some joins. Databases are explored in far greater depth in CIS 560, and you can also find many resources online, such as this codecademy course: https://www.codecademy.com/learn/learn-sql

Using a Database in Node

Let’s consider a new problem - we have a lab full of equipment we would like to loan out to students, clubs, and K-12 teachers (we actually do). We need a website to display what equipment we have, allow it to be booked, and keep track of who has it. For today, we’ll start with just the first problem - how do we maintain and display a list of equipment?

SQLite3

We’ll use SQLite3 as our database system for now - it is a good choice for small projects like this equipment loaning web application where we don’t expect to have many users. It is also useful for prototyping web applications that we will transition to other database systems in production. The reason in both cases is ease-of-use; it doesn’t require us to run a separate database server, and it stores its data in a simple file that is very portable.

However, it does not scale well to large numbers of concurrent users - it faces many of the same constraints as a file-based database, and too much concurrency will bog it down. Thankfully, it does a good job following the SQL standards, so what we write for SQLite today can be directed to a different datadase server later. There is a solid package for working with Sqlite3 available via npm - sqlite3. We can install it with the command:

>npm install sqlite3

Then we can utilize it in our scripts. For example, the code:

var sqlite3 = require('sqlite3'); var db = new sqlite3.Database('equipment.sqlite3');

creates a new Sqlite3 database (or opens an existing one). If we want to create a table, we can use the SQL CREATE TABLE command in a sqlite3.run() method:

db.run("CREATE TABLE equipment (id INTEGER PRIMARY KEY, name TEXT, serial TEXT, image_url TEXT)");

If we wanted to seed (insert some preliminary values) this table, we could do something like:

for(var i = 0; i < 10; i++) { db.run(“INSERT INTO equipment (name, serial) VALUES (?,?)”, ‘WiiMote’, i); }

And to make sure that those values are in the database, we could try:

db.each("SELECT id, name, serial FROM equipment", function(err, row) { if(err) return console.error(err); console.log(row); });

Which should print out the records corresponding to the ten rows we created in the equipment table.

Now, if you ran these in the console, they will have worked fine, but if you put them into a single script and ran it - you will have gotten a SQL error complaining that the table ‘equipment’ did not exist. What is going on? You just created it, didn’t you?

The issue is synchronization!

Like most Node libraries, the sqlite3 module uses asynchronous functions for long-running and potentially blocking operations. Sending commands to a database that could potentially be used by many, many Node instances and even other programs definitely falls into this category. So when we run the above code, it tries to create the table, insert rows, and read rows more or less at the same time! Clearly that isn’t going to work (though if you used the console, you probably didn’t type fast enough for the previous command to still be running, which is why you wouln’t have seen the error).

Thankfully, the sqlite3 creators anticipated this problem and provide a function, sqlite3.serialize(), which pushes the database commands into a queue which is sent to the database one at a time, enforcing strict ordering:

db.serialize(function() { db.run("CREATE TABLE equipment (id INTEGER PRIMARY KEY, name TEXT, serial TEXT, image_url TEXT)"); db.run("INSERT INTO equipment (name, serial) values ('WiiMote', 'ABC22')"); db.run("INSERT INTO equipment (name, serial) values ('WiiMote', 'ABC23')"); db.run("INSERT INTO equipment (name, serial) values ('WiiMote', 'ABC24')"); db.each("SELECT id, name, serial FROM equipment", function(err, row) { if(err) return console.error(err); console.log(row); }); });

There is also a sister function that executes queries in parallel: sqlite3.parallelize() - but this is also the default operation for the sqlite3 library.

What About SQL Injection?

You may wonder why we used the question-mark syntax of sqlite3:

for(var i = 0; i < 10; i++) { db.run(“INSERT INTO equipment (name, serial) VALUES (?,?)”, ‘WiiMote’, i); }

instead of simply concatenating strings:

for(var i = 0; i < 10; i++) { db.run(“INSERT INTO equipment (name, serial) VALUES (‘WiiMote’,” + i +”)”); }

As you might imagine, these both accomplish the same thing. But the question-mark syntax does an additional step - it escapes any SQL code in the variables it is concatenating. This helps protect us from SQL injection attacks, where a user-supplied variable is used in the expression. Imagine that we were applying a user-supplied serial number in the command above, and a malicious user typed:

(SELECT FIRST(username) FROM users)

This would have the effect of setting the serial number to the first user’s username (typically an admin), and making it available to the user. There are many, many more exploits that can be done by passing SQL into a user-supplied field (known as SQL Injection). The bottom line is we should never supply unsanitized input to a SQL command. And while in our example, we didn’t have user-supplied input, it is still a very good idea to treat all non-static data in a query as potentially unsafe, so that we are in the habit of always guarding against it.

Exploits of a Mom

Exploits of a Mom, XKCD Web Comic, https://xkcd.com/327/