Combining Data

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • How can I combine data from multiple tables?

Objectives
  • Explain the operation of a query that joins two tables.

  • Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.

  • Write queries that join tables on equal keys.

  • Explain what primary and foreign keys are, and why they are useful.

In order to submit our data to a web site that aggregates historical meteorological data, we might need to format it as latitude, longitude, date, quantity, and reading. However, our latitudes and longitudes are in the Site table, while the dates of measurements are in the Visit table and the readings themselves are in the Measurement table. We need to combine these tables somehow.

This figure shows the relations between the tables:

Survey Database Structure

The SQL command to do this is JOIN. To see how it works, let’s start by joining the Site and Visit tables:

SELECT * FROM Site JOIN Visit;
site_name lat long visit_id site_name visit_date
DR-1 -49.85 -128.57 619 DR-1 1927-02-08
DR-1 -49.85 -128.57 622 DR-1 1927-02-10
DR-1 -49.85 -128.57 734 DR-3 1930-01-07
DR-1 -49.85 -128.57 735 DR-3 1930-01-12
DR-1 -49.85 -128.57 751 DR-3 1930-02-26
DR-1 -49.85 -128.57 752 DR-3 -null-
DR-1 -49.85 -128.57 837 MSK-4 1932-01-14
DR-1 -49.85 -128.57 844 DR-1 1932-03-22
DR-3 -47.15 -126.72 619 DR-1 1927-02-08
DR-3 -47.15 -126.72 622 DR-1 1927-02-10
DR-3 -47.15 -126.72 734 DR-3 1930-01-07
DR-3 -47.15 -126.72 735 DR-3 1930-01-12
DR-3 -47.15 -126.72 751 DR-3 1930-02-26
DR-3 -47.15 -126.72 752 DR-3 -null-
DR-3 -47.15 -126.72 837 MSK-4 1932-01-14
DR-3 -47.15 -126.72 844 DR-1 1932-03-22
MSK-4 -48.87 -123.4 619 DR-1 1927-02-08
MSK-4 -48.87 -123.4 622 DR-1 1927-02-10
MSK-4 -48.87 -123.4 734 DR-3 1930-01-07
MSK-4 -48.87 -123.4 735 DR-3 1930-01-12
MSK-4 -48.87 -123.4 751 DR-3 1930-02-26
MSK-4 -48.87 -123.4 752 DR-3 -null-
MSK-4 -48.87 -123.4 837 MSK-4 1932-01-14
MSK-4 -48.87 -123.4 844 DR-1 1932-03-22

JOIN creates the cross product of two tables, i.e., it joins each record of one table with each record of the other table to give all possible combinations. Since there are three records in Site and eight in Visit, the join’s output has 24 records (3 * 8 = 24) . And since each table has three fields, the output has six fields (3 + 3 = 6).

What the join hasn’t done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we’re only interested in combinations that have the same site name, thus we need to use a filter:

SELECT
  *
FROM
  Site
  JOIN Visit ON Site.site_name = Visit.site_name;
site_name lat long visit_id site_name visit_date
DR-1 -49.85 -128.57 619 DR-1 1927-02-08
DR-1 -49.85 -128.57 622 DR-1 1927-02-10
DR-1 -49.85 -128.57 844 DR-1 1932-03-22
DR-3 -47.15 -126.72 734 DR-3 1930-01-07
DR-3 -47.15 -126.72 735 DR-3 1930-01-12
DR-3 -47.15 -126.72 751 DR-3 1930-02-26
DR-3 -47.15 -126.72 752 DR-3 -null-
MSK-4 -48.87 -123.4 837 MSK-4 1932-01-14

ON is very similar to WHERE, and for all the queries in this lesson you can use them interchangeably. There are differences in how they affect outer joins, but that’s beyond the scope of this lesson. Once we add this to our query, the database manager throws away records that combined information about two different sites, leaving us with just the ones we want.

Notice that we used Table.field to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we’re talking about. For example, if we joined the Person and Visit tables, the result would inherit a field called id from each of the original tables.

We can now use the same dotted notation to select the three columns we actually want out of our join:

SELECT
  Site.lat,
  Site.long,
  Visit.visit_date
FROM
  Site
  JOIN Visit ON Site.site_name = Visit.site_name;
lat long visit_date
-49.85 -128.57 1927-02-08
-49.85 -128.57 1927-02-10
-49.85 -128.57 1932-03-22
-47.15 -126.72 -null-
-47.15 -126.72 1930-01-07
-47.15 -126.72 1930-01-12
-47.15 -126.72 1930-02-26
-48.87 -123.4 1932-01-14

If joining two tables is good, joining many tables must be better. In fact, we can join any number of tables simply by adding more JOIN clauses to our query, and more ON tests to filter out combinations of records that don’t make sense:

SELECT
  Site.lat,
  Site.long,
  Visit.visit_date,
  Measurement.type,
  Measurement.value
FROM
  Site
  JOIN Visit
  JOIN Measurement ON Site.site_name = Visit.site_name
  AND Visit.visit_id = Measurement.visit_id
  AND Visit.visit_date IS NOT NULL;
lat long visit_date type value
-49.85 -128.57 1927-02-08 rad 9.82
-49.85 -128.57 1927-02-08 sal 0.13
-49.85 -128.57 1927-02-10 rad 7.8
-49.85 -128.57 1927-02-10 sal 0.09
-47.15 -126.72 1930-01-07 rad 8.41
-47.15 -126.72 1930-01-07 sal 0.05
-47.15 -126.72 1930-01-07 temp -21.5
-47.15 -126.72 1930-01-12 rad 7.22
-47.15 -126.72 1930-01-12 sal 0.06
-47.15 -126.72 1930-01-12 temp -26.0
-47.15 -126.72 1930-02-26 rad 4.35
-47.15 -126.72 1930-02-26 sal 0.1
-47.15 -126.72 1930-02-26 temp -18.5
-48.87 -123.4 1932-01-14 rad 1.46
-48.87 -123.4 1932-01-14 sal 0.21
-48.87 -123.4 1932-01-14 sal 22.5
-49.85 -128.57 1932-03-22 rad 11.25

We can tell which records from Site, Visit, and Measurement correspond with each other because those tables contain primary keys and foreign keys. A primary key is a value, or combination of values, that uniquely identifies each record in a table. A foreign key is a value (or combination of values) from one table that identifies a unique record in another table. Another way of saying this is that a foreign key is the primary key of one table that appears in some other table. In our database, Person.person_id is the primary key in the Person table, while Measurement.person_id is a foreign key relating the Measurement table’s entries to entries in Person.

Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like “student numbers” and “patient numbers”, and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite automatically numbers records as they’re added to tables, and we can use those record numbers in queries:

SELECT rowid, * FROM Person;
rowid person_id personal_name family_name
1 dyer William Dyer
2 pb Frank Pabodie
3 lake Anderson Lake
4 roe Valentina Roerich
5 danforth Frank Danforth

Listing Radiation Readings

Write a query that lists all radiation readings from the DR-1 site.

Solution

SELECT
   Measurement.value
FROM
   Measurement
   JOIN
      Visit
      ON Visit.visit_id = Measurement.visit_id
WHERE
   Visit.site_name = 'DR-1'
   AND Measurement.type = 'rad';
value
9.82
7.8
11.25

Where’s Frank?

Write a query that lists all sites visited by people named “Frank”.

Solution

SELECT
  DISTINCT Visit.site_name
FROM
  Visit
  JOIN Measurement
  JOIN Person ON
    Visit.visit_id = Measurement.visit_id
    AND Measurement.person_id = Person.person_id
  WHERE
    Person.personal_name = 'Frank';
name
DR-3

Reading Queries

Describe in your own words what the following query produces:

SELECT Site.site_name FROM Site JOIN Visit
ON Site.lat < -49.0 AND Site.site_name = Visit.site_name AND Visit.visit_date >= '1932-01-01';

Who Has Been Where?

Write a query that shows each site with exact location (lat, long) ordered by visited date, followed by personal name and family name of the person who visited the site and the type of measurement taken and its reading. Please avoid all null values. Tip: you should get 15 records with 8 fields.

Solution

SELECT Site.site_name, Site.lat, Site.long, Person.personal_name, Person.family_name, Measurement.type, Measurement.value, Visit.visit_date
FROM
   Site
   JOIN
      Visit
   JOIN
      Measurement
   JOIN
      Person
      ON Site.site_name = Visit.site_name
      AND Visit.visit_id = Measurement.visit_id
      AND Measurement.person_id = Person.person_id
WHERE
   Measurement.person_id IS NOT NULL
   AND Visit.visit_date IS NOT NULL
ORDER BY
   Visit.visit_date;
site_name lat long personal_name family_name type value visit_date
DR-1 -49.85 -128.57 William Dyer rad 9.82 1927-02-08
DR-1 -49.85 -128.57 William Dyer sal 0.13 1927-02-08
DR-1 -49.85 -128.57 William Dyer rad 7.8 1927-02-10
DR-1 -49.85 -128.57 William Dyer sal 0.09 1927-02-10
DR-3 -47.15 -126.72 Anderson Lake sal 0.05 1930-01-07
DR-3 -47.15 -126.72 Frank Pabodie rad 8.41 1930-01-07
DR-3 -47.15 -126.72 Frank Pabodie temp -21.5 1930-01-07
DR-3 -47.15 -126.72 Frank Pabodie rad 7.22 1930-01-12
DR-3 -47.15 -126.72 Anderson Lake sal 0.1 1930-02-26
DR-3 -47.15 -126.72 Frank Pabodie rad 4.35 1930-02-26
DR-3 -47.15 -126.72 Frank Pabodie temp -18.5 1930-02-26
MSK-4 -48.87 -123.4 Anderson Lake rad 1.46 1932-01-14
MSK-4 -48.87 -123.4 Anderson Lake sal 0.21 1932-01-14
MSK-4 -48.87 -123.4 Valentina Roerich sal 22.5 1932-01-14
DR-1 -49.85 -128.57 Valentina Roerich rad 11.25 1932-03-22

For a visual explanation of joins see the SQL Joins Visualizer.

Key Points

  • Use JOIN to combine data from two tables.

  • Use table.field notation to refer to fields when doing joins.

  • Every fact should be represented in a database exactly once.

  • A join produces all combinations of records from one table with records from another.

  • A primary key is a field (or set of fields) whose values uniquely identify the records in a table.

  • A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.

  • We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.

  • The most common join condition is matching keys.