Homework: ETL Street Poles

Overview

In this assignment, you will script a complete ETL process to import nonspatial data to PostGIS and create a spatial table, with some rudimentary normalization. The source data is the Street Poles dataset from OpenDataPhilly. Note that while this data is available as a shapefile, we are going to work with the CSV. One of the things you have to do in this assignment is construct the PostGIS geometries from the raw coordinates.

Begin by downloading http://data.phl.opendata.arcgis.com/datasets/9059a7546b6a4d658bef9ce9c84e4b03_0.csv, then use ogr2ogr to import this file into PostGIS. See further information on this step below.

The fields in this table, taken from the metadata, are listed below. Not listed are the x and y coordinate fields, which are included in the CSV but not the shapefile. Note that the field pole_num is described as a unique identifier, but it is not actually unique. The file contains duplicate records. Cleaning up the duplicate data is a common part of data cleaning, but would add considerable complexity to this assignment. If time permits I will discuss how this can be accomplished after the assignment is graded.

Your ETL process should be created in one SQL script. SQL statements should be properly terminated so that the script can be executed in one go. Do not use schema-qualified table names anywhere in your script. Instead, each script should begin with a SET search_path statement at the top (remember to include the public schema at the end!) so that it can easily be run in an alternate schema. I will run your scripts in the scratch shema. Your script should complete without any syntax errors.

The fields nlumin, lum_size, height, block, and plate have many NULL values. A large number of NULLs is usually a sign that the table is not properly normalized. According to the metadata, the nlumin, lum_size, and height fields all contain information about pole lighting, and the block and plate both reference “a map that was used to plot Alley (AEL) Poles”. Thus, you will create two related tables to hold this data.

The type column contains an acronym, expanded in the metadata below. You will create a lookup table with the acronym as its primary key and an additional column for the description.

The owner column contains owner names in a combination of descriptive categories (e.g. “Private”), acronyms (e.g. “SEPTA”, “PWD”), and abbreviated names (e.g. “U of Penn”). Mostly these are not good keys and subject to misspellings. You will replace this column with an owner_id column and create a lookup table populated with the names (this code is given to you below), then replace the owner names with the IDs from the lookup table.

Detailed Instructions

1. Use ogr2ogr to import Street_Poles.csv into PostGIS.

Refer to the lecture slides and ogr2ogr documentation to construct this command. Import the file to a staging table named import_street_poles. This is the temporary home for your data.

Use the following command line switches:

  • -lco PRECISION=NO: This will prevent ogr2ogr from constructing numerics for quantitative data, which is highly undesirable, or adding field size limits to varchar, which is not awful but kind of unnecessary.
  • -oo EMPTY_STRING_AS_NULL=YES (not demonstrated in class): this makes sure that empty strings (two commas in the CSV file with no characters between them) get imported as SQL NULLs.

Paste the ogr2ogr command that you construct in a comment block at the top of your SQL script.

Do not use data type autodetect (an ogr2ogr option). Do not use ogr2ogr to create geometries from the coordinates during the import. This is an extremely useful feature of ogr2ogr, but I want you to get practice creating geometries in SQL.

If you cannot complete this step, use a GUI tool such as QGIS to import the CSV, and continue with the rest of the assignment.

2. Create a pole_owner lookup table.

Run the following commands to create a lookup table based on the data in the owner column:

DROP TABLE IF EXISTS pole_owner CASCADE;
CREATE TABLE pole_owner (
    pole_owner_id serial PRIMARY KEY,
    pole_owner varchar
);

INSERT INTO pole_owner (pole_owner)
SELECT DISTINCT owner FROM import_street_pole;

As this table does not depend on any other table, you should be able to run this DDL without any problems.

3. Create a pole_type lookup table.

Create a pole_type table with a pole_type varchar primary key column, and a description varchar column. Use INSERT statements to populate this table with rows taken from the metadata (given below) for the TYPE field in the original data. Note that two codes are missing. (It is not uncommon for metadata to be incomplete in this way.) When you attempt to insert rows into the street_pole table in step 6, you will get foreign key constraint errors. Using these error messages, you will be able to identify the missing codes and add them to your INSERTs.

Refer to shakespeare_insert.sql, or your own work on the SQL INSERTs exercise for examples on how to populate a newly created table.

4. Create the main street_pole table.

Write a create table statements for the street_pole table. You will find it easier to test your script if you include the following line immediately prior to the CREATE TABLE:

DROP TABLE IF EXISTS street_pole CASCADE;

This will let you drop and create the table in one execute, without having to drop the table separately or check to see whether it exists.

Create the table as follows:

  • Discard the ogc_fid column, or any other ID column created during import.
  • Discard the x and y coordinate columns, as you will be constructing a geometry instead.
  • Rename the objectid column to gid. Make this a serial primary key column.
  • Discard the oid (or oid_) column.
  • Retain the pole_numcolumn.
  • Rename the type column to pole_type, since type is a SQL keyword.
  • Do not include the columns that will be moved to pole_light_info (see previous step).
  • As pole_date and up_date are the same for all poles, discard the up_date column.
  • Rename the owner column to pole_owner_id.
  • Retain the tap_id column.
  • Do not include the columns that will be moved to alley_pole (see previous step).
  • Add a geom column. This should be of type geometry(POINT, 2272), which corresponds to Pennsylvania State Plane South.

All columns should be of appropriate types. The types should be obvious from perusal of the data in import_street_pole, i.e. there are no integer columns that have a string value like 'missing' hidden somewhere deep in the table.

Additionally, the pole_owner_id and pole_type columns should be foreign keys to the pole_owner and pole_type tables.

5. Create related tables for lighting and alley pole info.

Construct two CREATE TABLE statements for tables named pole_light_info and alley_pole. Each table should have an int primary key column named gid. pole_light_info should have three int columns named nlumin, lumin_size, and height. alley_pole should have two varchar fields named block and plate.

6. Transform and insert the data into street_pole.

Use an INSERT statement to select data from import_street_pole and insert it into street_pole. Use the instructions from the last step as your guide. For example, where the instructions say “Rename the objectid column to gid“, this means that gid should appear in your target column list and objectid should appear in your SELECT list.

Columns will have to be transformed to the appropriate data type. Use the date type (rather than timestamp) for the pole_date column, since the times are all midnight (which really means “We don’t care about the time, we only care about the date.”).

Use ST_Point to construct geometries from the x and y coordinates. The coordinates are in WGS84 lat-long. Therefore to complete this step you will need to nest three functions:

  • ST_Point to construct the geometry.
  • ST_SetSRID to assign the CRS of the source data (i.e. the CRS of the coordinates as imported).
  • ST_Transform to transform the data to the target CRS.

Insert the pole_owner_id (int), rather than the owner (text) into the table. To do this, you need to use a correlated subquery to return the pole_owner_id from the pole_owner table. Refer to:

  • Scalar Subqueries in the Posgres documentation.
  • “What is a correlated subquery?” in PostGIS in Action, p. 498.
  • “Correlated Subquery” in the lecture slides, slide #145.

7. Insert related data into pole_light_info and alley_pole.

Insert the appropriate columns from import_street_pole into pole_light_info and alley_pole. Remember to include the gid column, as this is the foreign key that links the tables. Remember to transform the data types of the columns where necessary.

The point of doing this is to remove the large number of NULL values from the tables, so your INSERT…SELECT statement for pole_light_info, should only include those rows where the additional columns are not NULL. (If any of the three columns are NULL, they all are, so you can check any one of them for the NOT NULL criteria.) Similarly, for inserting into alley_pole, only include those rows where the additional columns are not NULL.

Grading Rubric

  • (1 point) Submit a SQL script in one file which runs from start to finish with no errors.
  1. (1 points) Create the ogr2ogr command to import Street_Poles.csv into PostGIS.
  2. (0 points) Create a pole_owner lookup table.
  3. (1 point) Create a pole_type lookup table.
  4. (2 point) Create the main street_pole table.
  5. (1 points) Create related tables for the lighting info and the alley pole info.
  6. (3 points) Transform and insert the data into street_pole.
    • 1 point for correctly constructing the point geometry.
    • 1 point for using a correlated subquery to populate the pole_owner_id column. If you find that you are unable to construct the correlated subquery, change the data type of the pole_owner_id column back to varchar (Step 4) and insert the owner data directly.
    • 1 point for correctly populating all other columns.
  7. (1 point) Insert related data into pole_light_info and alley_pole.

Metadata

Source: http://metadata.phila.gov/#home/datasetdetails/555f8135f15fcb6c6ed4413a/representationdetails/5571b1c4e4fb1d91393c2183/?view_287_page=1

+------------+--------------------------------------------------------------------------------------------------+---------+
| Field Name | Description                                                                                      | Type    |
+------------+--------------------------------------------------------------------------------------------------+---------+
| BLOCK      | References a map that was used to plot Alley (AEL) Poles                                         | Text    |
+------------+--------------------------------------------------------------------------------------------------+---------+
| HEIGHT     | light height                                                                                     | Integer |
+------------+--------------------------------------------------------------------------------------------------+---------+
| LUM_SIZE   | light wattages                                                                                   | Integer |
+------------+--------------------------------------------------------------------------------------------------+---------+
| NLUMIN     | light lumin count                                                                                | Integer |
+------------+--------------------------------------------------------------------------------------------------+---------+
| OBJECTID   |                                                                                                  | Numeric |
+------------+--------------------------------------------------------------------------------------------------+---------+
| OID        |                                                                                                  | Integer |
+------------+--------------------------------------------------------------------------------------------------+---------+
| OWNER      | pole owner                                                                                       | Text    |
+------------+--------------------------------------------------------------------------------------------------+---------+
| PLATE      | References a map that was used to plot Alley (AEL) Poles                                         | Text    |
+------------+--------------------------------------------------------------------------------------------------+---------+
| POLE_DATE  | creation date                                                                                    | Date    |
+------------+--------------------------------------------------------------------------------------------------+---------+
| POLE_NUM   | Unique identifier                                                                                | Integer |
+------------+--------------------------------------------------------------------------------------------------+---------+
| TAP_ID     | identifies the poles tap pole                                                                    | Integer |
+------------+--------------------------------------------------------------------------------------------------+---------+
| TYPE       | Acronym that describes the pole type and style. Contact GIS Data Contact below for translations. | Text    |
|            |                                                                                                  |         |
|            | * AAMP - Avenue of the Arts Mast Arm Pole                                                        |         |
|            | * AAPT - Avenue of the Arts Street Light Pole                                                    |         |
|            | * AEL - Alley Pole                                                                               |         |
|            | * C13 - Traffic C Post 13'                                                                       |         |
|            | * C20 - Traffic C Post 20'                                                                       |         |
|            | * CCP - Center City Pedestrian Pole                                                              |         |
|            | * CCR - Center City Roadway Pole                                                                 |         |
|            | * CHP - Chestnut Hill Street Light Pole                                                          |         |
|            | * D30 - Traffic D Pole                                                                           |         |
|            | * FLP - Franklin Light Pole                                                                      |         |
|            | * MAP - Traffic Mast Arm Pole                                                                    |         |
|            | * MAPT - Traffic Mast Arm Pole (Twin Light)                                                      |         |
|            | * MLP - Millenium Light Pole                                                                     |         |
|            | * MLPT - Millenium Light Pole (Twin Light) * OTH - Other                                         |         |
|            | * OTHT - Other (Twin Light)                                                                      |         |
|            | * PDT - PennDot Pole                                                                             |         |
|            | * PDTT - PennDot Pole (Twin Light)                                                               |         |
|            | * PKY - Parkway Street Light Pole                                                                |         |
|            | * PKYT - Parkway Street Light Pole (Twin Pole)                                                   |         |
|            | * PTA - Post Top Pole (Aluminum)                                                                 |         |
|            | * PTC - Post Top Pole (Concrete)                                                                 |         |
|            | * PTF - Post Top Pole (Fiberglass)                                                               |         |
|            | * PVT - Private Pole                                                                             |         |
|            | * PVTT - Private Pole (Twin Light)                                                               |         |
|            | * RP - Radar Traffic Pole                                                                        |         |
|            | * SLA - Street Light Aluminum                                                                    |         |
|            | * SLAT - Street Light Aluminum (Twin Light)                                                      |         |
|            | * SLF - Street Light Fiberglass                                                                  |         |
|            | * SLFT - Street Light Fiberglass (Twin Light)                                                    |         |
|            | * SM - Structure Mounted                                                                         |         |
|            | * SMP - Strawberry Mansion Bridge Pole                                                           |         |
|            | * SNP - Sign Pole                                                                                |         |
|            | * SWP - Span Wire Pole                                                                           |         |
|            | * TP - Trolley Pole                                                                              |         |
|            | * WP - Wood Pole                                                                                 |         |
|            | * WPT - Wood Pole (Twin Light)                                                                   |         |
|            | * CTP - Chinatown Pole                                                                           |         |
|            | * SSP - South Street Tower Pole                                                                  |         |
|            | * SMB - Septa Millennia Bridge Pole                                                              |         |
|            | * JMB - JFK Blvd Millennia Bridge Pole                                                           |         |
|            | * MMB - Market St Millennia Bridge Pole                                                          |         |
|            | * CMB - Chestnut St Millennia Bridge Pole                                                        |         |
|            | * WMB - Walnut St Millennia Bridge Pole                                                          |         |
|            | * SMAB - Strawberry Mansion Arch Bridge Pole                                                     |         |
|            | * FB - Falls Bridge Pole                                                                         |         |
|            | * RLP - Red Light Camera Pole                                                                    |         |
|            | * TCB - Traffic Control Box                                                                      |         |
|            | * AASP - Avenue of the Arts Signal Pole                                                          |         |
|            | * CP - Carrier Pole                                                                              |         |
+------------+--------------------------------------------------------------------------------------------------+---------+
| UP_DATE    | data update date                                                                                 | Date    |
+------------+--------------------------------------------------------------------------------------------------+---------+