Overview
For the final assignment, you need to identify a publicly available data set, upload it to PostGIS, test alternative database designs, and combine it with existing data to answer an interesting spatial question.
For example, imagine that you want to investigate a potential correlation between crime and alcohol-serving establishments, as some students of mine did in a previous course. You will probably start with tables of all business locations, which may or may not be geocoded. You may have crimes in aggregate (reported by police precinct, council district, ZIP code…) or maybe you will have individual crime locations. The data may already be spatial, or you may have to spatialize it.
Your workflow may involve the following.
- Upload the data.
- Geocode by address (possibly before uploading), geolocate by coordinates, or join by geographic identifier to an existing spatial layer.
- Join to other appropriate layers. For example, the crimes data should be joined to a population layer (such as census tracts) to turn raw numbers into a crime rate. One approach would be to do a point-in-polygons operation of crimes in census tracts.
- Use intersections and buffers to determine the crime rate inside and outside of a reasonable target distance.
- Connect to the layers via client applications (GIS software or statistical software) for visualization.
The final assignment has three parts, database design, Extract-Transform-Load (ETL), and analysis. The first two parts are more important. The analysis piece is primarily there to motivate your thinking about how the data will be used.
The requirements and specific deliverables are described in detail below.
The final assignment is worth 50 points. Each milestones is worth the following:
- Milestone 1: 5 pts.
- Milestone 2: 10 pts. You must complete milestones 1 & 2 to earn a D in this course.
- Milestone 3: 10 pts. You must complete this milestone to earn a C in this course.
- Milestone 4: 10 pts. You must complete this milestone to earn a B in this course.
- Milestone 5: 15 pts. You must complete this milestone to earn higher than a B (anything in the B+/A-/A range) in this course.
Milestone 1: Proof of Concept (5 points)
This milestone involves obtaining data, demonstrating familiarity with it, loading it to PostgreSQL, and suggesting possible analytical questions which can be answered.
To begin, you must identify a dataset that you will use for your final project. This should be a vector dataset* that is publicly available. Examples include business registrations, cadastral data (i.e., property data maintained by an assessor’s office), crime data, sales data, survey data from any number of government or private sources, etc. It may not be US Census data (either Decennial Census or American Community Survey). It must be in a spatial format (e.g., shapefile, geodatabase, GeoJSON) or data which is capable of being spatialized (e.g., business addresses which can be geocoded, survey data which can be joined to ZIP Codes, etc.).
Some data sources are more complex than others. Some will be large flat files, while others will be split into multiple tables. The multiple tables may represent annual releases, or may be a rudimentary normalization. The data may also be split merely to make file sizes smaller for distribution purposes. If you find that you are struggling with this milestone, you might have to consider finding a less complex dataset.
This is your project, and you need to familiarize yourself with the dataset. Public data (esp. for research purposes) should come with a data dictionary, which lists the data type, meaning, and allowable values (the domain) of all variables. Some datasets will come with hundreds or thousands of columns, and you will not be expected to know what they all mean, but you need to know the important ones—and getting familiar with your data also means determining which ones are the important ones.
Import your data “as is” to PostgreSQL. Most likely your data are not normalized—do not normalize it at this stage. Do not create a surrogate key. Do create a primary key if suitable column(s) exist. Do launder column names.
If your data are not already spatial (that is, if you are not importing from a spatial format), make your data spatial:
a. If the data has coordinates, create a geometry column based on the coordinates.
b. If the data has an address, geocode it to obtain coordinates, then create a geometry column based on the coordinates.
c. If the data can be georeferenced (it has an ID or name that corresponds to a geographic entity such as a county, ZIP code, school, etc.), identify an additional, spatial data source for the geographic entity, and include that in your import and your ERD.
Deliverables:
A report with the following:
- A description of the your data, including the geometry type (as is or after joining or geocoding) and the variables or data dictionary. If there are a large number of variables (more than 20), describe categories of information rather than listing them all. Provide some description of the variables you are interested in analyzing. Include the data source (publisher, institution, research team) and geographic and temporal coverage (e.g. Africa, 1960 – Present).
- An ERD for the data “as is”. Show the primary key if a suitable key exists. If the data are already (partially) normalized, show relationships between tables. If the data are nonspatial and you have identified a suitable geographic data source (see 2c), show the relationship in the ERD.
- Three examples of spatial questions which you would be able to answer using this data.
- One SQL statement with a WHERE clause which selects no more than 10 records from your data, along with the query result. Do not use a LIMIT clause.
* If you are already quite familiar with raster data, and it will be useful to your future career development to work with raster data, I will consider the possibility of a final project involving a raster dataset, but you must clear it with me ahead of time.
Milestone 2: Normalization Plan (10 points)
Make your data conform to 3NF.
If your data has coded categorical data, you should create lookup tables based on the data dictionary. If your data has categorical data which is not coded, you should create lookup tables and replace the values in the “base” table with the codes.
For example, assume you have a table with a column education
. This could appear in your data in two ways:
- There may be numeric codes 1-4, which the data dictionary indicates mean “Less than high school”, “High school or GED”, “Bachelor’s degree”, “Graduate degree”. You would need to create a lookup table with the codes 1-4 and the values they translate to.
- The table already have the values “Less than high school”, etc. In this case, you still need to create the lookup table, but you also need to update the base table and replace all instances of “Less than high school” with 1, all instances of “High school or GED” with 2, etc.
The only deliverable is the ERD of the normalized tables.
Milestone 3: Normalization Scripts (10 points)
After your normalization plan is approved, you need to actually implement it. This involves creating SQL scripts which create all necessary tables, primary keys, foreign key references, etc. Lookup tables must be populated with data. Columns which are no longer necessary due to normalization should be dropped. Data may need to be moved from the base table to related tables.
There is a significant danger of your wasting a lot of time implementing a poorly conceived normalization plan. Do not begin this task until your normalization plan is approved (Milestone 2).
Deliverables:
- One multi-statement SQL script with the necessary DDL and DML to create the final, normalized schema from your data as originally imported.
- Three queries using table joins and WHERE clauses that each return 10 rows or fewer.
Milestone 4: Analysis and Optimization (10 points)
Construct three spatial analytical queries based on your data that represent an interesting real world question that could be answered with this data. These queries may combine the data with other spatial datasets. In order to focus on your primary dataset of interest, you should either (a) choose a second dataset that is not overly complex; (b) coordinate with another student who is working with data that you would like to analyze, and share your results from Milestones 1 – 3 with each other; (c) upload your data to Kropotkin and use a dataset available there.
After constructing the analytical queries, time the results, and think about how the speed can be improved. You should consider the following options:
- Rewrite the query using the query planner to help investigate.
- Index columns that are included in joins or filters.
- Denormalize your tables.
Regardless of whether you find indexing to be helpful for these particular queries, assess your tables for possible improvements using indexing. Which columns will frequently be used in joins or WHERE clause criteria? Create all indexes that you think are appropriate.
Deliverables:
A report with the following:
- Discuss how you optimized the queries, and timings before and after optimization. The report should include the queries before and after optimization.
- Discuss your indexing assessment, including which columns you chose to index and why. The report should include the
CREATE INDEX
statements.
Milestone 5: Final Report (15 points)
The final report is a tutorial style report describing the dataset, the kinds of analytical questions it can be used to answer, and how to work with it. Much of the work from previous milestones will be repurposed here, but some of it will be briefer and some of it will need to be expanded. The report should be conversational, and assume an audience that it familiar with GIS and basic SQL, but not necessarily familiar with the ins and outs of PostGIS or spatial SQL.
The report should include some visualizations (constructed in a client software such as QGIS or R) of the data, and the analytical results.
The report should be structured so that DDL is relegated to appendices, while the body of the report describes the ETL process at a high level and goes into detail on the analytical process.
The report should include the following:
- Describe the dataset generally, including its source and purpose. Describe some of the spatial questions you will use it to answer. If you combine your main dataset with other data for you analytical queries, describe those additional datasources, but more briefly.
- Describe the structure of the data and your normalization process. Provide an ERD of the final schema.
- Discuss optimizations such as index construction or denormalization.
- Provide the 3 analytical queries that you developed for Milestone 4. Use only the final, optimized version of each query. For each query, begin by describing the question you are trying to answer, followed by the query itself, followed by a discussion of how the query works. The discussion of the query operation should assume the reader understands basic SQL but not spatial SQL. Explain any subqueries, CTEs, spatial functions or operators, and complex joins.
- Your ETL process should be documented in appendices:
a. One appendix should provide a detailed explanation of how to obtain and load the data.
b. A second appendix should include your normalization scripts submitted as Milestone 3.
c. A third appendix should include scripts for any further transformations of your data, such as denormalizations or index construction.
d. A final appendix should include your data dictionary (summary of all columns, by table, including data type and extremely brief description; make sure to include units for numerical data).
The report should be written in Markdown. All queries should appear formatted as code. Do not submit a report with screenshots of any SQL queries.