MY573 Half Unit
Managing and Visualising Data
This information is for the 2017/18 session.
Teacher responsible
Prof Kenneth Benoit COL8.11
Availability
This course is available on the PhD in Methodology. This course is available with permission as an outside option to students on other programmes where regulations permit.
This course is available to all research students where regulations permit.
Course content
The course be divided into two halves.
The first five weeks will focus on data structures and databases, covering the principles of digital methods for storing
and structuring data, including data types, relational and non-relational database design, and query languages.
Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of
interest. The course will also cover workflow management for typical data transformation and cleaning projects,
frequently the starting point and most time-consuming part of any data science project.
This part of the course will introduce principles and applications of the electronic storage, structuring, manipulation,
transformation, extraction, and dissemination of data. This includes data types, database design, data base
implementation, and data analysis through structured queries. Through joining operations, we will also cover the
challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in
fundamental data types, and how data is stored and recorded electronically. We will cover database design,
especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL
through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn
to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format
for comparison. For both types of database, students will be encouraged to work with data relevant to their own
interests as they learn to create, populate and query data. In the final section of the data section of the course, we will
step through a complete workflow including data cleaning and transformation, illustrating many of the practical
challenges faced at the outset of any data analysis or data science project.
The second five weeks will focus on visualising data, starting with univariate and bivariate data, discussing the
advantages/disadvantages of some commonly used graphics, then turning to more sophisticated tools, including
three-dimensional tools, maps and interactive and dynamic graphics.
This part of the course will cover: data visualisation basics (history and classic examples; best practice for univariate
and bivariate data; image formats and resolution); data visualisation principles (cognition and human visual
perception; grammar of graphics; application to examples); design principles (graphic design; layout; visual style; titles and annotations; animations; interactive and dynamic graphics); statistical analysis and maps (binwidths/bandwidths
for histograms and kernel density estimation; regression diagnostics; maps).
Teaching
20 hours of lectures and 15 hours of lectures in the MT.
Formative coursework
Students will be expected to produce 6 problem sets in the MT.
Indicative reading
Wilkinson, Leland. The Grammar of Graphics, 2nd Ed., Springer, 2005.
Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis, Springer, 2009.
Cook, Dianne and Swayne, Deborah. Interactive and Dynamic Graphics for Data Analysis - with R and GGobi,
Springer, 2007.
Murray, Scott. Interactive Data Visualisation for the Web, O'Reilly, 2013.
Assessment
Project (60%) and continuous assessment (40%) in the MT.
Four of the problem sets submitted by students weekly will be assessed (40% in total). In addition, there will be a
take-home exam (60%) in the form of an individual project in which they will demonstrate the ability to manage data
and visualise it through effective statistical graphics using principles they have learnt on the course. This may be
done by publishing the visualisation and code to a GitHub repository and GitHub pages website.
Marking of these assessments will be at a level appropriate for PhD students. For the project, it is expected that PhD students submit a more detailed project that what will be expected of students taking the MSc level course.
Key facts
Department: Methodology
Total students 2016/17: Unavailable
Average class size 2016/17: Unavailable
Value: Half Unit
Personal development skills
- Self-management
- Problem solving
- Application of information skills
- Communication
- Application of numeracy skills
- Specialist skills