MY572      Half Unit
Data for Data Scientists

This information is for the 2020/21 session.

Teacher responsible

Friedrich Geiecke

Availability

This course is available on the MPhil/PhD in Social Research Methods. This course is available with permission as an outside option to students on other programmes where regulations permit.

Course content

This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non­relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time­consuming part of any data science project. This course uses a project-­based learning approach towards the study of online publishing and group­-based collaboration, essential ingredients of modern data science projects. The coverage of data sharing will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision-­control and group collaboration tools such as GitHub. Each student will build one or more interactive website based on content relevant to his/her domain­-related interests, and will use GitHub for accessing and submitting course materials and assignments.

In this course, we introduce principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, database design, data base implementation, and data analysis through structured queries. Through joining operations, we will also cover the challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We will cover database design, especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. In the final section of the data section of the course, we will step through a complete workflow including data cleaning and transformation, illustrating many of the practical challenges faced at the outset of any data analysis or data science project.

Online publishing and collaboration tools forms the second part of this course, along with the tools and technologies that underlie them. Students will develop interactive, secure and powerful projects for the World Wide Web using both client and server side technologies. Collaboration and the dissemination and submission of course assignments will use GitHub, the popular code repository and version control system. The course begins with an indepth look at the markup languages that form the foundations of building web sites with a study of HTML and CSS. Students next study basic programming in JavaScript, to provide client and server side tools including the customization of web content using Bootstrap and Jekyll to publish web pages, which will provide the basis for a class project.

Teaching

This course is delivered through a combination of classes and lectures totalling a minimum of 20 hours across Michaelmas Term. This year, some or all of this teaching may be delivered through a combination of virtual classes and flipped-lectures delivered as short online videos.

This course has a reading week in Week 6 of MT.

Formative coursework

Students will be expected to produce 10 problem sets in the MT.

Students will work on weekly, structured problem sets in the staff-led class sessions. Example solutions will be provided at the end of each week.

Indicative reading

  • Chodorow, Kristina MongoDB: The Definitive Guide, 2nd Edition O’Reilly 2013.
  • Churcher, Clare. Beginning Database Design: From Novice to Professional. Apress, 2007.
  • Tahaghoghi, Seyed M. and Hugh E. Williams. Learning MySQL. O'Reilly, 2006. Karumanchi, Narasimha. Data Structures and Algorithms Made Easy: Data Structure and Algorithmic Puzzles, Second Edition. CreateSpace Independent Publishing Platform, 2011.
  • Lee, Kent. Data Structures and Algorithms with Python. Springer, 2015.
  • Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013.
  • Nield, Thomas. Getting Started with SQL: A hands-on approach for beginners. O’Reilly, 2016.
  • Byron, Angela and Addison Berry, Nathan Haug, Jeff Eaton, James Walker, Jeff Robbins Using Drupal: Choosing and Configuring Modules to Build Dynamic Websites. O'Reilly Media, 2008.
  • Duckett, Jon HTML and CSS: Design and Build Websites New York: Wiley, 2011.
  • Duckett, Jon JavaScript and JQuery: Interactive Front-End Web Development New York: Wiley, 2014.
  • Rice, Dylan. Twitter Bootstrap In Your Pocket. CreateSpace Independent Publishing Platform, 2016.
  • Sklar, David Learning PHP 5 O’Reilly, 2004. GitHub Guides at https://guides.github.com, including: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
  • Jacobson, Daniel APIs: A Strategy Guide O’Reilly: 2012.
  • London, Kyle Developing Large Web Applications: Producing Code That Can Grow and Thrive O’Reilly, 2010.

Assessment

Take-home assessment (50%) and problem sets (50%) in the MT.

Marking of these assessments will be at a level appropriate for PhD students.

Important information in response to COVID-19

Please note that during 2020/21 academic year some variation to teaching and learning activities may be required to respond to changes in public health advice and/or to account for the situation of students in attendance on campus and those studying online during the early part of the academic year. For assessment, this may involve changes to mode of delivery and/or the format or weighting of assessments. Changes will only be made if required and students will be notified about any changes to teaching or assessment plans at the earliest opportunity.

Key facts

Department: Methodology

Total students 2019/20: Unavailable

Average class size 2019/20: Unavailable

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills