MY572      Half Unit
Data for Data Scientists

This information is for the 2018/19 session.

Teacher responsible

Dr Pablo Barbera Aranguena COL7.10

Availability

This course is available on the MPhil/PhD in Social Research Methods. This course is available with permission as an outside option to students on other programmes where regulations permit.

This course is available to all research students where regulations permit.

Course content

This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non­relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time­consuming part of any data science project.  This course uses a project-­based learning approach towards the study of online publishing and group­-based collaboration, essential ingredients of modern data science projects. The coverage of data sharing will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision-­control and group collaboration tools such as GitHub. Each student will build one or more interactive website based on content relevant to his/her domain­-related interests, and will use GitHub for accessing and submitting course materials and assignments.

Teaching

20 hours of lectures and 15 hours of computer workshops in the MT.

In this course, we introduce principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, database design, data base implementation, and data analysis through structured queries. Through joining operations, we will also cover the challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We will cover database design, especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. In the final section of the data section of the course, we will step through a complete workflow including data cleaning and transformation, illustrating many of the practical challenges faced at the outset of any data analysis or data science project.

Online publishing and collaboration tools forms the second part of this course, along with the tools and technologies that underlie them. Students will develop interactive, secure and powerful projects for the World Wide Web using both client and server side technologies. Collaboration and the dissemination and submission of course assignments will use GitHub, the popular code repository and version control system. The course begins with an in­depth look at the mark­up languages that form the foundations of building web sites with a study of HTML and CSS. Students next study basic programming in JavaScript, to provide client and server side tools including the customization of web content using Bootstrap and Jekyll to publish web pages, which will provide the basis for a class project.

Formative coursework

Students will be expected to produce 10 problem sets in the MT.

Type: Weekly, structured problem sets with a beginning component to be started in the staff-led lab sessions, to be completed by the student outside of class. Answers should be formatted and submitted for assessment. 

Indicative reading

Chodorow, Kristina MongoDB: The Definitive Guide, 2nd Edition O’Reilly 2013.

Churcher, Clare. Beginning Database Design: From Novice to Professional. Apress, 2007.

Tahaghoghi, Seyed M. and Hugh E. Williams. Learning MySQL. O'Reilly, 2006. Karumanchi, Narasimha. Data Structures and Algorithms Made Easy: Data Structure and Algorithmic Puzzles, Second Edition. CreateSpace Independent Publishing Platform, 2011.

Lee, Kent. Data Structures and Algorithms with Python. Springer, 2015.

Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013.

Nield, Thomas. Getting Started with SQL: A hands-on approach for beginners. O’Reilly, 2016.

Byron, Angela and Addison Berry, Nathan Haug, Jeff Eaton, James Walker, Jeff Robbins Using Drupal: Choosing and Configuring Modules to Build Dynamic Websites. O'Reilly Media, 2008.

Duckett, Jon HTML and CSS: Design and Build Websites New York: Wiley, 2011.


Duckett, Jon JavaScript and JQuery: Interactive Front-End Web Development New York: Wiley, 2014.

Rice, Dylan. Twitter Bootstrap In Your Pocket. CreateSpace Independent Publishing Platform, 2016.

Sklar, David Learning PHP 5 O’Reilly, 2004. GitHub Guides at https://guides.github.com, including: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.

Jacobson, Daniel APIs: A Strategy Guide O’Reilly: 2012.


London, Kyle Developing Large Web Applications: Producing Code That Can Grow and Thrive O’Reilly, 2010.

Assessment

Take home exam (50%) and in class assessment (50%) in the MT.

Student problem sets will be marked each week, and will provide 50% of the mark. 

Marking of these assessments will be at a level appropriate for PhD students.

Key facts

Department: Methodology

Total students 2017/18: Unavailable

Average class size 2017/18: Unavailable

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills