ST446      Half Unit
Distributed Computing for Big Data

This information is for the 2024/25 session.

Teacher responsible

Dr Marcos Barreto

Availability

This course is available on the MPA in Data Science for Public Policy, MSc in Applied Social Data Science, MSc in Data Science, MSc in Econometrics and Mathematical Economics, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Operations Research & Analytics, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

This course has a limited number of places (it is controlled access) and demand is typically high. This may mean that you are not able to get a place on this course. The MSc in Data Science students are given priority for enrolment in this course.

Pre-requisites

Basic knowledge of Python or some other programming knowledge is desirable.

Course content

The course covers principles of distributed processing systems for big data, including distributed file systems (such as Hadoop); distributed computation models (such as MapReduce); resilient distributed datasets (Spark RDDs); structured querying over large datasets (Spark Dataframes and SQL); graph data processing systems (Sparh GraphX and Neo4j); stream data processing systems (Kafka and MongoDB); scalable machine learning models (Spark MLlib and TensorFlow), distributed and federated machine learning models (Spark MLlib and TensorFlow Federated Learning).

 

The course enables students to learn about the principles and gain hands-on experience in working with the state of the art computing technologies such as Apache Spark, a general engine for large-scale data processing, and TensorFlow, a popular software library for (distributed) learning of deep neural networks. Through weekly exercises and course project work, student can gain experience in performing data analytics tasks on their laptops and cloud computing platforms.

Teaching

This course will be delivered through a combination of classes, and lectures and Q&A sessions totalling a minimum of 35 hours across the Winter Term (WT). This course includes a reading week in Week 6 of Winter Term.

Formative coursework

Students will be expected to produce 10 problem sets in the WT.

Eight of the weekly problem sets will represent formative coursework. The other two will represent summative assessment.

Indicative reading

  • Damji, J., Weing, B., Das, T., Lee. D. Learning Spark: Lightining-fast Data Analysis, O’Reilly, 2nd Edition, 2020
  • Karau, H. and Warren, R., High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark, O’Reilly, 2017
  • Drabas, T. and Lee D., Learning PySpark, Packt, 2016
  • White, T., Hadoop: The Definitive Guide, O’Reilly, 4th Edition, 2015
  • Triguero, I. and Galar, M. Large-Scale Data Analytics with Python and Spark: a hands-on guide to implementing machine learning solutions. Cambridge, 2024. 


Additional reading:

  • Marz, N., Warren, J. Big Data: Principles and best practices of scalable realtime data systems. Manning, 2015.
  • Kleppmann, M. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly, 2016.
  • Foster, I., Ghani, R., Jarmin, R. S., Kreuter, F., Lanie, J. (Eds.). Big Data and Social Science: Data Science Methods and Tools for Research and Practice. CRC Press, 2nd edition, 2021.
  • Li, K-C., Jiang, H., Zomaya, A. (Eds.). Big Data Management and Processing. CRC Press, 2017.
  • Huang, S., Deng. H. Data Analytics: A Small Data Approach. CRC Press, 2021.
  • Apache Spark Documentation https://spark.apache.org/docs/latest
  • Apache TensorFlow Documentation https://www.tensorflow.org

Assessment

Project (80%) in the WT.
Continuous assessment (10%) in the WT Week 4.
Continuous assessment (10%) in the WT Week 9.

Summative assessments: a problem set submitted in WT Week 6 (10%), a problem set submitted in WT Week 11 (10%), a project (80%) in the WT. Each summative problem set will be composed of theory and coding components, will have an individual mark of 10% and submission will be required in the WT in Weeks 6 and 11. In addition, there will be a take-home exam (80%) in the form of a group project in which they will demonstrate their ability to develop a big data solution for solving a task of their choice.

Formative assessments: short weekly coding problem sets which will build the ground for the seminar sessions.

Student performance results

(2020/21 - 2022/23 combined)

Classification % of students
Distinction 36
Merit 47.6
Pass 15.2
Fail 1.2

Key facts

Department: Statistics

Total students 2023/24: 78

Average class size 2023/24: 26

Controlled access 2023/24: Yes

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills