ST446 Half Unit
Distributed Computing for Big Data
This information is for the 2017/18 session.
Teacher responsible
Prof Milan Vojnovic COL 2.05A
Availability
This course is available on the MSc in Data Science. This course is available with permission as an outside option to students on other programmes where regulations permit.
Pre-requisites
Some basic programming knowledge in Python or other programming language is desirable.
Course content
The course covers basic principles and techniques for distributed processing of large-scale datasets across clusters of computers with an emphasis on machine learning tasks. It covers the basic principles of different computation paradigms developed for batch, streaming, iterative, and graph data processing. The course is largely based on using Apache Hadoop computing framework, especially Apache Spark, a popular fast and general engine for large-scale data processing. The course also covers the basic principles of numerical computations using data flow graphs, which are used for computations on multiple CPUs and GPUs for learning deep neural networks such as by popular open-source software libraries Tensorflow developed by Google and The Microsoft Cognitive Toolkit developed by Microsoft.
The course covers canonical machine learning tasks that arise in real-world applications such as recommendation of items to users, k-means clustering for anomaly detection, latent semantic analysis of textual data sources such as Wikipedia, analysis of online social networks, analysis of geospatial data, estimation of financial risk, and deep learning for image recognition.
Teaching
20 hours of lectures and 15 hours of computer workshops in the LT.
Formative coursework
Students will be expected to produce 10 problem sets in the LT.
Eight of the weekly problem sets will represent formative coursework. The other two will represent summative assessment.
Indicative reading
Francesco Pierfederici. Distributed Computing with Python. Packt Publishing, 2016.
Tom White. Hadoop: The Definitive Guide. O'Reilly, 2015.
Holde Karau, Andy Konwinski, Patrick, Wendell, and Matei Zaharia. Learning Spark - Lightning-Fast Big Data Analysis. O'Reilly, 2015.
Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. Advanced Analytics with Spark – Patterns for Learning from Data at Scale. O’Reilly, 2015.
Goodfellow, Youshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
Open source software
Dispy: Distributed and parallel computing with Python: http://dispy.sourceforge.net
Spark: http://spark.apache.org
TensorFlow: Open Source Software Library for Machine Intelligence https://www.tensorflow.org
The Microsoft Cognitive Framework, https://www.microsoft.com/en-us/research/product/cognitive-toolkit
NVIDIA GPUs – The Engine of Deep Learning: https://developer.nvidia.com/deep-learning
Assessment
Project (80%) in the LT.
Continuous assessment (10%) in the Week 4.
Continuous assessment (10%) in the Week 7.
The main assessment will consist of an individual project to develop a package for fitting statistical models of the student's own choice to big data sets.
In addition, among the 10 weekly problem sets, there will be two (in weeks 4 and 7) which will contribute to summative assessment (10% each).
Key facts
Department: Statistics
Total students 2016/17: Unavailable
Average class size 2016/17: Unavailable
Controlled access 2016/17: No
Value: Half Unit
Personal development skills
- Self-management
- Problem solving
- Application of information skills
- Communication
- Application of numeracy skills
- Specialist skills