ST446 Half Unit
Distributed Computing for Big Data
This information is for the 2018/19 session.
Teacher responsible
Prof Milan Vojnovic COL 5.05
Availability
This course is available on the MSc in Applied Social Data Science, MSc in Data Science and MSc in Operations Research & Analytics. This course is available with permission as an outside option to students on other programmes where regulations permit.
Pre-requisites
Basic knowledge of Python or some other programming knowledge is desirable.
Course content
The course covers basic principles of systems for distributed processing of big data including distributed file systems; distributed computation models such as Mapreduce, resilient distributed datasets, and distributed dataflow graph computations; structured querying over large datasets; graph data processing systems; stream data processing systems; scalable machine learning algorithms for classification, regression, collaborative filtering, topic modelling and other tasks. The course enables students to learn about the principles and gain hands-on experience in working with the state of the art big data computing technologies such as Apache Spark, a general engine for large-scale data processing, and Apache TensorFlow, a popular software library for (distributed) learning of deep neural networks. Through weekly exercises and course project work, student can gain experience in performing data analytics tasks on their laptops and cloud computing platforms.
For more information, please see the course handout: http://lse-st446.github.io
Teaching
20 hours of lectures and 15 hours of computer workshops in the LT.
Formative coursework
Students will be expected to produce 10 problem sets in the LT.
Eight of the weekly problem sets will represent formative coursework. The other two will represent summative assessment.
Indicative reading
Karau, H., Konwinski, A., Wendell, P. and Zaharia, M., Learning Spark: Lightining-fast Data Analysis, O’Reilly, 2015
Karau, H. and Warren, R., High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark, O’Reilly, 2017
Drabas, T. and Lee D., Learning PySpark, Packt, 2016
White, T., Hadoop: The Definitive Guide, O’Reilly, 4th Edition, 2015
Apache Spark Documentation https://spark.apache.org/docs/latest
Apache TensorFlow Documentation https://www.tensorflow.org/get_started
Assessment
Project (80%) in the LT.
Continuous assessment (10%) in the MT Week 4.
Continuous assessment (10%) in the MT Week 7.
The main assessment will consist of an individual project to develop a package for fitting statistical models of the student's own choice to big data sets.
In addition, among the 10 weekly problem sets, there will be two (in weeks 4 and 7) which will contribute to summative assessment (10% each).
Key facts
Department: Statistics
Total students 2017/18: 29
Average class size 2017/18: 30
Controlled access 2017/18: Yes
Lecture capture used 2017/18: Yes (LT)
Value: Half Unit
Personal development skills
- Self-management
- Problem solving
- Application of information skills
- Communication
- Application of numeracy skills
- Specialist skills