DS105W      Half Unit
Data for Data Science

This information is for the 2024/25 session.

Teacher responsible

Dr Jonathan Cardoso-Silva COL.1.03

Availability

This course is available on the BSc in Politics and Data Science. This course is available as an outside option to students on other programmes where regulations permit and to General Course students.

While this course is not capped and, in principle, any student who requests a place is likely to be given one, restrictions might need to be imposed if the demand is too high.

Pre-requisites

There are no pre-requisites. A willingness to learn how to code is all you need.

Note: although there were never any pre-requisites, previous iterations of this course assumed students would have learned the basics of programming from pre-sessional courses. Starting this 2024/25 academic year, this assumption no longer be in place.

Course content

The main goal of this course is to teach students how to manipulate and store  'real data' in a hands-on manner. The first few weeks of the course will cover theoretical concepts through traditional lectures with slides, but then the format will shift to a more practical approach. Live coding demonstrations will be used to guide students through the material, which can be followed in real-time on their laptops. Python will be the primary programming language used in staff-led lectures and classes, but some exercises will involve a mixture of Python and R.

The 🎯 intended learning outcomes of this course are:

  • Understand the basic structure of data types and common data formats.
  • Show familiarity with international standards for common data types.
  • Manage a typical data cleaning, structuring, and analysis workflow using practical examples.
  • Clean data and diagnose common problems involved in data corruption and how to fix them.
  • Understand the concept of databases.
  • Link data from various sources.
  • Learn to use Python for the data manipulation workflow
  • Be exposed to how R is used in the data manipulation workflow and data visualisation
  • Use the collaboration and version control system GitHub, based on the git version control system.
  • Markup Language and the Markdown format for formatting documents and web pages.
  • Create and maintain simple websites using HTML and CSS.

Older iterations of this course can be seen on the course's public website: http://lse-dsi.github.io/DS105

Note, however, that starting this 2024/25 academic year, the concepts of data acquisition will not be fully covered in DS105A. If you want to learn more about advanced data collection techniques, such as web scraping and API queries, you should consider taking DS205, which covers the topic in more detail.

Teaching

40 hours of lectures and 15 hours of classes in the WT.

Reading Week in Week 6.

Formative coursework

Achieving proficiency in data science skills, much like programming in general, relies heavily on consistent and continuous practice. To facilitate this, we release these two structured problem sets very early in the course (around Weeks 02 & 04). These exercises are closely tied to in-class activities and follow the same submission structure as the graded problem sets that will be introduced after Reading Week.

Example exercises include navigating the computer terminal, accessing computer servers, and writing code to read and save data.

Indicative reading

  • Janssens, Jeroen. Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools. Second edition. Sebastopol, CA: O’Reilly Media, Inc., 2021.
  • Lutz, Mark. Learning Python. Fifth edition. Beijing: O’Reilly, 2013.
  • Scavetta, Rick J. Python and R for the Modern Data Scientist: The Best of Both Worlds. Sebastopol: O’Reilly Media, Incorporated, 2021.
  • VanderPlas, Jake. Python Data Science Handbook: Essential Tools for Working with Data. Second edition. Bejing Boston Farnham Sebastopol Tokyo: O’Reilly, 2023.

Assessment

Problem sets (60%) in the WT.
Group project (40%) in the ST.

The problem sets involve creating computational notebooks (Jupyter or Quarto notebooks) to showcase the coding and documentation skills gained throughout the course. Problem sets typically consist of two parts, with one submission around Week 07 and another around Week 09.

The group project will consist of a pitch presentation (Week 11) and a final public report in the form of a public website (Spring Term, around Week 04).

Key facts

Department: Data Science Institute

Total students 2023/24: 52

Average class size 2023/24: 10

Capped 2023/24: No

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills