Digital Data Collection: Web scraping for the Humanities and Social Sciences

Wed 6 Feb 2019

Description

The internet is a great resource for humanities and social science data, but most information is apparently chaotic. In this course we will explore how to programmatically access information stored online, typically in html, to create neat, tabulated data ready for analysis. The uses of web scraping are diverse: previous versions of this course used the the programming language R to access data directly from newspapers, and by accessing live data streams using APIs (YouTube, Facebook, Google Maps, Wikipedia). The one-day course is structured as follows: in the morning, we will consider general principles of webscraping, illustrated through examples. This session is designed to create a toolkit needed to effectively collect different types of online data. Then in the afternoon the session will take a workshop format, where students may chose to begin applying web scraping to their their own research, or work through a structured set of exercises. If there are any particular data sources you are interested in accessing, do email me at dt444@cam.ac.uk, as I may be able to integrate an example directly relevant to your research into the session.

Different from past years, this course will be taught using Python, Jupyter Notebooks and the BeautifulSoup library. The course will not assume any prior knowledge of Python, but students are encouraged to learn a bit of the tools before the course. Any introductory MOOC course on Python (such as edx or Cursera) will provide an excellent introduction.

Prerequisites

Familiarity with R and an interest in online data collection. Any programming knowledge or understanding of html is a bonus
Students should be comfortable with the RStudio interface (R is covered in the first ten videos of Roger Peng’s course Computing for Data Analysis, available on YouTube)
You must have a University Information Services (Computing) Desktop Services password (http://www.ucs.cam.ac.uk/linkpages/newcomers)
You must have access to CamTools

Sessions

Number of sessions: 2

#	Date	Time	Venue		Trainer
1	Wed 6 Feb 2019 09:00 - 13:00	09:00 - 13:00	Titan Teaching Room 2, New Museums Site	map	Daniel Tanis
2	Wed 6 Feb 2019 14:00 - 18:00	14:00 - 18:00	Titan Teaching Room 2, New Museums Site	map	Daniel Tanis

Aims

To provide students with the skillsets necessary to use web scraping in their own research.

Format

Presentations, demonstrations and practicals

Readings

No readings are assigned, but students should ensure they are comfortable with the basics of R. This is covered in the first ten videos of Roger Peng’s course Computing for Data Analysis, available on YouTube.

Assessment

There may be an online open-book test at the end of the module; for most students, the test is not compulsory.

Digital Data Collection: Web scraping for the Humanities and Social Sciences

Contact training provider

Privacy policy
Cookie policy

Study at Cambridge

About the University

Research at Cambridge

Digital Data Collection: Web scraping for the Humanities and Social Sciences

Contact training provider

Privacy policy Cookie policy

Study at Cambridge

About the University

Research at Cambridge

Privacy policy
Cookie policy