skip to navigation skip to content
- Select training provider - (ESRC Doctoral Training Partnership)
Wed 6 Feb 2019
09:00 - 18:00

Venue: Titan Teaching Room 2, New Museums Site

Provided by: Social Sciences Research Methods Programme


Booking

Bookings cannot be made on this event (Event is completed).


Other dates:

No more events



Register interest
Register your interest - if you would be interested in additional dates being scheduled.


Booking / availability

Digital Data Collection: Web scraping for the Humanities and Social Sciences

Wed 6 Feb 2019

Description

The internet is a great resource for humanities and social science data, but most information is apparently chaotic. In this course we will explore how to programmatically access information stored online, typically in html, to create neat, tabulated data ready for analysis. The uses of web scraping are diverse: previous versions of this course used the the programming language R to access data directly from newspapers, and by accessing live data streams using APIs (YouTube, Facebook, Google Maps, Wikipedia). The one-day course is structured as follows: in the morning, we will consider general principles of webscraping, illustrated through examples. This session is designed to create a toolkit needed to effectively collect different types of online data. Then in the afternoon the session will take a workshop format, where students may chose to begin applying web scraping to their their own research, or work through a structured set of exercises. If there are any particular data sources you are interested in accessing, do email me at dt444@cam.ac.uk, as I may be able to integrate an example directly relevant to your research into the session.

Different from past years, this course will be taught using Python, Jupyter Notebooks and the BeautifulSoup library. The course will not assume any prior knowledge of Python, but students are encouraged to learn a bit of the tools before the course. Any introductory MOOC course on Python (such as edx or Cursera) will provide an excellent introduction.

Prerequisites
  • Familiarity with R and an interest in online data collection. Any programming knowledge or understanding of html is a bonus
  • Students should be comfortable with the RStudio interface (R is covered in the first ten videos of Roger Peng’s course Computing for Data Analysis, available on YouTube)
  • You must have a University Information Services (Computing) Desktop Services password (http://www.ucs.cam.ac.uk/linkpages/newcomers)
  • You must have access to CamTools
Sessions

Number of sessions: 2

# Date Time Venue Trainer
1 Wed 6 Feb 2019   09:00 - 13:00 09:00 - 13:00 Titan Teaching Room 2, New Museums Site map Daniel Tanis
2 Wed 6 Feb 2019   14:00 - 18:00 14:00 - 18:00 Titan Teaching Room 2, New Museums Site map Daniel Tanis
Aims

To provide students with the skillsets necessary to use web scraping in their own research.

Format

Presentations, demonstrations and practicals

Readings

No readings are assigned, but students should ensure they are comfortable with the basics of R. This is covered in the first ten videos of Roger Peng’s course Computing for Data Analysis, available on YouTube.

Assessment

There may be an online open-book test at the end of the module; for most students, the test is not compulsory.


Booking / availability