Code
library("tidyverse")
library("jsonlite")
library("reclin")
library("rmarkdown")
After further research I found a way to get more data and with better quality.
March 21, 2021
{reclin}
package. I kept the code chunks without actually running the code. Fortunately the output data was stored, which allowed to continue porting the series to quarto.
Introduction
I should have known better. When I couldn’t find a proper API to the access the Bundestag-Data, I immediately went for what looked like the second best option: scraping the xls files that are hosted on the website.1 I started scraping the website right away, setup selenium and wrote a script to automatically click through the dynamic website and download all available xls files.
The result wasn’t too bad, actually: I obtained around 480 datasets, including exact dates of the polls, all individual votes, names of representatives and party/parliamentary groups. Yet, I couldn’t get the script to finish correctly. I had hoped to scrape all ~700 available polls, not only ~480.
Had I stepped back for a second and done some further research first, I would have found a comfortable way to access all polling data via an easily iterable URL pattern to obtain the same data for all ~700 polls. And better yet, someone has been doing the work of web scraping the bundestag-data in a quite professional way: The project behind https://www.democracy-deutschland.de/ provide several tools around scraping several parliamentary websites in Germany (e.g. national and federal parliaments). Some of the data scraped with these tools are regularly updated and hosted at the repository https://github.com/bundestag.
As I still learned a lot during my own scraping process, I decided to keep the now obsolete post and make this update a separate post.
Preview Image: © 2022 Christian A. Gebhard
The data
The Bundestag provides polling data for quite a long time going back as PDF reports, but only since around 2012 in machine readable tables. Roll call data since 2009 is available as lists that can be scraped from the website. Unfortunately there is no specific license mentioned for the open data, but the data is offered to “interested users” for “further machine processing”. Sounds good enough for my use.
I’ll be using the data hosted in a github repo, which was generated scraping the www.bundestag.de website. Among the different datasets, there are json files with the individual votes linked to the ID number of the ballot and in others there is the meta-data on the ballots including title, date, etc. Both are unofficial data, but I probably couldn’t scrape it in a better quality, so I will continue to use this data.
Both datasources (NamedPolls and NamedPollDeputies) are published under “The Unlicense” with no conditions whatsoever.
I still want to give credit to the contributors of this repo and thank them for their great and helpful work!
I want to get a single dataset containing the names and dates of the ballots, the names and votes of the representatives and their party membership. Unfortunately the data is separated in two “folders”: one for the poll meta data and one with the actual votes. Additionally in each folder each poll has its own file.
In this post I will
So let’s get started with loading the required packages:
library("tidyverse")
library("jsonlite")
library("reclin")
library("rmarkdown")
I cloned the above mentioned repositories “NamedPolls” and “NamedPollDeputies” to my data_sources/2021_bundestag
folder. Using Sys.glob I obtain the file_paths for all data files, loop over these and rbind
them to create the desired metadata tibble.
# obtain a vector with all relevant file paths
poll_files <- Sys.glob(here::here("data","2021_bundestag", "NamedPolls", "data", "*.json"))
# instantiate an empty tibble
poll_meta <- tibble()
# loop over the file paths, read each file, extract information of interest and append to the tibble
for (file in poll_files) {
json_temp <- read_json(file)
p_id <- as.numeric(json_temp$data$id)
p_title <- json_temp$data$title
p_date <- as.Date(json_temp$data$date)
poll_meta <- rbind(poll_meta, tibble(p_id, p_date, p_title))
}
Let’s get the dimensions and a summary of the metadata.
In this snippet I use jsonlite::fromJSON()
, as it can simplify the data to a dataframe automatically. The dataframe with the votes of each deputy is nested within the resulting list, as there is other data “surrounding” the actual voting data:
# obtain a vector with all relevant file paths
vote_files <- Sys.glob(here::here("data","2021_bundestag", "NamedPollDeputies", "data", "*.json"))
# instantiate an empty tibble
voting_data_raw <- tibble()
# loop over the files, extract the poll id and the list item with the
# nested data frame. Then append the dataframe to the rest of the data.
for (file in vote_files) {
votes_temp <- fromJSON(file, simplifyVector = TRUE, simplifyDataFrame = TRUE)
temp_id <- votes_temp$data$id
df_temp <- as_tibble(votes_temp$data$votes$deputies) %>%
mutate(p_id = as.numeric(temp_id))
voting_data_raw <- rbind(voting_data_raw, df_temp)
}
print("Dimensions of the dataset:")
dim(voting_data_raw)
For the further analysis in this series I want to make sure, that each deputy is identified by a unique identifier, and not listed under e.g. slightly different names. The scraped data isn’t perfect in this regard:
…so I cannot use either the ID nor the URL.
My first idea was to use the official basic deputy data provided by the Bundestag2, because there are all name variants, name changes, academic titles or titles of nobility listed and linked to a constant deputy ID number. However in the votes dataset above has 427039 entries and the deputy dataset also of that magnitude. As the order of first/last name etc. is still different in the current voting data and the “reference” dataset, I wanted to use record linkage / fuzzy string matching. Unfortunately my midrange 2013 notebook ran out of memory comparing all possible pairings and I had to take a different approach.
In the end I decided to use the name as identifier. To avoid interference of titles causing different names for the same person, I simply removed all variants of academic titles, that I could find or think of from the names:
voting_data_no_titles <- voting_data_raw %>%
mutate(name_clean = str_replace_all(name, "Prof\\.", ""),
name_clean = str_replace_all(name_clean, "Dr\\.", ""),
name_clean = str_replace_all(name_clean, "h\\.\\s?c\\.", ""),
name_clean = str_replace_all(name_clean, "rer\\.\\s?nat\\.", ""),
name_clean = str_replace_all(name_clean, "rer\\.\\s?pol\\.", ""),
name_clean = str_replace_all(name_clean, "iur\\.", ""),
name_clean = str_replace_all(name_clean, ",\\s*", ", ") # replace excessive spaces with a single space as this is missing in some cases causing name-duplicates
)
This resulted in quite a reduction of unique names from 1269 to 1207.
As quality checks I answered the following questions (written in the code comments):
# Q1: Are there rare name variations, e.g. caused by typos (occurring
# once or twice), missed by the above cleaning?
voting_data_no_titles %>%
count(name_clean) %>%
arrange(n) %>%
head(10) %>%
paged_table()
# Q2: Did the above process cause "name collision" for two deputies,
# resulting in >1 vote in a poll for the same name?
voting_data_no_titles %>%
group_by(p_id) %>%
count(name_clean, sort = TRUE) %>%
head(10) %>%
paged_table()
# Q3: Are there different names for the same biography-URL?
voting_data_no_titles %>%
distinct(URL, name_clean) %>%
count(URL, sort = TRUE) %>%
head(10) %>%
paged_table()
# Q4: Are there different names for the same id number?
voting_data_no_titles %>%
distinct(id, name_clean) %>%
count(id, sort = TRUE) %>%
head(10) %>%
paged_table()
I can answer all four questions with “no”:
To check for systematic spelling differences that do not appear as “single exceptions” above, I went back to record linkage and in fact identified a few names, mostly due to accents or non-standard characters.3
unique_names <- voting_data_no_titles %>%
distinct(name_clean)
reclin::pair_blocking(unique_names, unique_names) %>%
reclin::compare_pairs(by = "name_clean", default_comparator = reclin::lcs()) %>%
reclin::score_problink() %>%
reclin::select_threshold(threshold = 0.6) %>%
reclin::link() %>%
filter(name_clean.x != name_clean.y) %>%
arrange(name_clean.x)
Above I chose a quite permissive threshold of 0.6. I wouldn’t normally do this, but as some names are quite short, even a few changed characters might have a strong impact on the string distance. Many of the pairings above are in fact different deputies, but some names have to be corrected:
voting_data_cleaned <- voting_data_no_titles %>%
mutate(name_clean = case_when(
name_clean == "Brugger, Agnieszka" ~ "Brugger, Agnes", # http://webarchiv.bundestag.de/archive/2013/1212/bundestag/abgeordnete17/biografien/B/brugger_agnes.html
name_clean == "Dagdelen, Sevim" ~ "Dağdelen, Sevim",
name_clean == "Özoguz, Aydan" ~ "Özoğuz, Aydan",
name_clean == "Jantz, Christina" ~ "Jantz-Herrmann, Christina", # name changed due to marriage?
name_clean == "Aken, Jan" ~ "Aken, Jan van", # van forgotten in some records
name_clean == "Lerchenfeld, Philipp" ~ "Lerchenfeld, Philipp Graf", # title forgotten in some records
name_clean == "Neskovic, Wolfgang" ~ "Nešković, Wolfgang", # wrong spelling in some polls
name_clean == "Korkmaz, Elvan" ~ "Korkmaz-Emre, Elvan", # name changed due to marriage?
name_clean == "Wadephul, Johann" ~ "Wadephul, Johann David", # David forgotten in some records
TRUE ~ name_clean
))
voting_data_cleaned %>%
distinct(name_clean) %>%
nrow()
As expected this reduces the unique names by another 9 to 1198 distinct deputy names.
One factor I’ll be missing is, if due to marriage or divorce deputies completely changed their last name. I would have to check against the reference basic deputy data mentioned above.
But I decided to keep the rest of the data as it is. Since I’m not planning on doing any world-shaking inferences, the effort of manually checking every record against the official deputy data with all official name changes, earned or lost titles etc. seemed disproportionate. I’m going for an exploratory analysis.
After reading and cleaning the data, it’s time to save it and head on to the next questions to be answerd!
poll_meta %>%
write_csv("../../../data_sources/2021_bundestag/parsed/poll_meta.csv")
voting_data_cleaned %>%
write_csv("../../../data_sources/2021_bundestag/parsed/voting_data_cleaned.csv")
voting_data_complete <- voting_data_cleaned %>%
left_join(poll_meta, by = "p_id") %>%
select(-c(imgURL, URL, id))
voting_data_complete %>%
write_csv("../../../data_sources/2021_bundestag/parsed/polls_voting_complete.csv")
This concludes this blog post. I didn’t do any “analysis” on the data or visualization. I still hope it was an interesting read and am looking forward to the next post, as there - finally - I can do some actual analyis.
Actually I hadn’t tried web scraping before, so it also intrigued me as a possibility to learn this.↩︎
I used this dataset in the first part of this series↩︎
The differences might be caused by different systems/encodings when collecting/publishing or scraping the data.↩︎
@online{gebhard2021,
author = {Gebhard, Christian},
title = {Bundestag {Part} {II.1} - {Go} Back to Start...},
date = {2021-03-21},
url = {https://christiangebhard.com/posts/2021-03-22-bundestag-part-ii1/bundestag-part-ii1.html},
langid = {en}
}