.. _Corpora: ################# Corpora ################# This page contains several corpora relevant to political science research, categorized by country and key source, a link for where to find them and a note if they are not free. We are working with many on these to develop Texti. Parties and elections ^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - Manifesto Project - 51 inc. OECD - All political manifestos from the first democratic election onwards. - API; stata, spss, csv, xslx - `Here `_ * - Speeches - UK - Speeches from party leaders from 1895 to today - HTML on site - `Here `_ * - Regional manifesto - Spain - 1980 to 2019, all regional parties - Download - `Here `_ * - Regional manifesto - Wales and Scotland - 1999 and 2016 - Download - `Here `_ * - Regional manifesto - Italy - Fragmented depending on region - Download - `Here `_ Parliament Activity ^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - Parliamentary Questions Answered - UK - 278428 questions; csv - API - `Here `_ * - EP Plenary - European Union - 1997 to 2019 - HTTP resolvable URIs - `Here `_ * - Parliament Debates - France - Debates of the 15th legislature - HTTP resolvable URIs; XML - `Here `_ * - Lords Written Questions - UK - 52004 questions - API; csv - `Here `_ * - Commons Written Questions - UK - 275929 questions - API; csv - `Here `_ * - Questions to the Government - France - Since 2017 - HTTP resolvable URIs - `Here `_ * - Questions to the Government - without debates - France - Since 2017 - HTTP resolvable URIs - `Here `_ * - Written quesions to the Government - France - Since 2017 - HTTP resolvable URIs - `Here `_ * - Parliamentary Debates on Europe - France - 2002 to 2012 - HTTP resolvable URIs - `Here `_ * - Parliamentary speeches - Austria, Czech Republic, Germany, Denmark, Netherlands, NZ, Spain, Sweden, UK, Ireland - 21 to 32 years of data - API on DataVerse; full-text vectors in rds - `Here `_ * - Parliament Rules - UK - 1811 to 2019 - Download - `Here `_ * - Parliament Rules - Ireland - 1922 to 2020 - Download - `Here `_ * - Debates and Replies to Questions - Ireland - All - API - `Here `_ * - Senate "Dossiers Legislatifs" - France - Documents discussed since 1977 - Download - `Here `_ * - Amendments by the Senate - France - Amendments since 2001 - Download - `Here `_ * - Lords Bill Amendments - UK - 11727 Amendments - API - `Here `_ * - Questions to the Government (Senate) - France - Since 1978 - Download - `Here `_ * - Research Briefings - UK - 9739 briefings - API, csv with 500 records limit - `Here `_ * - Proceedings - European union - 1996-2011 - Download, xml - `Here `_ Legislative Documents ^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - All legislation - European Union - Summaries of EU legislation (full corpus exists but wrong license) - HTML on site (can email Dimiter Toshkov for ``Python`` script) - `Here `_ * - Trade agreements - European Union - All free trade agreements - List of linked PDFs - `Here `_ * - Bills - UK - All bills since 2007 - API - `Here `_ * - All Legal Texts - France - Constitution, laws and decrees, court rulings, treaties (in French and translated) - Downloadable + beta API - `Here `_ * - Legislation - Wales - All Bills, Acts, Marshalled lists - XML export - `Here `_ * - The Record of Proceedings - Wales - All proceedings - XML export - `Here `_ * - International Environment Agency - World - Most environmental treaties and agreements - List of .txt on the website - `Here `_ * - Bills and Acts - Ireland - All - API - `Here `_ * - All trade agreements - All - All - Download - `Here `_ Identity and Culture ^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - National Anthems - World - 194 countries - Download - `Here `_ Presidential & Governmental Activity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - Political speeches - UK - 8000+ political speeches on British Politics - HTML - `Here `_ * - Official correspondence - UK - All official correspondence of PMs - API - `Here `_ * - PM transcripts - Australia - Ministerial transcripts from 1940s to date - API; xml - `Here `_ * - Speeches - EU - All ECB President / VP speeches - Download; csv - `Here `_ * - Speeches - Germany - 6,685 speeches by 71 officials, spanning a time from 1984 to 2017 - Download, xml - `Here `_ * - Speeches - EU - 18,403 speeches from EU leaders from 2007 to 2015 - API from DataVerse; csv raw speeches, and term-document matrices in R - `Here `_ * - State of the Nation - South Africa - 1990 to 2018 - Download from Kaggle; txt per speech - `Here `_ Participative democracy ^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - Public consultations - France - Recent public consultations - HTTP-resolvable URIs - `Here `_ * - E-petitions - UK - All official e-petitions - API; JSON, xml, csv, HTML - `Here `_ News and Media ^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - EUvsDisinfo - Europe - Debunked news articles by European External Action Services - API; HTML - `Here `_ * - New York Times - All - Archive metadata, books, comments, reviews, most popular articles - API; JSON - e.g. `Here `_ * - Public debates over European integration - Austria, Britain, France, Germany, Sweden, and Switzerland - 1970s to 2012 from newspapers - csv, dta - `Here `_ * - Public debates over globalization issues - Austria, Britain, France, Germany, the Netherlands, and Switzerland - 2004-2006 from newspapers - csv, dta - `Here `_ * - Archive of Political emais - Australie, Canada, France, Germany, Ireland, Italy, NZ, UK, USA - 348,680 emails - HTML - `Here `_ * - News articles - Not specified - 9+ million articles and metadata for each - CSV split in 1GB zip files, download from GitHub - `Here `_ * - Poliwoops - Many countries including USA, UK and most European countries - Deleted tweets by public officials and politicians - API; JSON - `Here `_ Messy list of promising websites -------------------------------- Websites that might be goldmines but would require some time to explore. * European Language Resource Coordincation * A lot of legal / official documents translated and sometimes already processed. E.g. IP case law, audits, a lot of legal texts from EU countries (not sure how useful they really are, but it is a *lot* of them, there might be some interesting ones) * https://elrc-share.eu * Clarin * List of 24 parliamentary corpora, not all easy access * https://www.clarin.eu/resource-families/parliamentary-corpora * EveryCRSReport.com * Reports from the Congressional Research Service — essentially the national legislature’s think-tank. * https://www.everycrsreport.com/ * Supreme court transcripts * https://www.oyez.org/ Complementary text data ----------------------- Texts that are not necessarily directly relevant to political science research but are used for context / complement. E.g. annotate etc. * Wikipedia or other "ground truth" sources * Network data * Dictionaries: e.g. sentiment or emotions to use automated dictionary methods with one click ---- US Political Science focus ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Item - Country - Description - Access - Link * - General Social Survey - US - General Social Survey (GSS) monitors societal change in the US - Download: for SPSS, STATA - `Here `_ * - The Supreme Court Database - US - Case Centered Data - Total Rows : 13,533 - Download: CSV, DTA (STATA), POR (SPSS), RDATA, XLSX - `Here `_ * - The Supreme Court Database - US - Justice Centered Data - Total Rows : 121,224 - Download: CSV, DTA (STATA), POR (SPSS), RDATA, XLSX - `Here `_ * - Congressional speech data - US - Congressional-speech corpus includes labels for whether the speaker supported or opposed, by-name references between speakers, and the scores that our agreement/disagreement classifier(s), debate and related extracted information. (9.8 Mb, tar.gz format) - Download: compressed tar.gz, multiple types including CSV - `Here `_ * - ANES - US - Electoral behavior, political participation, and public opinion studies - Time Series Studies , Pilot Studies, Special Studies - Download - `Here `_ * - CorPS - US - CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER. - Request from marco.guerini[at]trentorise.eu and strappa[at]fbk.eu - `Here `_ * - Congressional Record for the 43rd-114th Congresses - US - Parsed Speeches and Phrase Counts - Download: zip of organized txt files - `Here `_ * - GDELT - US - All events from broadcast, print, and web news from nearly every corner of every country in over 100 languages - Download: CSV - `Here `_ * - The American Presidency Project - US - Presidential documents, papers, press, orders, memoranda etc - HTML - `Here `_ * - Full text corpus data - US - 10 large corpora of English: iWeb, COCA, COHA, NOW, Coronavirus, GloWbE, TV Corpus, Movie Corpus, Soap Corpus, Wikipedia - Purchase raw data in 3 formats - `Here `_ * - GovInfo - US - Congressional Bills; Bill Status; Bill Summaries; Commerce Business Daily; Code of Federal Regulations (Annual Edition); Electronic Code of Federal Regulations; Federal Register; United States Government Manual; House Rules and Manual; Privacy Act Issuances; Public Papers of the Presidents of the United States; Supreme Court Decisions 1937-1975 (FLITE) - Download: XML - `Here `_ * - DIME PLUS - US - Database on Ideology, Money in Politics, and Elections: Public version 2.0 - Download: compressed CSV - `Here `_ * - Replication data for: Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach - US - Replication Data - Download: compressed archive - `Here `_ * - CONGRESSIONAL & FEDERAL - Government Web Harvests - US - The National Archives and Records Administration (NARA) web harvests (i.e. capture) of Federal Agency public web sites since 2004 - Web harvests - `Here `_ * - Congress.gov - Bill Status - US - Bill Status data includes all data from the existing Bill Summaries data se - XML bulk data; API - `Here `_