.. _Corpora:
#################
Corpora
#################
This page contains several corpora relevant to political science research, categorized by country and key source, a link for where to find them and a note if they are not free. We are working with many on these to develop Texti.
Parties and elections
^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - Manifesto Project
- 51 inc. OECD
- All political manifestos from the first democratic election onwards.
- API; stata, spss, csv, xslx
- `Here `_
* - Speeches
- UK
- Speeches from party leaders from 1895 to today
- HTML on site
- `Here `_
* - Regional manifesto
- Spain
- 1980 to 2019, all regional parties
- Download
- `Here `_
* - Regional manifesto
- Wales and Scotland
- 1999 and 2016
- Download
- `Here `_
* - Regional manifesto
- Italy
- Fragmented depending on region
- Download
- `Here `_
Parliament Activity
^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - Parliamentary Questions Answered
- UK
- 278428 questions; csv
- API
- `Here `_
* - EP Plenary
- European Union
- 1997 to 2019
- HTTP resolvable URIs
- `Here `_
* - Parliament Debates
- France
- Debates of the 15th legislature
- HTTP resolvable URIs; XML
- `Here `_
* - Lords Written Questions
- UK
- 52004 questions
- API; csv
- `Here `_
* - Commons Written Questions
- UK
- 275929 questions
- API; csv
- `Here `_
* - Questions to the Government
- France
- Since 2017
- HTTP resolvable URIs
- `Here `_
* - Questions to the Government - without debates
- France
- Since 2017
- HTTP resolvable URIs
- `Here `_
* - Written quesions to the Government
- France
- Since 2017
- HTTP resolvable URIs
- `Here `_
* - Parliamentary Debates on Europe
- France
- 2002 to 2012
- HTTP resolvable URIs
- `Here `_
* - Parliamentary speeches
- Austria, Czech Republic, Germany, Denmark, Netherlands, NZ, Spain, Sweden, UK, Ireland
- 21 to 32 years of data
- API on DataVerse; full-text vectors in rds
- `Here `_
* - Parliament Rules
- UK
- 1811 to 2019
- Download
- `Here `_
* - Parliament Rules
- Ireland
- 1922 to 2020
- Download
- `Here `_
* - Debates and Replies to Questions
- Ireland
- All
- API
- `Here `_
* - Senate "Dossiers Legislatifs"
- France
- Documents discussed since 1977
- Download
- `Here `_
* - Amendments by the Senate
- France
- Amendments since 2001
- Download
- `Here `_
* - Lords Bill Amendments
- UK
- 11727 Amendments
- API
- `Here `_
* - Questions to the Government (Senate)
- France
- Since 1978
- Download
- `Here `_
* - Research Briefings
- UK
- 9739 briefings
- API, csv with 500 records limit
- `Here `_
* - Proceedings
- European union
- 1996-2011
- Download, xml
- `Here `_
Legislative Documents
^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - All legislation
- European Union
- Summaries of EU legislation (full corpus exists but wrong license)
- HTML on site (can email Dimiter Toshkov for ``Python`` script)
- `Here `_
* - Trade agreements
- European Union
- All free trade agreements
- List of linked PDFs
- `Here `_
* - Bills
- UK
- All bills since 2007
- API
- `Here `_
* - All Legal Texts
- France
- Constitution, laws and decrees, court rulings, treaties (in French and translated)
- Downloadable + beta API
- `Here `_
* - Legislation
- Wales
- All Bills, Acts, Marshalled lists
- XML export
- `Here `_
* - The Record of Proceedings
- Wales
- All proceedings
- XML export
- `Here `_
* - International Environment Agency
- World
- Most environmental treaties and agreements
- List of .txt on the website
- `Here `_
* - Bills and Acts
- Ireland
- All
- API
- `Here `_
* - All trade agreements
- All
- All
- Download
- `Here `_
Identity and Culture
^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - National Anthems
- World
- 194 countries
- Download
- `Here `_
Presidential & Governmental Activity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - Political speeches
- UK
- 8000+ political speeches on British Politics
- HTML
- `Here `_
* - Official correspondence
- UK
- All official correspondence of PMs
- API
- `Here `_
* - PM transcripts
- Australia
- Ministerial transcripts from 1940s to date
- API; xml
- `Here `_
* - Speeches
- EU
- All ECB President / VP speeches
- Download; csv
- `Here `_
* - Speeches
- Germany
- 6,685 speeches by 71 officials, spanning a time from 1984 to 2017
- Download, xml
- `Here `_
* - Speeches
- EU
- 18,403 speeches from EU leaders from 2007 to 2015
- API from DataVerse; csv raw speeches, and term-document matrices in R
- `Here `_
* - State of the Nation
- South Africa
- 1990 to 2018
- Download from Kaggle; txt per speech
- `Here `_
Participative democracy
^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - Public consultations
- France
- Recent public consultations
- HTTP-resolvable URIs
- `Here `_
* - E-petitions
- UK
- All official e-petitions
- API; JSON, xml, csv, HTML
- `Here `_
News and Media
^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - EUvsDisinfo
- Europe
- Debunked news articles by European External Action Services
- API; HTML
- `Here `_
* - New York Times
- All
- Archive metadata, books, comments, reviews, most popular articles
- API; JSON
- e.g. `Here `_
* - Public debates over European integration
- Austria, Britain, France, Germany, Sweden, and Switzerland
- 1970s to 2012 from newspapers
- csv, dta
- `Here `_
* - Public debates over globalization issues
- Austria, Britain, France, Germany, the Netherlands, and Switzerland
- 2004-2006 from newspapers
- csv, dta
- `Here `_
* - Archive of Political emais
- Australie, Canada, France, Germany, Ireland, Italy, NZ, UK, USA
- 348,680 emails
- HTML
- `Here `_
* - News articles
- Not specified
- 9+ million articles and metadata for each
- CSV split in 1GB zip files, download from GitHub
- `Here `_
* - Poliwoops
- Many countries including USA, UK and most European countries
- Deleted tweets by public officials and politicians
- API; JSON
- `Here `_
Messy list of promising websites
--------------------------------
Websites that might be goldmines but would require some time to explore.
* European Language Resource Coordincation
* A lot of legal / official documents translated and sometimes already processed. E.g. IP case law, audits, a lot of legal texts from EU countries (not sure how useful they really are, but it is a *lot* of them, there might be some interesting ones)
* https://elrc-share.eu
*
Clarin
* List of 24 parliamentary corpora, not all easy access
* https://www.clarin.eu/resource-families/parliamentary-corpora
*
EveryCRSReport.com
* Reports from the Congressional Research Service — essentially the national legislature’s think-tank.
* https://www.everycrsreport.com/
* Supreme court transcripts
* https://www.oyez.org/
Complementary text data
-----------------------
Texts that are not necessarily directly relevant to political science research but are used for context / complement. E.g. annotate etc.
* Wikipedia or other "ground truth" sources
* Network data
* Dictionaries: e.g. sentiment or emotions to use automated dictionary methods with one click
----
US Political Science focus
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
* - Item
- Country
- Description
- Access
- Link
* - General Social Survey
- US
- General Social Survey (GSS) monitors societal change in the US
- Download: for SPSS, STATA
- `Here `_
* - The Supreme Court Database
- US
- Case Centered Data - Total Rows : 13,533
- Download: CSV, DTA (STATA), POR (SPSS), RDATA, XLSX
- `Here `_
* - The Supreme Court Database
- US
- Justice Centered Data - Total Rows : 121,224
- Download: CSV, DTA (STATA), POR (SPSS), RDATA, XLSX
- `Here `_
* - Congressional speech data
- US
- Congressional-speech corpus includes labels for whether the speaker supported or opposed, by-name references between speakers, and the scores that our agreement/disagreement classifier(s), debate and related extracted information. (9.8 Mb, tar.gz format)
- Download: compressed tar.gz, multiple types including CSV
- `Here `_
* - ANES
- US
- Electoral behavior, political participation, and public opinion studies - Time Series Studies , Pilot Studies, Special Studies
- Download
- `Here `_
* - CorPS
- US
- CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
- Request from marco.guerini[at]trentorise.eu and strappa[at]fbk.eu
- `Here `_
* - Congressional Record for the 43rd-114th Congresses
- US
- Parsed Speeches and Phrase Counts
- Download: zip of organized txt files
- `Here `_
* - GDELT
- US
- All events from broadcast, print, and web news from nearly every corner of every country in over 100 languages
- Download: CSV
- `Here `_
* - The American Presidency Project
- US
- Presidential documents, papers, press, orders, memoranda etc
- HTML
- `Here `_
* - Full text corpus data
- US
- 10 large corpora of English: iWeb, COCA, COHA, NOW, Coronavirus, GloWbE, TV Corpus, Movie Corpus, Soap Corpus, Wikipedia
- Purchase raw data in 3 formats
- `Here `_
* - GovInfo
- US
- Congressional Bills; Bill Status; Bill Summaries; Commerce Business Daily; Code of Federal Regulations (Annual Edition); Electronic Code of Federal Regulations; Federal Register; United States Government Manual; House Rules and Manual; Privacy Act Issuances; Public Papers of the Presidents of the United States; Supreme Court Decisions 1937-1975 (FLITE)
- Download: XML
- `Here `_
* - DIME PLUS
- US
- Database on Ideology, Money in Politics, and Elections: Public version 2.0
- Download: compressed CSV
- `Here `_
* - Replication data for: Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach
- US
- Replication Data
- Download: compressed archive
- `Here `_
* - CONGRESSIONAL & FEDERAL - Government Web Harvests
- US
- The National Archives and Records Administration (NARA) web harvests (i.e. capture) of Federal Agency public web sites since 2004
- Web harvests
- `Here `_
* - Congress.gov - Bill Status
- US
- Bill Status data includes all data from the existing Bill Summaries data se
- XML bulk data; API
- `Here `_