Resources

Data Sets

EmpiriST Corpus

See the data set (and the description) on GitHub.

Corpora

Albanian Corpus (AlCo)

The Albanian Corpus (AlCo) contains a hundred million word tokens (text words), the first Albanian corpus of this size. The corpus covers different domains of language and contains different text types – it is a reference corpus. At this moment the work is still in progress, some texts still need to be replaced or recategorized. The corpus is annotated with a morpho-syntactic tagset of 77 tags, since 2015. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.

AlCoPress (2017-2019)

The Albanian Corpus of Press Texts (AlCo) contains around 32 million word tokens (text words). The corpus is annotated like AlCo. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.

Buzuku (1555) Corpus

The Buzuku Corpus contains the text of "Missale" (1555) from Gjon Buzuku. The corpus is not annotated.

GeRedE: A Corpus of German Reddit Exchanges

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018.

Created in collaboration with Andreas Blombach, Natalie Dykes, Philipp Heinrich and Thomas Proisl.

See the data set (and the description) on GitHub.