Resources
Data Sets
EmpiriST Corpus 2.0
The EmpiriST Corpus 2.0 is a manually annotated corpus consisting of German web pages and German computer-mediated communication (CMC), i.e. written discourse. Examples for CMC genres are monologic and dialogic tweets, social and professional chats, threads from Wikipedia talk pages, WhatsApp interactions and blog comments.
The dataset was originally created by Beißwenger et al. (2016) for the EmpiriST 2015 shared task and featured manual tokenization and part-of-speech tagging. Subsequently, Rehbein et al. (2018) incorporated the dataset into their harmonised testsuite for POS tagging of German social media data, manually added sentence boundaries and automatically mapped the part-of-speech tags to UD pos tags. In our own annotation efforts (Proisl et al., in preparation), we manually normalized and lemmatized the data and converted the corpus into a “vertical” format suitable for importing into the Open Corpus Workbench, CQPweb, SketchEngine, or similar corpus tools.
Normalization and lemmatization added in collaboration with Thomas Proisl, Natalie Dykes, Philipp Heinrich, and Stefan Evert.
–> See the data set (and the description) on GitHub.
Corpora
GeRedE: A Corpus of German Reddit Exchanges
GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Created in collaboration with Andreas Blombach, Natalie Dykes, Philipp Heinrich and Thomas Proisl.
–> See the data set (and the description) on GitHub.Albanian Corpus (AlCo)
The Albanian Corpus (AlCo) contains a hundred million word tokens (text words), the first Albanian corpus of this size. The corpus covers different domains of language and contains different text types – it is a reference corpus. At this moment the work is still in progress, some texts still need to be replaced or recategorized. The corpus is annotated with a morpho-syntactic tagset of 77 tags, since 2015. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Press (2017-2019)
The Albanian Corpus of Press Texts (AlCo) contains around 32 million word tokens (text words). The corpus is annotated like AlCo. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Tweets (sample 2020-2021)
The Albanian Corpus of tweets contains around 1 million word tokens (text words), i.e. standard and non-standard Albanian. The corpus is not annotated yet. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Tweets (selected users 2020-2021)
The Albanian Corpus of tweets form selected users contains around 10 million word tokens (text words), i.e. standard Albanian. The corpus is not annotated yet. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Literature
[Project just started]: The Albanian Corpus of literary texts (AlCo-Literature) contains around 2,5 million word tokens (text words). The corpus contains literary works (prose) of the most famous Albanian authors. The corpus is not annotated yet. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
Buzuku (1555) Corpus
The Buzuku Corpus contains the text of “Missale” (1555) from Gjon Buzuku. The corpus is not annotated.