Resources

Data Sets
Corpora

Data Sets

EmpiriST Corpus 2.0
The EmpiriST Corpus 2.0 is a manually annotated corpus consisting of German web pages and German computer-mediated communication (CMC), i.e. written discourse. Examples for CMC genres are monologic and dialogic tweets, social and professional chats, threads from Wikipedia talk pages, WhatsApp interactions and blog comments.

The dataset was originally created by Beißwenger et al. (2016) for the EmpiriST 2015 shared task and featured manual tokenization and part-of-speech tagging. Subsequently, Rehbein et al. (2018) incorporated the dataset into their harmonised testsuite for POS tagging of German social media data, manually added sentence boundaries and automatically mapped the part-of-speech tags to UD pos tags. In our own annotation efforts (Proisl et al., in preparation), we manually normalized and lemmatized the data and converted the corpus into a “vertical” format suitable for importing into the Open Corpus Workbench, CQPweb, SketchEngine, or similar corpus tools.

Normalization and lemmatization added in collaboration with Thomas Proisl, Natalie Dykes, Philipp Heinrich, and Stefan Evert.

–> See the data set (and the description) on GitHub.

Corpora

GeRedE: A Corpus of German Reddit Exchanges

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Created in collaboration with Andreas Blombach, Natalie Dykes, Philipp Heinrich and Thomas Proisl.

–> See the data set (and the description) on GitHub.
Albanian Corpus (AlCo)

The Albanian Corpus (AlCo) contains a 100 million word tokens (text words), the first Albanian corpus of this size. The corpus covers different domains of language and contains different text types – it is a reference corpus. At this moment the work is still in progress, some texts still need to be replaced or recategorized. The corpus is annotated with a morpho-syntactic tagset of 77 tags, since 2015. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Press (2017-2019)

The Albanian Corpus of Press Texts (AlCo) contains around 32 million word tokens (text words). The corpus is annotated like AlCo. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Press (2021-2022)

The Albanian Corpus of Press Texts (AlCo-Press 2021-2022) contains around 60,7 million word tokens (text words). The corpus is annotated like AlCo. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Tweets (selected users 2020-2021)

The Albanian Corpus of tweets form selected users contains around 10 million word tokens (text words), i.e. standard Albanian. The corpus is not annotated yet. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Tweets (sample 2020-2021)

The Albanian Corpus of tweets contains around 1 million word tokens (text words), i.e. standard and non-standard Albanian. The corpus is not annotated yet. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Literature

The Albanian Corpus of literary texts (AlCo-Literature) contains around 2,5 million word tokens (text words). The corpus contains literary works (prose) of the most famous Albanian authors. The corpus is not annotated yet. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Literature 2

The Albanian Corpus of literary texts (AlCo-Literature 2) contains around 7,8 million word tokens (text words). The corpus contains literary works (prose) of the most famous Albanian authors. The corpus is annotated. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
AlCo-Parliament (1991-2020)

The Albanian Corpus of Parliament texts and debates (AlCo-Parliament (1991-2020)) contains around 51,2 million word tokens (text words). The corpus is annotated. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
Buzuku (1555) Corpus

The Buzuku Corpus contains the text of “Missale” (1555) from Gjon Buzuku. The corpus is not annotated.

Besim Kabashi

[ Human Language Processing and Technology · Knowledge Resources · Intelligence ]

Resources

Data Sets

EmpiriST Corpus 2.0

Corpora

GeRedE: A Corpus of German Reddit Exchanges

Albanian Corpus (AlCo)

AlCo-Press (2017-2019)

AlCo-Press (2021-2022)

AlCo-Tweets (selected users 2020-2021)

AlCo-Tweets (sample 2020-2021)

AlCo-Literature

AlCo-Literature 2

AlCo-Parliament (1991-2020)

Buzuku (1555) Corpus