See the data set (and the description) on GitHub.
Albanian Corpus (AlCo)
The Albanian Corpus (AlCo) contains a hundred million word tokens (text words), the first Albanian corpus of this size. The corpus covers different domains of language and contains different text types – it is a reference corpus. At this moment the work is still in progress, some texts still need to be replaced or recategorized. The corpus is annotated with a morpho-syntactic tagset of 77 tags, since 2015. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
The Albanian Corpus of Press Texts (AlCo) contains around 32 million word tokens (text words). The corpus is annotated like AlCo. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.
Buzuku (1555) Corpus
The Buzuku Corpus contains the text of "Missale" (1555) from Gjon Buzuku. The corpus is not annotated.