Name
|
|
Creator
|
-
Koshchanka, Uladzimir
-
Bulojchyk, Alies
|
Description
|
-
The Belarusian N-corpus is a corpus of texts in modern Belarusian with structural and grammatical marking and certification. The corpus consists of the following subcorpora: 1) Basic corpus (14.8 thousand texts, 43.4 million word usages); 2) Concordance of Belarusian of the 19th century (515 texts, 278 thousand word usages); 3) Belarusian Wikipedia corpus (287 thousand texts, 126 million word usages); 4) Translations corpus (1.22 thousand texts, 6.91 million word usages); 5) Unprocessed texts corpus (68.7 thousand texts, 892 million word usages); 6) Biblic corpus (16 Bible translations into Belarusian and other languages (Latin, Jewish, Ukrainian, Polish) for comparison). In total, the corpus comprises 372 thousand texts and 1.07 billion word usages.The basic corpus contains texts of 5 different styles: artistic, scientific, publicistic, official, religious. Within styles, texts are classified into genres; for example, the artistic style is divided into the following genres: narrative, short novel, ballad, fable, verse, fairy tale, ode, poem, play, novel, plot. As in most Slavic corpora, the Belarusian N-corpus encodes morphological (grammatical) information: initial word forms and grammatical characteristics. The Lexical and Grammatical Base is used for grammatical marking of the corpus. The base is a collection of words with morphological and other tags. The paradigm header provides the identification number of the paradigm, the initial form, and the grammatical feature of the token. If necessary, additional information is recorded: government (for verbs), meaning, remarks. Each declensional form has its own characteristics. The source of the word or word form, stress, spelling and non-canonical forms are also indicated. To date, the Lexical and Grammatical Base has approximately 304 thousand paradigms and more than 4,4 million word forms.
|
Collection
|
|
Language
|
|
Resource type
|
|
Data provider
|
|