General information

The volume of the text base of the «National Corpus of the Kazakh Language» is 30 million words. Including in the text, consisting of 14 million word usages, meta mark with 16-21 parameters (author of the text, age of the author, text title, style, genre, type of text, source, etc.) is included. The collected texts were obtained from 5 styles of the Kazakh language (art style, scientific style, journalistic style, business style, conversational style).

The texts written in the style of fiction contain the works of Kazakh poets and writers. They differ in the genres of prose and poetry, making up a separate subcorpus.

Publicistic texts contain articles published in newspapers and magazines. They are collected in the base of the subcorpus of Kazakh newspaper texts.

The texts of the scientific style are mainly collected from scientific and humanitarian works, and the texts of the business style are from the texts of business documents and they are included in the base of the subcorpus.

Colloquial texts are taken from newspapers, magazines, interviews on sites. The corpus also includes educational texts.

In the collected texts of the National Corpus, metatext markup was introduced. Information about the metatext appears in the window when the cursor points to the author at the top of the sentences and when you left-click when searching for the desired word.

The following computer programs are running in the corpus:

Derivation of a number of sentences with which the search word occurs (concordance);
Automatic division of any word form in concordance into root and affix (lemmatization);
Program for the implementation of linguistic markups:

morphological markup;
word-formative markup;
lexical markup (meaning);
phonetic markup (description of sounds and automatic division into syllables);
morpho-semantic markup;

Search system for these metatext and linguistic markings.

This information is also displayed in the window when the cursor points to the searched word and when you click the left mouse button.

Thus, in the National Corpus, when searching for a word using a computer program, a list of texts marked with meta-markings is displayed on the screen, on which the search word is found, that is examples. In addition, linguistic information about the same word is given on the second side of the screen in different cells.

The site is available to everyone!

NATIONAL CORPUS OF KAZAKH LANGUAGE

General information