Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

PL EN


2025 | 107 | 1 | 7-18

Article title

Pasti dat: srovnatelnost dat jazykových korpusů

Content

Title variants

EN
Data Traps: Comparability of Language Corpus Data

Languages of publication

CS

Abstracts

Despite the apparent unambiguity of data provided by corpora, the data reflect different composition of the corpora, different conceptions of the synchronic period of a given language, different linguistic traditions, different orthography and other factors. We focus on the most common reasons affecting the comparability of data in parallel corpora, such as unequal lemmatization, tagging and tokenization, and illustrate them with examples from Czech, German and Russian. For example, when comparing Russian and Czech verb forms and lemmas, the data provided by the corpora are not comparable, because in Russian, unlike in Czech, the reflexive and non-reflexive forms are assigned to different lemmas and the verb lemma includes participles, whereas the corresponding Czech forms are tagged as adjectives, in accordance with Czech philological tradition. The differing approaches to tokenization are also reflected in the overall size of the corpus, indirectly affecting the comparability of relative frequencies.

Contributors

author
author

References

Document Type

Publication order reference

Identifiers

YADDA identifier

bwmeta1.element.desklight-acf9376b-bbbe-472e-8170-336961d8dfc8
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.