Pasti dat: srovnatelnost dat jazykových korpusů

Giger, Markus; Kocková, Jana

Article details

Journal

Časopis pro moderní filologii (Journal for Modern Philology)

2025 | 107 | 1 | 7-18

Article title

Pasti dat: srovnatelnost dat jazykových korpusů

Authors

Giger Markus , Kocková Jana

Content

Full texts:

Download

Title variants

EN

Data Traps: Comparability of Language Corpus Data

Languages of publication

CS

Abstracts

Despite the apparent unambiguity of data provided by corpora, the data reflect different composition of the corpora, different conceptions of the synchronic period of a given language, different linguistic traditions, different orthography and other factors. We focus on the most common reasons affecting the comparability of data in parallel corpora, such as unequal lemmatization, tagging and tokenization, and illustrate them with examples from Czech, German and Russian. For example, when comparing Russian and Czech verb forms and lemmas, the data provided by the corpora are not comparable, because in Russian, unlike in Czech, the reflexive and non-reflexive forms are assigned to different lemmas and the verb lemma includes participles, whereas the corresponding Czech forms are tagged as adjectives, in accordance with Czech philological tradition. The differing approaches to tokenization are also reflected in the overall size of the corpus, indirectly affecting the comparability of relative frequencies.

Keywords

CS

korpusy komparativní lingvistika tagování srovnatelnost dat vyváženost korpusů

EN

corpora comparative linguistics tagging data comparability corpus balance

Publisher

Charles University in Prague, Faculty of Arts, Czech Republic

Journal

Časopis pro moderní filologii (Journal for Modern Philology)

Year

2025

Volume

107

Issue

1

Pages

7-18

Physical description

Contributors

author

Giger Markus

author

Kocková Jana

References

Document Type

Publication order reference

Identifiers

YADDA identifier

bwmeta1.element.desklight-acf9376b-bbbe-472e-8170-336961d8dfc8

Article details

Journal

Časopis pro moderní filologii (Journal for Modern Philology)

Article title

Pasti dat: srovnatelnost dat jazykových korpusů

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier