Kvantitativní určení lexikálního jádra jazyka

Cvrček, Václav

Article details

Journal

Časopis pro moderní filologii (Journal for Modern Philology)

2014 | 96 | 1 | 9-26

Article title

Kvantitativní určení lexikálního jádra jazyka

Authors

Václav Cvrček

Content

Full texts:

Download

Title variants

EN

QUANTITATIVE DELIMITATION OF THE CORE OF A LANGUAGE

Languages of publication

CS

Abstracts

EN

The exploitation of hapax legomena, i.e. word or lemma types which occur in a corpus only once, is usually overlooked in language description. These types cannot be systematically used for a vast majority of analyses as they do not provide a basis for any type of generalization. On the other hand, the overall number of hapaxes can be used as an indicator of the lexical periphery of the language system. This paper suggests that the ratio between the number of hapaxes and the number of all types in relation to the growing corpus size (hapax-type ratio, HTR) can be used for delimitation of the lexical core of a language. It has been shown by previous research (Fengxiang 2010) that HTR in English has the shape of a pipe or chibouque, which means that the rates of the emergence of new hapaxes and new types in the process of building a corpus differ before and after reaching a certain size. In a hypothetical small corpus (a few sentences) the hapax-type ratio will be equal to one (each wordtype is also a hapax). As texts are added to the corpus (up to a few million words), the hapax-type ratio decreases (the number of new words including hapaxes is continuously increasing but the majority of added tokens are new instances of words already present in the corpus) from its maximal value (=1) to a local minimum. After reaching this turning point, extending the corpus increases the ratio because the number of hapaxes grows at a faster pace than the number of non-hapaxes (i.e. types with a frequency higher than one). This empirical finding tested on corpora of Czech and English brings us closer to the exact determination of the range of the core lexicon. Subsequently, we can deduce the approximate size of a corpus sufficient for compiling a dictionary that covers the core lexicon.

Keywords

CS

korpus kvantitativní lingvistika hapax legomenon lexikon token-type poměr

EN

corpus quantitative linguistics hapax legomenon lexicon token-type ratio

Year

2014

Volume

96

Issue

1

Pages

9-26

Physical description

Contributors

author

Václav Cvrček

vaclav.cvrcek@ff.cuni.cz

Ústav Českého národního korpusu, FFUK | nám. Jana Palacha 2, 116 38 Praha 1, Czech Republic

References

Document Type

Publication order reference

Identifiers

YADDA identifier

bwmeta1.element.desklight-41c6aa51-3179-4977-9a89-4491e68a6292

Article details

Journal

Časopis pro moderní filologii (Journal for Modern Philology)

Article title

Kvantitativní určení lexikálního jádra jazyka

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Discipline

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier