Automatická morfologická disambiguace korpusů řady SYN: spolupráce lingvistické introspekce a strojového učení

Petkevič, Vladimír; Jelínek, Tomáš

Article details

Journal

Naše řeč (Our Speech)

2025 | 108 | 1 | 3-40

Article title

Automatická morfologická disambiguace korpusů řady SYN: spolupráce lingvistické introspekce a strojového učení

Authors

Vladimír Petkevič , Tomáš Jelínek

Content

Full texts:

http://kramerius.lib.cas.cz/search/handle/uuid:cb69bb3c-09e2-4de1-8f37-e186dc28be22 [remote]

Title variants

EN

Automatic morphological disambiguation of the SYN series corpora: combination of linguistic introspection and machine learning

Languages of publication

CS

Abstracts

EN

The paper deals with the current method of automatic morphological tagging of corpora of contemporary Czech of the SYN series and other corpora within the Czech National Corpus. From SYN2020 onwards, the corpora are annotated on the basis of an improved concept. The paper starts with a brief description of newly introduced features concerning tokenization and lemmatization (introduction of sublemmata), and of the tagging of multiword tokens (i.e. compound forms like abys, cos); a new attribute, verbtag, is also presented. Then the successive steps of the entire annotation process are described. The core of the paper provides a detailed description of the procedure of automatic morphological disambiguation, namely the combination of two methodologically different approaches: the LanGr system of linguistically motivated disambiguation rules based on introspection, and the MorphoDiTa tool based on machine learning – we call this combination a hybrid approach. Particular emphasis is laid on the detailed characterization of the LanGr system, primarily on compiling specific lists of bigrams and trigrams of lemmas and forms labeled as global identifiers and on using these identifiers in disambiguation rules. The success rate of the hybrid system compared to the success rate of the stand-alone MorphoDiTa system is also presented and plans are briefly outlined for further development of our hybrid morphological tagging approach.

Keywords

CS

automatické morfologické značkování globální identifikátory korpusy řady SYN LanGr jako programovací jazyk LanGr jako systém pravidel lingvisticky motivovaná pravidla morfologická disambiguace MorphoDiTa proces anotace strojové učení

EN

annotation process automatic morphological tagging corpora of the SYN series global identifiers LanGr as a programming language LanGr as a rule system linguistically motivated rules machine learning MorphoDiTa morphological disambiguation

Publisher

Institute of the Czech Language, Czech Academy of Sciences

Journal

Naše řeč (Our Speech)

Year

2025

Volume

108

Issue

1

Pages

3-40

Physical description

Document type

ARTICLE

Contributors

author

Vladimír Petkevič

Ústav pro jazyk český AV ČR, v. v. i., Letenská 123/4, 118 51 Praha 1, Czech Republic

author

Tomáš Jelínek

Ústav pro jazyk český AV ČR, v. v. i., Letenská 123/4, 118 51 Praha 1, Czech Republic

Article details

Journal

Naše řeč (Our Speech)

Article title

Automatická morfologická disambiguace korpusů řady SYN: spolupráce lingvistické introspekce a strojového učení

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Document type

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier