Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

Results found: 6

first rewind previous Page / 1 next fast forward last

Search results

Search:
in the keywords:  lemmatization
help Sort By:

help Limit search:
first rewind previous Page / 1 next fast forward last
Język Polski
|
2012
|
vol. 92
|
issue 1
11–19
PL
Artykuł przedstawia sposób budowy elektronicznego korpusu tekstów polskiej gwary wsi Maćkowce na Ukrainie. Do tego celu został stworzony pakiet programów FonOrt autorstwa M. Wieczorka. Teksty, przepisane w transkrypcji fonetycznej do plików MS Word, zostały następnie przekształcone do formatu XML i poddane lematyzacji. Zastosowano automatyczne przypisanie każdej formie wyrazowej tekstu (ciągowi znaków) takiego odpowiedniego ciągu znaków, który może być zinterpretowany przez analizator morfologiczny języka polskiego. Zwykle była to odpowiednia forma ogólnopolska (np. kubita → kobieta, chudz'ima → chodzimy). Tak uzyskanym formom przypisana została odpowiednia forma hasłowa, do czego użyto biblioteki analizatora MorfeuszSIAT M. Wolińskiego. Do lematyzacji leksemów dyferencjalnych (wyekscerpowanych z tekstów ręcznie) została automatycznie stworzona lista ich form wyrazowych. Rezultatem jest korpus, w którym każdemu ciągowi znaków przypisany jest odpowiedni leksem, a także informacje dodatkowe, np. o osobie mówiącej. Korpus można przeszukiwać za pomocą programu Poliqarp.
EN
The paper presents an electronic corpus of the Polish dialect of the village of Maćkowce in Ukraine. For this purpose a computer tool FonOrt was created, the author of which is M. Wieczorek. The texts, transcribed in phonetic transcription in MS Word files, were afterwards converted to XML and lemmatized. Lemmatization was achieved by attributing to each token an appropriate sequence of signs which could be interpreted by a morphological analyzer of Polish. It was usually an appropriate standard Polish form (e.g. kubita → kobieta, chudz’ima → chodzimy). Thereafter the program imputed lemmas to attained word forms using the Morfeusz SIaT analyzer. To lemmatize lexical borrowings and Polish dialectal words (selected from the texts manually) a list of their word forms was automatically created. In the corpus created using the methods described above each token is annotated with an appropriate lemma and additional information like the speaker. One can search the corpus using the tool Poliqarp.
EN
Web-Application for the Presentation of Bilingual Corpora (Focusing on Bulgarian as One of the Two Paired Languages)This paper briefly presents a web-application for the presentation of bilingual aligned corpora focusing on Bulgarian as one the two paired languages. The focus is given to the description of the software tools and user interface. The software is developed in IMI-BAS and will be hosted on a server there. Some examples of the usage of the web-application for the presentation of a Bulgarian-Polish aligned corpus are included.
EN
This paper introduces some major conceptual enhancements to the morphological annotation of the SYN series corpora of the Czech National Corpus. Apart from minor changes in tokenization and in the positional tagset, three major conceptual changes have been applied which affect the representation of various lexical and grammatical patterns. In the paper, we present the actual impact of the changes in linguistic data and search for possibilities in three linguistic areas. First, the treatment of phonic, graphemic, and morphological variants via a two-tier lemma structure is discussed; second, a new approach to periphrastic verb forms, auxiliaries, participles and the interpretation of verbal grammatical categories through a new attribute, called verbtag, is explained; and third, a complex multi-value treatment of multiword tokens is introduced.
EN
The objective of the paper is to describe the principles for building the onemillionword DIA1900 Corpus consisting of Czech texts published between 1851 and 1900, designed to be both balanced and representative. There are two main goals determining the methods of corpus building and the decision to develop new tools tailored to the special needs of 19th century Czech: 1) to present the variability of Czech in the 2nd half of the 19th century (including spelling, morphology, wordformation) and 2) to link the detected variants to the appropriate lemmas. The paper presents the phases of the processing of the texts, including transcription, manual pre-annotation, as well as the construction of a large morphological dictionary and the selection of a suitable set of paradigms. Further sections are focused on annotation and morphological tagging and manual disambiguation. The objective was to create a gold standard, intended for use in the automatic annotation both of the DIA1900 corpus and the planned corpus of Czech texts of the years 1800–1850.
5
Content available remote

Nová automatická morfologická analýza češtiny

67%
EN
A detailed morphological description of word forms in any language is one of the necessary conditions for the successful automatic processing of linguistic data. The aim of this paper is to present a project aimed at a new description of Czech morphology, especially the planned changes in the tagset. The key changes are as follows: 1) the unambiguous description of variants; 2) the concept of a multiple lemma; 3) the revision of part-of-speech definitions.
EN
This paper introduces the description of Old Czech common nouns developed and used in a tool for tagging and lemmatizing common nouns occurring in transcribed digital editions of Old Czech texts. This description consists of four parts: the first features an overview of all declension type endings (approx. 100 declension patterns), the second part analyses alternations in the morphological basis accompanying declension (approx. 120 types of alternations), the third part deals with formal changes connected mainly with the language’s historical development (approx. 100 formal changes) and, finally, the fourth part contains a list of lemmas extracted from modern dictionaries of Old Czech (approx. 29 000 lemmas). Furthermore, the paper introduces the software developed and used for this purpose, namely i) the tool which makes it possible a) to generate word forms and subsequently search for multiple word forms in the texts at once, b) to create lists of word forms filtered by sequences of characters occurring at the end of the word forms, ii) the tool for assigning a declension pattern to a lemma, and iii) the tool enabling work with large databases. Finally, the paper describes two applications developed on the basis of Old Czech common noun description, i.e. i) a database of Old Czech common noun declension patterns connected with Old Czech dictionaries and the Old Czech text bank, ii) a tool for generating word forms, which is used for the lemmatization and tagging of Old Czech texts.
first rewind previous Page / 1 next fast forward last
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.