Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

Results found: 10

first rewind previous Page / 1 next fast forward last

Search results

Search:
in the keywords:  corpus annotation
help Sort By:

help Limit search:
first rewind previous Page / 1 next fast forward last
EN
Bulgarian sense-annotated corpus – between the tradition and noveltyThe Bulgarian Sense-annotated Corpus (BulSemCor) is compiled according to the general methodology established by the SemCor project. It is a subset of the Brown Corpus of Bulgarian semantically annotated with a corresponding synonym set (synset) in the Bulgarian wordnet. Unlike the bulk of sense-annotated corpora where only (sets of) content words are annotated, in BulSemCor each lexical unit has been assigned a sense. The main contributions achieved in the work on BulSemCor are briefly decides in the presented paper: definition of an annotation schema, compilation of an input corpus, development of a sense-annotated corpus, Bulgarian wordnet enlargement.
EN
Studies in L2 intonation and phrasal phonology are interesting not only to understand how second languages (L2) are acquired, but also to get better insights into the phonology of the target language itself. Indeed, clear descriptions are still missing for many intonational and phrasal phenomena; and analysing the speech production of L2 learners may help in analyzing phenomena that have remained unnoticed up to now (e.g. grammatical and prosodic constraints that operate in case of self-repairs, and phonological status of some prosodic events). We propose a close look at a well-designed corpus, as an introduction to this research perspective. The aim of the present contribution is twofold: (i) to present the COREIL corpus, an electronic oral learner corpus that has been designed to study the acquisition of phrasal phonology and intonation in French and English as a foreign language; (ii) to explain the principles used to collect and annotate the data. The data collection protocol is developed to be as modular as possible: for instance, it can be used to gather data produced by children as well as by adults. The protocol is intended as an easy-to-use tool that can be modified by the research community. It allows for a comparison of acquisition processes along several dimensions (L1 vs. L2, differences among L1 learners, etc).
3
100%
EN
Towards an event annotated corpus of PolishThe paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.
EN
Application of multilingual corpus in contrastive studies (on the example of the Bulgarian-Polish-Lithuanian parallel corpus)In this paper we present applications of a trilingual corpus in language research. Comparative and contrastive studies of Polish and Bulgarian as well as Polish and Lithuanian have been already conducted, but up to the best of our knowledge no such studies exist for Bulgarian and Lithuanian. On the one hand, it is interesting to note that two Slavic languages are compared to a Baltic language (Lithuanian). On the other hand, the three languages are marginally present in the EU because of the later ascension of the three countries to the EU. The paper shortly describes the first electronic Bulgarian–Polish–Lithuanian experimental corpus, currently under development only for research. We also focus our attention on the morphosyntactic annotation of the parallel trilingual corpus according to the Corpus Encoding Standard: we present a review of the Part-of-Speech (POS) classification of the participle in the three languages – Bulgarian, Polish, and Lithuanian in comparison to another POS, the adjective. We briefly discuss tagsets for corpus annotation from the point of view of possible unification in the future with some examples.
EN
Multilingual digital resources with Bulgarian languageThe paper presents in brief Bulgarian language resources as a part of multilingual digital resources developed in the frame of some international projects, among them parallel annotated and aligned corpora, comparable corpora, morpho-syntactic specifications for corpora annotation and dictionaries encoding, lexicons, lexical databases, and electronic dictionaries.
EN
Shallow syntactic annotation in the corpus of Wrocław University of TechnologyIn this paper we present shallow syntactic annotation of The Wrocław University of Technology Corpus. We discuss some theoretical and practical considerations related to shallow parsing of Polish, then we present our annotation guidelines. The proposed annotation scheme includes chunking – four chunk types are defined with reference to the notion of accommodation and syntactic connotation, as well as annotation of four inter-chunk predicate-argument relations. Until now almost 18k chunk and 4k relation instances have been annotated. We believe that both the corpus and the annotation guideliness will prove their applicability in construction of automatic shallow parsers.
EN
In the present contribution, the main constitutive features of the Functional Generative Description as proposed by Petr Sgall and his collaborators are introduced together with a brief characterization of selected Czech grammatical phenomena within this framework. These phenomena include above all verbal and nominal valency and related issues and topic-focus articulation, esp. in relation to negation and presupposition. Criteria for the determination of valency members are proposed together with the changes in the valency structure connected with the application of different diatheses and alternations. The role of valency requirements in complex predicates is described and exemplified by means of derived structures. The other phenomenon investigated is connected with reflexive and reciprocal constructions. Furthermore, attention is devoted to the categorization of deletions and the related phenomenon of the general participant, and also to various comparative constructions that are described as constructions with surface deletions. The constructions introduced by the Czech preposition kromě ‘besides/instead of’ are used as an illustration of how their deep representation looks. The main tenets of FGD have been applied, verified and further refined in the Prague Dependency Treebank family and valency lexicons, which are briefly characterized here as well.
EN
The paper deals with a phenomenon frequently encountered in the syntax of spoken Czech, namely one-syllable words, mostly of pronominal or verbal nature (se, si, sem, ste, sme, mě, mi, mu, tě, ti, bych, bys, by…) at the beginning of syntactic segments. At this stage, the analysis focuses on three forms: by, si, ti. The authors address the issue of the difficult identification of segment boundaries, including the influence of turn-taking in dialogue. The data was taken from the ORAL2013 corpus; the paper further looks into the usefulness of this corpus for the investigation of dialogue syntax, its query options and the possible interpretation of the presented evidence. The results have shown so far that the one-syllable beginnings in question are based on the elision of certain, mostly pronominal, expressions, or less frequently on word-order inversion. Furthermore, to a certain extent, they correlate with selected non-verbal discourse phenomena (longer pauses, silence, laughter), with syntactic phenomena (repetitions, corrections, parentheses, aposiopesis, etc.) and also with speaker turn-taking and topic change.
PL
Istnienie problemów AI-zupełnych przyczyniło się do poszukiwań alternatywnych sposobów rozwiązywania problemów sztucznej inteligencji, nie opartych wyłącznie na pracy komputera. Pomimo że komunikacja jest dla ludzi czymś oczywistym, nadal nie istnieje sposób jej automatyzacji. Aktualnie powszechnie stosowanym podejściem w rozwiązywaniu problemów NLP jest podejście statystyczne, którego powodzenie zależy od wielkości korpusu językowego. Przygotowanie rzetelnego zbioru danych jest zatem kluczowym aspektem tworzenia statystycznego systemu sztucznej inteligencji. Z uwagi na zaangażowanie specjalistów jest to proces czasochłonny i kosztowny. Jednym z obiecujących podejść, pomagających zredukować czas i koszt tworzenia otagowanego korpusu, jest korzystanie z gier skierowanych na cel. Ambicją niniejszej pracy jest przybliżenie poszczególnych etapów tworzenia gry przeznaczonej do pozyskania zasobów językowych oraz omówienie skuteczności jej działania. Analiza ta zostanie przeprowadzona na podstawie kolekcji gier Wordrobe wspierających anotacje korpusu języka naturalnego.
EN
The existence of AI-complete problems has led to a growth in research of alternative ways of solving artificial intelligence problems, which are not based solely on the computer. Although for us communication is obvious, there is still no way automate it. The current widely-used approach to solving the problems of NLP is a statistical one, whose success depends on the size of the training corpus. The preparation of a reliable set of data is therefore a key aspect in creating an artificial intelligence statistical system. Due to the involvement of a large number of specialists this is a very time-consuming and expensive process. One promising approache in helping reduce the time and cost of creating a tagged corpus is the use of games with a purpose. The objective of this paper is to present the stages of creating games with a purpose used for obtaining annotated language resources and to discuss its effectiveness. This analysis will be done based on the Wordrobe project, a collection of games created to support the gathering of an annotated corpus of natural language.
10
45%
EN
The present contribution is a theoretical and methodological study of the possibilities of processing discourse through the use of corpus methods. Despite the description complexity of phenomena “beyond the sentence boundary”, we argue that even more ways of systematic analysis are possible. Taking into account various attempts during the last decade to create discourse-annotated corpora, a reliable way to proceed in any such analysis is shown to be to distinguish between different layers of discourse analysis (in particular between “semantic” and “pragmatic” aspects) and to stick with the linguistic form as opposed to classifying phenomena with no surface realization.
first rewind previous Page / 1 next fast forward last
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.