Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

Results found: 11

first rewind previous Page / 1 next fast forward last

Search results

Search:
in the keywords:  quantitative linguistics
help Sort By:

help Limit search:
first rewind previous Page / 1 next fast forward last
EN
The article is an attempt to reconstruct the field of German-language discourse studies and to analyse them critically. Owing to very strong references, in the discourse analysis models examined in the article, to Michel Foucault’s concept, they are regarded as post-Foucault. The authors present the main threads in German-language discourse studies: (1) approaches the objective of which is to formulate a theoretical-methodological basis of a post-Foucault discourse analysis (these are primarily “discipline-specific” schools of discourse analysis: linguistic and sociological, as well as the programme of the so-called critical discourse analysis); (2) “dispositive” approaches, which constitute a novelty in the debate over the discourse category and regard the dispositive category as a possibility of finding a supradiscursive “system.” The authors also reflect on the critical remarks about the various threads in the studies, including those formulated by scholars themselves. The main conclusion from the authors’ reconstruction is that there is a tendency in German-language discourse studies to understand the category of discourse quite narrowly, with regard to specific disciplines, and thus that there is a lack of an integrated and interdisciplinary model of discourse analysis.
EN
The article raises the problem of a possibility of carrying out discourse analysis by means of corpus-based and quantitative methods. Discourse studies use various research methods, which are mostly qualitative. Corpus-based and statistics-based linguistics, on the other hand, offer many tools that can be used to study discourse, beginning with electronic concordance and ending with calculations making it possible to discover similarity in texts and genres, marked lexis, key words, etc. The author briefly describes these methods in the article. She also provides basic information about the structure of a specialist corpus and the selection of a representative sample of texts, which constitute a given discourse.
PL
This article presents a statistical and comparative analysis of four spelling conventions that represent different stages in the development of the Polish graphic system: the graphic system of a late-medieval manuscript (hand-written text), the standard spelling convention typical for the first half of the sixteenth century, the accepted and standard modern spelling of the first half of the twentieth century and the innovative set of graphic features used in electronic media. The characteristics of the statistical parameters encompasses dispersion and entropy in the first and the second row of letters, as well as in two-element sets (dyads). The analysis proves that: 1) inasmuch as the degree of differentiation of the distribution of signs, the history of Polish spelling convention prior to the solidification of the modern standard practice (accepted standard system) manifested a self-organizing tendency that was based on a reduction of letter signs and two-element letter combinations (ligatures) with the frequency of 1; 2) innovative solutions used in the set of graphic features characteristic for electronic media do not violate the statistical proportion between letters and their dyads operative and specific for modern standard graphic system 3) in respect to theory and information, the transformations of the graphic substance (graphic system) (within the analysed chronological timeframe) depended on neither progress (evolution) nor degradation.
PL
The article presents the results of a quantitative analysis of the vocabulary attested in the Polish translation of the work of Pietro de‘ Crescenzi (Piotr Krescentyn) Księgi o gospodarstwie (original title: Opus ruralium commodorum) against the statystical characteristics of the translations of the New Testament from the latter half of the sixteenth century and the beginning of the seventeenth century, and the text of Wizerunek własny żywota człowieka poczciwego (The Life of the Honest Man) by Mikołaj Rej. The following statystical parameters were studied: the number of words, the number of entries, K quantity factor, arithmetic mean of the frequency of entries, dispersion of entries, vocabulary originality parameter I, distribution of the autosemantic parts of speech (lexical category) and the distribution of individual autosemantic parts of speech (lexical categories). The analysis shows that, with regard to the statistical approach, the vocabulary of Księgi... is rather closer to Wizerunek than to the translations of the New Testament – in comparison to the latter, Księgi... is characterized by a far more ample vocabulary, greater number of entries and autosemantic words, greater number of attested nouns, adverbs in particular (the latter group is characterized by a higher percentage of entries with high frequency). Księgi..., as compared to the translations of the New Testament, also abounds with participles whose repertory is not only more numerous with regard to words but also to entries. The analysis confims the observation made by Władysław Kuraszkiewicz with reference to the dependence between the quantity of vocabulary and the content of texts – the closer it is to real life and refers to its diverse aspects, the richer vocabulary is used for its rendering.
PL
This article discusses automatic extraction of relevant words from sets of texts. The author briefly presents three methods aimed to extract the words from the corpus of words with regard to their frequency, or words whose occurrence next to each other is not random. First, he focuses on the keyword analysis method, then he discusses the Zeta method developed by John Burrows and Hugh Craig, and the third method covered in the article is the topic modelling method, which is becoming very popular recently, and consists in finding clusters of words co-occurring in similar contexts. Topic modelling was intended for a quick content search in large collections of documents. On the basis of 100 Polish novels, the article presents how this method can be used for linguistic studies.
Polonica
|
2018
|
vol. 38
51-66
EN
This article investigates if there is some distinctive way in which characters of Bolesław Prus novel “The Doll” are speaking. It is well known that author endowed his characters with diff erent social backgrounds, views, ethics. Literary critics often emphasized diff erences in vocabulary and numerous language stylizations among social classes, but in most of the cases only the most characteristic words were taken into con16 Jarosław Foltman sideration and function words (prepositions, conjunctions, personal pronouns and other) were omitted in the research. The aim of this study, in contrast to the previous, is to examine individual characters’ way of speaking by measuring frequencies of using most frequent words and given parts of speech. Tests have been performed using the Delta method with a couple of data visualization techniques. The results show some signifi cant variations in individual idiolects.
EN
The article describes a corpus research with its theoretical background. The basis of our research is a press corpus with 27 million tokens, where the epistemic terms are annotated. The grammatical and lexical epistemic expression are consistently kept apart on the basis of the relevant literature. The next step is to investigate with a fully automatic method which expressions often occur together in a text. Semantic subclasses of epistemic expressions are defined on the basis of these quantitatively proven solidarities. Finally, concrete examples illustrate how the epistemic expressions structure the text.
DE
Im Beitrag wird eine Korpusuntersuchung mit ihrem theoretischen Hintergrund beschrieben. In einem 27 Millionen Tokens starken Pressekorpus werden die epistemischen Ausdrücke annotiert. Die grammatikalisierten und lexikalischen epistemischen Ausdrucksmittel werden aufgrund der einschlägigen Literatur konsequent auseinandergehalten. Im nächsten Schritt wird mit einer vollautomatischen Methode untersucht, welche Ausdrücke häufig gemeinsam in einem Text auftreten. Aufgrund dieser quantitativ nachgewiesenen Solidaritäten werden semantische Subklassen epistemischer Ausdrücke definiert. Anschließend wird anhand konkreter Textbeispiele gezeigt, wie die epistemischen Ausdrücke den Text strukturieren.
EN
The exploitation of hapax legomena, i.e. word or lemma types which occur in a corpus only once, is usually overlooked in language description. These types cannot be systematically used for a vast majority of analyses as they do not provide a basis for any type of generalization. On the other hand, the overall number of hapaxes can be used as an indicator of the lexical periphery of the language system. This paper suggests that the ratio between the number of hapaxes and the number of all types in relation to the growing corpus size (hapax-type ratio, HTR) can be used for delimitation of the lexical core of a language. It has been shown by previous research (Fengxiang 2010) that HTR in English has the shape of a pipe or chibouque, which means that the rates of the emergence of new hapaxes and new types in the process of building a corpus differ before and after reaching a certain size. In a hypothetical small corpus (a few sentences) the hapax-type ratio will be equal to one (each wordtype is also a hapax). As texts are added to the corpus (up to a few million words), the hapax-type ratio decreases (the number of new words including hapaxes is continuously increasing but the majority of added tokens are new instances of words already present in the corpus) from its maximal value (=1) to a local minimum. After reaching this turning point, extending the corpus increases the ratio because the number of hapaxes grows at a faster pace than the number of non-hapaxes (i.e. types with a frequency higher than one). This empirical finding tested on corpora of Czech and English brings us closer to the exact determination of the range of the core lexicon. Subsequently, we can deduce the approximate size of a corpus sufficient for compiling a dictionary that covers the core lexicon.
PL
Niedoceniana, niekiedy wręcz kwestionowana w Polsce teoria Witolda Mańczaka dotycząca nieregularnego rozwoju fonetycznego spowodowanego frekwencją od samego początku spotkała się z szerokim odzewem za granicą, wpływając w znaczącym stopniu na rozwój metod kwantytatywno-statystycznych w językoznawstwie. Uznawana w skali światowej za jedno z podstawowych narzędzi badawczych tego nurtu, wpisująca się w ramy powstałych w okresie późniejszym metodologii, jak np. dyfuzja leksykalna, wykorzystywana jest powszechnie poza granicami kraju w opisie i analizach wielu zjawisk językowych, zarówno diachronicznych, jak i synchronicznych, gdzie czynnik frekwencyjny odgrywa niebagatelną rolę.
EN
Underestimated and often questioned in Poland, Witold Mańczak’s theory of irregular sound changes due to frequency has been, from the beginning, well accepted and considered abroad. It has noticeably contributed to the development of quantitative linguistics methods. This statistic tool, which is widely acknowledged, is frequently used by linguists all over the world to study and explain many different linguistic phenomena, synchronic as well as diachronic, where the frequency parameter is regularly involved.
EN
The article aims to recreate the tourist view based on the observation of the lexical layer of texts in quantitative terms. The subject of the research is the most frequent lexis excerpted in the form of an frequency list from Listy z podróży po Włoszech by Konstanty Gaszyński. The collected material consists of 300 words. The most common vocabulary was grouped into 9 semantic-lexical circles and analyzed using the research tools of cultural linguistics. The research has indicated that it is possible to reconstruct, on the basis of the analysis of the most common lexis, both the points of view in the texts and the correlated profiles of the Italian space.
EN
The paper focuses on the frequency and collocation analyses of Česko (“Czechia”), the short, geographical name of our country, in the opinion journalism section of the eight-version SYN corpus, which comprises texts from the period of 1990−2018. Within the scope of the research, the period was divided into several sections, which are delineated by the breakthrough political and cultural events (the Czech Republic entering NATO, the Czech Republic entering the EU, climax of the first season of the Pop Idol-based contest Czechia Is Looking for a SuperStar, etc.). The frequency analysis is based on the relativization via i.p.m.; the collocability force is counted on the grounds of the logDice index, which is easy to be interpreted linguistically, and independent of the corpus size. The goal of the study is to capture basic motivations which led to the popularisation of the name and its expansion in the given discourse (e.g. the influences of other one-word names of states, sport commentaries, popular contests, and generation change). It is possible to sum up that the Česko name is employed in a variety of contexts, and its usage can be seen as unmarked.
first rewind previous Page / 1 next fast forward last
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.