Google Books jako korpus językowy

Podhajecka, Mirosława

Article details

Journal

Biuletyn Polskiego Towarzystwa Językoznawczego

2018 | 74 | 31-46

Article title

Google Books jako korpus językowy

Authors

Podhajecka Mirosława

Selected contents from this journal

https://biuletynptj.com

Title variants

Google Books as a language corpus

Languages of publication

PL

Abstracts

PL

Artykuł poświęcony jest omówieniu Google Books, dostępnej przez Internet biblioteki wirtualnej, obejmującej skany 30 milionów książek. Jest to aktualnie najbogatsze na świecie źródło danych tekstowych w postaci cyfrowej. Zbiory Google Books można nazwać korpusem, ale zasadniczo różnią się one od tradycyjnych korpusów językowych. Kłopoty klasyfikacyjne wynikają z konkretnych ograniczeń, z ja kimi trzeba się zmierzyć w trakcie badań. Między innymi część źródeł to wersje pełnotekstowe, a część – wersje z ograniczonym podglądem, dane bibliograficzne są nierzadko błędne, a jakość optycznego rozpoznawania tekstu, zwłaszcza w przypadku starszych tekstów, jest daleka od doskonałości. Referat omawia krótko problemy badawcze dotyczące Google Books.

EN

This article concerns Google Books, a digital library available on the Internet, which contains scans of 30 million books. At present, it is the largest source of textual data in digital format worldwide. Google Books may be called a corpus, but it is markedly different from traditional language corpora. Classification difficulties arise from specific limitations encountered during research. Among other things, some sources are available as full texts, while others offer limited preview; bibliographic metadata are often wrong; and the quality of optical character recognition is far from perfect, especially when applied to older texts. The article briefly discusses research problems involved in using Google Books.

Keywords

PL

Google Books korpus analiza problemy badawcze

EN

Google Books corpus analysis research problems

Publisher

Polskie Towarzystwo Językoznawcze; UNIVERSITAS Towarzystwo Autorów i Wydawców Prac Naukowych; Wydawnictwo LEXIS

Journal

Biuletyn Polskiego Towarzystwa Językoznawczego

Year

2018

Volume

74

Pages

31-46

Physical description

Contributors

author

Podhajecka Mirosława

Uniwersytet Opolski

References

AIDEN Erez, MICHEL Jean-Baptiste (2013): Uncharted: Big data as a lens on human culture. − New York: Riverhead Books.
BIBER Douglas (1993): Representativeness in corpus design. − Literary and Linguistic Computing 8 (4), 243−257.
BRIN Sergey (2009): A library to last forever. – The New York Times, 8.10.2009, http://www.nytimes.com/2009/10/09/opinion/09brin.html.
CRAWFORD James (2010): On the future of books. − 14.10.2010. http://booksearch.blogspot.com/2010/10/ on-future-of-books.html.
DAVIES Mark (2014): Making Google Books Ngrams useful for a wide range of research on language change. – International Journal of Corpus Linguistics 19 (3), 401−416.
DAVIES Mark, CHAPMAN Don (2016): The effectiveness of representativeness and size in historical corpora: An empirical study of changes in lexical frequency. – [w:] Don CHAPMAN, Colette MOORE, Miranda
WILCOX (red.): Studies in the history of the English language 7, 131−152.
DHAENENS Clarisse, JOURDAN Laetitia (2016): Metaheuristics for big data. − London: John Wiley and Sons.
DIEMER Stefan (2012): Corpus linguistics with Google? − https://www.bu.edu/isle/files/2012/01/Stefan-Diemer-Corpus-Linguistics-with-Google.pdf.
DILLER Hans-Jürgen (2013): Culturomics and genre: Wrath and anger in the 17th century. – [w:] R.W. MCCONCHIE [et al.] (red.): Selected proceedings of the 2012 Symposium on New Approaches in English Historical Lexis (HEL-LEX 3). − Somerville: Cascadilla Proceedings Project, 54–65.
FISCHER Karen (2007): Committee on Institutional Cooperation (CIC) joins Google’s Library Project. − The University of Iowa Libraries. − September 6, 2007. https://blog.lib.uiowa.edu/transitions/?p=68.
GÓRSKI Rafał L. (2003): Korpus współczesnego języka polskiego IJP PAN, tzw. korpus krakowski. – [w:] Stanisław GAJDA (red.): Językoznawstwo w Polsce. Stan i perspektywy. − Opole: Wydawnictwo Uniwersytetu Opolskiego, 158−161.
GRAFTON Anthony (2012): Kodeks w kryzysie. Dematerializacja książki (tłum. Michał Choptiany). − Wielogłos: Pismo Wydziału Polonistyki UJ 3 (13), 229−264.
GRALIŃSKI Filip (2013). Polish digital libraries as a text corpus. – [w:] Zygmunt VETULANI, Hans USZKOREIT (red.): Proceedings of the 6th Language and Technology Conference. − Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza, 509−513.
GREEN Eugene (2014): The diffussion of I need you to + infinitive in world Englishes. – [w:] Eugene GREENE, Charles F. MAYER (red.): The variability of current world Englishes. − Berlin: de Gruyter, 257−284.
HABASH Gabe (2012): Average book length: Guess how many words there are in a novel. PWxyz (blog). Przedruk: Huffington Post, 3.09.2012. https://www.huffingtonpost.com/2012/03/09/book-length_n_1334636.html.
HALEVY Alon, NORVIG Peter, PEREIRA Fernando (2009): The unreasonable effectiveness of data. – IEEE Inteligent Systems 24 (2), 8−12.
HOWARD Jennifer (2017): What happened to Google’s effort to scan millions of university library books? – EdSurge, 10.08.2017, https://www.edsurge.com/news/2017-08-10-what-happened-to-google-seffort-to-scan-millions-of-university-library-books.
JAMES Ryan, WEISS Andrew (2012): An assessment of Google Books’ metadata. − Journal of Library Metadata 12 (1), 15−22.
KILGARRIFF Adam (2003): The Web as corpus. – [w:] Paul RAYSON [et al.] (red.): Proceedings of the Corpus Linguistics 2001 Conference. Lancaster University (UK) 29 March −2 April 2001, 342−344, http://ucrel.lancs.ac.uk/publications/CL2003/CL2001%20conference/papers/kilgarri.pdf.
KILGARRIFF Adam (2007): Googleology is bad science. − Computational Linguistics 3 (1), 147−151.
LANDAU Sidney I. (2001): Dictionaries: The art and craft of lexicography. − Cambridge: Cambridge University Press.
LEVY Steven (2011): In the plex: How Google thinks, works, and shapes our lives. − New York: Simon & Schuster.
LEWIS Danny (2015): Google Books isn’t copyright infringement. – Simthsonian.com, 20.10.2015, https://www.smithsonianmag.com/smart-news/court-ruling-legalizes-google-books-180956997/.
LÜDELING Anke, KYTÖ Merja (red.) (2009): Corpus linguistics, t. 2. − Berlin: De Gruyter Mouton.
MAYER-SCHÖNBERGER Viktor, CUKIER Kenneth (2013): A revolution that will transform how we live, work, and think. − Boston: Houghton Mifflin Harcourt.
MAYS Dorothy A. (2015): Google Books: Far more than just books. – 20.10.2015, http://publiclibraries - online.org/2015/10/far-more-than-just-books/.
MICHEL Jean-Baptiste et al. (2011): Quantitative analysis of culture using millions of digitized books. − Science 331 (176), 176–182.
NUNBERG Geoffrey (2009): Google’s book search: A disaster for scholars. – Chronicle of Higher Education, the Chronicle Review, 31.8.2009, http://chronicle. com/article/Googles-Book-Search-A/48245/.
OAKES Michael, KRAKOWIAN Przemysław, UZAR Rafał (2005): Testy statystyczne w językoznawstwie korpusowym. – [w:] Barbara LEWANDOWSKA-TOMASZCZYK (red.): Podstawy językoznawstwa korpusowego. − Łódź: Wydawnictwo Uniwersytetu Łódzkiego, 116−132.
OED3 = SIMPSON John, PROFFITT Michael, red. (2000−): Oxford English dictionary, wyd. 3, http://www.oed.com/.
PARKS Tim (2014): References, please! – The New York Review of Books, 13.9.2014, http://www.nybooks.com/daily/2014/09/13/references-please/.
PECHENICK Eitan A., DANFORTH Christopher, SHERIDAN DODDS Peter (2015): Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. − Plos One, 7.10.2015, http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0137041.
PIOTROWSKI Tadeusz (2003): Językoznawstwo korpusowe – wprowadzenie do problematyki. – [w:] Stanisław GAJDA (red.): Językoznawstwo w Polsce. Stan i perspektywy. − Opole: Wydawnictwo Uniwersytetu Opolskiego, 143−154.
PODHAJECKA Mirosława (2011): Can Google Books collection complement traditional corpora? – [w:] Stanisław GÓŹDŹ-ROSZKOWSKI (red.): PALC 2009: Explorations across languages and corpora. − Frankfurt am Main: Peter Lang, 529−546.
PODHAJECKA Mirosława (2015): Google Books as a source of historical data: The entry for macaroni in OED3. – [w:] Elżbieta MAŃCZAK-WOHLFELD, Barbara PODOLAK (red.): Words and dictionaries. A Festschrift for Professor Stanisław Stachowski on the occasion of His 85th birthday. − Kraków: Jagiellonian University Press, 247−263.
[PTOLEMY] (1538): Clavdii Ptolemaei Magnae Constructionis, Id est Perfectae Coelestium motuum pertractationis, Lib. XIII. − Basilaee: Apud Ioannem Vvaldervm.
ROBERTS Jeff J. (2017): Why Google Books deserves better than these obituaries. − Fortune, 24.4.2017, http://fortune.com/go/tech/google-books-future/.
ROSENBERG Scott (2017): How Google Books got lost. – Backchannel, 11.4.2017, https://www.wired.com/2017/04/how-google-book-search-got-lost/.
SKIENA Steven, WARD Charles B. (2013): Who’s bigger? Where historical figures really rank. − Cambridge: Cambridge University Press.
SKIPWORTH Hunter (2010): Google counts total number of books in the world. − The Telegraph, 6.8.2010, http://www.telegraph.co.uk/technology/google/7930273/Google-counts-total-number-of-books-inthe-world.html.
WALIŃSKI Jacek (2005): Typologia korpusów oraz warsztat informatyczny lingwistyki korpusowej. – [w:] Barbara LEWANDOWSKA-TOMASZCZYK (red.): Podstawy językoznawstwa korpusowego. − Łódź: Wydawnictwo Uniwersytetu Łódzkiego, 27−33.
WNID2 – William A. NEILSON, Thomas A. KNOTT, Paul W. CARHART (red.) (1934): Webster’s new international dictionary of the English language, wyd. 2. − Springfield, MA: Merriam-Webster.
WU Tim (2015): What ever happened to Google Books? – The New Yorker, 11.9.2015, http://www.new-yorker.com/business/currency/what-ever-happened-to-google-books.
ZIMMER Ben (2012): Bigger, better Google Ngrams: Brace yourself for the power of grammar. − The Atlantic, 18.10.2012, https://www.theatlantic.com/technology/archive/2012/10/bigger-better-googe-n - grams-brace-yourself-for-the-power-of-grammar/263487/.
ŻMIGRODZKI Piotr (2005): Słownik jako korpus tekstów – korpus tekstów jako słownik. Perspektywy polskiej leksykografii naukowej. – Poradnik Językowy 6, 3−14.

Document Type

Publication order reference

Identifiers

YADDA identifier

bwmeta1.element.desklight-cf568038-359c-4eb2-82e4-7f9c142baae7

Article details

Journal

Biuletyn Polskiego Towarzystwa Językoznawczego

Article title

Google Books jako korpus językowy

Authors

Selected contents from this journal

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Pages

Physical description

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier