The IMPACT project Polish Ground-Truth texts as a Djvu corpus

Bień, Janusz S.

doi:10.11649/cs.2014.008

Article details

Journal

Cognitive Studies

2014 | 14 | 75-84

Article title

The IMPACT project Polish Ground-Truth texts as a Djvu corpus

Authors

Bień Janusz S.

Content

Full texts:

Download

Title variants

Languages of publication

EN

Abstracts

EN

The purpose of the paper is twofold. First, to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words associated with their occurrences in the scans. Secondly, to present a case study of a corpus consisting of almost 5 000 pages of Polish historical texts dating from 1570 to 1756 (it is practically the very first corpus of historical Polish). The tools described have universal character and are freely available under the GNU GPL license, hence they can be used also for other purposes.

Keywords

EN

Polish language corpora DjVu OCR PAGE Page Analysis and Ground-Truth Elements GNU GPL

Publisher

Polska Akademia Nauk. Instytut Slawistyki PAN

Journal

Cognitive Studies

Year

2014

Issue

14

Pages

75-84

Physical description

Dates

published

2014-09-04

Contributors

author

Bień Janusz S.

Katedra Lingwistyki Formalnej, Uniwersytet Warszawski [Formal Linguistics Department, University of Warsaw], Warszawa [Warsaw], Poland

References

Bień, J. S. (2009). Facilitating access to digitalized dictionaries in DjVu format. Cognitives Studies | Études cognitives, 9, 161–170. Retrieved from http://bc.klf.uw.edu.pl/160/
Bień, J. S. (2011). Efficient search in hidden text of large DjVu documents. In R. Bernardi, S. Chambers, B. Gottfried, F. Segond & I. Zaihrayeu (Eds.), Advanced Language Technologies for Digital Libraries, volume 6699 of Lecture Notes in Computer Science (pp. 1-14). Berlin/Heidelberg: Springer. Retrieved from http://dx.doi.org/10.1007/978-3-642-23160-51,http://bc.klf.uw.edu.pl/177/
Breuel, T. (2007). The hOCR microformat for OCR workflow and results. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (pp. 1063–1067). IEEE Computer Society. Retrieved from http://madm.dfki.de/publication&pubid=4373
Kenter, T., Erjavec, T., Žorga Dulmin, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 1–6).
Avignon: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W/W12/W12-1001.pdf
Le Cun, Y., Bottou, L., Haffner, P., & Howard, P. G. (1998). DjVu: a compression method for distributing scanned documents in color over the internet. In Sixth Color Imaging Conference: Color Science, Systems and Applications (pp. 220–223). Scottsdale, Arizona: IST. Retrieved from http://leon.bottou.org/papers/lecun-98c
Pletschacher, S. & Antonacopoulos, A. (2010). The PAGE (Page Analysis and Ground-Truth Elements) format framework. In International Conference on Pattern Recognition (pp. 257–260). Los Alamitos, CA: USA. IEEE Computer Society. Retrieved from http://www.impact-project.eu/fileadmin/Editorial/Documents/ICPR2010_The_PAGE_Format_Framework_USAL.pdf
Przepiórkowski, A., Krynicki, Z., Dębowski, Ł., Woliński, M., Janus, D., & Bański, P. (2004). A search tool for corpora with positional tagsets and ambiguities. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC2004 (pp. 1235–1238). Retrieved from http://nlp.ipipan.waw.pl/ adamp/Papers/2004-lrec/fcqp.pdf

Document Type

Publication order reference

Identifiers

DOI

10.11649/cs.2014.008

YADDA identifier

bwmeta1.element.desklight-84503002-fdaf-43a1-a12c-ed8a534f7a7e

Article details

Journal

Cognitive Studies

Article title

The IMPACT project Polish Ground-Truth texts as a Djvu corpus

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Issue

Pages

Physical description

Dates

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier