Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

PL EN


2014 | 14 | 75-84

Article title

The IMPACT project Polish Ground-Truth texts as a Djvu corpus

Authors

Content

Title variants

Languages of publication

EN

Abstracts

EN
The purpose of the paper is twofold. First, to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words associated with their occurrences in the scans. Secondly, to present a case study of a corpus consisting of almost 5 000 pages of Polish historical texts dating from 1570 to 1756 (it is practically the very first corpus of historical Polish). The tools described have universal character and are freely available under the GNU GPL license, hence they can be used also for other purposes.

Year

Issue

14

Pages

75-84

Physical description

Dates

published
2014-09-04

Contributors

  • Katedra Lingwistyki Formalnej, Uniwersytet Warszawski [Formal Linguistics Department, University of Warsaw], Warszawa [Warsaw], Poland

References

  • Bień, J. S. (2009). Facilitating access to digitalized dictionaries in DjVu format. Cognitives Studies | Études cognitives, 9, 161–170. Retrieved from http://bc.klf.uw.edu.pl/160/
  • Bień, J. S. (2011). Efficient search in hidden text of large DjVu documents. In R. Bernardi, S. Chambers, B. Gottfried, F. Segond & I. Zaihrayeu (Eds.), Advanced Language Technologies for Digital Libraries, volume 6699 of Lecture Notes in Computer Science (pp. 1-14). Berlin/Heidelberg: Springer. Retrieved from http://dx.doi.org/10.1007/978-3-642-23160-51,http://bc.klf.uw.edu.pl/177/
  • Breuel, T. (2007). The hOCR microformat for OCR workflow and results. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (pp. 1063–1067). IEEE Computer Society. Retrieved from http://madm.dfki.de/publication&pubid=4373
  • Kenter, T., Erjavec, T., Žorga Dulmin, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 1–6).
  • Avignon: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W/W12/W12-1001.pdf
  • Le Cun, Y., Bottou, L., Haffner, P., & Howard, P. G. (1998). DjVu: a compression method for distributing scanned documents in color over the internet. In Sixth Color Imaging Conference: Color Science, Systems and Applications (pp. 220–223). Scottsdale, Arizona: IST. Retrieved from http://leon.bottou.org/papers/lecun-98c
  • Pletschacher, S. & Antonacopoulos, A. (2010). The PAGE (Page Analysis and Ground-Truth Elements) format framework. In International Conference on Pattern Recognition (pp. 257–260). Los Alamitos, CA: USA. IEEE Computer Society. Retrieved from http://www.impact-project.eu/fileadmin/Editorial/Documents/ICPR2010_The_PAGE_Format_Framework_USAL.pdf
  • Przepiórkowski, A., Krynicki, Z., Dębowski, Ł., Woliński, M., Janus, D., & Bański, P. (2004). A search tool for corpora with positional tagsets and ambiguities. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC2004 (pp. 1235–1238). Retrieved from http://nlp.ipipan.waw.pl/ adamp/Papers/2004-lrec/fcqp.pdf

Document Type

Publication order reference

Identifiers

YADDA identifier

bwmeta1.element.desklight-84503002-fdaf-43a1-a12c-ed8a534f7a7e
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.