Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

PL EN


2013 | 1 | 5(254) |

Article title

Zmienność treści na forach internetowych

Content

Title variants

EN
Variability in the content of Internet forums

Languages of publication

Abstracts

PL
W niniejszej pracy prezentujemy wyniki eksperymentu przeprowadzonego na próbie ponad 27 900 stron internetowych zebranych z 16 forów w odstępach 2-godzinnych (4256 niezależnych procesów pobierania) w celu zbadania, jak strony te ewoluują w czasie. Rezultaty eksperymentu mogą być podstawą do podejmowania decyzji w procesie projektowania robotów indeksujących działających w sposób przyrostowy (ang. incremental crawler), specjalizujących się w pozyskiwaniu dokumentów z forów internetowych w celu utrzymania wysokiego współczynnika aktualności zebranej kolekcji. Jak pokazują przeprowadzone analizy, fora internetowe różnią się od portali ogólnego przeznaczenia, a identyfikacja miejsc w ich strukturze nawigacyjnej, gdzie nowe dokumenty pojawiają się częściej, może pozwolić na podniesienie wydajności robotów indeksujących, jak również na utrzymanie wysokiego współczynnika aktualności lokalnej kolekcji dokumentów.
EN
In this article we present the results of a study conducted on a sample of Polish Web forums in order to investigate how these sites evolve over time. We analysed more than 27 900 Web pages from 16 sources at two hour intervals (4 256 data points) over 22 days of the experiment. The results can be the basis for improving Web crawler design, providing valuable insights into the nature of Web forums. It appears that the variability of Web forums content is significantly different from general-purpose Web sites, thus Web crawlers need to adjust their document extraction policies to deal with this kind of Web source.

Year

Volume

1

Issue

Physical description

Contributors

  • Uniwersytet Ekonomiczny w Poznaniu
  • Uniwersytet Ekonomiczny w Poznaniu

References

  • Adar, E., Teevan, J. i Dumais, S.T., 2009, Resonance on the Web: Web dynamics and Revisitation patterns, w: Proceedings of the 27th International Conference on Human Factors in Computing Systems, CHI ’09, ACM, New York, NY, USA, s. 1381–1390.
  • Adar, E., Teevan, J., Dumais, S.T. i Elsas, J.L., 2009, The Web Changes Everything: Understanding the Dynamics of Web Content, w: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, ACM, New York, NY, USA, s. 282–291.
  • Baeza-Yates, R. i Castillo, C., 2007, Crawling the Infinite Web, J. Web Eng, vol. 6, no. 1, s. 49–72.
  • Baeza-Yates, R., Castillo, C., Marin, M. i Rodriguez, A., 2005, Crawling a Country: Better Stra-tegies than Breadth-first for Web Page Ordering, w: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW ’05, ACM, New York, NY, USA.
  • Ben Saad, M. i Gaŋcarski, S., 2011, Archiving the Web Using Page Changes Patterns: a Case Study, w: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL ’11, ACM, New York, NY, USA.
  • Buttler, D., Rocco, D. i Liu, L., 2004, Efficient Web Change Monitoring with Page Digest, w: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, WWW Alt. ’04, ACM, New York, NY, USA, s. 476–477.
  • Cai, R., Yang, J.M., Lai, W., Wang, Y. i Zhang, L., 2008, iRobot: An Intelligent Crawler for Web Forums, w: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, ACM, New York, NY, USA, s. 447–456.
  • Cho, J. i Garcia-Molina, H., 2000, The Evolution of the Web and Implications for an Incremental Crawler, w: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB ’00, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, s. 200–209.
  • Cho, J. i Garcia-Molina, H., 2003, Estimating Frequency of Change, ACM Trans. Internet Technol., vol. 3, no. 3, s. 256–290.
  • Douglis, F. i Ball, T., 1996, Tracking and Viewing Changes on the Web, w: USENIX Technical Conference, AT&T Bell Laboratories.
  • Douglis, F., Ball, T., Chen, Y.F. i Koutsofios, E., 1998, The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web, World Wide Web, vol. 1, s. 27–44.
  • Farn Chen, Y., Douglis, F., Huang, H. i phong Vo, K., 2000, TopBlend: An Efficient Implementation of HtmlDiffin Java, w: World Conference on the WWW and Internet, s. 88–94.
  • Hirschberg, D.S., 1977, Algorithms for the Longest Common Subsequence Problem, J. ACM, vol. 24, no. 4, s. 664–675.
  • Jacobson, G. i Vo, K.P., 1992, Heaviest Increasing/Common Subsequence Problems, w: A. Apostolico, M. Crochemore, Z. Galil i U. Manber (eds.), Combinatorial Pattern Matching, tom 644 z serii Lecture Notes in Computer Science, Springer Berlin / Heidelberg, s. 52–66..
  • Jiang, J., Yu, N. i Lin, C.Y., 2012, FoCUS: Learning to Crawl Web Forums, w: Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, ACM, New York, NY, USA, s. 33–42.
  • Kwon, S., Lee, S. i Kim, S., 2006, Effective Criteria for Web Page Changes, w: X. Zhou, J. Li, H. Shen, M. Kitsuregawa i Y. Zhang (eds.), Frontiers of WWW Research and Development – APWeb 2006, t. 3841 z serii Lecture Notes in Computer Science, Springer Berlin / Heidelberg.
  • Law, M.T., Thome, N., Gaŋcarski, S. i Cord, M., 2012, Structural and Visual Comparisons for Web Page Archiving, w: Proceedings of the 2012 ACM Symposium on Document Engineering, DocEng ’12, ACM, New York, NY, USA, s. 117–120.
  • Liu, M., Cai, R., Zhang, M. i Zhang, L., 2011, User Browsing Behavior-driven Web Crawling, w: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, ACM, New York, NY, USA, s. 87–92.
  • Rocco, D., Buttler, D. i Liu, L., 2003, Page Digest for Large-scale Web Services, w: E-Commerce, 2003. CEC 2003. IEEE International Conference on, s. 381–390.
  • Saad, M.B. i Gaŋcarski, S., 2010, Using Visual Pages Analysis for Optimizing Web Archiving, w: Proceedings of the 2010 EDBT/ICDT Workshops, EDBT ’10, ACM, New York, NY, USA, s. 43:1–43:7.
  • Toyoda, M. i Kitsuregawa, M., 2006, What’s Really New on the Web?: Identifying New Pages from a Series of Unstable Web Snapshots, w: Proceedings of the 15th International Conference on Wo r l d Wi d e We b, WWW ’06, ACM, New York, NY, USA, s. 233–241.
  • Yang, J.M., Cai, R., Wang, C., Huang, H., Zhang, L. i Ma, W.Y., 2009, Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy, w: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, ACM, New York, NY, USA, s. 1375–1384.
  • Yeh, P.J., Li, J.T. i Yuan, S.M., 2006, Tracking the Changes of Dynamic Web Pages in the Existence of URL Rewriting, w: Proceedings of the Fifth Australasian Conference on Data Mining and Analystics – Volume 61, AusDM ’06, Australian Computer Society, Inc., Darlinghurst, Australia, Australia, s. 169–176.

Document Type

Publication order reference

Identifiers

YADDA identifier

bwmeta1.element.desklight-ce5d1acf-aa04-46ed-9855-7e6d36a98b9b
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.