dblp.uni-trier.de www.dagstuhl.de www.uni-trier.de

Distributed Hypertext Resource Discovery Through Examples.

Soumen Chakrabarti, Martin van den Berg, Byron Dom: Distributed Hypertext Resource Discovery Through Examples. VLDB 1999: 375-386
@inproceedings{DBLP:conf/vldb/ChakrabartiBD99,
  author    = {Soumen Chakrabarti and
               Martin van den Berg and
               Byron Dom},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Distributed Hypertext Resource Discovery Through Examples},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
               UK},
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {375-386},
  ee        = {http://www.vldb.org/conf/1999/P37.pdf},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, meta-data, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that a keyword-based "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.


Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

References

[1]
Chidanand Apté, Fred Damerau, Sholom M. Weiss: Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. 12(3): 233-251(1994) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[2]
...
[3]
Krishna Bharat, Andrei Z. Broder: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Computer Networks 30(1-7): 379-388(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[4]
Krishna Bharat, Monika Rauch Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR 1998: 104-111 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[5]
Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[6]
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. VLDB J. 7(3): 163-178(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[7]
Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, Jon M. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Computer Networks 30(1-7): 65-74(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[8]
Soumen Chakrabarti, Byron Dom, Piotr Indyk: Enhanced Hypertext Categorization Using Hyperlinks. SIGMOD Conference 1998: 307-318 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[9]
...
[10]
...
[11]
Donald D. Chamberlin: A Complete Guide to DB2 Universal Database. Morgan Kaufmann 1998, ISBN 1-55860-482-0
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[12]
...
[13]
Junghoo Cho, Hector Garcia-Molina, Lawrence Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[14]
William W. Cohen: Fast Effective Rule Induction. ICML 1995: 115-123 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[15]
...
[16]
Paul De Bra, R. D. J. Post: Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible. Computer Networks and ISDN Systems 27(2): 183-192(1994) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[17]
Susan T. Dumais, John C. Platt, David Hecherman, Mehran Sahami: Inductive Learning Algorithms and Representations for Text Categorization. CIKM 1998: 148-155 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[18]
Roy Goldman, Narayanan Shivakumar, Suresh Venkatasubramanian, Hector Garcia-Molina: Proximity Search in Databases. VLDB 1998: 26-37 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[19]
Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System. SIGMOD Conference 1995: 483 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[20]
Thorsten Joachims, Dayne Freitag, Tom M. Mitchell: Web Watcher: A Tour Guide for the World Wide Web. IJCAI (1) 1997: 770-777 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[21]
...
[22]
Thomas Kistler, Hannes Marais: WebL - A Programming Language for the Web. Computer Networks 30(1-7): 259-270(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[23]
Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. SODA 1998: 668-677 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[24]
David Konopnicki, Oded Shmueli: Information Gathering in the World-Wide Web: The W3QL Query Language and the W3QS System. ACM Trans. Database Syst. 23(4): 369-410(1998) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[25]
...
[26/27]
Alberto O. Mendelzon, Tova Milo: Formal Models of Web Queries. PODS 1997: 134-143 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[28]
...
[29]
Wayne Niblack, Xiaoming Zhu, James L. Hafner, Thomas M. Breuel, Dulce B. Ponceleon, Dragutin Petkovic, Myron Flickner, Eli Upfal, Sigfredo I. Nin, Sanghoon Sull, Byron Dom, Boon-Lock Yeo, Savitha Srinivasan, Dan Zivkovic, Mike Penner: Updates to the QBIC System. Storage and Retrieval for Image and Video Databases (SPIE) 1998: 150-161 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[30]
...
[31]
Jacques Savoy: An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems. Inf. Process. Manage. 32(2): 155-170(1996) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[32]
Loren G. Terveen, William C. Hill: Finding and Visualizing Inter-Site Clan Graphs. CHI 1998: 448-455 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[33]
...

Last update Mon Sep 17 22:01:05 2012 CET by the DBLP TeamThis material is Open Data Data released under the ODC-BY 1.0 license — See also our legal information page