dblp.uni-trier.de www.dagstuhl.de www.uni-trier.de

Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies.

Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. VLDB J. 7(3): 163-178(1998)
@article{DBLP:journals/vldb/ChakrabartiDAR98,
  author    = {Soumen Chakrabarti and
               Byron Dom and
               Rakesh Agrawal and
               Prabhakar Raghavan},
  title     = {Scalable Feature Selection, Classification and Signature Generation
               for Organizing Large Text Databases into Hierarchical Topic Taxonomies},
  journal   = {VLDB J.},
  volume    = {7},
  number    = {3},
  year      = {1998},
  pages     = {163-178},
  ee        = {db/journals/vldb/ChakrabartiDAR98.html},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand.

We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy.

To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or discriminants, from thenoise words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of "noise" words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs.

We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities.

Copyright © 1998 by Springer, Berlin, Heidelberg. Permission to make digital or hard copies of the abstract is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice along with the full citation.


Online Edition (Springer)

Citation Page

ACM SIGMOD Anthology

CDROM Version: Load the CDROM "Volume 5 Issue 2, JACM, VLDB-J, POS, ..." and ... DVD Version: Load ACM SIGMOD Anthology DVD 2" and ...

References

[1]
Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. STOC 1996: 20-29 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[2]
Peter G. Anick, Shivakumar Vaithyanathan: Exploiting Clustering and Phrases for Context-based Information Retrieval. SIGIR 1997: 314-323 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[3]
Chidanand Apté, Fred Damerau, Sholom M. Weiss: Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. 12(3): 233-251(1994) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[4]
Marko Balabanovic, Yoav Shoham: Content-Based, Collaborative Recommendation. Commun. ACM 40(3): 66-72(1997) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[5]
James O. Berger: Statistical Decision Theory and Bayesian Analysis, 2nd Edition. Springer 1985, ISBN 3-540-96098-8
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[6]
Leo Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and Regression Trees. Wadsworth 1984, ISBN 0-534-98053-8
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[7]
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. VLDB 1997: 446-455 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[8]
Soumen Chakrabarti, Byron Dom, Piotr Indyk: Enhanced Hypertext Categorization Using Hyperlinks. SIGMOD Conference 1998: 307-318 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[9]
...
[10]
William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[11]
...
[12]
Douglas R. Cutting, David R. Karger, Jan O. Pedersen: Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections. SIGIR 1993: 126-134 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[13]
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, Richard A. Harshman: Indexing by Latent Semantic Analysis. JASIS 41(6): 391-407(1990) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[14]
...
[15]
...
[16]
William B. Frakes, Ricardo A. Baeza-Yates (Eds.): Information Retrieval: Data Structures & Algorithms. Prentice-Hall 1992, ISBN 0-13-463837-9
Contents CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[17]
Jerome H. Friedman: On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Data Min. Knowl. Discov. 1(1): 55-77(1997) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[18]
...
[19]
David Goldberg, David A. Nichols, Brian M. Oki, Douglas B. Terry: Using Collaborative Filtering to Weave an Information Tapestry. Commun. ACM 35(12): 61-70(1992) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[20]
Donna Harman: Ranking Algorithms. Information Retrieval: Data Structures & Algorithms 1992: 363-392 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[21]
...
[22]
Anil K. Jain, Jianchang Mao, K. Moidin Mohiuddin: Artificial Neural Networks: A Tutorial. IEEE Computer 29(3): 31-44(1996) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[23]
...
[24]
...
[25]
Pat Langley: Elements of Machine Learning. Morgan Kaufmann 1994, ISBN 1-55860-301-8
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[26]
...
[27]
...
[28]
...
[29]
...
[30]
Dunja Mladenic: Feature Subset Selection in Text-Learning. ECML 1998: 95-100 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[31]
...
[32]
Balas K. Natarajan: Machine Learning: A Theoretical Approach. Morgan Kaufmann 1991, ISBN 1-55860-148-1
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[33]
Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala: Latent Semantic Indexing: A Probabilistic Analysis. PODS 1998: 159-168 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[34]
J. Ross Quinlan: C4.5: Programs for Machine Learning. Morgan Kaufmann 1993, ISBN 1-55860-238-0
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[35]
Prabhakar Raghavan: Information Retrieval Algorithms: A Survey. SODA 1997: 11-18 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[36]
...
[37]
Stephen E. Robertson, Steve Walker: Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. SIGIR 1994: 232-241 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[38]
Gerard Salton, Chris Buckley: Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24(5): 513-523(1988) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[39]
Gerard Salton, Michael McGill: Introduction to Modern Information Retrieval. McGraw-Hill Book Company 1984, ISBN 0-07-054484-0
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[40]
Hinrich Schütze, David A. Hull, Jan O. Pedersen: A Comparison of Classifiers and Document Representations for the Routing Problem. SIGIR 1995: 229-237 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[41]
Upendra Shardanand, Pattie Maes: Social Information Filtering: Algorithms for Automating "Word of Mouth". CHI 1995: 210-217 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[42]
C. J. van Rijsbergen: Information Retrieval. Butterworth 1979, ISBN 0-408-70929-4
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[43]
Ellen M. Voorhees: Using WordNet to Disambiguate Word Senses for Text Retrieval. SIGIR 1993: 171-180 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[44]
A. Wald: Statistical Decision Funtions. John Wiley 1950
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[45]
Sholom M. Weiss, Casimir A. Kulikowski: Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems. Morgan Kaufmann 1990, ISBN 1-55860-065-5
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[46]
...
[47]
Tzay Y. Young, Thomas W. Calvert: Classification, Estimation and Pattern Recognition. Elsevier 1974
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[48]
George Kingsley Zipf: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley 1949
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

Last update Fri Sep 14 18:29:12 2012 CET by the DBLP TeamThis material is Open Data Data released under the ODC-BY 1.0 license — See also our legal information page