dblp.uni-trier.de www.dagstuhl.de www.uni-trier.de

Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity.

William W. Cohen: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity. SIGMOD Conference 1998: 201-212
@inproceedings{DBLP:conf/sigmod/Cohen98,
  author    = {William W. Cohen},
  editor    = {Laura M. Haas and
               Ashutosh Tiwary},
  title     = {Integration of Heterogeneous Databases Without Common Domains
               Using Queries Based on Textual Similarity},
  booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
               on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
  publisher = {ACM Press},
  year      = {1998},
  isbn      = {0-89791-995-5},
  pages     = {201-212},
  ee        = {http://doi.acm.org/10.1145/276304.276323},
  crossref  = {DBLP:conf/sigmod/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

Most databases contain ``name constants'' like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.

Copyright © 1998 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ... Online Version (ACM WWW Account required): Full Text in PDF Format

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Printed Edition

Laura M. Haas, Ashutosh Tiwary (Eds.): SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press 1998, ISBN 0-89791-995-5 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML, SIGMOD Record 27(2), June 1998
Contents

Online Edition: ACM SIGMOD

[Abstract]
[Full Text (Postscript)]

References

[Abiteboul and Vianu, 1997]
Serge Abiteboul, Victor Vianu: Regular Path Queries with Constraints. PODS 1997: 122-133 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Arens et al., 1996]
...
[Atzeni et al., 1997]
...
[Barbara et al., 1992]
Daniel Barbará, Hector Garcia-Molina, Daryl Porter: The Management of Probabilistic Data. IEEE Trans. Knowl. Data Eng. 4(5): 487-502(1992) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Bartell et al., 1994]
Brian T. Bartell, Garrison W. Cottrell, Richard K. Belew: Automatic Combination of Multiple Ranked Retrieval Systems. SIGIR 1994: 173-181 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Bayardo et al., 1997]
Roberto J. Bayardo Jr., William Bohrer, Richard S. Brice, Andrzej Cichocki, Jerry Fowler, Abdelsalam Helal, Vipul Kashyap, Tomasz Ksiezyk, Gale Martin, Marian H. Nodine, Mosfeq Rashid, Marek Rusinkiewicz, Ray Shea, C. Unnikrishnan, Amy Unruh, Darrell Woelk: InfoSleuth: Semantic Integration of Information in Open and Dynamic Environments (Experience Paper). SIGMOD Conference 1997: 195-206 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Boyan et al., 1994]
...
[Chaudhuri et al., 1995]
Surajit Chaudhuri, Umeshwar Dayal, Tak W. Yan: Join Queries with External Text Sources: Execution and Optimization Techniques. SIGMOD Conference 1995: 410-422 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Cohen and Singer, 1996]
William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Cohen et al., 1997]
...
[Cohen, 1997a]
...
[Cohen, 1997b]
...
[Duschka and Genesereth, 1997a]
Oliver M. Duschka, Michael R. Genesereth: Answering Recursive Queries Using Views. PODS 1997: 109-116 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Duschka and Genesereth, 1997b]
...
[Fang et al., 1994]
...
[Felligi and Sunter, 1969]
...
[Fiebig et al., 1997]
...
[Fuhr, 1995]
Norbert Fuhr: Probabilistic Datalog - A Logic For Powerful Retrieval Methods. SIGIR 1995: 282-290 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Garcia-Molina et al., 1995]
Hector Garcia-Molina, Dallan Quass, Yannis Papakonstantinou, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Jennifer Widom: The TSIMMIS Approach to Mediation: Data Models and Languages. NGITS 1995: 0- CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Hernandez and Stolfo, 1995]
Mauricio A. Hernández, Salvatore J. Stolfo: The Merge/Purge Problem for Large Databases. SIGMOD Conference 1995: 127-138 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Huffman and Steier, 1995]
...
[Kilss and Alvey, 1985]
...
[Knuth, 1975]
Donald E. Knuth: The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition. Addison-Wesley 1973
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Konopnicki and Shmueli, 1995]
David Konopnicki, Oded Shmueli: W3QS: A Query System for the World-Wide Web. VLDB 1995: 54-65 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Korf, 1993]
Richard E. Korf: Linear-Space Best-First Search. Artif. Intell. 62(1): 41-78(1993) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Levy et al., 1996a]
Alon Y. Levy, Anand Rajaraman, Joann J. Ordille: Querying Heterogeneous Information Sources Using Source Descriptions. VLDB 1996: 251-262 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Levy et al., 1996b]
Alon Y. Levy, Anand Rajaraman, Joann J. Ordille: Query-Answering Algorithms for Information Agents. AAAI/IAAI, Vol. 1 1996: 40-47 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Lewis, 1992]
...
[Mendelzon and Milo, 1997]
Alberto O. Mendelzon, Tova Milo: Formal Models of Web Queries. PODS 1997: 134-143 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Monge and Elkan, 1996]
Alvaro E. Monge, Charles Elkan: The Field Matching Problem: Algorithms and Applications. KDD 1996: 267-270 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Monge and Elkan, 1997]
Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD 1997: 0- CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Newcombe et al, 1959]
...
[Nilsson, 1987]
...
[Porter, 1980]
...
[Quinlan, 1990]
...
[Salton, 1989]
Gerard Salton: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley 1989, ISBN 0-201-12227-8
CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Schäuble, 1993]
Peter Schäuble: SPIDER: A Multiuser Information Retrieval System for Semistructured and Dynamic Data. SIGIR 1993: 318-327 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Suciu, 1996]
Dan Suciu: Query Decomposition and View Maintenance for Query Languages for Unstructured Data. VLDB 1996: 227-238 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Suciu, 1997]
...
[Tomasic et al., 1997]
Anthony Tomasic, Rémy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert Naacke, Louiqa Raschid: The Distributed Information Search Component (Disco) and the World Wide Web. SIGMOD Conference 1997: 546-548 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Turtle and Flood, 1995]
Howard R. Turtle, James Flood: Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31(6): 831-850(1995) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

Last update Tue Sep 18 00:25:17 2012 CET by the DBLP TeamThis material is Open Data Data released under the ODC-BY 1.0 license — See also our legal information page