ACM SIGMOD Anthology ACM SIGMOD dblp.uni-trier.de

NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents.

Brad Adelberg: NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. SIGMOD Conference 1998: 283-294
@inproceedings{DBLP:conf/sigmod/Adelberg98,
  author    = {Brad Adelberg},
  editor    = {Laura M. Haas and
               Ashutosh Tiwary},
  title     = {NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured
               Data from Text Documents},
  booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
               on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
  publisher = {ACM Press},
  year      = {1998},
  isbn      = {0-89791-995-5},
  pages     = {283-294},
  ee        = {http://doi.acm.org/10.1145/276304.276330, db/conf/sigmod/Adelberg98.html},
  crossref  = {DBLP:conf/sigmod/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

Copyright © 1998 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ... Online Version (ACM WWW Account required): Full Text in PDF Format

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Printed Edition

Laura M. Haas, Ashutosh Tiwary (Eds.): SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press 1998, ISBN 0-89791-995-5 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML, SIGMOD Record 27(2), June 1998
Contents

Online Edition: ACM SIGMOD

[Abstract]
[Full Text (Postscript)]

References

[Abi97]
Serge Abiteboul: Querying Semi-Structured Data. ICDT 1997: 1-18 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Ade98]
...
[AK97a]
...
[AK97b]
Naveen Ashish, Craig A. Knoblock: Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4): 8-15(1997) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[CGMH+97]
Sudarshan S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: The TSIMMIS Project: Integration of Heterogeneous Information Sources. IPSJ 1994: 7-18 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Gol90]
...
[HGMC+97]
...
[KGP88]
...
[KWD97]
...
[Liv90]
...

Copyright © Mon Dec 14 20:19:14 2009 by Michael Ley (ley@uni-trier.de)