NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents.
Brad Adelberg:
NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents.
SIGMOD Conference 1998: 283-294@inproceedings{DBLP:conf/sigmod/Adelberg98,
author = {Brad Adelberg},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured
Data from Text Documents},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-995-5},
pages = {283-294},
ee = {http://doi.acm.org/10.1145/276304.276330, db/conf/sigmod/Adelberg98.html},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
Abstract
Often interesting structured or semistructured data is not in database systems but in HTML pages,
text files, or on paper.
The data in these formats is not usable by standard query processing engines and hence users
need a way of extracting data from these sources into a DBMS or of writing wrappers around the
sources.
This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is
an interactive tool for semi-automatically determining the structure of such documents
and then extracting their data.
Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions
and then describing their semantics.
This task is expedited by a mining component that attempts to infer the grammar of the file
from the information the user has input so far.
Once the format of a document has been determined, its data can be extracted into a number of
useful forms.
This paper describes both the NoDoSE architecture, which can be used as a test bed for structure
mining algorithms in general, and the mining algorithms that have been developed by the author.
The prototype, which is written in Java, is described and experiences parsing a variety of documents
are reported.
Copyright © 1998 by the ACM,
Inc., used by permission. Permission to make
digital or hard copies is granted provided that
copies are not made or distributed for profit or
direct commercial advantage, and that copies show
this notice on the first page or initial screen of
a display along with the full citation.
CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ...
Online Version (ACM WWW Account required): Full Text in PDF Format
DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...
Printed Edition
Laura M. Haas, Ashutosh Tiwary (Eds.):
SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA.
ACM Press 1998, ISBN 0-89791-995-5
,
SIGMOD Record 27(2),
June 1998
Contents
[Abstract]
[Full Text (Postscript)]
References
- [Abi97]
- Serge Abiteboul:
Querying Semi-Structured Data.
ICDT 1997: 1-18

- [Ade98]
- ...
- [AK97a]
- ...
- [AK97b]
- Naveen Ashish, Craig A. Knoblock:
Wrapper Generation for Semi-structured Internet Sources.
SIGMOD Record 26(4): 8-15(1997)

- [CGMH+97]
- Sudarshan S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom:
The TSIMMIS Project: Integration of Heterogeneous Information Sources.
IPSJ 1994: 7-18

- [Gol90]
- ...
- [HGMC+97]
- ...
- [KGP88]
- ...
- [KWD97]
- ...
- [Liv90]
- ...
Copyright © Mon Dec 14 20:19:14 2009
by Michael Ley (ley@uni-trier.de)