Article

Database support for species extraction from the biosystematics literature: a feasibility demonstration

Authors:

Ralf Duckstein,

Klemens BöhmAuthors Info & Claims

CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Pages 515 - 522

https://doi.org/10.1145/1031171.1031269

Published: 13 November 2004 Publication History

Get Access

Abstract

A part of the biosystematics literature is currently being digitized and manually marked up with XML. Fast search on such documents shall be feasible. But marking up such documents incurs high costs, and biologists would like to know the value of such an activity in advance. Deploying standard XML database technology in a straightforward way is not feasible, because of two characteristics of biosystematics documents. The first one is that descriptions of taxa are related, i.e., a more specific taxon should inherit from a more general one. The combination of inheritance with information-retrieval mechanisms gives rise to difficulties addressed in this article. The second issue is the frequent occurrence of very specific technical terms in such documents, i.e., geographical information or biological terms. To investigate the characteristics of the search in the presence of such difficulties, we have designed and implemented a respective system, based on relational database technology. We use a collection of XML documents that mimics the characteristics of biosystematics documents, as we will explain. We propose two query-evaluation alternatives and compare them by means of performance experiments. It turns out that our techniques can administer the envisioned corpus of documents efficiently and cope with those problems at the same time.

References

[1]

Fuhr, Norbert; Grossjohann, Kai: XIRQL: A Query Language for Information Retrieval. In: Proceedings of the 24th Annual International Conference on Research and Development in Information Retrieval. New York: September 2001, P. 172--180.

Digital Library

Google Scholar

[2]

Salton, Gerard; Wong, A.; Yang, C. S.: A Vector Space Model for Automatic Indexing. In: Communications of the ACM. New York: November 1975, P. 613--620.

Digital Library

Google Scholar

[3]

Grust, Torsten: Accelerating XPath location steps. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data. New York: 2002, P. 109--120.

Digital Library

Google Scholar

[4]

Deutsch, Alin; Fernandez, Mary; Suciu, Dan: Storing semistructured data with STORED. In: Proceedings of the International Conference on Management of Data (SIGMOD '99). New York: June 1999, P. 431--442.

Digital Library

Google Scholar

[5]

Mitra, Mandar; Singhal, Amit; Buckley, Chris: Improving Automatic Query Expansion. In: Proceedings of the 21st Annual International ACM (SIGIR '98). August 1998, P. 206--214.

Digital Library

Google Scholar

[6]

Fagin, Ronald: Combining Fuzzy Information from Multiple Systems. Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. June 1996. P. 216--226.

Digital Library

Google Scholar

[7]

Ciaccia, Paolo; Patella, Marco; Zezula, Pavel: Processing Complex Similarity Queries with Distance-Based Access Methods. In: Proceedings of the 6th International Conference on Extending Database Technology, Valencia, Spain, March 23-27, 1998. Springer LNCS 1998. P. 9--23.

Digital Library

Google Scholar

[8]

Schmidt, Albrecht, et al.: XMark: A Benchmark for XML Data Management.

Google Scholar

[9]

Grossman, David A.; Frieder, Ophir: Information Retrieval: Algorithms and Heuristics; Kluwer Academic Publishers 1998.

Digital Library

Google Scholar

[10]

Grabs, Torsten: Storage and Retrieval of XML Documents with a Cluster of Database Systems. Ph.D. dissertation, April 2003

Google Scholar

[11]

World Wide Web Consortium: XQuery 1.0: A Query Language for XML. http://www.w3.org/TR/xquery, November 2002

Google Scholar

[12]

World Wide Web Consortium: XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, November 1999

Google Scholar

[13]

Abiteboul, Serge; Quass, Dallan; McHugh, Jason; Widom, Jennifer; Wiener, Janet: The Lorel Query Language for Semistructured Data. In: International Journal on Digital Libraries, 1997, P. 68--88

Crossref

Google Scholar

[14]

Chamberlin, Don; Robie, Jonathan; Florescu, Daniela: Quilt: An XML Query Language for Heterogeneous Data Sources. In: Selected Papers - The World Wide Web and Databases, Third International Workshop WebDB, Dallas, Texas, USA, 2000. Springer LNCS 2001, P. 1--25

Digital Library

Google Scholar

Index Terms

Database support for species extraction from the biosystematics literature: a feasibility demonstration
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Domain-specific keyphrase extraction
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Document keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase ...
Automatic office document classification and information extraction
Automatic Extraction and Processing of Document References

Comments

Information & Contributors

Information

Published In

CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

November 2004

678 pages

ISBN:1581138741

DOI:10.1145/1031171

General Chair:
David Grossman
Illinois Institute of Technology
,
Program Chairs:
Luis Gravano
Columbia University
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign
,
Otthein Herzog
University of Bremen, Germany
,
David A. Evans
Clairvoyance Corporation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

biosystematics

Qualifiers

Article

Conference

CIKM04

Sponsor:

CIKM04: Conference on Information and Knowledge Management

November 8 - 13, 2004

D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
243
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Domain-specific keyphrase extraction

Automatic office document classification and information extraction

Automatic Extraction and Processing of Document References

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tag

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Other Metrics

Article Metrics

Other Metrics

Login options

Full Access

PDF

eReader

Abstract

References

Index Terms

Recommendations

Domain-specific keyphrase extraction

Automatic office document classification and information extraction

Automatic Extraction and Processing of Document References

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tag

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations