skip to main content
research-article

Information filtering and query indexing for an information retrieval model

Published: 09 March 2009 Publication History

Abstract

In the information filtering paradigm, clients subscribe to a server with continuous queries or profiles that express their information needs. Clients can also publish documents to servers. Whenever a document is published, the continuous queries satisfying this document are found and notifications are sent to appropriate clients. This article deals with the filtering problem that needs to be solved efficiently by each server: Given a database of continuous queries db and a document d, find all queries qdb that match d. We present data structures and indexing algorithms that enable us to solve the filtering problem efficiently for large databases of queries expressed in the model AWP. AWP is based on named attributes with values of type text, and its query language includes Boolean and word proximity operators.

References

[1]
Aekaterinidis, I. and Triantafillou, P. 2005. Internet scale string attribute publish/subscribe data networks. In Proceedings of the ACM 14th Conference on Information and Knowledge Management (CIKM05).
[2]
Aguilera, M. K., Strom, R. E., Sturman, D., Astley, M., and Chandra, T. 1999. Matching events in a content-based subscription system. In Proceedings of the 18th Annual ACM Symposium on Principles of Distributed Computing (PODC'99). ACM, New York, 53--62.
[3]
Aho, A., Hopcroft, J., and Ullman, J. 1983. Data Structures and Algorithms. Addison- Wesley, Reading, MA.
[4]
Aho, A. V., Sethi, R., and Ullman, J. 1986. Compilers, Principles, Techniques, and Tools. Addison-Wesley, Reading, MA.
[5]
Altinel, M., Aksoy, D., Baby, T., Franklin, M., Shapiro, W., and Zdonik, S. 1999. DBIS-toolkit: Adaptable middleware for large-scale data delivery. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[6]
Altinel, M. and Franklin, M. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th VLDB Conference.
[7]
Amer-Yahia, S., Botev, C., and Shanmugasundaram, J. 2004. TeXQuery: A full-text search extension to Query. In Proceedings of WWW. ACM Press, 583--594.
[8]
Aoe, J.-I., Morimoto, K., and Sato, T. 1992. An efficient implementation of trie structures. Softw.—Pract. Exper. 22, 9, 695--721.
[9]
Baeza-Yates, R. and Gonnet, G. 1996. Fast text searching for regular expressions or automaton simulation on tries. J. ACM 43, 6, 915--936.
[10]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley, Reading, MA.
[11]
Belkin, N. and Croft, W. 1992. Information filtering and information retrieval: Two sides of the same coin? Comm. ACM 35, 12, 29--38.
[12]
Bell, T., Cleary, J., and Witten, I. 1990. Text Compression. Prentice-Hall publishers.
[13]
Bell, T. and Moffat, A. 1996. The design of a high performance information filtering system. In Proceedings of the ACM SIGIR. 12--20.
[14]
Bharambe, A., Agrawal, M., and Seshan, S. 2004. Mercury: Supporting scalable multi-attribute range queries. In Proceedings of SIGCOMM. Portland, Oregon, USA.
[15]
Bharambe, A., Rao, S., and Seshan, S. 2002. Mercury: A scalable publish-subscribe system for Internet games. In Proceedings of the 1st International Workshop on Network and System Support for Games (Netgames). Braunchweig, Germany.
[16]
Callan, J. 1996. Document filtering with inference networks. In Proceedings of the ACM SIGIR.
[17]
Callan, J. 1998. Learning while filtering focuments. In Proceedings of the ACM SIGIR. 224--231.
[18]
Callan, J., Croft, W., and Harding, S. 1992. The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications. Springer-Verlag, 78--83.
[19]
Campailla, A., Chaki, S., Clarke, E., Jha, S., and Veith, H. 2001. Efficient filtering in publish subscribe systems using binary decision diagrams. In Proceedings of the 23rd International Conference on Software Engeneering (ICSE'01). IEEE Computer Society, 443--452.
[20]
Carzaniga, A., Rosenblum, D.-S., and Wolf, A. 2001. Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. 19, 3, 332--383.
[21]
Carzaniga, A., Rosenblum, D. S., and Wolf, A. L. 2000. Achieving scalability and expressiveness in an internet-scale event notification service. In Proceedings of the 19th ACM Symposium on Principles of Distributed Computing (PODC'00). 219--227.
[22]
Chan, C.-Y., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of ICDE. 235--244.
[23]
Chang, C.-C. 2001. Query and data mapping across heterogeneous information sources. Ph.D. thesis, Stanford University.
[24]
Chang, C.-C., Garcia-Molina, H., and Paepcke, A. 1996. Boolean query mapping across heterogeneous information sources. IEEE Trans. Knowl. Data Eng. 8, 4, 515--521.
[25]
Chang, C.-C. K., Garcia-Molina, H., and Paepcke, A. 1999. Predicate rewriting for translating Boolean queries in a heterogeneous information system. ACM Trans. Inform. Syst. 17, 1, 1--39.
[26]
Chinenyanga, T. T. and Kushmerick, N. 2001. Expressive retrieval from XML documents. In Proceedings of SIGIR'01.
[27]
Cohen, W. W. 2000. WHIRL: A word-based information representation language. Artif. Intell. 118, 1-2, 163--196.
[28]
Comer, D. 1981. Analysis of a heuristic for trie minimization. ACM Trans. Datab. Syst. 6, 3, 513--537.
[29]
Comer, D. and Sethi, R. 1977. The complexity of trie index construction. J. ACM 24, 3, 428--440.
[30]
Crespo, A. and Garcia-Molina, H. 2002. Routing indices for peer-to-peer systems. In ICDCS.
[31]
de la Briandais, R. 1959. File searching using variable length keys. In Proceedings of the Western Joint Computer Conference. 295--298.
[32]
Denning, P. 1982. Electronic junk. Comm. ACM 25, 3, 163--165.
[33]
Devroye, L. 1992. A study of trie-like structures under the density model. Annals Appl. Prob. 2, 2, 402--434.
[34]
DeWitt, D. J., Katz, R. H., Olken, F., Shapiro, L. D., Stonebraker, M. R., and Wood, D. 1984. Implementation techniques for main memory database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 92--95.
[35]
Diao, Y., Altinel, M., Franklin, M., Zhang, H., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Datab. Syst.
[36]
Dong, L. 2002. Automatic term extraction and similarity assessment in a domain specific document corpus. M.S. thesis, Department of Computer Science, Dalhousie University, Halifax, Canada.
[37]
Fabret, F., Jacobsen, H. A., Llirbat, F., Pereira, J., Ross, K. A., and Shasha, D. 2001. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proceedings of ACM SIGMOD.
[38]
Flajolet, P. 1983. On the performance evaluation of extendible hashing and trie searching. Acta Informatica 20, 345--369.
[39]
Flajolet, P. and Puech, C. 1986. Partial match retrieval of multidimensional data. J. ACM 33, 2, 371--407.
[40]
Foltz, P. and Dumais, S. 1992. Personalized information delivery: An analysis of information filtering methods. Comm. ACM 35, 12, 51--60.
[41]
Franklin, M. and Zdonik, S. 1998. “Data in Your Face”: Push technology in perspective. SIGMOD Record (ACM Special Interest Group on Management of Data) 27, 2, 516--519.
[42]
Frantzi, K., Ananiadou, S., and Mima, H. 2000. Automatic recognition of multiword terms:the c-value/nc-value method. JODL 5, 2.
[43]
Fredkin, E. 1960. Trie memory. Comm. ACM 3, 9, 490--499.
[44]
Fuhr, N. and GroSjohann, K. 2004. XIRQL: An XML query language based on information retrieval concepts. ACM Trans. Inform. Syst. 22, 2, 313--356.
[45]
Garcia-Molina, H. and Salem, K. 1992. Main memory database systems: An overview. IEEE Trans. Knowl. Data Eng. 4, 6, 509.
[46]
Gedik, B. and Liu, L. 2003. PeerCQ: A decentralized and self-configuring peer-to-peer information monitoring system. In Proceedings of the the 23rd International Conference on Distributed Computing Systems.
[47]
Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of the International Conference on Database Technology. 173--189.
[48]
Gupta, A., Sahin, O. D., Agrawal, D., and Abbadi, A. E. 2004. Meghdoot: Content-based publish/subscribe over P2P networks. In Proceedings of ACM/IFIP/USENIX 5th International Middleware Conference.
[49]
Hull, D., Pedersen, J., and Schütze, H. 1996. Method combination for document filtering. In Proceedings of the ACM SIGIR. 279--287.
[50]
Idreos, S., Koubarakis, M., and Tryfonopoulos, C. 2004a. P2P-DIET: An extensible P2P service that unifies ad-hoc and continuous querying in super-peer networks. In Proceedings of the ACM SIGMOD Conference. 933--934.
[51]
Idreos, S., Koubarakis, M., and Tryfonopoulos, C. 2004b. P2P-DIET: One-time and continuous queries in super-peer networks. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT). 851--853.
[52]
Jacquet, P. and Szpankowski, W. 1991. Analysis of digital tries with Markovian dependency. IEEE Trans. Inform. Theor. 37, 5, 1470--1475.
[53]
Karger, D., Lehman, E., Leighton, T., Levine, M., Lewin, D., and Panigrahy, R. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. 654--663.
[54]
Knuth, D. 1973a. The Art of Computer Programming. Vol. 3: Sorting and Searching. Addison-Wesley, Reading, MA.
[55]
Knuth, D. 1973b. The Art of Computer Programming. Vol. 1: Fundamental Algorithms. Addison-Wesley, Reading, MA.
[56]
Koubarakis, M., Koutris, T., Tryfonopoulos, C., and Raftopoulou, P. 2002. Information alert in distributed digital libraries: The models, languages, and architecture of DIAS. In Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). 527--542.
[57]
Koubarakis, M., Skiadopoulos, S., and Tryfonopoulos, C. 2006. Logic and computational complexity for Boolean information retrieval. IEEE Trans. Knowl. Data Eng. 18, 12, 1659--1666.
[58]
Koubarakis, M., Tryfonopoulos, C., Idreos, S., and Drougas, Y. 2003. Selective information dissemination in P2P networks: Problems and solutions. SIGMOD Record, Special Issue on Peer-to-Peer Data Management 32, 3, 71--76.
[59]
Koubarakis, M., Tryfonopoulos, C., Raftopoulou, P., and Koutris, T. 2002. Data models and languages for agent-based textual information dissemination. In Proceedings of the 6th International Workshop on Cooperative Information Agents (CIA). Lecture Notes in Artificial Intelligence, vol. 2446. Springer, 179--193.
[60]
Luhn, H. 1958. A business intelligence system. IBM J. Reasear. Devel. 2, 4, 314--319.
[61]
Milios, E., Zhang, Y., He, B., and Dong, L. 2003. Automatic term extraction and document similarity in special text corpora. In Proceedings of the 6th Conference of the Pacific Association for Computational Linguistics (PACLing). 275--284.
[62]
Morita, M. and Shinoda, Y. 1994. Information filtering based on user behaviour analysis and best match text retrieval. In Proceedings of the ACM SIGIR. 272--281.
[63]
Navarro, G. and Baeza-Yates, R. 1997. Proximal nodes: A model to query document databases by content and structure. ACM Trans. Inform. Syst. 15, 4, 400--435.
[64]
Nguyen, B., Abiteboul, S., G.Cobena, and Preda, M. 2001. Monitoring XML data on the Web. In Proceedings of the ACM SIGMOD Conference. Santa Barbara, CA, USA.
[65]
Nilsson, S. and Karlsson, G. 1999. IP-address lookup using LC-tries. IEEE J. Select. Areas Comm. 17, 6, 1083--1092.
[66]
Peterson, J. 1980. Computer programs for detecting and correcting spelling errors. Comm. ACM 23, 12, 676--686.
[67]
Pfeifer, U., Fuhr, N., and Huynh, T. 1995. Searching structured documents with the enhanced retrieval functionality of freeWAIS-sf and SFgate. Comput. Netw. ISDN Syst. 27, 6, 1027--1036.
[68]
Pietzuch, P. and Bacon, J. 2002. Hermes: A distributed event-based middleware architecture. In Proceedings of the 1st International Workshop on Distributed Event-Based Systems (DEBS'02).
[69]
Raftopoulou, P., Petrakis, E. G., Tryfonopoulos, C., and Weikum, G. 2008. Information retrieval and filtering over self-organising digital libraries. In Proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries (ECDL).
[70]
Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Shenker, S. 2001. A scalable content-addressable network. In Proceedings of the ACM SIGCOMM Conference.
[71]
Regnier, M. and Jacquet, P. 1989. New results on the size of tries. IEEE Trans. Inform. Theor. 35, 1, 203--205.
[72]
Rivest, R. L. 1976. Partial-match retrieval algorithms. SIAM J. Comput. 5, 1, 19--50.
[73]
Rowstron, A. and Druschel, P. 2001. Pastry: Scalable, distributed object location and routing for large-scale- peer-to-peer storage utility. In Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Paltforms (Middleware'01).
[74]
Rowstron, A., Kermarrec, A.-M., Castro, M., and Druschel, P. 2001. Scribe: The design of a large-scale event notification infrastructure. In Proceedings of the 3rd International COST264 Workshop, J. Crowcroft and M. Hofmann, Eds.
[75]
Severance, C. and Pramanik, S. 1990. Distributed linear hashing for main memory databases. In Proceedings of the International Conference on Parallel Processing. 92--95.
[76]
Stoica, I., Morris, R., Karger, D., Kaashoek, M., and Balakrishnan, H. 2001. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of the ACM SIGCOMM Conference.
[77]
Sussenguth, E. 1963. Use of tree structures for processing files. Comm. ACM 6, 5, 272--279.
[78]
Tam, D., Azimi, R., and Jacobsen, H.-A. 2003. Building content-based publish/subscribe systems with distributed hash tables. In Proceedings of the 1st International Workshop On Databases, Information Systems and Peer-to-Peer Computing.
[79]
Tang, C. and Xu, Z. 2003. pFilter: Global information filtering and dissemination using structured overlays. In FTDCS.
[80]
Terpstra, W., Behnel, S., Fiege, L., Zeidler, A., and Buchmann, A. 2003. A peer-to-peer approach to content-based publish/subscribe. In Proceedings of the 2nd International Workshop on Distributed Event-Based Systems (DEBS'03).
[81]
Theobald, A. and Weikum, G. 2000. Adding relevance to XML. In WebDB (Selected Papers). 105--124.
[82]
Theobald, M., Schenkel, R., and Weikum, G. 2005. An efficient and versatile query engine for TopX search. In Proceedings of the 31st International Conference on Very Large Databases (VLDB).
[83]
Tryfonopoulos, C., Idreos, S., and Koubarakis, M. 2005a. LibraRing: An architecture for distributed digital libraries based on DHTs. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). 25--36.
[84]
Tryfonopoulos, C., Idreos, S., and Koubarakis, M. 2005b. Publish/subscribe functionality in IR environments using structured overlay networks. In Proceedings of the 28th Annual International ACM SIGIR Conference. 322--329.
[85]
Tryfonopoulos, C. and Koubarakis, M. 2002. Selective dissemination of information in P2P systems: Data models, query languages, algorithms and computational complexity. Tech. Rep. TR-ISL-02-2003, Department of Electronic and Computer Engineering, Technical University of Crete.
[86]
Tryfonopoulos, C., Koubarakis, M., and Drougas, Y. 2004. Filtering algorithms for information retrieval models with named attributes and proximity operators. In Proceedings of the 27th Annual International ACM SIGIR Conference. 313--320.
[87]
Tryfonopoulos, C., Zimmer, C., Koubarakis, M., and Weikum, G. 2007. Architectural alternatives for information filtering in structured overlay networks. IEEE Intern. Comput. 11, 4, 24--34.
[88]
Yan, T. and Garcia-Molina, H. 1994a. Index structures for information filtering under the vector space model. Proceedings of the 10th International Conference on Data Engineering, 337--347.
[89]
Yan, T. and Garcia-Molina, H. 1994b. Index structures for selective dissemination of information under the Boolean model. ACM Trans. Datab. Syst. 19, 2, 332--364.
[90]
Yan, T. and Garcia-Molina, H. 1999. The SIFT information dissemination system. ACM Trans. Datab. Syst.
[91]
Yang, B. and Garcia-Molina, H. 2003. Designing a super-peer network. In Proceedings of the 19th International Conference on Data Engineering (ICDE'03).
[92]
Yochum, J. A. 1985. A high-speed text scanning algorithm utilising least frequent trigraphs. In Proceedings of the IEEE Symposium on New Directions in Computing.
[93]
Zhang, Y. and Callan, J. 2001. Maximum likelihood estimation for filtering thresholds. In Proceedings of the ACM SIGIR.
[94]
Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., and Weikum, G. 2008. Approximate information filtering in peer-to-peer networks. In Proceedings of the 9th Web Information Systems Engineering (WISE) Conference.
[95]
Zimmer, C., Tryfonopoulos, C., and Weikum, G. 2007. MinervaDL: An architecture for information retrieval and filtering in distributed digital libraries. In Proceedings of the 11th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). 148--160.
[96]
Zimmer, C., Tryfonopoulos, C., and Weikum, G. 2008. Exploiting correlated keywords to improve approximate information filtering. In Proceedings of the 31st Annual International ACM SIGIR Conference.

Cited By

View all
  • (2020)Research on personalized recommendation hybrid algorithm for interactive experience equipmentComputational Intelligence10.1111/coin.1237536:3(1348-1373)Online publication date: 20-Jul-2020
  • (2019)The object model of the selective information disseminationScientific and Technical Libraries10.33186/1027-3689-2019-4-61-75(61-75)Online publication date: 5-Apr-2019
  • (2019)Ping - A customizable, open-source information filtering system for textual dataProceedings of the 13th ACM International Conference on Distributed and Event-based Systems10.1145/3328905.3332512(228-231)Online publication date: 24-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 27, Issue 2
February 2009
184 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1462198
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2009
Accepted: 01 June 2008
Revised: 01 July 2007
Received: 01 February 2006
Published in TOIS Volume 27, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Information filtering
  2. performance evaluation
  3. query indexing algorithms
  4. selective dissemination of information

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Research on personalized recommendation hybrid algorithm for interactive experience equipmentComputational Intelligence10.1111/coin.1237536:3(1348-1373)Online publication date: 20-Jul-2020
  • (2019)The object model of the selective information disseminationScientific and Technical Libraries10.33186/1027-3689-2019-4-61-75(61-75)Online publication date: 5-Apr-2019
  • (2019)Ping - A customizable, open-source information filtering system for textual dataProceedings of the 13th ACM International Conference on Distributed and Event-based Systems10.1145/3328905.3332512(228-231)Online publication date: 24-Jun-2019
  • (2018)Technological Features of the Renewed System of Selective Dissemination of Information in the Library for Natural Sciences of the RASBibliotekovedenie [Library and Information Science (Russia)]10.25281/0869-608X-2018-67-5-513-52267:5(513-522)Online publication date: 7-Dec-2018
  • (2018)A distributed full-text top-k document dissemination system in distributed hash tablesWorld Wide Web10.1007/s11280-010-0106-014:5-6(545-572)Online publication date: 25-Dec-2018
  • (2018)A language model-based framework for multi-publisher content-based recommender systemsInformation Retrieval10.1007/s10791-018-9327-021:5(369-409)Online publication date: 1-Oct-2018
  • (2017)Query Reorganization Algorithms for Efficient Boolean Information FilteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.262014029:2(418-432)Online publication date: 1-Feb-2017
  • (2016)Full-Text Support for Publish/Subscribe Ontology SystemsProceedings of the 13th International Conference on The Semantic Web. Latest Advances and New Domains - Volume 967810.1007/978-3-319-34129-3_15(233-249)Online publication date: 29-May-2016
  • (2015)MTAF: An Adaptive Design for Keyword-Based Content Dissemination on DHT NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.526:4(1071-1084)Online publication date: Apr-2015
  • (2015)Pubsub: An Efficient Publish/Subscribe SystemIEEE Transactions on Computers10.1109/TC.2014.231563664:4(1119-1132)Online publication date: 1-Apr-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media