Abstract
Query expansion is a technique utilized to improve the performance of information retrieval systems by automatically adding related terms to the initial query. These additional terms can be obtained from documents stored in a database. Usually, this task is performed by clustering the documents and then extracting representative terms from the clusters. Afterwards, a new search is performed in the whole database using the expanded set of terms. Recently, the authors have proposed an immune-inspired algorithm, namely BIC-aiNet, to perform biclustering of texts. Biclustering differs from standard clustering algorithms in the sense that the former can detect partial similarities in the attributes. The preliminary results indicated that our proposal is able to group similar texts effectively and the generated biclusters consistently presented relevant words to represent a category of texts. Motivated by this promising scenario, this paper better formalizes the proposal and investigates the usefulness of the whole methodology on larger datasets. The BIC-aiNet was applied to a set of documents aiming at identifying the set of relevant terms associated with each bicluster, giving rise to a query expansion tool. The obtained results were compared with those produced by two alternative proposals in the literature, and they indicate that these techniques tend to generate complementary results, as a consequence of the use of distinct similarity metrics.
Similar content being viewed by others
References
Ada GL, Nossal GJV (1987) The clonal selection theory. Sci Am 257:50–57
Agrawal R, Gehrke J, Gunopulus D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM/SIGMOD international conference on management of data, pp 94–105
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–510. doi:10.1038/35000501
Castro PAD, de França FO, Ferreira HM, Von Zuben FJ (2007a) Applying Biclustering to Perform Collaborative Filtering. In: Proc. of the 7th International Conference on Intelligent Systems Design and Applications (ISDA), p 421–426, Brazil
Castro PAD, de França FO, Ferreira HM, Von Zuben FJ (2007b) Evaluating the Performance of a Biclustering Algorithm Applied to Collaborative Filtering—A Comparative Analysis. In: Proc. of the 7th International Conference on Hybrid Intelligent Systems (HIS), p 65–70, Germany
Castro PAD, de França FO, Ferreira HM, Von Zuben FJ (2007c) Applying biclustering to text mining: an immune-inspired approach. In: Proceedings of the 6th international conference on artificial immune systems (ICARIS), Brazil, pp 83–94
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103
Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D, Davis R (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2:65–73. doi:10.1016/S1097-2765(00)80114-8
Coelho GP, de França FO, Von Zuben FJ (2008) A multi-objective multipopulation approach for biclustering. In: Proceedings of 7th international conference on artificial immune systems (ICARIS), vol 5132, pp 71–82
Croft WB, Cook R, Wilder D (1995) Providing government information on the Internet: experiences with THOMAS. In: Proceedings of the 2nd international conference on the theory and practice of digital libraries, pp 19–24
de Castro LN, Timmis J (2002) Artificial immune systems: a new computational intelligence approach. Springer Verlag, London
de Castro LN, Von Zuben FJ (2001) aiNet: an artificial immune network for data analysis. In: Data mining: a heuristic approach, pp 231–259
de Castro LN, Von Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evol Comput 6(3):239–251. doi:10.1109/TEVC.2002.1011539
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197. doi:10.1109/4235.996017
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th international conference on knowledge discovery and data mining, pp 269–274
Divina F, Aguilar–Ruiz JS (2007). A multi-objective approach to discover biclusters in microarray data. In: Proceedings of the genetic and evolutionary computation conference (GECCO’07), London, UK,pp 385–392
Feldman R, Sanger J (2006) The Text Mining Handbook. Cambridge University Press
Giráldez R, Divina F, Pontes B, Aguilar–Ruiz JS (2007). Evolutionary search of biclusters by minimal intrafluctuation. In Proceedings of the IEEE international fuzzy systems conference (FUZZ–IEEE 2007), London, UK, pp 1–6
Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129. doi:10.2307/2284710 JASA
Jerne NK (1974) Towards a network theory of the immune system. Ann Immunol (Inst Pasteur) 125C:373–389
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21. doi:10.1108/eb026526
Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the twelfth international conference on machine learning, pp 331–339
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1:24–25. doi:10.1109/TCBB.2004.2
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Maulik U, Mukhopadhyay A, Bandyopadhyay S, Zhang MQ, Zhang X (2008). Multiobjective fuzzy biclustering in microarray data: method and a new performance measure. In: Proceedings of the 2008 IEEE congress on evolutionary computation (CEC 2008), Hong Kong, China, pp 1536–1543
Mitra S, Banka H (2006) Multi-objective evolutionary biclustering of gene expression data. Pattern Recognit 39:2464–2477. doi:10.1016/j.patcog.2006.03.003
Mitra S, Banka H, Pal SK (2006). A more framework for biclustering of microarray data. In: Proceedings of the 18th international conference on pattern recognition (ICPR’06), Hong Kong, China, pp 1154–1157
Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by Gibbs sampling. Bioinformatics 19(2):196–205. doi:10.1093/bioinformatics/btg1078
Symeonidis P, Nanopoulos A, Papadopoulos A, Manolopoulos Y (2007). Nearestbiclusters collaborative filtering with constant values. In: Advances in web mining and web usage analysis, vol 4811, Lecture notes in computer science. Springer-Verlag, Philadelphia, pp 36–55
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman & Hall/CRC Computer and Information Science Series, Boca Raton, FL
Tang C, Zhang L, Zhang I, Ramanathan M (2001) Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Proceedings of the 2nd IEEE international symposium on bioinformatics and bioengineering, pp 41–48
Zeimpekis D, Gallopoulos E (2005) TMG: a MATLAB toolbox for generating term-document matrices from text collections. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data: recent advances in clustering. Springer, Berlin, pp 187–210
Acknowledgements
This research was sponsored by UOL (www.uol.com.br), through its UOL Bolsa Pesquisa program. The authors would also like to thank CNPq and CAPES for their additional financial support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Castro, P.A.D., de França, F.O., Ferreira, H.M. et al. Query expansion using an immune-inspired biclustering algorithm. Nat Comput 9, 579–602 (2010). https://doi.org/10.1007/s11047-009-9127-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-009-9127-y