ABSTRACT
Genealogy research is the study of family history using available resources such as historical records. Ancestry provides its customers with one of the world's largest online genealogical index with billions of records from a wide range of sources, including vital records such as birth and death certificates, census records, court and probate records among many others. Search at Ancestry aims to return relevant records from various record types, allowing our subscribers to build their family trees, research their family history, and make meaningful discoveries about their ancestors from diverse perspectives.
In a modern search engine designed for genealogical study, the appropriate ranking of search results to provide highly relevant information represents a daunting challenge. In particular, the disparity in historical records makes it inherently difficult to score records in an equitable fashion. Herein, we provide an overview of our solutions to overcome such record disparity problems in the Ancestry search engine. Specifically, we introduce customized coordinate ascent (customized CA) to speed up ranking within a specific record type. We then propose stochastic search (SS) that linearly combines ranked results federated across contents from various record types. Furthermore, we propose a novel information retrieval metric, normalized cumulative entropy (NCE), to measure the diversity of results. We demonstrate the effectiveness of these two algorithms in terms of relevance (by NDCG) and diversity (by NCE) if applicable in the offline experiments using real customer data at Ancestry.
- Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining - WSDM '09. ACM Press, Barcelona, Spain, 5. Google ScholarDigital Library
- Christopher J C Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. (2010), 19.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, Vol. 2, 3 (April 2011), 1--27. Google ScholarDigital Library
- Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '08. ACM Press, Singapore, Singapore, 659. Google ScholarDigital Library
- David Cossock and Tong Zhang. 2006. Subset Ranking Using Regression. In Proceedings of the 19th Annual Conference on Learning Theory (COLT'06). Springer-Verlag, Berlin, Heidelberg, 605--619. Google ScholarDigital Library
- V Dang. 2013. The Lemur Project-Wiki-RankLib. http://sourceforge. net/p/lemur/wiki/RankLib.Google Scholar
- Feng Guan, Shuiyuan Zhang, Chunmei Liu, Xiaoming Yu, Yue Liu, and Xueqi Cheng. 2014. ICTNET at Federated Web Search Track 2014. (2014), 5.Google Scholar
- Maryam Karimzadehgan, Wei Li, Ruofei Zhang, and Jianchang Mao. 2011. A stochastic learning-to-rank algorithm and its application to contextual advertising. In Proceedings of the 20th international conference on World wide web - WWW '11. ACM Press, Hyderabad, India, 377. Google ScholarDigital Library
- Ralf Krestel and Peter Fankhauser. 2012. Reranking web search results for diversity. Information Retrieval, Vol. 15, 5 (Oct 2012), 458--477. Google ScholarDigital Library
- Leah S. Larkey, Margaret E. Connell, and Jamie Callan. 2000. Collection Selection and Results Merging with Topically Organized U. S. Patents and TREC Data. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM '00). ACM, New York, NY, USA, 282--289. Google ScholarDigital Library
- David Lillis, Fergus Toolan, Rem Collier, and John Dunnion. 2006. ProbFuse: A Probabilistic Approach to Data Fusion. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06 (2006), 139. arXiv: 1409.8518. Google ScholarDigital Library
- K. I. M. McKinnon. 1998. Convergence of the Nelder--Mead Simplex Method to a Nonstationary Point. SIAM Journal on Optimization, Vol. 9, 1 (Jan. 1998), 148--158. Google ScholarDigital Library
- Shriphani Palakodety and Jamie Callan. 2014. Query Transformations for Result Merging. (2014), 5.Google Scholar
- Allison L. Powell, James C. French, Jamie Callan, Margaret Connell, and Charles L. Viles. 2000. The Impact of Database Selection on Distributed Searching. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00). ACM, New York, NY, USA, 232--239. Google ScholarDigital Library
- M. J. D. Powell. 1973. On search directions for minimization algorithms. Mathematical Programming, Vol. 4, 1 (Dec. 1973), 193--201.Google ScholarCross Ref
- Filip Radlinski and Susan Dumais. 2006. Improving personalized web search using result diversification. In 29th annual international ACM SIGIR conference. ACM, 691--692. Google ScholarDigital Library
- C E Shannon. 1949. A Mathematical Theory of Communication. (1949), 55.Google Scholar
- Daniel Sheldon, Milad Shokouhi, Martin Szummer, and Nick Craswell. 2011. LambdaMerge: Merging the Results of Query Reformulations. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM '11). ACM, New York, NY, USA, 795--804. Google ScholarDigital Library
- Milad Shokouhi and Justin Zobel. 2009. Robust result merging using sample-based score estimates. ACM Transactions on Information Systems, Vol. 27, 3 (May 2009), 1--29. Google ScholarDigital Library
- Luo Si and Jamie Callan. 2002. Using Sampled Data and Regression to Merge Search Engine Results. SIGIR, Vol. 8 (2002). Google ScholarDigital Library
- Christopher C Vogt. 1999. Fusion Via a Linear Combination of Scores. Information Retrieval, Vol. 1 (1999), 151--173. Google ScholarDigital Library
- Qiang Wu, Christopher J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval, Vol. 13, 3 (June 2010), 254--270. Google ScholarDigital Library
- Cheng Zhai, William W. Cohen, and John Lafferty. 2003. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In 26th Annual International ACM SIGIR Conference (SIGIR '03). 10--17. Google ScholarDigital Library
Index Terms
- Ranking in Genealogy: Search Results Fusion at Ancestry
Recommendations
Family History Discovery through Search at Ancestry
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalAt Ancestry, we apply learning to rank algorithms to a new area to assist our customers in better understanding their family history. The foundation of our service is an extensive and unique collection of billions of historical records that we have ...
Quality-biased ranking for queries with commercial intent
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebModern search engines are good enough to answer popular commercial queries with mainly highly relevant documents. However, our experiments show that users behavior on such relevant commercial sites may differ from one to another web-site with the same ...
Ranking Relevance in Yahoo Search
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningSearch engines play a crucial role in our daily lives. Relevance is the core problem of a commercial search engine. It has attracted thousands of researchers from both academia and industry and has been studied for decades. Relevance in a modern search ...
Comments