Skip to main content
Log in

A process model for information retrieval context learning and knowledge discovery

  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

In this paper we take a fresh look at the information retrieval (IR) problem of balancing recall with precision in electronic document extraction. We examine the IR constructs of uncertainty, context and relevance, proposing a new process model for context learning, and introducing a new IT artifact designed to support user driven learning by leveraging explicit knowledge to discover implicit knowledge within a corpus of documents. The IT artifact is a prototype designed to present a small set of extracted documents from a targeted corpus based upon user inputted criteria. The prototype provides the user with the opportunity to balance exploration and exploitation, via iterative relevance feedback to address the problem of imprecision resulting from uncertainty. We model the problem as an exploration–exploitation dilemma and apply it to a specific case of IR called eDiscovery. We conduct a series of behavioral experiments to evaluate the model and the artifact. Our initial findings indicate that the proposed model and the artifact improve performance in the IR result.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Recall and Precision are measures of IR performance explained later in this paper.

  2. We found in our early user interviews that many eDiscovery practitioners would like to use a tool that offered an easy way to “take a quick peek” inside a collection, without having to use a heavy processing application.

  3. For more information on Zubulake and its effects the reader can consult the book written in 2012 by the plaintiff in the case.

References

  • Anderson TD, Bates MJ, Berryman J, Erdelez S, Heinstrom J (2006) Designing for uncertainty. Proc Am Soc Inf Sci Technol 43(1):1

    Google Scholar 

  • Attfield S, Blandford A (2008) E-discovery viewed as integrated human–computer sensemaking: the challenge of ‘frames’. Second international workshop on supporting search and sensemaking for electronically stored information in discovery proceedings (DESI II, 2008)

  • Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J Mach Learn Res 3:397

    MathSciNet  Google Scholar 

  • Barnett SA (1963) A study in behavior. Methuen, London

    Google Scholar 

  • Baron J (2005) Toward a federal benchmarking standard for evaluating information retrieval products used in e-discovery. Sedona Conf J 6(1):237–246

  • Barto AG, Sutton RS, Brouwer PS (1981) Associative search network: a reinforcement learning associative memory. IEEE Trans Syst Man Cybern 40:201–211

    Google Scholar 

  • Bates MJ (1979) Information search tactics. J Am Soc Inf Sci 30(4):205–214

  • Bates MJ (1986) Subject access in online catalogs: a design model. J Am Soc Inf Sci 37(6):357–376

  • Bates MJ (1989) The design of browsing and berry picking techniques for the online search interface. Online Rev 13(5):407–424

  • Berlyne DE (1960) Conflict, arousal and curiosity. McGraw Hill, New York

    Book  Google Scholar 

  • Berlyne DE (1963) Motivational problems raised by exploratory and epistemic behavior. In: Koch S (ed) Psychology: a study of science, vol 5. McGraw Hill, New York, pp 284–364

  • Blair DC, Maron ME (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun ACM 28(3):289–299

    Article  Google Scholar 

  • Broder A (2002) A taxonomy of web search,” IBM Research, SIGIR Forum, vol 36, no 2 (Fall, 2002)

  • Catledge LD, Pitkow JE (1995) Characterizing browsing strategies in the world-wide web. Comput Netw ISDN Syst 27:1065–1073

    Article  Google Scholar 

  • Chowdhury G (2012) Building environmentally sustainable information services: a green is research agenda. J Am Soc Inf Sci Technol 63(4):633–647

    Article  MathSciNet  Google Scholar 

  • Chowdhury CR, Bhuyan P (2010) Information retrieval using fuzzy c-means clustering and modified vector space model. In: Computer science and information technology (July, 2010)

  • Cohen JD, McClure SM, Yu AJ (2007) Should I stay or should I go. In: Philosophical transactions: biological sciences, vol 362, no 1481, mental processes in the human brain (May, 2007), The Royal Society

  • Cormack GV, Mojdeh M (2009) Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks

  • Cove JF, Walsh BC (1988) Online text retrieval via browsing. Inf Process Manag 24(1):31–37

    Article  Google Scholar 

  • Debowski S, Wood RE, Bandura A (2001) Impact of guided exploration and enactive exploration on self-regulatory mechanisms and information acquisition through electronic search. J Appl Psychol 86(6):1129

    Article  Google Scholar 

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407

    Article  Google Scholar 

  • Demangeot C, Broderick AJ (2010) Exploration and its manifestations in the context of online shopping. J Mark Manag 26(13–14):1256–1278

    Article  Google Scholar 

  • Ding Y, Chowdhury G, Foo S, Qian W (2000) Bibliometric information retrieval system (BIRS): a web search interface utilizing bibliometric research results. J Am Soc Inf Sci 51(13):1190–1204

    Article  Google Scholar 

  • Faisal S, Attfield S, Blandford A (2009) A classification of sensemaking representations, workshop on sensemaking, CHI, 2009

  • Fordham GL (2009) Using keyword search terms in e-discovery and how they relate to issues of responsiveness, privilege, evidence standards and rube goldberg. Richmond J Law Technol 15:8–13

    Google Scholar 

  • Grossman MR, Cormack GV (2011) Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond J Law Technol 17:11–16

  • Grossman MR, Cormack GV (2013) The grossman-cormack glossary of technology-assisted review. Federal Courts Law Rev 7(1):1–34

  • Grossman MR, Cormack GV (2014) Evaluation of machine-learning protocols for technology-assisted review in electronic discovery, SIGIR’14

  • Heinstrom J (2006) Broad exploration or precise specificity: two basic information seeking patterns among students. J Am Soc Inf Sci Technol 57(11):1440–1450

    Article  Google Scholar 

  • Hernandez R, Kocieniewski D (2009) As new lawyer, senator was active in tobacco’s defense. New York Times, March 26, 2009

  • Hills TT (2010) The central executive as a search process: priming exploration and exploitation across domains. J Exp Psychol 139(4):590

    Article  Google Scholar 

  • Hofmann K, Whitson S, de Rijke M (2013) Balancing exploration and exploitation in listwise and pairwise online learning to rank for information. Inf Retr 16:63–90

    Article  Google Scholar 

  • Holscher C, Strube G (2000) Web search behavior of internet experts and newbies, Cite as: www9.org/w9cdrom/81/81.html

  • Hyman HS, Fridy III W (2010) Using bag of words (BOW) and standard deviations to represent expected structures for document retrieval: a way of thinking that leads to method choices. In: NIST special publication, proceedings: text retrieval conference (TREC) 2010

  • Hyman HS, Fridy III W (2011) Modeling concept and context to improve performance in eDiscovery. In: NIST special publication, proceedings: text retrieval conference (TREC) 2011

  • Ignat C, Steinberger R, Pouliquen B, Erjavec T (2006) A tool set for the quick and efficient exploration of large document collections. Institute for the Protection and Security of the Citizen Joint research Centre, European Commission (2006)

  • Kaelbling LP (1996) Special issue on reinforcement learning. Mach Learn 22:284

    Google Scholar 

  • Kaplan S, Kaplan R (1982) Cognition and environment. Praeger, New York

    Google Scholar 

  • Karimzadehgan M, Zhai CX (2010) Exploration–exploitation tradeoff in interactive relevance feedback. In: Conference on information and knowledge management (2010)

  • Kuhlthau CC (1991) Inside the search process: information seeking from the user’s perspective. J Am Soc Inf Sci 42:361–371

    Article  Google Scholar 

  • Lehman S, Schwanecke U, Dorner R (2010) Interactive visualization for opportunistic exploration of large document collections. Inf Syst 35:260–269

    Article  Google Scholar 

  • Liu TY (2009) Learning to rank information retrieval. Found Trends Inf Retr 3(3):225–331

    Article  Google Scholar 

  • Losey R (2013) www.e-discoveryteam.com

  • March JG (1991) Exploration and exploitation in organizational learning. Organ Sci 2(1):71–87

    Article  MathSciNet  Google Scholar 

  • McKay D, Shukla P, Hunt R, Cunningham SJ (2004) Enhancing browsing in digital libraries: three new approaches to browsing in greenstone. Int J Dig Libr 4:283–297

    Article  Google Scholar 

  • Meuss H, Schulz KU, Wiegel F, Leonardi S, Bry F (2005) Visual exploration and retrieval of XML document collections with the generic system X2. Int J Dig Libr 5:3–17

    Google Scholar 

  • Muramatsu J, Pratt W (2001) Transparent queries: investigating users’ mental models of search engines, SIGIR 2001. ACM, New York

  • Muylle S, Moenaert R, Despontin M (1999) A grounded theory of World Wide Web search behaviour. J Marketing Commun 5(3):143–155

  • Navarro-Prieto R, Scaife M, Rogers Y (1999) Cognitive strategies in web searching, Cited as: zing.ncsl.nist.gov/hfweb/proceedings/Navarro-Prieto/index.html (June 3, 1999)

  • Oard DW, Baron JR, Hedin B, Lewis DD, Tomlinson S (2010) Evaluation of information retrieval for E-discovery. Artif Intell Law 18:347

    Article  Google Scholar 

  • Oussalaleh M, Khan S, Nefti S (2008) Personalized information retrieval system in the framework of fuzzy logic. Expert Syst Appl 35:423

    Article  Google Scholar 

  • Pace N, Zakaras L (2012) Where the money goes: understanding litigant expenditures for producing electronic discovery. http://www.rand.org/pubs/monographs/MG1208.html

  • Paul GL, Baron JR (2007) Information inflation: can the legal system adapt? Richmond J Law Technol 13:10–17

    Google Scholar 

  • Robbins H (1952) Some aspects of the sequential design of experiments. Bull Am Math Soc 58:527–535

    Article  Google Scholar 

  • Settles B (2010) Active learning literature survey. Univ Wis Madison 52(11):55–66

    Google Scholar 

  • Schweighofer E, Geist A (2008) Legal query expansion using ontologies and relevance feedback, TREC conference 2008, proceedings

  • Scott SL (2010) A modern bayesian look at the multi-armed bandit. Appl Stoch Models Bus Ind 26:639–658

    Article  MathSciNet  Google Scholar 

  • Sedona (2014) Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (2013 edition)

  • The Sedona Conference (2014) Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, The Sedona Conference Commentary on Search and Retrieval (Volume XV)

  • Tredennick J (2014) Pioneering Cormack/Grossman study validates continuous learning, judgmental seeds and review team training for technology assisted review. http://www.catalystsecure.com/blog/2014/05/pioneering-cormackgrossman-study-validates-continuous-learning-judgmental-seeds-and-review-team-training-for-technology-assisted-review/

  • Van Rijsbergen CJ (1979) Information Retrieval. Butterworth, London

    Google Scholar 

  • Vijayakumar P, Unnikrishnan PC (2012) Modified action value method applied to ‘n’—armed bandit problems using reinforcement learning. Int J Eng Sci Technol 4(12):4710–4716

  • Wang L, Oard DW (2008) Query expansion for noisy legal documents, Text Retrieval Conference (TREC) 2008 proceedings

  • Wang L, Lekadir K, Lee S, Merrifield R, Yang G (2013) A general framework for context-specific image segmentation using reinforcement learning. IEEE Trans Med Imaging 32(5):943–956

    Article  Google Scholar 

  • Weick KE, Sutcliffe KM, Obstfeld D (2005) Organizing and the process of sensemaking. Organ Sci 16(4):409–421

    Article  Google Scholar 

  • Zheng Z, Padmanabhan B (2006) Selectively acquiring customer information: a new data acquisition problem and an active learning-based solution. Manag Sci 52(5):697–712

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harvey Hyman.

Appendix: eDiscovery IR request adapted from the TREC legal track 2011 conference problem set #401

Appendix: eDiscovery IR request adapted from the TREC legal track 2011 conference problem set #401

The purpose of this task is to retrieve documents that match the below request for production. The company in this case is Enron. The company is a now defunct energy trading company that was the subject of a large body of litigation both civil and criminal.

The following is the request for production:

You are requested to produce all documents or communications that describe, discuss, refer to, report on, or relate to the design, development, operation, or marketing of enrononline, or any other online service offered, provided, or used by the Company (or any of its subsidiaries, predecessors, or successors-in-interest), for the purchase, sale, trading, or exchange of financial or other instruments or products, including but not limited to, derivative instruments, commodities, futures, and swaps.

1.1 Additional guidance for relevance

The above request broadly seeks documents concerning Enron online, the Company’s general purpose trading system, or any other online financial or commodities services offered, provided, or used by the Company and its agents.

In this case attorney-client communication or otherwise privileged information is not an issue.

This request is seeking information specifically about an online system for trading financial instruments. A document is not relevant if it refers to the purchase, sale, trading, or exchange of a financial instrument or product, but does not involve the use of an online system.

A document is relevant if it describes, discusses, refers to, reports on, or relates to: the design, development, operation, or marketing of “enrononline,” or any other online services offered, provided or used. This includes, how the system was set up, how the system worked on a day-to-day basis, how the Company developed or modified the system, how the Company marketed or advertised the system, and the actual use of the system by the Company, its subsidiaries, predecessors, or successors in interest.

A relevant document can be for the purchase, sale, trading, or exchange of: financial instruments, financial products, including, derivative instruments, commodities, futures, or swaps. These instruments and products are distinguished from other goods and services by the fact that their value depends on future events and their purchase incurs financial risk.

A document is relevant even if it makes only implicit reference to these parameters. No particular transaction (i.e., purchase or sale) need be cited specifically. If the document generally references such activities, transactions, or a system whose function is to execute such transactions, and it otherwise meets the criteria, it is relevant.

Examples of responsive documents include: Correspondence, Policy statements, Press releases, Contact lists, or Enronline guest access emails.

1.2 Additional guidance for non-relevance

Examples of non-relevant documents include: Purchase, sale, trading or exchange of products or services other than financial instruments or products, or any documents referring to employee stock options or stock purchase plans offered as incentives or compensation, or the exercise thereof. Also documents relating to structured finance deals or swaps that are specified explicitly by written contracts, even if the contracts themselves are electronic or electronically signed are not relevant. Also documents related to the use of online systems by Enron employees for their personal use are outside this request and are not relevant.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hyman, H., Sincich, T., Will, R. et al. A process model for information retrieval context learning and knowledge discovery. Artif Intell Law 23, 103–132 (2015). https://doi.org/10.1007/s10506-015-9165-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10506-015-9165-y

Keywords

Navigation