Abstract
While traditional document analysis has focused on printed media, an increasingly large portion of the documents today are generated in digital form from the start. Such “documents born digital” range from plain text documents such as emails to more sophisticated forms such as PDF documents and Web documents. On the one hand, the existence of the digital encoding of documents eliminates the need for scanning, image processing, and character recognition in most situations (a notable exception being the prevalent use of text embedded in images for Web documents, as elaborated upon in section “Analysis of Text in Web Images”). On the other hand, many higher-level processing tasks remain due to the fact that the design purpose of almost existing digital document encoding systems (i.e., HTML, PDF) is for display or printing for human consumption, not for machine-level information exchange and extraction. As such, significant amount of processing is still required for automatic information extraction, indexing, and content repurposing from such documents, and many challenges exist in this process. This chapter describes in detail the key technologies for processing documents born digital, with a focus on PDF and Web document processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adelberg B (1998) NoDoSE – a tool for semi-automatically extracting structured and Semi-structured data from text documents. In: ACM SIGMOD international conference on management of data (SIGMOD’98), Seattle, pp 283–294
Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. In: 37th STOC, Baltimore, pp 684–693
Anjewierden A (2001) AIDAS: incremental logical structure discovery in PDF documents. In: 6th international conference on document analysis and recognition (ICDAR), Seattle, Sept 2001, pp 374–378
Antonacopoulos A, Hu J (ed) (2004) Web document analysis: challenges and opportunities. World Scientific, Singapore
Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In 5th Asia Pacific Web Conference, pp 406–415
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference, AAAI’99/IAAI’99, Orlando. Menlo Park, pp 6–11
Chakrabarti D, Kumar R, Punera K (2008) A graph-theoretic approach to webpage segmentation. In: WWW 2008, Beijing, pp 377–386
Chao H, Fan J (2004) Layout and content extraction for pdf documents. In: Marinai S, Dengel A (eds) Document analysis systems VI. Lecture notes in computer science, vol 3163. Springer, New York/Berlin, pp 13–224
Chen JS, Tseng DC (1996) Overlapped-character separation and construction for table-form documents. In: IEEE international conference on image processing (ICIP), Lausanne, pp 233–236
Chen Y, Xie X, Ma W-Y, Zhang H-J (2005) Adapting web pages for small-screen devices. Internet Computing, 9(1):50–56
Cohn AG (1997) Qualitative spatial representation and reasoning techniques, vol 1303. Springer, Berlin, pp 1–30
Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. IJDAR 5(1):17–27
Embley DW, Campbell DM, Jiang YS, Liddle SW, Lonsdale DW, Ng Y-K, Smith RD (1999) Conceptual-model-based data extraction from multiple-record web pages. Data Knowl Eng 31(3):227–251
Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop on personalisation and recommender systems in digital libraries, Dublin, p 1
Futrelle RP, Shao M, Cieslik C, Grimes AE (2003) Extraction, layout analysis and classification of diagrams in PDF documents. In: International conference on document analysis and recognition (ICDAR) 2003, proceedings, Edinburgh, vol 2, p 1007
Gatterbauer W, Bohunsky P (2006) Table extraction using spatial reasoning on the CSS2 visual box model. In: Proceedings of the 21st national conference on artificial intelligence (AAAI), Boston, vol 2, pp 1313–1318
Gupta N, Hilal S Dr (2011) A heuristic approach for web content extraction. Int J Comput Appl 15(5):20–24
Hadjar K, Rigamonti M, Lalanne D, Ingold R (2004) Xed: a new tool for extracting hidden structures from electronic documents. In: Document image analysis for libraries, Palo Alto, pp 212–224
Hassan T (2009) Object-level document analysis of PDF files. In: Proceedings of the 9th ACM symposium on document engineering (DocEng’09), Munich. ACM, New York, pp 47–55
Hassan T (2009) User-guided wrapping of PDF documents using graph matching techniques. In: International conference on document analysis and recognition – ICDAR, Barcelona, pp 631–635
Hurst M (2001) Layout and language: challenges for table understanding on the web. In: Proceedings of the 1st international workshop on web document analysis, Seattle
Jain AK, Yu B (1998) Automatic text location in images and video frames. Pattern Recognit 31(12):2055–2076
Karatzas D (2002) Text segmentation in web images using colour perception and topological features. PhD Thesis, University of Liverpool
Karatzas D, Anotnacopoulos A (2007) Colour text segmentation in web images based on human perception. Image Vis Comput 25(5):564–577
Kong J, Zhang K, Zeng X (2006) Spatial graph grammars for graphical user interfaces. CHI 13:268–307
Krupl B, Herzog M, Gatterbauer W (2005) Using visual cues for extraction of tabular data from arbitrary HTML documents. In: Proceedings of the 14th international conference on world wide web (WWW), Chiba
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell Spec Issue Intell Internet Syst 118(1–2):15–68
Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec Homepage Arch 31(2):84–93
Laender AHF, Ribeiro-Neto B, da Silva AS (2002) DEByE – date extraction by example. Data Knowl Eng 40(2):121–154
Lien Y-LL (1989) Apparatus and method for vectorization of incoming scanned image data. United States Patent US4,817,187, assigned to GTX Corporation, Phoenix, Arizona, 28 Mar 1989
Liu Y, Bai K, Mitra P, Lee Giles C (2007) TableSeer: automatic table metadata extraction and searching in digital libraries. In: ACM/IEEE joint conference on digital libraries, Vancouver, pp 91–100
Lopresti D, Zhou J (2000) Locating and recognizing text in WWW images. Inf Retr 2(2/3):177–206
Lovegrove W, Brailsford D (1995) Document analysis of PDF files: methods, results and implications. Electron Publ Orig Dissem Des 8(3):207–220
Luo P, Fan J, Liu S, Lin F, Xiong Y, Liu J (2009) Web article extraction for web printing: a DOM+visual based approach. In: Proceedings of the DocEng, Munich. ACM, pp 66–69
Marinai S (2009) Metadata extraction from PDF papers for digital library ingest. In: Proceedings of the 10th international conference on document analysis and recognition (ICDAR), Barcelona, pp 251–255
McKeown KR, Barzilay R, Evans D, Hatzivassiloglou V, Kan MY, Schiffman B, Teufel S (2001) Columbia multi-document summarization: approach and evaluation. In: Document understanding conference, New Orleans
Okun O, Doermann D, Pietikainen M (1999) Page segmentation and zone classification: the state of the art. Technical report: LAMP-TR-036/CAR-TR-927/CS-TR-4079, University of Maryland, College Park, Nov 1999
Oro E, Ruffolo M (2009) PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: ICDAR’09 proceedings of the 2009 10th international conference on document analysis and recognition, Barcelona, pp 906–910
Petrie H, Harrison C, Dev S (2005) Describing images on the web: a survey of current practice and prospects for the future. In: Proceedings of human computer interaction international (HCII), Las Vegas, July 2005
Smith PN, Brailsford DF (1995) Towards structured, block-based PDF. Electron Publ Orig Dissem Des 8(2–3):153–165
Soderland S, Cardie C, Mooney R (1999) Learning information extraction rules for semi-structured and free text. Mach Learn Spec Issue Nat Lang Learn 34(1–3):233–272
Wang Y, Hu J (2002) Detecting tables in HTML documents. In: Fifth IAPR international workshop on document analysis systems, Princeton, Aug 2002. Lecture notes in computer science, vol 2423, pp 249–260
Wang Y, Phillips IT, Haralick RM (2000) Statistical-based approach to word segmentation, In: 15th international conference on pattern recognition, ICPR2000, vol 4. Barcelona, Spain, pp 555–558
Wasserman HC, Yukawa K, Sy BK, Kwok K-L, Phillips IT (2002) A theoretical foundation and a method for document table structure extraction and decomposition. In: Lopresti DP, Hu J, Kashi R (eds) Document analysis systems. Lecture notes in computer science, vol 2423. Springer, Berlin/New York, pp 29–294
Wyszecki G, Stiles W (1982) Color science: concepts and methods, quantitative data and formulae, 2nd edn. Wiley, New York
Yildiz B, Kaiser K, Miksch S (2005) pdf2table: a method to extract table information from PDF files. In: Proceedings of the 2nd Indian international conference on artificial intelligence (IICAI05), Pune, pp 1773–1785
Zanibbi R, Blostein D, Cordy JR (2004) A survey of table recognition: models, observations, transformations, and inferences. Int J Doc Anal Recognit 7(1):1–16
Zhu J, Nie Z, Wen J-R, Zhang B, Ma W-Y (2005) 2D conditional random fields for web information extraction. In: Proceedings of the ICML’05, Bonn. ACM, pp 1044–1051
Further Reading
Web Document Analytics, Apostolos Antonacopoulos and Jianying Hu (Editors), World Scientific, 2004.
PDF Explained, John Whitington, O’Reilly Media, 2011.
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Bing Liu, Springer, 2011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag London
About this entry
Cite this entry
Hu, J., Liu, Y. (2014). Analysis of Documents Born Digital. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_26
Download citation
DOI: https://doi.org/10.1007/978-0-85729-859-1_26
Published:
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering