research-article

Public Access

Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning

Authors:

Nicholas Chiang,

Christopher Ré,

Philip LevisAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 19, Issue 6

Article No.: 42, Pages 1 - 26

https://doi.org/10.1145/3391906

Published: 29 September 2020 Publication History

All formats PDF

Abstract

Hardware component databases are vital resources in designing embedded systems. Since creating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have random data entry errors.

We present a machine learning based approach for creating hardware component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. Addressing this complexity has traditionally relied on human input, making it costly to scale. Our approach uses a rich data model, weak supervision, data augmentation, and multi-task learning to create these knowledge bases in a matter of days.

We evaluate the approach on datasheets of three types of components and achieve an average quality of 77 F1 points—quality comparable to existing human-curated knowledge bases. We perform application studies that demonstrate the extraction of multiple data modalities including numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages that can be utilized together to improve knowledge base quality. Finally, we present a case study to show how this approach changes the way practitioners create hardware component knowledge bases.

References

[1]

Héctor Martínez Alonso and Barbara Plank. 2016. When is multitask learning effective? Semantic sequence prediction under varying data conditions. arXiv preprint arXiv:1612.02251 (2016).

[2]

Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2017. Trigger-action-circuits: Leveraging generative design to enable novices to design and build circuitry. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, 331--342.

Digital Library

[3]

Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1. 344--354.

[4]

Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. 2007. Multi-task feature learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 41--48.

[5]

Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. arXiv preprint arXiv:1702.08303 (2017).

[6]

Hui Chao and Jian Fan. 2004. Layout and content extraction for PDF documents. In Proceedings of the International Workshop on Document Analysis Systems. Springer, 213--224.

[7]

Christopher Andreas Clark and Santosh Divvala. 2015. Looking beyond text: Extracting figures, tables and captions from computer science papers. In Proceedings of the Workshops at the 29th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI’15).

[8]

Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).

[9]

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. 2019. RandAugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719 (2019).

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[11]

Dave Doherty. 2019. About Digikey. Retrieved from https://www.digikey.com/en/resources/about-digikey.

[12]

Daniel Drew, Julie L. Newcomb, William McGrath, Filip Maksimovic, David Mellis, and Björn Hartmann. 2016. The toastboard: Ubiquitous instrumentation and automated checking of breadboarded circuits. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 677--686.

Digital Library

[13]

Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, et al. 2011. Open information extraction: The second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.

Digital Library

[14]

Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neur. Netw. Learn. Syst. 25, 5 (2014), 845--869.

[15]

Hector Garcia-Molina, Manas Joglekar, Adam Marcus, Aditya Parameswaran, and Vasilis Verroios. 2016. Challenges in data crowdsourcing. IEEE Trans. Knowl. Data Eng. 28, 4 (2016), 901--911.

Digital Library

[16]

Luke Hsiao, Sen Wu, Nicholas Chiang, Christopher Ré, and Philip Levis. 2019. Automating the generation of hardware component knowledge bases. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 163--176.

Digital Library

[17]

William Huang, Ye-Sheng Kuo, Pat Pannuto, and Prabal Dutta. 2014. Opo: A wearable sensor for capturing high-fidelity face-to-face interactions. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems. ACM, 61--75.

Digital Library

[18]

Daniel P. Huttenlocher, Gregory A. Klanderman, and William A. Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 9 (1993), 850--863.

Digital Library

[19]

Antonio Iannopollo, Stavros Tripakis, and Alberto Sangiovanni-Vincentelli. 2019. Constrained synthesis from component libraries. Sci. Comput. Prog. 171 (2019), 21--41.

[20]

Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 195--206.

[21]

Ertugrul Kara, Mark Traquair, Burak Kantarci, and Shahzad Khan. 2019. Deep learning for recognizing the anatomy of tables on datasheets. In Proceedings of the IEEE Symposium on Computers and Communications (ISCC’19). IEEE, 1--6.

[22]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019).

[23]

Ying Liu, Kun Bai, Prasenjit Mitra, and Clyde Lee Giles. 2007. Tableseer: Automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’07). ACM, 91--100.

Digital Library

[24]

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).

[25]

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2. Association for Computational Linguistics, 1003--1011.

Digital Library

[26]

Ermelinda Oro and Massimo Ruffolo. 2009. Trex: An approach for recognizing and extracting tables from PDF documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE, 906--910.

Digital Library

[27]

Martha O. Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota. 2016. TAO: System for table detection and extraction from PDF documents. In Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference (FLAIRS’16).

[28]

Shanan E. Peters, Ce Zhang, Miron Livny, and Christopher Ré. 2014. A machine reading system for assembling synthetic paleontological databases. PLOS One 9, 12 (2014).

[29]

Raf Ramakers, Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2016. Retrofab: A design tool for retrofitting physical interfaces using actuators, sensors, and 3D printing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 409--419.

Digital Library

[30]

Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, and Gully A. P. C. Burns. 2012. Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7, 1 (2012), 7.

[31]

Rohit Ramesh, Richard Lin, Antonio Iannopollo, Alberto Sangiovanni-Vincentelli, Björn Hartmann, and Prabal Dutta. 2017. Turning coders into makers: The promise of embedded design generation. In Proceedings of the 1st Annual ACM Symposium on Computational Fabrication. ACM, 4.

Digital Library

[32]

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2020. Snorkel: Rapid training data creation with weak supervision. The Very Large Data Bases (VLDB) J. 29, 2 (2019), 709--730.

[33]

Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3567--3575.

[34]

Sagnik Ray Choudhury, Prasenjit Mitra, and Clyde Lee Giles. 2015. Automatic extraction of figures from scholarly documents. In Proceedings of the ACM Symposium on Document Engineering. ACM, 47--50.

Digital Library

[35]

Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).

[36]

StackExchange. 2015. Choosing the right transistor for a switching circuit. Retrieved from https://electronics.stackexchange.com/questions/29029/choosing-the-right-transistor-for-a-switching-circuit.

[37]

Abdel Aziz Taha and Allan Hanbury. 2015. An efficient algorithm for calculating the exact Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 37, 11 (2015), 2153--2163.

Digital Library

[38]

Jörg Tiedemann. 2014. Improved text extraction from PDF documents for large-scale natural language processing. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 102--112.

Digital Library

[39]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).

[40]

Sen Wu. 2019. Emmental: A framework for building multi-modal multi-task learning systems. Retrieved from https://github.com/SenWu/emmental.

[41]

Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1301--1316.

Digital Library

[42]

Sen Wu, Hongyang Zhang, and Christopher Ré. 2020. Understanding and improving information transfer in multi-task learning. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SylzhkBtDB.

[43]

Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher Ré. 2020. On the generalization effects of linear transformations in data augmentation. In Proceedings of the International Conference on Machine Learning.

[44]

Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 993--996.

Digital Library

[45]

Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1260--1268.

Cited By

Inayatulloh Arafah SMurtani AKurniawan RRitonga SNazly PRizki S(2023)The Effect of Using Mobile Applications, Using Social Media, Using E-Commerce, and Having IT Knowledge on The Performance of SMEs2023 International Conference on Information Management and Technology (ICIMTech)10.1109/ICIMTech59029.2023.10277907(621-626)Online publication date: 24-Aug-2023
https://doi.org/10.1109/ICIMTech59029.2023.10277907
Li BDong W(2022)Edge-Centric Programming for IoT Applications With Automatic Code PartitioningIEEE Transactions on Computers10.1109/TC.2021.312936771:10(2408-2422)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/TC.2021.3129367

Index Terms

Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning
1. Hardware
  1. Electronic design automation
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Automating the generation of hardware component knowledge bases
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Hardware component databases are critical resources in designing embedded systems. Since generating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have many ...
Creating Hardware Component Knowledge Bases from Pdf Datasheets
Partial Multi-label Learning with a Few Accurately Labeled Data
PRICAI 2023: Trends in Artificial Intelligence
Abstract
Partial Multi-label Learning is a multi-label classification problem where only candidate labels are given for training data. These candidate labels consist of relevant labels and false-positive labels. In this paper, we consider the PML when a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 19, Issue 6

Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing Compilers

November 2020

271 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3427195

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 29 September 2020

Online AM: 07 May 2020

Accepted: 01 March 2020

Revised: 01 March 2020

Received: 01 October 2019

Published in TECS Volume 19, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Intel/NSF CPS Security
Okawa Foundation; American Family Insurance; Google Cloud; Swiss Re
Teradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys
NIH
Stanford Secure Internet of Things Project, and the Stanford System X Alliance
FA86501827865
ONR
NSF
Moore Foundation; NXP; Xilinx; LETICEA; Intel; IBM; Microsoft; NEC; Toshiba; TSMC; ARM; Hitachi; BASF; Accenture; Ericsson; Qualcomm; Analog Devices
HAI-AWS Cloud Credits for Research program, and members of the Stanford DAWN

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
672
Total Downloads

Downloads (Last 12 months)273
Downloads (Last 6 weeks)28

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Inayatulloh Arafah SMurtani AKurniawan RRitonga SNazly PRizki S(2023)The Effect of Using Mobile Applications, Using Social Media, Using E-Commerce, and Having IT Knowledge on The Performance of SMEs2023 International Conference on Information Management and Technology (ICIMTech)10.1109/ICIMTech59029.2023.10277907(621-626)Online publication date: 24-Aug-2023
https://doi.org/10.1109/ICIMTech59029.2023.10277907
Li BDong W(2022)Edge-Centric Programming for IoT Applications With Automatic Code PartitioningIEEE Transactions on Computers10.1109/TC.2021.312936771:10(2408-2422)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/TC.2021.3129367

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents