skip to main content
research-article
Public Access

Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning

Published: 29 September 2020 Publication History

Abstract

Hardware component databases are vital resources in designing embedded systems. Since creating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have random data entry errors.
We present a machine learning based approach for creating hardware component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. Addressing this complexity has traditionally relied on human input, making it costly to scale. Our approach uses a rich data model, weak supervision, data augmentation, and multi-task learning to create these knowledge bases in a matter of days.
We evaluate the approach on datasheets of three types of components and achieve an average quality of 77 F1 points—quality comparable to existing human-curated knowledge bases. We perform application studies that demonstrate the extraction of multiple data modalities including numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages that can be utilized together to improve knowledge base quality. Finally, we present a case study to show how this approach changes the way practitioners create hardware component knowledge bases.

References

[1]
Héctor Martínez Alonso and Barbara Plank. 2016. When is multitask learning effective? Semantic sequence prediction under varying data conditions. arXiv preprint arXiv:1612.02251 (2016).
[2]
Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2017. Trigger-action-circuits: Leveraging generative design to enable novices to design and build circuitry. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, 331--342.
[3]
Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1. 344--354.
[4]
Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. 2007. Multi-task feature learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 41--48.
[5]
Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. arXiv preprint arXiv:1702.08303 (2017).
[6]
Hui Chao and Jian Fan. 2004. Layout and content extraction for PDF documents. In Proceedings of the International Workshop on Document Analysis Systems. Springer, 213--224.
[7]
Christopher Andreas Clark and Santosh Divvala. 2015. Looking beyond text: Extracting figures, tables and captions from computer science papers. In Proceedings of the Workshops at the 29th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI’15).
[8]
Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).
[9]
Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. 2019. RandAugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719 (2019).
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11]
Dave Doherty. 2019. About Digikey. Retrieved from https://www.digikey.com/en/resources/about-digikey.
[12]
Daniel Drew, Julie L. Newcomb, William McGrath, Filip Maksimovic, David Mellis, and Björn Hartmann. 2016. The toastboard: Ubiquitous instrumentation and automated checking of breadboarded circuits. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 677--686.
[13]
Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, et al. 2011. Open information extraction: The second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.
[14]
Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neur. Netw. Learn. Syst. 25, 5 (2014), 845--869.
[15]
Hector Garcia-Molina, Manas Joglekar, Adam Marcus, Aditya Parameswaran, and Vasilis Verroios. 2016. Challenges in data crowdsourcing. IEEE Trans. Knowl. Data Eng. 28, 4 (2016), 901--911.
[16]
Luke Hsiao, Sen Wu, Nicholas Chiang, Christopher Ré, and Philip Levis. 2019. Automating the generation of hardware component knowledge bases. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 163--176.
[17]
William Huang, Ye-Sheng Kuo, Pat Pannuto, and Prabal Dutta. 2014. Opo: A wearable sensor for capturing high-fidelity face-to-face interactions. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems. ACM, 61--75.
[18]
Daniel P. Huttenlocher, Gregory A. Klanderman, and William A. Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 9 (1993), 850--863.
[19]
Antonio Iannopollo, Stavros Tripakis, and Alberto Sangiovanni-Vincentelli. 2019. Constrained synthesis from component libraries. Sci. Comput. Prog. 171 (2019), 21--41.
[20]
Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 195--206.
[21]
Ertugrul Kara, Mark Traquair, Burak Kantarci, and Shahzad Khan. 2019. Deep learning for recognizing the anatomy of tables on datasheets. In Proceedings of the IEEE Symposium on Computers and Communications (ISCC’19). IEEE, 1--6.
[22]
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019).
[23]
Ying Liu, Kun Bai, Prasenjit Mitra, and Clyde Lee Giles. 2007. Tableseer: Automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’07). ACM, 91--100.
[24]
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).
[25]
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2. Association for Computational Linguistics, 1003--1011.
[26]
Ermelinda Oro and Massimo Ruffolo. 2009. Trex: An approach for recognizing and extracting tables from PDF documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE, 906--910.
[27]
Martha O. Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota. 2016. TAO: System for table detection and extraction from PDF documents. In Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference (FLAIRS’16).
[28]
Shanan E. Peters, Ce Zhang, Miron Livny, and Christopher Ré. 2014. A machine reading system for assembling synthetic paleontological databases. PLOS One 9, 12 (2014).
[29]
Raf Ramakers, Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2016. Retrofab: A design tool for retrofitting physical interfaces using actuators, sensors, and 3D printing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 409--419.
[30]
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, and Gully A. P. C. Burns. 2012. Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7, 1 (2012), 7.
[31]
Rohit Ramesh, Richard Lin, Antonio Iannopollo, Alberto Sangiovanni-Vincentelli, Björn Hartmann, and Prabal Dutta. 2017. Turning coders into makers: The promise of embedded design generation. In Proceedings of the 1st Annual ACM Symposium on Computational Fabrication. ACM, 4.
[32]
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2020. Snorkel: Rapid training data creation with weak supervision. The Very Large Data Bases (VLDB) J. 29, 2 (2019), 709--730.
[33]
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3567--3575.
[34]
Sagnik Ray Choudhury, Prasenjit Mitra, and Clyde Lee Giles. 2015. Automatic extraction of figures from scholarly documents. In Proceedings of the ACM Symposium on Document Engineering. ACM, 47--50.
[35]
Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
[36]
StackExchange. 2015. Choosing the right transistor for a switching circuit. Retrieved from https://electronics.stackexchange.com/questions/29029/choosing-the-right-transistor-for-a-switching-circuit.
[37]
Abdel Aziz Taha and Allan Hanbury. 2015. An efficient algorithm for calculating the exact Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 37, 11 (2015), 2153--2163.
[38]
Jörg Tiedemann. 2014. Improved text extraction from PDF documents for large-scale natural language processing. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 102--112.
[39]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
[40]
Sen Wu. 2019. Emmental: A framework for building multi-modal multi-task learning systems. Retrieved from https://github.com/SenWu/emmental.
[41]
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1301--1316.
[42]
Sen Wu, Hongyang Zhang, and Christopher Ré. 2020. Understanding and improving information transfer in multi-task learning. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SylzhkBtDB.
[43]
Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher Ré. 2020. On the generalization effects of linear transformations in data augmentation. In Proceedings of the International Conference on Machine Learning.
[44]
Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 993--996.
[45]
Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1260--1268.

Cited By

View all
  • (2023)The Effect of Using Mobile Applications, Using Social Media, Using E-Commerce, and Having IT Knowledge on The Performance of SMEs2023 International Conference on Information Management and Technology (ICIMTech)10.1109/ICIMTech59029.2023.10277907(621-626)Online publication date: 24-Aug-2023
  • (2022)Edge-Centric Programming for IoT Applications With Automatic Code PartitioningIEEE Transactions on Computers10.1109/TC.2021.312936771:10(2408-2422)Online publication date: 1-Oct-2022

Index Terms

  1. Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 19, Issue 6
      Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing Compilers
      November 2020
      271 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/3427195
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 29 September 2020
      Online AM: 07 May 2020
      Accepted: 01 March 2020
      Revised: 01 March 2020
      Received: 01 October 2019
      Published in TECS Volume 19, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Knowledge base construction
      2. design tools
      3. machine learning

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Intel/NSF CPS Security
      • Okawa Foundation; American Family Insurance; Google Cloud; Swiss Re
      • Teradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys
      • NIH
      • Stanford Secure Internet of Things Project, and the Stanford System X Alliance
      • FA86501827865
      • ONR
      • NSF
      • Moore Foundation; NXP; Xilinx; LETICEA; Intel; IBM; Microsoft; NEC; Toshiba; TSMC; ARM; Hitachi; BASF; Accenture; Ericsson; Qualcomm; Analog Devices
      • HAI-AWS Cloud Credits for Research program, and members of the Stanford DAWN

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)273
      • Downloads (Last 6 weeks)28
      Reflects downloads up to 23 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)The Effect of Using Mobile Applications, Using Social Media, Using E-Commerce, and Having IT Knowledge on The Performance of SMEs2023 International Conference on Information Management and Technology (ICIMTech)10.1109/ICIMTech59029.2023.10277907(621-626)Online publication date: 24-Aug-2023
      • (2022)Edge-Centric Programming for IoT Applications With Automatic Code PartitioningIEEE Transactions on Computers10.1109/TC.2021.312936771:10(2408-2422)Online publication date: 1-Oct-2022

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media