Skip to main content

Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python

  • Conference paper
  • First Online:
Book cover Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications (FDSE 2020)

Abstract

Py_ape is a package in Python that integrates a number of string and text processing algorithms for collecting, extracting, and cleaning text data from websites, creating frames for text corpora, and matching entities, matching two schemas, mapping and merging two schemas. The functions of Py_ape help the user step-by-step perform data integration and data preparation, based on some popular Python libraries. Especially in the entity matching function of the schema matching and merging phase, we used the Hamming distance algorithm to identify similar string pairs, and the longest common substring similarity algorithm to map data between the columns of schemas. These algorithms help to increase the accuracy of the schema matching process. In addition, in the article, we present experimental results using Py_ape to scrape, clean, match, and merge two sets of data related to aviation crashes, taken from different sources of Kaggle and Wikipedia. The result of the experiment will be evaluated in detail in the rest of the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pypi.org/project/py-ape.

  2. 2.

    https://www.biggorilla.org.

  3. 3.

    https://scrapy.org/.

  4. 4.

    https://github.com/biggorilla-gh/usagi.

  5. 5.

    http://pandas.pydata.org/.

  6. 6.

    https://github.com/biggorilla-gh/koko.

  7. 7.

    https://github.com/gpoulter/pydedupe.

  8. 8.

    https://sourceforge.net/projects/febrl/.

  9. 9.

    https://sites.google.com/site/anhaidgroup/projects/magellan.

  10. 10.

    https://github.com/davidfoerster/schema-matching.

  11. 11.

    https://github.com/bkj/wit.

  12. 12.

    https://airflow.apache.org/.

  13. 13.

    https://github.com/spotify/luigi.

  14. 14.

    https://numpy.org/.

  15. 15.

    https://www.scipy.org/.

  16. 16.

    https://matplotlib.org/.

  17. 17.

    https://docs.python.org/3/library/csv.html.

  18. 18.

    https://pypi.org/project/beautifulsoup4/.

  19. 19.

    https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227.

  20. 20.

    https://www.wikipedia.org.

  21. 21.

    https://en.wikipedia.org/wiki/List_of_accidents_and_incidents_involving_commercial_aircraft.

  22. 22.

    http://www.baaa-acro.com.

  23. 23.

    http://www.airdisaster.com/cgi-bin/database.cgi.

  24. 24.

    https://www.jacdec.de.

  25. 25.

    http://www.planecrashinfo.com.

  26. 26.

    https://www.kaggle.com.

  27. 27.

    https://www.kaggle.com/imdevskp/air-passengers-and-departures-data-from-19702018.

  28. 28.

    https://en.wikipedia.org/wiki/List_of_accidents_and_incidents_involving_commercial_aircraft.

References

  1. Chen, C., Golshan, B., Halevy, A., Tan, W.-C., Doan, A.H.: BigGorilla: an open-source ecosystem for data preparation and integration. Comput. Sci. IEEE Data Eng. Bull. (2018)

    Google Scholar 

  2. Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration, 1st edn. Morgan Kaufmann (2012)

    Google Scholar 

  3. Golshan, B., Halevy, A.Y., Mihaila, G.A., Tan, W.: Data integration: after the teenage years. In: PODS (2017)

    Google Scholar 

  4. Miller, R.J.: The future of data integration. In: KDD, p. 3 (2017)

    Google Scholar 

  5. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  6. Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)

    Google Scholar 

  7. Pessig, P.: Entity matching using Magellan - matching drug reference tables. In: CPCP Retreat (2017). http://cpcp.wisc.edu/resources/cpcp-2017-retreat-entity-matching

  8. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD-18 (2018)

    Google Scholar 

  9. Konda, P., et al.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)

    Google Scholar 

  10. Wang, S., Jiang, J.: A compare-aggregate model for matching text sequences. In: ICLR (2017)

    Google Scholar 

  11. Yu, M., et al.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)

    Article  Google Scholar 

  12. Bloor Research International: Self-Service Data Preparation and Cataloguing (2016). https://www.bloorresearch.com/research/self-service-data-preparation-cataloguing/. Accessed 14 May 2018

  13. Heer, J., Hellerstein, J., Kandel, S.: Predictive interaction for data transformation. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  14. Jin, Z., et al.: Foofah: transforming data by example. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 683–698. ACM (2017)

    Google Scholar 

  15. Kopelowitz, T., Porat, E.: A simple algorithm for approximating the text-to-pattern hamming distance. In: 1st Symposium on Simplicity in Algorithms (SOSA 2018) (2018)

    Google Scholar 

  16. Ho, T., Oh, S., Kim, H.: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. J. Supercomput. 74, 1815–1834 (2018). https://doi.org/10.1007/s11227-017-2192-6

  17. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Bernstein, P.A., Melnik, S.: Metadata management. In: Proceedings of the IEEE CS International Conference on Data Engineering. IEEE Computer Society (2004)

    Google Scholar 

  19. Mittal, S., Nag, S.: A survey of encoding techniques for reducing data-movement energy. J. Syst. Arch. 97, 373–396 (2019)

    Article  Google Scholar 

  20. Apostolico, A., et al.: Sequence similarity measures based on bounded hamming distance. Theoret. Comput. Sci. 638, 76–90 (2016)

    Article  MathSciNet  Google Scholar 

  21. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, pp. 125–128. Cambridge University Press, Cambridge (1999). ISBN 0-521-58519-8

    Google Scholar 

  22. Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. (0975–8887). 68(13) (2013)

    Google Scholar 

  23. Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2015). https://doi.org/10.1007/s11704-015-5900-5

    Article  Google Scholar 

  24. Recruit Holdings Co., Ltd.: Recruit’s Artificial Intelligence Laboratory Releases BigGorilla: An Open-source Data Integration and Data Preparation Ecosystem (2019). https://recruit-holdings.com/news_data/release/2017/0630_7890.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bich-Ngan T. Nguyen or Vu Thanh Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, BN.T., Phạm, P.N.H., Nguyen, V.T., Viet, P.Q., Tuan, L.D., Snasel, V. (2020). Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2020. Communications in Computer and Information Science, vol 1306. Springer, Singapore. https://doi.org/10.1007/978-981-33-4370-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-4370-2_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-4369-6

  • Online ISBN: 978-981-33-4370-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics