Skip to main content

A Study on Corpus Content Display and IP Protection

  • Conference paper
  • First Online:
  • 1722 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 902))

Abstract

Corpus has played an important role in most of research fields, especially in natural language processing. Some research demos provided detailed corpus content to highlight the contribution they have made, while overlook the security of corpus. In this paper, we explore content leakage resulted from the content display through a crawler. A website for displaying corpus is selected to be crawled by a simply crawler algorithm with some strategies we present. It is estimated that over 85% of the corpus can be downloaded, which means a substantial threaten to its IP right. Finally, we discuss the protection measures for content display, and give some valid suggestions for information content protection in technology and law.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Renouf, A.: Corpus development 25 years on: from super-corpus to cybercorpus. Lang. Comput. Stud. Pract. Linguist. 62(1), 27–49 (2007)

    Google Scholar 

  2. Kennedy, G., Ooi, V.B.Y.: An Introduction to Corpus Linguistics. Studies in Language and Linguistics (1998)

    Google Scholar 

  3. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  4. Cohen, K.B., Ogren, P.V., Fox, L., et al.: Corpus design for biomedical natural language processing. In: ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Association for Computational Linguistics (2005)

    Google Scholar 

  5. Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web-Internet Web Inf. Syst. 2(4), 219–229 (1999)

    Article  Google Scholar 

  6. Koehn, P.: A parallel corpus for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 79–86 (2005)

    Google Scholar 

  7. Bergler, F.: Application program interface: US, US 5572675 A[P] (1996)

    Google Scholar 

  8. Mehrabi, H.: Digital watermark. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 49–58. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44796-2_5

    Chapter  Google Scholar 

  9. Adji, F.R., Saputra, H.M.: Perbandingan Hyper Text Transfer Protocol (HTTP) dengan Real Time Streaming Protcol (RTSP) menggunakan Video Streaming. In: Prosiding Seminar Nasional Rekayasa & Desain Itenas (2016)

    Google Scholar 

  10. Sun, H., Tang, Y., Liang, C., et al.: High speed computer screen recorder system based on FPGA+ARM. Application of Electronic Technique (2011)

    Google Scholar 

  11. Dong, A.: Question inquiry on the copyright protection of foreign language corpus. J. Beijing Inst. Graph. 25, 68–70 (2017)

    Google Scholar 

  12. Liu, SL.: The strategy of coping with anti-crawler website. Comput. Knowl. Technol. 13, 19–21 (2017)

    Google Scholar 

Download references

Acknowledgments

The work of this paper is funded by the project of National Natural Science Foundation of China (No. 2017YFB1002102) and the project of National key research and development program of China (No. 91520204).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muyun Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, J., Yang, M., Wang, H., Zhu, C., Xu, B. (2018). A Study on Corpus Content Display and IP Protection. In: Zhou, Q., Miao, Q., Wang, H., Xie, W., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 902. Springer, Singapore. https://doi.org/10.1007/978-981-13-2206-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2206-8_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2205-1

  • Online ISBN: 978-981-13-2206-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics