Abstract
With the growing use of computers and the Internet, it has become difficult for organizations to locate and effectively manage sensitive personally identifiable information (PII). This problem becomes even more evident in collaborative computing environments. PII may be hidden anywhere within the file system of a computer. As well, in the course of different activities, via collaboration or not, personally identifiable information may migrate from computer to computer. This makes meeting the organizational privacy requirements all the more complex. Our particular interest is to develop technology that would automatically discover workflow across organizational collaborators that would include private data. Since in this context, it is important to understand where and when the private data is discovered, in this paper, we focus on PII discovery, i.e. automatically identifying private data existant in semi-structured and unstructured (free text) documents. The first part of the process involves identifying PII via named entity recognition. The second part determines relationships between those entities based upon a supervised machine learning method. We present test results of our methods using publicly-available data generated from different collaborative activities to provide an assessment of scalability in cooperative computing environment.
National Research Council Paper Number 50386.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Korba, L., Song, R., Yee, G., Patrick, A.S., Buffett, S., Wang, Y., Geng, L.: Private data management in collaborative environments. In: Luo, Y. (ed.) CDVE 2007. LNCS, vol. 4674, Springer, Heidelberg (2007)
Aura, T., Kuhn, T.A., Roe, M.: Scanning electronic documents for personally identifiable information. In: Proc. of the Workshop on Privacy in the Electronic Society (WPES 2006), Washington, DC, October 2006, pp. 41–49 (2006)
Agichtein, E., Cucerzan, S.: Predicting accuracy of extracting information from unstructured text collections. In: CIKM 2005, Bremen, Germany, pp. 413–420 (2005)
Kambhatla, N.: Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, Spain, July 21-26 (2004)
Miller, S., Fox, H., Ramshaw, L., et al.: Description of the SIFT system used for MUC-7. In: Proc. of the 7th Message Understanding Conference (MUC-7) (1998)
Luhn’s Algorithm on Wikipedia (last accessed: March 20, 2007), http://en.wikipedia.org/wiki/Luhn_algorithm
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 2003 Joint Conference on Digital Libraries (JCDL 2003), Houston, Texas, May 27-31, pp. 37–48 (2003)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Turmo, J., Ageno, A., Catala, N.: Adaptive information extraction. ACM Computing Surveys 38(2), 4 (2006)
Headers data, http://www.cs.cmu.edu/~kseymore/ie.html
Job posting data, http://www.cs.utexas.edu/users/ml/index.cgi?page=resourcesrepo
Enron random subset, http://www.cs.cmu.edu/~wcohen/
Song, R., Korba, L., Yee, G.: An Efficient Privacy-Preserving Data Mining Platform. In: The 4th Int. Conf. on Data Mining (DMIN 2008), Las Vegas, Nevada, July 14-17 (2008)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Korba, L. et al. (2008). Private Data Discovery for Privacy Compliance in Collaborative Environments. In: Luo, Y. (eds) Cooperative Design, Visualization, and Engineering. CDVE 2008. Lecture Notes in Computer Science, vol 5220. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88011-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-88011-0_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88010-3
Online ISBN: 978-3-540-88011-0
eBook Packages: Computer ScienceComputer Science (R0)