Abstract
A broad class of data, ranging from similarity networks, workflow networks to protein networks, can be modeled as graphs with data values as vertex labels. Both vertex labels and neighbors could be dirty for various reasons such as typos or erroneous reporting of results in scientific experiments. Neighborhood constraints, specifying label pairs that are allowed to appear on adjacent vertices in the graph, are employed to detect and repair erroneous vertex labels and neighbors. In this paper, we study the problem of repairing vertex labels and neighbors to make graphs satisfy neighborhood constraints. Unfortunately, the problem is generally hard, which motivates us to devise approximation methods for repairing and identify interesting special cases (star and clique constraints) that can be efficiently solved. First, we propose several label repairing approximation algorithms including greedy heuristics, contraction method and an approach combining both. The performances of algorithms are also analyzed for the special case. Moreover, we devise a cubic-time constant-factor graph repairing algorithm with both label and neighbor repairs (given degree-bounded instance graphs). Our extensive experimental evaluation on real data demonstrates the effectiveness of eliminating frauds in several types of application networks.
Similar content being viewed by others
Notes
e.g., Edit distance (see [25] for a survey of string similarity).
References
Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In PODS, pp. 68–79 (1999)
Bhattacharya, I., Getoor, L.: Entity Resolution in Graph Data. University of Maryland technical report CS-TR-4758 (2005)
Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference pp. 143–154 (2005)
Boobna, U., de Rougemont, M.: Correctors for XML data. In XSym, pp. 97–111 (2004)
Cheng, J., Ke, Y., Fu, A.W.-C., Yu, J.X., Zhu, L.: Finding maximal cliques in massive networks by h*-graph. In: SIGMOD Conference pp. 447–458 (2010)
Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD Conference pp. 857–872, (2007)
Cheng, J., Yu, J.X., Ding, B., Yu, P.S., Wang, H.: Fast graph pattern matching. In ICDE, pp. 913–922 (2008)
Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)
Conesa, A., Götz, S., García-Gómez, J.M., Terol, J., Talón, M., Robles, M.: Blast2go: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18), 3674–3676 (2005)
Dinur, I., Safra, S.: The importance of being biased. In: STOC. pp. 33–42 (2002)
Fan, W., Fan, Z., Tian, C., Dong, X.L.: Keys for graphs. PVLDB 8(12), 1590–1601 (2015)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. PVLDB 2(1), 407–418 (2009)
Fan, W., Li, J., Luo, J., Tan, Z., Wang, X., Wu, Y.: Incremental graph pattern matching. In: SIGMOD Conference pp. 925–936 (2011)
Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. PVLDB 3(1), 264–275 (2010)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)
Flesca, S., Furfaro, F., Parisi, F.: Querying and repairing inconsistent numerical databases. ACM Trans. Database Syst. 35(2), 14 (2010)
Gilchrist, M.A., Salter, L.A., Wagner, A.: A statistical framework for combining and interpreting proteomic datasets. Bioinformatics 20(5), 689–700 (2004)
Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. PVLDB 5(11), 1638–1649 (2012)
Jin, C., Bhowmick, S.S., Xiao, X., Cheng, J., Choi, B.: Gblender: towards blending visual query formulation and query processing in graph databases. In: SIGMOD Conference pp. 111–122 (2010)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, Berlin (1972)
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)
Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE, pp. 1275–1278 (2009)
Minton, S., Johnston, M.D., Philips, A.B., Laird, P.: Solving large-scale constraint-satisfaction and scheduling problems using a heuristic repair method. In: AAAI, pp. 17–24 (1990)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Song, S., Cao, Y., Wang, J.: Cleaning timestamps with temporal constraints. PVLDB 9(10), 708–719 (2016)
Song, S., Chen, L.: Differential dependencies: reasoning and discovery. ACM Trans. Database Syst. 36(3), 16 (2011)
Song, S., Chen, L., Cheng, H.: Parameter-free determination of distance thresholds for metric distance constraints. In: ICDE, pp. 846–857 (2012)
Song, S., Chen, L., Cheng, H.: Efficient determination of distance thresholds for differential dependencies. IEEE Trans. Knowl. Data Eng. 26(9), 2179–2192 (2014)
Song, S., Cheng, H., Yu, J.X., Chen, L.: Repairing vertex labels under neighborhood constraints. PVLDB 7(11), 987–998 (2014)
Suzuki, N.: Finding an optimum edit script between an XML document and a DTD. In: SAC, pp. 647–653 (2005)
van Rijsbergen, C.J.: Information Retrieval. Butterworth, Oxford (1979)
Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2001)
Wang, J., Song, S., Lin, X., Zhu, X., Pei, J.: Cleaning structured event logs: a graph repair approach. In: ICDE, pp. 30–41 (2015)
Wijsen, J.: Database repairing using updates. ACM Trans. Database Syst. 30(3), 722–768 (2005)
Zhang, A., Song, S., Wang, J.: Sequential data cleaning: a statistical approach. In: SIGMOD Conference, pp. 909–924 (2016)
Zhang, B., Park, B.-H., Karpinets, T.V., Samatova, N.F.: From pull-down data to protein interaction networks and complexes with biological relevance. Bioinformatics 24(7), 979–986 (2008)
Zhu, X., Song, S., Lian, X., Wang, J., Zou, L.: Matching heterogeneous event data. In: SIGMOD Conference, pp. 1211–1222 (2014)
Zhu, X., Song, S., Wang, J., Yu, P.S., Sun, J.: Matching heterogeneous events with patterns. In: ICDE, pp. 376–387 (2014)
Acknowledgements
This work is supported in part by National Key Research Program of China under Grant 2016YFB1001101; China NSFC under Grants 61572272, 61325008, 61370055, 61672313 and 61202008; Tsinghua University Initiative Scientific Research Program.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, S., Liu, B., Cheng, H. et al. Graph repairing under neighborhood constraints. The VLDB Journal 26, 611–635 (2017). https://doi.org/10.1007/s00778-017-0466-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-017-0466-5