Abstract
Uncertain graph is an important data model for many real-world applications. To answer the query on the uncertain graphs, the edges in these graphs are associated with existential probabilities that represent the likelihood of the existence of the edge. Almost all works on this area focus on how to promote the efficiency of the query processing. However, another issue should be notable, that is, the query results from the uncertain graphs are sometimes uninformative due to the edge uncertainty. We adopt a crowdsourcing-based approach to make the query results more informative. To save the monetary and time cost of crowdsourcing, we should select the optimal edges to clean to maximize the quality improvement. However, the noise of the crowdsourcing results will make the problem more complex. We prove that the problem is #P-hard and propose an efficient algorithm to derive the optimal edge. Our experimental results show that our proposed algorithm outperforms random-selection up to 22 times in quality improvement and each-edge-comparison way up to 5 times fast in elapsed time, which proves this algorithm is both effective and efficient.

















Similar content being viewed by others
References
Aggarwal, C.C.: Managing and mining uncertain data. Springer, US (2009)
Ball, M.O.: Computational complexity of network reliability analysis: an overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)
Brabham, D.C.: Crowdsourcing as a model for problem solving: an introduction andcases. Convergence the International Journal of Research Into New Media Technologies 14(1), 75–90 (2008)
Chen, M., Gu, Y., Bao, Y., Yu, G.: Label and distance-constraint reachability queries in uncertain graphs. In: Database Systems for Advanced Applications, pp 188–202. Springer International Publishing, Cham (2014)
Cheng, J., Huang, S., Wu, H., Fu, W.C.: Tf-label:a topological-folding labeling scheme for reachability querying in a large graph. In: ACM SIGMOD International Conference on Management of Data, pp. 193–204 (2013)
Cheng, R.: Querying and cleaning uncertain data. Springer, Berlin (2009)
Cheng, R., Chen, J., Xie, X.: Cleaning uncertain data with quality guarantees. Proceedings of the Vldb Endowment 1(1), 722–735 (2008)
Doan, A.H., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide Web. Commun. ACM 54(4), 86–96 (2011)
Fishman, G.S.: A comparison of four monte carlo methods for estimating the probability of s-t connectedness. IEEE Trans. Reliab. 35(2), 145–155 (1986)
Jin, R., Hong, H., Wang, H., Ning, R., Xiang, Y.: Computing label-constraint reachability in graph databases. In: ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, Usa, June, pp. 123?-134 (2010)
Jin, R., Liu, L., Ding, B., Wang, H.: Distance-constraint reachability computation in uncertain graphs. Very Large Data Bases 4(9), 551–562 (2011)
Jin, R., Liu, L., Ding, B., Wang, H.: Distance-constraint reachability computation in uncertain graphs. Proceedings of the Vldb Endowment 4(9), 551–562 (2011)
Karp, R.M., Luby, M.G.: A new monte-carlo method for estimating the failure probability of an (1983)
Khan, A., Chen, L.: On uncertain graphs modeling and queries. VLDB Endowment (2015)
Krogan, N.J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta, N., Tikuisis, A.P.: Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440(7084), 637–43 (2006)
Lin, X., Xu, J., Hu, H.: Range-based skyline queries in mobile environments. IEEE Trans. Knowl. Data Eng. 25(4), 835–849 (2013)
Lin, X., Peng, Y., Choi, B., Xu, J.: Human-powered data cleaning for probabilistic reachability queries on uncertain graphs. IEEE Trans. Knowl. Data Eng. 29(7), 1452–1465 (2017)
Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. Proceedings of the Vldb Endowment 5(1), 13–24 (2011)
Mo, L., Cheng, R., Li, X., Cheung, D.W.: Cleaning uncertain data for top-k queries. In: IEEE International Conference on Data Engineering, pp. 134–145 (2013)
Niedermayer, J., Emrich, T., Renz, M., Mamoulis, N., Chen, L., Kriegel, H.P.: Probabilistic nearest neighbor queries on uncertain moving object trajectories. Proceedings of the Vldb Endowment 7(3), 205–216 (2013)
Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1), 41–82 (2005)
Ruomingjin Linliu, B.H.: Distanceconstraintreachabilitycomputationin. Pvldb 4 (9), 2011 (2012)
Solecki, B., Solecki, B., Solecki, B.: Kdd cup 2013 - author-paper identification challenge: second place team. In: Kdd Cup 2013 Workshop, pp. 3 (2013)
Soliman, M.A., Ilyas, I.F., Chang, C.C.: Top-k query processing in uncertain databases. In: IEEE International Conference on Data Engineering, pp. 896–905 (2007)
Tao, Y., Xiao, X., Pei, J.: Efficient skyline and top-k retrieval in subspaces. IEEE Trans. Knowl. Data Eng. 19(8), 1072–1088 (2007)
Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining frequent itemsets over uncertain databases. Proceedings of the Vldb Endowment 5(11), 1650–1661 (2012)
Tong, Y., Chen, L., Ding, B.: Discovering threshold-based frequent closed itemsets over probabilistic data. In: IEEE International Conference on Data Engineering, pp. 270–281 (2012)
Tong, Y., Cao, C.C., Zhang, C.J., Li, Y.: Crowdcleaner: Data cleaning for multi-version data on the Web via crowdsourcing. In: IEEE International Conference on Data Engineering, pp. 1182–1185 (2014)
Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: IEEE International Conference on Data Engineering, pp. 219–230 (2015)
Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: ACM SIGMOD International Conference on Management of Data, pp. 229–240 (2013)
Widom, J., Agrawal, A.P., Benjelloun, O., Ch, A., Chaumond, J., Murthy, R., Mutsuzaki, M., Sugihara, T., Theobald, M.: Chapter 5 trio: A system for data, uncertainty, and lineage (2013)
Xu, K., Zou, L., Yu, J.X., Chen, L., Xiao, Y., Zhao, D.: Answering Label-Constraint Reachability in Large Graphs. In: ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October, pp. 1595?-1600 (2011)
Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. Proceedings of the Vldb Endowment 6(9), 757–768 (2013)
Zhang, C.J., Chen, L., Tong, Y., Liu, Z.: Cleaning uncertain data with a noisy crowd. In: IEEE International Conference on Data Engineering, pp 6–17 (2015)
Acknowledgments
This research is funded by NSFC (No. 61773167) and the Natural Science Foundation of Shanghai (No.17ZR1444900).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Web and Big Data
Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao
Appendix
Appendix
1.1 New Reachability
The reachability when crowd returns ‘yes’ is
And the reachability when crowd returns ‘no’ is
where \({p_{e}^{y}}\) is new edge probability if the crowd’s answer is ‘yes’, similar form for \({p_{e}^{n}}\).
Proof
First, we divide the whole G into two parts: graphs containing e (Ge) and graphs without e (\(G_{\overline {e}}\)). Assume for every \(pg \in PG_{G_{e}}\), there is a corresponding \(pg^{\prime } \in PG_{G_{\overline {e}}}\) such that all edges \(E_{pg} \in E\) and \(E_{pg^{\prime }} \in E\) are the same except that \(E_{pg^{\prime }}\) doesn’t have e. Then, we have
By (2), \(R_{G,q}^{0}\) can also be represented as:
Furthermore, \(G_{e}\) can be divided into \(G(P_{e},P_{\overline e})\) and \(G_{e}-G(P_{e},P_{\overline e})\). Obviously, for \(pg \in PG_{G(P_{e}, P_{\overline e})}\), \(r_{q}^{pg}= 1\) and \(r_{q}^{pg^{\prime }}= 0\). For \(pg \in PG_{G_{e}-G(P_{e}, P_{\overline e})}\), since whether e exists or not does not influence \(r_{q}\), we have \(r_{q}^{pg}=r_{q}^{pg^{\prime }}\).
Then, for \(pg \in PG_{G_{e}}\), the new graph probability \(Pr_{e \rightarrow y}(pg) = \frac {{p_{e}^{y}} Pr(pg)}{p_{e}}\) and \(Pr_{e \rightarrow n}(pg) = \frac {{p_{e}^{n}} Pr(pg)}{p_{e}}\), where \(e \rightarrow y\) (\(e \rightarrow n\)) represents edge e is cleaned to existence (nonexistence). For \(pg \in PG_{G_{\overline e}}\), \(Pr_{e \rightarrow y}(pg^{\prime }) = \frac {(1-{p_{e}^{y}}) Pr(pg)}{p_{e}}\) and \(Pr_{e \rightarrow n}(pg^{\prime }) = \frac {(1-{p_{e}^{n}}) Pr(pg)}{p_{e}}\). Hence, for \(pg \in PG_{G_{e}}\),
From above analysis, for \(pg \in PG_{G_{e}-G(P_{e}, P_{\overline e})}\) we have
For \(pg \in PG_{G(P_{e}, P_{\overline e})}\), we have
Then, \(R_{G,q}^{0}\) is equal to sum of each element: (29) + the first line of (28); \({R_{q}^{y}}\) is equal to sum of each element: (30) + the second line of (28) and \({R_{q}^{n}}\) is equal to sum of each element: (31) + the third line of (28). Therefore,
□
1.2 NECP is # P-hard
Proof
To prove NECP is #P-hard, we reduce ECP in [17] to NECP as Xin Lin [17] has proven ECP (17) is #P-hard. ECP corresponds to situation where a crowd is exactly accurate.
In detail, for B to-clean edges, there are totally \(2^{B}\) cleaning results: \(CS={000...000,000...001,000...011,......,111...111}\), each element of which is a B-bits sequence where each bit represents the cleaning result of the corresponding edge. For simplicity, we assume all edges’ existential probabilities of an uncertain graph G are the same value p.
For ECP, the expected query result quality after cleaning is
where \(Q_{1}\), \(Q_{2}\), \(Q_{3}\), ..., \(Q_{2^{B}}\) are the corresponding query result quality of CS.
For NECP, the expected query result quality after cleaning is
Simplifying (5), we have
We can see \(Pr(C_{r})\) is the linear expression with respect to p. Therefore, the fact computing \(Q^{N}\) implies that we can compute Q shows solving NECP is \(\#\)P-hard. □
1.3 Calculating \(P_{q,e}^{*}\)
Proposition 2: Calculating \(P_{q,e}^{*}\) can be reduced to calculating reachability.
Proof
Similar to calculating reachability \(R_{q}\) in (3) which in theory needs to enumerate all \(sg \in SG_{G,q}\), calculating \(P_{q,e}^{*}\) accordingly needs to enumerate all \(sg \in SG_{G(P_{e},P_{\overline e}),q}\). Also, it is impossible to identify \(SG_{G(P_{e},P_{\overline e})}\), even more troublesome than enumerating \(SG_{G,q}\).
In Example 2, we have mentioned Monte-carlo method can approximate the result of \(Pr(p_{1} \vee p_{2} \vee p_{3} \vee ... \vee p_{n})\) (assume \(n=|AP|\)). Similarly, Monte-carlo method is also applicative to calculating \(P_{q,e}^{*}\). First, we denote all paths passing through (not passing though) edge e by \(AP_{e}\) (\(AP_{\overline e}\)). Then, we have
By (34), we just need to respectively compute two parts: \(Pr(\bigvee _{p \in AP_{e}})\) and \(Pr(\bigvee _{p \in AP_{\overline e}})\), calculating each of which is equivalent to calculating reachability. □
Rights and permissions
About this article
Cite this article
Wu, Y., Lin, X., Yang, Y. et al. Cleaning uncertain graphs via noisy crowdsourcing. World Wide Web 22, 1523–1553 (2019). https://doi.org/10.1007/s11280-018-0624-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0624-8