Skip to main content

Tuple Reconstruction

  • Conference paper
  • First Online:
  • 916 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10829))

Abstract

Set of tuples expansion system (STEP) extracts information from the Web in the form of tuples. It builds a graph of entities consisting of Web pages, wrappers, seeds, domains, and candidates as its nodes while the relationships between them as edges. The final weight given for each node after running random walks on the graph is used to order the extracted candidates. Due to the nature of the regular expressions used as wrappers, some of the extracted candidates may contain “noise” and therefore can be considered as “false”. These false candidates may rank higher than the “true” ones on the list because they are extracted from many Web pages or produced by many different wrappers. Minimizing these false candidates is necessary to ensure the validity of the result presented.

In this research, we propose a method to tackle the aforementioned problem of STEP by reconstructing tuples. We begin with extracting binary tuples from the Web. These binary tuples consist of a key attribute and a property of the attribute. To validate the truthfulness of the binary tuples, we apply truth-finding algorithms. This helps us in building a credible list of binary tuples. We propose two methods to reconstruct tuples from binary ones. We use the reconstructed tuples to enrich the graph of entities of STEP such that the “true” candidates receive more confidence and rank higher in the graph. We show that our approach is efficient and significantly improve the confidence level of the tuples extracted by STEP. We also conduct an experiment on a real-world case of populating a database relation from the Web with our proposed approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.tripadvisor.com/Restaurants-g297697-Kuta_Kuta_District_Bali.html.

References

  1. Abdessalem, T., Cautis, B., Derouiche, N.: Objectrunner: lightweight, targeted extraction and querying of structured web data. PVLDB 3(2), 1585–1588 (2010). http://www.comp.nus.edu.sg/~vldb2010/proceedings/files/papers/D18.pdf

    Google Scholar 

  2. Ba, M.L., Berti-Equille, L., Shah, K., Hammady, H.M.: VERA: a platform for veracity estimation over web data. In: WWW (2016)

    Google Scholar 

  3. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009). http://dx.doi.org/10.1007/s00778-008-0098-x

    Article  Google Scholar 

  4. Bing, L., Lam, W., Wong, T.L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: WSDM, New York, NY, USA (2013)

    Google Scholar 

  5. Bleiholder, J., Draba, K., Naumann, F.: FuSem: exploring different semantics of data fusion. In: VLDB, Vienna, Austria (2007)

    Google Scholar 

  6. Brin, S.: Extracting patterns and relations from the World Wide Web. In: Atzeni, P., Mendelzon, A., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999). https://doi.org/10.1007/10704656_11

    Chapter  Google Scholar 

  7. Chen, Z., Cafarella, M., Jagadish, H.V.: Long-tail vocabulary dictionary extraction from the web. In: WSDM, New York, NY, USA (2016)

    Google Scholar 

  8. Derouiche, N., Cautis, B., Abdessalem, T.: Automatic extraction of structured web data with domain knowledge. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 726–737, April 2012

    Google Scholar 

  9. Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)

    Google Scholar 

  10. Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)

    Google Scholar 

  11. Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. PVLDB 2(1), 1654–1655 (2009)

    Google Scholar 

  12. Er, N.A.S., Abdessalem, T., Bressan, S.: Set of t-uples expansion by example. In: iiWAS, New York, NY, USA (2016)

    Google Scholar 

  13. Er, N.A.S., Ba, M.L., Abdessalem, T., Bressan, S.: Truthfulness of candidates in set of t-uples expansion. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017, Part I. LNCS, vol. 10438, pp. 314–323. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_24

    Chapter  Google Scholar 

  14. Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12

    Chapter  Google Scholar 

  15. Faheem, M., Senellart, P.: Adaptive web crawling through structure-based link classification. In: Allen, R.B., Hunter, J., Zeng, M.L. (eds.) ICADL 2015. LNCS, vol. 9469, pp. 39–51. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27974-9_5

    Chapter  Google Scholar 

  16. Fang, X.S.: Truth discovery from conflicting multi-valued objects. In: WWW, pp. 711–715 (2017)

    Google Scholar 

  17. Fang, X.S., Sheng, Q.Z., Wang, X., Ngu, A.H.: Value veracity estimation for multi-truth objects via a graph-based approach. In: WWW, pp. 777–778 (2017)

    Google Scholar 

  18. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: Diadem: thousands of websites to a single database. Proc. VLDB Endow. (PVLDB) 7, 1845–1856 (2014)

    Article  Google Scholar 

  19. Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: WSDM, New York, USA, February 2010

    Google Scholar 

  20. Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564

    Article  Google Scholar 

  21. He, Y., Xin, D.: Seisa: set expansion by iterative similarity aggregation. In: WWW, New York, NY, USA (2011)

    Google Scholar 

  22. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010). https://doi.org/10.14778/1920841.1920904

    Article  Google Scholar 

  23. Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: SIGMOD, Snowbird, Utah, USA, May 2014

    Google Scholar 

  24. Liu, W., Liu, J., Duan, H., Zhang, J., Hu, W., Wei, B.: TruthDiscover: resolving object conflicts on massive linked data. In: WWW, pp. 243–246 (2017)

    Google Scholar 

  25. Moens, M., Li, J., Chua, T. (eds.): Mining User Generated Content. Chapman and Hall/CRC, Boca Raton (2014)

    Google Scholar 

  26. Ortona, S., Orsi, G., Buoncristiano, M., Furche, T.: WADaR: joint wrapper and data repair. Proc. VLDB Endow. 8(12), 1996–1999 (2015). https://doi.org/10.14778/2824032.2824120

    Article  Google Scholar 

  27. Paşca, M.: Weakly-supervised discovery of named entities using web search queries. In: CIKM, New York, NY, USA (2007)

    Google Scholar 

  28. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)

    Google Scholar 

  29. Pasternack, J., Roth, D.: Latent credibility analysis. In: WWW, Rio de Janeiro, Brazil, May 2013

    Google Scholar 

  30. Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: SIGMOD, Snowbird, Utah, USA, May 2014

    Google Scholar 

  31. Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endow. 8(13), 2194–2205 (2015). https://doi.org/10.14778/2831360.2831372

    Article  Google Scholar 

  32. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). https://doi.org/10.1007/s007780100057

    Article  MATH  Google Scholar 

  33. Sarker, A., Gonzalez, G.: Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196–207 (2015)

    Article  Google Scholar 

  34. Wang, D., Kaplan, L., Le, H., Abdelzaher, T.: On truth discovery in social sensing: a maximum likelihood estimation approach. In: IPSN, Beijing, China, April 2012

    Google Scholar 

  35. Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM (2007)

    Google Scholar 

  36. Wang, R.C., Cohen, W.W.: Character-level analysis of semi-structured documents for set expansion. In: EMNPL, Stroudsburg, PA, USA (2009)

    Google Scholar 

  37. Wang, R.C., Schlaefer, N., Cohen, W.W., Nyberg, E.: Automatic set expansion for list question answering. In: EMNLP, Stroudsburg, PA, USA (2008)

    Google Scholar 

  38. Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE TKDE 20, 796–808 (2008)

    Google Scholar 

  39. Zhang, W., Ahmed, A., Yang, J., Josifovski, V., Smola, A.J.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, pp. 2257–2266. ACM, New York (2015). https://doi.org/10.1145/2783258.2788580

  40. Zhang, Z., Sun, L., Han, X.: A joint model for entity set expansion and attribute extraction from web search queries. In: AAAI (2016)

    Google Scholar 

  41. Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6), 550–561 (2012)

    Google Scholar 

  42. Zhao, Z., Cheng, J., Ng, W.: Truth discovery in data streams: A single-pass probabilistic approach. In: CIKM, Shangai, China, November 2014

    Google Scholar 

Download references

Acknowledgment

This work has been partially funded by the Big Data and Market Insights Chair of Télécom ParisTech and supported by the National University of Singapore under a grant from Singapore Ministry of Education for research project number T1 251RES1607.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ngurah Agus Sanjaya Er .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Er, N.A.S., Ba, M.L., Abdessalem, T., Bressan, S. (2018). Tuple Reconstruction. In: Liu, C., Zou, L., Li, J. (eds) Database Systems for Advanced Applications. DASFAA 2018. Lecture Notes in Computer Science(), vol 10829. Springer, Cham. https://doi.org/10.1007/978-3-319-91455-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91455-8_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91454-1

  • Online ISBN: 978-3-319-91455-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics