Skip to main content
Log in

Informationsunschärfe in Big Data

Erkenntnisse aus sozialen Medien in Stadtgebieten

Taming Uncertainty in Big Data

Evidence from Social Media in Urban Areas

  • Aufsatz
  • Published:
WIRTSCHAFTSINFORMATIK

Zusammenfassung

Während die klassische Definition von Big Data ursprünglich nur die drei Größen Datenmenge (Volume), Datenrate (Velocity) und Datenvielfalt (Variety) umfasste, ist in jüngster Zeit der Wahrheitsgehalt (Veracity) als weitere Dimension mehr und mehr in den wissenschaftlichen und praktischen Fokus gerückt. Der noch immer wachsende Bereich der Sozialen Medien und damit verbundene benutzergenerierte Datenmengen verlangen nach neuen Methoden, die die enthaltene Datenunschärfe abschätzen und kontrollieren können. Dieser Beitrag widmet sich einem Aspekt der Datenunschärfe und stellt einen neuartigen Ansatz vor, der die Verlässlichkeit von benutzergenerierten Daten auf Basis von wiederkehrenden Mustern abschätzt. Zu diesem Zweck wird eine große Menge von Twitter-Statusnachrichten mit geographischer Standortinformation aus San Francisco untersucht und mit Points of Interest (POIs), wie beispielsweise Bars, Restaurants oder Parks, in Verbindung gebracht. Das vorgeschlagene Modell wird durch kausale Beziehungen zwischen Points of Interest und den in der Umgebung vorliegenden Twitter-Meldungen validiert. Weiterhin wird die zeitliche Dimension dieser Beziehung in Betracht gezogen, um so in Abhängigkeit der Art des POI wiederkehrende Muster zu identifizieren. Die durchgeführten Analysen münden in einem Indikator, der die Verlässlichkeit von vorliegenden Daten in räumlicher und zeitlicher Dimension abschätzt.

Abstract

While the classic definition of Big Data included the dimensions volume, velocity, and variety, a fourth dimension, veracity, has recently come to the attention of researchers and practitioners. The increasing amount of user-generated data associated with the rise of social media emphasizes the need for methods to deal with the uncertainty inherent to these data sources. In this paper we address one aspect of uncertainty by developing a new methodology to establish the reliability of user-generated data based upon causal links with recurring patterns. We associate a large data set of geo-tagged Twitter messages in San Francisco with points of interest, such as bars, restaurants, or museums, within the city. This model is validated by causal relationships between a point of interest and the amount of messages in its vicinity. We subsequently analyze the behavior of these messages over time using a jackknifing procedure to identify categories of points of interest that exhibit consistent patterns over time. Ultimately, we condense this analysis into an indicator that gives evidence on the certainty of a data set based on these causal relationships and recurring patterns in temporal and spatial dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Abb. 1
Abb. 2
Abb. 3
Abb. 4
Abb. 5
Abb. 6
Abb. 7
Abb. 8

Notes

  1. Eine Woche mit 7 Tagen à 24 Stunden ergibt 168 Zeitscheiben des Beobachtungszeitraums.

Literatur

  • Cheng A, Evans M (2009) In-depth look inside the twitter world. http://www.sysomos.com/insidetwitter/

  • Du Y, Fan J, Chen J (2011) Experimental analysis of user mobility pattern in mobile social networks. In: IEEE wireless communications and networking conference (WCNC), S 1086–1090

    Google Scholar 

  • Ferrari L, Rosi A, Mamei M, Zambonelli F (2011) Extracting urban patterns from location-based social networks. In: Proc of the 3rd ACM SIGSPATIAL international workshop on location-based social networks (LBSN ’11). ACM, New York, S 9–16

    Google Scholar 

  • Heinrich B, Kaiser M, Klier M (2007) How to measure data quality? A metric-based approach. In: Rivard S, Webster J (Hrsg) Proc of the 28th international conference on information systems (ICIS). Queen’s University, Montreal

    Google Scholar 

  • Hilbert M, López P (2011) The world’s technological capacity to store, communicate, and compute information. Science 332(6025):60–65

    Article  Google Scholar 

  • IBM (2013) The four V’s of big data [INFOGRAPHIC]. http://dashburst.com/infographic/big-data-volume-variety-velocity. Abruf am 2013-11-05

  • Kraut RE, Rice RE, Ronald E, Cool C, Fish RS (1998) Varieties of social influence: the role of utility and norms in the sSuccess of a new communication medium. Organization Science 9(4):437–453

    Article  Google Scholar 

  • Lee R, Wakamiya S, Sumiya K (2011) Discovery of unusual regional social activities using geo-tagged microblogs. World Wide Web 14(4):321–349

    Article  Google Scholar 

  • Liu B, Fu Y, Yao Z, Xiong H (2013) Learning geographical preferences for point-of-interest recommendation. In: Proc of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’13). ACM, S 1043–1051, New York

    Chapter  Google Scholar 

  • Otto B, Wende K, Schmidt A, Osl P (2007) Towards a framework for corporate data quality management. In: ACIS 2007 proc

    Google Scholar 

  • Sargent RP, Shepard RM, Glantz SA (2004) Reduced incidence of admissions for myocardial infarction associated with public smoking ban: before and after study. British Medical Journal 328:977–980

    Article  Google Scholar 

  • Tobler WR (1970) A computer movie simulating urban growth in the Detroit region. Economic Geography 46:234–240

    Article  Google Scholar 

  • Wakamiya S, Lee R, Sumiya K (2011) Crowd-based urban characterization: extracting crowd behavioral patterns in urban areas from Twitter. In: Proc of the 3rd ACM SIGSPATIAL international workshop on location-based social networks (LBSN ’11). ACM, New York, S 77–84

    Google Scholar 

  • Wasserkrug S, Gal A, Etzion O (2005) A model for reasoning with uncertain rules in event composition systems. In: Proc of the 21st conference in uncertainty in artificial intelligence, Edinburgh, Scotland, UAI ’05, July 26–29, 2005. AUAI Press, Corvallis, S 599–608

    Google Scholar 

  • Wasserkrug S, Gal A, Etzion O, Turchin Y (2008) Complex event processing over uncertain data. In: Proc of the second international conference on distributed event-based systems (DEBS ’08). ACM, New York, S 253–264

    Chapter  Google Scholar 

  • Yager RR (2004) Uncertainty modeling and decision support. Reliability Engineering & System Safety 85(1–3):341–354. doi:10.1016/j.ress.2004.03.022

    Article  Google Scholar 

  • Zhang X, Zhu F (2011) Group size and incentives to contribute: a natural experiment at Chinese wikipedia. The American Economic Review 101(4):1601–1615

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Bendler.

Additional information

Angenommen nach einer Überarbeitung durch die Herausgeber des Schwerpunktthemas.

This article is also available in English via http://www.springerlink.com and http://www.bise-journal.org: Bendler J, Wagner S, Brandt T, Neumann D (2014) Taming Uncertainty in Big Data. Evidence from Social Media in Urban Areas. Bus Inf Syst Eng. doi: 10.1007/s12599-014-0342-4.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bendler, J., Wagner, S., Brandt, T. et al. Informationsunschärfe in Big Data. Wirtschaftsinf 56, 303–313 (2014). https://doi.org/10.1007/s11576-014-0431-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11576-014-0431-5

Schlüsselwörter

Keywords

Navigation