Elsevier

Information Systems

Volume 53, October–November 2015, Pages 87-106
Information Systems

Advanced topic modeling for social business intelligence

https://doi.org/10.1016/j.is.2015.04.005Get rights and content

Abstract

Social business intelligence combines corporate data with user-generated content (UGC) to make decision-makers aware of the trends perceived from the environment. A key role in the analysis of textual UGC is played by topics, meant as specific concepts of interest within a subject area. To enable aggregations of topics at different levels, a topic hierarchy has to be defined. Some attempts have been made to address the peculiarities of topic hierarchies, but no comprehensive solution has been found so far. The approach we propose to model topic hierarchies in ROLAP systems is called meta-stars. Its basic idea is to use meta-modeling coupled with navigation tables and with dimension tables: navigation tables support hierarchy instances with different lengths and with non-leaf facts, and allow different roll-up semantics to be explicitly annotated; meta-modeling enables hierarchy heterogeneity and dynamics to be accommodated; dimension tables are easily integrated with standard business hierarchies. After outlining a reference architecture for social business intelligence and describing the meta-star approach, we formalize its querying expressiveness and give a cost model for the main query execution plans. Then, we evaluate meta-stars by presenting experimental results for query performances and disk space.

Introduction

The planetary success of social networks and the widespread diffusion of portable devices has enabled simplified and ubiquitous forms of communication and has contributed, during the last decade, to a significant shift in human communication patterns towards the voluntary sharing of personal information. Most of us are able to connect to the Internet anywhere, anytime, and continuously send messages to a virtual community centered around blogs, forums, social networks, and the like. This has resulted in the accumulation of enormous amounts of user-generated content (UGC), that include geolocation, preferences, opinions, news, etc. This huge wealth of information about people׳s tastes, thoughts, and actions is obviously raising an increasing interest from decision makers because it can give them a fresh and timely perception of the market mood; besides, often the diffusion of UGC is so widespread to directly influence in a decisive way the phenomena of business and society [1], [2], [3].

Some commercial tools are available for analyzing the UGC from a few predefined points of view (e.g., brand reputation and topics correlation) and using some ad hoc KPIs (e.g., topic presence counting and topic sentiment). These tools do not rely on any standard data schema; often they do not even lean on a relational DBMS but rather on in-memory or non-SQL ones. Currently, they are perceived by companies as self-standing applications, so UGC-related analyses are run separately from those strictly related to business, that are carried out based on corporate data using traditional business intelligence platforms. To give decision makers an unprecedentedly comprehensive picture of the ongoing events and of their motivation, this gap must be bridged [4].

How to extract most information out of the UGC and use it is a hot research theme in different areas, such as information retrieval, text mining, and natural language processing; each community contributes to this common goal by employing different techniques. The perspective we focus on in this paper is that of social business intelligence (SBI), that is the discipline of effectively and efficiently combining corporate data with UGC to let decision-makers analyze and improve their business based on the trends and moods perceived from the environment [5]. The data to be combined have very different features: while corporate data are structured, reliable, and accurate, UGC is unstructured or poorly structured, possibly fake, often vague and imprecise; however, both types of data are crucial for an effective decision-making process. As in traditional business intelligence, the goal of SBI is to enable powerful and flexible analyses for users with a limited expertise in databases and ICT; this goal is typically achieved by storing information into a data warehouse, in the form of multidimensional cubes to be accessed through OLAP techniques.

In the context of SBI, the category of UGC that most significantly contributes to the decision-making process in the broadest variety of application domains is the one coming in the form of textual clips [2], [3]. Clips can either be messages posted on social media (such as Twitter, Facebook, blogs, and forums) or articles taken from on-line newspapers and magazines. Digging information useful for decision-makers out of textual UGC requires first crawling the web to extract the clips related to a subject area, then enriching them in order to let as much information as possible emerge from the raw text. The subject area defines the project scope and extent, and can be for instance related to a brand or a specific market. Enrichment activities range from the simple identification of relevant parts (e.g., author, title, and language) if the clip is semi-structured, to the use of either natural language processing or text analysis techniques to interpret each sentence and if possible assign a polarity to it (i.e., sentiment analysis or opinion mining [6]). Though the issues related to the overall process have been thoroughly investigated in the literature starting from the early 2000s and some commercial tools are available to support all or parts of it, the analysis capabilities of the results delivered to end-users are typically very limited: only static or poorly flexible reports are provided, and historical data are not made available. Besides, in standard architectures the flow of textual UGC is separate from the ETL flows carrying business data, which forces an unnatural dividing line in the decision-making process and dramatically reduces its effectiveness.

A key role in the analysis of textual UGC is played by topics, meant as specific concepts of interest within the subject area [4]. Users are interested in knowing how much people talk about a topic, which words are related to it, if it has a good or bad reputation, etc. Thus, topics are obvious candidates to become a dimension of the cubes for SBI. In this subsection we explain what we mean by topic and topic hierarchy to avoid misunderstandings due to heterogeneous backgrounds of readers.

A topic could be a word having a specific role in the users׳ business glossary (e.g., a product, a product type, or a brand), or it could be a common word that at some time becomes relevant to the subject area. In SBI projects, a first list of relevant topics and relationships is often manually provided by decision makers and by experts of the subject area, to be then iteratively refined and enriched by analyzing the dynamics of the subject area [7]. In other situations, this task is automated by employing topic discovery algorithms (e.g., [8], [9], [10]). When topics are manually defined, their relevance to the subject area is normally higher [11]. Conversely, if topics are automatically discovered, we can expect a wide range of heterogeneous concepts (to be then manually restricted to better focus analyses). For example, COBRA [12] by IBM mixes the two approaches to identify the set of topics that define a company profile at best. Depending on the tool or technique adopted for UGC analysis, each topic is normally coupled with some measures taken either at the clip/sentence level (e.g., number of occurrences of that topic in each clip) or for each single occurrence (e.g., sentiment of each occurrence of that topic in each clip). Such a detailed information is useful, e.g., for early-alerting applications [11] in which users need to timely react to some specific message; however, to effectively summarize the mood raised by a topic, topic measures must be aggregated using clip metadata, e.g., by author, media type, or language, which can be easily done through OLAP-style analyses.

On the other hand, OLAP analyses of single, specific topics are often not sufficient to give users a clear and comprehensive picture of the social mood. Like for any other dimension, users are also very interested in grouping topics together in different ways to carry out more general and effective analyses—which requires the definition of a topic hierarchy that specifies inter-topic roll-up (i.e., grouping) relationships so as to enable aggregations of topic measures at different levels. Hierarchically organized topics have been found to be useful in many contexts such as group profiling at varying granularity [13] and semantic comparison of documents [14]. Many solutions to create such hierarchies have been proposed: in some cases higher hierarchy levels are associated with topics occurring more frequently [15], while in others [16] higher levels are associated to more general topics from an ontological point of view (e.g., the arcs in the hierarchy represent part-of and instance-of relationships). Consistently with the OLAP metaphor, we opt for the second interpretation of hierarchy. Like for topic discovery, a topic hierarchy can be manually defined, or it can be automatically derived from the business glossary or from a mining process. For example, in [16] a methodology is proposed to extract information from big data, convert it into a human-comprehensible format, and build a hierarchical ontological tree based on the extracted metadata.

Once a topic hierarchy is available, it can be used to compute aggregated measures. In particular, the way how an aggregated sentiment is computed depends on how the sentiment of each topic occurrence is returned by the sentiment analysis engine. If a numerical score is returned like in [17], [18], a traditional weighted average can be applied, while if the engine adopts a qualitative (e.g., ordinal) score more complex solutions must be applied [19].

Discussing how topics and their relationships are provided or discovered is out of the scope of this paper, and henceforth we will simply assume that a topic hierarchy exists.

Topic hierarchies are quite different from traditional hierarchies (like the temporal and the geographical one) for several reasons [20]:

  • 1

    From the point of view of their instances, topic hierarchies are irregular in several ways. According to the terminology introduced in [21] they are non-onto, which means that hierarchy instances can have different lengths and also non-leaf topics can be related to facts (e.g., clips may talk of smartphones as well as of the Galaxy III). Topic hierarchies are also non-covering (some hierarchy levels may be missing in some instances) and non-strict (many-to-many relationship between topics may exist).1 Note that, in ROLAP (Relational OLAP) contexts, a non-onto hierarchy can be represented either by coupling a classic dimension table with a navigation table that explicitly represents the transitive closure of the node relationships, or by creating a parent–child (i.e., recursive) dimension table [24]. Conversely, a non-strict hierarchy can be dealt with using many-to-many bridge tables [24].

  • 2

    Trendy topics are heterogeneous (e.g., they could include names of famous people, products, places, brands, etc.) and change quickly over time (e.g., if at some time it were announced that using smartphones can cause finger pathologies, a brand new set of hot unpredicted topics would emerge during the following days), so a comprehensive schema for topics cannot be anticipated at design time and must be dynamically defined. For some topics a classification could even be hard due to their fuzzy nature, or unnecessary due to their transitoriness. So, from the point of view of their schemata, we can intuitively say that topic hierarchies are fluid.

  • 3

    Some topics (e.g., products) are normally part of the hierarchies that store business data. Thus, modeling those topics in such a way as to enable direct integration with the EDW (enterprise data warehouse) is highly desirable.

  • 4

    Relationships between topics can have different roll-up semantics: for instance, the relationship semantics in “Galaxy III has brand Samsung” and “Galaxy III has type smartphone” is quite different. In the multidimensional model this distinction can only be (implicitly) enforced by leaning on the semantics of aggregation levels, which is possible for a regular hierarchy (“Smartphone” is a member of level Type, “Samsung” is a member of level Brand) but not for a non-onto hierarchy because all topics are members of the same Topic level.

Example 1

In our motivating example, a marketing analyst wants to analyze people׳s feelings about mobile devices. A basic cube she could use to this purpose is the one counting, within the textual UGC, the number of occurrences of each topic related to subject area “mobile technologies”, distinguishing between those expressing positive/negative/neutral sentiment as labeled by an opinion mining algorithm (e.g., the one in [18]). Fig. 1 shows a sample set of topics for mobile technologies and chain stores and the roll-up relationships the analyst deems interesting (e.g., when analyzing topic “Samsung”, the analyst may wish to also include occurrences of topics “Galaxy III” and “Galaxy Tab”), while Table 1 gives some sample facts with four measures (the total number of occurrences is higher than the sum of positive and negative ones, because most occurrences are normally unbiased; the average sentiment is computed by averaging the numerical sentiment scores for the occurrences). Note that, since this example is from the marketing area, most topics reasonably correspond to values of attributes (e.g., products) in the EDW; in other settings, topics could also be more common words such as “finger pathologies” in our example, or emerging trends like “hands-free” and “wearable device”. Now, let the analyst be specifically interested in three types of analysis of the UGC: (i) brand reputation, aimed at assessing the people׳s perception of each brand; (ii) talking volume, whose goal is to count the overall occurrences of mobile tech topics; and (iii) health rumors, aimed at capturing the customers’ concerns about touchscreens and the possible pathologies they may cause. In the first case, the perception of Samsung will be measured by counting the positive and negative occurrences of topics “Samsung”, “Galaxy III”, and “Galaxy Tab”; in the second case, all occurrences of all tech-related topics except “Nokia” and “Samsung” will be counted; in the third case, only the occurrences of “Touchscreen” and “Finger Pathologies” will be considered. The results are shown in Table 2; it appears that, depending on the analyst׳s goals, facts can be aggregated in different ways by navigating or not navigating inter-topic relationships with the different semantics shown in gray in Fig. 1.

In light of the above, topic hierarchies in ROLAP contexts must clearly be modeled with more sophisticated solutions than traditional star schemata. Though some attempts have been made in the literature to address some of the mentioned issues (e.g., [3], [20]), no solution to all of them has been found so far. The approach we propose in this paper is called meta-stars, and it couples meta-modeling with navigation tables and with dimension tables to improve flexibility and expressiveness when modeling topic hierarchies. Meta-stars are flexible because, differently from traditional solutions that mainly model regular hierarchies with a fixed and pre-defined structure, they use a single schema to cope with hierarchies with very different features. Indeed, navigation tables support non-onto, non-covering, and non-strict hierarchies (requirement ♯1); meta-modeling enables heterogeneity and dynamics of topic classification to be accommodated without requiring changes to the underlying schema (requirement ♯2); finally, dimension tables are easily integrated with standard business hierarchies (requirement ♯3). On the other hand, meta-stars are more expressive than traditional solutions because they can model unclassified topics (requirement ♯2) and because different roll-up semantics can be explicitly annotated (requirement ♯4), which in turn enables a brand new class of semantics-aware OLAP queries.

As discussed in Section 7, an obvious consequence of the adoption of navigation tables is that the total size of the solution increases exponentially with the size of the topic hierarchy. This clearly limits the applicability of the meta-star approach to topic hierarchies of small-medium size; however, we argue that this limitation is not really penalizing because topic hierarchies are normally created and maintained manually by domain experts, which suggests that their size can hardly become too large.

The work we present in this paper is based on [5], where meta-stars were originally motivated and introduced. Here we improve our previous work under several substantial aspects: (i) The approach is extended to take also non-covering and non-strict hierarchies into account; (ii) The techniques for dealing with slowly-changing topics and levels when using meta-stars are discussed; (iii) A semantics for queries on topic hierarchies is formally stated; (iv) The possible execution plans for queries on meta-stars are presented and the related cost model is provided; (v) A wider set of tests is proposed to evaluate the meta-star approach. Our focus is on topic hierarchies and their effective modeling; the related methodological issues are out of scope here, and the interesting reader is referred to [7], where a methodology for designing and maintaining SBI applications is presented.

In the remainder of the paper, after discussing the related literature in Section 2, we sketch an architecture to support SBI in Section 3. Then, in Sections 4 and 5 we present our approach and the types of queries it support, while in Section 6 we show query execution plans and propose a cost model. Experimental tests are proposed in Section 7, while Section 8 draws the conclusions.

Section snippets

Related work

OLAP techniques are normally applied to multidimensional cubes storing structured business data. However, several research papers focus on the possibility of enhancing OLAP analyses and broadening its scope to unstructured content. Overall, as sketched in Table 3, the literature related to our approach can be classified into three partially overlapping areas: OLAP analyses on textual documents, advanced data warehouse modeling, and analysis of social contents. In the following we briefly

Architectural overview

The reference architecture we propose to support our approach to SBI, which we also adopted for the case studies described in [7], is depicted in Fig. 2. Its main highlight is the integration between sentiment and business data, achieved in a non-invasive way by extracting some business flows from the EDW and integrating them with those carrying textual UGC to provide users with 360° decisional capabilities. In the following we briefly comment each component.

The Crawling component carries out a

Meta-stars

Different multidimensional cubes can be stored in the DM component of Fig. 2, focused for instance on the perceived sentiment for the topics in the subject area, on the correlations between topics, on the trending topics, and so on as determined by the semantic enrichment process (Fig. 3 shows a simple cube for Example 1, that can also be used for any other subject area). Typical indicators associated to these cubes are the topic share (ratio between the number of occurrences of a topic and the

Querying meta-stars

In this section we show how meta-stars support OLAP queries with increasing expressiveness and complexity, starting from queries using only static levels to end-up with semantics-aware queries.

OLAP queries normally return the values of one or more measures aggregated according to a group-by set including a few hierarchy levels, optionally filtered according to a selection. For instance, with reference to the conceptual schema depicted in Fig. 3, a possible query could return the total number of

Query execution plans and cost model for meta-stars

As made clear in the previous sections, meta-stars entail higher flexibility and expressiveness than classical star schemata both from the point of view of the hierarchy structures supported and from that of the queries enabled. However, this also leads to extra costs in terms of storage space and, in some cases, query execution time. To quantitatively evaluate the querying efficiency of meta-stars vs. the one of traditional star schemata (see Section 7), in this section we discuss the main

Evaluation

In this section we discuss how meta-stars perform both in absolute terms and with respect to traditional star schema implementations. Execution times are computed using the cost model described in Section 6 with a fact table of 10 millions of tuples, and considering an average read/write time of 2.27×104s per disk page (as measured on the Oracle 11g RDBMS with disk page size set to 8 KB, on a 64-bits AMD Opteron quad-core 2.09 GHz virtual machine with 8 GB RAM and RAID 10 disk architecture,

Final remarks

In this paper we have introduced SBI as a relevant area for business and research, and we have proposed an expressive solution to model topic hierarchies based on some specific requirements: irregularity and fluidity of hierarchies, integrability with business hierarchies, and semantics-aware aggregation. Noticeably, the choice of the subset of levels to be modeled as static rules the trade-off between the fluidity of topic classification and aggregation and the efficiency of integrating

References (38)

  • A. Saha, V. Sindhwani, Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal...
  • S.B. Meftah, K. Khrouf, J. Feki, M.B. Kraiem, C. Soulé-Dupuy, Document warehouse: integration of semantic structures,...
  • B. Fortuna, D. Mladenic, M. Grobelnik, Semi-automatic construction of topic ontologies, in: Proceedings of the EWMF,...
  • N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, T. Tomokiyo, Deriving marketing intelligence from online...
  • S. Spangler et al.

    Cobra-mining web for corporate brand and reputation analysis

    Web Intell. Agent Syst.

    (2009)
  • L. Tang et al.

    Topic taxonomy adaptation for group profiling

    ACM Trans. Knowl. Discov. Data

    (2008)
  • A. Gelbukh, G. Sidorov, A. Guzman-Arenas, Document comparison with a weighted topic hierarchy, in: Proceedings of the...
  • S.-L. Chuang, L.-F. Chien, A practical web-based approach to generating topic hierarchy for text segments, in:...
  • A. Aras et al.

    Ontological tree generation for enhanced information retrieval

    Int. J. Artif. Intell. Appl.

    (2013)
  • Cited by (39)

    • Blockchain technology for cybersecurity: A text mining literature analysis

      2022, International Journal of Information Management Data Insights
      Citation Excerpt :

      The natural language processing Nair, Agrawal, Domnic, & Kumar (2021); R, Kuanr, & KR (2021) based approach has itself spread over several fields like journalism Ma, Nadamoto, & Tanaka (2006) etc. and for different application including business Verma, Sharma, Deb, & Maitra (2021), IoT Herath & Mittal (2022). There are several topic modeling algorithms available Gallinucci, Golfarelli, & Rizzi (2015); Kinoshita, Takasu, & Adachi (2015), those differ in the way they make assumptions to generate hidden word collections called ”topics”. A paper by Vayansky et.

    • A Big Data Analytics Method for Tourist Behaviour Analysis

      2017, Information and Management
      Citation Excerpt :

      He et al. [24] developed a “social media competitive analytics” tool called VOZIQ, for the calculation of sentiment benchmarks from tweets to enhance business performance. Another social business intelligence analytics application, employing online analytical processing techniques, has been developed, which combines corporate databases and user-produced big data to better inform the determination of business trends and customer mood within the business environment [20]. Organisation-specific social media big data has also been used to enhance managers’ understanding of stakeholders’ concerns to better inform managerial decisions relating to stakeholders and their connection with major events [30].

    • Use of Social Media Applications for Supporting New Product Development Processes in Multinational Corporations

      2017, Technological Forecasting and Social Change
      Citation Excerpt :

      However, little is known as to whether companies use social media to obtain information and to develop and exploit customer insights to create value added products and services. Past studies in the field of information systems (IS) that have explored the use of social media in a business context to develop customer insights were mainly technologically focused, such as methods for collecting and analysing customer data (e.g., Chau and Xu, 2012; Lewis et al., 2013; Li et al., 2014), technical conditions for data analysis (e.g., Gallinucci et al., 2015; Rosemann et al., 2012), and applications required for social media analytics (e.g., He et al., 2013; Kalampokis et al., 2013; Rao and Kumar, 2011). A limited number of marketing studies have investigated the use of social media in organizational contexts such as customer insights and wide-firm governance structure (Barwise and Meehan, 2011; Stone and Woodcock, 2014), effective application and associated challenges (Greenberg, 2010; Woodcock et al., 2011), the effect of social media use on firms' customer relationship performance (Choudhury and Harrigan, 2014; Trainor et al., 2014) and most recently, use of social media in logics importation in a social context in small and medium-sized enterprises (SMEs) (Mohajerani et al., 2015).

    View all citing articles on Scopus
    View full text