Skip to main content
Log in

Integrating learned and explicit document features for reputation monitoring in social media

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Currently, monitoring reputation in social media is probably one of the most lucrative applications of information retrieval methods. However, this task poses new challenges due to the dynamicity of contents and the need for early detection of topics that affect the reputations of companies. Addressing this problem with learning mechanisms that are based on training data sets is challenging, given that unseen features play a crucial role. However, learning processes are necessary to capture domain features and dependency phenomena. In this work, based on observational information theory, we define a document representation framework that enables the combination of explicit text features and supervised and unsupervised signals into a single representation model. Our theoretical analysis demonstrates that the observation information quantity (OIQ) generalizes the most popular representation methods, in addition to capturing quantitative values, which is required for integrating signals from learning processes. In other words, the OIQ allows us to give the same treatment to features that are currently managed separately. Empirically, our experiments on the reputation-monitoring scenario demonstrated that adding features progressively from supervised (in particular, Bayesian inference over annotated data) and unsupervised learning methods (in particular, proximity to clusters) increases the similarity estimation performance. This result is verified under various similarity criteria (pointwise mutual information, Jaccard and Lin’s distances and the information contrast model). According to our formal analysis, the OIQ is the first representation model that captures the informativeness (specificity) of quantitative features in the document representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. For instance, just considering the occurrence of a few words as text features is enough to obtain an empty result in a standard web search engine.

  2. Note that considering words as information pieces is not equivalent than considering words as features, due to the effect of repeated words:

    $$\begin{aligned} {\mathcal {I}}\big (\{w_1,w_2,w_2\}\big )= & {} -\hbox {log}\big (P_{d\in \mathcal {D}}(tf(d,w_1)\ge 1)\cdot P_{d\in \mathcal {D}}(tf(d,w_2)\ge 2)\big )\ne \\\ne & {} -\hbox {log}\big (P(w_1)P(w_2)P(w_2)\big ). \end{aligned}$$
  3. That is, pairs with equal similarity in the measure count a half in the probability estimation.

References

  1. Adhikari A, Singh S, Mondal D, Dutta B, Dutta A (2016) ‘A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet’, CoRR. In: Jajodia S (ed), arXiv preprint  arXiv:1607.05422

  2. Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, Rijke Md, Spina D (2014) Overview of RepLab 2014: author profiling and reputation dimensions for online reputation management. In: Proceedings of information access evaluation. Multilinguality, multimodality, and interaction—5th international conference of the CLEF initiative, CLEF 2014, Sheffield, UK, 15–18 September 2014

  3. Amigó E, de Albornoz JC, Chugur I, Corujo A, Gonzalo J, Martín-Wanton T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of information access evaluation. 4th International conference of the CLEF initiative (CLEF 2013) multilinguality, multimodality, and visualization. Springer, Berlin, Heidelberg

    Google Scholar 

  4. Amigó E, Giner F, Gonzalo J, Verdejo F (2017a) An axiomatic account of similarity. In: Proceedings of the SIGIR’17 workshop on axiomatic thinking for information retrieval and related tasks (ATIR), SIGIR ’20. ACM, New York, NY, USA

  5. Amigó E, Giner F, Gonzalo J, Verdejo F (2017b) A formal and empirical study of unsupervised signal combination for textual similarity tasks. Springer, Cham, pp 369–382

    Google Scholar 

  6. Amigó E, Giner F, Mizzaro S, Spina D (2018) A formal account on effectiveness evaluation and ranking fusion. In: Proceedings of the ACM SIGIR international conference on theory of information retrieval, ICTIR 2018, Tianjin, China, 14–17 September 2018

  7. Arora S, Li Y, Liang Y, Ma T, Risteski A (2016) A latent variable model approach to PMI-based word embeddings. J Trans Assoc Comput Linguist (TACL) 4:385–399

    Article  Google Scholar 

  8. Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  9. Brigadir I, Greene D, Cunningham P (2014) Adaptive representations for tracking breaking news on Twitter, arXiv preprint  arXiv:1403.2923

  10. Bullinaria John A, Levy JP (2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behav Res Methods 39(3):510–526

    Article  Google Scholar 

  11. Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1:163–190

    Article  Google Scholar 

  12. Cross V (1994) Fuzzy information retrieval. J Intell Inf Syst 3(1):29–56

    Article  Google Scholar 

  13. Dagan I, Pereira F, Lee L (1994) Similarity-based estimation of word cooccurrence probabilities. In: In proceedings of the 32nd annual meeting of the association for computational linguistics, association for computational linguistics, pp 272–278

  14. De Luca A, Termini S (1972) A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory. Inf Control 20(4):301–312

    Article  MathSciNet  Google Scholar 

  15. Delgado M, Martín-Bautista M, Sánchez D, Vila M (2001) Aggregating opinions in an information retrieval problem. In: Proceedings of EUROFUSE workshop on preference modelling and applications, Granada, Spain, pp 169–173

  16. Djuric N, Wu H, Radosavljevic V, Grbovic M, Bhamidipati N (2015) Hierarchical neural language models for joint representation of streaming documents and their content. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences steering committee, pp 248–255

  17. Greiff WR, Ponte JM (2000) The maximum entropy approach and probabilistic IR models. ACM Trans Inf Syst (TOIS) 18(3):246–287

    Article  Google Scholar 

  18. Harter SP (1975) A probabilistic approach to automatic keyword indexing. Part II: an algorithm for probabilistc indexing. J Am Soc Inf Sci 26(4):280–289

    Article  Google Scholar 

  19. Herrera F, Herrera-Viedma E, Martínez L (2002) An information retrieval system with unbalanced linguistic information based on the linguistic 2-tuple model. In: 8th International conference on information processing and management of uncertainty in knowledge-bases systems (IPMU2002). Wiley Online Library, Annecy (France), vol 52, pp 23–29

  20. Jiao Y, Cornec M, Jakubowicz J (2015) An entropy-based term weighting scheme and its application in e-commerce search engines. In: International symposium on web algorithms

  21. Kaufmann A (1975) Introduction to the theory of fuzzy subsets, vol 2. Academic Press, Cambridge

    MATH  Google Scholar 

  22. Ke W (2013) Information-theoretic term weighting schemes for document clustering. In: Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, ACM, pp 143–152

  23. Kohlas J (2017) Algebras of information. A new and extended axiomatic foundation, arXiv preprint  arXiv:1701.02658

  24. Kohlas J, Pouly M, Schneuwly C (2008) Information algebra. In: Wah B (ed) In formal theories of information. Wiley encyclopedia of computer science and engineering. Wiley, Berlin, pp 95–127

    Google Scholar 

  25. Kosko B (1990) Fuzziness vs. probability. Int J Gen Syst 17(2–3):211–240

    Article  Google Scholar 

  26. Lazo AV, Rathie P (2006) On the entropy of continuous probability distributions (Corresp.). IEEE Trans Inf Theory 24(1):120–122

    Article  Google Scholar 

  27. Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates, Inc., pp 2177–2185

  28. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, ICML ’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 296–304

  29. Ma J, Gao W, Mitra P, Kwon S, Jansen BJ, Wong K-F, Cha M (2016) Detecting rumors from microblogs with recurrent neural networks. In: International joint conferences on artificial intelligence (IJCAI). Elsevier, pp 3818–3824

  30. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Conference proceedings of the advances in neural information processing systems. Journal CoRR, vol. abs/1310.4546

  31. Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation (LREC). European Languages Resources Association (ELRA)

  32. Papineni K (2001) Why inverse document frequency?, In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics on language technologies, NAACL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–8

  33. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543

  34. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 1, IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 448–453

  35. RI (2018) Why RI? Business through data-driven reputation management. https://www.reputationinstitute.com/why-ri. Accessed 21 July 2018

  36. Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520

    Article  Google Scholar 

  37. Robertson SE, van Rijsbergen CJ, Porter MF (1981) Probabilistic models of indexing and searching. In: Proceedings of the 3rd annual ACM conference on research and development in information retrieval, SIGIR ’80. Butterworth & Co., Kent, UK, UK, pp 35–56

  38. Rudas IJ, Kaynak MO (1998) Entropy-based operations on fuzzy sets. J IEEE Trans Fuzzy Syst 6(1):33–40

    Article  Google Scholar 

  39. Shi Y, Wiggers P, Jonker CM (2012) Towards recurrent neural networks language models with linguistic and contextual features. In: 13th Annual conference of the international speech communication association, ISCA, pp 1664–1667

  40. Shirakawa M, Hara T, Nishio S (2017) IDF for word n-grams. ACM Trans Inf Syst (TOIS) 36(1):5:1–5:38

    Article  Google Scholar 

  41. Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceedings of European conference on speech communication and technology

  42. Toral A, Pecina P, Wang L, van Genabith J (2015) Linguistically-augmented perplexity-based data selection for language models. Hybrid machine translation: integration of linguistics and statistics. Comput Speech Lang 32(1):11–26

    Article  Google Scholar 

  43. Vakulenko S, Nixon L, Lupu M (2017) Character-based neural embeddings for tweet clustering. arXiv preprint  arXiv:1703.05123

  44. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends, In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06, ACM, New York, NY, USA, pp 424–433

  45. Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington

    Google Scholar 

  46. Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 233–242

  47. Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23(2):421–427

    Article  MathSciNet  Google Scholar 

  48. Zhai C (2008) Statistical language models for information retrieval: a critical review. Found Trends Inf Retr 2(3):137–213

    Article  Google Scholar 

  49. Zhao WX, Jiang J, Weng J, He J, Lim E-P, Yan H, Li X (2011) Comparing Twitter and traditional media using topic models. In: European conference on information retrieval. Springer, Heidelberg, pp 338–349

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their very useful comments, which have added value to the manuscript. The work was supported by the Ministerio de Economía y Competitividad, TIN Program (Vemodalen), under Grant Number: TIN2015-71785-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enrique Amigó.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: formal proofs

Appendix: formal proofs

Proposition 3.1

The proof is straightforward. According to the fuzzy set operators:

$$\begin{aligned} {\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2) \cup \ldots \cup {\mathcal {O}}_{\Gamma }(d_n)=(\Gamma ,f) \ , \end{aligned}$$

where

$$\begin{aligned} f(\gamma )=\max \limits _{ 1 \le i \le n} {\gamma }(d_i), \forall \gamma \in \Gamma \ . \end{aligned}$$

\(\square \)

Property 3.1

From \({\gamma }(d_1) \ge {\gamma }(d_2), \ \ \forall \gamma \in \Gamma \), it follows:

$$\begin{aligned} \Big \{d \in \mathcal {D} : {\gamma }(d) \ge {\gamma }(d_1), \ \forall \gamma \in \Gamma \Big \} \subseteq \Big \{d \in \mathcal {D} :{\gamma }(d) \ge {\gamma }(d_2), \ \forall \gamma \in \Gamma \Big \} \ . \end{aligned}$$

Then,

$$\begin{aligned} P_{d \in \mathcal {D}}\big ( {\gamma }(d) \ge {\gamma }(d_1), \ \forall \gamma \in \Gamma \big ) \le P_{d \in \mathcal {D}}\big ( {\gamma }(d) \ge {\gamma }(d_2), \ \forall \gamma \in \Gamma \big ) \ . \end{aligned}$$

This implies that

$$\begin{aligned} P_{d \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma }(d)\supseteq {\mathcal {O}}_{\Gamma }(d_1)\Big )\le P_{d \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma }(d)\supseteq {\mathcal {O}}_{\Gamma }(d_2)\Big ). \end{aligned}$$

And therefore, according to Definition 3.3:

$$\begin{aligned} {\mathcal {I}}_{\Gamma }\big (d_1\big ) \ge {\mathcal {I}}_{\Gamma }\big (d_2\big ). \end{aligned}$$

\(\square \)

Property 3.2

Notice that if we add a feature, the new observation is more restrictive than the initial observation, and thus, the set of messages which verify the new observation is contained in the set of messages which verify the initial observation, \({\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d) \subseteq {\mathcal {O}}_{\Gamma }(d)\). Then,

$$\begin{aligned} P_{d' \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d')\supseteq {\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d)\Big )\le P_{d' \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma }(d')\supseteq {\mathcal {O}}_{\Gamma }(d)\Big ). \end{aligned}$$

And therefore, according to Definition 3.3:

$$\begin{aligned} {\mathcal {I}}_{\Gamma \cup \{\gamma '\}}\big (d\big ) \ge {\mathcal {I}}_{\Gamma }\big (d\big ). \end{aligned}$$

\(\square \)

Property 3.3

By Proposition 3.1:

$$\begin{aligned} {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2)\big ) = - \log \bigg (P_{d \in \mathcal {D}}\Big ( {\gamma }(d) \ \ge \ \max \big \{{\gamma }(d_1),{\gamma }(d_2)\big \}, \ \forall \gamma \in \Gamma \Big ) \bigg ) \ . \end{aligned}$$

Given that

$$\begin{aligned} P_{d \in \mathcal {D}}\Big ( {\gamma }(d) \ \ge \ \max \big \{{\gamma }(d_1),{\gamma }(d_2)\big \}, \ \forall \gamma \in \Gamma \Big ) \le P_{d \in \mathcal {D}}\Big ( {\gamma }(d) \ \ge \ {\gamma }(d_1), \ \forall \gamma \in \Gamma \Big ) \ , \end{aligned}$$

we finally get, \( {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2)\big )\ge {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\big )\). Similarly, we can get the same result for \(d_2\). \(\square \)

Property 3.4

By hypothesis,

$$\begin{aligned} P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d)\big ) \le P_{d' \in \mathcal {D}}\big ({\gamma _2}(d') \ge {\gamma _2}(d) \big ) \ , \end{aligned}$$

which is equivalent to

$$\begin{aligned}&\frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _2}(d') \ge {\gamma _2}(d)\big )} \ge \frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d) \big )} \Rightarrow \\&\quad \log \left( \frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _2}(d') \ge {\gamma _2}(d)\big )}\right) \ge \log \left( \frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d) \big )}\right) \Rightarrow \\&\quad \Rightarrow {\mathcal {I}}_{\{\gamma _1\}}\big (d\big )\ge {\mathcal {I}}_{\{\gamma _2\}}\big (d\big ) \ . \end{aligned}$$

\(\square \)

Property 3.5

Consider two features, \(\gamma _1\), \(\gamma _2 \in \Gamma \), given a message, \(d \in \mathcal {D}\), it produces an observation under \(\gamma _1\), \({\mathcal {O}}_{\gamma _1}(d)\), whose Observation Information Quantity is:

$$\begin{aligned} {\mathcal {I}}_{\{\gamma _1\}}\big (d\big ) = P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d)\big ) = P_{d' \in \mathcal {D}}\Big ( g\big ( {\gamma _2}(d')\big ) \ge g\big ( {\gamma _2}(d)\big ) \Big ) \ . \end{aligned}$$

Given that g is a strict monotonic function,

$$\begin{aligned}&P_{d' \in \mathcal {D}}\Big ( g\big ( {\gamma _2}(d')\big ) \ge g\big ( {\gamma _2}(d)\big ) \Big ) = P_{d' \in \mathcal {D}}\Big ( g\big ( {\gamma _2}(d')\big ) \ge g\big ( {\gamma _2}(d)\big ) , {\gamma _2}(d')\ge {\gamma _2}(d)\Big ) = \\&\quad = {\mathcal {I}}_{\{\gamma _1,\gamma _2\}}\big (d\big ). \end{aligned}$$

\(\square \)

Property 3.6

Assume that we have a finite set of messages, the proof of this proposition is a direct consequence of the representativity of the messages by the features. If we have an infinite set of features, then they will describe every message, and messages will be unequivocally determined by the values of a set of features. \(\square \)

Property 3.7

Given a fixed message, \(d \in \mathcal {D}\), consider all the messages, \(d' \in \mathcal {D}\), which verify the inequalities:

$$\begin{aligned} {\gamma }(d') \le {\gamma }(d) \ \wedge \ {\gamma ^{-1}}(d') \le {\gamma ^{-1}}(d) \ . \end{aligned}$$

These inequalities are equivalent to (by definition of \(\gamma ^{-1}\)):

$$\begin{aligned} {\gamma }(d') \le {\gamma }(d) \ \wedge \ \frac{1}{{\gamma }(d')} \le \frac{1}{{\gamma }(d)} \ . \end{aligned}$$

Notice that \({\gamma }(d)\) and \({\gamma }(d')\) are non-negative numbers; therefore, these inequalities imply that: \({\gamma }(d) = {\gamma }(d')\). Then, the Observation Information Quantity is:

$$\begin{aligned} {\mathcal {I}}_{\left\{ \gamma , \gamma ^{-1}\right\} }\big (d\big )= -\log \bigg (P_{d' \in \mathcal {D}}\Big ( {\gamma }(d') \le {\gamma }(d) \wedge {\gamma ^{-1}}(d') \le {\gamma ^{-1}}(d) \Big ) \bigg ) \ \end{aligned}$$

which is equivalent to:

$$\begin{aligned} {\mathcal {I}}_{\left\{ \gamma , \gamma ^{-1}\right\} }\big (d\big )= -\log \Big (P_{d' \in \mathcal {D}}\big ( {\gamma }(d') = {\gamma }(d) \big ) \Big ) \ . \end{aligned}$$

\(\square \)

Proposition 5.1

Given the vocabulary, \(\mathcal {V} = \{w_1, \ldots , w_n\}\), consider the set of features as \(\Gamma = \{occ_{w_1},\ldots ,occ_{w_n}\}\), and given a message from the collection, \(d \in \mathcal {D}\), we are interested in computing the described OIQ, \({\mathcal {I}}_{occ_{w_i}}\big (d\big )\).

Assuming information additivity and considering text words as basic linguistic units, we have

$$\begin{aligned} {\mathcal {I}}_{occ_{w_i}}\big (d\big ) = \sum _{w_j \in d} {\mathcal {I}}_{occ_{w_i}}\big (w_j\big ) = \sum _{ w_j \in d} - \log \Big ( P_{w' \in \mathcal {V}}\big ({occ_{w_i}}(w') \ge {occ_{w_i}}(w_j) \big ) \Big ) \ . \end{aligned}$$

Notice that, if \(w_j \ne w_i\), then \({occ_{w_i}}(w_{j}) = 0\). Thus, \(P\big ({occ_{w_i}}(w') \ge 0 \big ) = 1\), since by definition \({occ_{w_i}}(d) \ge 0\), \(\forall d \in \mathcal {D}\). Therefore, in the last summation all the terms are null, except for \(w_{j} = w_{i}\). In this case, we have that \({occ_{w_i}}(w_i) = 1\), and given that by definition of the function \({occ_{w_i}}(.)\), its maximum value is 1, we can say that \({occ_{w_i}}(w') \ge 1\) is equivalent to \({occ_{w_i}}(w') = 1\). Therefore, the probability \(P\big ({occ_{w_i}}(w') = 1 \big )\) is exactly \(P(w' = w_i) = P(w_i)\). And, \({\mathcal {I}}_{occ_{w_i}}\big (d\big ) \propto - \log \big ( P(w_i) \big )\).

One of the assumptions is that every word is equiprobable, i.e. \(P(w_i) = k\), \(1 \le i \le n\), for an arbitrary k. In order to achieve the result, we can choose k in such a way that \(- \log (k) = 1\). And finally, the summation gives us the \(tf(w_i, d)\). \(\square \)

Proposition 5.2

Given the vocabulary, \(\mathcal {V} = \{w_1, \ldots , w_n\}\), considering the set of features as, \(\Gamma = \{occ_{w_1},\ldots ,occ_{w_n}\}\), and given a message from the collection, \(d \in \mathcal {D}\), we are interested in computing the described OIQ, \({\mathcal {I}}_{occ_{w_i}}\big (d\big )\).

Assuming information additivity and considering messages as basic linguistic units, we have

$$\begin{aligned} {\mathcal {I}}_{occ_{w_i}}\big (d\big ) = \sum _{w_j \in d} {\mathcal {I}}_{occ_{w_i}}\big (w_j\big ) = \sum _{ w_j \in d} - \log \Big ( P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') \ge {occ_{w_i}}(w_j) \big ) \Big ) \ . \end{aligned}$$

Notice that, if \(w_{j} \ne w_{i}\), then \({occ_{w_i}}(w_j) = 0\). Thus, \(P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') \ge 0 \big ) = 1\), since by definition \({occ_{w_i}}(d') \ge 0\), \(\forall d' \in \mathcal {D}\). Therefore, in the last summation all the terms are null, except for \(w_{j} = w_{i}\). In this case, we have as many terms as the number of times that the word \(w_i\) appears in the message d, i.e. \(tf(w_i, d)\). Moreover, we have that \({occ_{w_i}}(w_j) = 1\), and given that by definition of \({occ_{w_i}}(.)\), its maximum value is 1, we can say that \({occ_{w_i}}(d') \ge 1\) is equivalent to \({occ_{w_i}}(d') = 1\). Therefore, the expression \(-\log \Big (P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') = 1 \big ) \Big )\) is exactly \(-\log \Big ( P_{d' \in \mathcal {D}}(w_i \in d')\Big ) = idf(w_i)\). And thus, \({\mathcal {I}}_{occ_{w_i}}\big (d\big ) = tf(w_i, d) \cdot idf(w_i)\). \(\square \)

Proposition 5.3

Given the vocabulary, \(\mathcal {V} = \{w_1, \ldots , w_n\}\), and considering as features:

$$\begin{aligned} {\gamma _{i,j}}(d)= {\left\{ \begin{array}{ll} 1, &{} \quad \text {if the }j^{th}\text { element in }d\text { is the word}~~ w_i \\ 0, &{} \quad \text {otherwise}\\ \end{array}\right. }, \ \ 1 \le i, j \le m. \end{aligned}$$

Assuming feature independence, we have

$$\begin{aligned} {\mathcal {I}}_{\Gamma }\big ((w_1,\ldots ,w_m)\big ) =&\sum _{\gamma _{i,j} \in \Gamma } {\mathcal {I}}_{\{\gamma _{i,j}\}}\big ((w_1, \ldots , w_m)\big ) = \\ =&\sum _{\gamma _{i,j} \in \Gamma } -\log \Big ( P_{d\in {\mathcal {D}}}\Big ( {\gamma _{i,j}}(d)\ge {\gamma _{i,j}}(w_1, \ldots , w_m)\Big )\Big ) \ . \end{aligned}$$

Being \(\mathcal {S}\) all the possible sequences of words which form a message, in the previous formula we have that:

$$\begin{aligned}&P_{d\in {\mathcal {D}}}\Big ( {\gamma _{i,j}}(d)\ge {\gamma _{i,j}}(w_1, \ldots , w_m)\Big ) = \\&\quad = P_{(w_1', \ldots ,w_k') \in \mathcal {S}}\Big ( {\gamma _{i,j}}(w_1', \ldots , w_k') \ge {\gamma _{i,j}}(w_1, \ldots , w_m) \Big ) \ . \end{aligned}$$

Notice that \({\gamma _{i,j}}(w_1, \ldots , w_m)\) is equal to zero for all the sequences of the form \((w_1, \ldots , w_m)\) except for the sequences which verify that \(w_j = w_i\). Since by definition, \({\gamma _{i,j}}(.) \ge 0\), in the summation all the terms are null, except for the sequences which verify \(w_j = w_i\). In these cases, we have that \({\gamma _{i,j}}(w_1, \ldots , w_m) = 1\), and given that by definition of \({\gamma _{i,j}}(.)\), its maximum value is 1, we can say that \({\gamma _{i,j}}(w_1', \ldots , w_k') \ge 1\) is equivalent to \({\gamma _{i,j}}(w_1', \ldots , w_k') = 1\). Therefore, we have the next equality on probabilities:

$$\begin{aligned} P_{(w_1', \ldots ,w_k') \in \mathcal {S}}\Big ( {\gamma _{i,j}}(w_1', \ldots , w_k') = 1 \Big ) = P_{(w_1', \ldots ,w_k') \in \mathcal {S}}\Big ( w_i' = w_i, \Big ) \ . \end{aligned}$$

And finally, with trivial algebraic operations, we have:

$$\begin{aligned} Perplexity(w_1,\ldots ,w_m)=2^{\frac{1}{m}{\mathcal {I}}_{\Gamma }\big ((w_1, \ldots , w_m)\big )} \ . \end{aligned}$$

\(\square \)

Proposition 5.4

Considering the definition of Lin’s distance and assuming information additivity,

$$\begin{aligned} Lin(d_1, d_2) = \frac{ \displaystyle \sum _{w\in {d_1 \cap d_2}} {\mathcal {I}}_{\Gamma }\big (w\big )}{\displaystyle \sum _{w\in d_1} {\mathcal {I}}_{\Gamma }\big (w\big )+\displaystyle \sum _{w \in d_2} {\mathcal {I}}_{\Gamma }\big (w\big )} \ . \end{aligned}$$

Assuming feature independence, it is equivalent to:

$$\begin{aligned} \frac{{\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1) \cap {\mathcal {O}}_{\Gamma }(d_2)\big )}{{\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\big ) + {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_2)\big )} \ . \end{aligned}$$

\(\square \)

Proposition 5.5

We will start from the \(IDF_{N\hbox {-}gram}\) term weighting of an n-gram, g, using the same notation of \(|\phi (g)|\) for the message frequency of g, and \(|\mu (g)|\) for the amount of messages containing at least one subsequence of the n-gram g.

$$\begin{aligned} IDF_{N\hbox {-}gram}(g)= & {} \log \frac{|\mathcal {D}|}{|\mu (g)|} - \log \frac{|\mu (g)|}{|\phi (g)|} = \log \frac{|\mathcal {D}|}{|\mu (g)|} - \log \frac{|\mu (g)| \cdot |\mathcal {D}|}{|\phi (g)| \cdot |\mathcal {D}|} = \\= & {} \log \frac{|\mathcal {D}|}{|\mu (g)|} - \bigg ( \log \frac{|\mathcal {D}|}{|\phi (g)|} - \log \frac{|\mathcal {D}|}{|\mu (g)|} \bigg ) = 2 \cdot \log \frac{|\mathcal {D}|}{|\mu (g)|} - \log \frac{|\mathcal {D}|}{|\phi (g)|} \ . \end{aligned}$$

Considering both sets of features, \(\Gamma \) and \(\Gamma '\), the OIQ of an n-gram in each set of features is computed by:

$$\begin{aligned} {\mathcal {I}}_{\Gamma }\big (g\big )=\log \frac{|\mathcal {D}|}{|\phi (g)|} \ \ \ \wedge \ \ \ {\mathcal {I}}_{\Gamma '}\big (g\big )= \log \frac{|\mathcal {D}|}{|\mu (g)|} \ . \end{aligned}$$

Replacing these expressions in the definition of the \(IDF_{N\hbox {-}gram}\):

$$\begin{aligned} IDF_{N\hbox {-}gram}(g) = 2 \cdot {\mathcal {I}}_{\Gamma '}\big (g\big ) - {\mathcal {I}}_{\Gamma }\big (g\big ) \ . \end{aligned}$$

\(\square \)

Proposition 5.6

We will start from the \(\mu _{d}(\gamma _i)\) term weighting of a message.

$$\begin{aligned} \mu _{d}(\gamma _i)= & {} \log \frac{P(\gamma _i \in d \ | \ \mathcal {D}_1)}{P(\gamma _i \in d \ | \ \mathcal {D}_2)} = \log \frac{P(\gamma _i \in d \ | \ \mathcal {D}_1) \cdot |\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_2) \cdot |\mathcal {D}|} = \\= & {} \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_2)} - \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_1)} \ . \end{aligned}$$

Considering two different scenarios relating to the relevancy of messages, \(\mathcal {D}_1\) and, \(\mathcal {D}_2\), the OIQ of a message is computed by:

$$\begin{aligned} {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_1}\big (d\big )= \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_1)} \ \ \ \wedge \ \ \ {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_2}\big (d\big )= \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_2)} \ . \end{aligned}$$

Replacing these expressions, we get:

$$\begin{aligned} \mu _{d}(\gamma _i) = {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_2}\big (d\big ) - {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_1}\big (d\big ) \ . \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giner, F., Amigó, E. & Verdejo, F. Integrating learned and explicit document features for reputation monitoring in social media. Knowl Inf Syst 62, 951–985 (2020). https://doi.org/10.1007/s10115-019-01383-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01383-w

Keywords

Navigation