Skip to main content
Log in

Dynamic wavelet fingerprint for differentiation of tweet storm types

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

We describe a novel method for analyzing topics extracted from Twitter by utilizing the dynamic wavelet fingerprint technique (DWFT). Topics are derived from 7 different tweet storms analyzed in the study by using a dynamic topic model. Using the time series of each topic, we run DWFT analyses to get a two-dimensional, time-scale, binary image. Gaussian mixture model clustering is used to identify individual objects, or storm cells, that are characteristic to specific local behaviors commonly occurring in topics. The DWFT time series transformation is volume agnostic, meaning we can compare tweet storms of different intensities. We find that we can identify behavior, localized in time, that is characteristic to how different topics propagate through Twitter. The use of dynamic topic models and the DWFT create the basis for future applications as a real-time Twitter analysis system for flagging fake news.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Free Twitter API only allows user to pull tweets from past 7 days.

References

Download references

Acknowledgements

We would like to acknowledge Dr. William Fehlman for innumerable conversations about topic modeling and its many applications, particularly to Twitter data. This work was performed [in part] using computing facilities at the College of William and Mary which were provided by contributions from the National Science Foundation, the Commonwealth of Virginia Equipment Trust Fund and the Office of Naval Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark K. Hinders.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Data sets

1.1 Brett Kavanaugh confirmation hearings

The confirmation hearings of Supreme Court Justice Brett Kavanaugh was a highly partisan and contentious affair. Shortly after President Donald Trump tapped Kavanaugh as his choice to fill an open seat on the Supreme Court many people, particularly on the political left, were angry. Kavanaugh, known as conservative judge, was replacing Justice Anthony Kennedy who had been a swing vote for most of his 30 years on the court. During the confirmation hearings, an accusation of sexual assault from Kavanaugh’s time in high school emerged. The accusation from Dr. Christine Blasey Ford was impossible to verify as it occurred in the 1980s (Abramson 2018; Hanson 2018; Sorkin 2018). With the background of the #MeToo movement a Twitter firestorm emerged. Many on the left, most of whom would have already been against Kavanaugh, argued that he was unfit to be on the Supreme Court due to these allegations as well as his demeanor in the hearings. While, on the right, many argued that there was no concrete evidence proving the allegations so Kavanaugh should not be treated as if he were guilty.

Fig. 16
figure 16

Time series of the tweet volume for the Brett Kavanaugh confirmation hearings. Important points in the time series are marked A, B, and C. A represents the spike in volume when the Senate passed the procedural vote to continue with Kavanaugh’s confirmation process. B represents the spike in volume when Senator Susan Collins of Maine announced she would vote to confirm Kavanaugh. Finally, C represents the spike in volume when Kavanaugh was officially confirmed to the Supreme Court. Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs on October 4, 2018, 23:45 GMT, and the final interval occurs at midnight GMT October 8, 2018

Figure 16 shows the time series of tweets through the three days of Kavanaugh’s confirmation hearings in the Senate. Three key spikes occur in the time series, labeled A, B, and C. The first two occur on October 5th in quick succession. First the Senate passed the procedural vote to proceed with the confirmation process, this spike is labeled A. Three hours later Senator Susan Collins of Maine, seen as a swing vote in the confirmation process, announced that she would vote in favor of confirming Kavanaugh, spike labeled B. Finally on October 6th, the largest spike occurs, this is when the Senate officially confirmed Kavanaugh to fill Anthony Kennedy’s seat on the Supreme Court, labeled C.

It is important to note that this dataset filtered out all retweets. The data were collected using the Python Tweepy module by streaming tweets that included the terms ‘Kavanaugh’ and ‘Supreme Court’ (Tweepy 2017).

1.2 Freddie gray riots

On April 19, 2015 Freddie Gray died from spinal injuries while in Baltimore police custody. Gray suffered spinal cord injuries during a ride in a police van in which he was handcuffed but not buckled in, thus unable to brace himself when he fell during rough portions of the ride (Mathis-Lilley 2015; McCarthy 2016). There were questions about why the Baltimore police arrested him as well as his treatment while in police custody. The questionable arrest coupled with apparent mistreatment of Gray while in custody was enough to cause an uproar, but put in the context of other recent police killings of people of color such as Tamir Rice and Eric Garner caused riots in the streets of Baltimore and protests across the nation (MacGillis 2019; Wallace-Wells 2016). This reignited ongoing feuds between groups who either defended the victims as being wrongfully treated, or defended the police claiming they were trying to do their job. Much of this played out though protests, but they can also be seen playing out through Twitter. American University’s Center for Media and Social Impact (CMSI) did a comprehensive study on hashtag activism, in particular hashtag movements that centered around Black Lives Matter in 2014 and 2015 (Freelon et al. 2016). CMSI published their dataset, which is what we are using in this study. We specifically focus on the tweets included in the CMSI dataset from April 25, 2015 until April 30, 2015 with time steps of 5 min. We filtered through the dataset to ensure we were only including tweets relevant to Freddie Gray so we used the filter terms: Freddie, Gray, #Baltimore, #Baltimoreriots, #Baltimoreuprising, #freddiegray, and Baltimore. This leads to a total of about 811,000 tweets over the 5 day span, shown in Fig. 17.

Fig. 17
figure 17

Time series plot of tweet volume over the 5 day period of tweet collection for the Freddie Gray riots. Data acquired through American University’s Center for Media and Social Impact (Freelon et al. 2016). Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs at midnight GMT, April 25, 2015, and the final interval occurs on April 30, 2015 at 4:55 GMT

1.3 Michael Cohen testimony and North Korea summit

On February 27, 2019 two major events in American politics coincided. Beginning in the late morning Michael Cohen, President Donald Trump’s former personal lawyer, testified to the House Oversight and Reform Committee about his work with Trump before and during the election. Much of this testimony included accusations about criminal and unethical activity by Donald Trump and his family. Later in the afternoon President Trump had a summit with Kim Jong Un, the Supreme Leader of North Korea. The summit ended with President Trump walking away from table and with no deal between the two nations and vastly different interpretations of the day’s events depending one’s political leanings (Glasser 2019; Gunia 2019; Raleigh 2019).

A dataset consisting of roughly 5 million tweets over the 24 h period both of these events took place, was streamed using the Python module Tweepy (2017) and filtered using the terms: Trump, Cohen, North Korea, Hanoi, and Kim. It is clear that the time series in Fig. 18 does not have the same rhythmic nature of Figs. 16 and 17. There are a few reasons for this. First, those datasets occur over multiple days; thus, there is the natural rhythm of night and day that affects tweet volume. Second, we added the term ‘Trump’ to the query items, which is too general and caused the streamer to be rate limited to roughly 15,000 tweets per 5 min. Twitter does not publish what the actual rate limit is, but the flat line right around 15,000 over most of the time series is a good indicator that we were capped at this volume.

Fig. 18
figure 18

Time series plot of tweet volume over the 27.5 h of streaming tweets during the Michael Cohen hearing and Trump’s North Korea summit. Twitter’s streaming API implements a rate limit on streamers, while they do not publish what the rate limit is, the flat line around 15,000 for most of streaming time is a good indicator that our streamer was rate limited. Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs at 13:55 GMT, on February 27, 2019, and the final interval occurs on February 28, 2019 at 17:30 GMT

1.4 Winter Olympics

The Winter Olympics were held in February of 2018 in Pyeongchang, South Korea. The games featured many stars in their respective sports, both established and new, that captivated the audience. Harvard’s dataverse has published a dataset of about 13 million tweets that contained one of the hashtags: #olympics, #pyeongchang2018, #winterolympics, and the Korean hashtag which translated to “Pyeongchang Winter Olympics” (Littman 2018b).

Fig. 19
figure 19

Time series of tweet volume of the 28 day span of the 2018 Winter Olympics. Data were used from the Harvard Dataverse repository (Littman 2018b). Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs at 13:55 GMT, on January 31, 2018, and the final interval occurs on February 27, 2018 at 5:00 GMT

Figure 19 shows what the time series of tweet volume looks like over the 28 days time period. There is periodicity in the tweet volume, likely due to how the Olympics are aired on tape delay during prime time (8 p.m–11 p.m Eastern) every night in the US. The tweet volume begins to spike on February 9th, which coincides with the opening ceremonies. After that there are regular spikes every day until the closing ceremonies on February 25.

1.5 Charlottesville riots

On August 11 and 12 of 2017 a white supremacist group held their Unite the Right rally in Charlottesville, VA. The rally was met with thousands of protesters in the streets of Charlottesville culminating in the slaughter of a protester named Heather Heyer (Barone 2017; Green 2017; Heim 2017).

Fig. 20
figure 20

Time series plot of tweets referring to the riots in Charlottesville, VA in August 2017. Three important points in the time series are labeled A, B, and C. A shows the activity on the evening before the rally. B shows the activity during the rally. C shows the activity on the day after the rally. Tweet IDs were published through Harvard Dataverse (Littman 2018a). Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs at 12:45 GMT, on August 8, 2017, and the final interval occurs on August 14, 2017 at 16:35 GMT

Harvard Dataverse published a set of Tweet IDs about this event (Littman 2018a). They searched Twitter for tweets containing the hashtags: #charlottesville, #standwithcharlottesville, #defendCville, #HeatherHeyer, or #UnityCville. All together they produced about 3.5 million tweets, shown as a time series in Fig. 20. The dataset spans the few days leading up to the rally until the day after. The three peaks seen in the data occur on the evening of August 11, labeled A, when the rally goers marched through the University of Virginia campus carrying tiki torches. The next peak, labeled B, corresponds to the march and ensuing riots throughout the day of August 12. Finally the last peak, labeled C, comes from residual fallout of the day and ongoing discussions on social media about the events that occurred.

1.6 Mueller report

After the 2016 U.S. presidential election the Department of Justice deputy attorney general Rod Rosenstein appointed Robert Mueller as the head of a Special Council to investigate any collusion the Trump campaign might have had with Russia to aid in his election. The Special Council also looked into the possibility that Donald Trump obstructed justice by attempting to derail the investigation (Berenson and Abramson 2019; Cassidy 2019; Ross 2019). After almost two years of work Mueller released his report to Attorney General William Barr on March 22, 2019 (Mueller III 2019). Given the great interest in the Mueller investigations over two years and the divisive nature of the Trump administration, this set off a major tweet storm. Many on the left were hoping this would lead an indictment of President Trump and cause the Democrat lead House of Representatives to begin impeachment proceedings, while many on the right were hoping to see President Trump exonerated from all charges.

Figure 21 shows the time series of the tweet volume over the weekend the Mueller Report was submitted.

Fig. 21
figure 21

Time series of tweet volume in response to the release of the Mueller Report. The stream began when it was announced that Mueller turned his report over to Attorney General William Barr. Two days later a plateau occurs when Barr published a letter to Congress announcing that Trump was exonerated from any collusion charges with Russia, but there was not enough evidence to either exonerate nor indict him on obstruction of justice. The sharp dip around the afternoon of March 23rd is due to an error in the stream requiring it to be reset. In total we streamed about 8 million tweets over about a three and a half day period. Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs at 22:05 GMT, on March 22, 2019, and the final interval occurs on March 26, 2019 at 12:15 GMT

On the afternoon of March 22nd, it was announced that Mueller had turned his report over to William Barr. The findings of the report were not made public until William Barr sent a letter to Congress on the afternoon of March 24th summarizing what Robert Mueller put in his report. The actual report Mueller submitted to Barr was not released to the public over the duration of the dataset. In Barr’s letter he said that Mueller had exonerated President Trump of any collusion with Russia, however there was not enough evidence to either indict nor exonerate President Trump on obstruction of justice. In Fig. 21 the data began streaming almost as soon as it was announced Mueller had submitted his report to Barr. After this there is the normal variation of tweet volume until Barr sent his letter to Congress, the sharp dip in data shortly after noon on March 23rd is because the tweet stream needed to be reset. There is a sharp spike and plateau around the evening of March 24th when Barr submitted his letter. The plateau is likely due to rate limits on Twitter’s API.

1.7 World Cup

Fig. 22
figure 22

Time series for tweets referencing the World Cup from June 16 through June 18. Time steps are in 5-min intervals, represented on the y axis. The initial interval occurs at 6:30 GMT, on June 16, 2018, and the final interval occurs on June 18, 2018 at 18:35 GMT

Soccer is the most popular sport in the world and the World Cup brings all the greatest soccer players into one competition to play for national pride. We wanted to analyze the tweet volume of English language tweets about the first few days of the Group Stage of the 2018 World Cup. Within this time frame there were some memorable games, such as Mexico’s surprising victory over the defending World Cup champs Germany, and France’s late victory over the underdog Australia (Allen 2018; Goff and Wallace 2018). Figure 22 shows the time series tweet volume for the first few days of the World Cup. In total we were able to stream 15,936 tweets from the first few days of the World Cup. All tweets were gathered using Twitter’s API and Tweepy (2017). The low volume is likely due to the absence of the United States, who failed to qualify. By filtering only English tweets most of our data came from the United States where there was less interest in the competition. However, this lower volume will lead to an interesting comparison of tweet storms based on overall tweet volume.

DTM details

1.1 U,V initialization

The U and V matrices are initialized in the method used by Saha and Sindhwani (2012). To initialize the U(t) matrix,

$$\begin{aligned} U_\mathrm{init} = [U(t-1), U_\mathrm{emerge}] \end{aligned}$$
(28)

where \(U(t-1)\) is the set of all non-faded topics from the previous time step, and \(U_\mathrm{emerge}\) is a \(M\times k_\mathrm{emerge}\) matrix with random, non-negative entries, where \(k_\mathrm{emerge}\) is a parameter set for the number of topics to add each time step. To initialize V(t),

$$\begin{aligned} V_\mathrm{init} = \begin{bmatrix} V_{11}&V_{12}\\ V_{21}&V_{22} \end{bmatrix} \end{aligned}$$
(29)

where \(V_{11}\) represents old documents and old topics, \(V_{12}\) is new documents and old topics, \(V_{21}\) is old documents and new topics, and \(V_{22}\) is new documents and new topics. Both \(V_{12}\) and \(V_{22}\) are randomly initialized as we have no assumption about what topics the new documents will have. \(V_{11}\) will be initialized as \(V(t-1)\), and \(V_{21}\) is initialized to all 0s, because we have already derived the topic distribution for those documents. On the first time step the model is run both U and V are entirely randomly initialized. All random initializations are normal distributions of non-negative numbers with the mean at the average value of D(w) divided by the total number of topics. This method of random initialization is used in the NMF implementation in Pedregosa et al. (2011).

1.2 Topic streams

Topics are tracked through time using topic streams. There are two topic streams: the evolving topic stream and the faded topic stream. Each entry in the stream contains information about that specific topic such as: topic terms, weights, coherence, the number of tweets mentioning that topic, the time stamp the topic began, and the time stamp the topic faded. The number of tweets entry gives the raw number of tweets with a nonzero entry in the V matrix, which is what is used for time series representations of topics. Coherence is calculated using Eq. (26)

1.3 Removing fading topics

Before updating the model all fading topics need to be eliminated from U and V. A fading topic is defined to be a topic that is no longer representative of the documents inside D(w). A topic is no longer representative of D(w) when less than some predefined percentage of tweets mention that topic, in our case 0.5% of tweets in D(w). Once a topic is said to have faded its entry in the active topic stream is moved to the faded topic stream and the corresponding column in U and row in V are removed.

1.4 Checking emerging topics

Faded topics are saved so they can be compared to future emerging topics. A topic that has faded can become an evolving topic again if they are similar enough. This is referred to as a reemerging topic (Brüggermann et al. 2016). Cosine similarity is one method to measure the similarity between the two topics,

$$\begin{aligned} 1 - \cos \theta = 1-\frac{v_\mathrm{emerging}\cdot v_\mathrm{faded}}{||v_\mathrm{emerging}|| \cdot ||v_\mathrm{faded}||} \end{aligned}$$
(30)

where \(v_\mathrm{faded}\) is the term-topic vector for the fading topic and \(v_\mathrm{emerging}\) is the term-topic vector for the emerging topic. However, the vocabulary in the model is updated in time and terms no longer in use are dropped to save memory. This means that topic vectors from one point in time cannot be compared to topic vectors at another point because the entries will correspond to different vocabulary terms. To combat this we save the top n terms from the topic vector each time step, n is usually 10. Topic similarity is then calculated by comparing the top terms of two different topics. If they have enough similar words—i.e., 8 out of the top 10 are the same—then they can be considered the same topic and the faded topic is reclassified as a reemerged topic.

Feature extraction

DWFT creates a black and white image with objects that look similar to human fingerprints. Humans are adept at finding patterns in images, so utilizing wavelet fingerprints of time series allows us to use our own acuity to identify where patterns are and what features might be important for identifying those patterns. Before feature extraction we need to identify each individual object in the fingerprint. Figure 3 shows the feature extraction process. Each shade of gray in Fig. 3 at the bottom left represents a different object. Identifying objects becomes important when extracting features for analysis. To do this we use 8-connectivity (Bertoncini 2010) which identifies groups of nonzero pixels touching each other at any point and gives them a common label. Due to the nature of fingerprints inner ridges do not always touch outer ridges, though they represent the same object they are labeled as two different objects. Thus, we check each object in a fingerprint to ensure it is not surrounded by another object. If it is then both objects are relabeled to be the same.

Feature extraction from wavelet fingerprints follows those in Bertoncini (2010) and Dieckman (2014). Let I(ab) represent the binary image matrix for a wavelet fingerprint where a is the scale coordinate and b is the translation coordinate and let P be a \(2 \times N\) matrix that represents all nonzero pixels in I(ab) with the row P(bi) representing the b value for the ith entry and P(ai) representing the a value of the ith entry. The first features extracted are the parameters of an ellipse that most closely matches the shape of the fingerprint. To calculate these we use the formula for central moments given by

$$\begin{aligned} \mu _{p, q} = \sum _x \sum _y (x - \bar{x})^p (y - \bar{y})^q f(x,y) \end{aligned}$$
(31)

where \((\bar{x}, \bar{y})\) is the center of the object and f(xy) is the value of the pixel in the image. Since the image being analyzed is a binary image (31) can be simplified to

$$\begin{aligned} \mu _{p, q} = \sum _i (P(b, i) - c_b)^p (P(a, i) - c_a)^q \end{aligned}$$
(32)

where (\(c_a, c_b\)) is the location of the centroid of the wavelet fingerprint as calculated by

$$\begin{aligned} \begin{aligned} c_a = \frac{1}{N} \sum _{i=1}^{N} P(a, i) \\ c_b = \frac{1}{N} \sum _{i=1}^{N} P(b, i). \end{aligned} \end{aligned}$$
(33)

Using the central moments of the fingerprint, the properties of the ellipse can be found using

$$\begin{aligned} \begin{aligned}&x_\mathrm{maj} = \sqrt{2}\sqrt{\mu _{2,0} + \mu _{0,2} + \gamma } \\&x_\mathrm{min} = \sqrt{2}\sqrt{\mu _{2,0} + \mu _{0,2} - \gamma } \\&\hbox {ecc} = \sqrt{1 - \frac{x_\mathrm{min}^2}{x_\mathrm{maj}^2}} \\&\theta = \frac{1}{2}\arctan \frac{\mu _{1,1}}{\mu _{2,0} - \mu _{0,2}} \end{aligned} \end{aligned}$$
(34)

where

$$\begin{aligned} \gamma = \sqrt{\mu _{1,1}^2 + (\mu _{2,0} - \mu _{0,2})^2}, \end{aligned}$$
(35)

\(x_\mathrm{maj}\) is the semimajor axis, \(x_\mathrm{min}\) is the semiminor axis, ecc is the eccentricity, and \(\theta \) is the orientation angle of the ellipse. After the ellipse is derived for the fingerprint degree 2 and 4 polynomials are calculated using the polyfit function in the Numpy library in Python (Oliphant 2006). To calculate the polynomial coefficients the outermost values of the wavelet fingerprint are found. If there are multiple outer b values for a single a value then the lowest value for a is used as the outer point for the polynomial fit. The image on the right side of Fig. 3 shows the ellipse (red), degree 2 (blue) and degree 4 (green) polynomials fit to a single object in the fingerprint shown at the bottom left of Fig. 3. The object is the object centered near \(b = 475\) in the fingerprint.

Features based on the area of the fingerprint are also calculated. First is the area of the fingerprint, A, which is simply the number of on pixels in the fingerprint image. Then the area of a bounding box, \(A_{BB}\), or the box that completely surrounds the wavelet fingerprint in the image space. These measures are used to calculate the ratio of the area of the fingerprint and the area of the bounding box, also known as the extent

$$\begin{aligned} E_x = \frac{A}{A_{BB}}. \end{aligned}$$
(36)

Filled Area is the total number of nonzero pixels in I(ab) if all the holes inside the fingerprint are redefined to be one. Convex image area \(A_\mathrm{C}\) is defined as the area of the smallest convex polygon that can contain the fingerprint, this is calculated using the skimage library in Python (van der Walt et al. 2014). Solidity is the ratio of the area of I(ab) to the area of the convex image

$$\begin{aligned} s = \frac{A}{A_\mathrm{C}}. \end{aligned}$$
(37)

A topological feature is added called Euler Number. The Euler Number is a measure of the difference in the number of holes and the number of items in an image (Pratt 2013). It is calculated as

$$\begin{aligned} E = \frac{1}{4} \left( n\{Q_1\} - n\{Q_3\} - 2n\{Q_D\}\right) \end{aligned}$$
(38)

where \(n\{Q_i\}\) represents the number of bit quads or \(2\times 2\) segments of the fingerprint I(ab) that have i nonzero entries. \(Q_D\) is a special type of \(Q_2\) bit quads where there are nonzeros entries on either diagonal

$$\begin{aligned} Q_D = \begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix} \quad \text {or}\quad \begin{bmatrix} 0&1 \\ 1&0 \end{bmatrix} . \end{aligned}$$
(39)

Three more features are added onto the feature vector: \(c_a\), the length in time of the object, and the diameter of a circle with the same area as the fingerprint, calculated by

$$\begin{aligned} D = \sqrt{\frac{4 N}{\pi }}. \end{aligned}$$
(40)

This gives a total feature vector of length 21 for each fingerprint.

Lastly we want to create a set of features describing the gradient of the object. To do this we use Histograms of Oriented Gradients (HOG) (Dalal and Triggs 2005). HOG calculates the gradient of an image at all points and then pools gradients into different bins to create a histogram describing the distribution of gradients over some window. For our application we are using HOG features to describe the shape of an object in a fingerprint I(ab).

Usually the first step in calculating HOG features is to normalize the image, but since I(ab) is a binary image, normalization will have no effect. For us, the first step will be to calculate the gradients. A kernel is used in image convolution to calculate the gradient at each point by defining two matrices \(g_x\) and \(g_y\), both of the same shape as I where

$$\begin{aligned} \begin{aligned} g_x(a, b)&= [-\,1, 0, 1] \cdot I(a, b-1:b+1) \\ g_y(a, b)&= [-\,1; 0; 1] \cdot I(a-1:a+1, b). \end{aligned} \end{aligned}$$
(41)

If in either case one of the indices goes out of bounds on I, then it is set to the value of the nearest pixel. Then the gradient and angle can be calculated at each point by

$$\begin{aligned} \begin{aligned} G(a, b)&= \sqrt{g_x(a, b)^2 + g_y(a, b)^2} \\ \varTheta (a, b)&= \tan ^{-1} \frac{g_y(a, b)}{g_x(a, b)} \end{aligned} \end{aligned}$$
(42)

where, again, G and \(\theta \) are both of the same dimension as I. The final step is to create the histograms. In traditional HOG a window, w, is defined and a weighted histogram is defined for every \(w\times w\) window. However, this requires all images to have the same shape. This is too restrictive for our case. Either we would routinely cut off information from objects by enforcing a limit on T, or we would leave too much empty space in images leading to too much useless information. So we create one histogram for all pixels in I(ab). There are two different methods for calculating gradients, signed and unsigned. Signed bins gradient vectors for all angles from 0 to \(2\pi \), while unsigned only goes from 0 to \(\pi \) and antiparallel gradient vectors are placed in the same bin—i.e., a gradient of \(\pi /2\) is binned with \(-\,\pi /2\). Due to the nature of binary images, in the unsigned case there are only four possible angles: 0, \(\pi /4\), \(\pi /2\), and \(3\pi /2\). Thus, we define a four-dimensional HOG feature vector \(\mathbf {h}\), for which there is one entry for each possible angle. For each instance of one of these angles, the magnitude of that angle in G(ab) is added to the corresponding bin in \(\mathbf {h}\). All HOG vectors, \(\mathbf {h}\), were then normalized by the total time, T, of the given object to ensure all HOG feature values were weighted similarly. We then append \(\mathbf {h}\) to the full feature vector for I(ab).

A feature vector of dimension m, where m is the total number of features, is derived for every object which had more then 200 nonzero pixels. The value 200 was selected because many small objects represent noise in the data and below about 200 pixels it was difficult to fit well defined polynomials to the objects. All feature vectors are then combined into the matrix \(F \in \mathbb {R}^{O \times m}\), where O is the total number of objects. Each vector, \(\mathbf {f}_o\) in F will be clustered to find the predominant types of objects created in tweet storms.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kirn, S.L., Hinders, M.K. Dynamic wavelet fingerprint for differentiation of tweet storm types. Soc. Netw. Anal. Min. 10, 4 (2020). https://doi.org/10.1007/s13278-019-0617-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-019-0617-3

Keywords

Navigation