Keywords

1 Introduction

Human visual system understands the neighboring environment and processes the information through visible spectrum. The light reflected from the scene is sensed by the eyes and the brain performs complex processes through a network of neurons, receptors, and other specialized cells. Visualizing motion is the process of interpreting the speed and direction of small particles or objects in the nearby regions of a given scene. Human eyes can visualize the motion of objects which are significant. However, motion which can not be visualized by eyes might be important and might reveal invisible secrets actually present in the scene. Video motion magnification is an active research area over the past few years in which an imperceptible object motion is magnified and a synthetic video is generated where small motions are made perceptible to the eyes.

Wu et al. proposed temporal filtering based motion magnification and called it as Eulerian method [19] while improving the Lagrangian method that is based on motion tracking [7]. It follows the same fundamentals of fluid dynamics for Lagrangian and Eulerian approaches. Most of the work done in the motion magnification follows the uniform magnification for the entire scene. The proposed method tries to consider only the selected regions of interest from a video which exhibit imperceptible motion and therefore, need to be magnified. The proposed approach is also shown to reduce noise and remove outliers in the generated synthetic magnified video. Using this approach, the video is not only constrained to specific conditions, such as single object videos, and any given video could be processed in order to magnify the regions of interest.

It is not a completely automatic approach and requires user intervention to specify the object of interest. It is challenging as it follows many steps to complete and error incurred in one step can lead to erroneous output. Hence, it is needed to choose appropriate methods during each step. Recently, researchers have used Eulerian motion in tremor assessment [5] and endoscopic surgery [8]. Interest region based motion magnification can be very helpful in these types of applications to magnify a particular region or object.

The primary contributions of the proposed work are listed below.

  1. 1.

    Interest region based motion magnification has been performed for a given video with objects exhibiting imperceptible motion.

  2. 2.

    In addition, we show that the noise due to other sources present in the magnified region is reduced.

  3. 3.

    The approach is shown to work on videos of different natural scenes with objects exhibiting different kinds of motion and a video quality assessment is presented to check the video quality for noise.

The rest of the paper is organized as follows. In Sect. 2, related work is discussed. The framework of the proposed method including brief description of the techniques used is explained in Sect. 3. Experiments along with the results are discussed in Sect. 4. Finally, the paper is concluded in Sect. 5.

2 Related Work

Researchers have worked in artificial motion manipulation over the past decade and proposed different approaches using optical flow for many applications. Liu et al. proposed motion magnification for subtle changes in video [7]. They have used video registration to suppress camera shake motion, feature tracking to group correlated object motions, segmentation of motion trajectories, motion magnification followed by rendering of magnified video to fill the gaps. Wang et al. proposed “Cartoon Animation Filter” that produces motion exaggeration in artificial video of input video, which they claim to be more animated and alive. It subtracts the smoothed and time shifted version of second order derivative of signal from original signal [18]. These are Lagrangian approaches which make use of optical flow for motion magnification. First Eulerian motion magnification approach is proposed by Wu et al. [19]. Instead of calculating optical flow explicitly, they have magnified the temporal difference between the frames. This work is extended in phase-based magnification using complex steerable pyramids for noise reduction [16]. They magnified local phase variations in all the sub-bands of complex steerable pyramids. To improve the time complexity, Riesz pyramids are proposed for phase based motion magnification [17].

Motion magnification has been utilized in many applications. Deviation magnification of geometric structures is proposed by Wadhwa et al. [15]. Basic parametric shapes (e.g., lines and circles) are fitted in object of still images, and sampling and image matting are performed on particular object shape. Deviation is computed and magnified by a factor, and a rendered image is obtained in which deviation is magnified. Raja et al. proposed presentation attack detection scheme for iris recognition system by motion magnification of the phase information in the eye region [9]. Motion magnification has been used for face spoofing detection [2]. Subtle facial motions have been magnified, and texture based features have been used to detect the spoofing. Davis et al. used motion magnification to infer material properties by emphasizing small vibrations in the object [3].

Interest region based motion magnification is beneficial in terms of reducing outliers and noise. To the best of our knowledge, two papers have been proposed towards this direction. The first work is proposed by Kooij et al. in which they used depth maps to magnify objects of interest specified by depths [5]. However, it requires extra information of depth maps which is an additional task. On the other hand, our method works with only a given video for motion magnification. In the second work, Elgharib et al. proposed motion magnification in the presence of large motions and called it DVMAG (Dynamic video magnification) [4]. They calculated alpha matte of each frame by user specified scribbles and magnified motion in respective alpha mattes. They applied texture synthesis to fill the gaps in magnified videos. Our proposed method is different from [4] in the sense that DVMAG requires a large amount of user interaction to draw scribbles, and the proposed method required only two coordinates of the region of interest that can be easily automated in future using object proposals [10, 11]. The proposed method, unlike [4], does not require texture synthesis to fill the detail gaps.

3 Interest Region Based Motion Magnification

Most of the motion magnification techniques which have been proposed in the past, are for whole video frame irrespective of the object of interest and restrain to record the video in such conditions that most of the frame area should contain the object of interest with minimum background. Hence, they can not be applied to standard video recorded in regular conditions, e.g., fast moving objects in the background. It will lead to more noise during the magnification. In the proposed work, motion magnification has been applied to specific objects in the video. Since the motion magnification is for imperceptible motions, the region of interest is assumed to be static in the video.

Challenges: Our method is based on the observation that large motions in background may affect the motion magnified video and bring extensive noise. Solving this issue bring new challenges in the work. The main challenge of this work is to get the object of interest and perform an automatic motion magnification. Previous two works [4, 5] require additional information to get the object of interest. However, we made it possible with only two pixel locations marked by user on the first frame. The rest of the work is handled by an automated algorithm with no user intervention. The extracted object contains a sharp and distorted boundary, hence it can not be used as a mask. To get a fuzzy boundary mask, image matting is performed and for that automatic scribbles are drawn on background and foreground objects. From extraction of object of interest to scribble drawing, image matting, and motion magnification, all are performed automatically.

This algorithm has three main steps. In the first step, an object of interest is extracted from the first frame of video using kernel k-means approach [14], discussed in Sect. 3.1. In the second step, the image matting is performed on the input frame as explained in Sect. 3.2. To perform image matting, scribbles are required to be drawn on the foreground and background image parts which is done automatically. In the third step, video magnification is performed using Eulerian video magnification approach [19] as discussed in Sect. 3.3.

3.1 Object Segmentation

K-means segmentation approach is a partitioning method based on the sum of squared error in each cluster. In case of two segments C and \(\bar{C}\), likelihood in energy function can be written as

$$\begin{aligned} \sum _{p \in C}||I_p-\mu _C||^2+\sum _{p \in \bar{C}}||I_p-\mu _{\bar{C}}||^2 \end{aligned}$$
(1)

Kernel K-means (kKM) segmentation approach is adopted in the proposed work [14]. kKM is a well proven data clustering technique in machine learning, that makes use of kernel tricks to separate the complex structures which are non-linearly separable in the input space. Kernel K-means maps the data into a higher dimensional Hilbert space using a non-linear mapping \(\psi \). The energy function of standard K-means segmentation was replaced by the following in kKM

$$\begin{aligned} E_k(C) = \sum _{p \in C}||\psi (I_p)-\mu _C||^2+\sum _{p \in \bar{C}}||\psi (I_p)-\mu _{\bar{C}}||^2 \end{aligned}$$
(2)

where C and \(\bar{C}\) are two segments, \(I_p\) are data points in clusters, \(\mu _C\) and \(\mu _{\bar{C}}\) are cluster means for C and \(\bar{C}\) respectively. Detailed explanation regarding kKM, adaptive kKM, and kernel bandwidth can be found in [14]. Object segmentation using kKM is shown in Figs. 1(b) and 4.

Fig. 1.
figure 1

(a) Original frame with manually drawn bounding box using coordinates \((x_1,y_1)\) and \((x_2,y_2)\), (b) Segmented object using kernel K-means, (c) Eroded background image, (d) Eroded foreground image, (e) Black and white scribbles drawn on background and foreground image using Bezier curves, and (f) Extracted alpha matte using [6]

3.2 Scribble Drawing and Alpha Matting

Scribbles are used to perform the image matting to extract foreground and background image regions. They specify the image regions which can be considered clearly as foreground (white scribbles) and clearly as background (black scribbles) as shown in Fig. 1(e). After the extraction of foreground object, black (on background) and white (on foreground) scribbles need to be drawn on the most feasible spatial locations in the image in different shapes so that diverse image regions can be covered through the scribbles. To achieve this, super pixel over segmentation [1] and Bezier curves are employed [13].

Initially, morphological operations are applied so that scribbles can be drawn only in the foreground or background part of image and not on the boundary as it may lead to a erroneous alpha map. Eroded background and foreground images are shown in Fig. 1(c) and (d). Bezier curves are drawn using six points chosen near the centroid of each superpixel as mentioned in [13]. Scribbled image is shown in Fig. 1(e).

Motion magnification in the interest region is a challenging problem, especially on the object boundary, and it should be applied to a finely segmented object. Otherwise, it may lead to false video magnification. Hence, it is required to perform matting to produce the best segmentation of the video frame with fuzzy boundaries. Matting is an approach which smoothens the boundaries of segmented objects and makes their appearance more natural while blending. Intensity of pixels in the image can be expressed as a linear combination of F (foreground) and B (background) pixels.

$$\begin{aligned} I(x,y) = \alpha (x,y) F(x,y) + (1-\alpha (x,y))B(x,y) \end{aligned}$$
(3)

where \(\alpha (x,y)\) is foreground opacity. We have used closed form solution for extracting the alpha matte [6]. In this approach, F and B are assumed to be smooth in a local window around each pixel. Equation 3 can be rewritten as

$$\begin{aligned} \alpha (x,y) \approx a I(x,y) + b, \forall (x,y) \in w \end{aligned}$$
(4)

where \(a=\frac{1}{F-B}\), \(b=\frac{B}{F-B}\) and w is a small image window. A cost function J, is minimized for \(\alpha , a\) and b

$$\begin{aligned} \begin{aligned} J(\alpha ,a,b) = \sum _{(p,q)\in I}\Big (\sum _{(x,y)\in w}(\alpha (x,y) -a(p,q) I(x,y)-b(p,q))^2 +\varepsilon a(p,q)^{2}\Big ) \end{aligned} \end{aligned}$$
(5)

which can be further modified in terms of only \(\alpha \). More details regarding the closed form matting can be found in [6]. An example of image matting using this approach is shown in Fig. 1(f).

Fig. 2.
figure 2

Block diagram of the proposed framework. Blue lines shows the processes which need to be performed for only first frame. Red lines shows the processes which need to be performed for each frame of video. (Color figure online)

3.3 Video Magnification

We have used Eulerian motion magnification approach that amplifies the temporal difference between consecutive frames [19]. It incorporates spatial and temporal processing to highlight the small motion present in the video. Initially, the video frames are decomposed into different spatial frequency bands using a Laplacian filter. In temporal processing, a band pass filter is applied to magnify few particular frequencies based on the application. Temporal filter is applied on all spatial sub-bands and all pixels uniformly. The extracted temporal filtered video frame is magnified by a factor \(\gamma _{mag}\). The theory behind motion magnification using temporal filtering follows first-order Taylor series expansion of signal that is commonly used in optical flow estimation.

If I(xt) is the image signal of position x and time t, then the modified signal with \(\gamma _{mag}\) factor is given by

$$\begin{aligned} \hat{I}(x; t) = I(x; t) + \gamma _{mag} B(x; t) \end{aligned}$$
(6)

where B(xt) is the result of the temporal bandpass filter. The motion magnification factor \(\gamma _{mag}\) can be estimated using the following equation

$$\begin{aligned} (1+\gamma _{mag})\delta (t)< \frac{\lambda _c}{8} \end{aligned}$$
(7)

where \(\lambda _c\) is the cut-off spatial frequency beyond which an attenuated version of \(\gamma _{mag}\) is used, and \(\delta (t)\) is the video motion signal. Detailed mathematical explanation of the Eulerian motion magnification can be found in [19].

3.4 Pipeline

The proposed work follows a sequential approach using the three steps discussed above in order to achieve the task of motion magnification based on interest region in videos. A small amount of user input is required to process. In the first frame, user is asked to draw a box on the object of interest or to provide two coordinates for the same. In Fig. 1(a), coordinates \(((x_1,y_1),(x_2,y_2))\) are shown in blue marker and bounding box in red. Next, kernel K-means approach is used to segment the object from the background. It gives an approximate object segment, which is used to get alpha matte. The eroded background image is fed to super pixel over segmentation, and scribbles are drawn on foreground and background image regions using Bezier curves near each centroid of all the superpixels. After this, image matting is performed on image and corresponding alpha matte is calculated.

An object of interest which needs to be magnified is assumed to be static in video with tiny motion. This assumption is valid as the motivation of the proposed work is to magnify small motions. It relaxes the algorithm, as it requires scribbles only in the first frame, and these scribbles are enough for other frames as the object is not exhibiting large movements. On the basis of this assumption, the following two types of frameworks are adopted in this work depending on the object of interest in the given video.

  1. 1.

    If the motion of object lies inside the object or it is extremely tiny, then alpha matte of only the first frame may work for all the frames. If this condition is satisfied, then calculating alpha matte for only one frame will be computationally very cheap as compared to the second framework.

  2. 2.

    In the second framework, alpha matte of each frame is calculated and further utilized on magnified video frames.

In both the above mentioned approaches, the alpha matte is calculated for the first frame or for all the frames. Other than that, video is magnified by a magnification factor using temporal filtering and alpha matte is multiplied with magnified temporal difference. In the first framework, same alpha matte of first frame is multiplied with temporal differences of all the frames. Besides, in the second framework, alpha matte of each frame is multiplied with corresponding temporal difference. Finally, the magnified temporal difference of only the foreground object (using alpha matte multiplication) is added to original frame. Block diagram of both the frameworks is illustrated in Fig. 2.

Fig. 3.
figure 3

(a) Baby, (b) Camera, (c) Eye, (d) Woman, (e) Wrist, (f) Hand, and (g) Person.

4 Experiments and Discussion

The proposed method is tested on videos with subtle motion. First frame of each video is shown in Fig. 3. We have used similar videos as [16, 19] and some videos are recorded in conditions with a moving background. For object extraction using kKM, hard constraint and smoothness parameters can be set according to the objects of interest. In most of the cases, hard constraints are set to ‘on’ as objects of interest must lie inside the box provided by the user. Smoothness weight should be chosen more than zero to get a more flat image. An example of object extraction with and without smoothness constraint is shown in Fig. 4.

Fig. 4.
figure 4

(a) Original frame of hand sequence, (b) Extracted object with no smoothness, and (c) with .1 smoothness.

To remove the boundary of background and foreground image, erosion is performed with a disk of 10 or 20 radius. Next, the scribbles are drawn on background and foreground using superpixels and Bezier curves. Superpixel count can be placed from 50 to 100 depending on variability in the size of interest regions. Parameter used in magnification, i.e., band pass frequencies, sampling rate, magnification factor, and cut-off frequency are adopted from [19]. IIR, Butterworth, and ideal temporal filters are used in experiments to obtain temporal difference in frames. Results can be accessed online at: https://sites.google.com/site/manishaverma89/publications/int-reg-motion-mag.

Results in the form of spatial-temporal plots, are illustrated in Figs. 5, 6, 7 and 8. In Fig. 5, plots of camera sequence are shown. Pixels of a random column (shown in black line in Figs. 5, 6, 7 and 8) are plotted over time for each frame. Time and pixel intensities are plotted on x and y axis respectively. The first image in the Fig. 5 is the first frame of the camera sequence. Figure 5(a) is time-space plot of the original sequence, and there is no variation in pixel intensities over time. In Fig. 5(b), intensity variation of Eulerian method [19] is shown. It is clearly visible that motion magnification adds noise in the video as it magnifies the background. This problem will not appear if the background is ideally motionless, however, that is a very unlikely situation. In Fig. 5(b) and (c), only camera motion is visible as the background motion is not magnified. Figure 5(b) and (c) follow the first and the second framework respectively. It is noticeable that since the camera sequence has very tiny motion, there is no such difference in the first and second framework with respect to this example.

Fig. 5.
figure 5

Comparison of Wu et al. [19] and proposed method on camera video sequence. Original first frame and space-time plots of pixel intensities of (a) Original frame (no motion), (b) Wu et al. method [19] (uniform motion magnification), (c) Proposed method - first framework (interest region based motion magnification using first frame’s alpha matte), and (d) Proposed method - second framework (interest region based motion magnification using each frame’s alpha matte).

In a similar way, time-space plots are drawn for eye sequence. Iris is extracted as a foreground object and magnified throughout all the frames. Magnification using alpha matte (Fig. 6(c) and (d)) leads to noiseless magnification where only iris is magnified, and other portions are unchanged as they were in original sequence. However, Eulerian motion magnification magnifies the whole frame (Fig. 6(b)).

Fig. 6.
figure 6

Motion magnification comparison in eye video sequence. Original first frame and space-time plots of pixel intensities of (a) Original frame, (b) Wu et al. method [19], (c) Proposed method - first framework, and (d) Proposed method - second framework.

In the next two experiments, we have used videos with moving background. A video is recorded in such condition where a still hand is placed in front of a monitor displaying a video of waterfall. Hence the video has a subtle motion (of hand) with moving background (waterfall). First frame of video with a horizontal black line is shown for which the spatial intensity is plotted over time. Since the background is moving, Wu et al. [19] approach leads to high noise in background as shown in Fig. 7(b). On the other hand, our approaches (Fig. 7(c) and (d)) provide motion magnification with less noise.

Fig. 7.
figure 7

Motion magnification comparison in hand video sequence. Original first frame and space-time plots of pixel intensities of (a) Original frame, (b) Wu et al. method [19], (c) Proposed method - first framework, and (d) Proposed method - second framework.

Fig. 8.
figure 8

Motion magnification comparison in person video sequence with (i) vertical and (ii) horizontal motions. Original first frame and space-time plots of pixel intensities of (a) Original frame, (b) Wu et al. method [19], (c) Proposed method - first framework, and (d) Proposed method - second framework.

In the last experiment, a video is recorded where a person is sitting motionless, and another person is moving behind the first person. In Fig. 8, we have shown vertical and horizontal movements over time. In first column of Fig. 8(i) and (ii), first frame of video is shown with vertical and horizontal black lines respectively and corresponding motion graphs are shown in respective rows. Plot of space and time of original video is shown in Fig. 8(a) for both vertical and horizontal motions. The moving person is seen in the middle of all space-time plots, as in the middle of video sequence, the background person comes in the contact of foreground person. Extreme noise in the presence of the background is obtained by Wu et al. approach. Head of person is considered as foreground and extracted for motion magnification in the proposed approach and shown in Fig. 8(c) and (d). The minor difference between the first and the second framework can be seen at the boundary of the foreground object when it comes in contact with the background person.

We have presented a no-reference video quality assessment based on Video BLIINDS [12]. It computes video statistics and perceptual features, and feeds them to a learned support vector regressor for video quality prediction. Video quality of four videos, i.e., original video, motion magnified video produced by [19], motion magnified video using proposed framework 1, and proposed framework 2 are measured using Video BLIINDS [12] and shown in Table 1. The algorithm computes the differential mean opinion score (DMOS index), hence a low score implies better quality of the video. The DMOS index for the original video is less for almost all the videos. It is clearly visible that index of Wu et al. method is highly exceeding from both of the proposed frameworks in all the videos. There is a minor variation for proposed framework 1 and 2, and that depends on various factors, e.g., the movement of object, scribbles drawn in the first frame, and background motion. For two videos hand and person, the score for proposed magnified video is less than the original video, that could be possible due to training of Video BLIINDS. Other than that, for all the videos, the score using the proposed method is less than Wu et al. [19] method.

Table 1. Video quality assessment using video BLIINDS [12]

5 Conclusion

In the proposed work, interest region based motion magnification is proposed which helps in reducing noise and removing the outliers in motion magnification. The proposed work makes use of object extraction, automatic scribble drawing, image matting and motion magnification to achieve the task. The proposed method would be very favourable for videos where the object of interest is not focused in camera and other motions (excluding object of interest) are present in the video. The proposed method is shown to work well on different videos as compared to uniform motion magnification.

In the future work, we will try to employ semantic object detection techniques and try to make a fully automatic system for magnification of specific objects with no user intervention. Any existing motion amplification methods (phase based complex steerable pyramids and Reisz pyramids) can then be employed to process the interest region. Region based motion magnification can be helpful in many applications.