Dipfake video one frame
Is it possible to make an entire movie from one photograph? And having recorded the movements of one person, replace him with another in the video? Of course, the answer to these questions is extremely important for areas such as cinema, photography, and the development of computer games. The solution could be digital photo processing using specialized software. The problem in question among specialists in this field is called the task of automatic synthesis of video or image animation.
To obtain the expected result, existing approaches combine objects extracted from the original image and movements that can be delivered as a separate video – “donor”.
Now, in most areas, image animation is done using computer graphics tools. This approach requires additional knowledge about the object that we want to animate – its 3D model is usually necessary (how it works now in the film industry can be found here ). Most of the latest solutions to this problem are based on in-depth training of models, which are based on generative-competitive neural networks (GAN) and variational autoencoders (VAE). These models usually use pre-trained modules to search for key points of objects in the image. The main problem with this approach is that these modules can only recognize the objects on which they were trained.
How to solve the described problem for arbitrary objects in the frame? One way is suggested in the article “ First Order Motion Model for Image Animation ”. The authors propose their neural network model – First Order Motion Model, which solves the problem of image animation without pre-training on the animated object. Having studied on many videos depicting objects of one category (for example, faces, human bodies), the network developed by the authors allows you to animate all objects related to this category.
Let’s see how it works in more detail …
To simulate complex movements, a set of encoders of key elements of the object, trained without a teacher and local affine transformations is used.
To exclude from consideration the parts of the object that are not visible in the original image, an occlusion map is used. Since these parts are absent in the image, they must be generated by the neural network independently. The authors are also expanding the equivariant loss function used to train the keypoint detector in order to improve the assessment of affine transformations.
The framework consists of two main modules: motion estimation module and image generation module. The motion estimation module is designed to predict the motion field from the
video frame to the original image . The motion field is later used to align the key points of objects from the frame in accordance with the pose of these objects in the frame .
The key points detector receives an image and a frame from a video. This detector extracts a representation of first-order motion consisting of sparse key points and local affine transformations relative to an abstract frame (reference frame) . The motion transfer network uses this motion representation to create a backward optical flow from into and an overlap map . The source image and the output of the motion transfer network are used by the image generation module to render the target image.
Next, we consider the features of this solution in more detail.
Local affine transforms for approximating motion
The motion estimation module estimates the return optical flux from the moving frame to the original frame . The authors approximate by expanding in a Taylor series in the vicinity of key points. It is assumed that there is an abstract frame (reference frame), therefore, the estimate is expressed in terms of and . Moreover, given the result frame , we evaluate each transformation in the neighborhood of trained key points. Consider the expansion in a Taylor series at key points , where they indicate the coordinates of the key points in .
To evaluate , we assume that it is locally bijective in the vicinity of each key point.
The predictor of key points gives and . The authors use the standard U-Net architecture, which evaluates heatmaps , one for each key point.
The final layer of the decoder uses softmax to predict heatmaps, which can be interpreted as validity maps for the detection of key points.
The authors use a convolutional neural network to evaluate using the key points (here the coordinates of the key points are denoted by ), and the original frame . It is important that parameters , such as edges or texture, are aligned pixel by pixel according to , not c . In order for the input data to be aligned with , we deform the original frames and get the converted images ( ), each of which is aligned relatively in the vicinity of the key point. Heatmaps and converted images are combined and processed in U-Net.
is expressed by the formula:
Here is a mask for highlighting the neighborhood of the control point for which this transformation occurs ( – to add a background) and is expressed by the formula:
Let me remind you that the original image is not aligned pixel by pixel with the created image . To cope with this, the authors use an object deformation strategy. After two down-sampling blocks we get a map of objects . Then we deform according to c . If there is overlap in , the optical flux may not be enough to generate . Here we introduce the concept of a floor map to mark areas of the map of objects that need to be drawn because they are missing from the image . The new feature map looks like this:
where is the reverse deformation operation, and is the Hadamard product (bitwise logical multiplication of the corresponding members of two sequences of equal length).
We evaluate the overlap mask using a sparse representation of key points, adding a channel to the final layer of the motion transfer network. fed to subsequent layers of the image generation module to visualize the resulting frame.
The network trains continuously, combining several loss functions. Reconstruction loss based on Johnson’s perceptual loss function is used . As a key loss function for movements in the frame, the pre-trained VGG-19 network is used. The reconstruction loss formula is presented below:
– reconstructed frame, – frame with the original movement, – i-th channel element extracted from a specific VGG-19 layer, – number of element channels in this layer.
Imposing Equivariance Limitations
The key point predictor does not require any knowledge of key points during training. This can lead to unstable results. The restriction of equivariance is one of the most important factors determining the location of key points without a teacher. This forces the model to predict key points that do not contradict the known geometric transformations. Since the motion estimation module not only predicts key points, but also the Jacobians, we extend the function of equivalence losses to additionally include restrictions on the Jacobians.
The authors suggest that the image undergoes spatial deformation , which can be either an affine transformation or a thin plane spline . After this deformation we get a new image . Applying an extended motion estimate to both images, we obtain a set of local approximations for and
. The standard equivalence constraint is written as follows:
After expanding both parts of the equation in a Taylor series, we obtain the following restrictions (here is the unit square matrix):
To limit the positions of key points, use the function . The authors use equal weights when combining the loss functions in all experiments, since the model is not sensitive to the relative weights of reconstruction loss and 2 equivariant loss functions.
To animate an object from the original frame using video frames, each frame is independently processed to receive . For this, the relative motion between and is transmitted to the frame . That is, we apply the transformation in the neighborhood of each point :
It is important to note that in connection with this there is a limitation – objects are on frames and should have similar poses.
The model sets records!
The model was trained and tested on 4 different data sets:
- VoxCeleb – dataset of people from 22496 videos taken from YouTube;
- UvA-Nemo – dataset for face analysis, consisting of 1240 videos;
- BAIR robot pushing – a dataset consisting of videos collected by Sawyer’s robotic arm, which puts various objects on the table. It has 42,880 training and 128 test videos.
- 280 TaiChi YouTube videos.
As can be seen from the table, the First Order Motion model is superior to other approaches in all respects.
The long-awaited examples
Now try it yourself! It is very simple, everything is prepared here.