Conditional Image Generation Final Report

Part I. Project Analysis

Project analysis: “The project for this course is to generate the middle region of images conditioned on the outside border of the image and a caption describing the image. To be as successful as possible, the model needs to be able to understand the specific meaning of the caption in the context of a specific image.”

The model needs to finish two tasks: 1) generate the centre missing part of the image with uncorrupted outside part 2) to learn the meaning of caption of the image. Following image (from course website) shows the basic idea of this project.

Objectives summary for this project:
1) train a model which can generate the middle regions and learn the caption
2) study the relationship between the caption and region generation
3) Try to learn how to quantitatively evaluate the performance of generative models in this project (As reported before, the squared loss does not work well).

Part II. Potential Solutions and Existing Work

We can see that the objective for this task can be summarised as: inpainting the image with uncorrupted outside area and caption of the image. There are some previous work on image inpainting and no previous work on inpainting with image caption which is one of the novelties for this project. Previous work on image inpainting and text to image generation are helpful and enlightening for solving with this project.

Existing work for image inpainting mainly lie in three fields: 1) Local methods [1-3], this kind of methods try to fill the missing part from the uncorrupted part. Like finding the nearest patches from the same image [1]. This kind of methods will work well when the images have a low-rank structure or being planar 2) Non-local methods. As discussed in [4-6], non-local inpainting methods try to predict the missing part with extenal training images. Like in [4], the authors try to fill a hole in an image with a semantically similar patch from a huge database. In [5], the authors use images for the same scenes from the internet. 3) Forming image inpainting as an image generation problem. Another view of dealing with image inpainting is discussed in [7]. In this paper, it first use DCGAN (Deep Convolutional Generative Adversarial Network) to learn the image generation and then fill the missing part of the uncorrupted part of the image and learned DCGAN network.

Image generation with text description has gained interest in research community but is still far from generating meaningful and high-resolution images. The main challenge for this problem is that the space of plausible images given text descriptions is multimodal. Recently some work [8-9] try to solve this problem with DCGAN. In [8], the author present to use DCGAN to to generate images with text. In [9], a stacked DCGAN based structure is proposed to generate high-resolution images.

Part III Image Inpainting with Different Models

In this project, I am interested in dealing with the image inpainting with the third of previously discussed method: to form the image inpainting as an image generation problem and fill the missing part of the image with uncorrupted part. Especially I am interested in dealing this problem with DCGAN. However, traditional DCGAN would not directly work for image inpainting as it will produce a completely unrelated image with high probability. Following shows three methods used in this project to fill the missing part of the image.

3.1 DCGAN with Explicit Loss on Contextual Similarity

This method is proposed in [7]. There are mainly two steps for this method. First we train a DCGAN network with uncorrupted data until convergence. Then we reuse the learned D and G in DCGAN for image inpainting. In the second step, we need to define two kinds of loss: the contextual loss which describes the context similarity between the reconstructed image and the uncorrupted part of the image. The perceptual loss which describes the similarity between the reconstructed image and images in the training set. The perceptual loss is defined as log(1-D(G(Z))). In the inpainting step, the loss is defined as the sum of contextual loss and perceptual loss. We use back-propagation on this loss to fill the uncorrupted image with a most similar latent space. To point out, the influence of caption is not integrated into this method.

The model structure of this method is shown in the following figure. The model structure of original DCGAN paper [11] is used. For the generator the input is 100 dimensional random noise vector, sampled from uniform distribution (-1,+1), then is reshaped into dimensions of 4*4*1024. Each of following layers is deconvolutional layers and the number of channels halved and image dimension doubled. The final output is 64*64*3. For discriminator, the input layer is an image dimension of 64*64*3, followed with a series of convolution layers where the image dimension is halved and the number of channels is double for previous layer. The optimization algorithm is Adam, activation function for generator is relu and for discriminator is leaky relu. Bacth normalization is adopted.

3.2 Conditional DCGAN

Different from the previous method, here we propose another method for image inpainting with DCGAN. The proposed method is conditional DCGAN which combines the idea in [10][11]. The idea of conditional GAN [10], is to implement some constraints (side information y) on the data (x) being generated. By implementing condition on both the generator and discriminator, it is possible to direct data generate process. For image inpainting, we can assume the uncorrupted part of the image as the condition. Then we implement this condition for both the generator and discriminator of DCGAN. To point out, influence of caption is not integrated into this method. I am still working on the implementation of this model. The model structure for this model is the same as the first stage of method three, thus is omitted here.

3.3 Stacked Conditional DCGAN

In the previous two methods, the influence of caption on the image generation is not studied. Here I propose a new method to deal with this: Stacked Conditional DCGAN (SC-DCGAN). The idea of SC-DCGAN is mainly inspired by [9]. As discussed in the previous section, the main challenge for generating image from text is: the space of plausible images given text descriptions is multimodal and mainly it is hard to generate meaningful images. Here we propose SC-DCGAN to use the information of uncorrupted part and image caption in different stages. In the first stage, we learn the missing part of the image using DCGAN with condition on the uncorrupted and then in the second stage we fine tune the image using DCGAN with conditions on both the stage I generated image and caption of the image. I am still working on the implementation of this model. The model structure for this model is shown in the following figure.

Part IV Possible Methods to Quantitatively Evaluate the Generated Images

How to quantify the quality of generated images is still an open question. One possible measure is to calculate the square loss of the generated part and original part. Another way is the loss used in [7], to calculate the loss (L1) of similarity between generated part and uncorrupted part and loss (L2) of generated part and other images in the training set.

In my opinion, our final objective is to generate an image which is good in human mind. As for me, my judgement will be based on my knowledge from the following three parts: 1) similarity of generated whole image (generated missing part and uncorrupted part) with original whole image 2) similarity of the generated whole image with other images in the training set 3) similarity of generated whole image with semantically similar images from a huge database (images from the internet). Then a nice way to quantify maybe using the weighted combination of the mentioned three similarity loss: Loss = w1*L1 + w2*L2 +w3*L3.

 

Part V Simulation Results

For the implementation of the project, I have finished the first method for image inpainting and still working on the other two methods. Following is the result of the first method which explicitly uses the contextual similarity loss. To point out, the implementation of method one is mainly based on the previous work of [10-11] The results show the 64 original 64*64 images and after image inpainting images (After 20 training epochs). We can see that for some images the generated missing parts are quite similar to the original part, however, most of the generated missing parts only are similar to uncorrupted parts in colour.

Part VI Discussion and Future work

This project is very interesting and challenging. It would be interesting to show whether we can get improvement on image inpainting with image caption. So far, only the frist method is finished, and I will try to finish the other two methods in coming weeks. Besides, another interesting idea may be the multi-task learning: to learn both missing part of the image and caption of the image.

Reference
[1] Efros A A, Leung T K. Texture synthesis by non-parametric sampling[C], Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on. IEEE, 1999, 2: 1033-1038.

[2] Huang J B, Kang S B, Ahuja N, et al. Image completion using planar structure guidance[J]. ACM Transactions on Graphics (TOG), 2014, 33(4): 129.

[3] Hu Y, Zhang D, Ye J, et al. Fast and accurate matrix completion via truncated nuclear norm regularization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(9): 2117-2130.

[4] Hays J, Efros A A. Scene completion using millions of photographs[C], ACM Transactions on Graphics (TOG). ACM, 2007, 26(3): 4.

[5] Hays J, Efros A A. Scene completion using millions of photographs[C], ACM Transactions on Graphics (TOG). ACM, 2007, 26(3): 4.

[6] Mairal J, Elad M, Sapiro G. Sparse representation for color image restoration[J]. IEEE Transactions on image processing, 2008, 17(1): 53-69.

[7] Yeh R, Chen C, Lim T Y, et al. Semantic Image Inpainting with Perceptual and Contextual Losses[J]. arXiv preprint arXiv:1607.07539, 2016.

[8] Reed S, Akata Z, Yan X, et al. Generative adversarial text to image synthesis[C], Proceedings of The 33rd International Conference on Machine Learning. 2016, 3.

[9] Zhang H, Xu T, Li H, et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks[J]. arXiv preprint arXiv:1612.03242, 2016.

[10] Mirza M, Osindero S. Conditional generative adversarial nets[J]. arXiv preprint arXiv:1411.1784, 2014.

[11] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[J]. arXiv preprint arXiv:1511.06434, 2015.

[12] https://github.com/carpedm20/DCGAN-tensorflow

[13] https://github.com/bamos/dcgan-completion.tensorflow

 

Leave a Reply

Your email address will not be published. Required fields are marked *