조회 수 166 추천 수 0 댓글 0
?

단축키

Prev이전 문서

Next다음 문서

크게 작게 위로 아래로 댓글로 가기 인쇄 첨부
?

단축키

Prev이전 문서

Next다음 문서

크게 작게 위로 아래로 댓글로 가기 인쇄 첨부

Multimodal Residual Learning for Visual Question-Answering

  1. 1. Jin-Hwa Kim BI Lab, Seoul National University Multimodal Residual Learning for Visual Question-Answering !""!#$% #&" % "!#$% #&" ' % "!#$% #&" ( BS06+#*,- #$% #&" +#*,+$ ./% +0"1!, " ) 2.3&45%$ 6% 43 " #&" 7 37"#+$ $ 36%8 +$ 9%034*. " 7 :;%< 9& ł =8 6" &% 7 6#"#%5 " $ 7 .!4*"#%5 7 %0" + >! + ? %0 %< 3$ 6# #$% #&" 87 "##&" " )*! +#*, " 4$3@ ," "#%+ #$% #&" &"+$ 7 "##&" " ! 9%0! # 2 2 6% %# ' 33 8 7 %0)* 2A6',B = 8 45! 3@ "6% 2( " . ł+ > )*! +#*, " 7 7 2 CDEEFGEGH %07 7 7 ! 9 36% 7 %0"#3 " " 7 7 : %0%< +0" 7 " 2 27 < & % " 7 7 : %0 7 $ " + > 7 2 FI-JGK %< . 6# : L 43 7 %0 7 N 36% " +
  2. 2. Table of Contents 1. VQA: Visual Question Answering 2. Vision Part 3. Question Part 4. Multimodal Residual Learning 5. Results 6. Discussion 7. Recent Works 8. Q&A
  3. 3. 1. VQA: Visual Question Answering
  4. 4. 1. VQA: Visual Question Answering VQA is a new dataset containing open-ended questions about images, 
 for understanding of vision, language and commonsense knowledge to answer. VQA Challenge Antol et al., ICCV 2015
  5. 5. 1. VQA: Visual Question Answering Examples of Visual-Question, or Only-Question answers by Human VQA Challenge Antol et al., ICCV 2015
  6. 6. 1. VQA: Visual Question Answering # images: 204,721 (MS COCO) # questions: 760K (3 per image) # answers: 10M (10 per question + ɑ) Numbers Images Questions Answers Training 80K 240K 2.4M Validation 40K 120K 1.2M Test 80K 240K Antol et al., ICCV 2015
  7. 7. 1. VQA: Visual Question Answering 80K test images / Four splits of 20K images each Test-dev (development) Debugging and Validation - 10/day submission to the evaluation server. Test-standard (publications) Used to score entries for the Public Leaderboard. Test-challenge (competitions) Used to rank challenge participants. Test-reserve (check overfitting) Used to estimate overfitting. Scores on this set are never released. Test Dataset Slide adapted from: MSCOCO Detection/Segmentation Challenge, ICCV 2015
  8. 8. 2. Vision Part
  9. 9. 2. Vision Part ResNet: A Thing among Convolutional Neural Networks He et al., CVPR 2016 1st place on the ImageNet 2015 classification task 1st place on the ImageNet 2015 detection task 1st place on the ImageNet 2015 localization task 1st place on the COCO object detection task 1st place on the COCO object segmentation task
  10. 10. 2. Vision Part ResNet-152 He et al., CVPR 2016 Conv 1x1, 512 Conv 3x3, 512 Conv 1x1, 2048 AveragePooling 7x7x2048 1x1x2048 Linear Output Size Softmax 152-Layered Convolutional Neural Networks 3x224x224 Input Size
  11. 11. 2. Vision Part ResNet-152 He et al., CVPR 2016 Conv 1x1, 512 Conv 3x3, 512 Conv 1x1, 2048 AveragePooling 7x7x2048 1x1x2048 Linear Output Size Softmax As a visual feature extractor 3x224x224 Input Size
  12. 12. 2. Vision Part ResNet-152 He et al., CVPR 2016 Pre-trained models are available! For Torch, TensorFlow, Theano, Caffe, Lasagne, Neon, and MatConvNet https://github.com/KaimingHe/deep-residual-networks
  13. 13. 3. Question Part
  14. 14. 3. Question Part Word-Embedding What color are her eyes? what color are her eyes ? preprocessing indexing 53 7 44 127 2 6177 w53 | w7 | w44 | w127 | w2 | w6177 lookup w1T w2T ⋮ w7T ⋮ w44T ⋮ Lookup Table {wi} are learning parameters for back-propagation algorithm.
  15. 15. 3. Question Part Question-Embedding RNNw53 (what) h0 h1 RNNw7 (color) h1 h2 RNN w6177 (?) h5 h6 Step 0 Step 1 Step 5 use this
  16. 16. 3. Question Part Choice of RNN: Gated Recurrent Units (GRU) Cho et al., EMNLP 2014
 Chung et al., arXiv 2014 hst-1 st-1 h z = σ(xtUz + st-1Wz) r = σ(xtUr + st-1Wr) h = tanh(xtUh + (st-1⚬r)Wh) st = (1 - z)⚬h + z⚬st-1
  17. 17. 3. Question Part Skip-Thought Vectors Pre-trained model for word-embedding and question-embedding Kiros et al., NIPS 2015 I got back home. I could see the cat on the steps. This was strange. try to reconstruct the previous sentence and 
 next sentence BookCorpus dataset 
 (Zhu et al., ArXiv 2015)
  18. 18. 3. Question Part Skip-Thought Vectors Pre-trained model for word-embedding and question-embedding Its encoder as Sent2Vec model Kiros et al., NIPS 2015 w1T w2T w3T w4T ⋮ Lookup Table Gated 
 Recurrent 
 Units Pre-trained GRU
  19. 19. 3. Question Part Skip-Thought Vectors Pre-trained model (Theano) and porting code (Torch) are available! https://github.com/ryankiros/skip-thoughts https://github.com/HyeonwooNoh/DPPnet/tree/master/ 003_skipthoughts_porting Noh et al., CVPR 2016 Kiros et al., NIPS 2015
  20. 20. 4. Multimodal Residual Learning
  21. 21. 4. Multimodal Residual Learning Idea 1: Deep Residual Learning Extend the idea of deep residual learning for multimodal learning He et al., CVPR 2016 identity weight layer weight layer relu relu F(x) + x x F(x) x Figure 2. Residual learning: a building block.
  22. 22. 4. Multimodal Residual Learning Idea 2: Hadamard product for Joint Residual Mapping One modality is directly involved in the gradient with respect to the other modality https://github.com/VT-vision-lab/VQA_LSTM_CNN vQ vI tanh tanh ◉ softmax ∂σ (x)!σ (y) ∂x = diag( ′σ (x)!σ (y)) Scaling Problem? Wu et al., NIPS 2016
  23. 23. 4. Multimodal Residual Learning Multimodal Residual Networks Kim et al., NIPS 2016 Q V ARNN CNN softmax Multimodal Residual Networks What kind of animals are these ? sheep word embedding question shortcuts Hadamard products word2vec (Mikolov et al., 2013) skip-thought vectors (Kiros et al., 2015) ResNet (He et al., 2015)
  24. 24. 4. Multimodal Residual Learning Multimodal Residual Networks Kim et al., NIPS 2016 A Linear Tanh Linear TanhLinear Tanh Linear Q V H1 Linear Tanh Linear TanhLinear Tanh Linear H2 V Linear Tanh Linear TanhLinear Tanh Linear H3 V Linear Softmax ⊙ ⊕ ⊙ ⊕ ⊙ ⊕ Softmax
  25. 25. 5. Results
  26. 26. 5. Results Tanh Linear Linear Tanh Linear Q V Hl V ⊙ ⊕ (a) Linear Tanh Linear Tanh Linear Tanh Linear Q V Hl V⊙ ⊕ (c) Linear Tanh Linear Tanh Linear TanhLinear Tanh Linear Q V Hl V ⊙ ⊕ (b) Linear Tanh Linear TanhLinear Tanh Linear Q V Hl V ⊙ ⊕ (e) Linear Tanh Linear TanhLinear Tanh Linear Q V Hl V ⊙ ⊕ (d) if l=1 else Identity if l=1 Linear else none Table 1: The results of alternative models (a)-(e) on the test-dev. Open-Ended All Y/N Num. Other (a) 60.17 81.83 38.32 46.61 (b) 60.53 82.53 38.34 46.78 (c) 60.19 81.91 37.87 46.70 (d) 59.69 81.67 37.23 46.00 (e) 60.20 81.98 38.25 46.57 Table 2: The e ect of the visual features and # of target answers on the test-dev results. Vgg for VGG-19, and Res for ResNet-152 fea- tures described in Section 4. Open-Ended All Y/N Num. Other Vgg, 1k 60.53 82.53 38.34 46.78 Vgg, 2k 60.79 82.13 38.87 47.52 Vgg, 3k 60.68 82.40 38.69 47.10 Res, 1k 61.45 82.36 38.40 48.81 Res, 2k 61.68 82.28 38.82 49.25 Res, 3k 61.47 82.28 39.09 48.76 5 Results The VQA Challenge, which released the VQA dataset, provides evaluation servers for test- dev and test-standard test splits. For the test-dev, the evaluation server permits unlimited submissions for validation, while the test-standard permits limited submissions for the competition. We report accuracies in percentage. Alternative Models The test-dev results for the Open-Ended task on the of alternative models are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) is marginally better than (a). As compared to (b), (c) deteriorates the performance. An extra embedding for a question vector may easily cause overfitting leading the overall degradation. And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of the linear mappings may e ectively support to do the task. (e) shows a reasonable performance, however, the extra shortcut is not essential. The ive models Other 46.61 46.78 46.70 46.00 46.57 Table 2: The e ect of the visual features and # of target answers on the test-dev results. Vgg for VGG-19, and Res for ResNet-152 fea- tures described in Section 4. Open-Ended All Y/N Num. Other Vgg, 1k 60.53 82.53 38.34 46.78 Vgg, 2k 60.79 82.13 38.87 47.52 Vgg, 3k 60.68 82.40 38.69 47.10 Res, 1k 61.45 82.36 38.40 48.81 Res, 2k 61.68 82.28 38.82 49.25 Res, 3k 61.47 82.28 39.09 48.76 A Appendix A.1 VQA test-dev Results Table 1: The effects of various options for VQA test-dev. Here, the model of Figure 3a is used, since these experiments are preliminarily conducted. VGG-19 features and 1k target answers are used. s stands for the usage of Skip-Thought Vectors [6] to initialize the question embedding model of GRU, b stands for the usage of Bayesian Dropout [3], and c stands for the usage of postprocessing using image captioning model [5]. Open-Ended Multiple-Choice All Y/N Num. Other All Y/N Num. Other baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06 s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65 s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67 s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72 Table 2: The results for VQA test-dev. The precision of some accuracies [11, 2, 10] are one less than
  27. 27. 5. Results Table 3: The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2k target answers are used. MN stands for Multimodal Networks without residual learning, which does not have any shortcut connections. Dim. stands for common embedding vector’s dimension. The number of parameters for word embedding (9.3M) and question embedding (21.8M) is subtracted from the total number of parameters in this table. Open-Ended L Dim. #params All Y/N Num. Other MN 1 4604 33.9M 60.33 82.50 36.04 46.89 MN 2 2350 33.9M 60.90 81.96 37.16 48.28 MN 3 1559 33.9M 59.87 80.55 37.53 47.25 MRN 1 3355 33.9M 60.09 81.78 37.09 46.78 MRN 2 1766 33.9M 61.05 81.81 38.43 48.43 MRN 3 1200 33.9M 61.68 82.28 38.82 49.25 MRN 4 851 33.9M 61.02 82.06 39.02 48.04 A.2 More Examples
  28. 28. 5. Results Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are one less than others, so, zero-filled to match others. Open-Ended Multiple-Choice All Y/N Num. Other All Y/N Num. Other DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79 D-NMN [1] 58.00 - - - - - - - Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64 SAN [30] 58.90 - - - - - - - ACK [27] 59.44 81.07 37.12 45.83 - - - - FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20 DMN+ [28] 60.36 80.43 36.82 48.33 - - - - MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40 Human [2] 83.30 95.77 83.39 72.67 - - - - 5.1 Visualization
  29. 29. 6. Discussions
  30. 30. 6. Discussions A Linear Tanh Linear TanhLinear Tanh Linear Q V H1 Linear Tanh Linear TanhLinear Tanh Linear H2 V Linear Tanh Linear TanhLinear Tanh Linear H3 V Linear Softmax ⊙ ⊕ ⊙ ⊕ ⊙ ⊕ Softmax pretrained model 1st visualization 2nd visualization 3rd visualization Visualization
  31. 31. 6. Discussionsexamples examples What kind of animals are these ? sheep What animal is the picture ? elephant What is this animal ? zebra What game is this person playing ? tennis How many cats are here ? 2 What color is the bird ? yellow What sport is this ? surfing Is the horse jumping ? yes (a) (b) (c) (d) (e) (f) (g) (h)
  32. 32. 6. Discussions What color is the bird ? yellow Is the horse jumping ? yes (f) (h)
  33. 33. Acknowledgments This work was supported by Naver Corp. and partly by the Korea government (IITP-R0126-16-1072- SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086- RISF, ADD-UD130070ID-BMRR).
  34. 34. Recent Works
  35. 35. Recent Works Low-rank Bilinear Pooling
  36. 36. Low-rank Bilinear Pooling Bilinear model Low-rank Restriction Kim et al., arXiv 2016 fi = wijk k=1 M ∑ j=1 N ∑ xj yk + bi = xT Wiy + bi fi = xT Wiy + bi = xT UiVi T y + bi = 1T (Ui T x!Vi T y)+ bi f = PT (UT x!VT y)+ b N x D D x M rank(Wi) <= min(N,M)
  37. 37. Low-rank Bilinear Pooling Bilinear model Low-rank Restriction Kim et al., arXiv 2016 fi = wijk k=1 M ∑ j=1 N ∑ xj yk + bi = xT Wiy + bi fi = xT Wiy + bi = xT UiVi T y + bi = 1T (Ui T x!Vi T y)+ bi f = PT (UT x!VT y)+ b vQ vI ◉ N x D D x M rank(Wi) <= min(N,M)
  38. 38. Low-rank Bilinear Pooling Kim et al., arXiv 2016 x1 x2 … xN WT ∑wixi ∑wjxj … ∑wkxk
  39. 39. Low-rank Bilinear Pooling Kim et al., arXiv 2016 x1 x2 … xN Wx T ∑wixi ∑wjxj … ∑wkxk y1 y2 … yN Wy T ∑wlyl ∑wmym … ∑wnyn ∑∑wiwlxiyl ∑∑wjwmxjym … ∑∑wkwnxkyn Hadamard Product (Element-wise Multiplication)
  40. 40. Recent Works Multimodal Low-rank Bilinear Attention Networks (MLB)
  41. 41. MLB Attention Networks Kim et al., arXiv 2016 A Tanh Conv Tanh Linear Replicate Q V Softmax Conv Tanh Linear Tanh LinearLinear Softmax
  42. 42. MLB Attention Networks (MLB) Kim et al., arXiv 2016 Under review as a conference paper at ICLR 2017 Table 2: The VQA test-standard results to compare with state-of-the-art. Notice that these results are trained by provided VQA train and validation splits, without any data augmentation. Open-Ended MC MODEL ALL Y/N NUM ETC ALL iBOWIMG (Zhou et al., 2015) 55.89 76.76 34.98 42.62 61.97 DPPnet (Noh et al., 2015) 57.36 80.28 36.92 42.24 62.69 Deeper LSTM+Normalized CNN (Lu et al., 2015) 58.16 80.56 36.53 43.73 63.09 SMem (Xu & Saenko, 2016) 58.24 80.80 37.53 43.48 - Ask Your Neuron (Malinowski et al., 2016) 58.43 78.24 36.27 46.32 - SAN (Yang et al., 2015) 58.85 79.11 36.41 46.42 - D-NMN (Andreas et al., 2016) 59.44 80.98 37.48 45.81 - ACK (Wu et al., 2016b) 59.44 81.07 37.12 45.83 - FDA (Ilievski et al., 2016) 59.54 81.34 35.67 46.10 64.18 HYBRID (Kafle & Kanan, 2016b) 60.06 80.34 37.82 47.56 - DMN+ (Xiong et al., 2016) 60.36 80.43 36.82 48.33 - MRN (Kim et al., 2016b) 61.84 82.39 38.23 49.41 66.33 HieCoAtt (Lu et al., 2016) 62.06 79.95 38.22 51.95 66.07 RAU (Noh & Han, 2016) 63.2 81.7 38.2 52.8 67.3 MLB (ours) 65.07 84.02 37.90 54.77 68.89 The rate of the divided answers is approximately 16.40%, and only 0.23% of questions have more than two divided answers in VQA dataset. We assume that it eases the difficulty of convergence without severe degradation of performance.
  43. 43. MLB Attention Networks (MLB) Kim et al., arXiv 2016 Open-Ended task of VQA. The major improvements are from yes-or-no (Y/N) and others (ETC)- type answers. In Table 3, we also report the accuracy of our ensemble model to compare with other ensemble models on VQA test-standard, which won 1st to 5th places in VQA Challenge 20162 . We beat the previous state-of-the-art with a margin of 0.42%. Table 3: The VQA test-standard results for ensemble models to compare with state-of-the-art. For unpublished entries, their team names are used instead of their model names. Some of their figures are updated after the challenge. Open-Ended MC MODEL ALL Y/N NUM ETC ALL RAU (Noh & Han, 2016) 64.12 83.33 38.02 53.37 67.34 MRN (Kim et al., 2016b) 63.18 83.16 39.14 51.33 67.54 DLAIT (not published) 64.83 83.23 40.80 54.32 68.30 Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26 MCB (Fukui et al., 2016) 66.47 83.24 39.47 58.00 70.10 MLB (ours) 66.89 84.61 39.07 57.79 70.29 Human (Antol et al., 2015) 83.30 95.77 83.39 72.67 91.54 7 RELATED WORKS 7.1 COMPACT BILINEAR POOLING
  44. 44. Recent Works DEMO
  45. 45. DEMO Q: 아니 이게 뭐야? A: 냉장고 입니다.
  46. 46. DEMO
  47. 47. Q&A
  48. 48. Thank You
Atachment
첨부 '48'
?

List of Articles
번호 제목 글쓴이 날짜 조회 수
11 대용량 텍스트마이닝 기술 하정우 file 관리자 2016.11.04 111
10 딥러닝예제로보는개발자를위한통계 최재걸 file 관리자 2016.11.04 49
» Multimodal Residual Learning for Visual Question-Answering file 관리자 2016.11.04 166
8 yarn 기반의 deep learning application cluster 구축 김제민 file 관리자 2016.11.04 125
7 backend 개발자의 neural machine translation 개발기 김상경 file 관리자 2016.11.03 149
6 베이지안토픽모형 강병엽 file 관리자 2016.11.03 42
5 papago 김준석 file 관리자 2016.11.03 40
4 챗봇 개발을 위한 네이버 랩스 api file 관리자 2016.11.03 95
3 딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016 file 관리자 2016.11.03 63
2 딥러닝을 활용한 이미지 검색 포토요약과 타임라인 최종 file 관리자 2016.11.03 69
1 딥러닝을 이용한 지역 컨텍스트 검색 김진호 file 관리자 2016.11.03 91
Board Pagination Prev 1 Next
/ 1