Visual Captioning and Q&A read list


  • Virginia Tech / MSR [Web] [Paper]
    • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, CVPR, 2015 SUNw:Scene Understanding workshop.
  • MPI / Berkeley [Web] [Paper]
    • Mateusz Malinowski, Marcus Rohrbach, Mario Fritz, Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, ICCV 2015.
  • Toronto [Paper] [Dataset]
    • Mengye Ren, Ryan Kiros, Richard Zemel, Image Question Answering: A Visual Semantic Embedding Model and a New Dataset, ICML 2015 DL workshop.
  • Baidu / UCLA [Paper] [Dataset]
    • Hauyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, NIPS 2015.
  • POSTECH [Paper] [Project Page]
    • Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han, Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction, arXiv:1511.05765
  • Berkeley: [Paper]
    • Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Learning to Compose Neural Networks for Question Answering, NAACL 2016.
    • Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Deep compositional question answering with neural module networks, CVPR 2016.
  • MIT + Facebook: [Paper]
    • Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus, Simple Baseline for Visual Question Answering, arXiv:1512.02167v2, Dec 2015.
  • Adelaide Uni  [Web] [Paper1] [Paper2]
    • Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, Anthony Dick, Explicit Knowledge-based Reasoning for Visual Question Answering, arXiv:1511.02570v2, Nov 2015.
    • Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, Anthony Dick, Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources, CVPR 2016.
  • Stanford + Dresden: [Paper]
    • Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei, Visual7W: Grounded Question Answering in Images, CVPR 2016.
  • NICTA + Toyota: [Paper]
    • Aiwen Jiang, Fang Wang, Fatih Porikli, Yi Li, Compositional Memory for Visual Question Answering, arXiv:1511.05676v1, Nov 2015.
  • CMU + MR  [Web] [Paper]
    • Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, Stacked Attention Networks for Image Question Answering, arXiv:1511.02274v2, Jan 2016.
  • USC + Baidu: [Paper]
    • Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering, arXiv:1511.05960v1, Nov 2015.
  • UMas: [Paper]
    • Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, arXiv:1511.05234v1, Nov 2015.

Dynamic Memory Network: [Paper]

  • Caiming Xiong, Stephen Merity, Richard Socher, Dynamic Memory Networks for Visual and Textual Question Answering, arXiv:1603.01417v1, Mar 2016.


Fellows to keep an eye on: Dhruv Batra, Marcus Rohrbach, Chunhua Shen.





  • Stanford
    •  [Paper] Justin Johnson*, Andrej Karpathy*, Li Fei-Fei, DenseCap: Fully Convolutional Localization Networks for Dense Captioning, CVPR, 2016.
    • [Project] [Paper] Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR, 2015.



  • Datasets
    • Instructional Video Captions – (Google Inc, 2015 & University of Rochester, 2015)
      • References
        • What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision. Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew Rabinovich, and Kevin Murphy. NAACL 2015
        • Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments. I. Naim, Y. Song, Q. Liu, L. Huang, H. Kautz, J. Luo, and D. Gildea. Proc. NAACL 2015
    • MPII Movie Description dataset
      • Featuring movie snippets aligned to scripts and DVS (Descriptive video service). DVS is a linguistic description that allows visually impaired people to follow a movie. We benchmark state-of-the-art computer vision algorithms to recognize scenes, human activities, and participating objects and achieve encouraging results in video description on this new challenging dataset. The dataset contains a parallel corpus of over 68K sentences and video snippets from 94 HD movies.
      • References:
        • Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, A Dataset for Movie Description, CVPR 2015
        • Anna Rohrbach, Marcus Rohrbach, Bernt Schiele, The Long-Short Story of Movie Description, GCPR 2015
    • Microsoft Research Video Description Corpus (MS VDC) (UT Austin & MSR, 2011)
      • MS VDC contains parallel descriptions (85,550 English ones) of 2,089 short video snippets (10-25 seconds long). The descriptions are one sentence summary about the action or event in the video as described by Amazon Turkers. In this dataset both paraphrase and bilingual alternatives are captured, hence, the dataset can be useful translation, paraphrasing, and video description purposes.