|
[1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Advances in neural information processing systems, p. 28, 2015. [2] J., Pennington,R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation,” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014. [3] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Psychological Review, vol. 65, no 6, p. 386, 1958. [4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, 1985, pp. 318-362. [5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Neural Information Processing Systems (NIPS), 2012. [6] Z. S. Harris, “Distributional Structure,” Word, vol. 10, no. 2-3, pp. 146-162, 1954. [7] G. E. Hinton, "Learning Distributed Representations of Concepts," in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986. [8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space," in Proceeding of International Conference on Learning Representations, 2013. [9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean, "Distributed Representations of Words and Phrases and their Compositionality," in Neural Information Processing Systems (NIPS), 2013. [10] Mikolov, Tomáš, Wen-tau Yih, and Geoffrey Zweig, “Linguistic Regularities in Continuous SpaceWord Representations,” Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013. [11] C. M. Han Jun, "The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning," in Proceedings of the International Workshop on Artificial Neural Networks: From Natural to Artificial Neural Computation, 1995. [12] Vinod Nair, Geoffrey E. Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines," in Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010. [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition (CPVR), 2016, pp. 770-778. [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Representations by Back-propagating Errors," Nature, vol. 323, pp. 533-536, 1986. [15] Hochreiter, Sepp, and Jürgen Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. [16] J. Chung, C. Gulcehre, and K. Cho, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv:1414.3555, 2014. [17] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu, "Recurrent Models of Visual Attention," in Neural Information Processing Systems (NIPS), 2014. [18] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," in International Conference on Learning Representations (ICLR), 2015. [19] Jie Hu, Li Shen, and Gang Sun, "Squeeze-and-Excitation Networks," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, USA, June 2018, pp. 7132-7141. [20] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon, "CBAM: Convolutional Block Attention Module," in The European Conference on Computer Vision (ECCV), Munich, Germany, September 2018, pp.3-19. [21] A. Vaswani, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is All You Need,” Neural Information Processing Systems (NIPS), 2017. [22] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia, “TALL: Temporal Activity Localization via Language Query,” Proceedings of the IEEE international conference on computer vision., pp. 5267-5275, 2017. [23] Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan, “Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention,” Proceedings of the 2019 on international conference on multimedia retrieval, pp. 217-225, 2019. [24] Jingwen Wang, Lin Ma, and Wenhao Jiang, “Temporally Grounding Language Queries in Videos by Contextual Boundary-Aware Prediction,” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07, pp. 12168-12175, 2020. [25] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo, “Learning 2D Temporal Localization Networks for Moment Localization with Natural Language,” Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12870-12877, 2020. [26] Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng, “Frame-Wise Cross-Modal Matching for Video Moment Retrieval,” IEEE Transactions on Multimedia 24, pp. 1338-1349, 2021. [27] Gongmian Wang, Xing Xu, Fumin Shen, Huimin Lu, Yanli Ji, and Heng Tao Shen, “Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query,” IEEE Transactions on Multimedia 24, pp. 1221-1232, 2022. [28] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning Spatiotemporal Features With 3D Convolutional Networks,” Proceedings of the IEEE international conference on computer vision, pp. 4489-4497, 2015. [29] Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler, “Skip-Thought Vectors,” Advances in neural information processing systems (NIPS), 2015. [30] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang, “Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.,” arXiv preprint arXiv:2202.03052, 2022. [31] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit,” Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014. [32] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang, “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077-6086, 2018. [33] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” International journal of computer vision 123.1, pp. 32-73, 2017. [34] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the association for computational linguistics 5, pp. 135-146, 2017. [35] M. Schuster, K.K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE transactions on Signal Processing 45.11, pp. 2673-2681, 1997. [36] Schroff, Florian, Dmitry Kalenichenko, and James Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815-823, 2015.
|