|
[1]R. Gao, T. H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” CVPR, pp. 10457-10467, 2020. [2]J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CVPR, pp. 4724-4733, 2017. [3]J. Donahue, L.A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” CVPR, 2017. [4]Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: deep networks for video classification,”CVPR, 2015. [5]Z. Qiu, T. Yao, and T. Mei, “Learning spatio temporal representation with pseudo3d residual networks,” ICCV, pp. 5534-5542, 2017. [6]G. Thung and H. Jiang, “A torch library for action recognition and detection using CNNs and LSTMs,” 2016. [7]D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” CVPR, pp. 4489-4497, 2015. [8]S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition.,” TPAMI, Vol.35, No.1, pp. 221-231, 2012. [9]D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” CVPR, pp. 6450-6459, 2018. [10]K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” NIPS, 2014. [11]C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” CVPR, pp. 1933-1941, 2016. [12]L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” CVPR, pp. 20-36, 2016. [13]S. Karaman, L. Sei denari, and A.D. Bimbo, “Fast saliency based pooling of Fisher encoded dense trajectories,” ECCV, 2014. [14]D. Oneata, J. Verbeek, and C. Schmid, “The LEAR submission at Thumos 2014,” ECCV, 2014. [15]L. Wang, Y. Yu Qiao, and X. Tang, “Action recognition and detection by combining motion and appearance features,” ECCV, 2014. [16]R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014. [17]S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real time object detection with region proposal networks” NIPS, pp. 91-99, 2015. [18]Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, “A pursuit of temporal accuracy in general activity detection,” arXiv:1703.02716, 2017. [19]R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN) for action detection in videos,” ICCV, 2017. [20]T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “Bmn: Boundary-matching network for temporal action proposal generation,” ICCV, pp. 3889-3898, 2019. [21]Z. Shou, D. Wang, and S.F. Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” CVPR, pp. 1049-1058, 2016. [22]V. Escorcia, F. Heilbron, J. Niebles, and B. Ghanem, “DAPs: deep action Proposals for Action Understanding,” ECCV, pp. 768-784, 2016. [23]A. Montes, A. Salvador, and X. Nieto, “Temporal activity detection in untrimmed videos with recurrent neural networks,” arXiv:1608.08128, 2016. [24]B. Singh, T. Marks, M. Jones, O. Tuzel, and M. Shao, “A Multi stream bidirectional recurrent neural network for fine-grained action detection,” CVPR, 2016. [25]S. Yeung, O. Russakovsky, G. Mori, and L. FeiFei, “End-to-end learning of action detection from frame glimpses in videos,” CVPR, pp. 2678-2687, 2016. [26]S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in LSTMs for activity detection and early detection,” CVPR, pp. 1942-1950, 2016. [27]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput, Vol.9, No.8, pp. 1735-1780, 1997. [28]National Taiwan University Ph.D Hung-yi Lee Official website, https://speech.ee.ntu.edu.tw/~hylee/. [29]V. Mnih, N. Heess, and A. Graves, “Recurrent models of visual attention,” NIPS, pp. 2204-2212, 2014. [30]D. Bandanna, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv: 1409.0473, 2014. [31]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, and I. Polosukhin, “Attention is all you need,” NIPS, pp. 5998-6008, 2017. [32]A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv: 2010.11929, 2020. [33]pytorch Official website, https://pytorch.org/ [34]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, pp. 770-778, 2016. [35]K. Soomro, A.R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv: 1212.0402, 2012. [36]F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” CVPR, pp. 961-970, 2015. [37]H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” ICCV, 2011. [38]Kingma, P. Diederik, and B. Jimmy. "Adam: A method for stochastic optimization," arXiv:1412.6980, 2014.
|