臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.249) 您好！臺灣時間：2025/07/14 12:40

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
目次
參考文獻
電子全文
紙本論文
QR Code

本論文永久網址:

研究生:

姜美嘉

研究生(外文):

JIANG, MEI-JIA

論文名稱:

基於深度學習之監控系統跨模態移動物時刻檢索方法之研究

論文名稱(外文):

The Study on Cross-Modal Moving Object Moment Retrieval in Video Surveillance System by Deep Learning Approach

指導教授:

陳昭和、陳聰毅

指導教授(外文):

CHEN, CHAO-HO、CHEN, TSONG-YI

口試委員:

李建樹、黃登淵、陳定宏、陳昭和

口試委員(外文):

LEE, JIANN-SHU、HUANG, DENG-YUAN、CHEN, DING-HORNG、CHEN, CHAO-HO

口試日期:

2022-07-21

學位類別:

碩士

校院名稱:

國立高雄科技大學

系所名稱:

電子工程系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2022

畢業學年度:

110

語文別:

中文

論文頁數:

118

中文關鍵詞:

監控系統、檢索系統、深度學習、跨模態、自然語言處理

外文關鍵詞:

Video Surveillance System、Retrieval System、Deep Learning、Cross-Modal、Natural Language Processing

相關次數:

被引用:1
點閱:124
評分:
下載:0
書目收藏:0

目錄
摘要 I
ABSTRACT III
致謝 V
目錄 VI
圖目錄 X
表目錄 XIV
第一章、緒論 1
1.1 研究動機 1
1.2 系統架構與流程 3
1.3 論文大綱 4
第二章、相關習知技術與知識 5
2.1 檢索系統(Retrieval System) 5
2.2 深度學習(Deep Learning) 6
2.3 詞向量(Word Vector) 8
2.4 Word2vec 9
2.4.1 CBOW(Continuous Bag-of-Words) 10
2.4.2 Skip-gram 11
2.5 GloVe(Global Vectors for Word Representation) 12
2.6 卷積神經網路(Convolutional Neural Network, CNN) 13
2.6.1 卷積層(Convolution Layer) 14
2.6.2 池化層(Pooling Layer) 15
2.6.3 全連接層(Fully Connected Layer) 17
2.6.4 激勵函數(Activation Function) 17
2.7 殘差網路(Residual Network) 21
2.8 循環類神經網路(Recurrent Neural Network, RNN) 23
2.8.1 長短期記憶網路(Long Short Term Memory, LSTM) 24
2.8.2 Gate Recurrent Unit (GRU) 26
2.9 注意力機制(Attention Mechanism) 28
2.9.1 SENet (Squeeze-and-Excitation Networks) 30
2.9.2 CBAM (Convolutional Block Attention Module) 31
2.9.3 Scaled Dot-Product Attention 34
第三章、相關文獻探討 36
3.1 TALL: Temporal Activity Localization via Language Query 37
3.2 Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention 41
3.3 Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction 44
3.4 Learning 2D Temporal Localization Networks for Moment Localization with Natural Language 50
3.5 Frame-Wise Cross-Modal Matching for Video Moment Retrieval 53
3.6 Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query 58
第四章、本系統與方法 64
4.1 資料集 66
4.2 資料前處理 68
4.2.1 影像序列採樣 68
4.2.2 分詞化(Tokenization) 69
4.2.3 填充與遮罩(Padding and Mask) 70
4.3 影像特徵提取與語言編碼模組 74
4.3.1 影像特徵提取模組 75
4.3.2 語句編碼模組 77
4.4 跨模態特徵處理網路 79
4.4.1 跨模態特徵處理網路 80
4.4.2 視覺語義映射層 81
4.4.3 跨模態注意力層 83
4.4.4 多層感知注意力層 85
4.4.5 相似度計算 87
4.4.6 損失函數 89
4.5 時刻定位模組 92
4.5.1 時刻定位 93
4.6 模型參數與訓練細節 95
4.6.1 模型參數 95
4.6.2 學習策略 96
第五章、實驗結果 99
5.1 實驗設備與環境 99
5.2 資料收集 100
5.2.1 學校路口 101
5.2.2 巷道 102
5.2.3 街道 103
5.3 實驗結果與分析 104
5.3.1 主觀結果評估與比較 104
5.3.2 客觀結果評估與比較 107
5.3.3 整體性比較 108
5.3.4 實際應用介面 109
第六章、結論與未來方向 111
6.1 結論 111
6.2 未來方向 111
參考文獻 113

圖目錄
圖 2.1.1、檢索系統基本概念圖 6
圖 2.2.1、ANN基本架構圖 7
圖 2.4.1、語義特徵空間示意圖 9
圖 2.4.2、CBOW模型架構圖 10
圖 2.4.3、SKIP-GRAM模型架構圖 11
圖 2.6.1、CNN基本架構圖 13
圖 2.6.2、卷積示意圖 14
圖 2.6.3、邊緣填充 15
圖 2.6.4、最大池化示意圖 16
圖 2.6.5、平均池化示意圖 16
圖 2.6.6、全連接層示意圖 17
圖 2.6.7、SIGMOID函數圖 18
圖 2.6.8、TANH函數圖 19
圖 2.6.9、RELU函數圖 20
圖 2.6.10、SOFTMAX函數圖 21
圖 2.7.1、模型退化問題 22
圖 2.7.2、殘差塊 22
圖 2.8.1、RNN基本架構圖 23
圖 2.8.2、LSTM基本架構圖 25
圖 2.8.3、GRU基本架構圖 27
圖 2.9.1、注意力機制示意圖 29
圖 2.9.2、SE模組示意圖 30
圖 2.9.3、SE模組的泛化能力 31
圖 2.9.4、CBAM模型架構示意圖 32
圖 2.9.5、CBAM通道注意力模塊 32
圖 2.9.6、CBAM空間注意力模組 33
圖 2.9.7、注意力視覺化結果 34
圖 2.9.8、自注意力機制 35
圖 3.1.1、基於語言查詢的動作定位 38
圖 3.1.2、CTRL網路架構圖 38
圖 3.1.3、動作時刻定位結果 41
圖 3.2.1、SLTA網路架構圖 41
圖 3.2.2、時刻檢索實例 44
圖 3.3.1、CBP網路框架 45
圖 3.3.2、邊界分數與錨點分數融合 49
圖 3.3.3、時間定位視覺化 50
圖 3.4.1、二維時序圖 51
圖 3.4.2、2D-TAN網路框架 52
圖 3.5.1、ACRM網路架構圖 54
圖 3.5.2、ARCM時刻檢索結果 57
圖 3.6.1、CDN網路架構圖 58
圖 3.6.2、順序幀注意機制 60
圖 3.6.3、CDN檢索結果實例 63
圖 4.1.1、資料集標註 66
圖 4.1.2、圖像描述生成示意圖 67
圖 4.2.1、資料前處理部分 68
圖 4.2.2、影像序列採樣 69
圖 4.2.3、基於單詞的分詞化 70
圖 4.2.4、原始輸出值計算SOFTMAX 73
圖 4.2.5、遮罩處理後輸出值計算SOFTMAX 73
圖 4.3.1、特徵提取和詞向量編碼 74
圖 4.3.2、FASTER-RCNN網路架構圖 75
圖 4.3.3、卷積的感受域 76
圖 4.3.4、FASTER R-CNN的感興趣區域 77
圖 4.4.1、跨模態特徵處理網路 79
圖 4.4.2、跨模態特徵處理網路 80
圖 4.4.3、BI-GRU網路架構圖 82
圖 4.4.4、跨模態注意力層 84
圖 4.4.5、跨模態注意力機制 85
圖 4.4.6、餘弦相似度 88
圖 4.4.7、歐幾里得距離 89
圖 4.4.8、EASY TRIPLETS 90
圖 4.4.9、HARD TRIPLETS 91
圖 4.4.10、SEMI-HARD TRIPLETS 91
圖 4.5.1、時刻定位模組 93
圖 4.5.2、紀錄時間定位點 94
圖 4.6.1、WARM UP 97
圖 4.6.2、學習率衰減 98
圖 5.2.1、高雄雄工(平常) 101
圖 5.2.2、高雄雄工(汽車) 101
圖 5.2.3、文藻大學附近巷道(平常) 102
圖 5.2.4、文藻大學附近巷道(機車) 102
圖 5.2.5、直興市場的街道(平常) 103
圖 5.2.6、直興市場的街道(行人) 103
圖 5.3.1、圖形化介面 110

表目錄
表 4.2.1、分詞化後的句子示意表 71
表 4.2.2、填充句子後面示意表 71
表 4.2.3、填充句子前面示意表 71
表 4.2.4、填充與遮罩生成示意表 72
表 4.3.1、GLOVE詞向量 78
表 4.4.1、BI-GRU網路參數表 82
表 4.4.2、多層感知注意力層(語義加權特徵)參數表 86
表 4.4.3、多層感知注意力層(視覺加權特徵)參數表 86
表 4.6.1、參數配置表 95
表 5.1.1、電腦硬體規格表 99
表 5.1.2、電腦軟體規版本格表 99
表 5.2.1、資料集 100
表 5.2.2、移動物總標註數量和時間 100
表 5.3.1、資料測試集 104
表 5.3.2、高雄雄工路口(行人) 105
表 5.3.3、高雄雄工路口(機車) 105
表 5.3.4、高雄雄工路口(汽車) 105
表 5.3.5、文藻大學附近巷道(行人) 106
表 5.3.6、文藻大學附近巷道(機車) 106
表 5.3.7、直興市場街道(行人) 106
表 5.3.8、直興市場街道(機車) 107
表 5.3.9、總誤差結果 107
表 5.3.10、客觀評估 108
表 5.3.11、各文獻方法優缺比較 109

[1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Advances in neural information processing systems, p. 28, 2015.
[2] J., Pennington,R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation,” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014.
[3] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Psychological Review, vol. 65, no 6, p. 386, 1958.
[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, 1985, pp. 318-362.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Neural Information Processing Systems (NIPS), 2012.
[6] Z. S. Harris, “Distributional Structure,” Word, vol. 10, no. 2-3, pp. 146-162, 1954.
[7] G. E. Hinton, "Learning Distributed Representations of Concepts," in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986.
[8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space," in Proceeding of International Conference on Learning Representations, 2013.
[9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean, "Distributed Representations of Words and Phrases and their Compositionality," in Neural Information Processing Systems (NIPS), 2013.
[10] Mikolov, Tomáš, Wen-tau Yih, and Geoffrey Zweig, “Linguistic Regularities in Continuous SpaceWord Representations,” Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013.
[11] C. M. Han Jun, "The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning," in Proceedings of the International Workshop on Artificial Neural Networks: From Natural to Artificial Neural Computation, 1995.
[12] Vinod Nair, Geoffrey E. Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines," in Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition (CPVR), 2016, pp. 770-778.
[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Representations by Back-propagating Errors," Nature, vol. 323, pp. 533-536, 1986.
[15] Hochreiter, Sepp, and Jürgen Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[16] J. Chung, C. Gulcehre, and K. Cho, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv:1414.3555, 2014.
[17] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu, "Recurrent Models of Visual Attention," in Neural Information Processing Systems (NIPS), 2014.
[18] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," in International Conference on Learning Representations (ICLR), 2015.
[19] Jie Hu, Li Shen, and Gang Sun, "Squeeze-and-Excitation Networks," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, USA, June 2018, pp. 7132-7141.
[20] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon, "CBAM: Convolutional Block Attention Module," in The European Conference on Computer Vision (ECCV), Munich, Germany, September 2018, pp.3-19.
[21] A. Vaswani, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is All You Need,” Neural Information Processing Systems (NIPS), 2017.
[22] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia, “TALL: Temporal Activity Localization via Language Query,” Proceedings of the IEEE international conference on computer vision., pp. 5267-5275, 2017.
[23] Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan, “Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention,” Proceedings of the 2019 on international conference on multimedia retrieval, pp. 217-225, 2019.
[24] Jingwen Wang, Lin Ma, and Wenhao Jiang, “Temporally Grounding Language Queries in Videos by Contextual Boundary-Aware Prediction,” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07, pp. 12168-12175, 2020.
[25] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo, “Learning 2D Temporal Localization Networks for Moment Localization with Natural Language,” Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12870-12877, 2020.
[26] Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng, “Frame-Wise Cross-Modal Matching for Video Moment Retrieval,” IEEE Transactions on Multimedia 24, pp. 1338-1349, 2021.
[27] Gongmian Wang, Xing Xu, Fumin Shen, Huimin Lu, Yanli Ji, and Heng Tao Shen, “Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query,” IEEE Transactions on Multimedia 24, pp. 1221-1232, 2022.
[28] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning Spatiotemporal Features With 3D Convolutional Networks,” Proceedings of the IEEE international conference on computer vision, pp. 4489-4497, 2015.
[29] Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler, “Skip-Thought Vectors,” Advances in neural information processing systems (NIPS), 2015.
[30] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang, “Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.,” arXiv preprint arXiv:2202.03052, 2022.
[31] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit,” Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014.
[32] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang, “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077-6086, 2018.
[33] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” International journal of computer vision 123.1, pp. 32-73, 2017.
[34] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the association for computational linguistics 5, pp. 135-146, 2017.
[35] M. Schuster, K.K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE transactions on Signal Processing 45.11, pp. 2673-2681, 1997.
[36] Schroff, Florian, Dmitry Kalenichenko, and James Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815-823, 2015.

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	應用人工智慧於股市新聞與情感分析預測股價走勢
2.	基於BERT語言模型回答是非題之研究
3.	人工智慧情感對話機器人
4.	深度學習應用於中文成語與同義成語寫作之推薦
5.	BERT模型於中文成語填空測驗之比較研究
6.	應用BERT語言模型於同音別字之訂正
7.	應用深度學習於事實型問答系統之研究
8.	基於深度學習方法之中文句法校正模型
9.	使用深度學習進行基於社群網路評論的產品評價系統
10.	基於深度學習與自然語言處理發展的情感對話機器人-以短文本情感對話生成為例
11.	使用集成學習方法與深度遞迴類神經網路建立網路論壇對話預測模型
12.	應用深度學習於問題分類與回應系統之實驗研究
13.	輔助孩童學習英語之深度學習對話機器人
14.	深度學習模型應用於文言文翻譯為白話文之實驗研究
15.	深度學習應用於生成客製化餐廳評論之研究

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室