TY - JOUR T1 - Understanding Video Semantic Structure with Spatiotemporal Graph Random Walk AU - Yun, Hoyeoung AU - Kim, Minseo AU - Kim, Eun-Sol JO - Journal of KIISE, JOK PY - 2024 DA - 2024/1/14 DO - 10.5626/JOK.2024.51.9.801 KW - video understanding KW - compositional learning KW - spatiotemporal graph KW - random walk KW - semantic unit AB - Understanding a long video focuses on finding various semantic units present in the video and interpreting complex relationships among them. Conventional approaches utilize models based on CNNs or transformers to encode contextual information for short clips and then consider temporal relationships among them. However, such approaches struggle to capture complex relationships among smaller semantic units within video clips. In this paper, we present video inputs using a spatiotemporal graph with objects as vertices and relative space-time information between objects as edges, to explicitly express relationships among these semantic units. Additionally, we proposed a novel method to represent major semantic units as compositions of smaller units using high-order relationship information obtained by spatiotemporal random walks on the graph. Through experiments on CATER dataset, which involved complex actions of multiple objects, we demonstrated that our approach exhibited effective semantic unit capturing capabilities.