2. Related WorksCapturing motion and temporal information has been studied for activity recognition. Early, hand-crafted approachessuch as dense trajectories [24] captured motion informationby tracking points through time. Many algorithms have beendeveloped to compute optical flow as a way to capture motion in video [8]. Other works have explored learning theordering of frames to summarize a video in a single ‘dynamicimage’ used for activity recognition [1].Convolutional neural networks (CNNs) have been appliedto activity recognition. Initial approaches explored methodsto combine temporal information based on pooling or temporal convolution [12, 17]. Other works have explored usingattention to capture sub-events of activities [18]. Two-streamnetworks have been very popular: they take input of a singleRGB frame (captures appearance information) and a stackof optical flow frames (captures motion information). Often,the two network streams of the model are separately trainedand the final predictions are averaged together [20]. Therewere other two-stream CNN works exploring different waysto ‘fuse’ or combine the motion CNN with the appearanceCNN [7, 6]. There were also large 3D XYT CNNs learning spatio-temporal patterns [26, 3], enabled by large videodatasets such as Kinetics [13]. However, these approachesstill rely on optical flow input to maximize their accuracies.