resentation on features from a 3D CNN. As 3D CNNs areexpensive to train, we follow the method of I3D [3] to inflate a ResNet-18 pretrained on ImageNet to a 3D CNN forvideos. We also compare to the (2+1)D method of spatialconv. followed by temporal conv from [26], which producesa similar feature combining spatial and temporal information.We find our flow layer increases performance even with 3Dand (2+1)D CNNs already capturing some temporal information: Tables 6 and 7. These experiments used 10 iterationsand learning the flow parameters. In these experiments, FcFwas not used.