RGB frames as an input. T is the number of frames themodel processes, and W and H are the spatial dimensions.The CNN outputs a prediction per-timestep and these aretemporally averaged to produce a probability for each class.The model is trained to minimize cross-entropy: