It is necessary to have a model capable of predicting marginal heatmaps in order to use the prediction strategy outlined in Section. Since pose estimation data is inherently spatial, convolutional layers are a natural foundation for the model.The calculation performed by each convolutional layer is spatially local. That is, for any given output pixel, the value of that pixel is calculated using input pixels that are within a fixed spatial neighbourhood. This is appropriate when both the input and output images exist in the same coordinate space and there is a correlation between the locations of input and output features. For example, in 2D pose estimation the output heatmaps and input RGB image both exist in xy coordinate space, and the ground truth target spherical Gaussians align with the joints in the input image.However, we require our model to not only output an xy heatmap,Hˆ (xy), but also heatmaps that have one axis in the z-direction, Hˆ (zy) and Hˆ (xz). This poses a challenge for convolution-based computation.Consider the case of predicting a heatmap in the zy-plane, Hˆ(zy), from an input image in the xy-plane. In general, a location in the z-direction does not correspond to a location in the x-direction. This means that there may be quite some distance between visual evidence in the input image and the desired prediction location in the output image. Such an arrangement is generally not ideal for convolutional neural networks. For 3D pose estimation, the spatial discrepancy is neveralong both axes at once (Table shows axis correspondences for each of the three heatmaps). It is therefore desirable to preserve spatial locality of computation along theappropriate axes.Axis permutation.By transposing the intermediate activations it is possible to permute the axis undergoing spatially-local calculations with the axis undergoingdensely connected calculations. Therefore the model can bebuilt using convolutional layers without depending on spatial correspondence between mismatched axes. This allowsthe model to aggregate depth cues into feature maps, whichwill then become pixel values along the z-axis. Figure 5illustrates the axis permutation operation for Hˆ (zy). Notethat the permutation operation is simply a fixed manipulation of the activations, and does not add any parameters tothe model.Overall model architecture.Figure 6 illustrates the arrangement of residual blocks we used to produce heatmapsfrom image features. Residual blocks are constructed as perResNet using “option C” shortcut connections. For thenetwork paths predicting Hˆ (zy)and Hˆ (xz), the axis permutation operation is applied mid-way through the stage.The complete model is assembled according to Figure 4.Features are extracted from 256 × 256 pixel input imagesusing a truncated Inception v4 model. Multiple heatmap prediction stages are stacked together after the featureextractor to increase the capacity of the model. “Adapter”1 × 1 convolution layers are placed in between the stagesto combine the previous heatmap predictions into featuremaps, which are added with the previous stage’s input toform a large skip connection. This stacking technique is inspired by the Stacked Hourglass architecture for 2D pose estimation.
It is necessary to have a model capable of predicting marginal heatmaps in order to use the prediction strategy outlined in Section. Since pose estimation data is inherently spatial, convolutional layers are a natural foundation for the model.The calculation performed by each convolutional layer is spatially local. That is, for any given output pixel, the value of that pixel is calculated using input pixels that are within a fixed spatial neighbourhood. This is appropriate when both the input and output images exist in the same coordinate space and there is a correlation between the locations of input and output features. For example, in 2D pose estimation the output heatmaps and input RGB image both exist in xy coordinate space, and the ground truth target spherical Gaussians align with the joints in the input image.However, we require our model to not only output an xy heatmap,Hˆ (xy), but also heatmaps that have one axis in the z-direction, Hˆ (zy) and Hˆ (xz). This poses a challenge for convolution-based computation.Consider the case of predicting a heatmap in the zy-plane, Hˆ(zy), from an input image in the xy-plane. In general, a location in the z-direction does not correspond to a location in the x-direction. This means that there may be quite some distance between visual evidence in the input image and the desired prediction location in the output image. Such an arrangement is generally not ideal for convolutional neural networks. For 3D pose estimation, the spatial discrepancy is neveralong both axes at once (Table shows axis correspondences for each of the three heatmaps). It is therefore desirable to preserve spatial locality of computation along theappropriate axes.Axis permutation.By transposing the intermediate activations it is possible to permute the axis undergoing spatially-local calculations with the axis undergoingdensely connected calculations. Therefore the model can bebuilt using convolutional layers without depending on spatial correspondence between mismatched axes. This allowsthe model to aggregate depth cues into feature maps, whichwill then become pixel values along the z-axis. Figure 5illustrates the axis permutation operation for Hˆ (zy). Notethat the permutation operation is simply a fixed manipulation of the activations, and does not add any parameters tothe model.Overall model architecture.Figure 6 illustrates the arrangement of residual blocks we used to produce heatmapsfrom image features. Residual blocks are constructed as perResNet using “option C” shortcut connections. For thenetwork paths predicting Hˆ (zy)and Hˆ (xz), the axis permutation operation is applied mid-way through the stage.The complete model is assembled according to Figure 4.Features are extracted from 256 × 256 pixel input imagesusing a truncated Inception v4 model. Multiple heatmap prediction stages are stacked together after the featureextractor to increase the capacity of the model. “Adapter”1 × 1 convolution layers are placed in between the stagesto combine the previous heatmap predictions into featuremaps, which are added with the previous stage’s input toform a large skip connection. This stacking technique is inspired by the Stacked Hourglass architecture for 2D pose estimation.<br>
正在翻译中..