First, we trained and tested all three models on SQuAD 2.0, as shown in Table 3. Following Rajpurkar et al. (2016), we report average exact match and F1 scores.3 The best model, DocQA ELMo, achieves only 66:3 F1 on the test set, 23:2 points lower than the human accuracy of 89:5 F1. Note that a baseline that always abstains gets 48:9 test F1; existing models are closer to this baseline than they are to human performance. Therefore, we see significant room for model improvement on this task. We also compare with reported test numbers for analogous model architectures on SQuAD 1.1. There is a much larger gap between humans and machines on SQuAD 2.0 compared to SQuAD 1.1, which confirms that SQuAD 2.0 is a much harder dataset for existing models.