We confirm that SQuAD 2.0 is both challenging and high-quality. A state-of-the-art model achieves only 66:3% F1 score when trained and tested on SQuAD 2.0, whereas human accuracy is 89:5% F1, a full 23:2 points higher. The same model architecture trained on SQuAD 1.1 gets 85:8% F1, only 5:4 points worse than humans. We also show that our unanswerable questions are more challenging than ones created automatically, either via distant supervision (Clark and Gardner, 2017) or a rule-based method (Jia and Liang, 2017). We release SQuAD 2.0 to the public as new version of SQuAD, and make it the primary benchmark on the official SQuAD leaderboard.2 We are optimistic that this new dataset will encourage the development of reading comprehension systems that know what they don’t know.