These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models