While we’ve feared the effects of ChatGPT on English class essays, artificial intelligence has apparently made a gigantic leap in knowledge — and it may change how we do testing forever. OpenAI’s latest language model, GPT-4, released earlier this week, can now easily ace bar exams, LSATs and other higher-learning tests.
As noted in a series of tweets by Wharton professor Ethan Mollick (based on information culled from a publicly-available whitepaper by OpenAI), GPT-4 scored in the 90th percentile for the universal bar exam, 88th percentile on the LSAT and 93rd percentile on SAT Evidence-Based Reading and Writing, in total earning nearly top marks on over two dozen well-known exams (it seemed to have some issues with AP English).
“We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning,” explains OpenAI in a blog post. “GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.”
No kidding. And it’s learning exponentially; GPT-4 passed a simulated bar exam with a score around the top 10% of test takers, while an earlier GPT-3.5 had a score “around the bottom 10%.”
ChatGPT Is a Scarily Convincing AI Chatbot
One week after launch, it hasn’t been hard to get the AI-powered chatbot to suggest some pretty awful things, such as wiping out humansThe really scary (or fascinating, or beneficial, depending on your perspective) aspect of GPT-4, however, is how easily it accepts visual prompts. In a series of seven samples showcased by OpenAI, GPT-4 was able to explain jokes, memes and charts, and summarize a paper that was presented as an image.
So, should you be worried? In some ways, yes. Even OpenAI admits this: “GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it ‘hallucinates’ facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.”
On the other hand, the language model has vastly improved in just a few months, meaning AI is closer to becoming a reliable tool that could be used responsibly in a number of different fields.
And maybe this will lead to a move away from standardized testing, which has its own issues. As another Wharton professor noted, “Don’t forget scores are a big chunk (maybe 1/3) of the info we use to assess humans for admissions to higher education programs — may need revisiting.”
We should ask GPT-4. It may have a suggestion.
Thanks for reading InsideHook. Sign up for our daily newsletter and be in the know.