More than half of adults in the United States are now using large language models (LLMs) like ChatGPT, Gemini, and Copilot for various daily tasks. These uses range from creating grocery lists to sharing personal thoughts with AI chatbots. Research suggests that people may turn to these tools because their responses can make users feel understood.
A recent study from Northwestern University evaluated three LLMs alongside expert and non-expert human judges to see how well they could assess empathy in text conversations. The study, published in Nature Machine Intelligence, found that LLMs judged empathy almost as accurately as trained experts and were more consistent than laypeople.
“We believe evaluating AI models in this way could potentially teach humans something new about empathy — how we measure it and how we apply it,” said Matthew Groh, assistant professor at Kellogg School of Management and co-author of the study.
The research focused on empathy not just as a personality trait but as a communication skill—specifically, the ways people express understanding through language. “We assume that we all just understand empathy since we are humans, but communicating it is a skill,” Groh explained. “And just like any skill, you need to practice to get better at it. If someone hasn’t trained that muscle and learned the patterns behind empathic communication, then they won’t be able to truly recognize it in conversations. Our research shows that LLMs can learn the patterns and basically master the skill set.”
To conduct the study, researchers analyzed 200 real-world text message exchanges where one person shared a problem and another offered support. These included common issues such as work difficulties or family disputes, as well as sensitive topics like mental health struggles or discrimination.
Groh’s team asked three different LLMs—Gemini 2.5 Pro, ChatGPT 4o, Claude 3.7 Sonnet—three experts in empathic communication, and hundreds of non-experts to rate these conversations on factors like encouraging elaboration or demonstrating understanding.
“Large language models’ judgments on whether someone was effective at communicating empathically mirror the judgment of our experts,” Groh said. “LLMs might not catch every nuance that an expert would recognize, but they are substantially better at it than a typical person.”
He added that LLMs excel because “they have seen many instances of attempts to respond in a way that makes another feel heard, allowing them to get quite good at identifying the grammar and idioms of empathic expression.”
However, there are concerns about excessive empathy from chatbots—a phenomenon called sycophancy or insincere flattery—which can lead AI systems to avoid difficult truths or reinforce negative feelings without proper context.
“There’s such a thing as over-validation… That’s where LLMs still need to learn from expert humans on appropriate confrontation,” Groh noted.
Groh also distinguished between current commercial uses of LLMs—as companions designed for engagement—and their potential role as impartial judges offering transparency while maintaining privacy.
Looking ahead, Groh hopes this research will help improve training for professionals who rely on empathetic communication—including psychologists, teachers, doctors, and customer service workers—and promote greater accountability when using AI chatbots for companionship.
“We hope to see carefully designed LLMs being used to help train psychologists, teachers, doctors, customer service workers in being more effective communicators,” he said. “In addition… [we] see this research as demonstrating the potential for the LLMs-as-judge paradigm to create transparency and accountability into LLMs as companions.”
Groh concluded: “We live in a better world when people feel seen, heard and validated… It sounds crazy but there’s a potential to learn from AI how to be more human.”
Other authors on the paper include Aakriti Kumar, Nalin Poungpeth and Bruce Lambert from Northwestern; Diyi Yang from Stanford; and Erina Farrell from Pennsylvania State University.


