Decisions in seconds: Why Copilot & Co. are not taking over a recruiter’s job tomorrow

Marcus
Feb 13
3 min read

Microsoft Copilot, ChatGPT & Co. are increasingly being tested in recruiting. Upload a résumé, analyze a LinkedIn profile, assess personality and intelligence – and the “smart pre-selection” is done. Sounds efficient. But is it?

A recent study by Tobias Marc Härtel in the Journal of Business and Psychology examined exactly that. Microsoft Copilot analyzed 406 LinkedIn profiles. The individuals behind those profiles had previously completed validated psychological assessments. In other words, researchers knew their actual personality traits and intelligence levels.

The result: The AI was only limitedly accurate. And in some cases, clearly wrong. Some interesting and important considerations for recruiters seeking to make their work easier with Copilot and other LLMs:

First problem: The AI is not stable in its evaluations.

The same profile was evaluated twice. Sometimes the results were almost identical. Sometimes they were not. For several personality traits, the ratings varied noticeably.

Translated: The evaluation depends to a meaningful extent on how the model “thinks” at that moment. For a tool that might influence hiring or rejection decisions, that is problematic.

If two runs yield different outcomes, this is not a sound basis for decision-making.

Second problem: It only moderately reflects reality.

The key question is: How closely do the AI’s assessments match the actual test results?

The correlations were roughly:

Intelligence: weak positive
Openness: weak positive
Extraversion: weak positive
Other traits: barely or not at all aligned

What does that mean?

The AI detects slight tendencies. But it does not capture a robust personality structure. Its accuracy is somewhere between “better than guessing” and “far from diagnostically useful.”

The data does not support the idea that Copilot can reliably determine how conscientious or emotionally stable someone is.

Third problem: The “everyone looks good” effect

One striking finding: The AI systematically rated people more positively.

On average, candidates were assessed as:

more conscientious
more intelligent
more open
less neurotic

Why?

LinkedIn is a self-presentation platform. And AI models are trained to recognize positive, socially desirable patterns. The result is a double bias. Almost everyone appears “above average.”

For recruiting, this is critical. Selection depends on differentiation. If everyone ends up in the upper range, the tool stops being useful.

Fourth problem: The halo effect in algorithmic form

The study also shows that the AI does not clearly separate different traits.

Candidates with many career steps, international exposure, or academic credentials were rated higher across multiple dimensions. Intelligence, openness, and conscientiousness begin to blur together.

In essence, this is a digital halo effect: One positive signal influences everything else. The difference is that this time it is not human bias; it is embedded in the model.

Fifth problem: More text = better personality?

A few examples from the analysis:

Longer profiles → better overall ratings
Profiles written in English → more positive evaluations
Many followers → higher extraversion scores

This sounds less like a structured personality assessment and more like surface-level pattern recognition. More content gets interpreted as more competence. Whether that assumption holds true is another matter entirely.

Sixth problem: Bias does not disappear just because it is algorithmic

Differences between gender and age groups were also observed. Some traits were systematically rated differently. The effect sizes were not dramatic, but they were consistent.

And this is where it becomes sensitive.

If organizations use such systems in pre-selection, they carry responsibility for the outcomes – not the software provider. In the European context, with GDPR, anti-discrimination law, and the upcoming EU AI Act, this is not a side issue. (See also: https://www.talentacquisitionleader.com/post/bad-ki-bunny-die-klagen-gegen-workday-und-eightfold-und-ihre-wirkung-f%C3%BCr-recruiting-teams)

What does this mean for practice?

The study does not say: “AI in recruiting is nonsense.”

It says: AI is not a substitute for sound assessment practice – at least not in this form.

Where are the limits?

No reliable personality diagnosis from LinkedIn profiles
High susceptibility to bias (positivity bias, halo effects)
Low differentiation between candidates
Risk of systematic group differences
Limited transparency in how decisions are formed

Where can AI add value?

Structuring profile information
Creating summaries
Matching against hard criteria
Supporting interview preparation
Complementing – not replacing – human judgment

The core point

Recruiting is not a text analysis exercise. It is a decision-making process with real consequences. LLMs are excellent at processing language. But personality is more than a linguistic pattern.

The temptation is obvious: fast, cost-efficient, scalable. Yet “AI-supported” does not automatically mean “quality assured.” Anyone using Copilot & Co. for personality evaluation should be aware that the scientific foundation is currently thin. And convenience is not a substitute for evidence.