Human content moderators still outperform AI when it comes to recognizing policy-violating material, but they also cost significantly more.
Marketers looking to ensure that their ads do not surface in a toxic slurry face a dilemma – spend more money or see more Hitler.
Researchers affiliated with AI brand protection biz Zefr did the math, detailed in a preprint paper titled “AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety.”
The paper, accepted at the upcoming Computer Vision in Advertising and Marketing (CVAM) workshop at the 2025 International Conference on Computer Vision, presents an analysis of the cost and effectiveness of multimodal large language models (MLLMs) for brand safety tasks.
The researchers’ calculations show human moderation to be a premium indulgence, at almost 40x the most cost-efficient machine learning labor.
Brand safety means preventing inappropriate content from becoming associated with a brand and damaging that brand’s reputation. It has become something of a moving target in the wake of the Trump administration’s rollback of diversity, equity, and inclusion. It’s distinct from consumer-facing content moderation on social media sites like Meta’s Instagram, which stands accused of knowingly distributing content that’s harmful [PDF] and faces related litigation.
The Zefr team explains, “Advertisers define content categories they wish to avoid; ranging from violent or adult-themed material to controversial political discourse. While general content moderation aims to identify and manage policy-violating content, brand safety is specifically concerned with aligning ad placements with advertiser preferences.”
Typically, the authors say, brand safety efforts involve a combination of human review and machine learning-based analysis of imagery, audio, and text. The purpose of the study was to look at whether MLLMs can do the job well and at what cost.
They evaluated six models – GPT-4o, GPT-4o-mini, Gemini-1.5-Flash, Gemini-2.0-Flash, Gemini-2.0-Flash-Lite, and Llama-3.2-11B-Vision – and human review using a dataset of 1500 videos, consisting of 500 videos from each of the following categories: Drugs, Alcohol and Tobacco (DAT); Death, Injury and Military Conflict (DIMC); and Kid’s Content.
Researchers scored performance in each of three categories: precision, recall, and F1, which are common methods for machine learning evaluation. Precision refers to predicted positive classifications (policy violations) of content compared to actual positive instances in the dataset; recall refers to the percentage of actual positive instances classified correctly; and F1 is the harmonic mean of precision and recall.
Overall scores (precision, recall, F1) were as follows, where 1.00 would represent 100 percent accuracy, with no false positives or false negatives:
Model
Precision
Recall
F1
“Among the MLLMs the Gemini models emerge as the best overall models, outperforming the others in terms of F1-score,” the researchers state in their paper, adding that, interestingly, the compact versions of these models do not perform significantly worse.
“These results underscore the effectiveness of MLLMs in automating content moderation but also highlight the continued superiority of human reviewers in accuracy, particularly in more complex or nuanced classifications where context and deep understanding are required,” the paper states.
The researchers also observed that these models often failed due to incorrect associations, a lack of contextual understanding, and language differences. One example they cited is a video discussing caffeine addiction in the Japanese language that was incorrectly flagged as a drug category violation by all the models. The authors attributed this to flawed associations with the term addiction and gaps in the contextual understanding of Japanese. Generally, they said that these models exhibit poorer performance for non-English content.
In terms of cost, superior human moderation looks like a luxury. Here’s how the models compare in terms of F1 score and price.
Model
F1
Cost
“We showed that the compact MLLMs offer a significantly cheaper alternative compared to their larger counterparts without sacrificing accuracy,” the authors conclude. “However, human reviewers remain superior in accuracy, particularly in complex or nuanced classifications.”
“While multimodal large language models like Gemini and GPT can handle brand safety video moderation across text, audio and visuals with surprising accuracy and far lower costs than human reviewers alone, they still fall short on nuanced, context-heavy cases – making a hybrid human and AI approach the most effective and economical path forward for content moderation in the brand safety and suitability landscape,” said Jon Morra, Zefr’s Chief AI Officer, in an emailed statement.
The dataset and prompts used have been published to GitHub. ®