Human moderation is 40 times more expensive than AI-based content moderation services, despite being more effective in identifying and handling objectionable content online.
A new preprint paper titled "AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety" has shown that multimodal large language models (MLLMs) are significantly more cost-effective than human moderators for brand safety tasks.
The study, written by researchers affiliated with AI brand protection biz Zefr, evaluated six models - GPT-4o, GPT-4o-mini, Gemini-1.5-Flash, Gemini-2.0-Flash, Gemini-2.0-Flash-Lite, and Llama-3.2-11B-Vision - and human review using a dataset of 1500 videos.
The researchers found that the most cost-efficient AI models, such as GPT-4o-mini ($25) and Gemini-2.0-Flash-Lite ($28), achieved relatively high accuracy (F1-scores of 0.88 and 0.91 respectively), at a fraction of human moderation cost. Human moderators' cost was calculated by multiplying their total review time with an hourly rate, while AI model costs were assessed based on token consumption and operational expenses.
Despite the higher operational costs of larger models like Llama-3.2-11B-Vision ($459) and GPT-4o ($419), they did not achieve superior accuracy. Visual processing in multimodal inputs significantly increases costs compared to text-only LLMs, so practitioners must weigh the improved performance against cost increases.
Although human moderators slightly outperform AI in detecting policy-violating content, the substantial cost savings with MLLMs present a compelling trade-off. The paper suggests hybrid human-AI approaches as a promising path to balance performance and cost.
The Gemini models, particularly the Gemini-2.0-Flash, had the highest F1 scores among the evaluated models, outperforming others. The performance of the models was scored in three categories: precision, recall, and F1, which are common methods for machine learning evaluation. Precision refers to predicted positive classifications of content compared to actual positive instances in the dataset, while recall refers to the percentage of actual positive instances classified correctly. F1 is the harmonic mean of precision and recall.
The dataset and prompts used in the study have been published on GitHub. The paper will be presented at the upcoming Computer Vision in Advertising and Marketing (CVAM) workshop at the 2025 International Conference on Computer Vision.
This conclusion aligns with the broader observations that while humans still lead in nuanced content moderation, AI models provide significant economic advantages for brand safety applications. The research highlights that MLLMs provide a highly cost-efficient alternative for brand safety moderation, offering near-human accuracy at dramatically reduced costs, making them attractive for scale deployment compared to traditional human moderation.
- The study found that the most cost-efficient AI models, such as GPT-4o-mini and Gemini-2.0-Flash-Lite, achieved relatively high accuracy in brand safety tasks, at a fraction of human moderation cost.
- Despite the higher operational costs of larger models like Llama-3.2-11B-Vision and GPT-4o, they did not achieve superior accuracy in the study.
- The paper suggests hybrid human-AI approaches as a promising path to balance performance and cost in content moderation, particularly for brand safety applications.
- The research highlights that multimodal large language models (MLLMs) provide a highly cost-efficient alternative for brand safety moderation, offering near-human accuracy at dramatically reduced costs compared to traditional human moderation.