Content Filtering System Based on Large Language Models

The rapid development of transformer-based generative models such as GPT, LLaMA, and PaLM has been accompanied by exponential growth in risks tied to the generation of unwanted content. Such risks include toxic language, misinformation, instructions for weapon-making, and material that violates ethical or legal norms.

Existing industry solutions range from static rule-based systems (keyword blacklists, regex patterns) to integrated moderation APIs powered by specialized toxicity classifiers, such as OpenAI’s moderation models. However, these approaches tend to be costly to integrate and maintain, have limited adaptability, and are vulnerable to context-dependent attacks. In these attacks, disallowed intent is masked through paraphrasing, synonym usage, or encoding tricks.

Modern users actively exploit prompt injection, embedding hidden malicious instructions that cause LLMs to violate initial safety rules. This renders traditional heuristic filters largely ineffective. Industrial-scale deployment of content filters therefore requires multi-layer architectures capable of defending against both trivial circumventions and systematic adversarial attacks.

In this study, we examine an LLM-based filtering system where the model functions not only as a generator but also as a meta-analysis module for user queries. Unlike simple heuristic solutions, our architecture leverages Schema-Guided Reasoning (SGR). This method breaks down requests into predefined categories—intent, content domain, and potential harmfulness—enabling reproducibility, fewer false positives, and stronger resilience to adversarial attacks that disguise malicious intent or disrupt reasoning.

The architecture is hierarchical, consisting of three layers:

1_Preliminary fast filter — a lightweight BERT-family model fine-tuned on a large labeled dataset of toxic and unwanted texts. It efficiently processes streaming data with minimal cost, removing obvious violations early.

2_Deep contextual analysis — performed by an LLM using SGR. The model produces structured reasoning instead of binary classification, recording decision logic and ensuring explainability.

3_Meta risk assessment layer — integrates classification results from the first two stages with safety policies, legal norms, and dynamically updated bypass scenarios. Unlike rigid rule-based systems, this module relies on controlled reasoning through system prompts, which encode ethical and legal frameworks as well as corporate policies. The LLM interprets the outputs of the BERT classifier and SGR reasoning chain, mapping results to risk levels: “low risk — allow,” “medium risk — manual review,” “critical risk — block.”

This hybrid approach combines the high performance and efficiency of classical models like BERT with the cognitive depth of LLM reasoning mechanisms. As a result, it produces an adaptive filtering system capable of both responding quickly to known types of violations and detecting new bypass strategies. Further research will examine in detail the collection and annotation of training corpora, the optimization of BERT classifiers for industrial workloads, the computational costs of a multi-layer framework, and the prospects for evolving toward self-learning filters that integrate classical methods with reasoning-oriented LLMs.

Introduction

In the early stages of digital filtering, regular expressions, keyword filters, and simple rules dominated — for example, blocking emails containing the phrase “free money.” These approaches quickly proved unreliable and inflexible:

High false positive rates: even safe messages were often blocked due to template matches or unusual formatting.

Ease of evasion: small variations in phrasing, added symbols, spaces, or synonyms could easily bypass filters.

Disregard for meaning and context: since filters relied on keywords rather than semantics, they were blind to actual intent.

In practice, these methods were often combined with DNS blacklists (databases of domains and IPs linked to spam) and Bayesian filters, such as Apache SpamAssassin, which use probabilistic word analysis to determine whether a message resembles spam. This hybrid approach added some resilience but remained vulnerable to curated attacks and fast-adapting user groups.

Limitations of Classical Heuristic Methods

Heuristics are outdated and unsuitable as a core protection mechanism, but they remain useful as an auxiliary barrier in multi-level filters. Their strength lies in reliably blocking mass, templated attacks widely distributed in open sources such as GitHub or Hugging Face. These attacks are easy to detect through recognizable patterns, specific markers, or textual constructions.

In our filtration pipeline, heuristics serve the following purposes:

Primary defense against mass copy-paste attacks, where simple matches remain effective.

Reinforcing security contours for already known bypass strategies, such as widely replicated jailbreak prompts or data leakages.

Thus, we treat heuristics not as the core of the system but as a simple, practical solution that is both cheap and effective for quick and reliable filtering.

The Role of Heuristics in Modern Architectures

For training, we used a dataset of anonymized user queries. Based on this, we designed an instruction system for the LLM that included:

a fixed set of categories of prohibited content;

rules for handling borderline cases;

semantic context consideration;

several annotated examples.

Unlike keyword filters or first-level BERT classifiers, this configuration reduced false positives by approximately 30% by analyzing context and reference examples. For instance, while a keyword approach would block the word “weapon” even in a sports context, the LLM with instructions correctly distinguished between acceptable and prohibited scenarios.

3_Borderline contexts — ambiguous cases.

2_Disguised formulations — bypass strategies and paraphrases.

1_Direct violations — explicit requests for prohibited content.

The model moved beyond binary classification to a multi-level scheme:

Each query was also assigned a severity scale (from low to critical risk). This allowed threats to be ranked and system responses adjusted accordingly: soft warnings, manual review, or blocking.

As a result, compared to baseline methods (keyword filtering and a single-level BERT classifier), the LLM architecture showed superior performance in contextual and disguised attack scenarios, while maintaining low computational overhead due to its multi-level design.

LLM-Based Filtering Architecture

When designing filters, resilience to bypasses and reliable classification under low-agency conditions are critical.

Classical Function Calling (FC) works well in large models such as GPT-4o or Claude 3.5, where agency scores on the BFCL scale exceed 80. In these cases, the model can reliably decide when to invoke classification modules or additional filters.

On smaller local models (up to 32B), quality drops sharply: FC becomes unreliable when agency <35, leading to arbitrary text, null calls, or excessive tool requests.

Despite being experimental, SGR has shown stable performance even on small models (4B–7B). In these cases, the model is forced to produce structured output in JSON format, including mandatory fields:

Category of NSFW content (e.g., sexualization, violent scenes, references to minors in suspicious contexts).

Severity level (low/medium/high/critical).

Model explanation with a brief contextual analysis.

This format minimized output variability and allowed deterministic evaluation even of disguised formulations, where keyword filters and FC failed.

From experiments, architectural boundaries emerged:

For models with fewer than 14B parameters, pure SGR with strict JSON schemas works best.

For models between 14B and 32B, a hybrid SGR+FC approach is most effective, integrating reasoning into function calls.

For models larger than 32B, native FC is feasible, while SGR remains useful as a fallback for borderline or ambiguous cases.

Thus, in NSFW filtering tasks, SGR is an excellent solution for ensuring resilience on smaller models. It formalizes violation categories, defines risk levels, and provides reasoning chains — critical for industrial moderation.

SGR vs Function Calling in NSFW Filtering

In a pilot experiment, a prototype filter was tested on ~10,000 anonymized queries. Nearly 20% were classified as violations, with an average error rate of about 8%, including both false positives and false negatives. These results outperform heuristic methods, where error rates traditionally exceed 15–20%, confirming the effectiveness of the multi-level architecture.

For practical deployment, reducing false positives is especially important. At this stage, a key direction for development is expanding and refining the training corpus:

increasing dataset volume with coverage of various NSFW content types;

introducing more balanced class distributions to avoid bias toward frequent categories;

correcting annotations of borderline cases, which are often the main source of errors.

For the BERT module, the dataset remains the key factor: the model is highly sensitive to annotation quality and sample completeness, and its accuracy directly depends on the representativeness of the training data. At the same time, for the entire filtering system, what matters is not only the quality of the dataset but also the architectural organization of its layers — the combination of BERT, LLM, and meta-assessment. Thus, the dataset is critical for reliable operation of the first level, while the architecture as a whole ensures error reduction and resilience against complex attacks.

With regular dataset updates, the risk of drift arises — changes in query patterns and the emergence of new bypass strategies. Therefore, a separate area of development is the creation of mechanisms for monitoring data and model drift. In the context of dynamically evolving bypass strategies and new types of unwanted content, it is crucial that the system be able to detect statistical shifts, drops in classification accuracy, and changes in input stream structure. Integrating such monitoring tools will make it possible to initiate retraining and filter adaptation in a timely manner, thereby maintaining the resilience of the entire architecture.

Pilot Testing and Dataset Expansion

Corpus Formation

To build the filtering system, a dataset of about 40,000 user queries was created. It included two types of data: synthetically generated examples and organic, anonymized user queries. Synthetic data made it possible to model controlled and rare scenarios with high levels of risk (for example, disguised instructions), while organic queries provided representativeness and reflected real patterns of behavior. This combination helped maintain category balance and brought the system closer to real-world operating conditions, where the filter must demonstrate stable performance. It should be noted that a dataset of 40,000 queries is relatively small by industrial machine learning standards; therefore, dataset formation is treated as a continuous process of expansion and updating.

Categorization and Balancing

The queries were distributed across seven categories corresponding to the main classes of violations: Violence, Erotic, Self-harm, Animal cruelty, Drug abuse, Extremism, and Politics. Special attention was given to balancing the categories, which is critically important for classifier training: underrepresentation of one class inevitably leads to more errors in borderline cases. In the final dataset, categories were balanced, making it suitable not only for training but also for objective model evaluation.

Identified Weaknesses and Retraining

In early testing, the BERT classifier showed errors when handling long and noisy queries. Such queries could contain isolated trigger words capable of generating unwanted content, but because of the large volume of “innocent” text, the classifier failed to flag them as violations. This problem was particularly relevant for image generators, where even a single leakage is critical. To address it, a special subsection of the corpus was collected containing noisy and long queries with banned elements. After including this material in the training set and retraining the BERT module, classification quality improved significantly, particularly its resilience to misleading queries designed to confuse it.

Vectorization and Use of Embeddings

To represent queries in a uniform way, vectorization based on the universal BGE m3 embedding model was applied. The resulting embeddings allowed clustering and the identification of groups of semantically similar examples, which simplified the selection of “difficult” cases for training. In addition, vectorization enabled the use of the hard negatives technique — adding examples that appear similar to correct ones but are in fact violations. This strategy significantly improved classifier accuracy in distinguishing semantically close formulations.

Semantic Traps and the Role of Manual Moderation

Analysis of the corpus revealed many borderline cases where a query appeared to be a violation but was not in fact one. Examples included historical descriptions of warfare, quotes from literature, or metaphors using violent language. Such examples could not be automatically assigned to violation categories; they required either manual annotation by experts or carefully generated synthetic data followed by verification. Handling such cases became another important factor in reducing false positives and clarifying the boundary between permissible and prohibited scenarios.

Automated Corpus Expansion

At the final stage, a self-correction mechanism using LLMs was tested. Queries classified by the BERT module were additionally analyzed by the LLM. In cases of disagreement, the LLM’s result was given priority, and such examples were automatically labeled for inclusion in the training dataset. This approach enables iterative retraining of the classifier without constant reliance on manual moderation. As a result, a system is formed where the LLM acts as a “gold-standard annotator,” while BERT gradually adapts to new types of queries, maintaining high speed and low computational cost.

Dataset for Training and Validation of the Filter

To assess real-world robustness, we conducted red-teaming attacks:

Jailbreak prompts — hidden instructions, role-based scenarios.

Obfuscation — insertion of noisy words.

Social bypasses — framing requests as from a “teacher,” “journalist,” or “doctor”.

Results showed keyword filters and plain BERT were vulnerable to >40% of bypasses, while the hybrid architecture reduced success to ~12%.

Red Teaming and Adversarial Evaluation

In addition to offline metrics, A/B testing was critical for validation, allowing direct comparison of filter configurations under live traffic and assessing classification accuracy, false block rates, and latency.

Standard metrics used included:

Precision — proportion of correctly classified violations among all flagged.

Recall — proportion of correctly detected violations among all real violations.

F1-score — harmonic mean of precision and recall.

Additional focus:

False Positive Rate (FPR) — critical to minimize unnecessary blocking of allowed content.

False Negative Rate (FNR) — crucial in safety scenarios where leakages cannot be tolerated.

Adversarial evaluation included three test sets:

1_Lexical obfuscation — symbol replacement, spacing, encoding.

2_Semantic paraphrasing — synonym substitution, reordering.

3_Prompt injection — embedded commands and meta-instructions.

Metrics and Evaluation Methodology

With the growing popularity of image and video generation, integrating text filters with multimodal systems has become increasingly important. We propose extending the architecture into a two-tier system that separates input and output filtering:

1_Input filter at the user request level — analyzes text prompts for intent to generate prohibited content (e.g., requests for violent or exploitative images). Like text moderation, this uses a BERT module for primary filtering and LLM+SGR for deeper contextual analysis, preventing generation attempts from the start.

2_Output filter at the result level — even with clean input, diffusion models may generate results that violate restrictions. To prevent such leakages, a dedicated module analyzes generated multimodal content using visual classifiers (e.g., CLIP-like architectures) to assess risk categories.

This two-tier approach ensures safety at both input (preventing unwanted generation) and output (verifying results). In case of conflict, the output filter always takes priority, since it determines compliance of the final multimodal object with safety standards.

Multimodal Architecture Extension

Regulatory frameworks

Different jurisdictions impose specific requirements on content processing and blocking: GDPR (EU), Digital Services Act (EU), COPPA (US), and local laws concerning extremism, defamation, religion, or sexual content. A universal filter without adaptation risks being either too lenient or excessively strict.

Cultural variability

Permissible levels of explicitness or political discourse vary by region. In some countries, medical publications describing sexual behavior are allowed, while in others such content is blocked.

Adaptation mechanism

In our architecture, adaptability is achieved through multi-level policy management:

Regional and legal profiles define thresholds and blocking categories in the meta risk layer.

LLMs receive additional instructions describing local restrictions (e.g., “classify political slogans in region X as prohibited”).

A dynamic customization module enables profile updates without retraining base models — simply by adjusting prompts and config rules.

Risk of over-censorship

Overly strict filtering may block legitimate scientific or historical content. Using SGR with explainable reasoning chains reduces this risk: the system documents why content was classified as a violation, allowing verification against local legal frameworks.

Legal and Ethical Aspects

This study implemented and tested a multi-layer architecture for filtering unwanted content based on LLMs, with integration of auxiliary components. The first level is a lightweight BERT classifier, optimized on a balanced dataset of 40,000 queries that included both synthetic and organic data. The second level is an LLM that applies the methodology of Schema-Guided Reasoning (SGR) and structured output mechanisms. This level provides context-dependent evaluation of queries and minimizes false-positive classifications. An additional layer is the meta-level risk assessment module, which integrates classification results with safety policies and state regulatory constraints.

The significance of this work lies in the creation of a practical framework for NSFW content filtering, where the dataset ensures robustness and accuracy for lightweight models like BERT, and the multi-layer architecture provides error reduction and adaptability for the entire system. This balance combines speed and scalability with depth of analysis, making the solution suitable for industrial applications.

Such systems are becoming especially important in the context of widespread generative technologies, as they help minimize the risks of disinformation, extremist material dissemination, and other forms of abuse. This approach can be applied not only within the GenAI industry but also in adjacent fields such as cybersecurity, educational platforms, and online media. In this way, the presented research contributes to the development of reliable and adaptive moderation systems, directly linked to the safety of digital ecosystems and society as a whole.

Conclusion

< Back to homepage