ReText.AI Study: How AI Text Humanization Works — 20,000 Texts, 8 Models, 20 Categories

The ReText.AI team analyzed 19,804 texts generated by 8 neural networks (GPT, GigaChat, Llama, Qwen, T-pro). Result: in 90%+ cases, humanization reduces the likelihood of AI detection. Full data, charts, and conclusions.

Contents:

Why We Conducted This Study

Data: What We Analyzed

Methodology: How We Trained the Model

Base Model

Training Method: SimPO

Distribution of humanizer_score by Categories

Results: Reduction in AI Detection Probability

Results: Who Completely "Fooled" the Detector

Where Humanization Works Best

Where There Are Challenges

General Conclusions

What This Means for Users

FAQ

At ReText.AI, we are developing text humanization technology — an algorithm that takes text generated by a neural network and rewrites it to read as if written by a human. Not just synonym replacement, but a complete reworking of style, structure, and vocabulary.

But how well does it actually work? We decided to test it on a large scale — and conducted a study analyzing 19,804 pairs of texts in 20 thematic categories. This article presents the full results with figures, charts, and conclusions.

💡 Briefly for the impatient: In 90%+ cases, humanization successfully reduces the likelihood of the text being identified as AI-generated. For 14 out of 20 categories, more than half of the texts after processing completely "fool" the detector.

Why We Conducted This Study

With the growing popularity of ChatGPT, GigaChat, YandexGPT, and other generative models, a reverse problem has emerged: more and more platforms, universities, and editorial offices are implementing AI text detectors. Students receive lower grades, authors are accused of dishonesty, and SEO texts are filtered by search engines.

At ReText.AI, we created a text humanization feature specifically to solve this problem: so that people using AI as a tool for drafts and ideas can bring the text to a quality indistinguishable from that written by a human.

But we didn't just want to claim "it works." We wanted to prove it — with data.

Data: What We Analyzed

Text Sources

Human texts were collected from two academic datasets:

COLING-2025 (Workshop on MGT Detection, Subtask B: Multilingual MGT detection) — Russian and English texts
AINL-eval — scientific texts

Thematic Domains

The original texts cover a wide range of topics:

Social Media — posts, comments, discussions
Wikipedia — encyclopedic articles
Fiction — prose, stories
Administrative Documents — business correspondence, regulations
Scientific Texts — articles, studies, abstracts

Generation of "Machine" Variants

For each human text, we created a "machinized" version — a text rewritten by one of 8 neural networks:

Model	Parameters	Developer
Llama-3.2-3B-Instruct	3B	Meta
Qwen3-8B	8B	Alibaba
GigaChat-2-Max	—	Sber
GLM-4.6	—	ZAI
Llama-3.3-70B-Instruct	70B	Meta
GPT-oss-120B	120B	OpenAI
Qwen3-235B-A22B-Instruct	235B	Alibaba
T-pro-it-1.0	—	T-Bank

Why 8 models? We wanted to ensure that humanization works not only for texts from one ChatGPT but for any popular model. Each generates text with its own characteristics — and our algorithm must handle them all.

Final Dataset

After filtering and quality control, we obtained 19,804 pairs of texts (original human text + machinized version). The predominant language is Russian (~80%), with the rest being English and multilingual texts.

All texts were automatically classified into 20 thematic clusters: from recipes and cooking to IT and software development.

Methodology: How We Trained the Model

Base Model

At the core of our humanizer is the Gemma-2-9B-IT model (Google). We chose it for several reasons:

Good generation quality in Russian
Relatively compact size (9B parameters) — fast inference
Architecture optimized for text processing tasks

Training Method: SimPO

For fine-tuning, we used SimPO (Simple Preference Optimization) — a method from the RLHF (Reinforcement Learning from Human Feedback) family. The essence:

The model generates several variants of "humanized" text for each input text
Each variant is run through our AI detector — assessing the likelihood that the text was written by a neural network
Variants are ranked by the humanizer_score metric, calculated as:

humanizer_score = (prob_ai_original − prob_ai_humanized) × confidence_weight

Where:

prob_ai_original — probability of AI authorship of the original text
prob_ai_humanized — probability of AI authorship after humanization
confidence_weight — detector confidence coefficient

The best pairs (successful vs. unsuccessful humanization) form the training dataset for SimPO

In simple terms: we taught the model with examples of "this is good, this is bad" — and it learned to distinguish which stylistic techniques make the text more "human."

Distribution of humanizer_score by Categories

The visualization below shows how our key metric humanizer_score is distributed across all 20 thematic clusters. The higher the value, the more effective the humanization in that category:

Distribution of humanizer_score across 20 text types — boxplot from Jupyter study — Distribution of humanizer_score across 20 thematic clusters. Categories are sorted by median — from best (recipes and cooking) to least effective (everyday life).

Results: Reduction in AI Detection Probability

Overall Picture

The main question: how much does humanization reduce the likelihood that a detector will identify the text as AI-generated?

Answer: radically.

Metric	Before Humanization	After Humanization
Median AI Probability	~0.93 (93%)	~0.47 (47%)
Texts with AI Probability > 0.8	Overwhelming majority	<20%
Texts with AI Probability < 0.5	Few	>50%

Before processing, the distribution of probabilities is compressed towards one — almost all texts are confidently identified as machine-generated. After processing, the distribution shifts to the left — half of the texts are now classified as human.

Distribution of AI probabilities for original texts (median 0.92) and for humanized texts (median 0.45) — ReText.AI study results — Left — distribution of AI authorship probabilities for original texts (median 0.92). Right — after humanization (median 0.45). Red dashed line — median.

By Thematic Categories: Share of Improved Texts

For more than 90% of texts in each category, the humanizer successfully reduces the probability of AI detection:

Category	Share of Texts with Improvement
🍳 Recipes and Cooking	100.0%
🏠 Everyday Life and Reflections	93.8%
👥 Personnel Management and Organization	93.7%
⚖️ Legal Systems and Legislation	93.4%
📣 Marketing and Advertising	93.0%
🔬 Scientific Research and Methods	92.7%
🧠 Psychology and Society	91.9%
📚 Literature and Text Analysis	91.5%
🎓 Education and Training	90.9%
💼 Business and Market Analysis	90.4%
📊 Data Analysis and ML	89.8%
🏙️ Urban Systems and Services	89.0%
💰 Economics	88.7%
✍️ Personal Stories and Narratives	88.1%
📰 News about Russia	87.9%
🎮 Computer Games	87.3%
🖥️ Information Systems and Software Development	85.2%
🎨 Culture and Art	85.0%
🔧 Digital Technologies and Innovations	84.7%
🌐 Multilingual Texts	83.5%

Share of texts improved by the humanizer — from 84.1% for multilingual to 100% for recipes — Share of texts for which the humanizer reduced the probability of AI detection. All 20 categories show a result of 84%+.

Key observation: even for the most challenging categories (IT, digital technologies, multilingual texts), more than 83% of texts show improvement.

Results: Who Completely "Fooled" the Detector

Reducing probability is one thing. But we were also interested in a stricter question: what share of texts after humanization completely changes the detector's verdict — from "AI" to "human"?

We called this the Hard Flip Rate — the percentage of texts that were identified as AI-generated before processing, and as human after (AI probability < 0.5).

Category	Hard Flip Rate
🍳 Recipes and Cooking	66.7%
⚖️ Legal Systems and Legislation	64.2%
💼 Business and Market Analysis	61.0%
🔬 Scientific Research and Methods	60.8%
👥 Personnel Management and Organization	59.9%
📣 Marketing and Advertising	58.9%
🧠 Psychology and Society	57.8%
🏙️ Urban Systems and Services	57.1%
📚 Literature and Text Analysis	56.4%
📊 Data Analysis and ML	56.2%
✍️ Personal Stories and Narratives	55.7%
🎓 Education and Training	52.9%
💰 Economics	51.6%
📰 News about Russia	51.1%
🎮 Computer Games	49.3%
🖥️ Information Systems and Software Development	49.3%
🎨 Culture and Art	47.4%
🔧 Digital Technologies and Innovations	44.0%
🌐 Multilingual Texts	43.6%
🏠 Everyday Life and Reflections	41.0%

Hard Flip Rate by 20 categories — recipes (66.7%), law (64.2%), business (61.0%) lead — Hard Flip Rate — share of texts that completely changed the detector's verdict from "AI" to "human." Red dashed line — 50% threshold.

Result: for 14 out of 20 categories, more than half of the texts completely "fool" the detector. For the top 6 categories (law, business, science, HR, marketing, cooking), the rate exceeds 58%.

Where Humanization Works Best

🍳 Recipes and Cooking — Leader (100% improvement, 66.7% hard flip)

Culinary texts showed absolutely the best results. Reason: recipes have a characteristic conversational structure that our humanizer reproduces especially well — "add a pinch," "by eye," "when browned" instead of "add 2.5 g NaCl."

⚖️ Legal Texts — 64.2% hard flip

Legal texts are effectively humanized because the legal style has established formulations. The model learned to make them more "lively" while maintaining accuracy of meaning.

💼 Business and Science — 60%+

Business and scientific texts have a formalized structure that can be naturally "diluted" with human constructions — introductory words, authorial remarks, syntactic variability.

Average AI Probability Delta by Categories

Another metric — how much the probability of AI authorship decreases on average:

Category	Average Delta
Recipes and Cooking	+0.518
Personnel Management	+0.457
Legal Systems	+0.451
Business and Market Analysis	+0.450
Psychology and Society	+0.429
Literature and Text Analysis	+0.414
Marketing and Advertising	+0.413
Scientific Research	+0.404

On average, the probability of AI authorship decreases by 0.35–0.52 points — this is a radical change.

Δ prob_isfake by text type — boxplot shows distribution of AI probability changes for each category — Distribution of AI authorship probability change (Δ prob_isfake) across 20 categories. Positive values = improvement. All medians are positive.

Where There Are Challenges

🏠 Everyday Life and Reflections — 41.0% hard flip

Paradox: this category shows an excellent improvement percentage (93.8%) but low hard flip. The reason is that everyday texts already have a diverse style, making it difficult to "switch" the detector over the 0.5 threshold.

🌐 Multilingual Texts — 43.6% hard flip

Our dataset predominantly contained Russian and English texts. For other languages, the model needs a larger training set. We plan to expand multilingual support in future versions.

🖥️ IT and Software Development — 49.3% hard flip

Technical texts with terminology, code fragments, and specific syntax are the most challenging category for humanization. Nevertheless, for almost half of the texts, the detector is completely "fooled."

General Conclusions

For more than 90% of texts, the humanizer successfully reduces the probability of AI detection — regardless of the topic
For 14 out of 20 categories, more than half of the texts completely change the detector's verdict (from "AI" to "human")
Despite the relatively compact model (9B parameters) and dataset (~20K pairs), fine-tuning with SimPO showed high efficiency
The best results are in structured domains (law, business, science, marketing)
There is potential for improvement in multilingual texts and IT topics

What This Means for Users

If You Are a Student or Graduate Student

Using ChatGPT or GigaChat for drafts of term papers and essays? Text humanization is not "cheating," but a tool to bring a draft to a human level. In 60%+ of cases, text on legal and scientific topics will become indistinguishable from what you wrote.

If You Are a Copywriter or Marketer

Generating content for clients using AI? Marketing and business texts are humanized with a 58–61% hard flip rate. Run the text through the humanizer, then check it in our AI detector — and be confident in the result.

If You Are an SEO Specialist

Search engines are learning to identify AI content. Humanization reduces the likelihood of filtering — and preserves your work with organic search results.

Optimal Workflow

Generate a draft in ChatGPT / GigaChat / any neural network
Humanize in ReText.AI
Check through the AI detector
Final Edit manually

For Developers and Business (API)

We provide an API for both the AI detector and text humanization. If you want to integrate our technologies into your product, simply reach out to us at team@retext.ai.

More about our tools: read the review TOP-20 Neural Networks Online in 2026, where we thoroughly analyze the entire ReText.AI product stack.

FAQ

What is AI Text Humanization?

It is a technology that rewrites text generated by a neural network so that it reads as if written by a human. Not a simple synonym replacement, but a deep reworking of style, structure, and vocabulary with a specially trained model.

Which Neural Networks Did You Test?

We tested texts generated by 8 models: Llama-3.2-3B, Qwen3-8B, GigaChat-2-Max, GLM-4.6, Llama-3.3-70B, GPT-oss-120B, Qwen3-235B, and T-pro-it-1.0. This covers the current landscape from 3B to 235B parameters.

What Was the Size of the Test Set?

19,804 pairs of texts (original human text + machinized version), classified into 20 thematic categories.

How Effective Is It?

For more than 90% of texts, the likelihood of being identified as AI decreases. For 14 out of 20 categories, more than half of the texts completely change the detector's verdict from "AI" to "human."

In Which Categories Does Humanization Work Best?

Leaders: recipes and cooking (66.7% hard flip), legal texts (64.2%), business (61.0%), science (60.8%), HR (59.9%), and marketing (58.9%).

Where Does Humanization Work Worse?

The most challenging categories: everyday texts (41.0%), multilingual texts (43.6%), digital technologies (44.0%), and culture/art (47.4%). But even for them, 83%+ of texts show improvement.

Is It Ethical?

We believe that AI is a tool, not an author. Just as a calculator does not replace a mathematician, humanization does not replace a writer — it helps bring a draft to a publication-worthy quality. Important: we do not encourage passing off AI text as your own without refinement. We encourage using technology as part of the workflow.

How to Try Humanization?

Go to retext.ai/ai-text-humanizer — up to 1,000 characters can be processed for free.

Contents:

Why We Conducted This Study

Data: What We Analyzed

Methodology: How We Trained the Model

Base Model

Training Method: SimPO

Distribution of humanizer_score by Categories

Results: Reduction in AI Detection Probability

Results: Who Completely "Fooled" the Detector

Where Humanization Works Best

Where There Are Challenges

General Conclusions

What This Means for Users

FAQ

ReText.AI Study: How AI Text Humanization Works — 20,000 Texts, 8 Models, 20 Categories

Why We Conducted This Study

Data: What We Analyzed

Methodology: How We Trained the Model

Base Model

Training Method: SimPO

Distribution of humanizer_score by Categories

Results: Reduction in AI Detection Probability

Results: Who Completely "Fooled" the Detector

Where Humanization Works Best

Where There Are Challenges

General Conclusions

What This Means for Users

FAQ

Recommended articles

TOP-20 AI Tools Online in 2026 — Review by Retext.AI Founder

ReText.AI AI Detector Update: Now Detects AI-Generated Text with Higher Accuracy