ReText.AI

ReText.AI Study: How AI Text Humanization Works — 20,000 Texts, 8 Models, 20 Categories

Olga Shkryaba
April 8, 2026
-
0
Olga Shkryaba
The ReText.AI team analyzed 19,804 texts generated by 8 neural networks (GPT, GigaChat, Llama, Qwen, T-pro). Result: in 90%+ cases, humanization reduces the likelihood of AI detection. Full data, charts, and conclusions.
Contents:
Why We Conducted This Study
Data: What We Analyzed
Text Sources
Thematic Domains
Generation of "Machine" Variants
Final Dataset
Methodology: How We Trained the Model
Base Model
Training Method: SimPO
Distribution of humanizer_score by Categories
Results: Reduction in AI Detection Probability
Overall Picture
By Thematic Categories: Share of Improved Texts
Results: Who Completely "Fooled" the Detector
Where Humanization Works Best
🍳 Recipes and Cooking — Leader (100% improvement, 66.7% hard flip)
⚖️ Legal Texts — 64.2% hard flip
💼 Business and Science — 60%+
Average AI Probability Delta by Categories
Where There Are Challenges
🏠 Everyday Life and Reflections — 41.0% hard flip
🌐 Multilingual Texts — 43.6% hard flip
🖥️ IT and Software Development — 49.3% hard flip
General Conclusions
What This Means for Users
If You Are a Student or Graduate Student
If You Are a Copywriter or Marketer
If You Are an SEO Specialist
Optimal Workflow
For Developers and Business (API)
FAQ
What is AI Text Humanization?
Which Neural Networks Did You Test?
What Was the Size of the Test Set?
How Effective Is It?
In Which Categories Does Humanization Work Best?
Where Does Humanization Work Worse?
Is It Ethical?
How to Try Humanization?

At ReText.AI, we are developing text humanization technology — an algorithm that takes text generated by a neural network and rewrites it to read as if written by a human. Not just synonym replacement, but a complete reworking of style, structure, and vocabulary.

But how well does it actually work? We decided to test it on a large scale — and conducted a study analyzing 19,804 pairs of texts in 20 thematic categories. This article presents the full results with figures, charts, and conclusions.

💡 Briefly for the impatient: In 90%+ cases, humanization successfully reduces the likelihood of the text being identified as AI-generated. For 14 out of 20 categories, more than half of the texts after processing completely "fool" the detector.

Why We Conducted This Study

With the growing popularity of ChatGPT, GigaChat, YandexGPT, and other generative models, a reverse problem has emerged: more and more platforms, universities, and editorial offices are implementing AI text detectors. Students receive lower grades, authors are accused of dishonesty, and SEO texts are filtered by search engines.

At ReText.AI, we created a text humanization feature specifically to solve this problem: so that people using AI as a tool for drafts and ideas can bring the text to a quality indistinguishable from that written by a human.

But we didn't just want to claim "it works." We wanted to prove it — with data.

Data: What We Analyzed

Text Sources

Human texts were collected from two academic datasets:

  1. COLING-2025 (Workshop on MGT Detection, Subtask B: Multilingual MGT detection) — Russian and English texts
  2. AINL-eval — scientific texts

Thematic Domains

The original texts cover a wide range of topics:

  • Social Media — posts, comments, discussions
  • Wikipedia — encyclopedic articles
  • Fiction — prose, stories
  • Administrative Documents — business correspondence, regulations
  • Scientific Texts — articles, studies, abstracts

Generation of "Machine" Variants

For each human text, we created a "machinized" version — a text rewritten by one of 8 neural networks:

Model

Parameters

Developer

Llama-3.2-3B-Instruct

3B

Meta

Qwen3-8B

8B

Alibaba

GigaChat-2-Max

Sber

GLM-4.6

ZAI

Llama-3.3-70B-Instruct

70B

Meta

GPT-oss-120B

120B

OpenAI

Qwen3-235B-A22B-Instruct

235B

Alibaba

T-pro-it-1.0

T-Bank

Why 8 models? We wanted to ensure that humanization works not only for texts from one ChatGPT but for any popular model. Each generates text with its own characteristics — and our algorithm must handle them all.

Final Dataset

After filtering and quality control, we obtained 19,804 pairs of texts (original human text + machinized version). The predominant language is Russian (~80%), with the rest being English and multilingual texts.

All texts were automatically classified into 20 thematic clusters: from recipes and cooking to IT and software development.

Methodology: How We Trained the Model

Base Model

At the core of our humanizer is the Gemma-2-9B-IT model (Google). We chose it for several reasons:

  • Good generation quality in Russian
  • Relatively compact size (9B parameters) — fast inference
  • Architecture optimized for text processing tasks

Training Method: SimPO

For fine-tuning, we used SimPO (Simple Preference Optimization) — a method from the RLHF (Reinforcement Learning from Human Feedback) family. The essence:

  1. The model generates several variants of "humanized" text for each input text
  2. Each variant is run through our AI detector — assessing the likelihood that the text was written by a neural network
  3. Variants are ranked by the humanizer_score metric, calculated as:

humanizer_score = (prob_ai_original − prob_ai_humanized) × confidence_weight

Where:

  • prob_ai_original — probability of AI authorship of the original text
  • prob_ai_humanized — probability of AI authorship after humanization
  • confidence_weight — detector confidence coefficient
  1. The best pairs (successful vs. unsuccessful humanization) form the training dataset for SimPO

In simple terms: we taught the model with examples of "this is good, this is bad" — and it learned to distinguish which stylistic techniques make the text more "human."

Distribution of humanizer_score by Categories

The visualization below shows how our key metric humanizer_score is distributed across all 20 thematic clusters. The higher the value, the more effective the humanization in that category:

Distribution of humanizer_score across 20 text types — boxplot from Jupyter study

Results: Reduction in AI Detection Probability

Overall Picture

The main question: how much does humanization reduce the likelihood that a detector will identify the text as AI-generated?

Answer: radically.

Metric

Before Humanization

After Humanization

Median AI Probability

~0.93 (93%)

~0.47 (47%)

Texts with AI Probability > 0.8

Overwhelming majority

<20%

Texts with AI Probability < 0.5

Few

>50%

Before processing, the distribution of probabilities is compressed towards one — almost all texts are confidently identified as machine-generated. After processing, the distribution shifts to the left — half of the texts are now classified as human.

Distribution of AI probabilities for original texts (median 0.92) and for humanized texts (median 0.45) — ReText.AI study results

By Thematic Categories: Share of Improved Texts

For more than 90% of texts in each category, the humanizer successfully reduces the probability of AI detection:

Category

Share of Texts with Improvement

🍳 Recipes and Cooking

100.0%

🏠 Everyday Life and Reflections

93.8%

👥 Personnel Management and Organization

93.7%

⚖️ Legal Systems and Legislation

93.4%

📣 Marketing and Advertising

93.0%

🔬 Scientific Research and Methods

92.7%

🧠 Psychology and Society

91.9%

📚 Literature and Text Analysis

91.5%

🎓 Education and Training

90.9%

💼 Business and Market Analysis

90.4%

📊 Data Analysis and ML

89.8%

🏙️ Urban Systems and Services

89.0%

💰 Economics

88.7%

✍️ Personal Stories and Narratives

88.1%

📰 News about Russia

87.9%

🎮 Computer Games

87.3%

🖥️ Information Systems and Software Development

85.2%

🎨 Culture and Art

85.0%

🔧 Digital Technologies and Innovations

84.7%

🌐 Multilingual Texts

83.5%

Share of texts improved by the humanizer — from 84.1% for multilingual to 100% for recipes

Key observation: even for the most challenging categories (IT, digital technologies, multilingual texts), more than 83% of texts show improvement.

Results: Who Completely "Fooled" the Detector

Reducing probability is one thing. But we were also interested in a stricter question: what share of texts after humanization completely changes the detector's verdict — from "AI" to "human"?

We called this the Hard Flip Rate — the percentage of texts that were identified as AI-generated before processing, and as human after (AI probability < 0.5).

Category

Hard Flip Rate

🍳 Recipes and Cooking

66.7%

⚖️ Legal Systems and Legislation

64.2%

💼 Business and Market Analysis

61.0%

🔬 Scientific Research and Methods

60.8%

👥 Personnel Management and Organization

59.9%

📣 Marketing and Advertising

58.9%

🧠 Psychology and Society

57.8%

🏙️ Urban Systems and Services

57.1%

📚 Literature and Text Analysis

56.4%

📊 Data Analysis and ML

56.2%

✍️ Personal Stories and Narratives

55.7%

🎓 Education and Training

52.9%

💰 Economics

51.6%

📰 News about Russia

51.1%

🎮 Computer Games

49.3%

🖥️ Information Systems and Software Development

49.3%

🎨 Culture and Art

47.4%

🔧 Digital Technologies and Innovations

44.0%

🌐 Multilingual Texts

43.6%

🏠 Everyday Life and Reflections

41.0%

Hard Flip Rate by 20 categories — recipes (66.7%), law (64.2%), business (61.0%) lead

Result: for 14 out of 20 categories, more than half of the texts completely "fool" the detector. For the top 6 categories (law, business, science, HR, marketing, cooking), the rate exceeds 58%.

Where Humanization Works Best

🍳 Recipes and Cooking — Leader (100% improvement, 66.7% hard flip)

Culinary texts showed absolutely the best results. Reason: recipes have a characteristic conversational structure that our humanizer reproduces especially well — "add a pinch," "by eye," "when browned" instead of "add 2.5 g NaCl."

⚖️ Legal Texts — 64.2% hard flip

Legal texts are effectively humanized because the legal style has established formulations. The model learned to make them more "lively" while maintaining accuracy of meaning.

💼 Business and Science — 60%+

Business and scientific texts have a formalized structure that can be naturally "diluted" with human constructions — introductory words, authorial remarks, syntactic variability.

Average AI Probability Delta by Categories

Another metric — how much the probability of AI authorship decreases on average:

Category

Average Delta

Recipes and Cooking

+0.518

Personnel Management

+0.457

Legal Systems

+0.451

Business and Market Analysis

+0.450

Psychology and Society

+0.429

Literature and Text Analysis

+0.414

Marketing and Advertising

+0.413

Scientific Research

+0.404

On average, the probability of AI authorship decreases by 0.35–0.52 points — this is a radical change.

Δ prob_isfake by text type — boxplot shows distribution of AI probability changes for each category

Where There Are Challenges

🏠 Everyday Life and Reflections — 41.0% hard flip

Paradox: this category shows an excellent improvement percentage (93.8%) but low hard flip. The reason is that everyday texts already have a diverse style, making it difficult to "switch" the detector over the 0.5 threshold.

🌐 Multilingual Texts — 43.6% hard flip

Our dataset predominantly contained Russian and English texts. For other languages, the model needs a larger training set. We plan to expand multilingual support in future versions.

🖥️ IT and Software Development — 49.3% hard flip

Technical texts with terminology, code fragments, and specific syntax are the most challenging category for humanization. Nevertheless, for almost half of the texts, the detector is completely "fooled."

General Conclusions

  1. For more than 90% of texts, the humanizer successfully reduces the probability of AI detection — regardless of the topic
  2. For 14 out of 20 categories, more than half of the texts completely change the detector's verdict (from "AI" to "human")
  3. Despite the relatively compact model (9B parameters) and dataset (~20K pairs), fine-tuning with SimPO showed high efficiency
  4. The best results are in structured domains (law, business, science, marketing)
  5. There is potential for improvement in multilingual texts and IT topics

What This Means for Users

If You Are a Student or Graduate Student

Using ChatGPT or GigaChat for drafts of term papers and essays? Text humanization is not "cheating," but a tool to bring a draft to a human level. In 60%+ of cases, text on legal and scientific topics will become indistinguishable from what you wrote.

If You Are a Copywriter or Marketer

Generating content for clients using AI? Marketing and business texts are humanized with a 58–61% hard flip rate. Run the text through the humanizer, then check it in our AI detector — and be confident in the result.

If You Are an SEO Specialist

Search engines are learning to identify AI content. Humanization reduces the likelihood of filtering — and preserves your work with organic search results.

Optimal Workflow

  1. Generate a draft in ChatGPT / GigaChat / any neural network
  2. Humanize in ReText.AI
  3. Check through the AI detector
  4. Final Edit manually

For Developers and Business (API)

We provide an API for both the AI detector and text humanization. If you want to integrate our technologies into your product, simply reach out to us at team@retext.ai.

More about our tools: read the review TOP-20 Neural Networks Online in 2026, where we thoroughly analyze the entire ReText.AI product stack.

FAQ

What is AI Text Humanization?

It is a technology that rewrites text generated by a neural network so that it reads as if written by a human. Not a simple synonym replacement, but a deep reworking of style, structure, and vocabulary with a specially trained model.

Which Neural Networks Did You Test?

We tested texts generated by 8 models: Llama-3.2-3B, Qwen3-8B, GigaChat-2-Max, GLM-4.6, Llama-3.3-70B, GPT-oss-120B, Qwen3-235B, and T-pro-it-1.0. This covers the current landscape from 3B to 235B parameters.

What Was the Size of the Test Set?

19,804 pairs of texts (original human text + machinized version), classified into 20 thematic categories.

How Effective Is It?

For more than 90% of texts, the likelihood of being identified as AI decreases. For 14 out of 20 categories, more than half of the texts completely change the detector's verdict from "AI" to "human."

In Which Categories Does Humanization Work Best?

Leaders: recipes and cooking (66.7% hard flip), legal texts (64.2%), business (61.0%), science (60.8%), HR (59.9%), and marketing (58.9%).

Where Does Humanization Work Worse?

The most challenging categories: everyday texts (41.0%), multilingual texts (43.6%), digital technologies (44.0%), and culture/art (47.4%). But even for them, 83%+ of texts show improvement.

Is It Ethical?

We believe that AI is a tool, not an author. Just as a calculator does not replace a mathematician, humanization does not replace a writer — it helps bring a draft to a publication-worthy quality. Important: we do not encourage passing off AI text as your own without refinement. We encourage using technology as part of the workflow.

How to Try Humanization?

Go to retext.ai/ai-text-humanizer — up to 1,000 characters can be processed for free.

Contents:
Why We Conducted This Study
Data: What We Analyzed
Text Sources
Thematic Domains
Generation of "Machine" Variants
Final Dataset
Methodology: How We Trained the Model
Base Model
Training Method: SimPO
Distribution of humanizer_score by Categories
Results: Reduction in AI Detection Probability
Overall Picture
By Thematic Categories: Share of Improved Texts
Results: Who Completely "Fooled" the Detector
Where Humanization Works Best
🍳 Recipes and Cooking — Leader (100% improvement, 66.7% hard flip)
⚖️ Legal Texts — 64.2% hard flip
💼 Business and Science — 60%+
Average AI Probability Delta by Categories
Where There Are Challenges
🏠 Everyday Life and Reflections — 41.0% hard flip
🌐 Multilingual Texts — 43.6% hard flip
🖥️ IT and Software Development — 49.3% hard flip
General Conclusions
What This Means for Users
If You Are a Student or Graduate Student
If You Are a Copywriter or Marketer
If You Are an SEO Specialist
Optimal Workflow
For Developers and Business (API)
FAQ
What is AI Text Humanization?
Which Neural Networks Did You Test?
What Was the Size of the Test Set?
How Effective Is It?
In Which Categories Does Humanization Work Best?
Where Does Humanization Work Worse?
Is It Ethical?
How to Try Humanization?
Olga Shkryaba
Founder and CEO of Retext.ai
5
Rate article
0 reviews
Share
Rate article
Share
0 reviews
Rate article
Share
0 reviews
Comments
0 / 500

Recommended articles

TOP-20 AI Tools Online in 2026 — Review by Retext.AI Founder

ReText.AI AI Detector Update: Now Detects AI-Generated Text with Higher Accuracy