ReText.AI Study: How AI Text Humanization Works — 20,000 Texts, 8 Models, 20 Categories
At ReText.AI, we are developing text humanization technology — an algorithm that takes text generated by a neural network and rewrites it to read as if written by a human. Not just synonym replacement, but a complete reworking of style, structure, and vocabulary.
But how well does it actually work? We decided to test it on a large scale — and conducted a study analyzing 19,804 pairs of texts in 20 thematic categories. This article presents the full results with figures, charts, and conclusions.
💡 Briefly for the impatient: In 90%+ cases, humanization successfully reduces the likelihood of the text being identified as AI-generated. For 14 out of 20 categories, more than half of the texts after processing completely "fool" the detector.
Why We Conducted This Study
With the growing popularity of ChatGPT, GigaChat, YandexGPT, and other generative models, a reverse problem has emerged: more and more platforms, universities, and editorial offices are implementing AI text detectors. Students receive lower grades, authors are accused of dishonesty, and SEO texts are filtered by search engines.
At ReText.AI, we created a text humanization feature specifically to solve this problem: so that people using AI as a tool for drafts and ideas can bring the text to a quality indistinguishable from that written by a human.
But we didn't just want to claim "it works." We wanted to prove it — with data.
Data: What We Analyzed
Text Sources
Human texts were collected from two academic datasets:
- COLING-2025 (Workshop on MGT Detection, Subtask B: Multilingual MGT detection) — Russian and English texts
- AINL-eval — scientific texts
Thematic Domains
The original texts cover a wide range of topics:
- Social Media — posts, comments, discussions
- Wikipedia — encyclopedic articles
- Fiction — prose, stories
- Administrative Documents — business correspondence, regulations
- Scientific Texts — articles, studies, abstracts
Generation of "Machine" Variants
For each human text, we created a "machinized" version — a text rewritten by one of 8 neural networks:
Model | Parameters | Developer |
|---|---|---|
| Llama-3.2-3B-Instruct | 3B | Meta |
| Qwen3-8B | 8B | Alibaba |
| GigaChat-2-Max | — | Sber |
| GLM-4.6 | — | ZAI |
| Llama-3.3-70B-Instruct | 70B | Meta |
| GPT-oss-120B | 120B | OpenAI |
| Qwen3-235B-A22B-Instruct | 235B | Alibaba |
| T-pro-it-1.0 | — | T-Bank |
Why 8 models? We wanted to ensure that humanization works not only for texts from one ChatGPT but for any popular model. Each generates text with its own characteristics — and our algorithm must handle them all.
Final Dataset
After filtering and quality control, we obtained 19,804 pairs of texts (original human text + machinized version). The predominant language is Russian (~80%), with the rest being English and multilingual texts.
All texts were automatically classified into 20 thematic clusters: from recipes and cooking to IT and software development.
Methodology: How We Trained the Model
Base Model
At the core of our humanizer is the Gemma-2-9B-IT model (Google). We chose it for several reasons:
- Good generation quality in Russian
- Relatively compact size (9B parameters) — fast inference
- Architecture optimized for text processing tasks
Training Method: SimPO
For fine-tuning, we used SimPO (Simple Preference Optimization) — a method from the RLHF (Reinforcement Learning from Human Feedback) family. The essence:
- The model generates several variants of "humanized" text for each input text
- Each variant is run through our AI detector — assessing the likelihood that the text was written by a neural network
- Variants are ranked by the humanizer_score metric, calculated as:
humanizer_score = (prob_ai_original − prob_ai_humanized) × confidence_weight
Where:
prob_ai_original— probability of AI authorship of the original textprob_ai_humanized— probability of AI authorship after humanizationconfidence_weight— detector confidence coefficient
- The best pairs (successful vs. unsuccessful humanization) form the training dataset for SimPO
In simple terms: we taught the model with examples of "this is good, this is bad" — and it learned to distinguish which stylistic techniques make the text more "human."
Distribution of humanizer_score by Categories
The visualization below shows how our key metric humanizer_score is distributed across all 20 thematic clusters. The higher the value, the more effective the humanization in that category:
Results: Reduction in AI Detection Probability
Overall Picture
The main question: how much does humanization reduce the likelihood that a detector will identify the text as AI-generated?
Answer: radically.
Metric | Before Humanization | After Humanization |
|---|---|---|
Median AI Probability | ~0.93 (93%) | ~0.47 (47%) |
Texts with AI Probability > 0.8 | Overwhelming majority | <20% |
Texts with AI Probability < 0.5 | Few | >50% |
Before processing, the distribution of probabilities is compressed towards one — almost all texts are confidently identified as machine-generated. After processing, the distribution shifts to the left — half of the texts are now classified as human.
By Thematic Categories: Share of Improved Texts
For more than 90% of texts in each category, the humanizer successfully reduces the probability of AI detection:
Category | Share of Texts with Improvement |
|---|---|
🍳 Recipes and Cooking | 100.0% |
🏠 Everyday Life and Reflections | 93.8% |
👥 Personnel Management and Organization | 93.7% |
⚖️ Legal Systems and Legislation | 93.4% |
📣 Marketing and Advertising | 93.0% |
🔬 Scientific Research and Methods | 92.7% |
🧠 Psychology and Society | 91.9% |
📚 Literature and Text Analysis | 91.5% |
🎓 Education and Training | 90.9% |
💼 Business and Market Analysis | 90.4% |
📊 Data Analysis and ML | 89.8% |
🏙️ Urban Systems and Services | 89.0% |
💰 Economics | 88.7% |
✍️ Personal Stories and Narratives | 88.1% |
📰 News about Russia | 87.9% |
🎮 Computer Games | 87.3% |
🖥️ Information Systems and Software Development | 85.2% |
🎨 Culture and Art | 85.0% |
🔧 Digital Technologies and Innovations | 84.7% |
🌐 Multilingual Texts | 83.5% |
Key observation: even for the most challenging categories (IT, digital technologies, multilingual texts), more than 83% of texts show improvement.
Results: Who Completely "Fooled" the Detector
Reducing probability is one thing. But we were also interested in a stricter question: what share of texts after humanization completely changes the detector's verdict — from "AI" to "human"?
We called this the Hard Flip Rate — the percentage of texts that were identified as AI-generated before processing, and as human after (AI probability < 0.5).
Category | Hard Flip Rate |
|---|---|
🍳 Recipes and Cooking | 66.7% |
⚖️ Legal Systems and Legislation | 64.2% |
💼 Business and Market Analysis | 61.0% |
🔬 Scientific Research and Methods | 60.8% |
👥 Personnel Management and Organization | 59.9% |
📣 Marketing and Advertising | 58.9% |
🧠 Psychology and Society | 57.8% |
🏙️ Urban Systems and Services | 57.1% |
📚 Literature and Text Analysis | 56.4% |
📊 Data Analysis and ML | 56.2% |
✍️ Personal Stories and Narratives | 55.7% |
🎓 Education and Training | 52.9% |
💰 Economics | 51.6% |
📰 News about Russia | 51.1% |
🎮 Computer Games | 49.3% |
🖥️ Information Systems and Software Development | 49.3% |
🎨 Culture and Art | 47.4% |
🔧 Digital Technologies and Innovations | 44.0% |
🌐 Multilingual Texts | 43.6% |
🏠 Everyday Life and Reflections | 41.0% |
Result: for 14 out of 20 categories, more than half of the texts completely "fool" the detector. For the top 6 categories (law, business, science, HR, marketing, cooking), the rate exceeds 58%.
Where Humanization Works Best
🍳 Recipes and Cooking — Leader (100% improvement, 66.7% hard flip)
Culinary texts showed absolutely the best results. Reason: recipes have a characteristic conversational structure that our humanizer reproduces especially well — "add a pinch," "by eye," "when browned" instead of "add 2.5 g NaCl."
⚖️ Legal Texts — 64.2% hard flip
Legal texts are effectively humanized because the legal style has established formulations. The model learned to make them more "lively" while maintaining accuracy of meaning.
💼 Business and Science — 60%+
Business and scientific texts have a formalized structure that can be naturally "diluted" with human constructions — introductory words, authorial remarks, syntactic variability.
Average AI Probability Delta by Categories
Another metric — how much the probability of AI authorship decreases on average:
Category | Average Delta |
|---|---|
Recipes and Cooking | +0.518 |
Personnel Management | +0.457 |
Legal Systems | +0.451 |
Business and Market Analysis | +0.450 |
Psychology and Society | +0.429 |
Literature and Text Analysis | +0.414 |
Marketing and Advertising | +0.413 |
Scientific Research | +0.404 |
On average, the probability of AI authorship decreases by 0.35–0.52 points — this is a radical change.
Where There Are Challenges
🏠 Everyday Life and Reflections — 41.0% hard flip
Paradox: this category shows an excellent improvement percentage (93.8%) but low hard flip. The reason is that everyday texts already have a diverse style, making it difficult to "switch" the detector over the 0.5 threshold.
🌐 Multilingual Texts — 43.6% hard flip
Our dataset predominantly contained Russian and English texts. For other languages, the model needs a larger training set. We plan to expand multilingual support in future versions.
🖥️ IT and Software Development — 49.3% hard flip
Technical texts with terminology, code fragments, and specific syntax are the most challenging category for humanization. Nevertheless, for almost half of the texts, the detector is completely "fooled."
General Conclusions
- For more than 90% of texts, the humanizer successfully reduces the probability of AI detection — regardless of the topic
- For 14 out of 20 categories, more than half of the texts completely change the detector's verdict (from "AI" to "human")
- Despite the relatively compact model (9B parameters) and dataset (~20K pairs), fine-tuning with SimPO showed high efficiency
- The best results are in structured domains (law, business, science, marketing)
- There is potential for improvement in multilingual texts and IT topics
What This Means for Users
If You Are a Student or Graduate Student
Using ChatGPT or GigaChat for drafts of term papers and essays? Text humanization is not "cheating," but a tool to bring a draft to a human level. In 60%+ of cases, text on legal and scientific topics will become indistinguishable from what you wrote.
If You Are a Copywriter or Marketer
Generating content for clients using AI? Marketing and business texts are humanized with a 58–61% hard flip rate. Run the text through the humanizer, then check it in our AI detector — and be confident in the result.
If You Are an SEO Specialist
Search engines are learning to identify AI content. Humanization reduces the likelihood of filtering — and preserves your work with organic search results.
Optimal Workflow
- Generate a draft in ChatGPT / GigaChat / any neural network
- Humanize in ReText.AI
- Check through the AI detector
- Final Edit manually
For Developers and Business (API)
We provide an API for both the AI detector and text humanization. If you want to integrate our technologies into your product, simply reach out to us at team@retext.ai.
More about our tools: read the review TOP-20 Neural Networks Online in 2026, where we thoroughly analyze the entire ReText.AI product stack.
FAQ
What is AI Text Humanization?
It is a technology that rewrites text generated by a neural network so that it reads as if written by a human. Not a simple synonym replacement, but a deep reworking of style, structure, and vocabulary with a specially trained model.
Which Neural Networks Did You Test?
We tested texts generated by 8 models: Llama-3.2-3B, Qwen3-8B, GigaChat-2-Max, GLM-4.6, Llama-3.3-70B, GPT-oss-120B, Qwen3-235B, and T-pro-it-1.0. This covers the current landscape from 3B to 235B parameters.
What Was the Size of the Test Set?
19,804 pairs of texts (original human text + machinized version), classified into 20 thematic categories.
How Effective Is It?
For more than 90% of texts, the likelihood of being identified as AI decreases. For 14 out of 20 categories, more than half of the texts completely change the detector's verdict from "AI" to "human."
In Which Categories Does Humanization Work Best?
Leaders: recipes and cooking (66.7% hard flip), legal texts (64.2%), business (61.0%), science (60.8%), HR (59.9%), and marketing (58.9%).
Where Does Humanization Work Worse?
The most challenging categories: everyday texts (41.0%), multilingual texts (43.6%), digital technologies (44.0%), and culture/art (47.4%). But even for them, 83%+ of texts show improvement.
Is It Ethical?
We believe that AI is a tool, not an author. Just as a calculator does not replace a mathematician, humanization does not replace a writer — it helps bring a draft to a publication-worthy quality. Important: we do not encourage passing off AI text as your own without refinement. We encourage using technology as part of the workflow.
How to Try Humanization?
Go to retext.ai/ai-text-humanizer — up to 1,000 characters can be processed for free.