The bottleneck in building better AI models has never been computing alone – it’s always been data quality. Meta AI’s RAM (Reasoning, Alignment, and Memory) team is now addressing that problem directly. Meta researchers are silent Default dataa framework that uses AI agents in the role of an independent data scientist, tasked with iteratively building, testing, and refining training and evaluation data sets – without relying on expensive human annotation at every step.
And the results, tested on complex scientific reasoning problems, show that this method is not only similar to the old methods of artificial data processing – it far surpasses them.

Why Artificial Data Creation Is Always Difficult
To understand what Autodata solves, you need to understand how AI training data is often created today.

Most modern AI systems started with data written by humans. As the models developed, researchers began to add that synthetic data – data generated by the model itself. Synthetic data is attractive because it can generate unusual cases, reduce the cost of manual labeling, and generate more challenging examples than exist naturally in a public organization.
The best way to generate synthetic data is Self-education — promotes a large-scale linguistic model (LLM) using trivial or few examples to build new training samples. Grounded Self-discipline Methods have increased that by establishing the production of documents and other sources to reduce illusions and increase diversity. CoT Self-Education (Chain-of-Thought Self-Instruct) went further by using sequential thinking during production to create more complex tasks with more precision. Recently, Ways to “challenge yourself”. allow the adversarial agent to interact with the tools before proposing a job and the corresponding evaluation tasks – a front-end job that is very close to what Autodata does.
The problem? None of these methods provided researchers with a feedback-driven approach to control or iteratively improve the quality of the data during the production process itself. You could filter, transform, or refine the data after the fact — but the production pipeline remained static and one-pass.
Automated data changes that.


What Autodata Actually Does
Autodata is an approach that allows AI agents to act as data scientists that iteratively generate high-quality training and test data. Instead of generating data in a single pass, the agent runs a closed-loop pipeline modeled after how a real human data scientist works:
- Data Structure – The agent supports the provided source documents (research papers, code, legal text, etc.) and uses the tools and skills learned to create training or testing examples.
- Data Analysis — The agent then checks what you’ve created: Is this prototype correct? High quality? Challenge enough? It integrates the learning at the instance level and, finally, at the dataset level (Is it unique? Does it improve the model when used as training data?).
- Repetition – Using that learning, the agent updates its data generation recipe and goes back to create better data. This continues until a stopping condition is met.
Agent data creation provides a way to turn extended inference computation into high-quality model training. The more you give an agent a predictive time calculator, the better the data it produces – an important insight for computer budgeting experts.
Direct Implementation: Agentic Self-Instruction
The first Meta installation of Autodata is called Agentic Self-Instructionand its structure is built around the LLM of the main orchestrator that coordinates four specialized subagents:
- Importer LLM – create a training example (input + feedback) based on detailed information from the master agent
- A Weak Solver – a small, inefficient model that is expected to not make it to the production example
- Solid Solver – a very capable model that is expected to be successful in general
- Verifier/Judge — checks whether the output of each solver meets the quality criteria, using the rubrics produced by Challenger LLM
Important design note: The Weak and Strong solver can be the same LLM that works in different ways. For example, a stricter version could be allowed to use overtime computing including scaffolding or assembly, and access to privileged information – giving experts flexibility in how they define power sharing.
The acceptance criteria are precise and multidimensional. For an instance to be accepted into a dataset, all four of the following must hold:
- The quality assurance (QV) must pass for example
weak_avg ≤ 65%againmax_weak ≤ 75%without zero scoresstrong_avg ≥ 60%againstrong_avg < 95%– Ensuring that the question is not too difficult for everyone or too easy for a strong solver- The gap
strong_avg − weak_avg ≥ 20%
If any of those limits are not reached, the master agent sends a target response to the Challenger and tries again — from a different angle of reasoning. This loop usually makes a few cycles per paper (median 3–5) before generating an accepted query or exhausting its step budget.
Important Numbers
The quality benefits over traditional CoT Self-Instruct are measurable and significant.
Under CoT Self-Instruct, the two solvers get almost identical scores – weak at 71.4% and strong at 73.3%, a score gap of only 1.9 percentage points – indicating that one-shot questions fail to provide enough challenging tasks for either model. Agentic Self-Instruct lowered the weak score to 43.7% while raising the strong score to 77.8%, widening the gap to 34 points. The agent’s data generation loop generates questions that preferentially reward strong model capabilities, rather than questions that both models can answer equally well.
The dataset itself was generated by processing more than 10,000 CS papers from the S2ORC corpus (2022+), generating 2,117 QA pairs that satisfy all quality constraints and performance gap requirements.
When Qwen-3.5-4B was then trained with GRPO for about one period (cluster size 32, learning rate 1e-6) on Agentic Self-Instruct data compared to CoT Self-Instruct data – using Kimi-K2.6 as a reward model for responses against generated rubrics – a clear model of data trained on the distribution of both test sets outside the distribution.
Meta-Optimization: Teaching the Agent to Be a Better Data Scientist
Default data is one level deeper. Without an internal data creation loop, the framework supports meta-optimization of the data science agent itself – using the same inner loop quality criteria to develop the outer loop agent harness (agent code scaffolding, prompts, and test logic).
Using an evolution-based optimization framework, the meta-optimizer ran 233 iterations, of which 126 were accepted (an adaptive harness is only added to the population if its validation result strongly exceeds its parent). The meta-optimizer used Kimi-K2.6 as an analyzer – studying full test methods to identify systematic failure patterns – and an initiator, which changed the agent harness using the code editing agent. The setup used 50 training papers and 25 validation papers.
From the first harness that achieves a validation pass rate of 12.8%, the meta-optimizer is gradually available. The development of four key harnesses by default:
- Use of paper-specific information: Questions should test paper-specific knowledge, not general ML/CS knowledge. A self-test was introduced: “If the solver can answer correctly without reading this particular paper, the question is very easy.”
- To prevent content leakage: Strict rules that require the context to describe only the background of the problem and setting, not the proposed solution of the paper.
- Only good rubric with weight capping: The optimizer removed the negatively weighted rubric terms entirely, finding that they historically confound and destroy robust models without improving discrimination. All conditions now use integer weights set to 7.
- Structured rubric format: Robust JSON format for the rubric process with absolute weights, removing parsing errors that caused test failures in previous iterations.
The progression from 12.8% to 42.4% confirmed pass rate shows that meta-optimizing data science agent instructions can significantly improve data quality without manual harness engineering.
Check it out Technical details here. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us