v0.4.0 — Human-in-the-Loop Co-Pilot System

AutoResearchClaw is no longer purely autonomous. The new HITL Co-Pilot system transforms the pipeline into a human-AI collaborative research engine.

Highlights

6+ Intervention Modes: full-auto, gate-only, checkpoint, step-by-step, co-pilot, custom, express
Idea Workshop: Brainstorm and refine hypotheses collaboratively (Stages 7-8)
Baseline Navigator: Review and customize experiment designs (Stage 9)
Paper Co-Writer: Section-by-section collaborative drafting (Stages 16-19)
SmartPause: Confidence-driven dynamic intervention
ALHF Intervention Learning: Learns from your review patterns
Claim Verification: Inline fact-checking against collected literature
Cost Guardrails: Budget monitoring with threshold alerts
Pipeline Branching: Fork to explore multiple research directions
CLI Commands: attach, status, approve, reject, guide
3 Adapters: CLI, WebSocket, MCP

New Files

researchclaw/hitl/ — 34 modules (7,500+ lines)
tests/test_hitl_*.py — 9 test files (242 tests)
docs/HITL_GUIDE.md — 620-line guide
3 new builtin skills

Testing

2,753 tests passed, 0 failures

Full Changelog: v0.3.2...v0.4.0

What's New

Cross-Platform Support

ACP-compatible agent backends: Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI
OpenClaw bridge: messaging platform integration (Discord, Telegram, Lark, WeChat)
CLI-agent code generation backend: delegates Stages 10 & 13 to external CLI agents with budget control and timeout management

Anti-Fabrication System

VerifiedRegistry: ground-truth whitelist from experiment results with tolerance matching
Experiment diagnosis & repair loop: 13 deficiency categories, auto-repair with best-result selection
Always-on sanitization: unverified numbers replaced in paper tables

Stability & Quality

100+ bug fixes across 8 deep audit rounds
Modular executor refactoring (10K → 400-line facade)
--resume auto-detection for interrupted runs
LLM retry hardening with exponential backoff
Community-reported fixes (macOS M3, math/theoretical topics)

New Subsystems

Assessor (paper quality scoring + venue recommendation)
Calendar (conference deadline tracking)
Collaboration (multi-user research coordination)
Copilot (interactive steering modes)
Dashboard (real-time metrics broadcasting)
Knowledge Graph (entity extraction + visualization)
Memory (cross-run experiment/ideation/writing memory)
MCP (Model Context Protocol server)
Overleaf (live sync with conflict resolution)
Project Manager (multi-project scheduling)
Remote Servers (SSH/SLURM/cloud execution)
Skills Library (12 built-in domain/tooling skills)
Trends (daily arXiv digest + opportunity finder)
Voice (speech-to-text commands)
Wizard (guided project setup)

Testing

1,935 tests passing

Full Changelog: v0.3.1...v0.3.2

What's New

OpenCode Beast Mode

New "Beast Mode" routes complex code generation to OpenCode with automatic 6-signal complexity scoring and graceful fallback to CodeAgent.

Universal Cross-Domain Support

Domain detector for 7 research domains (ML, physics, chemistry, economics, math, biology, security)
25+ domain-specific experiment profiles with tailored datasets, metrics, and evaluation protocols
Domain-aware prompt adapters and Docker images

Code Searcher Agent

GitHub-integrated code search for experiment design reference — query generation, pattern extraction, and result caching.

Web Integration Layer

Web search, crawling, PDF extraction, and Google Scholar support for enhanced literature discovery.

Community Contributions

Novita AI provider — added as built-in LLM provider preset (#80)
Thread-safety hardening — all module-level globals protected with locks (#77)
Figure agent config — fixed 6 missing config fields (#75)
Robust LLM output parsing — 4-strategy JSON extractor, ACP-aware YAML extraction (#69)
Test collection fix — safe skipif guard for Anthropic tests (#53)

Bug Fixes

20+ edge-case fixes from 3-round deep audit
Fixed sleep-inside-lock contention in Semantic Scholar batch API
Fixed false-positive regex on Markdown links in thinking-tag stripper
Caught RecursionError in JSON parser for deeply nested payloads
Convergence evaluator with early stopping recommendations

Full Changelog: v0.3.0...v0.3.1

What's New in v0.3.0

MetaClaw Cross-Run Learning Integration

New researchclaw/metaclaw_bridge/ module: skill injection, lesson-to-skill conversion, PRM quality gates, session lifecycle management
Pipeline failures → structured lessons → reusable skills, injected into all 23 stages
+18.3% pipeline robustness in controlled experiments
Opt-in via metaclaw_bridge.enabled: true, fully backward-compatible

CodeAgent v2 — Enhanced Code Generation

Enhanced Blueprint: deep implementation specs with per-file pseudocode, tensor shapes, generation order
Sequential File Generation: dependency-ordered with AST-based CodeMem
Hard Validation Gates: block identical ablations, hardcoded metrics, cross-file import errors
Targeted Error Repair: parse traceback to fix surgically instead of full regeneration

BenchmarkAgent & FigureAgent Improvements

BenchmarkAgent: domain-aware benchmarks, import validation, pretrained resize
FigureAgent: LLM output type safety, Paul Tol colorblind-safe palette, heatmap/ablation chart types
visualize.py full rewrite: academic styling, 300 DPI, 6 enhanced chart types

50+ Pipeline Bug Fixes (BUG-06 through BUG-51)

Metric direction, citation verify, CodeGen guard, condition drift, RL stability
BST ordering, raw metrics, bracket citations, arXiv categories
Docker HF permission, KD stability, ablation detection
references.bib fallback generation, ICML LaTeX template fix

Paper Quality Hardening (4 Rounds)

Post-compilation quality checks, weasel/duplicate word lint, NeurIPS checklist
LaTeX escaping, 7-dim AI-Scientist-style review, AI-slop detection (50+ phrase blocklist)
Related work depth checker, stats rigor validator, anti-boilerplate prompts
Cross-discipline support for 7 domains

Docker Sandbox & Infrastructure

Network-policy-aware code generation & sandbox execution
Rate-limit defense for literature search APIs (OpenAlex → Semantic Scholar → arXiv cascade)

Full Changelog: v0.2.0...v0.3.0

Highlights

This release introduces three multi-agent subsystems, a hardened Docker sandbox, and 4 rounds of paper quality auditing — significantly improving the end-to-end quality of generated research papers.

New Multi-Agent Subsystems

CodeAgent (4-phase architecture)

LLM generates multi-file experiment code (main.py + setup.py + requirements.txt)
Static analysis & deep validation (AST-based class/method checks)
LLM-guided code review with structured JSON feedback
Iterative repair loop (up to 3 rounds) with automatic UnboundLocalError fix

BenchmarkAgent (4 sub-agents: Surveyor → Selector → Acquirer → Validator)

Domain-aware dataset and baseline selection from 13-domain knowledge base
Automatic benchmark acquisition with Docker compatibility validation
Integrated at Stage 9 (experiment_design), output injected into Stage 10

FigureAgent (5 sub-agents: Planner → CodeGen → Renderer → Critic → Integrator)

Academic-quality chart generation with SciencePlots, 300 DPI, colorblind-safe palette
6 built-in chart templates + LLM fallback for custom visualizations
Tri-modal critic review (data accuracy, aesthetics, academic convention)

Docker Sandbox Enhancements

Network-policy-aware code generation: none | setup_only | pip_only | full
Dynamic dependency installation via requirements.txt
Pre-cached datasets: CIFAR-10/100, MNIST, FashionMNIST, STL-10, SVHN
Extended ML stack: torch, torchvision, timm, einops, transformers, etc.

Paper Quality Hardening (4-round audit)

Post-compilation quality checks, weasel/duplicate word lint
7-dimension AI-Scientist-style review scoring
AI-slop detection (50+ phrases), statistical rigor validator
Cross-discipline support for 7 research domains (ML/physics/chem/econ/math/eng/bio)
NeurIPS checklist integration

Bug Fixes (15+)

Fix baselines dict-to-list crash in BenchmarkAgent
Fix Gymnasium environment versions (v4 → v5)
Fix experiment condition drift in iterative refinement (anchor to exp_plan.yaml)
Fix compute budget constraint for experiment design
Fix metric direction mismatch, citation verification batching
Fix LaTeX output sanitization, figure plan format handling
Add RL stability guidance (gradient clipping, NaN guard)
And more — see full commit message for details

Compatibility

All changes are backward-compatible with v0.1.0 configuration files.

Full Changelog: v0.1.0...v0.2.0

AutoResearchClaw v0.1.0

Fully autonomous research pipeline: one message in, full conference paper out. 🦞

Highlights

23-stage pipeline: Research Scoping → Literature Discovery → Knowledge Synthesis → Hypothesis Generation → Experiment Design → Self-Healing Execution → Analysis & Decision → Paper Writing → Citation Verification
Multi-agent debate: 3 agents (Innovator, Pragmatist, Contrarian) argue over hypotheses; adversarial analysis panel reviews results
Self-healing executor: autonomous crash diagnosis, code repair, and Pivot/Refine decisions
Cross-run evolution: time-decayed lesson store that improves future runs
Citation verification: 4-layer pipeline (arXiv, DOI, Semantic Scholar, LLM relevance check)
OpenClaw integration: trigger full runs from a chat message

Results (6 end-to-end runs)

100% pipeline completion (124/124 steps)
94.3% citation integrity
Mean quality 6.2/10 on conference review scale

Requirements

Python 3.9+
OpenAI-compatible LLM API

Releases: aiming-lab/AutoResearchClaw

v0.4.0: Human-in-the-Loop Co-Pilot System

v0.4.0 — Human-in-the-Loop Co-Pilot System

Highlights

New Files

Testing

Uh oh!

v0.3.2: Cross-Platform Support + Major Stability

What's New

Cross-Platform Support

Anti-Fabrication System

Stability & Quality

New Subsystems

Testing

Uh oh!

v0.3.1 — OpenCode Beast Mode + Community Contributions

What's New

OpenCode Beast Mode

Universal Cross-Domain Support

Code Searcher Agent

Web Integration Layer

Community Contributions

Bug Fixes

Uh oh!

v0.3.0 — MetaClaw Integration, CodeAgent v2, 50+ Bug Fixes

What's New in v0.3.0

MetaClaw Cross-Run Learning Integration

CodeAgent v2 — Enhanced Code Generation

BenchmarkAgent & FigureAgent Improvements

50+ Pipeline Bug Fixes (BUG-06 through BUG-51)

Paper Quality Hardening (4 Rounds)

Docker Sandbox & Infrastructure

Uh oh!

v0.2.0 — Multi-Agent Pipeline, Docker Sandbox & Quality Hardening

Highlights

New Multi-Agent Subsystems

CodeAgent (4-phase architecture)

BenchmarkAgent (4 sub-agents: Surveyor → Selector → Acquirer → Validator)

FigureAgent (5 sub-agents: Planner → CodeGen → Renderer → Critic → Integrator)

Docker Sandbox Enhancements

Paper Quality Hardening (4-round audit)

Bug Fixes (15+)

Compatibility

Uh oh!

v0.1.0 — Initial Release

AutoResearchClaw v0.1.0

Highlights

Results (6 end-to-end runs)

Requirements

Uh oh!