Releases: aiming-lab/AutoResearchClaw
v0.4.0: Human-in-the-Loop Co-Pilot System
v0.4.0 β Human-in-the-Loop Co-Pilot System
AutoResearchClaw is no longer purely autonomous. The new HITL Co-Pilot system transforms the pipeline into a human-AI collaborative research engine.
Highlights
- 6+ Intervention Modes:
full-auto,gate-only,checkpoint,step-by-step,co-pilot,custom,express - Idea Workshop: Brainstorm and refine hypotheses collaboratively (Stages 7-8)
- Baseline Navigator: Review and customize experiment designs (Stage 9)
- Paper Co-Writer: Section-by-section collaborative drafting (Stages 16-19)
- SmartPause: Confidence-driven dynamic intervention
- ALHF Intervention Learning: Learns from your review patterns
- Claim Verification: Inline fact-checking against collected literature
- Cost Guardrails: Budget monitoring with threshold alerts
- Pipeline Branching: Fork to explore multiple research directions
- CLI Commands:
attach,status,approve,reject,guide - 3 Adapters: CLI, WebSocket, MCP
New Files
researchclaw/hitl/β 34 modules (7,500+ lines)tests/test_hitl_*.pyβ 9 test files (242 tests)docs/HITL_GUIDE.mdβ 620-line guide- 3 new builtin skills
Testing
- 2,753 tests passed, 0 failures
Full Changelog: v0.3.2...v0.4.0
v0.3.2: Cross-Platform Support + Major Stability
What's New
Cross-Platform Support
- ACP-compatible agent backends: Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI
- OpenClaw bridge: messaging platform integration (Discord, Telegram, Lark, WeChat)
- CLI-agent code generation backend: delegates Stages 10 & 13 to external CLI agents with budget control and timeout management
Anti-Fabrication System
- VerifiedRegistry: ground-truth whitelist from experiment results with tolerance matching
- Experiment diagnosis & repair loop: 13 deficiency categories, auto-repair with best-result selection
- Always-on sanitization: unverified numbers replaced in paper tables
Stability & Quality
- 100+ bug fixes across 8 deep audit rounds
- Modular executor refactoring (10K β 400-line facade)
--resumeauto-detection for interrupted runs- LLM retry hardening with exponential backoff
- Community-reported fixes (macOS M3, math/theoretical topics)
New Subsystems
- Assessor (paper quality scoring + venue recommendation)
- Calendar (conference deadline tracking)
- Collaboration (multi-user research coordination)
- Copilot (interactive steering modes)
- Dashboard (real-time metrics broadcasting)
- Knowledge Graph (entity extraction + visualization)
- Memory (cross-run experiment/ideation/writing memory)
- MCP (Model Context Protocol server)
- Overleaf (live sync with conflict resolution)
- Project Manager (multi-project scheduling)
- Remote Servers (SSH/SLURM/cloud execution)
- Skills Library (12 built-in domain/tooling skills)
- Trends (daily arXiv digest + opportunity finder)
- Voice (speech-to-text commands)
- Wizard (guided project setup)
Testing
- 1,935 tests passing
Full Changelog: v0.3.1...v0.3.2
v0.3.1 β OpenCode Beast Mode + Community Contributions
What's New
OpenCode Beast Mode
New "Beast Mode" routes complex code generation to OpenCode with automatic 6-signal complexity scoring and graceful fallback to CodeAgent.
Universal Cross-Domain Support
- Domain detector for 7 research domains (ML, physics, chemistry, economics, math, biology, security)
- 25+ domain-specific experiment profiles with tailored datasets, metrics, and evaluation protocols
- Domain-aware prompt adapters and Docker images
Code Searcher Agent
GitHub-integrated code search for experiment design reference β query generation, pattern extraction, and result caching.
Web Integration Layer
Web search, crawling, PDF extraction, and Google Scholar support for enhanced literature discovery.
Community Contributions
- Novita AI provider β added as built-in LLM provider preset (#80)
- Thread-safety hardening β all module-level globals protected with locks (#77)
- Figure agent config β fixed 6 missing config fields (#75)
- Robust LLM output parsing β 4-strategy JSON extractor, ACP-aware YAML extraction (#69)
- Test collection fix β safe
skipifguard for Anthropic tests (#53)
Bug Fixes
- 20+ edge-case fixes from 3-round deep audit
- Fixed sleep-inside-lock contention in Semantic Scholar batch API
- Fixed false-positive regex on Markdown links in thinking-tag stripper
- Caught
RecursionErrorin JSON parser for deeply nested payloads - Convergence evaluator with early stopping recommendations
Full Changelog: v0.3.0...v0.3.1
v0.3.0 β MetaClaw Integration, CodeAgent v2, 50+ Bug Fixes
What's New in v0.3.0
MetaClaw Cross-Run Learning Integration
- New
researchclaw/metaclaw_bridge/module: skill injection, lesson-to-skill conversion, PRM quality gates, session lifecycle management - Pipeline failures β structured lessons β reusable skills, injected into all 23 stages
- +18.3% pipeline robustness in controlled experiments
- Opt-in via
metaclaw_bridge.enabled: true, fully backward-compatible
CodeAgent v2 β Enhanced Code Generation
- Enhanced Blueprint: deep implementation specs with per-file pseudocode, tensor shapes, generation order
- Sequential File Generation: dependency-ordered with AST-based CodeMem
- Hard Validation Gates: block identical ablations, hardcoded metrics, cross-file import errors
- Targeted Error Repair: parse traceback to fix surgically instead of full regeneration
BenchmarkAgent & FigureAgent Improvements
- BenchmarkAgent: domain-aware benchmarks, import validation, pretrained resize
- FigureAgent: LLM output type safety, Paul Tol colorblind-safe palette, heatmap/ablation chart types
visualize.pyfull rewrite: academic styling, 300 DPI, 6 enhanced chart types
50+ Pipeline Bug Fixes (BUG-06 through BUG-51)
- Metric direction, citation verify, CodeGen guard, condition drift, RL stability
- BST ordering, raw metrics, bracket citations, arXiv categories
- Docker HF permission, KD stability, ablation detection
- references.bib fallback generation, ICML LaTeX template fix
Paper Quality Hardening (4 Rounds)
- Post-compilation quality checks, weasel/duplicate word lint, NeurIPS checklist
- LaTeX escaping, 7-dim AI-Scientist-style review, AI-slop detection (50+ phrase blocklist)
- Related work depth checker, stats rigor validator, anti-boilerplate prompts
- Cross-discipline support for 7 domains
Docker Sandbox & Infrastructure
- Network-policy-aware code generation & sandbox execution
- Rate-limit defense for literature search APIs (OpenAlex β Semantic Scholar β arXiv cascade)
Full Changelog: v0.2.0...v0.3.0
v0.2.0 β Multi-Agent Pipeline, Docker Sandbox & Quality Hardening
Highlights
This release introduces three multi-agent subsystems, a hardened Docker sandbox, and 4 rounds of paper quality auditing β significantly improving the end-to-end quality of generated research papers.
New Multi-Agent Subsystems
CodeAgent (4-phase architecture)
- LLM generates multi-file experiment code (main.py + setup.py + requirements.txt)
- Static analysis & deep validation (AST-based class/method checks)
- LLM-guided code review with structured JSON feedback
- Iterative repair loop (up to 3 rounds) with automatic UnboundLocalError fix
BenchmarkAgent (4 sub-agents: Surveyor β Selector β Acquirer β Validator)
- Domain-aware dataset and baseline selection from 13-domain knowledge base
- Automatic benchmark acquisition with Docker compatibility validation
- Integrated at Stage 9 (experiment_design), output injected into Stage 10
FigureAgent (5 sub-agents: Planner β CodeGen β Renderer β Critic β Integrator)
- Academic-quality chart generation with SciencePlots, 300 DPI, colorblind-safe palette
- 6 built-in chart templates + LLM fallback for custom visualizations
- Tri-modal critic review (data accuracy, aesthetics, academic convention)
Docker Sandbox Enhancements
- Network-policy-aware code generation:
none|setup_only|pip_only|full - Dynamic dependency installation via requirements.txt
- Pre-cached datasets: CIFAR-10/100, MNIST, FashionMNIST, STL-10, SVHN
- Extended ML stack: torch, torchvision, timm, einops, transformers, etc.
Paper Quality Hardening (4-round audit)
- Post-compilation quality checks, weasel/duplicate word lint
- 7-dimension AI-Scientist-style review scoring
- AI-slop detection (50+ phrases), statistical rigor validator
- Cross-discipline support for 7 research domains (ML/physics/chem/econ/math/eng/bio)
- NeurIPS checklist integration
Bug Fixes (15+)
- Fix baselines dict-to-list crash in BenchmarkAgent
- Fix Gymnasium environment versions (v4 β v5)
- Fix experiment condition drift in iterative refinement (anchor to exp_plan.yaml)
- Fix compute budget constraint for experiment design
- Fix metric direction mismatch, citation verification batching
- Fix LaTeX output sanitization, figure plan format handling
- Add RL stability guidance (gradient clipping, NaN guard)
- And more β see full commit message for details
Compatibility
All changes are backward-compatible with v0.1.0 configuration files.
Full Changelog: v0.1.0...v0.2.0
v0.1.0 β Initial Release
AutoResearchClaw v0.1.0
Fully autonomous research pipeline: one message in, full conference paper out. π¦
Highlights
- 23-stage pipeline: Research Scoping β Literature Discovery β Knowledge Synthesis β Hypothesis Generation β Experiment Design β Self-Healing Execution β Analysis & Decision β Paper Writing β Citation Verification
- Multi-agent debate: 3 agents (Innovator, Pragmatist, Contrarian) argue over hypotheses; adversarial analysis panel reviews results
- Self-healing executor: autonomous crash diagnosis, code repair, and Pivot/Refine decisions
- Cross-run evolution: time-decayed lesson store that improves future runs
- Citation verification: 4-layer pipeline (arXiv, DOI, Semantic Scholar, LLM relevance check)
- OpenClaw integration: trigger full runs from a chat message
Results (6 end-to-end runs)
- 100% pipeline completion (124/124 steps)
- 94.3% citation integrity
- Mean quality 6.2/10 on conference review scale
Requirements
- Python 3.9+
- OpenAI-compatible LLM API