8000
Skip to content

Releases: aiming-lab/AutoResearchClaw

v0.4.0: Human-in-the-Loop Co-Pilot System

01 Apr 18:37

Choose a tag to compare

v0.4.0 β€” Human-in-the-Loop Co-Pilot System

AutoResearchClaw is no longer purely autonomous. The new HITL Co-Pilot system transforms the pipeline into a human-AI collaborative research engine.

Highlights

  • 6+ Intervention Modes: full-auto, gate-only, checkpoint, step-by-step, co-pilot, custom, express
  • Idea Workshop: Brainstorm and refine hypotheses collaboratively (Stages 7-8)
  • Baseline Navigator: Review and customize experiment designs (Stage 9)
  • Paper Co-Writer: Section-by-section collaborative drafting (Stages 16-19)
  • SmartPause: Confidence-driven dynamic intervention
  • ALHF Intervention Learning: Learns from your review patterns
  • Claim Verification: Inline fact-checking against collected literature
  • Cost Guardrails: Budget monitoring with threshold alerts
  • Pipeline Branching: Fork to explore multiple research directions
  • CLI Commands: attach, status, approve, reject, guide
  • 3 Adapters: CLI, WebSocket, MCP

New Files

  • researchclaw/hitl/ β€” 34 modules (7,500+ lines)
  • tests/test_hitl_*.py β€” 9 test files (242 tests)
  • docs/HITL_GUIDE.md β€” 620-line guide
  • 3 new builtin skills

Testing

  • 2,753 tests passed, 0 failures

Full Changelog: v0.3.2...v0.4.0

v0.3.2: Cross-Platform Support + Major Stability

23 Mar 03:27

Choose a tag to compare

What's New

Cross-Platform Support

  • ACP-compatible agent backends: Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI
  • OpenClaw bridge: messaging platform integration (Discord, Telegram, Lark, WeChat)
  • CLI-agent code generation backend: delegates Stages 10 & 13 to external CLI agents with budget control and timeout management

Anti-Fabrication System

  • VerifiedRegistry: ground-truth whitelist from experiment results with tolerance matching
  • Experiment diagnosis & repair loop: 13 deficiency categories, auto-repair with best-result selection
  • Always-on sanitization: unverified numbers replaced in paper tables

Stability & Quality

  • 100+ bug fixes across 8 deep audit rounds
  • Modular executor refactoring (10K β†’ 400-line facade)
  • --resume auto-detection for interrupted runs
  • LLM retry hardening with exponential backoff
  • Community-reported fixes (macOS M3, math/theoretical topics)

New Subsystems

  • Assessor (paper quality scoring + venue recommendation)
  • Calendar (conference deadline tracking)
  • Collaboration (multi-user research coordination)
  • Copilot (interactive steering modes)
  • Dashboard (real-time metrics broadcasting)
  • Knowledge Graph (entity extraction + visualization)
  • Memory (cross-run experiment/ideation/writing memory)
  • MCP (Model Context Protocol server)
  • Overleaf (live sync with conflict resolution)
  • Project Manager (multi-project scheduling)
  • Remote Servers (SSH/SLURM/cloud execution)
  • Skills Library (12 built-in domain/tooling skills)
  • Trends (daily arXiv digest + opportunity finder)
  • Voice (speech-to-text commands)
  • Wizard (guided project setup)

Testing

  • 1,935 tests passing

Full Changelog: v0.3.1...v0.3.2

v0.3.1 β€” OpenCode Beast Mode + Community Contributions

18 Mar 15:41

Choose a tag to compare

What's New

OpenCode Beast Mode

New "Beast Mode" routes complex code generation to OpenCode with automatic 6-signal complexity scoring and graceful fallback to CodeAgent.

Universal Cross-Domain Support

  • Domain detector for 7 research domains (ML, physics, chemistry, economics, math, biology, security)
  • 25+ domain-specific experiment profiles with tailored datasets, metrics, and evaluation protocols
  • Domain-aware prompt adapters and Docker images

Code Searcher Agent

GitHub-integrated code search for experiment design reference β€” query generation, pattern extraction, and result caching.

Web Integration Layer

Web search, crawling, PDF extraction, and Google Scholar support for enhanced literature discovery.

Community Contributions

  • Novita AI provider β€” added as built-in LLM provider preset (#80)
  • Thread-safety hardening β€” all module-level globals protected with locks (#77)
  • Figure agent config β€” fixed 6 missing config fields (#75)
  • Robust LLM output parsing β€” 4-strategy JSON extractor, ACP-aware YAML extraction (#69)
  • Test collection fix β€” safe skipif guard for Anthropic tests (#53)

Bug Fixes

  • 20+ edge-case fixes from 3-round deep audit
  • Fixed sleep-inside-lock contention in Semantic Scholar batch API
  • Fixed false-positive regex on Markdown links in thinking-tag stripper
  • Caught RecursionError in JSON parser for deeply nested payloads
  • Convergence evaluator with early stopping recommendations

Full Changelog: v0.3.0...v0.3.1

v0.3.0 β€” MetaClaw Integration, CodeAgent v2, 50+ Bug Fixes

17 Mar 13:50

Choose a tag to compare

What's New in v0.3.0

MetaClaw Cross-Run Learning Integration

  • New researchclaw/metaclaw_bridge/ module: skill injection, lesson-to-skill conversion, PRM quality gates, session lifecycle management
  • Pipeline failures β†’ structured lessons β†’ reusable skills, injected into all 23 stages
  • +18.3% pipeline robustness in controlled experiments
  • Opt-in via metaclaw_bridge.enabled: true, fully backward-compatible

CodeAgent v2 β€” Enhanced Code Generation

  • Enhanced Blueprint: deep implementation specs with per-file pseudocode, tensor shapes, generation order
  • Sequential File Generation: dependency-ordered with AST-based CodeMem
  • Hard Validation Gates: block identical ablations, hardcoded metrics, cross-file import errors
  • Targeted Error Repair: parse traceback to fix surgically instead of full regeneration

BenchmarkAgent & FigureAgent Improvements

  • BenchmarkAgent: domain-aware benchmarks, import validation, pretrained resize
  • FigureAgent: LLM output type safety, Paul Tol colorblind-safe palette, heatmap/ablation chart types
  • visualize.py full rewrite: academic styling, 300 DPI, 6 enhanced chart types

50+ Pipeline Bug Fixes (BUG-06 through BUG-51)

  • Metric direction, citation verify, CodeGen guard, condition drift, RL stability
  • BST ordering, raw metrics, bracket citations, arXiv categories
  • Docker HF permission, KD stability, ablation detection
  • references.bib fallback generation, ICML LaTeX template fix

Paper Quality Hardening (4 Rounds)

  • Post-compilation quality checks, weasel/duplicate word lint, NeurIPS checklist
  • LaTeX escaping, 7-dim AI-Scientist-style review, AI-slop detection (50+ phrase blocklist)
  • Related work depth checker, stats rigor validator, anti-boilerplate prompts
  • Cross-discipline support for 7 domains

Docker Sandbox & Infrastructure

  • Network-policy-aware code generation & sandbox execution
  • Rate-limit defense for literature search APIs (OpenAlex β†’ Semantic Scholar β†’ arXiv cascade)

Full Changelog: v0.2.0...v0.3.0

v0.2.0 β€” Multi-Agent Pipeline, Docker Sandbox & Quality Hardening

16 Mar 16:27

Choose a tag to compare

Highlights

This release introduces three multi-agent subsystems, a hardened Docker sandbox, and 4 rounds of paper quality auditing β€” significantly improving the end-to-end quality of generated research papers.

New Multi-Agent Subsystems

CodeAgent (4-phase architecture)

  • LLM generates multi-file experiment code (main.py + setup.py + requirements.txt)
  • Static analysis & deep validation (AST-based class/method checks)
  • LLM-guided code review with structured JSON feedback
  • Iterative repair loop (up to 3 rounds) with automatic UnboundLocalError fix

BenchmarkAgent (4 sub-agents: Surveyor β†’ Selector β†’ Acquirer β†’ Validator)

  • Domain-aware dataset and baseline selection from 13-domain knowledge base
  • Automatic benchmark acquisition with Docker compatibility validation
  • Integrated at Stage 9 (experiment_design), output injected into Stage 10

FigureAgent (5 sub-agents: Planner β†’ CodeGen β†’ Renderer β†’ Critic β†’ Integrator)

  • Academic-quality chart generation with SciencePlots, 300 DPI, colorblind-safe palette
  • 6 built-in chart templates + LLM fallback for custom visualizations
  • Tri-modal critic review (data accuracy, aesthetics, academic convention)

Docker Sandbox Enhancements

  • Network-policy-aware code generation: none | setup_only | pip_only | full
  • Dynamic dependency installation via requirements.txt
  • Pre-cached datasets: CIFAR-10/100, MNIST, FashionMNIST, STL-10, SVHN
  • Extended ML stack: torch, torchvision, timm, einops, transformers, etc.

Paper Quality Hardening (4-round audit)

  • Post-compilation quality checks, weasel/duplicate word lint
  • 7-dimension AI-Scientist-style review scoring
  • AI-slop detection (50+ phrases), statistical rigor validator
  • Cross-discipline support for 7 research domains (ML/physics/chem/econ/math/eng/bio)
  • NeurIPS checklist integration

Bug Fixes (15+)

  • Fix baselines dict-to-list crash in BenchmarkAgent
  • Fix Gymnasium environment versions (v4 β†’ v5)
  • Fix experiment condition drift in iterative refinement (anchor to exp_plan.yaml)
  • Fix compute budget constraint for experiment design
  • Fix metric direction mismatch, citation verification batching
  • Fix LaTeX output sanitization, figure plan format handling
  • Add RL stability guidance (gradient clipping, NaN guard)
  • And more β€” see full commit message for details

Compatibility

All changes are backward-compatible with v0.1.0 configuration files.

Full Changelog: v0.1.0...v0.2.0

v0.1.0 β€” Initial Release

15 Mar 04:35

Choose a tag to compare

AutoResearchClaw v0.1.0

Fully autonomous research pipeline: one message in, full conference paper out. 🦞

Highlights

  • 23-stage pipeline: Research Scoping β†’ Literature Discovery β†’ Knowledge Synthesis β†’ Hypothesis Generation β†’ Experiment Design β†’ Self-Healing Execution β†’ Analysis & Decision β†’ Paper Writing β†’ Citation Verification
  • Multi-agent debate: 3 agents (Innovator, Pragmatist, Contrarian) argue over hypotheses; adversarial analysis panel reviews results
  • Self-healing executor: autonomous crash diagnosis, code repair, and Pivot/Refine decisions
  • Cross-run evolution: time-decayed lesson store that improves future runs
  • Citation verification: 4-layer pipeline (arXiv, DOI, Semantic Scholar, LLM relevance check)
  • OpenClaw integration: trigger full runs from a chat message

Results (6 end-to-end runs)

  • 100% pipeline completion (124/124 steps)
  • 94.3% citation integrity
  • Mean quality 6.2/10 on conference review scale

Requirements

  • Python 3.9+
  • OpenAI-compatible LLM API
0