I recently watched Shunyu Yao give a talk on his seminal ReAct paper, the work that introduced the Thought → Action → Observation loop now sitting underneath nearly every agentic LLM workflow. Watching it reminded me of something I’ve been turning over for a while: classical AI — the whole 1970s–90s tradition of expert systems, logic programming, planners, knowledge graphs, rule engines — was largely written off when LLMs took the spotlight. But the structured reasoning flows those old systems defined seem like exactly the kind of scaffolding that could guide an LLM to reason more reliably. And conversely, the LLM is now good enough to fix the brittleness that killed expert systems in the first place: natural-language understanding, common-sense gaps, the knowledge-acquisition bottleneck.

So I went looking. Two threads:

  1. Is anyone seriously combining LLMs with classical reasoning systems?
  2. What significant papers grew out of ReAct?

This post is the result of that survey. Every paper is linked to the source.

Framing first: a sharpened version of the intuition

The intuition is largely on the mark, but it needs sharpening: classical AI is being revived inside LLM stacks, but mostly where classical AI already had legitimate wins — formal logic, planning, theorem proving, knowledge graphs. The grand “expert systems for everything, fixed by LLMs” vision is not yet empirically supported. The knowledge-acquisition bottleneck got relocated, not solved.

Today’s neuro-symbolic LLM hybrids tend to win on logic benchmarks (FOLIO, ProofWriter, GSM8K, Game-of-24, Blocksworld) where an off-the-shelf solver already exists and the problem reduces to translation. They’re far less developed in the messy, open-ended domains — medicine, law, common-sense reasoning — that killed expert systems originally. With that caveat front and center, here are the two threads.


Thread 1 — LLM + classical reasoning (neuro-symbolic)

LLM as semantic parser → symbolic solver

The dominant pattern. The LLM translates natural language into a formal language; a deterministic solver does the inference; the LLM optionally repairs from solver errors. It’s the modern incarnation of “natural-language interface to expert system” — except the LLM is finally good enough that the translation step works.

  • Faithful Chain-of-Thought (Lyu et al., Jan 2023). NL → symbolic chain (Python, PDDL, Datalog) → deterministic solver. Frames the lack of faithfulness in CoT — that the chain doesn’t actually compute the answer — as the core problem, and fixes it by making the chain executable.
  • Logic-LM (Pan et al., EMNLP 2023). NL → first-order logic / constraint program / Boolean formula → solver (Prover9, Z3, python-constraint), with a self-refinement loop driven by solver error messages. +39% over standard prompting.
  • LINC (Olausson et al., Oct 2023). Targets first-order logic with a theorem prover. Notably, small open models + LINC beat much bigger closed models doing pure CoT.
  • SatLM (Ye et al., NeurIPS 2023). LLM emits a declarative SMT spec (not imperative code), passed to Z3. Argues declarative specs mirror the problem statement and are easier for LLMs. +23% on hard arithmetic over Program-of-Thought.
  • SymbCoT (Xu et al., ACL 2024). Three stages: translate → solve via explicit logical rules → verify both translation and chain.

The common theme cleanly inverts the GOFAI division of labor: in old-school AI, modeling was the hard human-expert step and inference was automated. Here, the symbolic system does the inference; the LLM does the modeling.

LLM + theorem provers (formal mathematics)

LLM as planner — and the LLM-Modulo critique

The sharpest argument in the field. Subbarao Kambhampati’s group has been the loudest skeptic of “LLMs can plan.”

  • PlanBench (Valmeekam et al., June 2022). Real planning-community domains (Blocksworld, Logistics) showing LLM planning collapses outside common-sense scenarios.
  • On the Planning Abilities of LLMs: A Critical Investigation (Valmeekam et al., NeurIPS 2023 Spotlight). GPT-4 manages ~12% on autonomous plan generation, but there’s a significant lift when LLMs advise a classical planner instead of generating plans themselves.
  • LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks (Kambhampati et al., ICML 2024). The constructive proposal: LLMs generate candidate plans / model fragments / heuristics, and a bank of symbolic critics (plan validators, simulators, model-checkers) verifies and reprompts. The LLM is “an approximate knowledge source,” never the final arbiter. The most rigorously specified neuro-symbolic agent architecture I know of.

Knowledge-graph-augmented reasoning

The modern descendant of expert-system knowledge bases. Instead of curating a brittle ontology by hand, drive LLM traversal over an existing graph.

  • Think-on-Graph (Sun et al., ICLR 2024). LLM acts as an agent doing iterative beam search over a KG (Wikidata, Freebase). No fine-tuning, and the trace is auditable — you see exactly which edges were traversed, which neatly addresses hallucination.
  • CRITIC (Gou et al., ICLR 2024). LLM critiques its own outputs by calling external tools (search engines, code interpreters) and revising. Adjacent in spirit: the “rule” is “verify factually before answering.”

Process-level guidance (the soft-rules cousin)

  • Let’s Verify Step by Step (Lightman et al., OpenAI, May 2023). Process Reward Models score each reasoning step, not just final answers. Released the PRM800K dataset of 800k step labels. A learned step-checker is neuro-symbolic in spirit — and it’s the through-line from process supervision to today’s o1/o3-style reasoning models.

General surveys


Thread 2 — The ReAct lineage

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., ICLR 2023) is the root. Below, descendants are grouped by what they extend or critique in ReAct.

Reflection / verbal RL (extends the loop with self-critique)

  • Self-Refine (Madaan et al., Mar 2023). Same model is generator + critic + refiner. ~20% lift across seven tasks, no extra training.
  • Reflexion (Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao; NeurIPS 2023). The flagship descendant: a failed ReAct trajectory → LLM writes a verbal lesson into episodic memory → next attempt conditions on it. Gradient-free policy improvement — “verbal RL.” 91% on HumanEval, beating raw GPT-4.

Search-based extensions (replace the linear loop with a tree or graph)

  • Self-Consistency (Wang et al., ICLR 2023). Sample many CoT paths, majority-vote the answer. Conceptual ancestor of every sample-and-verify approach.
  • Tree of Thoughts (Yao et al., NeurIPS 2023). Yao’s own follow-up. Reasoning as deliberate search over a tree of thoughts with a value function, replacing ReAct’s greedy chain. 4% → 74% on Game-of-24.
  • Graph of Thoughts (Besta et al., AAAI 2024). Generalizes ToT: thoughts as graph vertices with aggregation and feedback loops. +62% over ToT on sorting.
  • Algorithm of Thoughts (Sel et al., ICML 2024). Embeds algorithmic exemplars (DFS/BFS pseudocode) directly in prompts so the LLM internalizes the search trace in a single pass. Critique of ToT/GoT inference cost.
  • LATS — Language Agent Tree Search (Zhou et al., ICML 2024). The most explicit ReAct + MCTS marriage. Tree search over ReAct trajectories with a value function and self-reflection at each node. 92.7% HumanEval.

Tool use (the “Action” half, refined)

  • Toolformer (Schick et al., Meta, Feb 2023). LLM teaches itself when and how to call APIs via self-supervised data generation. Tool-use as learned behavior rather than prompted — a parallel formulation to ReAct’s prompted version.
  • ReWOO (Xu et al., May 2023). The clearest critique of ReAct: ReAct re-prompts the full history every step, blowing tokens. ReWOO plans the whole tool-call DAG up front (Planner), executes (Worker), then synthesizes (Solver). ~5× token efficiency, +4% on HotpotQA, and lets you distill a 175B planner into a 7B model.
  • CodeAct (Wang et al., ICML 2024). Replaces JSON tool-call schemas with executable Python as the action space. Up to +20% success rate. Now the dominant pattern in OpenHands and many production agents.

Autonomous, lifelong, multi-agent (ReAct loops in the wild)

  • Generative Agents (Park et al., UIST 2023). 25 Sims-style agents with an observation→reflection→planning memory architecture. Demonstrated emergent social behavior (the famous Valentine’s Day party).
  • Voyager (Wang et al., NVIDIA, May 2023). Minecraft agent with a growing skill library of reusable code snippets and an automatic curriculum. 3.3× more unique items vs. prior SOTA. The first convincing demo of LLM-agent lifelong learning.
  • AutoGen (Wu et al., Microsoft, Aug 2023). Multi-agent conversation framework; the inter-agent protocol is the contribution.
  • MetaGPT (Hong et al., ICLR 2024). Encodes Standard Operating Procedures as prompt scaffolds — explicit symbolic structure on top of LLM agents.
  • AutoGPT and BabyAGI — no formal papers; treat as artifacts, not citable research.

Coding agents

  • SWE-agent (Yang, …, Yao, Narasimhan, Press; NeurIPS 2024). Introduces the Agent-Computer Interface (ACI) concept: design tools for agents, not adapt human tools. 12.5% on SWE-bench at release — a step-change at the time.
  • OpenHands (Wang et al., ICLR 2025; formerly OpenDevin). Open platform with sandboxed execution, multi-agent coordination, evaluation harnesses. (Cognition Labs’ Devin, Mar 2024, has no paper — product, not research.)

Yao’s own framing papers

  • WebShop (Yao et al., July 2022). The benchmark that motivated ReAct. 1.18M products, 12k human instructions; agents at 29% vs. humans at 59%.
  • Cognitive Architectures for Language Agents (CoALA) (Sumers, Yao, Narasimhan, Griffiths; TMLR). The conceptual map of the field. Decomposes language agents into memory modules (working / episodic / semantic / procedural), action space (internal vs. external), and decision procedure. Explicitly draws the line back to Soar and ACT-R. The paper that connects Thread 1 and Thread 2 most directly.
  • τ-bench (Yao, Shinn, Razavi, Narasimhan; June 2024). Tool + user + rule benchmarks; even GPT-4o is below 50% success. The “ReAct works in lab demos but breaks under real rules and real users” paper.

Industry framing

  • Anthropic, Building Effective Agents (Schluntz & Ingargiola, Dec 2024). Not a paper, but cited everywhere. Argues the field oversold “agents” and most production systems should be workflows (predefined LLM/tool orchestration), reserving true agents (ReAct-style dynamic loops) for genuinely open-ended tasks. Names five composable patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. The “evaluator-optimizer” pattern is essentially Reflexion-lite.

What to read first

Ranked by signal-to-noise for someone who already groks ReAct:

  1. CoALA — the map. Read this before anything else; it tells you where every other paper sits.
  2. Reflexion — the cleanest single extension of ReAct; verbal RL is a pattern you’ll re-implement.
  3. LLM-Modulo — the rigorous neuro-symbolic-agent architecture; takes the planning critique seriously and proposes a real fix.
  4. Logic-LM — the prototypical NL→formal-language→solver paper.
  5. Tree of Thoughts — Yao’s own pivot from ReAct’s chain to deliberate search.
  6. ReWOO — the critique of ReAct’s token economics; matters enormously in production.
  7. Building Effective Agents — sober field-state read, free of academic incentives.

Honorable mention: Let’s Verify Step by Step — the through-line from process supervision to today’s o1-style reasoning models.


Where the threads converge — open research questions

This is where it gets interesting and underexplored.

  1. Symbolic planners inside the ReAct loop. LLM-Modulo describes verifiers as critics outside the agent loop; ReAct describes a free-form LLM loop. The clean synthesis — every Thought checked by a PDDL planner / SMT solver / type-checker before becoming an Action — is implied by LATS and LLM-Modulo, but I don’t know of a paper that demonstrates it cleanly as a single architecture on a hard benchmark. Logic-LM and SatLM are single-shot; LLM-Modulo discusses the architecture but is more position than implementation. Genuine research direction.
  2. Symbolic memory for agents. CoALA proposes typed memories (semantic / episodic / procedural). Voyager has a code skill library; Generative Agents has a natural-language memory. Nobody has convincingly demonstrated a symbolic / KG-structured long-term memory inside a ReAct-style agent that updates itself reliably. The direct revival of frame-based / knowledge-base systems but inside a working LLM agent — and it’s open.
  3. PRMs vs. symbolic verifiers as step-level critics. Both score each step. PRMs are learned and noisy; symbolic verifiers are exact but narrow. There’s no clean answer yet on when to use which, or how to combine them.
  4. The knowledge-acquisition bottleneck, redux. Expert systems died because building the rule base for any new domain was prohibitive. LLMs make rule extraction cheap-but-unreliable. There is surprisingly little work on LLMs auto-constructing the symbolic model that a downstream symbolic system then verifies and uses. LLM-Modulo gestures at this; Voyager’s skill library is a code-based version; Think-on-Graph leverages existing KGs rather than building new ones. Arguably the most important open problem at the intersection.
  5. Benchmarks beyond logic puzzles. Almost all neuro-symbolic LLM benchmarks are FOLIO, ProofWriter, GSM8K, Game-of-24, Blocksworld. τ-bench is the rare evaluation that punishes failures of rule-following in messy domains — and current agents fail it. Translating Thread 1’s solver-aided techniques into Thread 2’s messy-environment benchmarks is wide open.

A single research question that ties everything together: Can an agent that uses ReAct-style loops, with a symbolic verifier critic (LLM-Modulo) and a KG-structured episodic memory (CoALA-typed), beat current agentic baselines on τ-bench? Nobody has shown that yet, and the pieces all exist.

That’s the thread I’ll be pulling on.