[
  {
    "src": "hackernews",
    "id": "47936461",
    "title": "Show HN: Loom – A Markdown knowledge graph for better coding-agent execution",
    "body": "Hi HN, I built Loom because I wanted less agent tooling, not more.<p>My coding-agent workflow had outgrown PLAN.md. One file kept turning into the partial spec, research log, task queue, evidence log, review notes, handoff summary, and feature doc. And stratifying it typically ends up in disparate scratch files with no canonicity.<p>One solution is to add more surfaces: a spec tool, an issue tool, a memory system, a review prompt, a planning plugin, a workflow package. But that brings two problems: There is a lack of genuine cohesion, no emergent knowledge graph. And some tools try to do too much and take over your workflow.<p>I wanted one repo-native work record&#x2F;grammar with enough structure for the agent to organize itself.<p>That became Loom.<p>If you want to stop reading and try it, the repo has install paths for Claude Code, OpenCode, Codex, Cursor, and Gemini CLI as well as more detailed write up in the README:<p><a href=\"https:&#x2F;&#x2F;github.com&#x2F;z3z1ma&#x2F;agent-loom\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;z3z1ma&#x2F;agent-loom</a><p>How it works:<p>You start a task in a Loom-enabled repo.<p>The agent first asks where the work belongs.<p>If it needs discovery, it goes to research.\nIf behavior is unclear, it goes to a spec.\nIf sequencing is unclear, it goes to a plan.\nIf work is live, it goes to a ticket.\nIf something was observed, it goes to evidence.\nIf risk needs pressure, it goes to critique.\nIf the project learned something reusable, it goes to wiki.<p>That project vocabulary is the core of Loom. It&#x27;s a knowledge graph.<p>The individual pieces are familiar. Beads has local task memory. Spec Kit has executable specs. Superpowers has development oriented skills. ECC has compounding. GSD has context engineering. Ralph has clean execution loops.<p>Loom’s contribution is the unification and composition. It gives every kind of work a place in the repo, then teaches the agent how to move between those places.<p>Once implementation is ready, the parent compiles a packet.<p>A packet is not just a prompt. It is a bounded worker contract compiled from upstream project state: constitution, initiative, research, spec, plan, ticket, evidence, critique, source context, write scope, verification posture, stop conditions, and output shape.<p>The worker gets a clean context window, but not an empty one. Less context by volume, better context by shape.<p>Then the loop runs the other way.<p>After the worker returns, the parent reconciles the result into the ticket, records evidence, routes critique when needed, and promotes durable learning through retrospectives. A rejected path can move into research. A settled explanation can move into wiki. A clarified behavior can move into a spec. A changed sequencing lesson can move into a plan.<p>The next packet is better because the project is better.<p>There is no service, daemon, MCP server, workflow engine, or runtime database. The graph lives in Markdown files. Agents inspect it with normal tools: grep, find, git, cat, awk, sed, and shell pipes.<p>I would like criticism from people using coding agents on projects that span more than one session. The most useful feedback would be where this feels helpful, where it feels like process, and which project layers are wrong&#x2F;right.",
    "url": "https://github.com/z3z1ma/agent-loom",
    "upvotes": 1,
    "comments": 0,
    "sub": "hackernews",
    "signal": 55.0,
    "hits": [
      "agent workflow",
      "context engineering",
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "45529628",
    "title": "Launch HN: Extend (YC W23) – Turn your messiest documents into data",
    "body": "Hey HN! We’re Kushal and Eli, co-founders of Extend (<a href=\"https:&#x2F;&#x2F;www.extend.ai&#x2F;\">https:&#x2F;&#x2F;www.extend.ai&#x2F;</a>). Extend is a toolkit for AI teams to ingest any kind of messy document (e.g. PDFs, images, excel files) and build incredible products.<p>We built Extend to handle the hardest documents that break most pipelines. You can see some examples here in our demo (no signup required): <a href=\"https:&#x2F;&#x2F;dashboard.extend.ai&#x2F;demo\">https:&#x2F;&#x2F;dashboard.extend.ai&#x2F;demo</a><p>I know you&#x27;re probably thinking “not another document API startup”. Unfortunately, the problem just isn’t solved yet!<p>I’ve personally spent months struggling to build reliable document pipelines at a previous job. The long tail of edge cases is endless — massive tables split across pages, 100pg+ files, messy handwriting, scribbled signatures, checkboxes represented in 10 different formats, multiple file types… the list just keeps going. After seeing countless other teams during our time in YC run into these same issues, we started building Extend.<p>We initially launched with a set of APIs for engineers to parse, classify, split, and extract documents. That started to take off, and soon we were deployed in production at companies building everything from medical agents, to real-time bank account onboarding, to mortgage automation. Over time, we’ve worked closely with these teams and seen first-hand how large the gap is between raw OCR&#x2F;model outputs —&gt; a production-ready pipeline (LLMs and VLMs aren’t magic).<p>Unlike other solutions in the space, we&#x27;re specifically focused on three core areas: (1) the computer vision layer, (2) LLM context engineering, and (3) the surrounding product tooling. The combination of all three is what we think it takes to hit 99% accuracy and maintain it at scale.<p>For instance, to parse messy handwriting, we built an agentic OCR correction layer which uses a VLM to review and make edits to low confidence OCR errors. To tackle multi-page tabular data, we built a semantic chunking engine which can detect the optimal boundaries within a document so models can excel with smaller context inputs.<p>We also shipped a prompt optimization agent to automate the endless prompt engineering whack-a-mole teams spend time on. It’s built as a background agent to replicate the best prompter on your team, and runs in a loop with access to a set of tools (view files, run evals, analyze results, and update schemas).<p>The most surprising part of this whole experience has been seeing how many crazy PDF formats are out there! We&#x27;ve run into everything from supermarket inventory magazines, pesticide labels, construction blueprints, and satellite manufacturing plans.<p>Everything described above is live today. You can see it in action here (no signup): <a href=\"https:&#x2F;&#x2F;dashboard.extend.ai&#x2F;demo\">https:&#x2F;&#x2F;dashboard.extend.ai&#x2F;demo</a>. To upload your own files, you can log in and do so (we’re adding free usage credits to all accounts that sign up today).<p>We’re excited to be sharing with HN! We’d love to hear about your experiences building document pipelines. Please try it out, and share any and all feedback with us (e.g. hard documents that didn’t work, feature requests).",
    "url": "https://www.extend.ai/",
    "upvotes": 61,
    "comments": 33,
    "sub": "hackernews",
    "signal": 43.6,
    "hits": [
      "context engineering",
      "prompt engineering",
      "evals"
    ]
  },
  {
    "src": "hackernews",
    "id": "45104974",
    "title": "Launch HN: Datafruit (YC S25) – AI for DevOps",
    "body": "Hey HN! We’re Abhi, Venkat, Tom, and Nick and we are building Datafruit (<a href=\"https:&#x2F;&#x2F;datafruit.dev&#x2F;\">https:&#x2F;&#x2F;datafruit.dev&#x2F;</a>), an AI DevOps agent. We’re like Devin for DevOps. You can ask Datafruit to check your cloud spend, look for loose security policies, make changes to your IaC, and it can reason across your deployment standards, design docs, and DevOps practices.<p>Demo video: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=2FitSggI7tg\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=2FitSggI7tg</a>.<p>Right now, we have two main methods to interact with Datafruit:<p>(1) automated infrastructure audits— agents periodically scan your environment to find cost optimization opportunities, detect infrastructure drift, and validate your infra against compliance requirements.<p>(2) chat interface (available as a web UI and through slack) — ask the agent questions for real-time insights, or assign tasks directly, such as investigating spend anomalies, reviewing security posture, or applying changes to IaC resources.<p>Working at FAANG and various high-growth startups, we realized that infra work requires an enormous amount of context, often more than traditional software engineering. The business decisions, codebase, and cloud itself are all extremely important in any task that has been assigned. To maximize the success of the agents, we do a fair amount of context engineering. Not hallucinating is super important!<p>One thing which has worked incredibly well for us is a multi-agent system where we have specialized sub-agents with access to specific tool calls and documentation for their specialty. Agents choose to “handoff” to each other when they feel like another agent would be more specialized for the task. However, all agents share the same context (<a href=\"https:&#x2F;&#x2F;cognition.ai&#x2F;blog&#x2F;dont-build-multi-agents\" rel=\"nofollow\">https:&#x2F;&#x2F;cognition.ai&#x2F;blog&#x2F;dont-build-multi-agents</a>). We’re pretty happy with this approach, and believe it could work in other disciplines which require high amounts of specialized expertise.<p>Infrastructure is probably the most mission-critical part of any software organization, and needs extremely heavy guardrails to keep it safe. Language models are not yet at the point where they can be trusted to make changes (we’ve talked to a couple of startups where the Claude Code + AWS CLI combo has taken their infra down). Right now, Datafruit receives read-only access to your infrastructure and can only make changes through pull requests to your IaC repositories. The agent also operates in a sandboxed virtual environment so that it could not write cloud CLI commands if it wanted to!<p>Where LLMs <i>can</i> add significant value is in reducing the constant operational inefficiencies that eat up cloud spend and delay deadlines—the small-but-urgent ops work. Once Datafruit indexes your environment, you can ask it to do things like:<p><pre><code>  &quot;Grant @User write access to analytics S3 bucket for 24 hours&quot;\n    -&gt; Creates temporary IAM role, sends least-privilege credentials, auto-revokes tomorrow\n\n  &quot;Find where this secret is used so I can rotate it without downtime&quot;\n    -&gt; Discovers all instances of your secret, including old cron-jobs you might not know about, so you can safely rotate your keys\n\n\n  &quot;Why did database costs spike yesterday?&quot;\n    -&gt; Identifies expensive queries, shows optimization options, implements fixes\n\n</code></pre>\nWe charge a straightforward subscription model for a managed version, but we also offer a bring-your-own-cloud model. All of Datafruit can be deployed on Kubernetes using Helm charts for enterprise customers where data can’t leave your VPC.\nFor the time being, we’re installing the product ourselves on customers&#x27; clouds. It doesn’t exist in a self-serve form yet. We’ll get there eventually, but in the meantime if you’re interested we’d love for you guys to email us at founders@datafruit.dev.<p>We would love to hear your thoughts! If you work with cloud infra, we are especially interested in learning about what kinds of work you do which you wish could be offloaded onto an agent.",
    "url": "https://news.ycombinator.com/item?id=45104974",
    "upvotes": 65,
    "comments": 48,
    "sub": "hackernews",
    "signal": 41.2,
    "hits": [
      "context engineering",
      "claude code"
    ]
  },
  {
    "src": "hackernews",
    "id": "47400868",
    "title": "Show HN: Claude Code skills that build complete Godot games",
    "body": "I’ve been working on this for about a year through four major rewrites. Godogen is a pipeline that takes a text prompt, designs the architecture, generates 2D&#x2F;3D assets, writes the GDScript, and tests it visually. The output is a complete, playable Godot 4 project.<p>Getting LLMs to reliably generate functional games required solving three specific engineering bottlenecks:<p>1. The Training Data Scarcity: LLMs barely know GDScript. It has ~850 classes and a Python-like syntax that will happily let a model hallucinate Python idioms that fail to compile. To fix this, I built a custom reference system: a hand-written language spec, full API docs converted from Godot&#x27;s XML source, and a quirks database for engine behaviors you can&#x27;t learn from docs alone. Because 850 classes blow up the context window, the agent lazy-loads only the specific APIs it needs at runtime.<p>2. The Build-Time vs. Runtime State: Scenes are generated by headless scripts that build the node graph in memory and serialize it to .tscn files. This avoids the fragility of hand-editing Godot&#x27;s serialization format. But it means certain engine features (like `@onready` or signal connections) aren&#x27;t available at build time—they only exist when the game actually runs. Teaching the model which APIs are available at which phase — and that every node needs its owner set correctly or it silently vanishes on save — took careful prompting but paid off.<p>3. The Evaluation Loop: A coding agent is inherently biased toward its own output. To stop it from cheating, a separate Gemini Flash agent acts as visual QA. It sees only the rendered screenshots from the running engine—no code—and compares them against a generated reference image. It catches the visual bugs text analysis misses: z-fighting, floating objects, physics explosions, and grid-like placements that should be organic.<p>Architecturally, it runs as two Claude Code skills: an orchestrator that plans the pipeline, and a task executor that implements each piece in a `context: fork` window so mistakes and state don&#x27;t accumulate.<p>Everything is open source: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;htdt&#x2F;godogen\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;htdt&#x2F;godogen</a><p>Demo video (real games, not cherry-picked screenshots): <a href=\"https:&#x2F;&#x2F;youtu.be&#x2F;eUz19GROIpY\" rel=\"nofollow\">https:&#x2F;&#x2F;youtu.be&#x2F;eUz19GROIpY</a><p>Blog post with the full story (all the wrong turns) coming soon. Happy to answer questions.",
    "url": "https://github.com/htdt/godogen",
    "upvotes": 337,
    "comments": 205,
    "sub": "hackernews",
    "signal": 41,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "39641105",
    "title": "Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps",
    "body": "Hi HN, we are the founders of Relari, the company behind continuous-eval (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;relari-ai&#x2F;continuous-eval\">https:&#x2F;&#x2F;github.com&#x2F;relari-ai&#x2F;continuous-eval</a>), an evaluation framework that lets you test your GenAI systems at the component level, pinpointing issues where they originate.<p>We experienced the need for this when we were building a copilot for bankers. Our RAG pipeline blew up in complexity as we added components: a query classifier (to triage user intent), multiple retrievers (to grab information from different sources), filtering LLM (to rerank &#x2F; compress context), a calculator agent (to call financial functions) and finally the synthesizer LLM that gives the answer. Ensuring reliability became more difficult with each of these we added.<p>When a bad response was detected by our answer evaluator, we had to backtrack multiple steps to understand which component(s) made a mistake. But this quickly became unscalable beyond a few samples.<p>I did my Ph.D. in fault detection for autonomous vehicles, and I see a strong parallel between the complexity of autonomous driving software and today&#x27;s LLM pipelines. In self-driving systems, sensors, perception, prediction, planning, and control modules are all chained together. To ensure system-level safety, we use granular metrics to measure the performance of each module individually. When the vehicle makes an unexpected decision, we use these metrics to pinpoint the problem to a specific component. Only then we can make targeted improvements, systematically.<p>Based on this thinking, we developed the first version of continuous-eval for ourselves. Since then we’ve made it more flexible to fit various types of GenAI pipelines. Continuous-eval allows you to describe (programmatically) your pipeline and modules, and select metrics for each module. We developed 30+ metrics to cover retrieval, text generation, code generation, classification, agent tool use, etc. We now have a number of companies using us to test complex pipelines like finance copilots, enterprise search, coding agents, etc.<p>As an example, one customer was trying to understand why their RAG system did poorly on trend analysis queries. Through continuous-eval, they realized that the “retriever” component was retrieving 80%+ of all relevant chunks, but the “reranker” component, that filters out “irrelevant” context, was dropping that to below 50%. This enabled them to fix the problem, in their case by skipping the reranker for certain queries.<p>We’ve also built ensemble metrics that do a surprisingly good job of predicting user feedback. Users often rate LLM-generated answers by giving a thumbs up&#x2F;down about how good the answer was. We train our custom metrics on this user data, and then use those metrics to generate thumbs up&#x2F;down ratings on future LLM answers. The results turn out to be 90% aligned with what the users say. This gives developers a feedback loop from production data to offline testing and development. Some customers have found this to be our most unique advantage.<p>Lastly, to make the most out of evaluation, you should use a diverse dataset—ideally with ground truth labels for comprehensive and consistent assessment. Because ground truth labels are costly and time-consuming to curate manually, we also have a synthetic data generation pipeline that allows you to get started quickly. Try it here (<a href=\"https:&#x2F;&#x2F;www.relari.ai&#x2F;#synthetic_data_demo\" rel=\"nofollow\">https:&#x2F;&#x2F;www.relari.ai&#x2F;#synthetic_data_demo</a>)<p>What’s been your experience testing and iterating LLM apps? Please let us know your thoughts and feedback on our approaches (modular framework, leveraging user feedback, testing with synthetic data).",
    "url": "https://news.ycombinator.com/item?id=39641105",
    "upvotes": 106,
    "comments": 15,
    "sub": "hackernews",
    "signal": 40.3,
    "hits": [
      "coding agent",
      "rag pipeline",
      "tool use",
      "retrieval"
    ]
  },
  {
    "src": "hackernews",
    "id": "35042836",
    "title": "Launch HN: Vellum (YC W23) – Dev Platform for LLM Apps",
    "body": "Hi HN – Noa, Akash, and Sidd here. We’re building Vellum (<a href=\"https:&#x2F;&#x2F;www.vellum.ai\">https:&#x2F;&#x2F;www.vellum.ai</a>), a developer platform for building on LLMs like OpenAI’s GPT-3 and Anthropic’s Claude. We provide tools for efficient prompt engineering, semantic search, performance monitoring, and fine-tuning, helping you bring LLM-powered features from prototype to production.<p>The MLOps industry has matured rapidly for traditional ML (typically open-source models hosted in-house), but companies using LLMs are suffering from a lack of tooling to support things like experimentation, version control, and monitoring. They’re forced to build these tools themselves, taking valuable engineering time away from their core product.<p>There are 4 main pain points. (1) Prompt engineering is tedious and time consuming. People iterate on prompts in playgrounds of individual model providers and store results in spreadsheets or documents. Testing across many test cases is usually not done because of the manual nature of prompt engineering. (2) LLM calls against a corpus of text are not possible without semantic search. Due to limited context windows, any time an LLM has to return factual data from a set of documents, companies need to create embeddings, store them in a vector database and host semantic search models to query for relevant results at runtime; building this infrastructure is complex and time consuming. (3) There is limited observability &#x2F; monitoring once LLMs are used in production. With no baseline for how something is performing, it’s scary making changes to it for fear of making it worse; and (4) Creating fine-tuned models and re-training them as new data becomes available is rarely done despite the potential gains (higher quality, lower cost, lower latency, more defensibility). Companies don’t usually have the capacity to build the infrastructure for collecting high-quality training data and the automation pipelines used to re-train and evaluate new models.<p>We know these pain points from experience. Sidd and Noa are engineers who worked at Quora and DataRobot building ML tooling. Then the three of us worked together for a couple years at Dover (YC S19), where we built features powered by GPT-3 when it was still in beta. Our first production feature was a job description writer, followed by a personalized recruiting email generator and then a classifier for email responses.<p>We found it was easy enough to prototype, but taking features to production and improving them was a different story. It was a pain to keep track of what prompts we had tried and to monitor how they were performing under real user inputs. We wished we could version control our prompts, roll back, and even A&#x2F;B test. We found ourselves investing in infrastructure that had nothing to do with our core features (e.g. semantic search). We ended up being scared to change prompts or try different models for fear of breaking existing behavior. As new LLM providers and foundation models were released, we wished we could compare them and use the best tool for the job, but didn’t have the time to evaluate them ourselves. And so on.<p>It’s clear that better tools are required for businesses to adopt LLMs at scale, and we realized we were in a good position to build them, so here we are! Vellum consists of 4 systems to address the pain points mentioned above:<p>(1) Playground—a UI for iterating on prompts side-by-side and validating them against multiple test cases at once. Prompt variants may differ in their text, underlying model, model parameters (e.g. “temperature”), and even LLM provider. Each run is saved as a history item and has a permanent url that can be shared with teammates.<p>(2) Search—upload a corpus of text (e.g. your company help docs) in our UI (PDF&#x2F;TXT) and Vellum will convert the text to embeddings and store it in a vector database to be used at run time. While making an LLM call, we inject relevant context from your documents into the query and instruct the LLM to only answer factually using the provided context. This helps prevent hallucination and avoids you having to manage your own embeddings, vector store, and semantic search infra.<p>(3) Manage—a low-latency, high-reliability API wrapper that’s provider-agnostic across OpenAI, Cohere, and Anthropic (with more coming soon). Every request is captured and persisted in one place, providing full observability into what you’re sending these models, what they’re giving back, and their performance. Prompts and model providers can be updated without code changes. You can replay historical requests and version history is maintained. This serves as a data layer for metrics, monitoring, and soon, alerting.<p>(4) Optimize—the data collected in Manage is used to passively build up training data, which can be used to fine-tune your own proprietary models. With enough high quality input&#x2F;output pairs (minimum 100, but depends on the use case), Vellum can produce fine-tuned models to provide better quality, lower cost or lower latency. If a new model solves a problem better, it can be swapped without code changes.<p>We also offer periodic evaluation against alternative models (i.e. we can see if fine-tuning Curie produces results of comparable quality to Davinci, but at a lower price). Even though OpenAI is the dominant model provider today, we expect there to be many providers with strong foundation models, and in that case model interoperability will be key!<p>Here’s a video demo showcasing Vellum (feel free to watch on 1.5x!): <a href=\"https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;5dbdb8ae87bb4a419ade05d92993e5a0\" rel=\"nofollow\">https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;5dbdb8ae87bb4a419ade05d92993e5a0</a>.<p>We currently charge a flat monthly platform fee that varies based on the quantity and complexity of your use-cases. In the future, we plan on having more transparent pricing that’s made up of a fixed platform fee + some usage-based component (e.g. number of tokens used or requests made).<p>If you look at our website you’ll notice the dreaded “Request early access” rather than “Try now”. That’s because the LLM Ops space is evolving extremely quickly right now. To maximize our learning rate, we need to work intensively with a few early customers to help get their AI use cases into production. We’ll invite self-serve signups once that core feature set has stabilized a bit more. In the meantime, if you’re interested in being one of our early customers, we’d love to hear from you and you can request early access here: <a href=\"https:&#x2F;&#x2F;www.vellum.ai&#x2F;landing-pages&#x2F;hacker-news\">https:&#x2F;&#x2F;www.vellum.ai&#x2F;landing-pages&#x2F;hacker-news</a>.<p>We deeply value the expertise of the HN community! We’d love to hear your comments and get your perspective on our overall direction, the problems we’re aiming to solve, our solution so far, and anything we may be missing. We hope this post and our demo video provide enough material to start a good conversation and we look forward to your thoughts, questions, and feedback!",
    "url": "https://news.ycombinator.com/item?id=35042836",
    "upvotes": 136,
    "comments": 40,
    "sub": "hackernews",
    "signal": 39.8,
    "hits": [
      "prompt engineering",
      "llm ops",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "45504388",
    "title": "Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI",
    "body": "Hi HN! We&#x27;re Rob, Matt, and Rachel from LlamaFarm (<a href=\"https:&#x2F;&#x2F;llamafarm.dev\">https:&#x2F;&#x2F;llamafarm.dev</a>). We&#x27;re building an open-source AI framework based on a simple belief: the future isn&#x27;t one massive model in the cloud—it&#x27;s specialized models running everywhere, continuously fine-tuned from real usage.<p>The problem: We were building AI tools and kept falling into the same trap. AI demos die before production. We built a bunch of AI demos but they were impossible to get to production.  It would work perfectly on our laptop, but when we deployed it, something broke, and RAG would degrade. If we were running our own model, it would quickly become out of date. The proof-of-concept that impressed the team couldn&#x27;t handle real-world data.<p>Our solution: declarative AI-as-code. One YAML defines models, policies, data, evals, and deploy. Instead of one brittle giant, we orchestrate a Mixture of Experts—many small, specialized models you continuously fine-tune from real usage. With RAG for source-grounded answers, systems get cheaper, faster, and auditable.<p>There’s a short demo here: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=W7MHGyN0MdQ\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=W7MHGyN0MdQ</a> and a more in-depth one at  <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=HNnZ4iaOSJ4\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=HNnZ4iaOSJ4</a>.<p>Ultimately, we want to deliver a single, signed bundle—models + retrieval + database + API + tests—that runs anywhere: cloud, edge, or air-gapped. No glue scripts. No surprise egress bills. Your data stays in your runtime.<p>We believe that the AI industry is evolving like computing did. Just as we went from mainframes to distributed systems and monolithic apps to microservices, AI is following the same path: models are getting smaller and better. Mixture of Experts is here to stay. Qwen3 is sick. Llama 3.2 runs on phones. Phi-3 fits on edge devices. Domain models beat GPT-5 on specific tasks.<p>RAG brings specialized data to your model: You don&#x27;t need a 1T parameter model that &quot;knows everything.&quot; You need a smart model that can read <i>your</i> data. Fine-tuning is democratizing: what cost $100k last year now costs $500. Every company will have custom models.<p>Data gravity is real: Your data wants to stay where it is: on-prem, in your AWS account, on employee laptops.<p>Bottom line: LlamaFarm turns AI from experiments into repeatable, secure releases, so teams can ship fast.<p>What we have working today: Full RAG pipeline: 15+ document formats, programmatic extraction (no LLM calls needed), vector-database embedding, universal model layer that runs the same code for 25+ providers, automatic failover, cost-based routing; Truly portable: Identical behavior from laptop → datacenter → cloud; Real deployment: Docker Compose works now with Kubernetes basics and cloud templates on the way.<p>Check out our readme&#x2F;quickstart for easy install instructions: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;llama-farm&#x2F;llamafarm?tab=readme-ov-file#-quickstart-tldr\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;llama-farm&#x2F;llamafarm?tab=readme-ov-file#-...</a><p>Or just grab a binary for your platform directly from the latest release:\n  <a href=\"https:&#x2F;&#x2F;github.com&#x2F;llama-farm&#x2F;llamafarm&#x2F;releases&#x2F;latest\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;llama-farm&#x2F;llamafarm&#x2F;releases&#x2F;latest</a><p>The vision is to be able to run, update, and continuously fine-tune dozens of models across environments with built-in RAG and evaluations, all wrapped in a self-healing runtime. We have an MVP of that today (with a lot more to do!).<p>We’d love to hear your feedback! Think we’re way off? Spot on? Want us to build something for your specific use case? We’re here for all your comments!",
    "url": "https://github.com/llama-farm/llamafarm",
    "upvotes": 106,
    "comments": 71,
    "sub": "hackernews",
    "signal": 39.3,
    "hits": [
      "rag pipeline",
      "evals",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "44053754",
    "title": "Show HN: Representing Agents as MCP Servers",
    "body": "Hey HN! A few months ago we shared mcp-agent (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent\">https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent</a>) [1][2], a lightweight framework that implements every agent pattern from Anthropic’s Building Effective Agents blog [3] and handles MCP server&#x2F;client management seamlessly. Our core bet is that connecting LLMs to tools, resources, and external systems will soon be MCP-native by default.<p>Today we&#x27;re launching a significant update: Agents as MCP servers.<p>Currently &quot;agentic&quot; behavior exists only on the MCP client side – clients like Claude or Cursor use MCP servers to solve tasks. With this update, Agents can be MCP servers themselves, so that any MCP client can invoke, coordinate and orchestrate agents the same way it does with any other MCP server.<p>This paradigm shift enables:\n1. Agent Composition: Build complex multi-agent systems over the same base protocol (MCP).\n2. Platform Independence: Use your agents from any MCP-compatible client\n3. Scalability: Run agent workflows on dedicated infrastructure, not just within client environments\n4. Customization: Develop your own agent workflows and reuse them across any MCP client.<p>How an agent server is implemented:<p>We’ve implemented this in mcp-agent with Workflows. Each workflow is an agent application that can interact with other MCP servers (e.g. summarizing GitHub issues → Slack message). mcp-agent exposes workflows as MCP tools on an MCP Agent Server [5]:<p>- workflows&#x2F;list – list available workflows\n- workflows&#x2F;{WorkflowName}&#x2F;run – Execute the workflow (async)\n- workflows&#x2F;{WorkflowName}&#x2F;get_status – Check workflow status\n- workflows&#x2F;{WorkflowName}&#x2F;resume – Resume paused workflow (e.g. with human input)\n- workflows&#x2F;{WorkflowName}&#x2F;cancel – Terminate workflow<p>We’ve also implemented Temporal for durable execution [6], so agent workflows can be paused, resumed and retried in production settings.<p>This demo [7] shows Claude invoking an MCP agent server, running workflows when appropriate, and polling for status. It basically shows agentic behavior on both the MCP client and MCP server side.<p>We&#x27;re excited about the potential this unlocks—especially as more applications become MCP-compatible clients. We&#x27;d love your feedback and ideas!<p>[1] - <a href=\"https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42867050\">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42867050</a><p>[2] - <a href=\"https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent\">https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent</a><p>[3] - <a href=\"https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;building-effective-agents\" rel=\"nofollow\">https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;building-effective-agents</a><p>[4] - <a href=\"https:&#x2F;&#x2F;github.com&#x2F;github&#x2F;github-mcp-server\">https:&#x2F;&#x2F;github.com&#x2F;github&#x2F;github-mcp-server</a><p>[5] - <a href=\"https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent&#x2F;tree&#x2F;main&#x2F;examples&#x2F;mcp_agent_server&#x2F;asyncio\">https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent&#x2F;tree&#x2F;main&#x2F;examples&#x2F;...</a><p>[6] - <a href=\"https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent&#x2F;tree&#x2F;main&#x2F;examples&#x2F;temporal\">https:&#x2F;&#x2F;github.com&#x2F;lastmile-ai&#x2F;mcp-agent&#x2F;tree&#x2F;main&#x2F;examples&#x2F;...</a><p>[7] - <a href=\"https:&#x2F;&#x2F;youtu.be&#x2F;pLe2GAjEoYs\" rel=\"nofollow\">https:&#x2F;&#x2F;youtu.be&#x2F;pLe2GAjEoYs</a> [DEMO]",
    "url": "https://github.com/lastmile-ai/mcp-agent/tree/main/examples/mcp_agent_server",
    "upvotes": 58,
    "comments": 16,
    "sub": "hackernews",
    "signal": 38.1,
    "hits": [
      "agent workflow",
      "mcp agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "47244042",
    "title": "Show HN: Armalo AI – The Infrastructure for Agent Networks",
    "body": "Hey HN — I&#x27;m Ryan, founder of Armalo AI (<a href=\"https:&#x2F;&#x2F;armalo.ai\" rel=\"nofollow\">https:&#x2F;&#x2F;armalo.ai</a>). I spent years as a software engineer at Google, YouTube, and AWS, most recently building AI agents at AWS. Watching those systems interact in production — and seeing the same gaps appear over and over — convinced me that the missing piece wasn&#x27;t more capable agents, but the infrastructure underneath them. So I left to build it.<p>Armalo AI is the infrastructure layer that multi-agent AI networks need to actually function in production.<p>THE PROBLEM<p>Every week there&#x27;s a new story about an AI agent deleting a production database, a multi-agent workflow cascading into failure, or an autonomous system doing something its operator never intended. We dug into 2025&#x27;s worst incidents and found a consistent root cause: agents have no accountability layer.<p>You can&#x27;t Google an agent&#x27;s reputation. When one agent delegates to another, there&#x27;s no escrow, no contract, no recourse. State doesn&#x27;t persist across a network. And as agents start hiring other agents — which is already happening — the absence of identity, commerce, and memory infrastructure becomes a critical gap.<p>Benchmarks measure capability. We measure reliability.<p>WHAT WE BUILT<p>Armalo is three integrated layers:<p>1. Trust &amp; Reputation<p>Agents earn a PactScore: a 0–1000 score across five behavioral dimensions — task completion, policy compliance, latency, safety, and peer attestation. Four certification tiers (Bronze → Gold). Scores are cryptographically verifiable and on-chain. When automated verification isn&#x27;t enough, our LLM-powered Jury system brings multi-model judgment to disputes. All of it is queryable via REST API in sub-second latency.<p>2. Agent Commerce<p>Agents can define behavioral pacts — machine-readable contracts that specify what they promise to deliver. These are backed by USDC escrow on Base L2 via smart contracts. Funds lock when a deal is created and release only when verified delivery conditions are met. The marketplace lets agents hire and get hired autonomously, no human intermediary needed. We also support x402 pay-per-call: agents pay $0.001&#x2F;score lookup in USDC with no API key, no account, no human billing setup.<p>3. Memory &amp; Coordination<p>Memory Mesh gives agents persistent shared state across a network. Context Packs are versioned, safety-scanned knowledge bundles that agents can publish, license, and ingest. Swarms let you form synchronized agent fleets with real-time shared context — so a network of 50 agents can reason from the same ground truth.<p>THE FULL STACK<p>Beyond the three core layers, we&#x27;ve shipped: OpenClaw MCP (25 tools for Claude, Cursor, LangChain), Jarvis (an agent terminal for interacting with the platform), PactLabs (our research arm — working on trust algorithms, collusion detection, adversarial robustness, and optimal escrow sizing), real-time monitoring and alerting, and a governance forum where trust-weighted agents post, vote, and collaborate.<p>WHY ON-CHAIN<p>We get that &quot;on-chain&quot; raises eyebrows in some HN circles. Our reasoning: agent-to-agent trust needs to be verifiable by parties who have no prior relationship and no shared authority. Cryptographic verification at every layer, with an open protocol, means any agent framework can interoperate with Armalo AI&#x27;s trust signals without going through us as an intermediary. We&#x27;re not building a walled garden.<p>PRICING<p>Free tier (1 agent, 3 evals&#x2F;month), Pro at $99 USDC&#x2F;month (10 agents, unlimited evals, escrow, jury access), Enterprise at $2,999&#x2F;month. Or pure pay-per-call via x402 — no subscription required.<p>We&#x27;d love feedback from builders working on multi-agent systems. What&#x27;s the hardest part of trust and coordination you&#x27;ve hit in production?",
    "url": "https://news.ycombinator.com/item?id=47244042",
    "upvotes": 3,
    "comments": 8,
    "sub": "hackernews",
    "signal": 37.8,
    "hits": [
      "agent workflow",
      "langchain",
      "evals",
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "41451698",
    "title": "Show HN: Laminar – Open-Source DataDog + PostHog for LLM Apps, Built in Rust",
    "body": "Hey HN, we’re Robert, Din and Temirlan from Laminar (<a href=\"https:&#x2F;&#x2F;www.lmnr.ai\">https:&#x2F;&#x2F;www.lmnr.ai</a>), an open-source observability and analytics platform for complex LLM apps. It’s designed to be fast, reliable, and scalable. The stack is RabbitMQ for message queues, Postgres for storage, Clickhouse for analytics, Qdrant for semantic search - all powered by Rust.<p>How is Laminar different from the swarm of other “LLM observability” platforms?<p>On the observability part, we’re focused on handling full execution traces, not just LLM calls. We built a Rust ingestor for OpenTelemetry (Otel) spans with GenAI semantic conventions. As LLM apps get more complex (think Agents with hundreds of LLM and function calls, or complex RAG pipelines), full tracing is critical. With Otel spans, we can: 1. Cover the entire execution trace. 2. Keep the platform future-proof 3. Leverage an amazing OpenLLMetry (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;traceloop&#x2F;openllmetry\">https:&#x2F;&#x2F;github.com&#x2F;traceloop&#x2F;openllmetry</a>), open-source package for span production.<p>The key difference is that we tie text analytics directly to execution traces. Rich text data makes LLM traces unique, so we let you track “semantic metrics” (like what your AI agent is actually saying) and connect those metrics to where they happen in the trace. If you want to know if your AI drive-through agent made an upsell, you can design an LLM extraction pipeline in our builder (more on it later), host it on Laminar, and handle everything from event requests to output logging. Processing requests simply come as events in the Otel span.<p>We think it’s a win to separate core app logic from LLM event processing. Most devs don’t want to manage background queues for LLM analytics processing but still want insights into how their Agents or RAGs are working.<p>Our Pipeline Builder uses graph UI where nodes are LLM and util functions, and edges showing data flow. We built a custom task execution engine with support of parallel branch executions, cycles and branches (it’s overkill for simple pipelines, but it’s extremely cool and we’ve spent a lot of time designing a robust engine). You can also call pipelines directly as API endpoints. We found them to be extremely useful for iterating on and separating LLM logic. Laminar also traces pipeline directly, which removes the overhead of sending large outputs over the network.<p>One thing missing from all LLM observability platforms right now is an adequate search over traces. We’re attacking this problem by indexing each span in a vector DB and performing hybrid search at query time. This feature is still in beta, but we think it’s gonna be crucial part of our platform going forward.<p>We also support evaluations. We loved the “run everything locally, send results to a server” approach from Braintrust and Weights &amp; Biases, so we did that too: a simple SDK and nice dashboards to track everything. Evals are still early, but we’re pushing hard on them.<p>Our goal is to make Laminar the Supabase for LLMOps - the go-to open-source comprehensive platform for all things LLMs &#x2F; GenAI. In it’s current shape, Laminar is just few weeks old and developing rapidly, we’d love any feedback or for you to give Laminar a try in your LLM projects!",
    "url": "https://github.com/lmnr-ai/lmnr",
    "upvotes": 203,
    "comments": 45,
    "sub": "hackernews",
    "signal": 37,
    "hits": [
      "rag pipeline",
      "evals",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "44564248",
    "title": "Context Rot: How increasing input tokens impacts LLM performance",
    "body": "I work on research at Chroma, and I just published our latest technical report on context rot.<p>TLDR: Model performance is non-uniform across context lengths, including state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models.<p>This highlights the need for context engineering. Whether relevant information is present in a model’s context is not all that matters; what matters more is how that information is presented.<p>Here is the complete open-source codebase to replicate our results: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;chroma-core&#x2F;context-rot\">https:&#x2F;&#x2F;github.com&#x2F;chroma-core&#x2F;context-rot</a>",
    "url": "https://research.trychroma.com/context-rot",
    "upvotes": 260,
    "comments": 59,
    "sub": "hackernews",
    "signal": 36,
    "hits": [
      "context engineering"
    ]
  },
  {
    "src": "hackernews",
    "id": "39510874",
    "title": "Show HN: R2R – Open-source framework for production-grade RAG",
    "body": "Hello HN, I&#x27;m Owen from SciPhi (<a href=\"https:&#x2F;&#x2F;www.sciphi.ai&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;www.sciphi.ai&#x2F;</a>), a startup working on simplifying˛Retrieval-Augmented Generation (RAG). Today we’re excited to share R2R (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;SciPhi-AI&#x2F;R2R\">https:&#x2F;&#x2F;github.com&#x2F;SciPhi-AI&#x2F;R2R</a>), an open-source framework that makes it simpler to develop and deploy production-grade RAG systems.<p>Just a quick reminder: RAG helps Large Language Models (LLMs) use current information and specific knowledge. For example, it allows a programming assistant to use your latest documents to answer questions. The idea is to gather all the relevant information (&quot;retrieval&quot;) and present it to the LLM with a question (&quot;augmentation&quot;). This way, the LLM can provide answers (“generation”) as though it was trained directly on your data.<p>The R2R framework is a powerful tool for addressing key challenges in deploying RAG systems, avoiding the complex abstractions common in other projects. Through conversations with numerous developers, we discovered that many were independently developing similar solutions. R2R distinguishes itself by adopting a straightforward approach to streamline the setup, monitoring, and upgrading of RAG systems. Specifically, it focuses on reducing unnecessary complexity and enhancing the visibility and tracking of system performance.<p>The key parts of R2R include: an Ingestion Pipeline that transforms different data types (like json, txt, pdf, html) into &#x27;Documents&#x27; ready for embedding. Next, the Embedding Pipeline takes text and turns it into vector embeddings through various processes (such as extracting text, transforming it, chunking, and embedding). Finally, the RAG Pipeline follows the steps of the embedding pipeline but adds an LLM provider to create text completions.<p>R2R is currently in use at several companies building applications from B2B lead generation to educational tools for consumers.<p>Our GitHub repo (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;SciPhi-AI&#x2F;R2R\">https:&#x2F;&#x2F;github.com&#x2F;SciPhi-AI&#x2F;R2R</a>) includes basic examples for application deployment and standalone use, demonstrating the framework&#x27;s adaptability in a simple way.<p>We’d love for you to give R2R a try, and welcome your feedback and comments as we refine and develop it further!",
    "url": "https://github.com/SciPhi-AI/R2R",
    "upvotes": 167,
    "comments": 57,
    "sub": "hackernews",
    "signal": 36,
    "hits": [
      "rag pipeline",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "46324665",
    "title": "Show HN: I open-sourced my Go and Next B2B SaaS Starter (deploy anywhere, MIT)",
    "body": "Hi HN, I&#x27;m Mohammed, a technical founder who loves shipping and giving back to the community. I&#x27;m open-sourcing the full-stack engine that powers my B2B product, apflow.co.<p>What it is: A production B2B starter with a Go backend and Next.js frontend. Both are fully Dockerized with separate containers. No Vercel. No Supabase. Deploy the whole thing on a $6 VPS, or split frontend and backend across different providers. You own the infrastructure.<p>The problem I was solving:<p>Every SaaS starter I evaluated had the same issue: they locked me into someone else&#x27;s platform. Vercel for hosting. PlanetScale for the database. Serverless functions billing per invocation. Fine for prototypes, but costs become unpredictable at scale and migrating away is painful.<p>I wanted something I could deploy on any Linux box with docker-compose up. Something where I could host the frontend on Cloudflare Pages and the backend on a Hetzner VPS if I wanted. No vendor-specific APIs buried in my code.<p>Why Go for the backend:<p>Go gives me exactly what I need for a SaaS backend:<p>Tiny footprint. The backend idles at ~50MB RAM. On a cheap VPS, that headroom lets me run more services without upgrading.\nConcurrency without complexity. Billing webhooks, file uploads, and AI calls run concurrently without callback hell.\nCompile-time type safety. Using SQLC, my SQL compiles to type-safe Go. If the query is wrong, it fails at build time, not in production.\nPredictable performance. No garbage collection pauses that surprise you under load.\nThe architecture (Modular Monolith):<p>I didn&#x27;t want microservices complexity for a small team, but I needed clean separation. I built a Modular Monolith: features like Auth, Billing, and AI are isolated Go modules with explicit interfaces, but they deploy as a single binary.<p>This structure also made AI coding tools (Cursor, Claude Code) dramatically more effective. Because every module has strict boundaries, the AI knows exactly where new code belongs and doesn&#x27;t break other modules.<p>Full-stack, not just backend:<p>Backend: Go 1.25 + Gin + SQLC (type-safe SQL, no ORM) + PostgreSQL with pgvector\nFrontend: Next.js 16 + React 19 + Tailwind + shadcn&#x2F;ui\nCommunication: The frontend consumes a clean REST API. You can swap Next.js for any framework that speaks HTTP.\nInfrastructure: Separate Dockerfiles for frontend and backend. Deploy together or apart.\nWhat&#x27;s pre-built:<p>The boring infrastructure is solved so you can focus on your actual product:<p>Auth + RBAC: Stytch B2B integration with Organizations, Teams, and Roles. Multi-tenant data isolation enforced at the query level.\nBilling: Polar.sh as Merchant of Record. Handles subscriptions, invoices, and global tax&#x2F;VAT. No Stripe webhook edge cases.\nAI Pipeline: OpenAI RAG using pgvector. The retrieval service enforces strict context boundaries to minimize hallucinations.\nOCR: Mistral integration for document extraction.\nFile Storage: Cloudflare R2 integration.\nEach feature is a separate module. Don&#x27;t need OCR? Remove it. Want Stripe instead of Polar? The billing interface is abstracted.<p>Real-world proof:<p>This isn&#x27;t a template I made for GitHub stars. It&#x27;s the exact code running apflow.co in production. When I added document OCR, I built it as a new module without touching Auth or Billing. The architecture held.<p>How to try it:<p>Clone the repo, read setup.md to check the prerequisite, run .&#x2F;setup.sh, and you have a working B2B environment locally in minutes.<p>Feedback I want:<p>I&#x27;d appreciate feedback from Go developers on the module boundaries and cross-module interfaces. Also curious if anyone has suggestions for the Docker setup in production deployments.<p>GitHub: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;moasq&#x2F;production-saas-starter\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;moasq&#x2F;production-saas-starter</a><p>Live: <a href=\"https:&#x2F;&#x2F;apflow.co\" rel=\"nofollow\">https:&#x2F;&#x2F;apflow.co</a>",
    "url": "https://github.com/moasq/production-saas-starter",
    "upvotes": 83,
    "comments": 35,
    "sub": "hackernews",
    "signal": 35.1,
    "hits": [
      "claude code",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "github",
    "id": "1095081803",
    "title": "ratel-ai/ratel",
    "body": "Context engineering for AI agents. ~80% fewer tokens. Fix tool overload. Skills and memory with in-process BM25 retrieval. No vector DB. No embeddings. accuracy agents claude-skills context harness llm llm-routing mcp mcp-server memory optimization rag skills token-optimization tool-calling tool-selection",
    "url": "https://github.com/ratel-ai/ratel",
    "upvotes": 159,
    "comments": 16,
    "sub": "github",
    "signal": 35.1,
    "hits": [
      "context engineering",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "46237358",
    "title": "Show HN: Autofix Bot – Hybrid static analysis and AI code review agent",
    "body": "Hi there, HN! We’re Jai and Sanket from DeepSource (YC W20), and today we’re launching Autofix Bot, a hybrid static analysis + AI agent purpose-built for in-the-loop use with AI coding agents.<p>AI coding agents have made code generation nearly free, and they’ve shifted the bottleneck to code review. Static-only analysis with a fixed set of checkers isn’t enough. LLM-only review has several limitations: non-deterministic across runs, low recall on security issues, expensive at scale, and a tendency to get ‘distracted’.<p>We spent the last 6 years building a deterministic, static-analysis-only code review product. Earlier this year, we started thinking about this problem from the ground up and realized that static analysis solves key blind spots of LLM-only reviews. Over the past six months, we built a new ‘hybrid’ agent loop that uses static analysis and frontier AI agents together to outperform both static-only and LLM-only tools in finding and fixing code quality and security issues. Today, we’re opening it up publicly.<p>Here’s how the hybrid architecture works:<p>- Static pass: 5,000+ deterministic checkers (code quality, security, performance) establish a high-precision baseline. A sub-agent suppresses context-specific false positives.<p>- AI review: The agent reviews code with static findings as anchors. Has access to AST, data-flow graphs, control-flow, import graphs as tools, not just grep and usual shell commands.<p>- Remediation: Sub-agents generate fixes. Static harness validates all edits before emitting a clean git patch.<p>Static solves key LLM problems: non-determinism across runs, low recall on security issues (LLMs get distracted by style), and cost (static narrowing reduces prompt size and tool calls).<p>On the OpenSSF CVE Benchmark [1] (200+ real JS&#x2F;TS vulnerabilities), we hit 81.2% accuracy and 80.0% F1; vs Cursor Bugbot (74.5% accuracy, 77.42% F1), Claude Code (71.5% accuracy, 62.99% F1), CodeRabbit (59.4% accuracy, 36.19% F1), and Semgrep CE (56.9% accuracy, 38.26% F1). \nOn secrets detection, 92.8% F1; vs Gitleaks (75.6%), detect-secrets (64.1%), and TruffleHog (41.2%). We use our open-source classification model for this. [2]<p>Full methodology and how we evaluated each tool: <a href=\"https:&#x2F;&#x2F;autofix.bot&#x2F;benchmarks\" rel=\"nofollow\">https:&#x2F;&#x2F;autofix.bot&#x2F;benchmarks</a><p>You can use Autofix Bot interactively on any repository using our TUI, as a plugin in Claude Code, or with our MCP on any compatible AI client (like OpenAI Codex).[3] We’re specifically building for AI coding agent-first workflows, so you can ask your agent to run Autofix Bot on every checkpoint autonomously.<p>Give us a shot today: <a href=\"https:&#x2F;&#x2F;autofix.bot\" rel=\"nofollow\">https:&#x2F;&#x2F;autofix.bot</a>. We’d love to hear any feedback!<p>---<p>[1] <a href=\"https:&#x2F;&#x2F;github.com&#x2F;ossf-cve-benchmark&#x2F;ossf-cve-benchmark\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;ossf-cve-benchmark&#x2F;ossf-cve-benchmark</a><p>[2] <a href=\"https:&#x2F;&#x2F;huggingface.co&#x2F;deepsource&#x2F;Narada-3.2-3B-v1\" rel=\"nofollow\">https:&#x2F;&#x2F;huggingface.co&#x2F;deepsource&#x2F;Narada-3.2-3B-v1</a><p>[3] <a href=\"https:&#x2F;&#x2F;autofix.bot&#x2F;manual&#x2F;#terminal-ui\" rel=\"nofollow\">https:&#x2F;&#x2F;autofix.bot&#x2F;manual&#x2F;#terminal-ui</a>",
    "url": "https://news.ycombinator.com/item?id=46237358",
    "upvotes": 37,
    "comments": 13,
    "sub": "hackernews",
    "signal": 34.5,
    "hits": [
      "claude code",
      "coding agent",
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "47472965",
    "title": "Show HN: ClawMem – Open-source agent memory with SOTA local GPU retrieval",
    "body": "So I&#x27;ve been building ClawMem, an open-source context engine that gives AI coding agents persistent memory across sessions. It works with Claude Code (hooks + MCP) and OpenClaw (ContextEngine plugin + REST API), and both can share the same SQLite vault, so your CLI agent and your voice&#x2F;chat agent build on the same memory without syncing anything.<p>The retrieval architecture is a Frankenstein, which is pretty much always my process. I pulled the best parts from recent projects and research and stitched them together: [QMD](<a href=\"https:&#x2F;&#x2F;github.com&#x2F;tobi&#x2F;qmd\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;tobi&#x2F;qmd</a>) for the multi-signal retrieval pipeline (BM25 + vector + RRF + query expansion + cross-encoder reranking), [SAME](<a href=\"https:&#x2F;&#x2F;github.com&#x2F;sgx-labs&#x2F;statelessagent\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;sgx-labs&#x2F;statelessagent</a>) for composite scoring with content-type half-lives and co-activation reinforcement, [MAGMA](<a href=\"https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2501.13956\" rel=\"nofollow\">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2501.13956</a>) for intent classification with multi-graph traversal (semantic, temporal, and causal beam search), [A-MEM](<a href=\"https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2510.02178\" rel=\"nofollow\">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2510.02178</a>) for self-evolving memory notes, and [Engram](<a href=\"https:&#x2F;&#x2F;github.com&#x2F;Gentleman-Programming&#x2F;engram\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;Gentleman-Programming&#x2F;engram</a>) for deduplication patterns and temporal navigation. None of these were designed to work together. Making them coherent was most of the work.<p>On the inference side, QMD&#x27;s original stack uses a 300MB embedding model, a 1.1GB query expansion LLM, and a 600MB reranker. These run via llama-server on a GPU or in-process through node-llama-cpp (Metal, Vulkan, or CPU). But the more interesting path is the SOTA upgrade: ZeroEntropy&#x27;s distillation-paired zembed-1 + zerank-2. These are currently the top-ranked embedding and reranking models on MTEB, and they&#x27;re designed to work together. The reranker was distilled from the same teacher as the embedder, so they share a semantic space. You need ~12GB VRAM to run both, but retrieval quality is noticeably better than the default stack. There&#x27;s also a cloud embedding option if you&#x27;re tight on vram or prefer to offload embedding to a cloud model.<p>For Claude Code specifically, it hooks into lifecycle events. Context-surfacing fires on every prompt to inject relevant memory, decision-extractor and handoff-generator capture session state, and a feedback loop reinforces notes that actually get referenced. That handles about 90% of retrieval automatically. The other 10% is 28 MCP tools for explicit queries. For OpenClaw, it registers as a ContextEngine plugin with the same hook-to-lifecycle mapping, plus 5 REST API tools for the agent to call directly.<p>It runs on Bun with a single SQLite vault (WAL mode, FTS5 + vec0). Everything is on-device; no cloud dependency unless you opt into cloud embedding. The whole system is self-contained.<p>This is a polished WIP, not a finished product. I&#x27;m a solo dev. The codebase is around 19K lines and the main store module is a 4K-line god object that probably needs splitting. And of course, the system is only as good as what you index. A vault with three memory files gives deservedly thin results. One with your project docs, research notes, and decision records gives something actually useful.<p>Two questions I&#x27;d genuinely like input on: (1) Has anyone else tried running SOTA embedding + reranking models locally for agent memory, and is the quality difference worth the VRAM? (2) For those running multiple agent interfaces (CLI + voice&#x2F;chat), how are you handling shared memory today?",
    "url": "https://github.com/yoloshii/ClawMem",
    "upvotes": 5,
    "comments": 0,
    "sub": "hackernews",
    "signal": 34.2,
    "hits": [
      "claude code",
      "coding agent",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "github",
    "id": "1252251881",
    "title": "sopanmunde/trivisionx-ai",
    "body": "TriVisionX – Enterprise-grade AI Research Assistant powered by Advanced RAG, LangGraph Multi-Agent Workflows, Gemini LLM, Pinecone Vector Search, MongoDB, and FastAPI for intelligent research automation and document analysis. ai-agents ai-research-copilot fastapi fastapi-nextjs langgraph mongodb pinecone rag research-automation semantic-search sentence-transformers",
    "url": "https://github.com/sopanmunde/trivisionx-ai",
    "upvotes": 2,
    "comments": 51,
    "sub": "github",
    "signal": 32.1,
    "hits": [
      "agent workflow",
      "langgraph",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "42299349",
    "title": "I looked at 1000s of RAG queries to figure out the problem with semantic search",
    "body": "The vast majority of AI systems in production rely on basic semantic search to provide context. A single retrieval call into a vector database powers most Retrieval-Augmented Generation systems today. If you’ve tried using models like these, you know exactly how limited they are in truly understanding your data.<p>I looked into thousands of datapoints of actual user queries to clearly classify and determine exactly where and when semantic search starts to break down and provide missing or hallucinated results.<p>I pattern matched dozens of failure modes. Here are three of them. If you want to hear more you can reach me at pipitone@zeroentropy.dev<p>1. Negated Semantic Queries: “Which electric vehicle articles do not include any reference to Elon Musk?”<p>Both keyword and semantic searches will immediately retrieve specifically the electric vehicle articles that include a reference to Elon Musk.<p>2. Multi-Hop Queries “If the acquiring company fails to hold a shareholder’s meeting, what is the penalty?”<p>To answer this query, you need to work step-by-step. You would need to find the paragraph that says what happens when you fail to hold a shareholder meeting. Let’s say that such a search reveals that the agreement will be terminated in that circumstance. Then, you must search for what penalties are incurred by terminating the agreement. A simple semantic search will return paragraphs about shareholder’s meetings, and it will also return paragraphs about any kind of penalty — but, it will fail to link the two and realize that specifically a “termination penalty” must be boosted to the first place result.<p>Multi-hop queries require multiple steps of retrieval to get to the right information.<p>3. Fuzzy Filtering Queries “What diagnostic methods are suggested for early-stage cancer, in papers with a sample size of over 2000”<p>Sample sizes often occur in the first paragraph of a medical research article. Meanwhile, the specific diagnostic method is likely mentioned deep the article. So, these two pieces of information often do not occur in the same chunk. Your RAG pipeline will be happy to show diagnostic methods for early-stage cancer in articles that do not match the requested sample size — Not only that, but the correct answer will be almost impossible to find if “over 2000” is a rare filter.<p>----<p>Another interesting topic is evals for retrieval. At this point, I&#x27;ve talked to hundreds of developers, and discovered that retrieval evaluation is often overlooked, despite the impact on an AI’s intelligence and hallucination rate.<p>In most cases, evaluations occur at the end-user stage, either through direct feedback mechanisms like thumbs up&#x2F;down ratings. However, few have a method of associating “thumbs down” ratings with exactly what went wrong and where. Was it a UX problem? Or an LLM hallucination? Did the retrieval pipeline fail, or did the corpus simply lack the correct information. Currently, these questions are typically addressed by manually reviewing queries — a process that is labor-intensive, inconsistent, and impractical at scale.<p>Yet, evaluating retrieval is a key step to building a useful and reliable AI product. But doing so is hard. LLM evaluations only require an (Input, Output) pair. Meanwhile, retrieval benchmarks require the query, a snapshot of the entire corpus at that exact point in time, along with ground truth citations into exactly what the correct retrieval results should have been.<p>Building such a benchmark is super hard. But, I strongly believe LLMs can and should be used to autonomously define and build benchmarks to compute deterministic metrics like recall, precision, mean reciprocal rank, etc.<p>That’s why I am currently building an open-source benchmark creation framework that I will release soon. If you’d like to contribute, or if evaluation is something you’re curious about, feel free to reach out to me at pipitone@zeroentropy.dev",
    "url": "https://news.ycombinator.com/item?id=42299349",
    "upvotes": 6,
    "comments": 3,
    "sub": "hackernews",
    "signal": 31.9,
    "hits": [
      "rag pipeline",
      "evals",
      "benchmark",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "44325301",
    "title": "Ask HN: What Agent should I build next? Looking for ideas",
    "body": "Hey folks,<p>I&#x27;ve been working on Awesome AI Apps, where I&#x27;m exploring and building practical examples for anyone working with LLMs and agentic workflows.<p>It started as a way to document the stuff I was experimenting with, basic agents, RAG pipelines, MCPs, a few multi-agent workflows, but it’s kind of grown into a larger collection.<p>Right now, it includes 25+ examples across different stacks:<p>- Starter agent templates\n- Complex agentic workflows\n- MCP-powered agents\n- RAG examples\n- Multiple Agentic frameworks (like Langchain, OpenAI Agents SDK, Agno, CrewAI, and more...)<p>You can find them here: https:&#x2F;&#x2F;github.com&#x2F;arindam200&#x2F;awesome-ai-apps<p>I&#x27;m also playing with tools like FireCrawl, Exa, and testing new coordination patterns with multiple agents.<p>Honestly, just trying to turn these “simple ideas” into examples that people can plug into real apps.<p>Now I’m trying to figure out what to build next.<p>If you’ve got a use case in mind or something you wish existed, please drop it here. Curious to hear what others are building or stuck on.<p>Always down to collab if you&#x27;re working on something similar.",
    "url": "https://news.ycombinator.com/item?id=44325301",
    "upvotes": 1,
    "comments": 0,
    "sub": "hackernews",
    "signal": 31.1,
    "hits": [
      "agent workflow",
      "rag pipeline",
      "langchain"
    ]
  },
  {
    "src": "hackernews",
    "id": "47678328",
    "title": "Show HN: Frontend-VisualQA — give coding agents eyes to verify their own UI work",
    "body": "Coding agents today are blind.<p>They write “valid” HTML&#x2F;CSS code but can still ship a broken layout, a clipped dropdown, or a page at the wrong URL. Playwright scripts can assert modal.isVisible() without knowing the modal is rendered off-screen.<p>Essentially, coding agents need “eyes” to verify their own UI work.<p>frontend-visualqa is a CLI + MCP server for Claude Code and Codex for visual testing, verification, and QA of a website.<p>You give it a URL and natural-language claims:<p><pre><code>  frontend-visualqa verify http:&#x2F;&#x2F;localhost:8000&#x2F;dashboard.html \\\n  --claims \\\n  &#x27;The API status indicator shows Active&#x27; \\\n  &#x27;The monthly quota progress bar is completely filled&#x27;\n\n  # → first claim passes, second fails (label says 100% but bar is ~65% full)\n\n</code></pre>\nIt catches visual&lt;-&gt;DOM disagreements that selectors are blind to.<p>You can also test interactive flows without hardcoded data:<p><pre><code>  frontend-visualqa verify &#x27;http:&#x2F;&#x2F;localhost:8000&#x2F;booking_form.html&#x27; \\\n  --claims &#x27;The date on the confirmation page matches the date selected on the calendar&#x27; \\\n  --navigation-hint &quot;Fill out the form with example data&quot;\n\n  # → fails: fills the form, picks a date, books the slot, and catches an off-by-one date error on the confirmation page\n\n</code></pre>\nThe visual evaluation runs on n1, a VLM by Yutori that is post-trained specifically for browser interaction with RL on live websites. It navigates pages autonomously — so when a coding agent sends it to the wrong URL, n1 sees the wrong page, self-corrects, and reports this correction. On browser-use benchmarks n1 slightly outperforms Opus 4.6 and GPT-5.4 while running 2—3x faster at 4—5x lower cost: <a href=\"https:&#x2F;&#x2F;yutori.com&#x2F;blog&#x2F;introducing-n1\" rel=\"nofollow\">https:&#x2F;&#x2F;yutori.com&#x2F;blog&#x2F;introducing-n1</a><p>How does this compare to?<p>1. Playwright CLI+MCP\n- Gold standard, but blind.\n- frontend-visualqa is the visual verification layer on top.<p>2. OpenAI Playwright skill &#x2F; Claude + Dev-Browser\n- similar idea, but n1 is specifically trained for browser use (thus faster and cheaper), and the claim-based approach structures what to check rather than hoping the model notices everything.\n- Not locked to a TUI or IDE.<p>Known limitations:\n- Native &lt;select&gt; dropdowns render as OS-level widgets outside the viewport — n1 can&#x27;t see or interact with them. Custom dropdowns work fine.\n- Small visual&#x2F;numeric disagreements (red vs green status dot) are a known hard case. Improving with model updates.<p>Requires a Yutori API key (new accounts get free credits). DM me if you run out of credits.",
    "url": "https://github.com/yutori-ai/frontend-visualqa",
    "upvotes": 10,
    "comments": 0,
    "sub": "hackernews",
    "signal": 30.5,
    "hits": [
      "claude code",
      "coding agent",
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "47166647",
    "title": "Show HN: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%",
    "body": "One of the authors. Some things that surprised us while running these experiments:<p>The tasks are pulled from real merged PRs in vLLM and SGLang, so there&#x27;s a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.<p>What we didn&#x27;t expect: the agents are genuinely good at <i>diagnosing</i> the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.<p>The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.<p>We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing &quot;I need to actually use the tools now&quot; 2,412 times without ever making a tool call.<p>The benchmark, all agent transcripts, and evaluation code are open: <a href=\"https:&#x2F;&#x2F;ayushnangia.github.io&#x2F;iso-bench-website&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;ayushnangia.github.io&#x2F;iso-bench-website&#x2F;</a><p>Curious what others think about the scaffolding result in particular feels underexplored.",
    "url": "https://ayushnangia.github.io/iso-bench-website/",
    "upvotes": 3,
    "comments": 1,
    "sub": "hackernews",
    "signal": 30.4,
    "hits": [
      "claude code",
      "coding agent",
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "47366011",
    "title": "Launch HN: Captain (YC W26) – Automated RAG for Files",
    "body": "Hi HN, we’re Lewis and Edgar, building Captain to simplify unstructured data search (<a href=\"https:&#x2F;&#x2F;runcaptain.com\">https:&#x2F;&#x2F;runcaptain.com</a>). Captain automates the building and maintenance of file-based RAG pipelines. It indexes cloud storage like S3 and GCS, plus SaaS sources like Google Drive. There’s a quick walkthrough at <a href=\"https:&#x2F;&#x2F;youtu.be&#x2F;EIQkwAsIPmc\" rel=\"nofollow\">https:&#x2F;&#x2F;youtu.be&#x2F;EIQkwAsIPmc</a>.<p>We also put up this demo site called “Ask PG’s Essays” which lets you ask&#x2F;search the corpus of pg’s essays, to get a feel for how it works: <a href=\"https:&#x2F;&#x2F;pg.runcaptain.com\">https:&#x2F;&#x2F;pg.runcaptain.com</a>. The RAG part of this took Captain about 3 minutes to set up.<p>Here are some sample prompts to get a feel for the experience:<p>“When do we do things that don&#x27;t scale? When should we be more cautious?” \n<a href=\"https:&#x2F;&#x2F;pg.runcaptain.com&#x2F;?q=When%20do%20we%20do%20things%20that%20don&#x27;t%20scale%3F%20When%20should%20we%20be%20more%20cautious%3F\">https:&#x2F;&#x2F;pg.runcaptain.com&#x2F;?q=When%20do%20we%20do%20things%20...</a><p>“Give me some advice, I&#x27;m fundraising” \n<a href=\"https:&#x2F;&#x2F;pg.runcaptain.com&#x2F;?q=Give%20me%20some%20advice%2C%20I&#x27;m%20fundraising\">https:&#x2F;&#x2F;pg.runcaptain.com&#x2F;?q=Give%20me%20some%20advice%2C%20...</a><p>“What are the biggest advantages of Lisp”\n<a href=\"https:&#x2F;&#x2F;pg.runcaptain.com&#x2F;?q=what%20are%20the%20biggest%20advantages%20of%20Lisp\">https:&#x2F;&#x2F;pg.runcaptain.com&#x2F;?q=what%20are%20the%20biggest%20ad...</a><p>A good production RAG pipeline takes substantial effort to build, especially for file workloads. You have to handle ETL or text extraction, chunking, embedding, storage, search, re-ranking, inference, and often compliance and observability – all while optimizing for latency and reliability. It’s a lot to manage. grep works well in some cases, but for agents, semantic search provides significantly higher performance. Cursor uses both and reports 6.5%–23.5% accuracy gains from vector search over grep (<a href=\"https:&#x2F;&#x2F;cursor.com&#x2F;blog&#x2F;semsearch\" rel=\"nofollow\">https:&#x2F;&#x2F;cursor.com&#x2F;blog&#x2F;semsearch</a>).<p>We’ve spent the past four years scaling RAG pipelines for companies, and Edgar’s work at Purdue’s NLP lab directly informed our chunking techniques. In conversations with dozens of engineers, we repeatedly saw DIY pipelines produce inconsistent results, even after weeks of tuning. Many teams lacked clarity on which retrieval strategies best fit their data.<p>We realized that a system to provision storage and embeddings, handle indexing, and continuously update pipelines to reflect the latest search techniques could remove the need for every team to rebuild RAG themselves. That idea became Captain.<p>In practice, one API call indexes URLs, cloud storage buckets, directories, or individual files. Under the hood, we’re converting everything to Markdown. For this, we’ve had good results with Gemini 3 Pro for images, Reducto for complex documents, and Extend for basic OCR. For embedding models, ‘gemini-embedding-001’ performed reasonably well at first, but we later switched to the Contextualized Embeddings from ‘voyage-context-3’. It produced more relevant results than even the newer Voyage 4 models because its chunk embeddings are encoded with awareness of the surrounding document context. We then applied Voyage’s ‘rerank-2.5’ as second-stage re-ranking, reducing 50 initial chunks to a final top 15 (configurable in Captain’s API). Dense embeddings are just half the picture and full-text search with RRF complete our hybrid retrieval. In the Captain API, these techniques are exposed through a single &#x2F;query endpoint. Access controls can be configured via metadata filters, and page number citations are returned automatically.<p>The stack is constantly changing but the Captain API creates a standard interface for this. You can try Captain, 1 month for free, and build your own pipelines at <a href=\"https:&#x2F;&#x2F;runcaptain.com\">https:&#x2F;&#x2F;runcaptain.com</a>. We’re looking for candid feedback, especially anything that can make it more useful, and look forward to your comments!",
    "url": "https://www.runcaptain.com/",
    "upvotes": 57,
    "comments": 38,
    "sub": "hackernews",
    "signal": 30.4,
    "hits": [
      "rag pipeline",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "46634773",
    "title": "How do you pick a Coding Agent HN?",
    "body": "There&#x27;s lots of models benchmark out there, but how do you evaluate coding agents?<p>I&#x27;ve been seeing a lot of OpenCode fuzz on HN lately, because of Anthropic disabling their access to the private subscription endpoints, and I confess it made me feel like I could be missing out on something though I can&#x27;t tell for sure.<p>There&#x27;s also Amp Code who seems to be picking up traction, and, although more on the IDE side, I have tried Kiro through AWS Credits and it surprisingly outperforms Claude Code for me in some cases but didn&#x27;t fully bait me into the switch.<p>Codex works as good as Claude Code for me but I like Claude&#x27;s UX and Opus 4.5 better.<p>Are there any reliable Coding Agents benchmark out there? What is your take?",
    "url": "https://news.ycombinator.com/item?id=46634773",
    "upvotes": 4,
    "comments": 0,
    "sub": "hackernews",
    "signal": 30.2,
    "hits": [
      "claude code",
      "coding agent",
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "47940150",
    "title": "Show HN: An agent that remembers across sessions (no chat history)",
    "body": "Hi HN — I built this in my off-hours over the last 3 months. Sharing now because I just filed the provisional patent yesterday (US 64&#x2F;050,345) and the repo is freshly public.<p>The frustration that started it: every time I use a coding agent (Cursor, OpenCode, Aider, Claude Code, etc.), it eventually loses context — forgets the SSH address, re-asks for the DB password, tries to redeploy to localhost when the server is remote. The &quot;proper&quot; answer is &quot;set up 10 specialized agents with short context windows.&quot; I&#x27;m too lazy for that.<p>The conventional architecture is the actual problem. Every turn re-sends the full conversation, the model recomputes attention from scratch, and cost compounds with conversation length. Long-running agents are economically infeasible by design.<p>What I built: NLS captures the model&#x27;s own computed K&#x2F;V states (and recurrent states for hybrid models like Qwen3.5-MoE) after each turn, persists them to disk, and re-injects them into the cache on the next turn — at the right positions, with proper alignment. The model behaves as if it had the full conversation in context, but the conversation is never re-sent.<p>Validated across three settings, in increasing order of stringency:<p>(1) Standard conversational recall: 5&#x2F;5 on a 5-fact production test. Baseline check.<p>(2) LongMemEval (published cross-session benchmark, ~19K sessions). On the 18-question &quot;fully answerable&quot; subset:<p><pre><code>  Condition                                              Qwen 3.5    Qwen 3.6\n  Memories provided as TEXT in the prompt                8&#x2F;18        9&#x2F;18\n  Same memories delivered as KV-state via NLS            8&#x2F;18        9&#x2F;18\n\n  Text and KV produce identical scores. Both fail the same 9-10 questions for the same reasons (multi-hop temporal reasoning that exceeds model capacity). When the architecture&#x27;s inputs are equivalent, the outputs are equivalent.\n</code></pre>\n(3) Real agentic loop with OpenCode (TUI coding agent, used NLS as its OpenAI-compatible backend). It scaffolded a multi-phase coding project (&quot;ICF Coaching Evaluation Tool&quot;). Then in a separate session, after a full TUI restart with no chat history, I asked &quot;what&#x27;s the project about?&quot; — it returned a rich, specific description naming the project, the stack, and the architectural decisions. 124 user-typed tokens delivered 18,751 tokens of stored prior-session context. 99.3% prompt-token savings on the recall path. 4&#x2F;4 recall across the test scenarios.<p>Honest caveats:\n- The plugin source is proprietary (patent pending). The repo has docs, benchmarks, journey — not the implementation.\n- Single-GPU validation. Multi-GPU not tested yet.\n- Solo, no team yet.\n- Provisional patent only — non-provisional and PCT in the next 12 months.<p>What I want from this thread: tell me where you&#x27;d stress-test it. What workload breaks it? Anyone here from an inference provider — does this overlap with what your stack already does, or is this a new place?<p>Demo (conversational): <a href=\"https:&#x2F;&#x2F;punkrecords.live\" rel=\"nofollow\">https:&#x2F;&#x2F;punkrecords.live</a>\nDemo (agentic, OpenAI-compatible): <a href=\"https:&#x2F;&#x2F;api.punkrecords.live&#x2F;v1\" rel=\"nofollow\">https:&#x2F;&#x2F;api.punkrecords.live&#x2F;v1</a>",
    "url": "https://github.com/umbecanessa/neural-ledger-system",
    "upvotes": 1,
    "comments": 0,
    "sub": "hackernews",
    "signal": 30.1,
    "hits": [
      "claude code",
      "coding agent",
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "45928259",
    "title": "Show HN: Wegent –Open Source Cloud Coding Agent Platform",
    "body": "Core Capabilities<p>Configuration-Driven Agent Teams: Define and run personalized agent teams through YAML configuration with web UI - no secondary development required<p>Multi Execution Engines: Built on Agno and Claude Code agent engines at the bottom layer, supporting both dialogue and coding modes at the upper layer<p>Isolated Sandbox Environments: Each agent team runs in an independent sandbox, enabling multiple teams to execute simultaneously<p>Advanced Collaboration Modes: Dialogue mode supports parallel, leader-based, and other agent collaboration patterns for complex workflows like news insights and content retrieval<p>AI Coding Integration: Coding mode integrates with GitHub&#x2F;GitLab and other code services to implement AI-driven development, code review, and other coding workflows",
    "url": "https://github.com/wecode-ai/Wegent",
    "upvotes": 1,
    "comments": 0,
    "sub": "hackernews",
    "signal": 30.1,
    "hits": [
      "claude code",
      "coding agent",
      "retrieval"
    ]
  },
  {
    "src": "hackernews",
    "id": "47049776",
    "title": "Launch HN: Sonarly (YC W26) – AI agent to triage and fix your production alerts",
    "body": "Hey HN, I am Dimittri and we’re building Sonarly (<a href=\"https:&#x2F;&#x2F;sonarly.com\">https:&#x2F;&#x2F;sonarly.com</a>), an AI engineer for production. It connects to your observability tools like Sentry, Datadog, or user feedback channels, triages issues, and fixes them to cut your resolution time. Here&#x27;s a demo: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=rr3VHv0eRdw\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=rr3VHv0eRdw</a>.<p>Sonarly is really about removing the noise from production alerts by grouping duplicates and returning a root cause analysis to save time to on-call engineers and literally cut your MTTR.<p>Before starting this company, my co-founder and I had a B2C app in edtech and had, some days, thousands of users using the app. We pushed several times a day, relying on user feedback. Then we set up Sentry, it was catching a lot of bugs, but we had up to 50 alerts a day. With 2 people it&#x27;s a lot. We took a lot of time filtering the noise to find the real signal so we knew which bug to focus on.<p>At the same time, we saw how important it is to fix a bug fast when it hits users. A bug means in the worst case a churn and at best a frustrated user. And there are always bugs in production, due to code errors, database mismatches, infrastructure overload, and many issues are linked to a specific user behavior. You can&#x27;t catch all these beforehand, even with E2E tests or AI code reviews (which catch a lot of bugs but obviously not all, plus it takes time to run at each deployment). This is even more true with vibe-coding (or agentic engineering).<p>We started Sonarly with this idea. More software than ever is being built and users should have the best experience possible on every product. The main idea of Sonarly is to reduce the MTTR (Mean Time To Repair).<p>We started by recreating a Sentry-like tool but without the noise, using only text and session replays as the interface. We built our own frontend tracker (based on open-source rrweb) and used the backend Sentry SDK (open source as well). Companies could just add another tracker in the frontend and add a DSN in their Sentry config to send data to us in addition to Sentry.<p>We wanted to build an interface where you don&#x27;t need to check logs, dashboards, traces, metrics, and code, as the agent would do it for you with plain English to explain the &quot;what,&quot; &quot;why,&quot; and &quot;how do I fix it.&quot;<p>We quickly realized companies don&#x27;t want to add a new tracker or change their monitoring stack, as these platforms do the job they&#x27;re supposed to do. So we decided to build above them. Now we connect to tools like Sentry, Datadog, Slack user feedback channels, and other integrations.<p>Claude Code is so good at writing code, but handling runtime issues requires more than just raw coding ability. It demands deep runtime context, immediate reactivity, and intelligent triage, you can’t simply pipe every alert directly into an agent. That’s why our first step is converting noise into signal. We group duplicates and filter false positives to isolate clear issues. Once we have a confirmed signal, we trigger Claude Code with the exact context it needs, like the specific Sentry issue and relevant logs fetched via MCP (mostly using grep on Datadog&#x2F;Grafana). However, things get exponentially harder with multi-repo and multi-service architectures.<p>So we built an internal map of the production system that is basically a .md file updated dynamically. It shows every link between different services, logs, and metrics so that Claude Code can understand the issue faster.<p>One of our users using Sentry was receiving ~180 alerts&#x2F;day. Here is what their workflow looked like:<p>- Receive the alert<p>- 1) Defocus from their current task or wake up, or 2) don&#x27;t look at the alert at all (most of the time)<p>- Go check dashboards to find the root cause (if infra type) or read the stack trace, events, etc.<p>- Try to figure out if it was a false positive or a real problem (or a known problem already in the fixes pipeline)<p>- Then fix by giving Claude Code the correct context<p>We started by cutting the noise and went from 180&#x2F;day to 50&#x2F;day (by grouping issues) and giving a severity based on the impact on the user&#x2F;infra. This brings it down to 5 issues to focus on in the current day. Triage happens in 3 steps: deduplicating before triggering a coding agent, gathering the root cause for each alert, and re-grouping by RCA.<p>We launched self-serve (<a href=\"https:&#x2F;&#x2F;sonarly.com\">https:&#x2F;&#x2F;sonarly.com</a>) and we would love to have feedback from engineers. Especially curious about your current workflows when you receive an alert from any of these channels like Sentry (error tracking), Datadog (APM), or user feedback. How do you assign who should fix it? Where do you take your context from to fix the issue? Do you have any automated workflow to fix every bug, and do you have anything you use currently to filter the noise from alerts?<p>We have a large free tier as we mainly want feedback. You can self-serve under 2 min. I&#x27;ll be in the thread with my co-founder to answer your questions, give more technical details, and take your feedback: positive, negative, brutal, everything&#x27;s constructive!",
    "url": "https://sonarly.com/",
    "upvotes": 30,
    "comments": 17,
    "sub": "hackernews",
    "signal": 29.9,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "46706442",
    "title": "Show HN: UltraContext – A simple context API for AI agents with auto-versioning",
    "body": "Hey HN! I&#x27;m Fabio and I built UltraContext, a simple context API for AI agents with automatic versioning.<p>After two years building AI agents in production, I experienced firsthand how frustrating it is to manage context at scale. Storing messages, iterating system prompts, debugging behavior and multi-agent patterns—all while keeping track of everything without breaking anything. It was driving me insane.<p>So I built UltraContext. The mental model is git for context:<p>- Updates and deletes automatically create versions (history is never lost)<p>- Replay state at any point<p>The API is 5 methods:<p><pre><code>  uc.create()   &#x2F;&#x2F; new context (can fork from existing)\n  uc.append()   &#x2F;&#x2F; add message\n  uc.get()      &#x2F;&#x2F; retrieve by version, timestamp, or index\n  uc.update()   &#x2F;&#x2F; edit message → creates version\n  uc.delete()   &#x2F;&#x2F; remove message → creates version\n</code></pre>\nMessages are schema-free. Store conversation history, tool calls, system prompts—whatever shape you need. Pass it straight to your LLM using any framework you&#x27;d like.<p>What it&#x27;s for:<p>- Persisting conversation state across sessions<p>- Debugging agent behavior (rewind to decision point)<p>- Forking contexts to test different flows<p>- Audit trails without building audit infrastructure<p>- Multi-agent and sub-agent patterns<p>What it&#x27;s NOT:<p>- Not a memory&#x2F;RAG system (no semantic search)<p>- Not a vector database<p>- Not an Orchestration&#x2F;LLM framework<p>UltraContext handles versioning, branching, history. You get time-travel with one line.<p>Docs: <a href=\"https:&#x2F;&#x2F;ultracontext.ai&#x2F;docs\" rel=\"nofollow\">https:&#x2F;&#x2F;ultracontext.ai&#x2F;docs</a><p>Early access: <a href=\"https:&#x2F;&#x2F;ultracontext.ai\" rel=\"nofollow\">https:&#x2F;&#x2F;ultracontext.ai</a><p>Would love feedback! Especially from anyone who&#x27;s rolled their own context engineering and can tell me what I&#x27;m missing.",
    "url": "https://ultracontext.ai/",
    "upvotes": 21,
    "comments": 21,
    "sub": "hackernews",
    "signal": 29.2,
    "hits": [
      "context engineering",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "48762862",
    "title": "Launch HN: Manufact (YC S25) – MCP Cloud",
    "body": "Hi HN, we are Pietro and Luigi, cofounders of Manufact (<a href=\"https:&#x2F;&#x2F;manufact.com\">https:&#x2F;&#x2F;manufact.com</a>), a cloud for MCP apps and servers. We used to be called mcp-use, and still build open source SDKs for MCP under that name: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;mcp-use&#x2F;mcp-use\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;mcp-use&#x2F;mcp-use</a>. We did a Show HN about that last year: <a href=\"https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=44747229\">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=44747229</a>.<p>Today we want to tell you about our cloud product, Manufact, which is to mcp-use as Vercel is to Next.js. Manufact is an MCP vertical cloud designed for dev teams putting MCP Apps and servers in production.You can ship, iterate on, test and monitor your MCPs, and get them ready for the store submissions. All with the best developer and agent experience in mind.<p>Here is a demo video of the product: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=R2rbr5OT9LI\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=R2rbr5OT9LI</a>.<p>We have been working on MCP since April 2025. Our first focus was making it easy to build agents that could use any MCP server, and a lot of people started using our SDKs. Then the harness revolution kicked off: Claude Code, Claude Cowork, ChatGPT, Codex, OpenCode started shipping agent harnesses that made most standalone agent frameworks redundant. That pushed us to the other side of the connection, the servers. If agents were going to consolidate into a few harnesses, then first-class integration with the rest of a company&#x27;s systems (i.e. MCP) would become the thing that mattered, so we started building up our server SDKs.<p>Then in succession:<p>1. Oct 2025. ChatGPT Apps SDK. OpenAI brings app UIs to ChatGPT, built on top of MCP and the work of mcp-ui.\n2. Late 2025. The stores open. ChatGPT starts accepting app submissions, Claude grows its connector directory with selected partners.\n3. Jan 2026. MCP Apps becomes official. SEP-1865 merges as the first MCP extension (io.modelcontextprotocol&#x2F;ui): one UI standard any host can render.<p>Today, all the major clients fully support MCP and are opening marketplaces of reviewed MCPs that can be one click installed. All major tech companies have an MCP server, and many of those are reporting that already 15+% of their usage comes from their MCP, and we start to have a good way to distribute them just now.<p>MCP can return fully interactive UIs. So companies can (1) display data in more meaningful ways to their users (e.g. analytics, ecommerce) and (2) display their branding in some of the most used products on the planet (ChatGPT, Claude etc). Numbers: an engineer at Amplitude reported that their MCP saw a 2x increase in retention after adding UI to their MCP.<p>Clients (Claude, ChatGPT, Cursor) are starting to dynamically present MCP servers&#x2F;apps to users, based on their intent. Products will be organically discovered on the chats!<p>We feel that MCP is reaching its maturity moment. Now that MCPs are starting to be easy to install and discover, there is going to be a huge incentive for users to use them and for companies to create them:<p>1 - Most work is already done from AI chats, this is not going to stop, MCP gives you a way to interact with products without manually using their dashboards.<p>2 - MCP allows you to bring the context together in one place: you can read an email, create a ticket while plugged into the source code of your product, or your knowledge base. Aggregation of products that was not possible before, will happen in the chat, orchestrated by increasingly intelligent models.<p>If AI apps (Codex, Claude Desktop) are the new browsers, as PG said in a recent tweet <a href=\"https:&#x2F;&#x2F;x.com&#x2F;paulg&#x2F;status&#x2F;2069080429236191504\" rel=\"nofollow\">https:&#x2F;&#x2F;x.com&#x2F;paulg&#x2F;status&#x2F;2069080429236191504</a>, then MCPs are the new websites.<p>But there is a catch:<p>- Submission process on the stores is still quite tricky, manual and takes up valuable time.\n- Hardly anybody knows how to design a good MCP: most of them are 1:1 proxies of the API and are abandoned, since being one shotted a few months ago.\n- The MCP Spec advances quickly and it is not easy to keep track of the changes, and what they mean for your server.\n- Auth is still a mystery for most teams (API key in the URL ???).\n- Most companies are not even aware that MCPs can return interactive UIs.\n- Clients still have to consolidate behavior, some do dynamic tool discovery, some don&#x27;t, some persist authentication properly some don&#x27;t.<p>We built Manufact and mcp-use to solve these problems.\nOur SDKs help them build good MCPs, our inspector helps them test locally, and our cloud helps them ship&#x2F;publish and monitor them in production.<p>To deploy on Manufact you just need to connect a Github app, pick the repo, we&#x27;ll detect the framework you are working with and get you a live MCP url as soon as possible.<p>In our platform, that live URL will be used to give you a chat where you can try&#x2F;debug your MCP immediately and share it with your team. If you push an update on a new experimental branch, you&#x27;ll be able to test that as well thanks to preview deployments.<p>Once your server is ready to go live, we help you make sure that it does not break. You can configure automated tests that will take your MCP server, install it in ChatGPT and Claude and test it. We do not test the model, we test the client (model + harness). This way you reliably know if your server breaks where people use it.<p>Since publishing on the store is a major distribution unlock for companies (your MCP can be dynamically discovered and one click installed across Claude\nproducts, and ChatGPT), we collected a set of requirements that will keep your submission from being rejected. You check this locally before going through the actual review process.<p>Once your server is live, you&#x27;ll want to understand how it is used. Our analytics are designed for MCP, so you&#x27;ll know how many users are hitting\nyour MCP, how many tool calls you receive, from which client.<p>You can try out <a href=\"https:&#x2F;&#x2F;manufact.com\">https:&#x2F;&#x2F;manufact.com</a> for free today. We have usage-based pricing and on our free account we give free credits for you to try it out. If you have an\nMCP already, just connect your Github repo and deploy, if not you can build one using our skill and SDKs pretty simply (we will guide you in the onboarding).<p>We would love to hear feedback about the product in the comments, and hear thoughts from everyone about MCP. Thanks! :)",
    "url": "https://manufact.com",
    "upvotes": 108,
    "comments": 68,
    "sub": "hackernews",
    "signal": 28.4,
    "hits": [
      "claude code"
    ]
  },
  {
    "src": "hackernews",
    "id": "48480559",
    "title": "Show HN: Interbase – Long-running AI goals and aliases for any model",
    "body": "Hi HN,<p>I&#x27;ve been working on an open-source CLI agent called Interbase:<p><a href=\"https:&#x2F;&#x2F;github.com&#x2F;agentsorchestrationcompany&#x2F;interbase\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;agentsorchestrationcompany&#x2F;interbase</a><p>Two ideas motivated a lot of the project.<p>The first is that long-running agent workflows shouldn&#x27;t be restricted to a small number of frontier models.<p>Many recent agent products are beginning to support persistent tasks, background work, and goal-oriented workflows. I think those capabilities are useful abstractions independent of the underlying model.<p>Interbase includes a `&#x2F;goal` command that allows work to be organized around long-running objectives and supports more than 135 providers and 4,800+ models. The goal is to let users choose the model that works best for them rather than forcing a specific provider because a particular workflow feature only exists there.<p>The second idea is that AI workflows should be reusable in the same way shell workflows are.<p>Interbase includes `&#x2F;aliases`, which allows users to create shortcuts for workflows they run frequently. For example, a user might create aliases such as:<p>`gcm` → preferred git commit workflow<p>`review` → code review workflow<p>`ship` → release readiness workflow<p>After a while these become muscle memory in much the same way traditional shell aliases do.<p>The project also includes encrypted remote access, and one of the next areas I&#x27;m exploring is computer use capabilities that can work across a broad range of models rather than a handful of specialized offerings.<p>I&#x27;m curious whether others think long-running goals and reusable workflows should live above the model layer, or whether they belong as model-specific capabilities.<p>Happy to answer questions about the implementation or design decisions.",
    "url": "https://github.com/agentsorchestrationcompany/interbase",
    "upvotes": 2,
    "comments": 0,
    "sub": "hackernews",
    "signal": 28.1,
    "hits": [
      "agent workflow",
      "code review workflow"
    ]
  },
  {
    "src": "hackernews",
    "id": "47141347",
    "title": "Show HN: Open-source EU AI Act compliance layer for AI agents (8/2026 deadline)",
    "body": "We built AIR Blackbox — open-source compliance infrastructure for AI agents targeting the EU AI Act enforcement deadline on August 2, 2026.\nIf you&#x27;re deploying LLM-based agents (LangChain, CrewAI, AutoGen, OpenAI Agents SDK) into production, the EU AI Act requires tamper-evident audit trails, human oversight mechanisms, data governance controls, and injection defense — for any system classified as high-risk.\nMost teams we&#x27;ve talked to either don&#x27;t know about the deadline or assume their existing logging is enough. It&#x27;s not. Article 12 specifically requires logs that regulators can mathematically verify haven&#x27;t been altered. Article 14 requires the ability to interrupt agent execution. Article 15 requires defense against prompt injection and data poisoning.\nWhat we built:<p>Trust layers for LangChain, CrewAI, AutoGen, OpenAI Agents SDK, and RAG pipelines — each is a pip install that hooks into your existing agent code with ~3 lines of setup\nHMAC-SHA256 tamper-evident audit chains — every agent decision, tool call, and LLM interaction gets logged to a chain that regulators can verify\nConsentGate — risk-classifies tool calls and blocks critical operations until approved\nInjectionDetector — 15+ weighted patterns scanning prompts before they reach the model\nWriteGate + DriftDetector (for RAG) — prevents knowledge base poisoning and detects retrieval anomalies\nCompliance scanner — pip install air-compliance &amp;&amp; air-compliance scan .&#x2F;my-project tells you exactly which articles you&#x27;re missing<p>Everything maps to specific EU AI Act articles (9, 10, 11, 12, 14, 15). Zero vendor lock-in, Apache 2.0, zero core dependencies on the trust layers.\nThe scanner is probably the fastest way to understand where your gaps are. It takes about 3 seconds to run on a typical project.\nGitHub: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;airblackbox\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;airblackbox</a>\nPyPI: pip install air-compliance\nHappy to answer questions about what the EU AI Act actually requires for AI agent deployments — we&#x27;ve read the full regulation and mapped it to specific technical controls.",
    "url": "https://news.ycombinator.com/item?id=47141347",
    "upvotes": 2,
    "comments": 6,
    "sub": "hackernews",
    "signal": 27.3,
    "hits": [
      "rag pipeline",
      "langchain",
      "autogen",
      "retrieval"
    ]
  },
  {
    "src": "hackernews",
    "id": "43244549",
    "title": "Show HN: Firebender, a simple coding agent for Android Engineers",
    "body": "Hey HN, I made a simple coding agent plugin in Android Studio called Firebender. Here’s an unedited 5-minute video where it writes tests for an Android app and iterates against the Gradle task output on its own (<a href=\"https:&#x2F;&#x2F;docs.firebender.com&#x2F;get-started&#x2F;agent\">https:&#x2F;&#x2F;docs.firebender.com&#x2F;get-started&#x2F;agent</a>). You can use the plugin for free, no sign up needed, on the jetbrains marketplace.<p>The agent can edit multiple files, run gradle tasks like tests, and use the output to improve its changes. At the end, it reports a git diff of all changes that can be accepted or rejected.<p>Under the hood, the agent relies on Claude 3.7 sonnet and a fast code apply model to speed up edits. We built tools to give deeper access throughout the IDE like IntelliJ’s graph representation of kotlin&#x2F;java code, “everywhere search” for classes, and have more integrations planned. The goal is for the agent to have access to all the IDE goodies that we engineers take for granted, to improve the agent&#x27;s responses and ability to gather correct context. In order to improve the agent, there are internal evals like “tasks” and simulate the IDE which serves as a gym for the agent. This is heavily inspired by SWE-bench. Whenever tools, prompts, subagents, or models are changed, this gym helps find regressions quickly.<p>Building the UI was surprisingly hard. I had the great pleasure of becoming proficient in Java Swing (released in ‘96 by Netscape) to get this done right. Things like markdown streaming, or streaming git diffs are prone to layout flickering where Swing tries to recalculate where elements should go. We had to write our own markdown parsing and rendering engine that repaints Swing components only when changed portions of the markdown nodes. The UI tends to focus on simplifying reviewing AI changes, something I have a feeling we’ll be doing much more in the coming years.<p>If you’re an Android engineer, please let me know if you run into any bugs or want anything improved in the plugin!",
    "url": "https://docs.firebender.com/get-started/agent",
    "upvotes": 53,
    "comments": 18,
    "sub": "hackernews",
    "signal": 27.2,
    "hits": [
      "coding agent",
      "evals"
    ]
  },
  {
    "src": "hackernews",
    "id": "41202694",
    "title": "Launch HN: Roe AI (YC W24) – AI-powered data warehouse to query multimodal data",
    "body": "Hey HN, we’re Richard and Jason from Roe AI (<a href=\"https:&#x2F;&#x2F;getroe.ai\">https:&#x2F;&#x2F;getroe.ai</a>). We’re building a query engine that lets data people do SQL queries on various kinds of unstructured data (videos, images, webpages, documents) using LLM-powered data processors.<p>Here is a 3-minute video: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=9-WwJk1v5mI\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=9-WwJk1v5mI</a>, showing how to create an LLM data processor to process videos, build a semantic search for image data, and use it with SQL.\nThe problem we tackle is that data analysts cannot quickly answer their business questions around unstructured, multimodal data. For example, product teams want to understand user session replay videos to understand the painpoints of using their product. Ads teams need to know everything about an advertiser based on their web pages, such as the products they offer, payment methods, etc. Marketing teams need to know how product placement or music in a marketing campaign could get more views. And so on.<p>For data that is structured, questions like these can be answered quickly with SQL queries in Snowflake &#x2F; BigQuery. But when you have unstructured multimodal data, it becomes a complex analysis process: open a Python notebook, write custom logic to get these multimodal data from blob storage (or write a crawler first if you need webpage data), find an AI model, do prompt engineering, do data ops to productionize the workload in a data workflow, etc.\nWe simplify this process to a few lines of SQL.<p>How it works: first, we leverage multimodal LLMs as data processors because they’re good at unstructured data information extraction, classification or any arbitrary tasks. Next, we’ve built a user interface for data people to explore multimodal data and manage AI components. Then we have a quick semantic index builder for multimodal data. (We often see databases provide vector search functionality but not indexing building, so we built that.) Utility functions deal with multimodal data, like video cutter, PDF page selector, etc. Finally, SQL is the command line for slicing and dicing multimodal data.<p>How we got here: I’ve experienced 3 data evolutions in the last 10 years. At UC Berkeley, I was a data researcher using a supercomputer cluster called Savio. It was a bare-metal way to analyze the data—I had to move CSV between machines. Then at LinkedIn, I had Hadoop + Pig &#x2F; Scala Spark. That abstracted most of the work, but I spent hours tuning jobs and had a headache manipulating HDFS directories. Later I joined Snowflake, and was like, holy – data analysis can be this simple – I can just use SQL to do everything within this data warehouse! I asked myself: why can’t we make something like Snowflake for unstructured data? That was the impulse behind Roe.ai and it’s been driving me ever since.<p>To get started, you can sign in at <a href=\"https:&#x2F;&#x2F;app.roe-ai.com&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;app.roe-ai.com&#x2F;</a> and there are docs at <a href=\"https:&#x2F;&#x2F;docs.roe-ai.com&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;docs.roe-ai.com&#x2F;</a>. You can load unstructured data via our SQL and File API, Snowflake Staging Data Connector, S3 Blob Storage Data connector, Zapier Roe AI Zap, or the SQL function load_url_file() to get a file from a URL.<p>Some logistics: the product is free to start, and we’ve preloaded $50 AI credits—enough to process 3000 one-pager PDFs. If you use all $50, just email us, and we’ll give you more. The solution is not open-sourced because it is too complex to be self-hosted, but let us know if you see the potential for open-source.<p>The product is early and could have bugs and UX problems. It’d be incredible if you could give it a spin anyway and we hope it will be interesting and that you’ll let us know what you think!\nJason and I will be around in the thread and are really interested in hearing from you!",
    "url": "https://news.ycombinator.com/item?id=41202694",
    "upvotes": 60,
    "comments": 35,
    "sub": "hackernews",
    "signal": 27.0,
    "hits": [
      "prompt engineering",
      "vector"
    ]
  },
  {
    "src": "github",
    "id": "1285730990",
    "title": "sumitk87549/IBM_RAG__Agentic_AI",
    "body": "Tool Calling Retrieval-Augmented Generation AI Security AI Integrations Multimodal Prompts Agentic systems Generative AI Agents Prompt Patterns Software Development LLM Application ** Vector Databases LangChain OpenAI API Model Context Protocol Prompt Engineering AI Workflows Agentic Workflows LangGraph Generative AI AI Orchestration ",
    "url": "https://github.com/sumitk87549/IBM_RAG__Agentic_AI",
    "upvotes": 0,
    "comments": 0,
    "sub": "github",
    "signal": 27.0,
    "hits": [
      "prompt engineering",
      "langchain",
      "langgraph",
      "retrieval",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "48195021",
    "title": "Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs",
    "body": "Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href=\"https:&#x2F;&#x2F;superlog.sh\">https:&#x2F;&#x2F;superlog.sh</a>). We&#x27;re building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=xFhU9Mk247M\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough.  Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog&#x2F;Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue&#x2F;constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM&#x2F;upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn&#x27;t, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can\nmerge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don&#x27;t have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs&#x2F;metrics&#x2F;traces we install. Pricing is on the site. We&#x27;re early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href=\"https:&#x2F;&#x2F;superlog.sh\">https:&#x2F;&#x2F;superlog.sh</a>. We&#x27;d love to hear what you&#x27;re using today, what&#x27;s broken about it, and whether the &quot;one mergeable PR per incident&quot; model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who&#x27;s rolled their own observability, and anyone who has tried Sentry &#x2F; Datadog MCPs and given up. Comments and feedback welcome!",
    "url": "https://superlog.sh/",
    "upvotes": 74,
    "comments": 49,
    "sub": "hackernews",
    "signal": 26.7,
    "hits": [
      "claude code"
    ]
  },
  {
    "src": "hackernews",
    "id": "43683075",
    "title": "Show HN: A library to convert+deploy existing agent projects as MCP servers",
    "body": "Most of the MCP servers that I’ve seen are tools implemented in standalone projects. To onboard more tools (especially agents and multi-agent workflows) to MCP, I’ve been thinking it’s important to allow AI engineers to continue to prototype in their existing agent frameworks and deploy with minimal conversion when ready.<p>We created the automcp library, which you can add as a dependency to existing projects (CrewAI, LangGraph, Llama Index, OpenAI Agents SDK, Pydantic AI, mcp-agent currently supported but more coming soon). You just need to run a CLI command to create a run_mcp.py file, make some edits and run it to start the server locally. You can think of run_mcp.py like Heroku’s Procfile, Codespaces configs, Pulumi&#x2F;AWS CDK style IaC.<p>We also created a demo of a deployment platform where you can enter the GitHub URL of your project, deploy with one click, and get a URL for the hosted sse server that can be used with MCP clients like Cursor. Think of it like Vercel for MCP servers.<p>There are still a few manual steps for the user that can be further automated, but curious to hear whether people think it’s useful? There are some interesting directions automcp could go in in future like automatically creating MCP servers for each orchestrator, agent and tool in a project (rather than one monolithic MCP server).<p>Website: <a href=\"https:&#x2F;&#x2F;auto-mcp.com\" rel=\"nofollow\">https:&#x2F;&#x2F;auto-mcp.com</a>\nautomcp repo: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;NapthaAI&#x2F;automcp\">https:&#x2F;&#x2F;github.com&#x2F;NapthaAI&#x2F;automcp</a> \nDeployment platform: <a href=\"https:&#x2F;&#x2F;labs.naptha.ai&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;labs.naptha.ai&#x2F;</a> \nDemo: <a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=El5YvBQ5py0\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=El5YvBQ5py0</a>",
    "url": "https://github.com/NapthaAI/automcp",
    "upvotes": 22,
    "comments": 2,
    "sub": "hackernews",
    "signal": 26.5,
    "hits": [
      "agent workflow",
      "langgraph"
    ]
  },
  {
    "src": "hackernews",
    "id": "44660406",
    "title": "Show HN: Single-agent long-horizon reasoning within one LLM run",
    "body": "- We build the Thread Inference Model (TIM) based on the transformer architecture, and its dedicated runtime TIMRUN.<p>- TIM + TIMRUN = Intelligent workflow generation, context engineering, and multi-hop tool use happens at the runtime level<p>- TIM + TIMRUN supports virtually unlimited reasoning enabled by context pruning, significantly improves the efficiency for long-horizon reasoning tasks<p>- Inference API is live at <a href=\"https:&#x2F;&#x2F;subconscious.dev&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;subconscious.dev&#x2F;</a><p>- More details: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;subconscious-systems&#x2F;TIMRUN\">https:&#x2F;&#x2F;github.com&#x2F;subconscious-systems&#x2F;TIMRUN</a>",
    "url": "https://huggingface.co/papers/2507.16784",
    "upvotes": 4,
    "comments": 1,
    "sub": "hackernews",
    "signal": 26.4,
    "hits": [
      "context engineering",
      "tool use"
    ]
  },
  {
    "src": "hackernews",
    "id": "43477861",
    "title": "Show HN: Typia (20,000x faster validator) challenges to Agentic AI with compiler",
    "body": "- typia is a runtime validator using the TypeScript compiler API, and automatically generates validators, serialized, JSON schema, etc. by analyzing source code at compile time<p>- Agentica: Challenges the Agentic AI Framework by utilizing typia compiler skills, specializing in LLM Function Calling<p>- Agentica argues that everything can be done with LLM Function Calling, avoiding the agent workflow graph used in traditional AI agent development, and therefore developers should focus on the function unit<p>- Scalable, flexible, and mass-productive agent development possible by focusing on the function unit\n- Compiler Driven Development for safe and efficient function schema build<p>- Document Driven Development by separating the function unit prompt domain for enterprise-level agent development",
    "url": "https://typia.io/articles/typia-challenges-to-agentic-ai-with-its-compiler-skill.html",
    "upvotes": 3,
    "comments": 0,
    "sub": "hackernews",
    "signal": 26.1,
    "hits": [
      "agent workflow",
      "function calling"
    ]
  },
  {
    "src": "hackernews",
    "id": "47034087",
    "title": "Evaluating AGENTS.md: are they helpful for coding agents?",
    "body": "",
    "url": "https://arxiv.org/abs/2602.11988",
    "upvotes": 232,
    "comments": 161,
    "sub": "hackernews",
    "signal": 26,
    "hits": [
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "47431671",
    "title": "Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training",
    "body": "I replicated David Ng&#x27;s RYS method (<a href=\"https:&#x2F;&#x2F;dnhkng.github.io&#x2F;posts&#x2F;rys&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;dnhkng.github.io&#x2F;posts&#x2F;rys&#x2F;</a>) on consumer AMD GPUs \n(RX 7900 XT + RX 6950 XT) and found something I didn&#x27;t expect.<p>Transformers appear to have discrete &quot;reasoning circuits&quot; — contiguous blocks of 3-4 layers that \nact as indivisible cognitive units. Duplicate the right block and the model runs its reasoning \npipeline twice. No weights change. No training. The model just thinks longer.<p>The results on standard benchmarks (lm-evaluation-harness, n=50):<p>Devstral-24B, layers 12-14 duplicated once:\n- BBH Logical Deduction: 0.22 → 0.76\n- GSM8K (strict): 0.48 → 0.64\n- MBPP (code gen): 0.72 → 0.78\n- Nothing degraded<p>Qwen2.5-Coder-32B, layers 7-9 duplicated once:\n- Reasoning probe: 76% → 94%<p>The weird part: different duplication patterns create different cognitive &quot;modes&quot; from the same \nweights. Double-pass boosts math. Triple-pass boosts emotional reasoning. Interleaved doubling \n(13,13,14,14,15,15,16) creates a pure math specialist. Same model, same VRAM, different routing.<p>The circuit boundaries are sharp — shift by one layer and the effect disappears or inverts. \nSmaller models (24B) have tighter circuits (3 layers) than larger ones (Ng found 7 layers in 72B).<p>Tools to find circuits in any GGUF model and apply arbitrary layer routing are in the repo. \nThe whole thing — sweep, discovery, validation — took one evening.<p>Happy to answer questions.",
    "url": "https://github.com/alainnothere/llm-circuit-finder",
    "upvotes": 265,
    "comments": 80,
    "sub": "hackernews",
    "signal": 26,
    "hits": [
      "benchmark"
    ]
  },
  {
    "src": "hackernews",
    "id": "48002136",
    "title": "DeepClaude – Claude Code agent loop with DeepSeek V4 Pro",
    "body": "",
    "url": "https://github.com/aattaran/deepclaude",
    "upvotes": 678,
    "comments": 281,
    "sub": "hackernews",
    "signal": 26,
    "hits": [
      "claude code"
    ]
  },
  {
    "src": "hackernews",
    "id": "46218813",
    "title": "Show HN: Cupcake – Better performance and security for coding agents (via OPA)",
    "body": "We&#x27;re releasing early efforts on coding agent governance with Cupcake [1] - an open-source policy enforcement layer with native integrations. You write rules in policy-as-code (OPA&#x2F;Rego), and Cupcake integrates them into the agent runtime via Hooks.<p>See it in action (Desktop only): <a href=\"https:&#x2F;&#x2F;cupcake-policy-studio.vercel.app&#x2F;example-policies&#x2F;security&#x2F;protecting-paths?harness=claude-code&amp;format=rego\" rel=\"nofollow\">https:&#x2F;&#x2F;cupcake-policy-studio.vercel.app&#x2F;example-policies&#x2F;se...</a><p>Help us build: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;eqtylab&#x2F;cupcake\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;eqtylab&#x2F;cupcake</a><p>We are EQTY Lab, our mission is verifiable AI (identity, provenance, and governance). With the rise of capable agents like Claude Code, it became immediately clear that those deploying these agents need the ability to conduct their own alignment and safety controls. We can’t rely solely on the frontier labs.<p>This is why we created the feature request for Hooks in Claude Code [2], and pivoted away from filesystem and OS-level monitoring once those hooks were implemented. Hooks provide the critical points we need:<p>* Evaluation: Checking agent intent and actions.<p>* Prevention: Stopping unsafe or unwanted actions.<p>* Modification: Adjusting the agent&#x27;s output before execution.<p>Policy-as-Code with OPA&#x2F;Rego - While many agent security papers suggest similar policy architectures using invented DSLs, Cupcake is fundamentally built on Open Policy Agent (OPA) and its policy language, Rego [3].<p>We chose Rego because it is:<p>* Industry-Robust: Widely adopted across enterprise DevSecOps and cloud-native environments.<p>* Purpose-Built: Offers unique, mature advantages for defining, managing, and enforcing policy as code.<p>* Enterprise-Oriented: This makes Cupcake compatible with existing enterprise governance frameworks.<p>Cupcake is released under the Apache-2.0 license. We will formalize a path to v1.0.0 in Q1 of 2026. This is an early preview version. The goal with Cupcake is not suppression, but to ensure an agent is able to drive fast without crashing. To collaborate, or join forces: ramos at eqtylab dot io.<p>[1] <a href=\"https:&#x2F;&#x2F;github.com&#x2F;eqtylab&#x2F;cupcake\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;eqtylab&#x2F;cupcake</a><p>[2] <a href=\"https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;claude-code&#x2F;issues&#x2F;712\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;claude-code&#x2F;issues&#x2F;712</a><p>[3] <a href=\"https:&#x2F;&#x2F;www.openpolicyagent.org&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;www.openpolicyagent.org&#x2F;</a>",
    "url": "https://github.com/eqtylab/cupcake",
    "upvotes": 12,
    "comments": 1,
    "sub": "hackernews",
    "signal": 25.8,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "47674729",
    "title": "Show HN: AgentLint – ESLint for your coding agents",
    "body": "I’ve been spending a lot of time with coding agents lately. Across Claude Code, Cursor, OpenCode, Codex, and different models, I kept noticing that some people were getting much better results from the same tools. It became clear that this was not just about prompting.<p>A big part of it was context drift. AGENTS.md, skills, rules, and workflows looked fine, but were no longer aligned with the code.<p>I also learned that more context does not always help. Sometimes it adds noise and wastes tokens. The recent AGENTS.md paper also pushed me to think harder about this, especially around auto-generated context files and &#x2F;init-style workflows.<p>Then I saw Microsoft’s writeup showing a jump from 38.1% to 69% after improving instruction setup. That made me take these files much more seriously.<p>AgentLint came out of that. It’s a small CLI that scans the repo and helps keep context files aligned. After setup, MCP handles most of the ongoing flow.<p>Give it a try: npx @agent-lint&#x2F;cli<p><a href=\"http:&#x2F;&#x2F;samilozturk.github.io&#x2F;agentlint\" rel=\"nofollow\">http:&#x2F;&#x2F;samilozturk.github.io&#x2F;agentlint</a><p>Would really appreciate any feedback or criticism.",
    "url": "https://github.com/samilozturk/agentlint",
    "upvotes": 4,
    "comments": 3,
    "sub": "hackernews",
    "signal": 25.8,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "42381139",
    "title": "Show HN: Gentrace – connect to your LLM app code and run/eval it from a UI",
    "body": "Hey HN - Doug from Gentrace here. We originally launched via Show HN in August of 2023 as evaluation and observability for generative AI: <a href=\"https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37238648\">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37238648</a><p>Since then, everyone from the model providers to LLM ops companies built a prompt playground. We had one too, until we realized this was totally the wrong approach:<p>- It&#x27;s not connected to your application code<p>- They don&#x27;t support all models<p>- You have to rebuild evals for just this one prompt (can&#x27;t use your end-to-end evals)<p>In other words, it was a ton of work and time to use these to actually make your app better. So, we built a new experience and are relaunching around this idea:<p>Gentrace is a collaborative LLM app testing and experimentation platform that brings together engineers, PMs, subject matter experts, and more to run and test your actual end-to-end app.<p>To do this, use our SDK to:<p>- connect your app to Gentrace as a live runner over websocket (local) &#x2F; via webhook (staging, prod)<p>- wrap your parameters (eg prompt, model, top-k) so they become tunable knobs in the front end<p>- edit the parameters and then run &#x2F; evaluate the actual app code with datasets and evals in Gentrace<p>We think it&#x27;s great for tuning retrieval systems, upgrading models, and iterating on prompts.<p>It&#x27;s free to trial. Would love to hear your feedback &#x2F; what you think!",
    "url": "https://gentrace.ai/",
    "upvotes": 23,
    "comments": 3,
    "sub": "hackernews",
    "signal": 25.8,
    "hits": [
      "llm ops",
      "evals",
      "retrieval"
    ]
  },
  {
    "src": "hackernews",
    "id": "48346958",
    "title": "Show HN: Agents, run any coding agent on your subscription not API costs",
    "body": "Hi HN. I&#x27;m the founder of Phoenix Labs (ex TikTok, Applied AI) and we&#x27;re open sourcing our internal tooling today which is like a toolchain &#x2F; meta-harness for CLI agents useful for really scaling eng and creative work.<p>We are a very small team who&#x27;s building a very ambitious product so we had to find ways to squeeze every ounce of efficiency that we could get our hands on. Harness strengths of different models (Claude, GPTs) and CLI-harnesses (Claude Code, Codex), safe&#x2F;robust browser integration to speed up UX&#x2F;QA testing, teams cli to speed up security reviews and parallelize bug hunting and fixes, and secrets cli with touch id integration so DX is extremely fast.<p>We also noticed that small things like installing marketplaces, or sharing resources per projects (skills, plugins, secrets, subagents, workflows, rules, permission groups, hooks) took a lot of time so we put everything under ~&#x2F;.agents and supported multi-layer dot-agents repos, auto layering and syncing system, user and project level resources and extra so teams can have their own dot-agents repos<p>Fun things like auto-rotation of CC credentials to tackle session limits also exist and save a lot of time. We usually have multiple agent versions installed per agent type.<p>CLI is called `agents` and it injects shims for `claude`, `codex` and other agents. When we need a new feature like routines for keeping CI healthy, we just implement it in a way that&#x27;s compatible with most commonly uses agent-harnesses at our company including Claude Code, Codex, Antigravity&#x2F;Gemini, Cursor&#x2F;Grok CLI and more<p>Install:<p>curl -fsSL agents-cli.sh&#x2F;install.sh | sh\n# or: bun install -g @phnx-labs&#x2F;agents-cli<p>Source: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;phnx-labs&#x2F;agents-cli\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;phnx-labs&#x2F;agents-cli</a><p>Honest limits: macOS works best. Linux works. Touch ID is macOS only. But, it&#x27;s MIT :)<p>Want feedback on the developer experience. And my apologies if your agent harness is not supported throughout. Please feel free to make a PR and happy to hop on a chat&#x2F;call<p>Muqsit",
    "url": "https://agents-cli.sh",
    "upvotes": 6,
    "comments": 2,
    "sub": "hackernews",
    "signal": 25.7,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "47525283",
    "title": "Show HN: Vectimus – Cedar policy enforcement for AI coding agents",
    "body": "Hey HN.  I built Vectimus because coding agents keep doing things they shouldn&#x27;t and there&#x27;s no runtime governance layer for the developer workstation.<p>The problem: Claude Code, Cursor, Gemini CLI and GitHub Copilot let agents execute shell commands, write files and call MCP servers.  Most developers disable the permission prompts because they slow you down.  But that means the agent can rm -rf &#x2F;, read your .env, push to production or call a compromised MCP server with nothing watching.<p>Vectimus intercepts every tool call and evaluates it against 78 Cedar policies containing 369 rules before execution.  Cedar is the policy language AWS chose for AgentCore Policy (GA this month).  Evaluation runs locally via a persistent daemon in under 10ms.  Zero network calls.  Zero telemetry.  Every evaluation produces an Ed25519-signed receipt so you have cryptographic proof of what was allowed and denied.<p>Every policy maps to a real incident.  CVE-2025-6514 compromised 437,000+ developer environments through a malicious MCP OAuth proxy.  The GitHub MCP server was hijacked via a crafted issue to exfiltrate private repo data.  A Terraform agent destroyed production infrastructure.  These happened.<p>How it hooks in: Claude Code intercepts shell commands, file writes, MCP calls and web fetches.  Cursor governs shell commands, file reads&#x2F;writes and MCP tool calls at the editor level.  Copilot intercepts terminal commands, file edits, deletes and git pushes.  Gemini CLI uses Gemini&#x27;s native hook system.  MCP servers are blocked by default and allowlisted per-project with input inspection.  Observe mode lets you see what would be blocked before you enforce.<p>I also built Sentinel (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;vectimus&#x2F;sentinel\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;vectimus&#x2F;sentinel</a>), a three-agent pipeline that scans for new agentic AI security incidents daily, drafts Cedar policies, replays the incident in a sandbox to prove the policy catches it, then opens a PR.  The pipeline is governed by Vectimus.  Every finding and policy draft is public.<p>All 10 OWASP Agentic Top 10 categories covered.  Compliance annotations for SOC 2, NIST AI RMF, NIST CSF 2.0, EU AI Act, ISO 27001, CIS Controls and SLSA.  Apache 2.0.  Solo founder, built in Ireland.<p>Happy to go deep on the Cedar policy design, the hook architecture, the signed receipts or the OWASP mapping.",
    "url": "https://github.com/vectimus/vectimus",
    "upvotes": 3,
    "comments": 2,
    "sub": "hackernews",
    "signal": 25.6,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "37777683",
    "title": "Show HN: HoneyHive – An unified evaluation and monitoring platform for LLM apps",
    "body": "Hey HN! We’re Mohak and Dhruv from HoneyHive (https:&#x2F;&#x2F;honeyhive.ai). HoneyHive is a set of tools built around evaluating, monitoring, and iteratively improving LLM systems to make them production-ready and reliable.<p>We’re sure everyone has seen the general bugginess LLMs introduce into products and how hard it is to improve these models. Most LLM products are generally assumed to be buggy, and everyone treats them as such – it works well sometimes, but I won’t bet on it. This is obviously not going to work in production at scale.<p>Most teams we talked to want to iterate and improve their LLM apps, much like what they’ve been doing for decades with traditional software, but the tooling and workflows to do so are broken in many ways:<p>- Offline evaluations are manual, time-consuming and costly<p>- Product analytics tools used to track user feedback aren’t built to handle unstructured data<p>- In more complex pipelines like autonomous agents or RAG, the LLM is not the only issue – vector databases and other APIs are often the bigger issue, making it hard to debug<p>As we see it, the typical workflow across most companies is: OpenAI Playground -&gt; LangChain&#x2F;CLI for prototyping -&gt; Google Sheets for evaluations -&gt; Mixpanel, Sentry, or Streamlit&#x2F;Retool for monitoring. This flow doesn’t scale to multi-step LLM pipelines like agents or RAG, let alone multimodality. We are convinced that companies here will decide to buy external tooling instead of slowing themselves down and wasting valuable developer time maintaining these internal tools - given how quickly OpenAI’s schemas keep evolving<p>We both saw this workflow at Microsoft &amp; Templafy before starting HoneyHive, so we aimed to build a tool that works from the prototype stage to scaling in production. From the start, we focused on building abstractions that generalize across a single LLM and multimodal agents.<p>- Studio: Our Playground integrates into any model that follows OpenAI API schema and can call an arbitrary javascript block as a “tool” - this allows us to integrate across vector dbs, search APIs, etc. Aimed to help teams collaborate early in the prototyping phase<p>- Offline Evaluations: Our Evaluations SDK is based on arbitrary configuration dictionaries and I&#x2F;O schemas, extending quickly across single prompts, agents, chains, and RAG pipelines. Our Metric interface can then ingest LLM stack traces and compute metrics across every step during testing and monitoring.<p>- Online Monitoring: Here, we took heavy inspiration from product, software &amp; ML observability to marry them for multimodal LLM pipelines. The schemas are highly configurable, allowing you to enrich each event with any config properties, custom metadata, user properties, feedback or metrics - all of which can be used to slice and dice your data to discover trends and anomalies<p>Here’s a full demo: https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;e36aecf20f09428b8b2172d8fb4be1ff?sid=07242547-db5e-471d-a8d3-760c0f4bc513<p>We have enabled multiple companies with this stack. MultiOn, a company building a multimodal browser agent, has used our platform to evaluate and monitor their agent, and fine-tune open source models for acting on browser DOMs. They have set up moderation filters in prod, using our Metrics docker environment to run an arbitrary Python code-block or an LLM evaluation function over logs to enrich it. They’ve also integrated our eval pipelines with their fine-tuning pipelines, allowing them to automatically benchmark any new fine-tuned models and automate the data flywheel.<p>We launched our public beta yesterday and will be making the platform open for general access in the coming weeks! We apologize for the public beta form before login haha.<p>As you can imagine, building a developer platform for multimodal agents is an intricate engineering challenge, so any feedback from the HN community will be very helpful for us! We look forward to hearing your thoughts, questions and feedback!",
    "url": "https://news.ycombinator.com/item?id=37777683",
    "upvotes": 3,
    "comments": 2,
    "sub": "hackernews",
    "signal": 25.6,
    "hits": [
      "rag pipeline",
      "langchain",
      "benchmark",
      "vector"
    ]
  },
  {
    "src": "hackernews",
    "id": "46990733",
    "title": "Show HN: 20+ Claude Code agents coordinating on real work (open source)",
    "body": "Single-agent LLMs suck at long-running complex tasks.<p>We’ve open-sourced a multi-agent orchestrator that we’ve been using to handle long-running LLM tasks. We found that single LLM agents tend to stall, loop, or generate non-compiling code, so we built a harness for agents to coordinate over shared context while work is in progress.<p>How it works:\n1. Orchestrator agent that manages task decomposition\n2. Sub-agents for parallel work\n3. Subscriptions to task state and progress\n4. Real-time sharing of intermediate discoveries between agents<p>We tested this on a Putnam-level math problem, but the pattern generalizes to things like refactors, app builds, and long research.\nIt’s packaged as a Claude Code skill and designed to be small, readable, and modifiable.<p>Use it, break it, tell me about what workloads we should try and run next!",
    "url": "https://github.com/mutable-state-inc/lean-collab",
    "upvotes": 53,
    "comments": 39,
    "sub": "hackernews",
    "signal": 25.4,
    "hits": [
      "claude code"
    ]
  },
  {
    "src": "hackernews",
    "id": "47170501",
    "title": "Ask HN: Why do AI coding agents refuse to save their own observations?",
    "body": "I&#x27;ve spent months building tooling for AI coding agents and hit something I can&#x27;t fully explain.<p>If you give an agent (Claude Code, Cursor, Codex) a tool to save observations — &quot;save_observation: persist this insight for future sessions&quot; — and explicitly instruct it to use the tool in system prompts, config files, everywhere you can, it calls it maybe 30% of the time.<p>The agent will happily use tools that help it complete the current task. But a tool that only benefits future sessions? Almost never.<p>My working theory: these models are optimized for task completion within the current context window. Saving an observation has zero value for the current task — it&#x27;s a token cost with no immediate reward. The model has learned that every token spent on &quot;let me save this for later&quot; is a token not spent on the actual work. The incentive structure is wrong at the training level.<p>I ended up building a passive observation system that watches what the agent does and infers observations from tool calls and AST-level code diffs, without requiring agent cooperation. But I&#x27;m curious if others have found ways to make agents reliably self-document.<p>Has anyone solved this? Techniques like:\n- Prompt structures that actually get agents to save context\n- Fine-tuning approaches that reward knowledge retention\n- Alternative architectures for persistent agent memory<p>Or is passive observation the only reliable path when the agent won&#x27;t cooperate?",
    "url": "https://news.ycombinator.com/item?id=47170501",
    "upvotes": 2,
    "comments": 1,
    "sub": "hackernews",
    "signal": 25.3,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "45053581",
    "title": "Show HN: Devplan – Generate specs and coding prompts with deep context",
    "body": "Hi, I’m Chris and my partners and I are building Devplan, an AI product development tool that helps teams go from idea to working code faster.<p>What Devplan does:<p>- Creates deep contextual understanding from Github and the web with our open source context engine: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;devplaninc&#x2F;contextify\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;devplaninc&#x2F;contextify</a><p>- Generates right-sized PRDs, user stories, and tech design based on company context<p>- Gives a ballpark effort and complexity estimate for every user story<p>- Breaks down requirements into structured coding prompts for tools like Claude Code, Cursor, Windsurf, or JetBrains Junie<p>- Integrates with Linear and Jira to push generated project docs and tickets to your tracking system<p>- Lets you kick off projects with images to refine specs with mocks, diagrams, or screenshots<p>- Exports detailed coding prompts as standalone files or use our CLI to work with them directly<p>Why we built it:<p>We believe the next generation of product development will be built with AI at its core. But we’ve seen first-hand how the current tools fall short:<p>- Docs from ChatGPT or Claude are useful but too general and lack context for real workflows<p>- AI coding agents lose context quickly in large repos and generated code often requires re-work<p>- Most approaches to planning for AI coding takes too long and isn&#x27;t shared or reviewed, which slows teams down<p>AI should remove that friction, not create more of it. We built Devplan to make planning and execution one connected flow. It starts with outcomes, adapts to the size of your project, and produces structured inputs for the coding tools you already use. Instead of bouncing between AI assistants, PM docs, and code editors, Devplan ties it all together so you can move faster without losing context.<p>We have an MVP template for side projects, but the platform is being built for real teams who want to ship product with confidence while staying lean. We are still early and we’re iterating quickly.<p>Would love to hear feedback from other builders. What’s working for you when it comes to planning and building with AI?<p>P.S. If you want to try it, public beta is open: <a href=\"https:&#x2F;&#x2F;www.devplan.com\" rel=\"nofollow\">https:&#x2F;&#x2F;www.devplan.com</a>",
    "url": "https://www.devplan.com/",
    "upvotes": 6,
    "comments": 0,
    "sub": "hackernews",
    "signal": 25.3,
    "hits": [
      "claude code",
      "coding agent"
    ]
  },
  {
    "src": "hackernews",
    "id": "47125210",
    "title": "Show HN: Irpapers – Visual embeddings vs. OCR trade-offs in scientific PDFs",
    "body": "Hey HN, we are releasing IRPAPERS to answer a highly pragmatic question: when building a RAG pipeline over PDFs, should you OCR the text or just embed the raw page images?<p>Processing PDFs in production usually involves stringing together brittle OCR heuristics. While recent multimodal embeddings (like ColModernVBERT or ColPali) allow you to skip OCR entirely and retrieve directly from visual layouts, we wanted to measure if the computational overhead is actually worth the utility.<p>The short answer: Transformer-based image pipelines won&#x27;t be perfect for every use-case, but they fix exactly what OCR breaks.<p>Here is what we found benchmarking 3,230 pages of dense scientific literature:<p>Complementary Bottlenecks: Text representations (BM25 + dense vectors) are highly efficient for exact lexical constraints (e.g., finding a specific acronym like &quot;HyDE&quot;). Conversely, image embeddings shine on spatial architecture diagrams and t-SNE plots where OCR serialization just turns into structural garbage.<p>Multimodal Hybrid Search: Because these failure modes are almost perfectly orthogonal, fusing the two signals gives you the best performance out of the box. By combining them, we pushed top-1 recall to 49% (beating text alone at 46%).<p>The Memory Constraint: Late-interaction image embeddings produce thousands of vectors per page, creating a massive storage bottleneck. To address this need, we evaluate MUVERA encoding. Under the hood, this compresses multi-vector representations into a single fixed-dimensional encoding via SimHash, allowing you to use standard HNSW indexing without the paralyzing memory overhead.<p>In practice, if you are building a RAG workflow today, text-based context still provides higher downstream utility for the actual generation step (0.82 vs 0.71 alignment). Instead of picking one modality and dealing with its blind spots, start with hybrid text search as a sensible default, and inject multi-vector image embeddings to catch the visual edge-cases.<p>We’ve open-sourced the benchmark and the evaluation recipes:<p>Paper <a href=\"https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2602.17687\" rel=\"nofollow\">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2602.17687</a>\nIRPAPERS dataset on HuggingFace at huggingface.co&#x2F;weaviate&#x2F;IRPAPERS and GitHub\nat github.com&#x2F;weaviate&#x2F;IRPAPERS<p>Our experimental code is also available on GitHub at\ngithub.com&#x2F;weaviate&#x2F;query-agent-benchmarking<p>Happy to answer any questions about the evaluation pipeline, the cold start problem of visual benchmarks, or the specific retrieval trade-offs we saw.",
    "url": "https://github.com/weaviate/query-agent-benchmarking",
    "upvotes": 5,
    "comments": 0,
    "sub": "hackernews",
    "signal": 25.2,
    "hits": [
      "rag pipeline",
      "benchmark",
      "retrieval",
      "vector"
    ]
  }
]