Article
The report compares IDOR detection by Semgrep’s in-house multimodal harness against a plain prompt-only setup for several commercial and open-weight models, with all systems evaluated on the same real open-source dataset using a fixed IDOR prompt and F1 score. The post explains F1, precision, and recall, and emphasizes that Semgrep’s pipeline includes endpoint enumeration and guided context selection, while the open-weight runs rely only on the base prompt plus minimal instructions. Results place Semgrep Multimodal with GPT-5.5 first at 61% and Opus 4.8 second at 53%, while GLM-5.2 scores 39% and Claude Code Opus 4.6 at 37%, with other models like MiniMax M3 and Kimi K2.7 trailing. GLM-5.2 is presented as notable because it is open weight, MoE with about 750B total and 40B active parameters, long-context up to 1M tokens, strong coding benchmark performance, and lower cost, while costing about $0.17 per true positive in this test. The article adds operational context by noting open-weight availability can improve deployment control and security posture despite training data and full stack still being proprietary. Readers in the comments broadly agree it is an impressive value story but challenge methodology, especially the apples-to-oranges comparison between Semgrep’s harnessed multimodal setup and prompt-only models, and the benchmark’s older versus newer model mix. They also point out metric inconsistency around Claude Code being reported as 32% in prose and 37% in the table. A number of commenters share cost and workflow anecdotes, some praising GLM’s speed and lower price, while others report unreliable behavior, provider friction, and questions about reproducibility. The thread also reflects skepticism toward potential commercial motivation, benchmark trust, and U.
Commenters are split between practical endorsement and skepticism. Several report switching daily coding and security-adjacent work to GLM-5.2 because it felt faster, cheaper, and sufficiently capable for common tasks, with some saying token costs were far below closed alternatives. Others caution that open-weight results are not stable across tasks, citing mixed experiences, occasional model “nonsense,” provider instability, and concerns that benchmark scores can vary by setup and verification tooling. Multiple participants dispute that Claude Code is an LLM rather than an agent harness, argue for fairer head-to-head comparisons with matching harnesses, and ask for full cost-per-vulnerability reporting across all models. A few raise data-quality concerns, including inconsistent Claude percentages in the post and over-reliance on IDOR as a narrow vulnerability class, and challenge whether prior knowledge cutoffs or tool access could shift outcomes. Security practitioners also discuss local deployment constraints, export-control risk, and whether broader open-weight progress explains a market and geopolitical shift. Despite doubts, most comments converge on one point: harness quality and economics may matter more than raw model choice, while open-weight viability remains context dependent and still not uniformly proven.