Always Be Self-Criticizing

In the space of detection engineering, I’ve been experimenting with crafting detection rules using AI Agents, and one thing I realized was the imperative need to have a Self-Criticizing Agent (Or Reviewing Agent). The initial output would almost always be too simplistic, and having another agent to review the outputs to feedback to the first agent invariably improves the quality of the detection rule.

Setup

Here’s the rough architecture. Not the exact prompts, but the workflow is fairly similar.

For this experiment, I used openai-agents

openai-agents-python

openai • Updated Aug 28, 2025

We have a Detection Engineering Agent to write the Elastic Queries based on the input scenario

(Not built yet ) The generated EQL should have it’s syntax and fields validated against the actual dataset to make sure the query can run.

The Reviewer Agent takes in the scenario and EQL, and generates a score and a list of recommendations

The final score is a weighted average of individual evaluations

Coverage (40%)

Durability (25%)

Noise Levels (35%)

I decided to we should optimize the rule based on Noise Levels and Coverage, while Durability can be improved by adding other detections to complement this rule, or via composite detections (detections made from an aggregation of other detections)

We set the passing score to be >= 0.8 to get a good enough detection. From experience (and wasting a lot of tokens), it seems to plateau at 0.85, which makes sense since our goal is to create a “good-enough” detection, and there’s no perfect rule with 100% Coverage, 100% Durability and 0% Noise.

Let’s analyze the output of each iteration with the prompt:


Detect the usage of mimikatz.exe using powershell or cmd.exe

Detection Engineer Iteration 1

This initial iteration only gets the scenario as a prompt and generates a detection in EQL


# Det eng: I have created this EQL for detecting mimikatz usage on powershell or cmd.exe: 

process where process_name in ("powershell.exe", "cmd.exe") and process_command_line != null and stringcontains(process_command_line, "sekurlsa") or stringcontains(process_command_line, "lsadump") or stringcontains(process_command_line, "kerberos") or stringcontains(process_command_line, "mimikatz") or stringcontains(process_command_line, "privilege::debug") or stringcontains(process_command_line, "logonpasswords")

Is this good enough? At first glance yes, and maybe it’s what a junior analyst might write, but let’s see what our “Senior Expert Detection Engineer Reviewer” has to say about it

Reviewer Feedback 1


This detection has a quality rating of 0.55 
Feedback : 
Coverage: 
- Consider expanding the process list to include commonly used alternatives or renamed binaries often used for LOLBins (e.g., powershell_ise.exe, pwsh.exe). Mimikatz may also be loaded using other scripting hosts or rundll32.exe with reflective DLL injection techniques, so detection based solely on cmd or powershell is narrower than ideal.
- Rewrite the condition logic such that all command line keyword checks are paired with the process condition to avoid capturing unrelated events where e.g. process_command_line contains 'sekurlsa' for a non-cmd/powershell process.

Durability:
- Attackers can easily evade this detection by obfuscating or encoding keywords, using alternate argument formats, or changing the case (e.g., 'SEKURLSA', base64 encoding in PowerShell). Implement case-insensitive matching and consider using regex patterns to catch more obfuscated strings.
- Include monitoring for encodedCommand or -enc in PowerShell which is commonly used for evasion.

False Positive Rate:
- This query will likely generate FPs from legitimate tools, test scripts, or documentation mentions in command lines. Tighten detection by combining multiple keywords indicative of Mimikatz together (e.g., require 'sekurlsa' and 'logonpasswords' together), or by noting when questionable keywords occur with suspicious parent processes rather than for every mention.
- Consider excluding common and known benign script paths from the detection to reduce noise (e.g., trusted admin workstations or script directories).
- Add constraints such as these keywords occurring at the start or in specific argument positions, rather than anywhere within a command line, to reduce FPs.

0.55.. That’s not too good.

But we see some good feedback in scope of powershell.exe and cmd.exe, while also commenting that the scenario might be too narrow if we only focus on powershell and cmd

It also suggests improvements to reduce FP rates by using keywords in conjunction, rather than in disparate conditions, and how to enhance the detection to catch evasion attempts via encoding.

We pump this feedback back into the Detection Engineer Agent and get it to improve the detections. We run this multiple times until we get a passing score ≥0.8

Detection Engineer Iteration 2


# I have created this EQL for detecting mimikatz usage on powershell or cmd.exe:

process where
  process_name in ("powershell.exe", "pwsh.exe", "powershell_ise.exe", "cmd.exe") and process_command_line != null and (
    regexp(process_command_line, "(?i)mimikatz") or
    (
      regexp(process_command_line, "(?i)sekurlsa") and regexp(process_command_line, "(?i)logonpasswords")
    ) or
    regexp(process_command_line, "(?i)privilege::debug") or
    (
      process_name in ("powershell.exe", "pwsh.exe", "powershell_ise.exe") and (
        regexp(process_command_line, "(?i)-enc(odedCommand)?") or regexp(process_command_line, "(?i)--encodedcommand")
      )
    )
  )
  and not (
    /* Exclude known benign admin or script directories */
    process_command_line =~ "C:\\AdminTools\\*" or process_command_line =~ "C:\\Scripts\\*"
  )

Reviewer Iteration 2


This detection has a quality rating of 0.68 
These are the list of recommendations : 
Detection Coverage: 
- The detection does well to look for Mimikatz-specific keywords and common switches, with special attention to both cmd and PowerShell-based invocations. However, it misses other typical Mimikatz-related terms (such as 'kerberos::', 'lsadump', 'token::', etc.) and alternative obfuscation tricks (e.g., using different case, spacing, or splitting). To increase coverage, include more keywords and extend regular expressions to be more resilient.
- Consider also looking for script block logs in PowerShell logs that contain suspicious base64-encoded blocks, since -EncodedCommand alone is generic. Extracting, decoding, and searching for suspicious tokens inside can improve coverage for encoded usage.

Durability:
- The rule relies mostly on specific keywords and common switches; an attacker could evade by obfuscating command arguments, e.g., using alternate character encoding, splitting commands (e.g., 'mi' + 'mikatz'), or loading Mimikatz as a DLL under another name. You may improve durability by adding regex patterns for split or unicode-encoded words (e.g., m[i]mik[a]tz), or by incorporating detection for suspicious DLL or PE file loads from PowerShell or cmd.

False Positive Rate:
- The rule may cause false positives, especially due to the broad detection of '-EncodedCommand' and '--encodedcommand', which are frequently used in legitimate automation and admin scripts. To reduce noise, restrict these to cases when other evidence of credential access exists in the same command, or require the presence of both the switch and Mimikatz-like terms in the decoded data.
- Make the exclusion on script/admin tools path case-insensitive, and consider expanding exclusions with additional known-good script directories if appropriate.

Overall, recommend splitting the encoded command check: only escalate if encodings are present AND the base64-decoded payload includes known credential access terms. Augment the Mimikatz-specific regexes to be more robust, and add more generally abused terms. Incorporate path-based or parent-process-based context to cut down on admin tool noise.

Detection Engineer Iteration 3


# I have created this EQL for detecting mimikatz usage on powershell or cmd.exe: 

process where
  process_name in ("powershell.exe", "pwsh.exe", "powershell_ise.exe", "cmd.exe") and process_command_line != null and (
    (
      /* Robust regex for classic and obfuscated Mimikatz patterns */
      regexp(process_command_line, "(?i)m[!1il\s\W]*i[!1il\s\W]*m[!1il\s\W]*i[!1il\s\W]*k[xk\s\W]*a[t\s\W]*[z2]") or
      regexp(process_command_line, "(?i)(sekurlsa|lsadump|kerberos::|token::|privilege::debug|crypto::|dpapi::|logonpasswords|sam::)")
    ) and (
      /* Combine with at least one further suspicious signal: suspicious module/technique, EncodedCommand, or multiple unique modules in one command */
      regexp(process_command_line, "(?i)lsadump|sekurlsa|kerberos::|token::") or
      regexp(process_command_line, "(?i)-enc(odedCommand)?|--encodedcommand|/enc") or
      regexp(process_command_line, "(?i)invoke-expression|reflection|amsi.bypass|[\w]*add-type[\W]*\{" )
    )
    or (
      /* EncodedCommand with Mimikatz terms in decoded PowerShell (if available as 'script_block_decoded') */
      (regexp(process_command_line, "(?i)-enc(odedCommand)?|--encodedcommand|/enc") and script_block_decoded != null and regexp(script_block_decoded, "(?i)m[!1il]*i[!1il]*m[!1il]*i[!1il]*k[xk]*a[t]*[z2]|sekurlsa|lsadump|kerberos::|token::|logonpasswords"))
    )
  )
  and not (
    /* Case-insensitive exclusion for common benign tooling and organization-specific dirs */
    regexp(process_command_line, "(?i)c:\\admin[^\\]*\\|c:\\scripts\\|c:\\windows\\system32\\windows-powerShell\\v1.0\\modules|c:\\users\\[^\\]+\\automation-scripts|c:\\it-tools|d:\\prod-scripts|z:\\trusted-scripts")
  )

Reviewer Iteration 3


This detection has a quality rating of 0.8 
These are the list of recommendations : 
Coverage: The detection covers both process names for PowerShell, pwsh, and cmd.exe, and robustly checks for several obfuscated and classic Mimikatz keyword patterns and module terms. It also checks for suspicious switches (e.g., encodedcommand) and attempts to match decoded script blocks (if present). However, consider adding additional PowerShell host variants (e.g., wsmprovhost.exe) and potentially supporting cases where attackers might use dropped script files or indirect execution modules. When meaningful, monitor child process creation from these shells targeting LSASS or similar credential dumping behaviors, if available in the logs.

Durability: The regex patterns for mimikatz, module arguments, and command obfuscation attempt to combat simple evasion. However, sophisticated attackers could still evade by chunking arguments (e.g., string concatenation or indirect invocation), using alternate casing/encodings, or proxy scripts that invoke the core payload. To improve durability, look at not just command-line content but also script block content (where available), and consider chained behaviors (e.g., process access to LSASS after these suspicious commands) in your logic. Add additional regex for common string splitting and alternate character representations.

False Positive Rate: The current noise mitigation includes exclusions for known admin/script folders, but the breadth of keyword matching (especially in organizations using legitimate PowerShell modules for admin/automation) may still capture some benign use—especially scripts mentioning relevant dlls, modules, or using encoded commands for valid reasons. Tighten exclusions with additional context: for example, require a combination of rarely seen processes, odd parent-child relationships, or suspicious process ancestry. Consider whitelisting environment-specific patterns seen in benign automation after review. If feasible, add more context on user-interactive sessions versus scheduled/scripted, or look for simultaneous anomalous process ancestry.

We see the score inching upwards on each iteration until it hits 0.8, and now the loops break and we have our final query!

As mentioned earlier, after running the loop multiple times, I’ve seen it plateauing around 0.85, and expecting a perfect score of 1.0 is probably not feasible. (0% FP rates? come on…)

The final output also keeps in scope of only detecting mimikatz usage over powershell.exe and cmd.exe, while addressing the Reviewer’s inputs iteratively.

Human Evaluation

So the detection is written and now it triggers the CI/CD pipeline, and you as a human reviewer received a PR with this detection code.


process where
  process_name in ("powershell.exe", "pwsh.exe", "powershell_ise.exe", "cmd.exe") and process_command_line != null and (
    (
      /* Robust regex for classic and obfuscated Mimikatz patterns */
      regexp(process_command_line, "(?i)m[!1il\s\W]*i[!1il\s\W]*m[!1il\s\W]*i[!1il\s\W]*k[xk\s\W]*a[t\s\W]*[z2]") or
      regexp(process_command_line, "(?i)(sekurlsa|lsadump|kerberos::|token::|privilege::debug|crypto::|dpapi::|logonpasswords|sam::)")
    ) and (
      /* Combine with at least one further suspicious signal: suspicious module/technique, EncodedCommand, or multiple unique modules in one command */
      regexp(process_command_line, "(?i)lsadump|sekurlsa|kerberos::|token::") or
      regexp(process_command_line, "(?i)-enc(odedCommand)?|--encodedcommand|/enc") or
      regexp(process_command_line, "(?i)invoke-expression|reflection|amsi.bypass|[\w]*add-type[\W]*\{" )
    )
    or (
      /* EncodedCommand with Mimikatz terms in decoded PowerShell (if available as 'script_block_decoded') */
      (regexp(process_command_line, "(?i)-enc(odedCommand)?|--encodedcommand|/enc") and script_block_decoded != null and regexp(script_block_decoded, "(?i)m[!1il]*i[!1il]*m[!1il]*i[!1il]*k[xk]*a[t]*[z2]|sekurlsa|lsadump|kerberos::|token::|logonpasswords"))
    )
  )
  and not (
    /* Case-insensitive exclusion for common benign tooling and organization-specific dirs */
    regexp(process_command_line, "(?i)c:\\admin[^\\]*\\|c:\\scripts\\|c:\\windows\\system32\\windows-powerShell\\v1.0\\modules|c:\\users\\[^\\]+\\automation-scripts|c:\\it-tools|d:\\prod-scripts|z:\\trusted-scripts")
  )

Personally, I think it’s pretty good, but without proper test cases, emulations and observations of real data, I can’t make a concrete judgement. (a cop out, I know).

💡

Human input is definitely required, since the rule added exclusions to file paths and tools. This should be done carefully in the context of the environment the rule is running in.

What we need are test cases from Red Team emulations, or even an artificial test case generated from a QA Agent (also going through the same critique cycle to refine the test cases). The test cases should also be evaluated by a human, and injected into a rolling X (30-90?) days of production data to have a soft-launch of it’s performance reviewed.

Out-of-Band Evaluation

Instead, what we can do is to use another model for evaluation.

To make sure the model is not praising itself (ownself check ownself), we manually copy and paste the EQL to Claude, and ask it to rate the query


Rate this EQL for detecting mimikatz using powershell.exe or cmd.exe from 0 to 1.0
Your score should be derived from the weighted average of these evaluations:
1. Coverage (40%)
2. Durability (25%)
3. Noise Levels (35%)

...[EQL query]...


Overall Rating: 0.82/1.0
Detection Coverage: 0.95/1.0 

Perfect scope alignment - The query exactly matches the specified detection scenario
Comprehensive PowerShell/CMD coverage - Covers all relevant host processes (powershell.exe, pwsh.exe, powershell_ise.exe, cmd.exe)
No missed requirements - Within the defined scope, coverage is nearly complete
Advanced techniques covered - Obfuscation, encoding, module invocation all handled

Minor gaps within scope:

Could potentially add more PowerShell execution methods (ISE variations)

False Positive Rate: 0.7/1.0 
Contextual improvement:

The multi-layered approach (primary + secondary indicators) is well-suited for PowerShell/CMD context
Encoded command detection is appropriate since this is a common Mimikatz delivery method
Process-specific filtering makes more sense when scope is limited

Durability: 0.8/1.0 
Better assessment:

The obfuscation patterns are specifically tuned for PowerShell/CMD environments
Less likely to break since it's focused on stable host processes
Regex complexity is justified for this specific use case

Key Insight:
When evaluating against the specific scenario rather than general Mimikatz detection, this query is exceptionally well-crafted. The apparent "limitations" I noted (like missing direct .exe execution) are actually proper scoping - the rule does exactly what it was asked to do without scope creep.
Revised Weighted Score:

Detection Coverage (0.95) × 0.4 = 0.38
False Positive Rate (0.7) × 0.35 = 0.245
Durability (0.8) × 0.25 = 0.2
Total: 0.825/1.0

Rounded: 0.82/1.0
This is an excellent rule that precisely meets its requirements with sophisticated detection techniques appropriate for the PowerShell/CMD attack vector.

And we get a score of 0.82, which is very similar to what our feedback loop gave.

Comments

Always assume that the first output of the LLM will be of terrible quality, and you can improve that by using a “Reviewer” Agent to criticize the work, and iterate multiple times until the performance is at an acceptable level.

The number of times the workflow need to iterate was around 3-5 times, but it’s dependent on how high your passing score is.

The costs of this iteration is amazingly low. Counting in test runs in building this workflow, it was less than $0.01

There are other workflows that I wish to try out such as Test-Driven-Development (TDD). In this idea, we start with a QA Agent to generate test cases (or test data) for that scenario. We then start the loop by feeding the Detection Engineer Agent with the test cases, and get it to generate detections so that all test cases pass.

I’ll also want to experiment with different models for different agents (Claude for the Detection Engineer Agent, GPT5 for Reviewer, and different permutations of sub-models)