How to Design Real AI Agents (Agents) — A Step-by-Step Guide from Level 0 to Level 7

How to Design Real AI Agents (Agents)?

Transitioning from "Persona" to Tool-Using Decision Mechanisms
A step-by-step senior-level AI Engineer guide from Level 0 to Level 7.

Published: June 2026 | Reading time: ~25 min | Category: AI Engineering

🎯 Introduction: Why "You Are the Best AI Engineer in the World" Is Not Enough

90% of online tutorials tell you to write this to create an agent:

"You are the best software engineer in the world..."

If you are deploying an application to production, prompts written with only persona will turn your system into a hallucination bomb.

The real power of agents does not come from the roles you assign them; it comes from Tool Calling, Grounding, and Explainability.

💡 What Will You Learn in This Post?

In this post we will take a simple problem — "measuring text similarity" — and show you step by step how to evolve it from a novice approach to an autonomous level designed by a senior AI Engineer. At each level we will compare prompts, architecture, and cost.

🎬 Learn with Video: Murat Karakaya Akademi

You can also watch this training on the Murat Karakaya Akademi YouTube channel. Follow the same journey from Level 0 to Level 7 with step-by-step live demos, code explanations, and architectural analysis.

▶ Watch on YouTube — Murat Karakaya Akademi

📌 Level 0 — Novice Approach: Black Box & Hallucination

Beginner

Scenario: We need to find how similar Text 1 and Text 2 are on a 0–100 scale. (For example, in a RAG system, we measure how faithfully a generated answer stays true to the reference text).

Novice Prompt:
"You are a text analyst. Compare the given Text 1 and Text 2 and decide how similar they are. Return the similarity as a numeric value between 0–100. 0: They do not match at all, 100: They are completely identical word for word."

Why Is This Bad?

Black Box: The model only returns "75". Why 75? Not 74 or 80? We don't know.
Subjectivity: The same texts might get 60 one day and 85 the next.
Hallucination Risk: LLMs cannot perform mathematical measurements; they predict words. "75" is a fabricated (hallucinated) value.

📌 Level 1 — Structured Output & Chain of Thought

Improvement

The first thing we want is Explainability. We tell the model: "Don't just give me the result, show me your thinking process."

"You are a text analyst. Your goal is to measure the semantic similarity between two texts. Your output MUST be a valid JSON object. Do not use Markdown or additional text. Required JSON fields: - reasoning: The specific justification for similarity and differences between these two texts. - similarity_score: An integer between 0-100. Expected JSON schema: { "type": "object", "required": ["reasoning", "similarity_score"], "properties": { "reasoning": {"type": "string"}, "similarity_score": {"type": "integer", "minimum": 0, "maximum": 100} } }"

Why Is This Better?

Chain of Thought: We did not ask for the score first. We asked it to fill the reasoning field first. The model justifies its own inference while writing the explanation.
Debugging: If the score is 40, we can look at the logs and say "The model missed this detail, that's why it gave a low score."
Integration: The JSON output can be parsed by application code.

⚠️ What Is Missing: The score is still subjective. The 0–100 scale is too broad and unstructured for an LLM.

📌 Level 2 — Rubric and Objectivity (Divide & Conquer)

Increased Autonomy

We are not allowing the agent to interpret the abstract concept of "similarity" on its own. We give it rules (rubric). We break the problem into parts.

Rubric:

1. Main Idea (0-20): Do both texts convey the same core message?
2. Tone and Style (0-20): Do the texts have the same formality and emotion?
3. Entities (0-20): Do the names, dates, and numbers in the texts match?
4. Missing Information (0-20): Does Text 2 omit an important detail from Text 1?
5. Fluency (0-20): How structurally coherent is Text 2?

"You are an expert evaluator. Evaluate the two texts according to the 5 criteria below. Each criterion is scored 0-20. Your output MUST be a valid JSON object. Do not use Markdown or additional text. Required JSON fields: - evaluations: A list of 5 items [criterion, score, explanation] - total_score: The sum of the five criterion scores. Expected JSON schema: { "type": "object", "required": ["evaluations", "total_score"], "properties": { "evaluations": { "type": "array", "minItems": 5, "maxItems": 5, "items": { "type": "object", "required": ["criterion", "score", "explanation"], "properties": { "criterion": {"type": "string"}, "score": {"type": "integer", "minimum": 0, "maximum": 20}, "explanation": {"type": "string"} } } }, "total_score": {"type": "integer", "minimum": 0, "maximum": 100} } }"

Why Is This Better?

Grounding: We removed the model from abstract evaluation on a scale of 100.
Objectivity: Scores will be much more consistent even if you run them at different times (variance decreases).
Comparability: We can evaluate different prompts against the same rubric.

📌 Level 3 — Example-Based Rubric Calibration (One-Shot / Few-Shot)

Calibration

In Level 2 we gave the rubric; however, this was still zero-shot prompting. The model read the criteria but never saw from examples what "When do I give 20 points?", "What is the boundary case for 10?", "When is 0 appropriate?"

At this level we provide small, representative-examples for each criterion.

Scoring Calibration Examples:

Main Idea — 20 points: "Data cleaning is critical for model success" and "Model quality heavily depends on clean data" convey the same core message.
Main Idea — 10 points: Both texts discuss data quality but one focuses on security risks while the other focuses on model performance.
Main Idea — 0 points: One discusses data cleaning, the other discusses a sports match result.
Tone and Style — 20 points: Both texts are written in an academic and formal tone.
Tone and Style — 10 points: One is formal, the other more conversational, but the meaning is preserved.

Why Is This Better?

Consistency: The model uses the same score ranges more reliably.
Teachability: The rubric is now supported by behavioral examples, not just an abstract list.
Cost: Few-shot examples increase input tokens — so examples must be short and clear.

Expected JSON Schema:

{
  "type": "object",
  "required": ["evaluations", "calibration_note", "total_score"],
  "properties": {
    "evaluations": {
      "type": "array",
      "minItems": 5,
      "maxItems": 5,
      "items": {
        "type": "object",
        "required": ["criterion", "score", "explanation"],
        "properties": {
          "criterion": {"type": "string"},
          "score": {"type": "integer", "minimum": 0, "maximum": 20},
          "explanation": {"type": "string"}
        }
      }
    },
    "calibration_note": {"type": "string"},
    "total_score": {"type": "integer", "minimum": 0, "maximum": 100}
  }
}

📌 Level 4 — Tool Calling and Workflow

Grounding Begins

In Level 3 we calibrated the rubric with examples; however, the evaluation still relied solely on LLM interpretation. Now we add deterministic metrics from external systems as evidence.

Metrics Used:

ROUGE-L F1: Measures word-sequence overlap.
Lightweight Similarity Score: A combination of Token cosine + Token Jaccard + Character 3-gram cosine + Sequence ratio.

Metric Interpretation Rules:

If ROUGE-L F1 is low, this indicates low word-sequence overlap; it is not alone evidence of low semantic similarity.
If the Lightweight Similarity Score is higher than ROUGE-L, the texts may convey a similar message with different words.
The Lightweight Similarity Score is not a real semantic embedding; it should be used as a decision-support signal, not as the sole decision-maker.

🌍 Why Is This Used in the Real World?

Reliability: ROUGE and lightweight similarity scores are deterministic — they give the same scores to the same text pair every time.
Traceability: Since the LLM's opinion is grounded in external evidence, evaluation becomes more auditable.
Cost Control: Lightweight metrics are fast and do not require heavy model dependencies.

🔬 Experiment Hygiene Note

Level 4 uses the same rubric text as Level 3. This is a deliberate decision: the difference between the two levels is not a rubric change, but only external metric context.

Additional Required JSON Fields in Level 4:

metric_interpretation: Explain how you interpreted the ROUGE-L F1 and Lightweight Similarity Score values.
calibration_note: Explain how the rubric calibration examples affected your scoring.

Expected JSON Schema (Level 4):

{
  "type": "object",
  "required": ["evaluations", "metric_interpretation", "calibration_note", "total_score"],
  "properties": {
    "evaluations": { /* Same 5-item list as Level 3 */ },
    "metric_interpretation": {"type": "string"},
    "calibration_note": {"type": "string"},
    "total_score": {"type": "integer", "minimum": 0, "maximum": 100}
  }
}

📌 Level 5 — ReAct and Ollama Tool Calling with Real Agent Loop

Real Agentic Loop

We built a strong workflow in Level 4, but was our system a real agent? Not exactly. Because we calculated the metrics with Python. In real agent behavior, the model determines what evidence it needs, the software layer executes the tool, and the result returns to the model.

ReAct Loop:

Reasoning: The model determines what external evidence it needs to evaluate text similarity.
Action: Instead of writing plain text, the model produces Ollama's native tool_calls field.
Observation: Python executes the relevant function and returns the result to the model as role="tool".
Final Answer: The model uses the tool results to produce the rubric-based JSON evaluation.

Ollama Native Tool Calling Flow:

Functions are presented to the model via the `tools=[...]` list.
If needed, the model produces `response.message.tool_calls`.
Python executes these tool calls.
Results are added to the conversation as `role="tool"` messages.
The model produces the final answer based on tool results.

Tool Definitions (Python):

def calculate_rouge_tool(reference_text: str = "", candidate_text: str = "") -> str:
    """Calculate ROUGE scores."""
    return json.dumps(calculate_rouge(METIN_1, METIN_2), ensure_ascii=False)

def calculate_lightweight_similarity_tool(reference_text: str = "", candidate_text: str = "") -> str:
    """Calculate lightweight similarity score for the fixed Text 1 and Text 2 in the notebook.
    
    Args:
        reference_text: Not considered in training demo; tool uses fixed METIN_1.
        candidate_text: Not considered in training demo; tool uses fixed METIN_2.
    
    Returns:
        JSON string containing Token cosine, Token Jaccard, character 3-gram cosine, sequence ratio,
        and combined score.
    """
    return json.dumps(calculate_lightweight_similarity(METIN_1, METIN_2), ensure_ascii=False)

Tool Calling Loop (Pseudo-Python):

messages = [
    {"role": "system", "content": TOOL_CALLING_SYSTEM_PROMPT},
    {"role": "user", "content": "Text 1: ..., Text 2: ..."},
]

while True:
    response = client.chat(
        model=MODEL,
        messages=messages,
        tools=[calculate_rouge_tool, calculate_lightweight_similarity_tool],
    )
    messages.append(response.message)

    if not response.message.tool_calls:
        print(response.message.content)  # Final answer
        break

    for tool_call in response.message.tool_calls:
        tool_name = tool_call.function.name
        result = available_tools[tool_name](**tool_call.function.arguments)
        messages.append({
            "role": "tool",
            "tool_name": tool_name,
            "content": str(result),
        })

Why Is This a Mastery-Level Skill?

ReAct is not just writing Thought / Action / Observation; it is connecting thought to real tool execution.
An agent is the joint design of prompt, tool calling, execution loop, grounding, and error control layers.
The model determines tool needs, Python executes the tool, and the final evaluation is supported by external evidence.

📊 Level 5 Token Cost:

"Since Level 5 is a multi-turn agent loop, the input token is not just the length of the first user prompt. With each client.chat call, the system message, user message, previous assistant messages, and tool results are re-injected into context. Therefore, the input token total in Level 5 is not a unique token count, but a cumulative processed token / cost indicator."

🛡️ In the Real World: Production Guardrails

In real production, these protections are added: schema validation, maximum step limit, tool allowlist, retry, and tracing/logging.

Level 5 Tool Calling JSON Schema:

{
  "type": "object",
  "required": ["evaluations", "metric_interpretation", "calibration_note", "total_score"],
  "properties": {
    "evaluations": {
      "type": "array",
      "minItems": 5,
      "maxItems": 5,
      "items": {
        "type": "object",
        "required": ["criterion", "score", "explanation"],
        "properties": {
          "criterion": {"type": "string"},
          "score": {"type": "integer", "minimum": 0, "maximum": 20},
          "explanation": {"type": "string"}
        }
      }
    },
    "metric_interpretation": {"type": "string"},
    "calibration_note": {"type": "string"},
    "total_score": {"type": "integer", "minimum": 0, "maximum": 100}
  }
}

📌 Level 6 — Rubric-Based Sub-Agents and Python Aggregator

Modular Architecture

In Level 5, a single agent both called tools and interpreted the entire rubric on its own. At this level we try a different architecture: instead of one large prompt, we give each rubric criterion to a separate sub-agent.

Architecture:

1 generic sub-agent function is written.
This function is called 5 times with 5 different rubric configurations.
Each sub-agent evaluates only its own criterion.
Python aggregator validates, sorts, and calculates the total score.
The aggregator makes no LLM calls — it is deterministic.

The Pedagogical Message of This Level:

✅ "Don't make everything an agent!"
LLM is for subjective evaluation. Python is for validation, aggregation, formatting, and deterministic computation.

Limitations:

5 sub-agents = 5 LLM calls = more expensive than Level 5.
Not necessary in every case; it makes sense when the rubric grows or when audibility is critical.

Level 6 — Sub-Agent JSON Output (single criterion):

{
  "criterion": "Main Idea",
  "score": 18,
  "explanation": "Specific evaluation for this criterion",
  "evidence": "Basis from text or metrics"
}

Python Aggregator Total Output (all criteria):

{
  "evaluations": [
    {"criterion": "Main Idea", "score": 18, "explanation": "...", "evidence": "..."},
    {"criterion": "Tone and Style", "score": 16, "explanation": "...", "evidence": "..."},
    {"criterion": "Entities", "score": 17, "explanation": "...", "evidence": "..."},
    {"criterion": "Missing Information", "score": 15, "explanation": "...", "evidence": "..."},
    {"criterion": "Fluency", "score": 14, "explanation": "...", "evidence": "..."}
  ],
  "metric_interpretation": "ROUGE and lightweight similarity metrics were provided as context to each sub-agent; each criterion was interpreted by its own expert sub-agent.",
  "aggregation_note": "Total score was computed deterministically by the Python aggregator; no LLM orchestrator was used.",
  "total_score": 80
}

Level 6 Python Aggregator Function Example:

def aggregate_sub_agent_results(sub_agent_results, metrics_context):
    expected_criteria = [config['criterion'] for config in RUBRIC_AGENT_CONFIGS]
    actual_criteria = [result['parsed'].get('criterion') for result in sub_agent_results]

    if actual_criteria != expected_criteria:
        raise ValueError(f"Sub-agent criterion order does not match expected.")

    evaluations = []
    for result in sub_agent_results:
        parsed = result['parsed']
        score = parsed.get('score')
        if not isinstance(score, int) or not 0 <= score <= 20:
            raise ValueError(f"Invalid score: {parsed}")
        evaluations.append({
            'criterion': parsed['criterion'],
            'score': score,
            'explanation': parsed['explanation'],
            'evidence': parsed['evidence'],
        })

    total_score = sum(item['score'] for item in evaluations)
    return {
        'evaluations': evaluations,
        'metric_interpretation': '...',
        'aggregation_note': 'Total score was computed deterministically by Python.',
        'total_score': total_score,
    }

Level 6 — Token Cost:

5 sub-agents = 5 independent LLM calls
Each call includes system prompt + user prompt + metric context
Total input token = 5 × (system + user prompt length)
Aggregator token cost = 0 (Python code runs)

Level 6 — Sub-Agent JSON Schema (inside build_sub_agent_system_prompt):

{
  "type": "object",
  "required": ["criterion", "score", "explanation", "evidence"],
  "properties": {
    "criterion": {"type": "string"},
    "score": {"type": "integer", "minimum": 0, "maximum": 20},
    "explanation": {"type": "string"},
    "evidence": {"type": ["string", "array", "object"]}
  }
}

📌 Level 7 — Orchestrator Agents and When They Are Not Needed

Architectural Decision

After Level 6 the natural question is: "Wouldn't it be better if an orchestrator agent managed these sub-agents?"

Orchestrator Agent is a powerful architecture in the real world. An orchestrator can break down tasks, decide which sub-agent to run, select appropriate tools, initiate retries on missing or contradictory results, and convert results from different agents into a final decision.

However, in this example we deliberately do not need it, because:

The 5 rubric criteria are predetermined.
Every criterion must run.
Each sub-agent evaluates only its own criterion.
Missing criterion check, sorting, and total score can be reliably done with Python.

🏗️ When Does an Orchestrator Agent Make Sense?

When which sub-agents to run changes from task to task.
When tool selection, data source selection, or workflow branching is needed.
When there are contradictions between sub-agent responses and an interpretive reconciliation is needed.
When there are dynamic steps such as quality control, retry, missing information completion, or human approval.

Level 7's Message: Orchestrator + sub-agent architectures exist and are important; however, they are not necessary in every problem. In this example, the Python aggregator is the correct, simple, and instructive choice.

💡 Level 7 Note: This level does not make a new LLM call. It is an architectural decision-making and boundary-setting section. Performance graphs show the cost of Level 0-6 experiments.

📊 Comparison of All Levels

Level	Approach	Added Layer	Gain	Limitation / Lesson
Lvl 0	Persona / Black Box	Simple system prompt	Fast start	Inconsistent, unexplainable, hallucination-prone
Lvl 1	JSON + Explanation	Structured output	Answer becomes parseable	Still subjective
Lvl 2	Rubric	Criteria-based evaluation	More objective score	No external evidence, still LLM opinion
Lvl 3	One-Shot / Few-Shot Calibration	Criteria-based examples	More consistent scores	Input token cost increases
Lvl 4	Workflow + Tools	ROUGE + lightweight similarity metrics	Grounded, evidence-based	Developer selects the tools
Lvl 5	ReAct + Tool Calling	Automatic tool calling + execution loop	Real agent behavior	High cost, loop management needed
Lvl 6	Sub-Agents + Python Aggregator	Generic sub-agent + deterministic aggregation	Task decomposition, responsibility separation	5× LLM calls, more expensive
Lvl 7	Orchestrator Agent Decision	Architectural decision-making	Understanding of advanced architectures	Orchestrator not needed everywhere

🎓 Final Message: Prompt Engineering Becomes Systems Engineering

Simply giving an agent a powerful-sounding prompt and expecting it to work correctly is not enough.
A Real AI Agent is not just a model that produces answers; it is a software system that jointly manages thinking patterns, tool usage, reasoning steps, and data-driven evidence.

As we progressed from Level 0 to Level 7, we actually built the same idea layer by layer. First we structured the output, then we broke evaluation into rubric parts, then we calibrated with examples. Then we added grounding with metrics and connected the ReAct idea to real tool execution. Finally, we saw that more advanced architectures like orchestrator agents exist, but in this example, not adding an extra LLM orchestrator was the more correct engineering decision.

Our value as AI Engineers emerges here: instead of expecting miracles from the model, understand the model's strengths and weaknesses and build the right architecture around them. Good agent design is not just about writing prompts; it is about proving with data, supporting with tools, making reasoning visible, and making outputs measurable.

📝 Test Texts Used in the Training

These two texts were specifically selected: low word overlap (low ROUGE), yet they convey a semantically similar message.

Text 1 (Reference):
"In the process of training artificial intelligence models, the use of high-quality datasets is of critical importance. If the dataset contains incorrect, biased, or incomplete information, the results produced by the model will inevitably be flawed and unreliable. Therefore, data cleaning is a more prioritized step than the complexity of the model architecture."

Text 2 (System Output):
"The success of machine learning algorithms heavily depends on the quality of the information they are fed. Algorithms fed with dirty, biased, or incomplete data will naturally produce incorrect and untrustworthy outputs. Therefore, filtering and organizing data is a much more essential process than building the system's infrastructure."

🏷️ Tags & Hashtags

#ArtificialIntelligence #AI #MachineLearning #DeepLearning #LLM #LargeLanguageModel #PromptEngineering #AgentDesign #AI Agents #Ollama #ToolCalling #ReAct #SubAgent #Orchestrator #StructuredOutput #Rubric #FewShot #Grounding #Explainability #RAG #ROUGE #TokenCost #MuratKarakayaAkademi #AIEngineering #TechEducation #YouTubeEducation #Blogger

Thursday, June 18, 2026