Showing posts with label Prompt Engineering. Show all posts
Showing posts with label Prompt Engineering. Show all posts

Thursday, June 18, 2026

How to Design Real AI Agents (Agents)

How to Design Real AI Agents (Agents) — A Step-by-Step Guide from Level 0 to Level 7

How to Design Real AI Agents (Agents)?

Transitioning from "Persona" to Tool-Using Decision Mechanisms
A step-by-step senior-level AI Engineer guide from Level 0 to Level 7.

Published: June 2026  |  Reading time: ~25 min  |  Category: AI Engineering

🎯 Introduction: Why "You Are the Best AI Engineer in the World" Is Not Enough

90% of online tutorials tell you to write this to create an agent:

"You are the best software engineer in the world..."

If you are deploying an application to production, prompts written with only persona will turn your system into a hallucination bomb.

The real power of agents does not come from the roles you assign them; it comes from Tool Calling, Grounding, and Explainability.

💡 What Will You Learn in This Post?

In this post we will take a simple problem — "measuring text similarity" — and show you step by step how to evolve it from a novice approach to an autonomous level designed by a senior AI Engineer. At each level we will compare prompts, architecture, and cost.

🎬 Learn with Video: Murat Karakaya Akademi

You can also watch this training on the Murat Karakaya Akademi YouTube channel. Follow the same journey from Level 0 to Level 7 with step-by-step live demos, code explanations, and architectural analysis.

▶ Watch on YouTube — Murat Karakaya Akademi

📌 Level 0 — Novice Approach: Black Box & Hallucination

Beginner

Scenario: We need to find how similar Text 1 and Text 2 are on a 0–100 scale. (For example, in a RAG system, we measure how faithfully a generated answer stays true to the reference text).

Novice Prompt:
"You are a text analyst. Compare the given Text 1 and Text 2 and decide how similar they are. Return the similarity as a numeric value between 0–100. 0: They do not match at all, 100: They are completely identical word for word."

Why Is This Bad?

  1. Black Box: The model only returns "75". Why 75? Not 74 or 80? We don't know.
  2. Subjectivity: The same texts might get 60 one day and 85 the next.
  3. Hallucination Risk: LLMs cannot perform mathematical measurements; they predict words. "75" is a fabricated (hallucinated) value.

📌 Level 1 — Structured Output & Chain of Thought

Improvement

The first thing we want is Explainability. We tell the model: "Don't just give me the result, show me your thinking process."

"You are a text analyst. Your goal is to measure the semantic similarity between two texts. Your output MUST be a valid JSON object. Do not use Markdown or additional text. Required JSON fields: - reasoning: The specific justification for similarity and differences between these two texts. - similarity_score: An integer between 0-100. Expected JSON schema: { "type": "object", "required": ["reasoning", "similarity_score"], "properties": { "reasoning": {"type": "string"}, "similarity_score": {"type": "integer", "minimum": 0, "maximum": 100} } }"
Why Is This Better?
  • Chain of Thought: We did not ask for the score first. We asked it to fill the reasoning field first. The model justifies its own inference while writing the explanation.
  • Debugging: If the score is 40, we can look at the logs and say "The model missed this detail, that's why it gave a low score."
  • Integration: The JSON output can be parsed by application code.
⚠️ What Is Missing: The score is still subjective. The 0–100 scale is too broad and unstructured for an LLM.

📌 Level 2 — Rubric and Objectivity (Divide & Conquer)

Increased Autonomy

We are not allowing the agent to interpret the abstract concept of "similarity" on its own. We give it rules (rubric). We break the problem into parts.

Rubric:

  • 1. Main Idea (0-20): Do both texts convey the same core message?
  • 2. Tone and Style (0-20): Do the texts have the same formality and emotion?
  • 3. Entities (0-20): Do the names, dates, and numbers in the texts match?
  • 4. Missing Information (0-20): Does Text 2 omit an important detail from Text 1?
  • 5. Fluency (0-20): How structurally coherent is Text 2?
"You are an expert evaluator. Evaluate the two texts according to the 5 criteria below. Each criterion is scored 0-20. Your output MUST be a valid JSON object. Do not use Markdown or additional text. Required JSON fields: - evaluations: A list of 5 items [criterion, score, explanation] - total_score: The sum of the five criterion scores. Expected JSON schema: { "type": "object", "required": ["evaluations", "total_score"], "properties": { "evaluations": { "type": "array", "minItems": 5, "maxItems": 5, "items": { "type": "object", "required": ["criterion", "score", "explanation"], "properties": { "criterion": {"type": "string"}, "score": {"type": "integer", "minimum": 0, "maximum": 20}, "explanation": {"type": "string"} } } }, "total_score": {"type": "integer", "minimum": 0, "maximum": 100} } }"
Why Is This Better?
  • Grounding: We removed the model from abstract evaluation on a scale of 100.
  • Objectivity: Scores will be much more consistent even if you run them at different times (variance decreases).
  • Comparability: We can evaluate different prompts against the same rubric.

📌 Level 3 — Example-Based Rubric Calibration (One-Shot / Few-Shot)

Calibration

In Level 2 we gave the rubric; however, this was still zero-shot prompting. The model read the criteria but never saw from examples what "When do I give 20 points?", "What is the boundary case for 10?", "When is 0 appropriate?"

At this level we provide small, representative-examples for each criterion.

Scoring Calibration Examples:

  • Main Idea — 20 points: "Data cleaning is critical for model success" and "Model quality heavily depends on clean data" convey the same core message.
  • Main Idea — 10 points: Both texts discuss data quality but one focuses on security risks while the other focuses on model performance.
  • Main Idea — 0 points: One discusses data cleaning, the other discusses a sports match result.
  • Tone and Style — 20 points: Both texts are written in an academic and formal tone.
  • Tone and Style — 10 points: One is formal, the other more conversational, but the meaning is preserved.
Why Is This Better?
  • Consistency: The model uses the same score ranges more reliably.
  • Teachability: The rubric is now supported by behavioral examples, not just an abstract list.
  • Cost: Few-shot examples increase input tokens — so examples must be short and clear.

Expected JSON Schema:

{ "type": "object", "required": ["evaluations", "calibration_note", "total_score"], "properties": { "evaluations": { "type": "array", "minItems": 5, "maxItems": 5, "items": { "type": "object", "required": ["criterion", "score", "explanation"], "properties": { "criterion": {"type": "string"}, "score": {"type": "integer", "minimum": 0, "maximum": 20}, "explanation": {"type": "string"} } } }, "calibration_note": {"type": "string"}, "total_score": {"type": "integer", "minimum": 0, "maximum": 100} } }

📌 Level 4 — Tool Calling and Workflow

Grounding Begins

In Level 3 we calibrated the rubric with examples; however, the evaluation still relied solely on LLM interpretation. Now we add deterministic metrics from external systems as evidence.

Metrics Used:

  • ROUGE-L F1: Measures word-sequence overlap.
  • Lightweight Similarity Score: A combination of Token cosine + Token Jaccard + Character 3-gram cosine + Sequence ratio.
Metric Interpretation Rules:
  • If ROUGE-L F1 is low, this indicates low word-sequence overlap; it is not alone evidence of low semantic similarity.
  • If the Lightweight Similarity Score is higher than ROUGE-L, the texts may convey a similar message with different words.
  • The Lightweight Similarity Score is not a real semantic embedding; it should be used as a decision-support signal, not as the sole decision-maker.

🌍 Why Is This Used in the Real World?

  • Reliability: ROUGE and lightweight similarity scores are deterministic — they give the same scores to the same text pair every time.
  • Traceability: Since the LLM's opinion is grounded in external evidence, evaluation becomes more auditable.
  • Cost Control: Lightweight metrics are fast and do not require heavy model dependencies.

🔬 Experiment Hygiene Note

Level 4 uses the same rubric text as Level 3. This is a deliberate decision: the difference between the two levels is not a rubric change, but only external metric context.

Additional Required JSON Fields in Level 4:

  • metric_interpretation: Explain how you interpreted the ROUGE-L F1 and Lightweight Similarity Score values.
  • calibration_note: Explain how the rubric calibration examples affected your scoring.

Expected JSON Schema (Level 4):

{ "type": "object", "required": ["evaluations", "metric_interpretation", "calibration_note", "total_score"], "properties": { "evaluations": { /* Same 5-item list as Level 3 */ }, "metric_interpretation": {"type": "string"}, "calibration_note": {"type": "string"}, "total_score": {"type": "integer", "minimum": 0, "maximum": 100} } }

📌 Level 5 — ReAct and Ollama Tool Calling with Real Agent Loop

Real Agentic Loop

We built a strong workflow in Level 4, but was our system a real agent? Not exactly. Because we calculated the metrics with Python. In real agent behavior, the model determines what evidence it needs, the software layer executes the tool, and the result returns to the model.

ReAct Loop:

  • Reasoning: The model determines what external evidence it needs to evaluate text similarity.
  • Action: Instead of writing plain text, the model produces Ollama's native tool_calls field.
  • Observation: Python executes the relevant function and returns the result to the model as role="tool".
  • Final Answer: The model uses the tool results to produce the rubric-based JSON evaluation.

Ollama Native Tool Calling Flow:

1. Functions are presented to the model via the `tools=[...]` list. 2. If needed, the model produces `response.message.tool_calls`. 3. Python executes these tool calls. 4. Results are added to the conversation as `role="tool"` messages. 5. The model produces the final answer based on tool results.

Tool Definitions (Python):

def calculate_rouge_tool(reference_text: str = "", candidate_text: str = "") -> str: """Calculate ROUGE scores.""" return json.dumps(calculate_rouge(METIN_1, METIN_2), ensure_ascii=False) def calculate_lightweight_similarity_tool(reference_text: str = "", candidate_text: str = "") -> str: """Calculate lightweight similarity score for the fixed Text 1 and Text 2 in the notebook. Args: reference_text: Not considered in training demo; tool uses fixed METIN_1. candidate_text: Not considered in training demo; tool uses fixed METIN_2. Returns: JSON string containing Token cosine, Token Jaccard, character 3-gram cosine, sequence ratio, and combined score. """ return json.dumps(calculate_lightweight_similarity(METIN_1, METIN_2), ensure_ascii=False)

Tool Calling Loop (Pseudo-Python):

messages = [ {"role": "system", "content": TOOL_CALLING_SYSTEM_PROMPT}, {"role": "user", "content": "Text 1: ..., Text 2: ..."}, ] while True: response = client.chat( model=MODEL, messages=messages, tools=[calculate_rouge_tool, calculate_lightweight_similarity_tool], ) messages.append(response.message) if not response.message.tool_calls: print(response.message.content) # Final answer break for tool_call in response.message.tool_calls: tool_name = tool_call.function.name result = available_tools[tool_name](**tool_call.function.arguments) messages.append({ "role": "tool", "tool_name": tool_name, "content": str(result), })
Why Is This a Mastery-Level Skill?
  • ReAct is not just writing Thought / Action / Observation; it is connecting thought to real tool execution.
  • An agent is the joint design of prompt, tool calling, execution loop, grounding, and error control layers.
  • The model determines tool needs, Python executes the tool, and the final evaluation is supported by external evidence.

📊 Level 5 Token Cost:

"Since Level 5 is a multi-turn agent loop, the input token is not just the length of the first user prompt. With each client.chat call, the system message, user message, previous assistant messages, and tool results are re-injected into context. Therefore, the input token total in Level 5 is not a unique token count, but a cumulative processed token / cost indicator."

🛡️ In the Real World: Production Guardrails

In real production, these protections are added: schema validation, maximum step limit, tool allowlist, retry, and tracing/logging.

Level 5 Tool Calling JSON Schema:

{ "type": "object", "required": ["evaluations", "metric_interpretation", "calibration_note", "total_score"], "properties": { "evaluations": { "type": "array", "minItems": 5, "maxItems": 5, "items": { "type": "object", "required": ["criterion", "score", "explanation"], "properties": { "criterion": {"type": "string"}, "score": {"type": "integer", "minimum": 0, "maximum": 20}, "explanation": {"type": "string"} } } }, "metric_interpretation": {"type": "string"}, "calibration_note": {"type": "string"}, "total_score": {"type": "integer", "minimum": 0, "maximum": 100} } }

📌 Level 6 — Rubric-Based Sub-Agents and Python Aggregator

Modular Architecture

In Level 5, a single agent both called tools and interpreted the entire rubric on its own. At this level we try a different architecture: instead of one large prompt, we give each rubric criterion to a separate sub-agent.

Architecture:

  • 1 generic sub-agent function is written.
  • This function is called 5 times with 5 different rubric configurations.
  • Each sub-agent evaluates only its own criterion.
  • Python aggregator validates, sorts, and calculates the total score.
  • The aggregator makes no LLM calls — it is deterministic.

The Pedagogical Message of This Level:

✅ "Don't make everything an agent!"
LLM is for subjective evaluation. Python is for validation, aggregation, formatting, and deterministic computation.

Limitations:

  • 5 sub-agents = 5 LLM calls = more expensive than Level 5.
  • Not necessary in every case; it makes sense when the rubric grows or when audibility is critical.

Level 6 — Sub-Agent JSON Output (single criterion):

{ "criterion": "Main Idea", "score": 18, "explanation": "Specific evaluation for this criterion", "evidence": "Basis from text or metrics" }

Python Aggregator Total Output (all criteria):

{ "evaluations": [ {"criterion": "Main Idea", "score": 18, "explanation": "...", "evidence": "..."}, {"criterion": "Tone and Style", "score": 16, "explanation": "...", "evidence": "..."}, {"criterion": "Entities", "score": 17, "explanation": "...", "evidence": "..."}, {"criterion": "Missing Information", "score": 15, "explanation": "...", "evidence": "..."}, {"criterion": "Fluency", "score": 14, "explanation": "...", "evidence": "..."} ], "metric_interpretation": "ROUGE and lightweight similarity metrics were provided as context to each sub-agent; each criterion was interpreted by its own expert sub-agent.", "aggregation_note": "Total score was computed deterministically by the Python aggregator; no LLM orchestrator was used.", "total_score": 80 }

Level 6 Python Aggregator Function Example:

def aggregate_sub_agent_results(sub_agent_results, metrics_context): expected_criteria = [config['criterion'] for config in RUBRIC_AGENT_CONFIGS] actual_criteria = [result['parsed'].get('criterion') for result in sub_agent_results] if actual_criteria != expected_criteria: raise ValueError(f"Sub-agent criterion order does not match expected.") evaluations = [] for result in sub_agent_results: parsed = result['parsed'] score = parsed.get('score') if not isinstance(score, int) or not 0 <= score <= 20: raise ValueError(f"Invalid score: {parsed}") evaluations.append({ 'criterion': parsed['criterion'], 'score': score, 'explanation': parsed['explanation'], 'evidence': parsed['evidence'], }) total_score = sum(item['score'] for item in evaluations) return { 'evaluations': evaluations, 'metric_interpretation': '...', 'aggregation_note': 'Total score was computed deterministically by Python.', 'total_score': total_score, }

Level 6 — Token Cost:

  • 5 sub-agents = 5 independent LLM calls
  • Each call includes system prompt + user prompt + metric context
  • Total input token = 5 × (system + user prompt length)
  • Aggregator token cost = 0 (Python code runs)

Level 6 — Sub-Agent JSON Schema (inside build_sub_agent_system_prompt):

{ "type": "object", "required": ["criterion", "score", "explanation", "evidence"], "properties": { "criterion": {"type": "string"}, "score": {"type": "integer", "minimum": 0, "maximum": 20}, "explanation": {"type": "string"}, "evidence": {"type": ["string", "array", "object"]} } }

📌 Level 7 — Orchestrator Agents and When They Are Not Needed

Architectural Decision

After Level 6 the natural question is: "Wouldn't it be better if an orchestrator agent managed these sub-agents?"

Orchestrator Agent is a powerful architecture in the real world. An orchestrator can break down tasks, decide which sub-agent to run, select appropriate tools, initiate retries on missing or contradictory results, and convert results from different agents into a final decision.

However, in this example we deliberately do not need it, because:

  • The 5 rubric criteria are predetermined.
  • Every criterion must run.
  • Each sub-agent evaluates only its own criterion.
  • Missing criterion check, sorting, and total score can be reliably done with Python.

🏗️ When Does an Orchestrator Agent Make Sense?

  • When which sub-agents to run changes from task to task.
  • When tool selection, data source selection, or workflow branching is needed.
  • When there are contradictions between sub-agent responses and an interpretive reconciliation is needed.
  • When there are dynamic steps such as quality control, retry, missing information completion, or human approval.

Level 7's Message: Orchestrator + sub-agent architectures exist and are important; however, they are not necessary in every problem. In this example, the Python aggregator is the correct, simple, and instructive choice.

💡 Level 7 Note: This level does not make a new LLM call. It is an architectural decision-making and boundary-setting section. Performance graphs show the cost of Level 0-6 experiments.

📊 Comparison of All Levels

Level Approach Added Layer Gain Limitation / Lesson
Lvl 0 Persona / Black Box Simple system prompt Fast start Inconsistent, unexplainable, hallucination-prone
Lvl 1 JSON + Explanation Structured output Answer becomes parseable Still subjective
Lvl 2 Rubric Criteria-based evaluation More objective score No external evidence, still LLM opinion
Lvl 3 One-Shot / Few-Shot Calibration Criteria-based examples More consistent scores Input token cost increases
Lvl 4 Workflow + Tools ROUGE + lightweight similarity metrics Grounded, evidence-based Developer selects the tools
Lvl 5 ReAct + Tool Calling Automatic tool calling + execution loop Real agent behavior High cost, loop management needed
Lvl 6 Sub-Agents + Python Aggregator Generic sub-agent + deterministic aggregation Task decomposition, responsibility separation 5× LLM calls, more expensive
Lvl 7 Orchestrator Agent Decision Architectural decision-making Understanding of advanced architectures Orchestrator not needed everywhere

🎓 Final Message: Prompt Engineering Becomes Systems Engineering

Simply giving an agent a powerful-sounding prompt and expecting it to work correctly is not enough.
A Real AI Agent is not just a model that produces answers; it is a software system that jointly manages thinking patterns, tool usage, reasoning steps, and data-driven evidence.

As we progressed from Level 0 to Level 7, we actually built the same idea layer by layer. First we structured the output, then we broke evaluation into rubric parts, then we calibrated with examples. Then we added grounding with metrics and connected the ReAct idea to real tool execution. Finally, we saw that more advanced architectures like orchestrator agents exist, but in this example, not adding an extra LLM orchestrator was the more correct engineering decision.

Our value as AI Engineers emerges here: instead of expecting miracles from the model, understand the model's strengths and weaknesses and build the right architecture around them. Good agent design is not just about writing prompts; it is about proving with data, supporting with tools, making reasoning visible, and making outputs measurable.

📝 Test Texts Used in the Training

These two texts were specifically selected: low word overlap (low ROUGE), yet they convey a semantically similar message.

Text 1 (Reference):
"In the process of training artificial intelligence models, the use of high-quality datasets is of critical importance. If the dataset contains incorrect, biased, or incomplete information, the results produced by the model will inevitably be flawed and unreliable. Therefore, data cleaning is a more prioritized step than the complexity of the model architecture."
Text 2 (System Output):
"The success of machine learning algorithms heavily depends on the quality of the information they are fed. Algorithms fed with dirty, biased, or incomplete data will naturally produce incorrect and untrustworthy outputs. Therefore, filtering and organizing data is a much more essential process than building the system's infrastructure."

🏷️ Tags & Hashtags

#ArtificialIntelligence #AI #MachineLearning #DeepLearning #LLM #LargeLanguageModel #PromptEngineering #AgentDesign #AI Agents #Ollama #ToolCalling #ReAct #SubAgent #Orchestrator #StructuredOutput #Rubric #FewShot #Grounding #Explainability #RAG #ROUGE #TokenCost #MuratKarakayaAkademi #AIEngineering #TechEducation #YouTubeEducation #Blogger

Friday, January 17, 2025

Understanding How Prompts Shape LLM Responses


Understanding How Prompts Shape LLM Responses: Mechanisms Behind "You Are a Computer Scientist"

Large Language Models (LLMs) are incredibly versatile, offering diverse outputs depending on the prompts they receive. For instance, providing a prompt like “You are a computer scientist” yields a very different response compared to “You are an economist.” But what drives these changes? What mechanisms process these prompts within the model? Let’s dive into the core principles and workings behind this fascinating behavior.


1. The Role of Transformers and Context Representation

LLMs, such as GPT, are based on Transformer architecture, which processes prompts through a mechanism called self-attention. Here's how it works:

  • Self-Attention: This component analyzes how each word in the prompt relates to others.
  • Context Framing: A prompt like “You are a computer scientist” sets a frame, directing the model to focus on knowledge and vocabulary relevant to computer science.

The framing influences how the model processes subsequent words, shaping the tone and content of the response.


2. Pre-Trained Knowledge of the Model

LLMs are pre-trained on vast datasets, which means they have absorbed a wide array of contexts, terminologies, and knowledge areas, such as:

  • Word Associations: Understanding which words commonly appear together.
  • Domain-Specific Patterns: Recognizing patterns specific to fields like economics or computer science.

When given a prompt, the model recalls relevant patterns and applies them to craft its response.


3. How Prompts Change Context and Meaning

Prompts influence the model’s output in two significant ways:

a. Word Selection and Priority:

In a technical prompt like "You are a computer scientist," the model tends to prioritize technical jargon, algorithms, or programming concepts.

b. Tone and Approach:

In contrast, “You are an economist” triggers the model to shift towards economic theories, trends, or statistical data.

This dynamic shift is achieved by re-weighting the probabilities of word choices based on the given context.


4. The Art of Prompt Engineering

Prompt engineering is the deliberate crafting of inputs to guide the model’s responses effectively. A good prompt:

  • Defines Roles: Example: “You are a helpful assistant.”
  • Specifies Tasks: Example: “Write a Python script for sorting algorithms.”
  • Shapes Output Style: Example: “Explain it to a 5-year-old.”

These nuances help extract specific, accurate, and meaningful outputs from the model.


5. Mechanics at Work

Under the hood, this process is governed by probabilistic mechanisms:

  • Dynamic Word Distributions: The model calculates the probability of each possible next word based on the context.
  • Attention Mechanisms: Prompts like "You are a computer scientist" highlight certain nodes in the network, emphasizing related topics and phrases.

6. Advanced Techniques: Prefix Tuning and Fine-Tuning

To refine how prompts influence the model, advanced techniques can be employed:

  • Prefix Tuning: Adds a pre-defined “prefix” to the model’s input, making the prompt’s effect more pronounced.
  • Fine-Tuning: Retrains the model on specialized data to align its responses with a specific domain or task.

7. Key Takeaway

The behavior of LLMs is deeply tied to how prompts direct their focus and leverage their vast pre-trained knowledge. Understanding these mechanisms and crafting effective prompts can unlock the full potential of LLMs, allowing you to tailor responses to specific needs with precision.

By experimenting with prompt variations, you can discover how subtle changes in phrasing yield drastically different results. This is the art and science of working with LLMs—a powerful skill in the AI era.