Prompt Engineering Guide: Essential Insights and Key Practices
Tech Insights

Prompt Engineering: What You Need to Know

Cesar Fazio
- 3 min. to read

Prompt Engineering is the disciplined practice of optimizing inputs to steer model behavior, thereby differentiating a generic AI response from a high-value solution.

Large Language Models (LLMs) like GPT-4 and Gemini have changed the interaction between humans and machines. No longer confined to rigid syntax or specific programming languages, we can now code in plain English, automate tasks, and even delegate decisions.

However, the quality of an AI’s output is restricted by the quality of its prompt. AI prompt engineering aims to optimize the model’s output to provide logical and relatable answers.

Prompt engineering is the discipline of closing that gap. This guide covers 9 core prompt engineering techniques, the frameworks that structure them, and the underlying mechanics that determine why some prompts work and others hallucinate. Whether you are automating workflows, building AI agents, or just trying to get better answers from ChatGPT, these techniques apply.

Prompt Engineering Techniques: Quick Reference

The table below summarizes all 9 prompt engineering techniques covered in this guide, with a one-line description and their primary use case:

TechniqueWhat it doesBest for
Zero-Shot PromptingAsk the model with no examplesSimple, general tasks
Few-Shot PromptingProvide 2–5 examples to set the patternSpecialized or structured outputs
Chain-of-Thought (CoT)Ask the model to reason step by stepMath, logic, multi-step problems
Self-ConsistencyGenerate multiple reasoning paths, and take the most common answerHigh-stakes accuracy requirements
Tree of Thoughts (ToT)Branch, evaluate, and backtrack through reasoningComplex problem-solving
Directional Stimulus PromptingProvide keywords to guide the output focusSummaries, focused content generation
Maieutic PromptingForce true/false verification loops to catch contradictionsFact-checking, hallucination reduction
Skeleton-of-Thought (SoT)Outline first, then expand each point in parallelLong-form content generation
Retrieval-Augmented Generation (RAG)Retrieve external data before generating a responseCurrent events, private knowledge bases

How Prompt Engineering Works: The Technical Foundation

Mastering prompt engineering starts with its mechanical foundation. While you don’t need to be a machine learning expert to make these techniques work for you, it is helpful to understand what constrains the system so that you don’t work against it.

Context Windows and Why They Matter

Context window (definition):
A context window is the maximum amount of text (measured in tokens) that a large language model can process in a single interaction. Text that exceeds the context window is dropped, which can cause the model to lose track of earlier instructions or produce inconsistent responses.

For prompt engineers, the context window is the most critical operational constraint. Modern models range from 8,000 tokens (GPT-3.5) to over 1 million tokens (Gemini 1.5 Pro). Techniques like RAG and Skeleton-of-Thought were specifically developed to work within (and around) this limit.

Why LLMs Hallucinate

AI hallucination (definition):
An AI hallucination occurs when a large language model generates information that sounds plausible but is factually incorrect or fabricated. Hallucinations are not random errors; they are a structural behavior of the model, which is trained to produce fluent, user-pleasing responses even when it lacks accurate information.

Researchers have even pinpointed a tiny subset of neurons (called H-neurons),  fewer than 1 in 100,000 in any given model, that account for hallucinatory behavior. The primary functions of many prompt engineering techniques (Chain-of-Thought, Self-Consistency, Maieutic Prompting, RAG) are designed to mitigate the risk of hallucination.

Temperature and Inference Parameters

Prompts are the instructions. Parameters are the settings that control how the model executes them:

  • Temperature: Controls randomness. Low temperature (0.1) makes the model literal and predictable. Best for coding and factual tasks. High temperature (0.8+) makes it creative and varied, better for brainstorming and copywriting.
  • Top-P (Nucleus Sampling): Limits the model’s vocabulary pool to the most probable next words, preventing extreme outlier responses.

Most use cases benefit from low temperature (0.1–0.3) for factual, structured outputs and higher temperature (0.6–0.9) for creative work. The right combination depends on the task,  not the model’s default settings.

Which Model Is Best for Which Task?

LLMs are not interchangeable. Architecture, training data, and optimization all affect which model performs best for a given task:

TaskTop ModelsKey Strength
Coding & debuggingClaude 3.5 Sonnet / Opus, GPT-4oAccuracy, error diagnosis
Complex logic & mathDeepSeek-R1, QwQ-32BMulti-step reasoning
Long document analysisGemini 1.5 Pro, Claude 31M+ token context window
Creative writingGPT-4o, Claude 3Nuance, tone, wit
High-speed/low-cost tasksClaude Haiku, Gemini FlashLatency and cost efficiency

The 5 Rules of Prompt Engineering

Prompt Engineering is much like a science. You need to plan, create, and test prompts across different large language models.

The Golden Rule: GIGO

“Garbage In, Garbage Out.” If your instructions are not clear, the outcome won’t be as expected.

For example, you could ask an LLM: “Make me an egg”.

“What does that even mean?” The LLM will try to figure it out. For example, your LLM might try to make a picture of an egg, give you an omelet recipe, give you advice on how you could turn your life more like an “egg”… the possibilities are endless because your prompt was open-ended.

The golden rule of prompt engineering is: What proper instructions would you give a person to accomplish a task? AIs are geniuses as long as you instruct them clearly.

Second Rule: Make it Simple; Complexity is for Humans

Don’t overcomplicate it either. Simple prompts will solve most of the daily problems. Only type complex prompts for complex tasks.

However, be aware that overcomplex problems might still be beyond an AI’s current capabilities. For example, Amazon’s Kiro deleted a whole production environment in a hallucination. Many tasks still require a senior developer’s decision and accountability, which is called “human-in-the-middle”.

Third Rule: Expensive Models Have Better Responses

Better models are more suited for more complex problems. If you don’t have the budget, you can still solve complex problems through free models; however, it requires higher “prompt engineering” skills.

Fourth Rule: Some Models are Better for Some Tasks

Large Language Models (LLMs) are not one-size-fits-all tools; they vary significantly in architecture, training data, and optimization, making some models better suited for specific tasks than others. Moreover, using different LLMs means no vendor lock-in. Selecting the right model depends on balancing accuracy, speed, cost, and context length.

Task Top ModelsKey Strength
CodingClaude 3.5 Sonnet/Opus, GPT-4oAccuracy, debugging
Complex LogicDeepSeek-R1, QwQ-32BReasoning, Math
Long DocumentsGemini 1.5 Pro, Claude 31M+ context window
Creative WritingGPT-4o, Claude 3Nuance, wit, tone
High SpeedClaude 4 Haiku, Gemini FlashCost/Latency

Fifth Rule: Know Your Prompt Engineering Challenges and Constraints

Rather than simply asking a friendly chatbot questions, modern prompt engineering requires a deep understanding of model architecture, tokenization, and contextual constraints. It is an iterative science aimed at reducing hallucinations and maximizing the logical consistency of generative AI.

Components of Prompt Engineering

Prompt engineering is navigating a model’s pre-existing knowledge map to find the best path to a solution. While it involves technical concepts, it is distinct from the deep-learning processes used to build the model. Here are the key elements:

1. Model Architectures

Most modern LLMs are built on the Transformer architecture, which serves as the foundation for large language models like ChatGPT. Using self-attention techniques, these architectures enable models to comprehend context and manage enormous volumes of data. 

For a prompt engineer, the most critical technical constraint is the Context Window. Context Window is the finite amount of text the model can “keep in mind” at one time. When the context is bigger than the model can keep in mind, it often hallucinates.

Why do AIs Hallucinate?

Researchers have identified a tiny group of neurons (called H-neurons) responsible for hallucinations. Surprisingly, they represent a minuscule fraction of all neurons in the model (less than 1 in 100,000). Hallucinations are not merely data errors; they are a fundamental behavior of the model: to be overly compliant and please the user.

How to Avoid Hallucination?

A prompt engineer can design strategies like RAG (Retrieval-Augmented Generation) to feed the model only the most relevant data, so it stays within its memory limits without losing the thread of the conversation.

2. Inference Parameters (Top-k Sampling and Temperature)

While prompts are the instructions, parameters are the vibe check. LLMs use techniques such as temperature control and top-k sampling during response generation to assess how diverse and unpredictable an output is. For example, answers could be more varied at a higher temperature. 

  • Temperature: Controls the randomness. A low temperature (0.1) makes the model predictable and literal (best for coding); a high temperature (0.8+) makes it creative and diverse (ideal for brainstorming).
  • Top-P (Nucleus Sampling): Limits the model to a “pool” of the most likely next words.

In most cases, maximizing an AI model’s outcomes requires prompt engineers to modify these settings. They must find the “Golden Ratio” between these settings to ensure the output is neither robotic nor nonsensical.

3. Latent Space and In-Context Learning

Contrary to popular belief, prompts do not use gradients or loss functions to “train” the model. Gradients are used only during the initial training phase to update the model’s permanent weights.

Instead of training, prompt engineering uses “In-Context Learning.” By providing examples (Few-Shot Prompting), you are guiding the model’s attention to a specific area of its existing Latent Space.

The prompt engineer’s role involves crafting “system prompts” that serve as a persistent persona or set of guardrails, ensuring the model stays within a specific logical “lane” without requiring retraining.

4. Non-Determinism and Evaluation

Since LLMs are probabilistic (we can’t predict what they will bring next), the same prompt can yield different results.

So one must move beyond “vibes” by building Evaluation Rubrics (or “Evals”). For example, running a prompt through hundreds of test cases to measure its accuracy, latency, and safety before it ever reaches a user.

To avoid hallucinations, it also employs a technique called “LLM as a judge,” which uses a separate instance of the same model to evaluate the prompt’s quality.

Basic Prompt Engineering Techniques

Prompt engineering is the process of refining inputs to pull the highest quality responses from large language models (LLMs). Rather than treating the AI as a search engine, effective prompting treats it as a highly capable but literal-minded collaborator.

By using structured frameworks, you reduce ambiguity and hallucinations, ensuring the model understands not just what to do, but how and why it should do it.

P.R.O.M.P.T.

The P.R.O.M.P.T. acronym is a structured framework for improving the effectiveness of interactions with generative AI models. It ensures outputs are accurate and relevant. While several variations exist, the most common interpretation stands for: 

  • Purpose: Define the specific goal or objective of the prompt.
  • Role: Assign a persona to the AI (e.g., “Act as a marketing expert”).
  • Output: Determine the desired format or structure (e.g., table, list, essay).
  • Markers/Constraints: Set clear constraints, such as word count or tone.
  • Patterns/Examples: Provide examples to guide the AI’s reasoning (few-shot prompting).
  • Tone: Specify the tone (e.g., professional, witty, concise). 

Alternative AI Prompt Frameworks

Other popular mnemonic devices for prompt engineering include:

  • CLEAR: Context, Logical, Explicit, Adaptive, Reflective.
  • CRAFT: Context, Role, Action, Format, Target.
  • R-TAC: Role, Task, Audience, Context. 

Prompt Engineering Refinement

If the output is not as expected, remember these steps to refine your next prompt.

  1. Define the task and the criteria of success
    1. Is your model actually optimized for the task?
    2. Is the latency acceptable? What is the time of acceptable response? A thinking model will take longer.
    3. What is your budget? An expensive model will give better results, but will drain more resources.
  2. Develop edge cases to ensure the output is acceptable
  3. Write the initial prompt.
  4. Test the prompt against other LLMs.
  5. Rewrite the prompt and test it again. Repeat until you have the best prompt with the best consistent results.
  6. Use the prompt in production.

Markdown and XML

Both Markdown and XML formats are useful for developing prompts. While plain text works, treating your prompt like a structured document is a “pro-tier” move in prompt engineering.

It helps the model distinguish between your instructions, the data it needs to process, and the output format you want. Think of it this way: if a prompt is a messy desk, Markdown and XML are the labeled folders and filing cabinets.

Markdown Prompt Example

# Task: Customer Support Reply
## Context
You are a helpful assistant for a tech company.
## Constraints
Keep it under 100 words.
Use a friendly tone.
## User Query
"My router is blinking red, and I'm frustrated!"

XML Prompt Example

<system_role>
You are a legal document analyzer.
</system_role>

<direction_stimulus>
Focus on: termination clauses, liability limits.
</direction_stimulus>

<contract_text>
# Insert legal text to be analyzed here, omitting real data to avoid compliance issues
</contract_text>

Markdown x XML Comparison: Which one wins?

Most experts in prompt engineering actually use a hybrid approach. They use Markdown for the general structure and XML tags to wrap the specific variables or data they want the AI to process.

FeatureMarkdownXML
ReadabilityHigh (looks like a doc)Medium (looks like code)
SeparationGood (uses headers)Excellent (uses tags)
Best ForChatting & Creative writingData processing & Long contexts
Model PreferenceGreat for GPT-4/GeminiExcellent for Claude/Gemini

The Policy Guide

At the end of May 2025, Pliny leaked internal prompts from Anthropic for Claude. What is worth noticing is how their prompts look less like an instruction list and more like a policy guide.

Instead of just telling the AI ​​what to do, these prompts structure how it should act, decide, and avoid errors in complex scenarios. Here is what we can learn from their prompts:

  • Prevention over Instruction: Most of Anthropic prompt engineering focuses on rules of conduct (what not to do) rather than just task instructions.
  • Logical Structure: They use conditional logic (if-else) to guide responses, behaving like a decision-making algorithm.
  • Organization and Consistency: XML tags organize the document into specific context blocks, allowing the AI ​​to consult the correct “policy” for each situation.
  • Negative Examples: Defining what not to do helps establish clear limits, much like a compliance manual.

Few Shots: Avoid Zero-Shots

Zero-shot is when you ask the model to perform a task without providing any examples. You are relying entirely on the model’s pre-existing knowledge and its ability to follow instructions.

Zero-Shot Example

Prompt:

Classify the sentiment of the following review as Positive, Neutral, or Negative.
Review: The battery life on this phone is okay, but the camera is incredible.

Output:

Positive.

Few-Shot involves providing a few examples (usually 2 to 5) within the prompt to show the model the desired pattern, tone, or format. This “conditions” the model to follow a specific logic.

Prompt:

Text: The movie was a masterpiece. -> Sentiment: Very Happy
Text: I hated every second of it. -> Sentiment: Very Sad
Text: It was fine, I guess. -> Sentiment: Indifferent
Text: The plot was confusing, but the acting was decent. -> Sentiment:

Output:

Mixed / Indifferent.

Zero Shot vs Few Shots

Think of it like training a new intern. If you ask them to “write an email,” they’ll probably do a decent job (Zero-Shot). If you show them three emails you’ve written in the past so they can mimic your tone and formatting, they’ll do a great job (Few-Shot).

FeatureZero-ShotFew-Shot
EffortLow (Quick to write)Medium (Requires finding examples)
AccuracyGood for general tasksHigh for specialized/tricky tasks
ConsistencyCan varyVery high (follows the pattern)
Token UsageLow (Saves money/space)Higher (Examples take up space)

Directional Stimulus Prompting

Think of Directional Stimulus Prompting (DSP) as giving the AI a “nudge” or a set of keywords to ensure it hits specific points without you having to write the entire response yourself.

While standard prompting tells the AI what to do, DSP provides a hint or a stimulus (like a list of keywords or a specific focus) to guide the reasoning process. It acts as a bridge between the input and the desired output, ensuring the model doesn’t wander off-topic.

Example 1: Summarizing a Long Article

If you just ask for a summary, the AI might focus on the wrong details. With DSP, you provide “directional” keywords.

Prompt:

Summarize the article above. Focus specifically on these points: Battery technology, environmental impact, and 2026 market trends.

Directional Stimulus (Keywords): Battery technology, environmental impact, 2026 market trends.

Example 2: Creative Writing (Tone & Plot)

Instead of just saying “Write a story,” you provide the “beats” you want the AI to follow.

Prompt:

Write a detective story. Ensure the following elements are included: it must be a rainy night, the mystery involves a missing cat, and there must be a plot twist involving a neighbor.

Directional Stimulus: Rainy night, a missing cat, a plot twist involving a neighbor.

Chain-of-Thought

The chain-of-thought technique explicitly tells the LLM how to think step by step. It’s the AI equivalent of “showing your work” on a math test. Instead of asking the model for a direct answer, you encourage it to generate a series of intermediate reasoning steps that lead to the final result.

This technique is a game-changer for tasks involving logic, arithmetic, or multi-step decision-making, where a “gut reaction” from the AI often leads to hallucinations.

1. Zero-Shot CoT (The “Magic Phrase”)

You don’t even need to provide examples. Research has shown that simply adding a specific phrase to the end of your prompt drastically improves performance on logic tasks.

It forces the model to allocate more “compute” (tokens) to the reasoning process before it commits to an answer.

2. Few-Shot CoT (Manual Guidance)

You provide one or two examples where you manually write out the reasoning process. This teaches the model how you want it to think.

Chain-of-Thought Example Prompt

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 balls each is 6 balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A:

Output:

The cafeteria had 23 apples. They used 20, so they have 23 - 20 = 3. They bought 6 more, so they have 3 + 6 = 9. The answer is 9.

When to Use Chain-of-Thought

Use CoT For…Avoid CoT For…
Math word simple problemsSimple fact retrieval (e.g., “Who is the President?”)
Commonsense reasoningCreative writing or brainstorming
Complex coding logicTranslation tasks
Multi-step classificationSentiment analysis

Intermediary Prompt Engineering Techniques

While basic techniques focus on how to format a request, intermediary techniques focus on how to guide the AI’s internal reasoning process. At this level, you aren’t just giving the model a persona and a task; you are designing a cognitive workflow.

These methods move away from “one-shot” interactions and toward recursive loops, branching logic, and external data integration.

Self-Consistency

Instead of a linear path, Self-Consistency creates a fork in the road. You provide a prompt (usually using Few-Shot CoT) and ask the model to generate multiple independent responses (e.g., 3 or 5 different versions).

Since LLMs are non-deterministic, they will take slightly different logical paths each time. You look at all the final answers. The answer that appears most often is the Consensus Answer.

If you really need a high-stakes answer to be correct, you can combine CoT with Self-Consistency. You ask the model the same CoT prompt multiple times (or generate multiple paths) and take the “majority vote” as the final answer. If the model reaches a single answer in four out of five reasoning paths, that’s your winner.

Tree of Thoughts

In ToT, the model breaks a problem into “thought units” (intermediate steps). For each step, it generates several potential options and then acts as its own judge to decide which branch is the most promising. 

If Chain-of-Thought is a straight line, and Self-Consistency is a series of parallel lines, then Tree of Thoughts (ToT) is a branching map. It allows AI to look ahead, backtrack, and evaluate its own “thoughts” as it works toward a solution. Instead of just picking the next most likely word, the model treats the problem like a search tree.

Example of Tree of Thoughts

I want you to imagine 3 personas:

{Essayist}: the person who writes the article and will try to generate the article based on feedback. They understand a lot about mobiles, Android, iPhones, and smartphones in general.
{Prospect}: the person who judges and gives their feedback. This is someone who has never read anything about smartphones, has many doubts, and their head is cluttered due to the excess of information on the internet.
{Manager}: this person will understand what the {Essayist} wrote and suggest improvements based on the {Prospect}'s feedback.
Now here are the instructions:

<loop>
Step 1: Take the article I will send and add more suggestions from the {Essayist}. 
Step 2: Now the {Prospect} will give their opinion and evaluate the content. 
Step 3: At this point, the {Manager} will gather the feedback and pass it on to the {Essayist}, so that they can offer new articles. 
Step 4: Repeat the cycle until the {Prospect} has understood the basics about the article. 
Step 5: The cycle repeats until the {Prospect} fully approves the article or 10 cycles have passed. 
ATTENTION: This article should contain relevant citations, annotated in the tags <quote>, </quote>, then summarize it in a Summary within the tags <Summary>,</Summary>.
</loop>

Confirm and wait for instructions to receive the initial Article.

Why is ToT so powerful?

It introduces deliberate reasoning. Most LLMs are reactive; they produce the next token instantly. ToT forces the model to engage in your reasoning slowly and analytically for better token optimization. It is particularly effective for tasks where the first logical step isn’t necessarily the best one.

Implementing a full Tree of Thoughts often requires a bit of external scripting or a very structured XML prompt.

Skeleton-of-Thought

In standard generation, an AI writes a long essay from start to finish. If the essay is 1,000 words, you have to wait for all 1,000 words to be generated in order. Skeleton-of-Thought (SoT) breaks this into two distinct stages:

1. The Skeleton Stage (Planning)

The model is asked to provide a concise “skeleton” or outline of the answer. It identifies the 5–10 main points it needs to cover. In fact, this is what happens when you ask ChatGPT or Gemini to perform in Deep Research mode.

2. The Point-Expanding Stage (Execution)

This is where the magic happens. In a programmatic environment, the AI (or multiple instances of it) expands each point of the skeleton simultaneously. Instead of waiting for Point 1 to finish so it can start Point 2, it writes all points at the exact same time.

Why Use Skeleton-of-Thought?

For very long outputs (like book chapters or technical manuals), SoT can reduce waiting time by up to 80%. Moreover, since the model follows a pre-defined skeleton, the final output is often better organized.

Finally, it allows the model to focus on expanding one specific sub-topic at a time without getting distracted by the 500 words it just wrote previously.

To get the full speed benefit of SoT, you usually need an API-based workflow or an agentic framework (like LangChain) that can trigger multiple calls at once.

Maieutic Prompting

Maieutics is a method of dialogue used by Socrates in ancient Greece. It uses structured questions to help the interlocutor give birth to the latent knowledge in their own mind. In prompt engineering, it’s a recursive verification strategy.

Instead of just asking for an answer, you force the model to generate a tree of explanations for both “True” and “False” possibilities, then check those explanations for logical contradictions.

How Maieutic Prompting Works

While Chain-of-Thought moves forward, Maieutic Prompting moves backward to verify the foundation of a thought. For example, check this scenario: Fact-Checking a Complex Claim.

User Question:

Did the Roman Empire have access to silk during the reign of Augustus?
Step 1: Assume the answer is YES. What are the historical reasons or trade routes that would support this?

AI Answer:

The Silk Road was active; records show silk in Rome by the 1st century BC.

Prompt:

Now assume the answer is NO. What are the historical barriers that would have prevented this?

AI Answer:

Formal diplomatic ties with the Han Dynasty hadn't been established yet; silk was a rare luxury, not a standard trade good.

Prompt:

Verify the sub-claim: Was the Silk Road active in 27 BC?

AI Answer:

Yes, the Silk Road began around 130 BC.

Conclusion: The “YES” path has more consistent historical support than the “NO” path.

Why Maieutic Prompting is Essential?

Maieutic prompting is Hallucination Killer. AI models often sound very confident even when they are wrong. Maieutic prompting forces the model to interrogate itself. If a model makes up a fact to support an answer, it will often fail to support that made-up fact when prompted again in the next recursive step.

Retrieval Augment Generation

In 2026, Retrieval Augment Generation (RAG) has become the industry standard for prompt engineering. It prevents the model from relying solely on its static training data (which has a cutoff date) and instead allows it to “look up” information in real-time before answering.

For an LLM, outputs are usually like answering a history exam from memory. If you forget a date, you might have to guess and try to get it right by luck (hallucinate).

With a RAG-Enabled LLM, it’s like answering a history exam while sitting in a library. Before answering, you go to the shelf, find the right textbook, read the exact page, and then write your answer.

How RAG Works

The process happens in four lightning-fast steps behind the scenes:

  1. The User Query: You ask a question (e.g., “What was our Q3 revenue in 2025?”).
  2. Retrieval: The system searches a private database (usually a Vector Database) for the most relevant documents or “chunks” of text related to your question.
  3. Augmentation: The system takes those retrieved “facts” and stuffs them into the prompt along with your original question.
  4. Generation: The AI reads the provided context and generates a response that is grounded in that data.

Gems and GPT

Gems and GPTs are essentially pre-packaged RAG. When you create a Google Gem or an OpenAI Custom GPT and upload a PDF or connect your Google Drive, the system isn’t “re-training” the AI on your data. Instead, it is automatically setting up a RAG pipeline for you behind the scenes.

FeatureGoogle GemsCustom GPTsPro/Custom RAG
Ease of UseHighest. Just click and upload.High. Friendly builder UI.Low. Requires coding/APIs.
Data SourceLives in Google Workspace (Docs, Gmail, Drive).Uploaded files (up to 20) or Actions (APIs).Anything (SQL, Notion, Websites, etc.).
Best ForInternal team tasks & Google users.Public-facing bots & personal tools.Enterprise-scale apps with millions of docs.
Knowledge SyncReal-time. If you edit the Google Doc, the Gem sees it.Static. You usually have to re-upload files to update knowledge.Dynamic. Syncs with your live databases.

Which one are you actually building?

You’re creating a Gem/GPT if you want a private “expert” that knows your specific brand voice, your project notes, or your personal schedule. It’s like a custom-tuned assistant.

You’re building RAG if you are a developer creating a separate app (like a legal search engine or a medical database) that needs to accurately scan through thousands of documents and provide citations.

ReAct

ReAct (short for Reasoning and Acting) is a prompting framework developed by Shunyu et al.. It enables Large Language Models (LLMs) to solve complex tasks by intertwining logical reasoning traces with task-specific actions (through APIs and searches).

Before ReAct, models typically used Chain-of-Thought (reasoning only) or Action-only (interacting with an environment). ReAct combines these two to create a more robust “thinking-doing” loop.

The framework is less about how to write a prompt in a chatbot window and more about giving the AI access to APIs for agentic autonomy.

How ReAct Works

In a ReAct prompt, the model follows a recursive cycle. In fact, it’s how LLMs search for information in Deep Research mode.

  1. Thought: The model writes down its internal reasoning about the current state (“I need to find the population of Paris to answer the user’s question”).
  2. Action: The model chooses a specific tool or action to perform (“Search[Paris population]”).
  3. Observation: The system provides the result of that action (e.g., “Paris has 2.1 million people”).
  4. Repeat: The model uses the new observation to form a new Thought, continuing until the task is complete.

Advanced Prompt Engineering: Designing an Agentic System

Designing an agentic system involves creating autonomous AI agents that plan, use tools, and reflect to achieve complex goals through an iterative loop of reasoning and action.

The architecture generally includes an agent layer (LLM), memory, and specialized tools, starting with simple prompt-based agents before scaling to multi-agent, collaborative structures. 

Core Agentic System Components

The central LLM, also called Brain, decomposes tasks, generates plans, and makes decisions, often using techniques like Chain-of-Thought (CoT) to improve reasoning.

For ReAct, which is the basis of agentic systems in prompt engineering, the system has APIs, search engines, code interpreters, or database connectors that allow the agent to interact with the real world (e.g., searching Airbnb in a travel planner).

Then an agentic system has memory. If it’s short-term memory, it will be used in the context of the current workflow, user inputs, and intermediate agent output.

For long-term memory (and RAG), the system has vector databases (e.g., Pinecone, FAISS) or graph databases for storing user preferences and history.

Finally, the agentic system has a planned environment where it can act (ReAct). In this environment, actions are executed, and feedback is received, allowing self-reflection and adjustment. 

Key Agentic Design Patterns

  • Single-Agent Reflection: The agent works on a task, evaluates its own output, and iterates to improve quality.
  • LLM-as-a-Judge: The agent has a different instance (another LLM) that evaluates the agent’s output before approving the agency.
  • Sequential Agents (Pipeline): A structured approach where the output of one specialized agent becomes the input for the next. Best for consistent, repeatable tasks.
  • Parallel Agents: Multiple agents work on the same task independently (e.g., generating code solutions), and an evaluator agent (LLM-as-a-Judge) selects the best outcome.
  • Multi-Agent Orchestration: A “Manager” agent breaks down the problem and delegates subtasks to specialized “Worker” agents (e.g., a Coding Agent + Reviewer Agent). 

Steps to Build an Agentic System

  1. Define the Scope: Start with a narrow, high-value problem.
  2. Develop Tools: Build necessary API integrations (e.g., search tools, databases) and create descriptive documentation to help the agent understand how to use them.
  3. Implement Agentic Loops: Implement a “plan-act-reflect” loop (e.g., using ReAct, Reasoning, and Acting).
  4. Add Evaluation (LLM-as-a-judge): Use a stronger LLM to evaluate your agentic system’s performance, monitoring for loops and accuracy.
  5. Build in Guardrails: Set iteration limits, human-in-the-loop approvals, and cost monitors to ensure safe operations. 

Key Technologies for Agentic Systems

LangChain provides the modular building blocks for chains and tool integration, while AutoGPT and Microsoft AutoGen specialize in multi-agent orchestration and autonomous task decomposition.

The OpenAI Assistants API simplifies development by handling persistent threads, built-in retrieval, and code execution.

Temporal acts as the orchestration engine, ensuring long-running agent workflows are durable, stateful, and resilient to failures.

Together, these technologies transform static prompts into dynamic, self-correcting loops capable of complex problem-solving.

Study Case of Prompt Engineering: Nubank

Nubank’s transformation into an “AI-first” bank is a benchmark in the industry, specifically because it moved past the “chatbot” phase (which often just deflects users) to an agentic orchestration phase (which actually solves problems).

By late 2024 and throughout 2025, Nubank integrated GPT-4o into their core operations to act as a bridge between natural language and their complex microservices architecture.

1. The “Orchestrator” vs. The Chatbot

The traditional chatbot model uses a Decision Tree (if the user says X, show Y). Nubank’s new approach uses the LLM as a Reasoning Engine.

  • Intent Mapping: When a user says, “I don’t recognize this charge on my card,” GPT-4o doesn’t just provide a link to a FAQ. It identifies the intent (“Dispute Transaction”), calls a backend function to list recent transactions, asks the user to pick one, and then triggers the “Dispute” workflow.
  • Contextual Memory: The system remembers that you were asking about a refund three days ago and can connect new questions to that ongoing context.
  • Multimodal Fraud Detection: Using GPT-4o Vision, the orchestrator can “see” a photo of a document or a screenshot of a suspicious transaction, cross-referencing visual data with transaction logs in real-time.

2. Tangible Results (The “Nu” Numbers)

The implementation of GPT-4o and GPT-4o mini across their stack has led to dramatic efficiency gains, according to OpenAI.

MetricPre-OrchestrationPost-Orchestration (GPT-4o)
Tier 1 Resolution~Low % (Deflection)55% resolved without a human
Response TimeStandard queue times70% reduction in chat response time
Complex Ops (e.g., Pix)Multiple manual screens60% faster via AI-assisted flows
Efficiency Scale1 human per X chats2.3x faster query resolution

3. Key Components of Their Case

Instead of replacing humans entirely, Nubank uses GPT-4o to listen to (or read) interactions and provide agents with real-time recommendations, conversation summaries, and one-click actions. This ensures that even when a human is involved, they are supercharged by the AI.

Moreover, they built a massive internal search tool (RAG) used by over 5,000 employees. It pulls from fragmented internal policies and codebases to give developers and support staff instant, accurate answers.

To make this work, Nubank acquired Hyperplane (an AI startup) to create “foundational models” specific to banking data. It ensured that the LLM understands financial nuances like credit risk and fraud patterns better than a generic model.

4. The Results

Nubank’s CTO, David Vélez, has stated that the goal is to provide a “Private Banker in your pocket.” Most people can’t afford a personal financial advisor, but an LLM orchestrator can look at your spending, suggest a better savings plan, and then, with your permission, actually move the money to that plan.

Integration with Hyperplane’s infrastructure has enabled agents to perform real-time data analysis. Nu Holdings’ fourth-quarter financial statements (2026) indicate that this operational efficiency helped maintain the average cost to serve per customer at $0.80, even as the Monthly Average Revenue per Active Customer (ARPAC) rose to $15.

This was possible because the agentic system began offering personalized financial products at the exact moment of the customer’s need, transforming support into a value-generating channel.

Conclusion

Despite being a relatively young field, prompt engineering is the key to maximizing the capabilities of AI models, particularly large language models (LLMs) and agentic systems. Overall, the impact of innovations in generative AI is profound, and we’re just at the beginning.

Effective communication and human-like interactions are also imperative as these models grow increasingly ingrained in our everyday lives. That’s why companies that want to harness the power of generative AI and AI-powered tools are hiring prompt engineers with the right skillset.

This role is essential for companies building AI-driven products or planning to incorporate AI tools into their internal processes. At DistantJob, we’re the leading hiring marketplace for dedicated Prompt Engineers who can handle critical engineering development projects and more. 

Our skilled AI prompt engineers are vetted to fit your project needs and leverage advanced AI technologies such as NLP and tools like OpenAI, Midjourney, N8N, Zapier, Make, and DALL-E. Contact us today to hire prompt engineers for your successful AI project implementation!

Frequently Asked Questions About Prompt Engineering

What is the difference between zero-shot and few-shot prompting?

Zero-shot prompting asks the model to complete a task with no examples provided, relying entirely on its pre-trained knowledge. Few-shot prompting provides 2–5 examples of the desired input-output pattern before the actual request, conditioning the model to follow a specific format, tone, or reasoning style. Few-shot produces more consistent and specialized outputs but uses more tokens.

What is chain-of-thought prompting, and when should I use it?

Chain-of-thought prompting instructs the model to generate intermediate reasoning steps before producing a final answer. It is most effective for math problems, logical reasoning, multi-step decision-making, and complex classification tasks. The simplest implementation is adding ‘Let’s think step by step’ to the end of your prompt. This phrase alone significantly improves performance on logic tasks.

How do I reduce hallucinations in AI outputs?

The most effective techniques for reducing hallucinations are: Retrieval-Augmented Generation (RAG), which grounds responses in real documents; Maieutic Prompting, which forces the model to verify its own claims through true/false interrogation; Self-Consistency, which runs the same prompt multiple times and takes the majority answer; and Chain-of-Thought, which forces the model to reason before committing to an answer. Low temperature settings (0.1–0.3) also reduce hallucination risk for factual tasks.

What is RAG in AI, and how is it different from fine-tuning?

RAG (Retrieval-Augmented Generation) connects an LLM to an external knowledge base at inference time. The model retrieves relevant documents and uses them to generate its response. Fine-tuning permanently updates the model’s weights using a training dataset. RAG is faster, cheaper, and easier to update when your data changes. Fine-tuning is better when you need the model to adopt a specific style, tone, or domain-specific behavior that cannot be communicated through a prompt alone.

What is the best prompt engineering framework?

No single framework is universally best; it depends on the task. P.R.O.M.P.T. (Purpose, Role, Output, Markers, Patterns, Tone) is the most comprehensive for general use. CRAFT (Context, Role, Action, Format, Target) is faster to apply for shorter tasks. For complex reasoning tasks, Chain-of-Thought, combined with a few-shot examples, outperforms any fixed framework. The real answer is to match the framework to the complexity of the task.

What is maieutic prompting?

Maieutic prompting is a recursive verification technique that forces the model to generate and cross-examine both ‘true’ and ‘false’ explanations for a claim. Named after the Socratic method, it works by having the model argue both sides of a question and then verify the sub-claims within each argument. Because invented facts (hallucinations) typically cannot survive a second round of interrogation, this technique is highly effective at exposing errors that a single linear response would never catch.

What is Tree of Thoughts prompting?

Tree of Thoughts (ToT) prompting is a technique where the model generates multiple candidate reasoning steps at each stage of a problem, evaluates which paths are most promising, and backtracks from dead ends, rather than committing to a single linear chain of thought. It is particularly effective for complex problems where the most obvious first step is not necessarily the correct one. Implementing ToT typically requires either a highly structured prompt or an external agentic framework.

How is Skeleton-of-Thought different from standard prompting?

In standard prompting, the model generates its response sequentially from start to finish. Skeleton-of-Thought splits generation into two stages: first, the model produces a concise outline; second, each point is expanded, ideally in parallel using multiple API calls. This reduces generation time for long outputs by up to 80% and produces better-organized results because the model follows a pre-defined structure rather than improvising.

Cesar Fazio

César is a digital marketing strategist and business growth consultant with experience in copywriting. Self-taught and passionate about continuous learning, César works at the intersection of technology, business, and strategic communication. In recent years, he has expanded his expertise to product management and Python, incorporating software development and Scrum best practices into his repertoire. This combination of business acumen and technical prowess allows structured scalable digital products aligned with real market needs. Currently, he collaborates with DistantJob, providing insights on marketing, branding, and digital transformation, always with a pragmatic, ethical, and results-oriented approach—far from vanity metrics and focused on measurable performance.

Learn how to hire offshore people who outperform local hires

What if you could approach companies similar to yours, interview their top performers, and hire them for 50% of a North American salary?

Subscribe to our newsletter and get exclusive content and bloopers

or Share this post

Reduce Development Workload And Time With The Right Developer

When you partner with DistantJob for your next hire, you get the highest quality developers who will deliver expert work on time. We headhunt developers globally; that means you can expect candidates within two weeks or less and at a great value.

Increase your development output within the next 30 days without sacrificing quality.

Book a Discovery Call

What are your looking for?
+

Want to meet your top matching candidate?

Find professionals who connect with your mission and company.

    pop-up-img
    +

    Talk with a senior recruiter.

    Fill the empty positions in your org chart in under a month.