The 1% Skill Nobody Talks About: Why Evals Are Everything
Part 3: When Systematic Evaluation Became My Superpower
(Part 1 of this series is here & Part 2 of this series is here)
My first round of AI-generated emails scored in the low 20s out of 30.
Generic. Website-copy-like. Nothing like me. The AI kept starting emails with “Hi there!” (which I NEVER do). The tone was professional but sterile. The brand voice? Barely recognizable.
My second round? Consistently 28-29 out of 30.
The difference wasn’t better prompts. It wasn’t a different AI model. It wasn’t magic.
It was something far more powerful, far more systematic, and far more valuable than anything else I learned in Sara and Tyler’s entire course:
Evaluation. Rigorous, detailed, human-in-the-loop evaluation.
Week 3 taught me what Sara and Tyler call “the 1% skill”—the capability that separates people who dabble with AI from people who can actually build reliable AI systems that work at scale.
And here’s what nobody tells you: This skill isn’t about AI at all. It’s about developing the systematic discipline to measure, analyze, and iteratively improve outputs until they meet your actual standards—not just “good enough” standards.
In high-stakes government and defense applications where I work? “Good enough” isn’t good enough. Lives depend on AI reliability. Trust calibration matters. Consistency is everything.
Week 3 showed me how to bridge the gap between AI capability and AI reliability. And the transformation was dramatic.
The Evaluation Framework
Week 3’s assignment was deceptively simple: Evaluate your AI email assistant across five different customer service scenarios using six criteria:
Tone Alignment (1-5): Does it match the appropriate emotional register?
Clarity (1-5): Is the message clear and easy to understand?
Relevance (1-5): Does it address what was actually asked?
Correctness (1-5): Is the information accurate?
Conciseness (1-5): Is it appropriately brief without losing necessary detail?
Brand Voice Match (1-5): Does it sound like ME?
Maximum score: 30 points.
The scenarios ranged from handling a frustrated customer with a delayed order to responding to a journalist asking about sustainability initiatives. Different contexts, different emotional registers, different communication needs.
And here’s the critical part: I had to evaluate TWICE. First round, document what’s wrong. Refine the system instructions. Second round, evaluate again and measure improvement.
This wasn’t just “does this sound okay?” This was systematic, quantifiable, iterative refinement with specific feedback on exactly what needed to change.
Round 1: The Brutal Truth
Let me show you what my first round looked like. Real scores. Real feedback. Real problems.
Scenario 1: Delayed Order (Frustrated Customer)
Score: 21/30
My feedback: “Pretty good for first try. I would just leave the subject pretty simple ‘Update on delayed order’ and not put the ‘And a Thank You from Us’ cause that is not appropriate. I also understand since this is a fake company, we don’t have an active live database… I know my brand voice is a little long so I would prefer it to be a bit more concise… I also do NOT start my emails with ‘Hi there!’ I usually would try to address the person directly. Or leave it just as ‘Hi’ but almost always put the person’s name in there but never ‘Hi there.’”
There it is. The “Hi there!” problem. That generic salutation that immediately signals “this was written by AI, not a human who actually cares.”
Scenario 2: Product Details Inquiry
Score: Low-to-mid 20s
My feedback: Identified that the response was too long, lacked my parenthetical expressions, and didn’t have the exclamation points after short emotional phrases that characterize my actual communication style.
Scenario 4: Subscription Service Inquiry
Score: 26/30
Brand Voice Match: 2/5
My feedback: “Again all good ‘correct’ information. I even like and appreciate the follow up questions at the end because that definitely is something we would do as customer service reps… Again consistent with all the other tests - very good response, but doesn’t really sound like me.”
See the pattern? The AI was getting the INFORMATION right. It was clear, relevant, mostly correct. But it wasn’t capturing ME. It was producing competent customer service emails that could have been written by anyone.
And that’s the gap that most people never close. Because “competent” feels good enough. It’s functional. It works.
But in Week 2, I’d spent hours analyzing my authentic voice—the parenthetical layering, the rhythmic variation, the connection-making, the cognitive transparency. I KNEW what my voice sounded like. And this wasn’t it.
The Night and Day Improvement
Here’s what happened after I refined my system instructions based on my detailed Round 1 feedback:
Round 2 - Scenario 1: Delayed Order
Score: 28/30 (up from 21)
Tone Alignment: 4/5 (appropriate empathy and urgency)
Clarity: 5/5 (crystal clear)
Relevance: 5/5 (directly addressed the concern)
Correctness: 5/5 (accurate information)
Conciseness: 5/5 (much tighter, action-oriented)
Brand Voice Match: 4/5 (NOW it sounds like me)
The AI response in Round 2: “Subject: Update on Your Order
Hi,
I’m so sorry your order is delayed - I know how frustrating that is, especially after waiting a full week.
Here’s what’s happening: I’m tracking your order right now and will send you specific delivery information within 24 hours. We’re also adding a 10% discount to your account (it’ll automatically apply at checkout) as a thank you for your patience.
If you need anything else or want to check on the status, you can reach me at support@greenhavenliving.com or call 1-800-123-4567 (Monday-Friday, 9 AM – 5 PM EST).
Thanks for sticking with us, [Signature] GreenHaven Living Support Team”
See the difference?
No more “Hi there!” Just “Hi” with implied personalization. Short, empathetic opening that acknowledges the frustration. Action-oriented language (”I’m tracking your order right now”). Specific timeline (”within 24 hours”). Clear next steps. Warm but professional closing.
It’s concise without being cold. Professional without being sterile. And it sounds like something I would actually write.
The improvement across all scenarios: From low-to-mid 20s to consistently 28-29 out of 30.
That’s not incremental improvement. That’s transformation.
What Actually Changed
Let me show you the specific refinements that made the difference, because the details matter:
1. The Salutation Fix
Before: “Hi there!” (generic, impersonal, AI-sounding)
After: “Hi,” or “Hi [Name],” (direct, personal, human)
Why it matters: This tiny detail immediately signals whether a human wrote this or an AI did. It’s the difference between “customer service template” and “person who cares.”
2. The Parenthetical Expression Integration
Before: Straightforward sentences with no asides
After: Occasional parenthetical context that shows thinking process
Why it matters: This is MY voice. This is how I actually communicate. The parenthetical asides aren’t decorative—they’re cognitive transparency.
3. The Exclamation Point Calibration
Before: Either too many or too few, inconsistently placed
After: Strategic placement after short emotional phrases for emphasis
Why it matters: You can HEAR my voice when the exclamation points are placed correctly. It creates oral quality in text.
4. The Conciseness Without Coldness Balance
Before: Long, thorough explanations that felt like website copy
After: Tight, action-oriented language that still maintains warmth
Why it matters: Customer service emails need to be efficient, but efficiency without humanity is just automation.
5. The Context-Appropriate Tone Shifts
Before: Same tone across all scenarios
After: Calibrated emotional register based on situation (empathy for frustrated customers, enthusiasm for product inquiries, professional warmth for journalist inquiries)
Why it matters: My voice ADAPTS across contexts while maintaining core authenticity. That’s sophisticated communication.
The Systematic Process That Made It Work
Here’s what Week 3 actually taught me: Evaluation isn’t just scoring outputs. It’s developing a systematic process for continuous improvement.
The framework we were taught from Tyler & Sara used:
Step 1: Generate AI output (using current system instructions)
Step 2: Evaluate against six specific criteria (not just “does this feel okay?”)
Step 3: Document SPECIFIC issues (not “this doesn’t sound like me” but “this uses ‘Hi there!’ which I never use, and it’s missing parenthetical expressions that characterize my voice”)
Step 4: Refine system instructions (based on documented issues, not vague feelings)
Step 5: Generate new output (with refined instructions)
Step 6: Evaluate again and measure improvement (quantifiable progress, not subjective assessment)
Step 7: Iterate until outputs consistently meet standards (not “good enough” but “actually good”)
This isn’t revolutionary. It’s just rigorous. But rigor is what separates AI that’s “interesting” from AI that’s actually RELIABLE.
The High-Stakes Connection
By Week 3, I was making connections everywhere (because that’s what my brain does—pattern recognition across domains).
This evaluation framework? It’s exactly what high-stakes AI systems need.
I’ve been writing about designing AI for government and defense applications for months. About shared responsibility between AI and human judgment. About appropriate trust calibration. About transparency and human oversight.
But evaluation is HOW you operationalize all of that.
In healthcare AI (like Vincent Buil’s work at Phillips that I wrote about), evaluation isn’t optional—it’s how you ensure patient safety. You can’t just deploy AI and hope it works. You need systematic measurement of accuracy, error detection, appropriate confidence levels.
In defense applications (like the mission planning systems I design), evaluation is how you maintain operational reliability. When lives are at stake, you need to KNOW—not guess, not hope, but KNOW—that your AI system will perform consistently under pressure.
In intelligence analysis (where my work is heading), evaluation is how you calibrate trust. Analysts need to know when to rely on AI recommendations and when to override them. That calibration requires systematic measurement of AI performance across different contexts.
The evaluation framework I learned in Week 3 isn’t just for email assistants. It’s the foundation for ANY reliable AI system in ANY high-stakes environment.
The Business Applications Already Emerging
By the end of Week 3, my brain was spinning with applications (because of course it was):
For my bosses’ startup (pursuing DOD contracts): How do we evaluate AI assistants that help with complex proposal writing? What metrics matter? How do we measure whether AI-generated content maintains the technical precision AND persuasive clarity that government contracts require?
For my husband’s accounting software company: How do we systematically evaluate AI that helps with financial analysis? What’s the threshold for “good enough” when you’re dealing with people’s money? How do we build confidence through measurable performance?
For my mom’s Bible education blog: How do we evaluate whether AI-generated scripts maintain theological accuracy AND her authentic teaching voice? What criteria matter most? How do we iterate toward consistency?
Every single one of these applications requires the same fundamental skill: systematic evaluation with human feedback loops.
Why This Is The 1% Skill
Here’s what Sara and Tyler emphasized throughout Week 3, and what I’m now seeing everywhere:
Most people using AI stop at “good enough”—but here’s the thing: they’re not actually FINE with it. They’re frustrated. They’re disappointed. They blame the AI models for producing “generic garbage.”
But what they don’t realize is this: If you put in general information, AI will generate a basic output. That’s not the AI failing. That’s the AI doing exactly what you asked it to do—with no context, no personalization, no understanding of who YOU are or how YOU think or what YOU actually need.
The output looks... okay. Functional. They use it. But they don’t LIKE it. And they get increasingly frustrated because they keep hearing about how “revolutionary” AI is supposed to be, and what they’re getting is just... meh.
So they throw up their hands. “AI doesn’t work for me.” “It’s all hype.” “I tried it and it’s useless.”
But here’s what’s actually happening: They literally don’t know HOW to use it. Not because they’re not smart enough or tech-savvy enough—but because nobody taught them that AI collaboration is a SKILL that requires:
Systematic evaluation (not just “does this look okay?”)
Iterative refinement (not one-and-done prompts)
Measurement of improvement (not vague feelings of dissatisfaction)
Context and personalization (not generic inputs expecting magical outputs)
How do AI systems “know” any context about the human typing the question? They don’t. Unless you TELL them. Unless you build that context into your system instructions. Unless you personalize the AI to understand how YOU think, how YOU communicate, what YOU need.
It’s actually not the AI’s “fault” at all. It’s US. It’s HOW we talk to AI systems.
That’s the skill I’ve been honing over these seven weeks. That’s what separates the 1% who build reliable AI systems from the 99% who try AI once, get frustrated, and give up.
The 1% understand: AI isn’t magic. It’s a tool that requires skill, practice, and systematic refinement to use effectively.
And once you develop that skill? The transformation is dramatic.
Thank you Tyler. Thank you Sara.
The 1% who actually build reliable AI systems? They evaluate. Rigorously. Systematically. Continuously.
They don’t just ask “does this work?” They ask:
How well does it work?
Under what conditions does it work best?
Where does it fail?
How can we measure improvement?
What’s our threshold for acceptable performance?
How do we maintain that performance at scale?
This is the skill gap that’s holding back AI adoption in professional contexts. It’s not that AI isn’t capable—it’s that most people don’t know how to systematically improve and maintain AI performance.
And in high-stakes environments? That gap is dangerous.
The Meta-Realization About My UX Work
Here’s where Week 3 connected to everything I’ve been doing as a UX designer:
Evaluation IS user testing. It’s the same fundamental skill—systematic observation, specific feedback, iterative refinement based on measured outcomes.
I’ve been doing this for years in UX design. Testing interfaces with users. Documenting specific issues. Refining designs. Testing again. Measuring improvement.
The only difference with AI is that the “user” is the AI system itself, and the “interface” is the system instructions that guide its behavior.
My Week 3 feedback even pointed this out: “As a UX/UI Systems Designer working with complex applications in government and military sectors, you have a unique perspective on evaluation and refinement. How might you adapt the evaluation framework you’ve developed to assess not just AI communication but also the usability and effectiveness of complex systems interfaces?”
That question hit me hard. Because the answer is: It’s the SAME framework.
Systematic evaluation. Specific criteria. Measurable outcomes. Iterative refinement. Human feedback loops.
Whether you’re evaluating an AI email assistant or a mission planning interface or an intelligence analysis workflow—the fundamental process is identical.
And suddenly I realized: My UX background isn’t just adjacent to AI design. It’s FOUNDATIONAL. The skills I’ve been developing for years—user research, systematic testing, iterative design, stakeholder management—these are exactly the skills needed for reliable AI development.
The Questions I’m Now Asking
Week 3 left me with questions that extend far beyond email assistants:
How do we scale evaluation in enterprise AI systems? I can manually evaluate five email scenarios. But what about AI systems generating hundreds or thousands of outputs daily? How do we maintain evaluation rigor at scale?
What’s the right balance between human evaluation and automated metrics? Some things require human judgment (does this sound like me?). Some things can be automated (word count, sentiment analysis). Where’s the line?
How do we build evaluation into AI workflows from the beginning? Most people treat evaluation as an afterthought. What if we designed AI systems with evaluation frameworks built in from day one?
What evaluation criteria matter most in different contexts? Email assistants need brand voice match. Medical AI needs accuracy and safety. Defense AI needs reliability under pressure. How do we identify the right criteria for each context?
How do we train people to evaluate AI effectively? This is a SKILL. It requires practice, feedback, refinement. How do we help more people develop this capability?
What Week 4 Was About to Reveal
By the end of Week 3, I had all the pieces:
Week 1: Personalization (understanding how I learn and think)
Week 2: Voice analysis (understanding how I communicate)
Week 3: Evaluation (systematic improvement and reliability)
Week 4 was about putting it all together: Building autonomous workflows where multiple AI agents collaborate to accomplish complex tasks.
And what I built surprised even me. Because I didn’t build an email assistant. I built something far more ambitious—a multi-agent system that transforms complex theological research into engaging educational content for my mother’s Bible teaching ministry.
And then I turned around and applied the same principles to mission planning for defense applications.
Turns out, the skills are universal. The contexts change. The stakes vary. But the fundamental principles of personalization + voice + evaluation = reliable AI systems that actually work.
But that’s a story for Part 4.
The Transformation I’m Sitting With
Here’s what Week 3 taught me that goes beyond AI:
Systematic evaluation is how you bridge the gap between capability and reliability. In any domain. Any context. Any application.
AI can generate impressive outputs. But impressive isn’t the same as reliable. Reliable requires measurement, iteration, continuous improvement, human feedback loops.
And in high-stakes environments where I work—government, defense, intelligence—reliability isn’t optional. It’s everything.
The evaluation framework I learned in Week 3 isn’t just a technique for improving AI assistants. It’s a fundamental skill for building trustworthy AI systems in contexts where trust actually matters.
My first round scores: Low-to-mid 20s out of 30. My second round scores: Consistently 28-29 out of 30.
That’s not luck. That’s not magic. That’s systematic evaluation with human feedback loops.
That’s the 1% skill.
And now I have it.
Stay tuned for Part 4, where everything comes together in ways I didn’t expect.
In the spirit of transparency I advocate for in AI development: I worked with Claude to structure and refine these reflections from my Week 3 experience. The evaluation scores, feedback comments, and insights about systematic improvement are from my actual evaluation work in Sara and Tyler’s course, with AI assistance in articulating the patterns and principles I discovered.
