So... My AI App Has Been Lying to Users (And How I Fixed It)
Chapters

So... My AI App Has Been Lying to Users (And How I Fixed It)

Chris Raroque
18:59
Apr 7, 2026
10.7K views
595
Show description

Hi my name is Chris and I build productivity apps πŸ‘‹ and this is a DEEP DIVE on how I build (and iterate on) AI systems --- Braintrust (the tool I use for AI evals): https://braintrustdata.link/Uww5Jin --- My apps and socials: https://chrisraroque.com Timestamps: 0:00 – Intro / What we are covering 1:52 – Overview of AI evals (and why you need them) 4:22 – My ACTUAL AI eval workflow 5:14 – Attempt 1 (split search and calculation system) 6:43 – Attempt 2 (mini agent) 7:55 – Attempt 3 (swapping search providers) 9:33 – Trying to squeeze more out of Exa 10:51 – My new AI system for Amy (using Exa) 11:23 – Braintrust (what i use to run AI evals) 14:23 – Common AI eval mistakes (that i made) 15:37 – Writing good test cases 16:22 – How to IMPROVE your AI system with user feedback 17:37 – A summary of what I learned from my experiments 18:33 – Final thoughts and thank you :) #appdevelopment #dayinthelife #softwareengineer #startup #softwaredev #indieappdeveloper #dayinthelifecoding #codewithme #buildinpublic #vlog

Have questions about this video?

Sign up to chat with AI and get deeper insights.

Sign up β€” 5 free credits
AI accuracy challenges
AI evaluation system
AI model testing
Search provider impact
AI cost-performance optimization
TL;DR

Chris Raroque discusses improving the accuracy of his AI calorie tracking app through testing and experimentation.

8
Watch Score

The video provides meaningful insights into AI optimization with practical advice.

2/10
Clickbait
positive
Sentiment
Should watch

Any app developer or AI manager who's interested in improving AI accuracy and testing should watch.

Can skip

Those not engaged with AI technologies or uninterested in technical deep dives may skip.

Quality (9/10)

The video provides valuable insights into AI testing with transparency about method and outcomes.

Clickbait (2/10)

Title accurately reflects content without exaggerated claims.

Sponsorship Detected
Brain Trust β€” ~30s
Summary
Chris Raroque faces issues with accuracy in his AI-based calorie tracking app, Amy, leading to subscriber cancellations. The AI's accuracy is crucial since it pulls data from nutrition databases but often errs, especially with international products. Raroque demonstrates his approach to enhancing AI accuracy using real production data and iterative testing, known as "evals." He describes how he used Brain Trust's tools to run tests and get objective accuracy scores. Through multiple attempts to improve the AI, such as separating search and reasoning tasks, Raroque performs comparisons between different models and search providers. His trials show that although some innovations, like using Gemini 3 for reasoning, initially worsen the app's performance, others like switching search providers from Perplexity to Exa can boost the app's accuracy. Despite some failures, Exa emerges as a more effective search partner. Raroque reflects on the challenges and successes in optimizing his AI, emphasizing the need for continuous testing and adaptation to changes in data or performance over time. This ongoing process has increased the app's accuracy while maintaining cost-efficiency and speed. He further underscores the importance of setting up a robust eval system, involving judiciously choosing test cases and ensuring judgements are reliable. Finally, he encourages viewers to implement their own eval systems, sharing his tools and insights for better AI performance management.
Key Takeaways
  • Separating search and reasoning tasks in AI can improve control and accuracy.
  • Switching search providers significantly impacted the accuracy of AI output.
  • Complex solutions often underperformed compared to simpler ones.
  • Continuous evaluation is vital for maintaining AI system integrity.
  • Testing with real user data highlighted international data disparities.
  • Brain Trust provides vital tools for comprehensive AI testing.
  • New AI setups revealed differences in speed and cost.
  • Experimentation confirmed necessity for specific test cases.
  • Third-party AI tools’ updates can unpredictably boost performance.
  • DIY eval systems are crucial for deploying effective AI solutions.
Action Items
  • 1Set up an eval system for AI testing.
  • 2Consider switching search providers based on latest data performance.
  • 3Regularly update test cases based on user feedback.
Prerequisites
  • Basic understanding of AI models
  • Knowledge of app development concepts
  • Familiarity with performance testing methodologies
Key Definitions
evals
Test cases used to evaluate AI system improvements.
Brain Trust
Platform used for issuing evals and scoring AI performance.
Mentioned Resources
Brain Trust(tool)

Used for evaluating and scoring AI effectiveness.

Perplexity Sonar(tool)

Initial search AI model used in the app.

Gemini 3 Flash(tool)

Used as a reasoning model in testing.

Exa(tool)

Improved search provider for accuracy tests.

My Fitness Pal(website)

Mentioned as a source for nutrition databases.

Content Analysis
Type

vlog

Sentiment

positive

Difficulty

intermediate

Complexity

moderate

Target Audience

App developers, AI enthusiasts, productivity tool users

#ai testing#app development#ai accuracy#calorie tracking app#productivity tools#brain trust#search providers#eval system#ai experimentation#model optimization