Seeing Through the Smoke and Mirrors: A Practical Guide to Evaluating AI

Seeing Through the Smoke and Mirrors: A Practical Guide to Evaluating AI

As AI capabilities advance at a dizzying pace, we are bombarded with bold promises of automated tools that will revolutionize how we work and live. Slick product launches and marketing hype tempt us to buy into claims of AI ascendancy. However, the gap between promises and actual performance often remains cavernous. To determine if an AI tool can genuinely enhance our productivity, we need to see through the smoke and mirrors. We must build our own literacy to evaluate when and how AI can meaningfully complement our unique skills and workflows.

In this post, we'll provide a practical guide to developing an informed perspective on AI tools. I’ll tell you how not to be fooled. By building core competencies, testing performance on familiar tasks, scrutinising claims, and consulting impartial experts, you can determine if and when AI is currently capable of assisting you. Marketing glitz often far outpaces present functionality. With literate and skeptical eyes, we can sift reality from wishful thinking about AI’s current abilities and focus more on getting practical help in doing what we do.


The Temptation of AI’s Promise

We live in a world of relentless AI marketing hype. The pace of advancement stokes dreams of automated super-human assistants. Slick product launches dazzle us with visions of revolutionising our workflow overnight.

It’s no wonder many of us get lured in. The potential seems limitless. But behind the clever websites, video pitches and benchmark beatings lies a gap between promise and reality. How can we see through the smoke and mirrors?

Building Our Own AI Literacy

With any rapidly emerging technology, hype often outpaces truth. To genuinely assess new AI systems, we must first develop core competencies to evaluate claims. This is what our books and training aims to provide. Foundational skills like mastering our 4Ps of preparation, prompting, process and proficiency give us the skills to judge what AI can and cannot yet achieve. Building your AI skills makes you savvier.

Testing AI's Mettle Through Familiar Tasks

Once you’ve learned the foundational skills, testing AI tools on activities within our areas of expertise provides an authentic litmus test beyond canned demos. Tune out the speculative static and focus on present-day utility.

Once you have seen a language models like ChatGPT perform impressively on known tasks, you’ll realise the power in concrete terms. Perhaps it’s summarising lengthy reports into cogent briefs. Perhaps it’s providing thoughtful creative input.

When I first started exploring ChatGPT, the very first thing I tried was an audience segmentation for a client. As an audience analytics specialist, that’s an area of deep expertise for me. I wanted to genuinely test if this AI could provide value in my wheelhouse.

More people should push models to perform tasks within their professions to test them. Not the absolute most advanced and complicated tasks, but an important activity that would offer real utility if automated. Start with capabilities, not complexity.

During a recent AI workshop we ran, participants tried this approach across areas that mattered to them. Some tested summarizing longform documents into snippets to share with colleagues. Others challenged the model to evaluate a partner’s proposal against specified criteria from their industry.

The specific task differs, but the approach remains replicable: pick a reasonably high-level activity from your domain where AI assistance would provide concrete value. Then rigorously assess, with good usage (the 4Ps!) where current capabilities fall short and excel. This tailored testing builds familiarity with limitations and utility.

Judging AI feasibility on familiar turf provides an authentic litmus test. It illuminates strengths and weaknesses better than any slick demo. And it forges understanding through hands-on experience rather than far-off speculation. By walking through use cases we know intimately, we can determine if and where AI proves useful. And ultimately, we better prepare for more advanced integrations to come.

Meanwhile, use imperfect tools to build expertise and get benefits

Rather than awaiting perfection, we can realize benefits by proactively employing present-day AI utilities, incomplete as they may be. We may not be able to have AI produce a slick PowerPoint for us, but it can certainly help us think, slide-by-slide about the content we should include. Early adoption accessorizes our workflows while constructing the scaffolding to assimilate more advanced tools seamlessly when ready. This active learning accrues the foundational skills essential for more fluid integrations with later AI innovations.

Additionally, employing current AI functionalities, however imperfect, provides valuable insights into optimal integration strategies for enhanced editions yet to come. We discern through trial and error how augmented intelligence complements our strengths. We map ideal intersections between automated assistance and human judgment within our roles.

Just as we gain muscle memory executing routines before perfecting our craft, interacting with AI in its currently imperfect form advances our readiness to utilize more powerful iterations in the future. The capabilities will continue advancing rapidly. By adopting what is workably available now, when sophisticated successors arrive, we will be well-versed in unlocking their full potential within our teams.

Probing the Proof Behind Bold Claims

This is all necessary because AI marketing often relies on lofty proclamations ahead of real-world performance. Take Google's release of Gemini, its much-touted new language model. The slick promotional videos depicted impressive capabilities beyond what the best current model (GPT-4) can achieve and the benchmark scores suggested the same. But the reality is far less polished.

Google claimed benchmark scores showing Gemini surpassed GPT-4, the most advanced model to date. Yet their testing methodology used more advanced prompting for Gemini. Truly matching performance would require equal conditions.

Further, the Gemini demo videos were impressive and seemed to show capabilities that GPT-4 didn’t have. But the video adjusted timing and simplified the prompts compared to what was actually done. This exaggerates the smoothness and breadth of its abilities. When GPT-4 was tested with the real world examples and prompts, it was able to replicate Gemini’s responses.

Similarly, many of us were wowed by demo videos and slick promises for Microsoft Copilot. The dream that it could ‘create a PowerPoint for you’ is alluring. But it doesn’t work in a way that’s nearly as useful or reliable as the average person seeing that demo thinks it does. Far from it. Again, the hype exceeds real world abilities.

Again, it pays to scrutinize such bold claims through the lens of our informed experience. Do the slick releases genuinely demonstrate superior performance on activities people actually do? Understanding AI building blocks allows us to separate hype from reality.

Seeking Out Impartial Perspectives

Rather than trusting authority and tech brands, it helps to seek out impartial experts. Independent analysts who you trust (not ones who are incentivised to sell, sell, sell!) can provide authentic guidance on capabilities beyond product launches. Their robust, hands-on evaluations counterbalance carefully orchestrated and often misleading marketing.

Our own firsthand experience offers a grounded perspective. We always start exploring an AI tool’s performance on tasks within our expertise to make a realistic assessment. We then work with experts in areas outside our expertise, to combine our AI usage skills (see the 4Ps above) with their expertise to push and evaluate AI performance in areas outside our expertise. This rigour is very much needed in order to get beyond the marketing hype.

Scepticism First is a good strategy

Be sceptical until you see an AI do useful work in an area where you’re an expert or where an expert you trust tells you it does so for them.

In the meantime, work on those foundational skills (as covered in our books and training) so you have the skills to evaluate AIs!

Summing it all up

By developing AI evaluation skills rooted in our expertise, impartial perspectives and realistic testing, we build the literacy to navigate relentless marketing hype. Wild claims will come and go but our hard-won discernment remains reliable. Though AI promises tempt, we focus on present-day utility not speculative dreams.

With patient literacy, we understand AI’s strengths and limitations for enhancing our workflows right now. And with an eye always on advancing competencies, we prepare to leverage AI capabilities not just for today but for the future evolutions our diligence helps create. Though hype attempts to dazzle, our enlightened eyes see AI clearly for what it is— and more importantly, for what it someday may be.

Back to blog

1 comment

All I can say is WOW!!!! This is next level AI gospel right here. I can’t believe these posts are not loaded with comments. But honestly I have to say thank you. When the AIPRM extension first released and your prompts were some of the initial handful to choose from, the results opened my eyes and really pulled my interest down this path of innovation. I worry that if not for your insights I may have used chatgpt for a couple days then went back to my 9-5. Thank you

Steve Michael LaJoie

Leave a comment