June 8, 2026 · Carlos Crespo

The Best AI Was The One Used For The Right Job

Every model struggled with full-game NFL picks. Gemini stood out when the question became specific, hitting 73.3 percent on player props.

AINFLGeminiChatGPTTrustworthy AI

Ahead of football season I reopened an older project where I compared ChatGPT, Claude, and Gemini on NFL predictions. I expected one model to be a clear winner. The real result was more useful than that.

Every model struggled when it tried to predict full games.

The test covered 86 verified bets across two weeks of NFL games. I included spreads, totals, moneylines, and player props. The model names were hidden during grading, so I was judging Model 1, Model 2, and Model 3 instead of picking the company I already liked.

ChatGPT lost $7 on spreads, totals, and moneylines. Gemini lost $43.82. Claude came closest to breaking even, but still lost $0.45.

That means the broad game picks did not produce a winning approach for any of them. ChatGPT finished with the best overall total because its failures were less severe and its player picks were still strong. I do not think that makes it the best model for this use case. It made it the least volatile across a mixed group of questions.

The player picks told a different story.

Verified AI model results showing every model struggling on game picks while Gemini leads player props

Gemini went 11-4 on player props. That is a 73.3 percent hit rate and plus $63.82 in the category where the models actually showed a useful edge.

ChatGPT was also strong on player props at 9-4 and 69.2 percent. Claude finished 10-7 at 58.8 percent. All three made money there.

If I were choosing a model based on the part of the experiment that actually worked, Gemini would be my pick. Its full-game results were bad, but I would not use any of the models for full-game picks after seeing this data. The useful question is not which model lost the least when used poorly. The useful question is which model performed best when the task matched what these tools seem able to do.

That is where specificity matters.

Predicting a whole football game asks a model to combine too many moving parts into one answer. A player prop is narrower. The model can focus on one player, one matchup, recent usage, injuries, and a specific stat line. It has a cleaner question and a result that can be checked.

This lesson applies outside of football too. People often ask which AI is "the best" as if there should be one answer. I think a better approach is to understand who built the model, what tools it can use, and what kind of work fits its strengths.

My theory is that Gemini may have benefited from Google's strength in search, retrieval, and current information. Google explains that Gemini can be grounded with Google Search, allowing responses to use current web information and source links. That kind of access could matter when a question depends on recent player performance, injuries, matchup details, or other changing data.

The experiment did not directly test whether Search grounding caused Gemini's result. I cannot claim that it did. The result is still interesting because it matches a reasonable idea: a model connected to a strong information-retrieval system may have an advantage when the task depends on specific, current facts.

ChatGPT's result tells a different story. It was steady across the mixed test and had fewer extreme failures than Gemini. Claude was nearly break-even on game picks and positive on player props. Each model showed a different shape instead of one universal ranking.

The project also reminded me to check my own work. My first spreadsheet had some model assignments mixed up. I rebuilt the data from the original responses, verified each result, and corrected the totals. That cleanup mattered, but it was not the most interesting finding. The bigger finding was still the split between broad predictions and specific player analysis.

This was a small experiment covering two weeks of one NFL season. It does not prove Gemini will repeat a 73.3 percent result. It does not mean anyone should gamble based on an AI response. Betting outcomes change, models change, and a short sample can make a result look stronger than it really is.

What it did show me is that AI gets more useful when the task is specific and the model is chosen for a reason. Gemini was the strongest model in the part of this experiment that worked. The other models still had useful qualities, but treating every LLM as interchangeable would miss the point.

You can explore the original AI Analyzer project and see how the comparisons were presented.

I will keep sharing project breakdowns through the CSolutions blog and newsletter. Shorter updates are also on Instagram and Facebook. Newsletter The ai-analyzer-revisited campaign leads with the useful result: every model failed on broad game picks, all three improved on player props, and Gemini led the category at 73.3 percent. It links to the blog, chart, interactive project, newsletter signup, Instagram, and Facebook.

LinkedIn post I went back to an old AI project before football season, and I realized I had been looking at the winner the wrong way.

I tested ChatGPT, Claude, and Gemini on 86 NFL picks. Every model struggled on full-game bets like spreads, totals, and moneylines.

The player picks were completely different.

Gemini went 11-4 on player props and hit 73.3 percent. ChatGPT hit 69.2 percent. Claude hit 58.8 percent. That was the only category where every model made money.

ChatGPT had the best overall total because it had less variance and lost less on the bad category. If I were actually choosing a model for the part of the test that worked, I would choose Gemini.

That was the bigger lesson for me. The best AI depends on the job. Specific questions gave the models a much better chance than broad predictions, and understanding the company behind a model may help explain its strengths.

My theory is that Google's search and retrieval ecosystem may have helped Gemini with player-specific information. The test did not prove that, but the result gives me a good reason to keep exploring it.

This was a small experiment, not betting advice. Results can change and nobody should expect the same outcome.

Read the full breakdown: https://carloscrespo.info/blog/why-specific-ai-prompts-beat-general-predictions

Explore the original project: https://crespo1301.github.io/AI_Analyzer_Crespo/

Newsletter readers get new project breakdowns first: https://carloscrespo.info/#newsletter

hashtag#AI hashtag#Gemini hashtag#ChatGPT hashtag#NFL hashtag#TrustworthyAI

Research links used for the theory section Google AI for Developers, Grounding with Google Search: https://ai.google.dev/gemini-api/docs/google-search Vertex AI, Grounding with Google Search: https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/grounding-with-google-search Google Cloud, Gemini 2.5 Flash model page: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash OpenAI, Introducing GPT-5: https://openai.com/index/introducing-gpt-5/

Get the next post in your inbox.

A new write-up every Monday, no spam, unsubscribe anytime.

Subscribe to the newsletter

Follow Along for Recent Work, Behind-The-Scenes Updates, & New Launches.