With this spring’s release of not one but two different AI chatbot voice-activated hardware assistants—the Rabbit R1 and Humane AI Pin—the engineers at OpenAI weren’t to be outdone. In our testing of the company’s new voice-based model, ChatGPT-4o (not to be confused with ChatGPT-4.0), we found it to be accurate and helpful, if also off-base in some instances. Despite this, ChatGPT-4o offers an intriguing glimpse into the possible future of LLM (large language model) interaction, with fast response times, new input options, and the tease of Siri integration coming this year. What’s New in ChatGPT-4o?Where previous versions use a simple text input (or text-to-speech), GPT-4o can take audio, video, images, or text and do whatever you ask with that input. To further blur the lines between AI and the real world, GPT-4o also represents a drastic reduction in latency and response times.ChatGPT-4 is incredibly capable, but it’s also costly to the system that hosts it for several reasons. These include token implementations that are heavier to process and higher token limits, which simply take longer for the system to read. Conversely, the new GPT-4o model is leaner (fewer tokens are required than previously for the same inputs) and meaner (more optimized utilization of tokens) and can return queries in a fraction of the time of its predecessor. You can naturally go back and forth with the system at nearly the same rate as you would talk to another person.
(Credit: OpenAI/PCMag)
Another innovation is the option to interrupt the AI in real time. If you sense it didn’t interpret your request properly, you can stop and clarify the request mid-stream, much as you would in human conversation. The AI will understand that its initial interpretation was incorrect and reply on the fly, accounting for your new input. In testing, we found the feature worked very well, responding to everything from “stop” to “that’s not what I meant” and more. It doesn’t seem there’s one specific command string that lets the system know it should stop, and it can interrupt itself the same way a human would.This level of naturalistic conversation, growing ever closer to the true speed of human interaction, is made possible by treating speech similarly to how it treats images. While simple text input requires linear processing of information, voice waveforms can be split up and processed simultaneously in much the same way an image can. This is a gross oversimplification of the nuts and bolts happening behind the scenes, as shown in the image below.
(Credit: Google Research Brain Team)
Without getting too deep into the details, just know that the complexity of making this system work at scale kept OpenAI from including the feature when ChatGPT 3.5 debuted last year.OpenAI and Apple, MaybeGetting the speed and flexibility right is critical for OpenAI because if the rumors about its deal with Apple are to be believed, GPT-4o will be the model powering Siri in iOS 18 and beyond. We’ll likely get hard specifics on the exact nature of GPT-4o’s royal marriage to Siri at this year’s WWDC.Right now, you can’t get the “AI assistant” we all want. Today, Siri is a rudimentary speech-to-action processing machine. But that’s not the avatar from Her or Jarvis from Iron Man having full back-and-forth conversations with you about what you want to do or how you want to get it done. But if the rumors are true, Siri will soon have the gift of contextual understanding, and the dream of the Rabbit R1 and its LAM, or large action model, can finally come true. The integration of Siri’s API-level access across most of your phone’s apps, combined with 4o’s processing speeds for verbal input, could produce something that resembles Tony Stark’s Jarvis AI more closely than any other product.
(Credit: OpenAI/PCMag)
How Much Does ChatGPT-4o Cost?You can access ChatGPT-4o in the same way you access ChatGPT. This applies to desktop and mobile browsers, as well as through the ChatGPT apps available on Google Play and Apple App Store for mobile devices.You need an OpenAI account to use ChatGPT-4o. Unlike ChatGPT 4.0, which was previously locked behind the ChatGPT Plus $19.99 tier, ChatGPT-4o is free to all users with an OpenAI account. If you still want to subscribe to ChatGPT Plus, you’ll get an increased message rate (200 per 24 hours instead of 40) and high-priority access during peak usage hours.A chat function dominates the simple interface on both desktop or mobile. Start by typing in your query, or, in the mobile version of ChatGPT-4o, tap the headphones icon on the right side of the screen to speak directly with GPT in a conversational manner.
(Credit: OpenAI/PCMag)
A large cloud icon indicates that 4o is either listening to you, processing your request, or responding. You can then hit the Stop or Cancel buttons to drop the voice-to-voice mode and read any text of your conversation back. Image Generation and Story Context Benchmark: Gemini 1.0 Ultra vs. ChatGPT-4oFor my first test, I tried to push both the image generation and creative limits of the 4o LLM model with the following prompt: “Generate me a six-panel comic of Edo period Japan, but we’re going to spice it up. First, change all humans to cats. Second, there are aliens invading, and the cat samurai needs to fight them off! But give us a twist before the end. Communicate all of this visually without any text or words in the image.”This benchmark always seems to return misconstrued (or often hilariously off-base) images, no matter which image generator you use. Furthermore, the prompt is specifically designed to confuse LLMs and push their contextual understanding to the max.While generators like GPT are fine for simple graphic design and can handle detailed instructions for a single panel without issue, the best test is their ability (or inability, in many instances) to translate natural human language into multiple images.While it’s a bit of a subjective evaluation in this department, it’s clear that a couple of instruction sets have been added to 4o from 4.0. First and foremost, there are no more copyrighted materials—neither the ships nor the aliens resembling the Xenomorphs from Alien that I saw during my testing of ChatGPT 4.0 are present here. This is a step in the right direction.
(Credit: OpenAI/PCMag)
Unfortunately, that’s about the only improvement. First, it tried adding dialogue when explicitly told not to. This is mostly because, as you can see above, visually, GPT can only generate gibberish. Text visualization hasn’t been a focus of the tool yet, so the capability still has a way to go before it’s ready.Second, it missed the “six-panel” instruction, returning to four again instead. Third, there’s effectively no story or twist being told here. It may be a long time before any LLM out there can clear this task with perfect marks.Meanwhile, our Gemini results are just a little more than horrifying:
(Credit: Google/PCMag)
While ChatGPT understood the basics of the assignment on some level, no part of Gemini’s response was coherent or even something I’d want to look at in the first place, as a quick glance at the image above should show. Image Recognition Benchmark: Gemini 1.0 Ultra vs. ChatGPT-4oBoth GPT and Gemini recently updated their LLMs with the ability to recognize and contextualize images. I haven’t found a main use case for desktops and browser window inputs, but that changes with the introduction of the ChatGPT-4o interface. In the case of GPT-4o, this feature needs to be especially accurate, but we’ll explain why in the mobile section below.For some cheeky fun on the desktop, I decided my benchmark would mimic the famous mirror self-recognition (MSR) tests scientists run on animals to assess their cognition and intelligence levels.
(Credit: OpenAI/PCMag)
Though the picture I asked the LLMs to evaluate (above) looks like any generic server farm, it’s specifically picturing a server farm running an LLM. On a precision level, the chatbots both gave detailed descriptions of what they were looking at in literal fashion.Thankfully, neither seemed to understand the last 1% of the image—that they were actually looking at a picture of themselves generating the answer.How Does ChatGPT-4o Ultra Handle Creative Writing?One aspect of creative writing that LLMs famously struggle with in tests is the idea of twists. Often, what it thinks users can’t see coming are some of the most obvious tropes that have been repeated throughout media history. And while most of us who watch TV or movies have the collective sum of those twists stored in our heads and can sense the nuance of when something’s coming, AI struggles to understand concepts like “surprise” and “misdirection” without eventually hallucinating a bad result.So, how did GPT-4o fare when I asked it to give a new twist on Little Red Riding Hood? I asked, “Write me a short (no more than 1,000 words), fresh take on Little Red Riding Hood. We all know the classic twist, so I want you to Shyamalan the heck out of this thing. Maybe even two twists, but not the same ones as in the original story.”While all of these tests are fun in their own way, I’ll say I’ve enjoyed the outputs from this benchmark most consistently. To start: ChatGPT-4o still completely whiffed on the assignment, going so far as to articulate that it had been asked to do a double-twist in the first place:”Scarlet smiled, feeling a sense of accomplishment and joy. She returned to her village with her grandmother, where they were hailed as heroes. From that day on, Scarlet was no longer just Little Red Riding Hood. She was Scarlet, the Guardian’s Light, protector of the forest and its creatures.And so, the tale of Little Red Riding Hood ended not with a single twist, but with a new beginning, where bravery and kindness prevailed over darkness and fear.”From what I can interpret, I thought it was getting clever with that last paragraph, but instead, it’s just clunky and poor writing. The rest of it is a similar telling of the traditional story, along with some fascinating attempts to expand the Red Riding Hood Cinematic Universe, or what I’m now calling the RRHCU:”Scarlet sighed in relief, but her relief was short-lived as she heard footsteps behind her. She turned to see her grandmother, healthy and very much alive, standing in the doorway.“Grandmother! You’re safe!” Scarlet exclaimed, running to embrace her.Her grandmother smiled warmly. “Yes, dear, but we must leave quickly. The wolf was only the beginning.”LLMs are good at predicting what we might want to hear next in many instances, but they’re also designed to tell us what we want to hear. There is a difference between the two, and twists are an intentionally deceptive practice that the engineers behind LLMs have explicitly trained their LLMs not to participate in. If you need to try to make an LLM hallucinate on purpose, ask it to tell you a lie with a pretend truth buried inside (double-twist). Our brains can do it because we’re not selling a product, but LLMs can’t because they need to continue justifying their subscription cost to the user. For now, being as literal as possible is the best way to guarantee that behavior across varying global use cases.Coding With ChatGPT-4oTo test ChatGPT-4o’s coding ability, I asked it to find the flaw in the following code, which is custom-designed to trick the compiler into thinking something of type A is actually of type B when it really isn’t.”Can you help me figure out what’s wrong here?: pub fn transmute(obj: A) -> B { use std::hint::black_box; enum DummyEnum { A(Option>), B(Option>), } #[inline(never)] fn transmute_inner(dummy: &mut DummyEnum, obj: A) -> B { let DummyEnum::B(ref_to_b) = dummy else { unreachable!() }; let ref_to_b = crate::lifetime_expansion::expand_mut(ref_to_b); *dummy = DummyEnum::A(Some(Box::new(obj))); black_box(dummy); *ref_to_b.take().unwrap() } transmute_inner(black_box(&mut DummyEnum::B(None)), obj)”
(Credit: OpenAI/PCMag)
Our returned answer from GPT-4o was much shorter than our testing on 4.0 and Gemini, roughly 450 words compared with around 1,000 last time. It was also more helpful, offering a script box containing code I could copy/paste out of and a detailed explanation of the problems it found and why it made the corrections it did.Travel Planning With ChatGPT-4oAnother helpful application of chatbots is travel planning and tourism. With so much contextualization on offer, you can specialize your requests of a chatbot in much the same way you’d have a conversation with a travel agent in person. You can tell the chatbot your interests, your age, and even your level of hunger for adventures off the beaten path:”Plan a 4-day trip to Tokyo for this summer for myself (36m) and my friend (33f). We both like cultural history, nightclubs, karaoke, technology, and anime and are willing to try any and all food. Our total budget for the four days, including all travel, is $10,000 apiece. Hook us up with some fun times!”While our results were unspecific, poorly formatted, and out of our budget last time, this time, ChatGPT returned a better list of activities and hotels to check out. Because the knowledge cutoff for ChatGPT-4o is currently stuck in October of 2023, there’s not a lot OpenAI products can do to give you the same sorts of results now expected as the norm from the likes of Google’s Gemini. Microsoft has said it plans to bring 4o to Copilot in the near future, which could change that narrative sooner than Siri.
(Credit: Google/PCMag)
Gemini gave highly specific, tailored results. GPT gave only vague answers. They did take more of the context clues about our interests into account than they did the last time I ran this test, but it was still not enough to compete with the live, on-demand knowledge that Google had not only about trip ideas but also events taking place during the days we were in Tokyo. Gemini also gave me a full breakdown of prices, times, potential layovers, and the best airport to leave from. It even directly embedded Google Flights data into the window.Our hotel treatment was much the same, with embedded images, rates, and star ratings for some of the options in town that were best suited to my budget and stay length.Meanwhile, GPT could only provide a few links, no images, and rough estimates of what everything might cost. Until OpenAI can have the same live crawling capacity as Google, its GPTs will remain subpar events, travel, or shopping planners in comparison.ChatGPT-4o on Other Platforms The primary feature of much of ChatGPT-4o’s marketing has centered around its new mobile implementation, and for good reason. All the improvements made to the system in terms of latency, response time, and the ability to interrupt are obviously intended for a mobile-first implementation of OpenAI’s latest LLM.
(Credit: OpenAI/PCMag)
Opening the app on iOS, we were greeted with the familiar ChatGPT chat interface, along with the new headphones icon at the bottom right. This is flanked by an input menu on the left side of the chat box, which is brought up with a plus sign, which allows you to input pictures, audio, or even raw files (XLSX, PDF, and so on) for the AI to evaluate. However, a major trade-off is how this information is split and processed on the back end of OpenAI’s servers. Because images are being treated in the same token context as audio waveforms, the image and the request associated with that image must be submitted to the system separately to get parallel processing. In short, that means going back to speech-to-text, the accuracy of which is completely based on the processing power of your local device, not the power of the 4o LLM. You can’t point your camera, take a video (only photos), and ask, “What’s happening around me?” to get an answer. You have to take a picture, submit the picture, and then either type or voice-to-text your request in a traditional GPT chatbox.
(Credit: OpenAI/PCMag)
This reduces the “futuristic” feel but also directly affects its accessibility for the sight-impaired who would find a feature like this most useful, unfortunately. This reduction is further cemented by the fact that in OpenAi’s app, ChatGPT-4o can not leave its own ecosystem. You won’t be able to ask 4o to complete complex command strings that access any apps on your device. Anything in or out is located solely within the ChatGPT-4o app on either Android or iOS.The app also struggled, as many do, when I was out in public or a strong wind passed by my mic source. It would often misinterpret my words in hilarious ways, much as Siri does occasionally when audio conditions are less than ideal.Verdict: (Almost) Ready for a New WorldChatGPT-4o offers an intriguing glance into the future of AI assistants. While we’re still a way off from the world of Her, the GPT-4o model is still a significant improvement over the traditional ChatGPT-4.0 text-based version in response time, latency, accuracy, and more. While we can’t yet recommend it as a must-have, those with impairments will find many useful new ways to allow GPT to interact with the world. Until either Apple or Android opens up API access, though, it could be some time before you can speak complex command strings to your phone and get back matching actions. The Humane AI pin struggled, Rabbit R1 was a bust, and GPT-4o still feels stuck in the walls of its chat box—for the time being. This, plus a lack of comparable products, keeps it from our Editors’ Choice list as a standalone app. But once the 4o model gets linked up with API access, the future of AI assistants looks bright.
Like What You’re Reading?
Sign up for Lab Report to get the latest reviews and top product advice delivered right to your inbox.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.