Musing on OpenAI’s “Device” 

Intro

In May, OpenAI went public with its vague plan to “do hardware”. Loving your own brand of farts doesn’t begin to cover the sheer level of Hubris that this PR laminated, varnished and polish turd of a “blog” post. Also the video cost $3 million to make. In this post I’m going to dissect the effluence and try and divine the direction I think OpenAI is going in.   

This post will cover some of the hype surrounding this announcement, and then assess the “state of the art” to figure out a possible direction. As with everything flowing from OpenAI, you need to engage critical faculties on this page as well.

What I’ll be covering in this post:

  • Hardware is hard
  • Its all about context, you prick
  • Power budgets are hard
  • Altman is a dipshit
  • Physics thinks you’re a naive fool That’s for another day

Who the fuck made you qualified to question our lord and master Altman?

You are wise to ask. I, until very recently, used to work at a world leading lab specialising in making computers understand the world. Moreover I spent a good amount of time trying to make “AI” assistant glasses practical.

Let us out our biases

I want to be frank with you, This is not going to be a page filled with fawning praise. I think Altman is a personality vacuum, A person who would make a very large song and dance about buying a round that one time. I dislike him. Yes I know that’s an ad hominem, but, my dear reader, this is my website, and I’ll piss on whomsoever’s chips I see fit.

Where I think Altman is most flawed is in his thinking. The man is incapable of original thought. He has a paper thin grasp of history, and almost no understanding of how an economy works. You would think that someone with the one of the largest corpuses of human knowledge, would actually try and explore some of it. But alas no, Altman can only think how clever he is. 

But there is another character in OpenAI’s blog post. It’s unusual to for Altman share the spotlight. The other person is Johnny Ive. Ive is a good designer. He understands and lives “narrative Uber alles”. In the $3m video that accompanies this blog post, what Johnny says is quite prescient: “if you want to see where you are going to end up, you shouldn’t look at the technology, you should look at the people [..] and look at their values”.

Ive might be all about human centred design, Altman is all about Techno fudalism. Altman saw minority report and thought “Fuck yeah I want a bit of that”. However lets actually get into the meat of the matter:

Get in looser we’re doing Hardware

Yes, lets try and work out what hardware they are trying to make. Watching the video, once you get over the feeling that Altman is being dubbed by a voice actor, we learn that Ive has made a prototype and its “the best thing in the world”. Altman goes on to talk about how getting answers from chatGPT require opening up a browser typing in what the conversation was and asking a question. 

This is all very high friction. 

So let us dive into how the prototype hardware might work, and the constraints that’ll shape the design decision/functionality. But before we do, we need to talk about context

What the fuck is Context?

What do I mean by context?  Generally if you structure your question to an LLM like “why does no-one love me”, it gives you a useless generic reply. To get a good answer you need to supply some information, then ask a question: “I was at a barbie themed party last night dressed as an Emo kid complaining loudly about how much I hate the venga boys,  why did no-one talk to me?” The first part of the text was the “context”, the extra info that makes the question easier to answer. 

It helps to take a moment to think about how your day-to-day memory functions. I know it’s a bit existential to do, but it’s well worth the time to introspect. It will give you a better understanding of why LLMs need ”context”, and why they talk in generic business speak. 

If I’m on the phone to you, and the first thing I ask you is “what do I do with this”, you’re basically limited to guessing. Why? Because you’re not able to see what “this” is. If you’re on a video call and see me holding a pepper and a knife, in my kitchen looking confused, you might be able to infer that I need help on cutting a pepper. All of that extra information is context. As humans we naturally look at the world around us to create context, it happens automatically.

But, communicating context is expensive. As humans we use words like “this” to act as variables so we don’t have to describe complex objects every time we want to talk about something. “I need to fry this?” as opposed to “I need to fry the chopped onions, carrots and meat?” Humans are lazy, so spending effort communicating the obvious (to us) is taxing and boring. So we take shortcuts that rely on shared context. 

Context and Time == Funtimes 

Where it gets fun is when we start talking about past context, because that requires memory. That’s where we endup slap bang into theory of mind shit. It’s also where current LLMs really start to fall apart. 

I’m not a psychologist, so these assertions about memory might stray into Malcom Gladwell level bollocks. But broadly speaking you have a spectrum of memory types; Long term memory (That time you burnt yourself on a candle, your first kiss, that one time your mum said something about your favourite toy, etc etc. ) Medium term memory: the correct time to wait before retrying the flush in the toilet you use once a week, and short term working memory: The thing you are working on now, and the next steps to get to what good looks like. 

It’s far more complex than this, but you need to understand that human memory is a spectrum from Indelible to Ephemeral. (well its not. But let’s not split hairs. Emotion also plays a large part, but, we ain’t got time for that shit, and the precise mechanism for memory making is not well understood. Also there are many other type of memory not covered here. Please consult your nearest cognitive psychologist

However the important take away is that LLMs have excellent read only longterm memory, and limited ephemeral memory. There is nothing in-between. 

When you are working on a task, you are combining longterm knowledge with working (ie ephemeral) and medium term memory to reach a goal. Your brain will dynamically update/log new and interesting information based on how important those events were to you. There are a number of voices (agents in LLM parlance) that plan your next move, assess outcomes, plan next steps, and work out if the current goal has been reached. All of those processes are dynamic. 

This is a marked difference to LLMs, where they have excellent long term memory, but very limited short term memory. There is currently no way to dynamically update long term memory. Anything you want them to remember and do needs to fit in a smallish amount of memory. 

That memory needs to hold all the instructions to execute the current task, success criteria and any intermediate data. If you push too hard, it looses context and witters on like an American businessman who’s asked a hard question on a subject they have no knowledge. The result is a stream of generic bullshit. 

Basically it’s memento but with less murder.  

Context is king.

Now that we better understand  what context is, we can get back to the point. 

To do what Atlman wants in his video, a device that you can just ask things, and it gives proper answers requires context. Let’s take the example he uses in the video: “I want to ask a question about the conversation I’ve just had with Johnny” 

To do that now, he’d need to pull out a device, type in what he said, and then ask the question. Even then it might not be that relevant, because chatGPT might not have all the bits of conversation that would help answer the question. Human memory isn’t reliable, When you replay a conversation you begin to lose detail. 

So how do we stop that loss of information/context? Record everything.  To do what Altman wants, the prototype device needs to record every utterance you make, where you made it, what was near you, what you were holding, what you were looking at and identify who you were talking with. When I mean everything, I mean everything.

But why is that level of privacy invasion needed? Because the machine needs that context to work out how best to reply with the relevant answer: Take the question: “what did Ive say about the product that made me say ‘wow, that’s awesome’? ”. The machine needs to know what Ive said, the history of previous conversations that might have been referenced, the cultural context of the meeting 

If we think of all the bits of context we need to actually answer that question, you can begin to see the scale of the product they are proposing. 

“What did Johnny say”

Firstly we need to know what Johnny said. So we need always on transcribing. Whisper does this well, as do a number of other proprietary models. That only gets us what was said, not who said it.  We need “diarization”. Now on its own, Diarization will only give you a speaker ID, not a name. So you need another mechanism to give a unique ID. 

I’m not that up to date with audio fingerprinting, so I assume there is a way to do this in audio only, but that’s pretty limiting. Given that cameras are so cheap, If you use facial recognition not only do you get a reliable way to work out who is speaking, you also get where they are speaking. This allows you to isolate their voice from the background, along with lots of other context cues. 

So now we are recording what is being said, by whom, and at least where each speaker is looking, and what they are holding. With those sensors and processing systems (transcription, facial recognition, gaze detection, pose detection) we can get a stream of data that looks like this:

{"Timestamp":1749224525,
"actor": "Altman",
"location_name": "Generic expensive coffee shop",
"position": [37.79001709486111, -122.40921084722736],
"gaze_vector": {quartinion},
"pose":{quartinion},
"holding": {"id": 2342543234, "name": "coffee mug"},
"looking_at": {"id": 656434353, "type": "human", "name": "Ive"},
"transcription": “what did he say that made me say ‘wow that’s awesome’"}

Once we have a rich stream of data like that we can begin to answer complex human questions. If we dump that into a database, we can start to retrieve context fairly easily. For Altman’s initial pitch “I want to ask something about the conversation I had with Johnny” It suddenly becomes an SQL statement. 

We can select all the text where Johnny and Altman were together, dump it in as context, and ask the question that Altman has given.  This approach gets to 90% of the way to being “magical”. It sure as hell impressed me when I first saw it. The hard part is getting good quality data into the database in the first place.

Hang on, didn’t you prattle on about LLMs having memory problems?

Well observed, you’re right it does. But what we can do is give the LLM instructions on how to find the data it needs to arrive at an answer. Effectively we give the LLM a search engine and some clues on how the data is laid out. 

Where this approach falls down is for queries like “how much cooking does this need”. We need some sort of visual describer to divine what “this” is. Using cameras you can infer what people are holding by combining body pose and some segmentation + VLM to describe what people are holding. This quickly gets expensive to do with any kind of resolution. It’s cutting edge stuff, you’ll need to look at conference papers to see what kind of capabilities are being developed right now.

However even if you have the best algorithms in the world, you have a more fundamental problem: What if you have your back to the “device” when you are talking? Congratulations, you’ve discovered occlusion. This is where people or things get in the way of the camera, making it impossible to directly observe things. 

Conclusions so far

  • I think Altman is a prick. 
  • LLMs have a memory problem, 
  • LLMs really don’t understand the world outside. 
  • To make Sci-fi AI assistants, we need to make the  “machine” understand how humans see the world. 

With that out of the way, let’s move on to what I think these “new paradigms” of computer interfaces will look like. 

Product Idea prediction 1:

I’m going to assume the “magical prototype” product is basically a hacked Amazon echo show. It has all the things needed to listen in, and enough processing power to stream audio/video to a backend that can transcribe and store in a database. 

Of course how it stores and manages your data is another matter completely. Given how lax OpenAI is at security, I suspect you wouldn’t want this in your house.  Altman has got previous, combine that with is spectacularly flexible morals, its not going to be a secure product.  

So how will it work? As I described earlier it’ll need to continually record any noise, spoken word or utterance. It’ll stream it to openAI to do the heavy lifting. It will also be videoing everything you do. 

You’ll then be able to ask it things like “what did Ive say about that new TV series?” or “what was that new TV show that Ive talked about last night?” Those now become simple questions to answer. It can also summarise what was said and divine actions for a longer period. With this amount of data, it allows questions like“Is there anything I need to do today?”. So a bullshit answer machine, morphs into an executive assistant. So it’s a slam dunk, excellent product, ship it right?

Well no. To run it continuously requires enormous amounts of CPU and GPU power. Sure the actual device might not use that much power, but the compute required to accurately transcribe, diarise, describe and track the people/objects in range is huge. As in $10k a month huge.  

It’s also not mobile, so you’re not going to have to fill it in at the end of the day on what happened. However without a massive investment in hardware, you’re not going to do better. 

Crucially it’s something that a small team can smash out in a month or two. That’s enough time to hack the hardware, glue together some APIs and databases, and polish the interface enough for it to impress an Intellectual lightweight such as Altman. 

Product Idea prediction 2:

To be *really* magical, you need a device that is always on, always with you and sees what you see. That’s right this is AR glasses time. Silicon valley are somewhat obsessed with AR glasses, I assume because they don’t like the idea of fake eyeballs

Now, Glasses form factor is *really* hard to do well. Even apple can’t do it yet. Glasses have to be small, light and cold. This means you have almost no battery power, and even less processing power. 

I’ve hinted at the enormous amounts of power needed. Let’s talk about power now. Even with improvements in battery technology your glasses will only have around 1-2 watt hours of power. For reference this is about enough power to run a single chatGPT server for 1-4 seconds. 

So, for always on(12 hours), wearable, comfortable glasses you have three options:

  1. Cables
  2. Custom silicon
  3. A fucktonne of custom silicon
  4. Magic battery tree

Something that illustrates *just* how hard glasses form factor is this:

You see that pathetic recording light? It’s not very bright is it? The reason why is it burning through about 30mw of power. Doesn’t sound like much does it? however if your power budget is 83mw, it suddenly becomes a big problem. (1 watt hour over 12 hours is 83mwhr) So you could have a brighter LED, but it would mean your battery life could half, for no perceivable gain. (well that depends on your viewpoint, also for these glasses the power budget is a bit higher, so don’t write in)

There are other interesting problems with that kind of power budget, writing to SSD becomes an expensive operation. Wifi will eat your battery in 30 minutes, running android/other OS all the time will reduce your battery life by >40%, even dumping stuff in RAM needs to be thought about. Then there is all the bits about offload devices and such. All things way above my paygrade.

That problem gets much bigger when you enumerate what else those glasses need to do in that tiny power budget:

  • Run at least 2 SLAM cameras (these are cameras that do magic to work out where you are in 3d space, (“localisation” as its called in computer vision circles) 
  • Combine SLAM cameras with gyros and compass and feed it to a SLAM system to get an estimated position
  • Run eye-gaze cameras to see which way your eyes are pointing
  • Run an ML model to work out where your eyes are actually pointing.
  • Run a front facing colour camera to see what your eyes can see. 
  • Record audio from multiple microphones
  • Transcribe that audio
  • Write all this data to storage
  • Upload that data to *the cloud*
  • Run a speaker to answer any questions/play music
  • Run some sort of ML model to work out what your looking at
  • Run another model to work out what your holding
  • run another model to work out what kind of room you’re in.

Assuming you have all those features you have a set of glasses that can probably record enough information, the only problem is that you don’t have enough power to process it. You also don’t have enough power to upload , so you need to make compromise. It turns out that you can throw a shit tonne of info away and still get usable data. But that requires custom silicon, and a fuckton of it.

But where is all that data streaming? well it has to be somewhere within range of either a cable or bluetooth LE. Magic leap, vision pro and Meta’s Orion all have a massive battery/proccessing unit thats roughly the same volume as a chonky portable charger. With work that’s probably enough space to put in most of the silicon and battery required to process and record all the audio, gaze and navigation. The visual stuff is might need an upload later for refinement. 

Notice that I’ve not included a display? Yeah, that shit is even harder. You need to bend the light so much and use so many difficult to machine materials that it’s hilariously expensive to make (the prototypes I saw took 12 hours to machine just the frames. The lenses were made from some sort of carbide, so I suspect that actually they cost more 10x more). Even then, those displays are dull, full of funny colours, and have lots of artefacts if you don’t align them correctly. (think late 90s colour LCD, but with a rainbow hologram sticker on the front of the screen)

Quick recap: 

Fully aware glasses are very hard to do.

It’s not impossible, you’re just going to need to use cables, or spend a good number of years baking computer vision techniques into silicon. (Meta put a full SLAM stack into its research glasses, but you can’t do much else when its running. Although the second generation has more sexy bits in it.)

So what is OpenAI going to do?

I strongly suspect that streaming audio to your phone is probably achievable now with mostly off the shelf parts. You’ll not get room accurate position, or context of what’s in the room/what you’re holding, short of a low resolution picture every n seconds. But perhaps that’s enough to impress dipshits like Altman. It will answer your questions, it will have a lot of the context of what you’ve been doing. It’s basically rayban metas, but probably with cables. 

Conclusions:

The headline is that “Proper” AI assistants are not going to be here for a good number of years. Better amazon alexas will probably arrive soon, but either they’ll bankrupt openAI trying to process that much data, or the subscription will be >$400 a month.

Sure convincing prototypes exist, and they are something that keen hobbyists could make. However, a reliable, secure, useful mass market product will remain elusive until the cost of running the many ML models needed to understand the world drops to reasonable levels. 

Now, I am not an optimist, and especially not a tech optimist. However, the iPhone 16 has the same graphics power as the xbox one from 2013. Given that the xbox one draws about 120watts, and the iphone16 has a battery of 3,582 mAh it’s not inconceivable that practicable useable AR AI assistant will arrive inside 10 years.