I’m optimistic about humanoid robotics, but I’m curious about the reliability issue. Biological limbs and hands are quite miraculous when you consider that they are able to constantly interact with the world, which entails some natural wear and tear, but then constantly heal themselves.
Industrial robots at least are very reliable, MTBF is often upwards of 100,000 hours[0]. Industrial robots are optimized to be as reliable as possible because the longer they last and less often they need to be fixed, the more profitable they are. In fact, German and Japanese companies came to dominate the industrial robotics market because they focused on reliability. They developed rotary electric actuators that were more reliable. Cincinnati Millicron(US) was out competed in the industrial robot market because although their hydraulic robots were strong, they were less reliable.
I am personally a bit skeptical of anthropormophic hands achieving similarly high reliability. There's just too many small parts that need to withstand high forces.
It does either get very exciting or very spooky thinking of the possibilities in the near future.
I had always assumed that such a robot would be very specific (like a cleaning robot) but it does seem like by the time they are ready they will be very generalizable.
I know they would require quite a few sensors and motors, but compared to self-driving cars their liability would be less and they would use far less material.
I think this is the spooky part. I feel dumb saying it, but is there a point where they are able to coordinate and build a factory to build chips/more of themselves? Or other things entirely?
I think those problems can be solved with further research in material science, no? Combined that with very responsive but low torque servos, I think this is a solvable problem.
The 1%/year failure rate appears to just be made up. There are plenty of electric motors that dont have anywhere near that failure rate (at least during the expected service life, failure rates certainly will probably hit 1%/year or higher eventually).
For example, do the motors in hard drives fail anywhere close to 1% a year in the first ~5 years? Backblaze data gives a total drive failure rate around 1% and I imagine most of those are not due to failure of motors.
For example, an industrial robot arm with 6 motors achieves much higher reliability than a consumer roomba with 3 motors. They do this with more metal parts, more precision machining, much more generous design tolerances, and suchlike. Which they can afford by charging 100x as much per unit.
Also factory robots arms are probably operating in highly sterile, dry environments? How would working in a muddy / dusty / wet environment change this?
I'm interested how differences with robots work overtime, there are a lot of machines in this world that have been patched or "jimmied up" to continue working, let's say a mining robot, it would probably get quite heavily contaminated with dust, wear would occur in different places, rock falls might bend parts.
So even though another robot could probably do the "jimmy up". it seems like overtime, the robots will "drift" into all being a bit different.
Even commercial airlines seem to go through fairly unique repairs from things like collisions with objects, tail strikes etc.
According to the blog post, it requires an NVIDIA Jetson Orin with at least 8GB RAM, and they've optimized for Jetson AGX Orin (64GB) and Orin NX (16GB) modules.
Yeah they didn't really mention anything, I was almost getting my hopes up that Google might be announcing a modernized Coral TPU for the transformer age, but I guess not. It's probably all just API calls to their TPUv6 data centers lmao.
You can think of these as essentially multi-modal LLMs, which is to say you can have very small/fast ones (SmolVLA - 0.5B params) that are good at specific tasks, and larger/slower more general ones (OpenVLA - a finetuned llama2 7B). So a rpi could be used for some very specific tasks, but even the more general ones could run on beefy consumer hardware.
Nice. I work with some students younger than 13, so most cloud and llms are quite tricky to work with, local only models like vertex are nice for this use case. I will try this as a replacement for chatgpt as Computer Vision in robotics like Lego Mindstorm
What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?
It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".
As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).
OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk
A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.
I continued to be impressed how Google stealth releases fairly groundbreaking products, and then (usually) just kind of forgets about them.
Rather than advertising blitz and flashy press events, they just do blog posts that tech heads circulate, forget about, and then wonder 3-4 years later "whatever happened to that?"
This looks awesome. I look forward to someone else building a start-up on this and turning it into a great product.
Because the whole purpose of these kinds of projects at Google is to keep regulators at bay. They don't need these products in the sense of making money from them. They will just burn some money and move on, exactly the way they did hundreds of times. But what kind of company has such a free pass to burning money? The kind of company that is a monopoly. Monopolies are THAT profitable.
These are going to be war machines, make absolutely no mistake about it. On-device autonomy is the perfect foil to escape centralized authority and accountability. There’s no human behind the drone to charge for war crimes. It’s what they’ve always dreamed of.
Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.
The elimination of toil will
mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation
for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.
MIT spinoff Google-owned Boston Dynamics pledged not to militarize their robots. Which is very hard to believe given they're backed by DARPA, the DoD/Military investment arm.
How would these things be competitive with drones on the battlefield? They probably cost the equivalent of 1000 autonomous drones and 100x the time and materials to make, way more power would be required to make them work too.
Terminator is a good movie but in reality, a cheap autonomous drone would mess one of those up pretty good.
I've seen some of the footage from Ukraine, drones are deadly, efficient, they are terrifying on the battlefield. Even though those robots will get crazy maneuverable, it's going to be pretty hard to out run an exploding drone.
Maybe the Terminators will have shotguns, but I could imagine 5 drones per terminator being a pretty easy to achieve considering they will be built by other autonomous robots.
> there won’t have to be a human in the loop to execute mass murder
This looks like an increasingly theoretical concern. (And probably always has been. Wars were far more brutal when folks fought face to face than they are today.)
Please make robots. LLMs should be put to work for *manual* tasks, not art/creative/intellectual tasks. The goal is to improve humanity. not put us to work putting screws inside of iphones
(five years later)
what do you mean you are using a robot for your drummer
I've spent the last few months looking into VLAs and I'm convinced that they're gonna be a big deal, ie they very well might be the "chatgpt moment for robotics" that everyone's been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot.
OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.
The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...
Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.
I don't think transformers will be viable for self driving cars until they can both:
1) Properly recognize what they are seeing without having to lean so hard on their training data. Go photoshop a picture of a cat and give it a 5th leg coming out of it's stomach. No LLM will be able to properly count the cat's legs (they will keep saying 4 legs no matter how many times you insist they recount).
2.) Be extremely fast at outputting tokens. I don't know where the threshold is, but its probably going to be a non-thinking model (at first) and probably need something like Cerebras or diffusion architecture to get there.
The current gen VLA architectures include some tricks (like compressed action tokenization and diffusion decoding) to reach action frequencies between 50-200hz. I think they’re _more_ efficient this way than regular LLMs trying to do everything thru text.
1. Well, based on Karpathy's talks on Tesla FSD, his solution is to actually make the training set reflect everything you'd see in reality. The tricky part is that if something occurs 0.0000001% IRL and something else occurs 50% of the time, they both need to make 5% of the training corpus. The thing with multimodal LLMs is that lidar/depth input can just be another input that gets encoded along with everything else, so for driving "there's a blob I don't quite recognize" is still a blob you have to drive around.
2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix
I will be surprised if VLAs stick around, based on your description. That sounds far too low-level. Better hand that off to the 'nervous system' / kernel of the robot - it's not like humans explicitly think about the rotation of their hip & ankle when they walk. Sounds like a bad abstraction.
The laws of robotics were literally designed to cause conflict and facilitate strife in a fictional setting--I certainly hope no real goddamn system is built like that,.
> To ensure robots behave safely, Gemini Robotics uses a multi-layered approach. "With the full Gemini Robotics, you are connecting to a model that is reasoning about what is safe to do, period," says Parada. "And then you have it talk to a VLA that actually produces options, and then that VLA calls a low-level controller, which typically has safety critical components, like how much force you can move or how fast you can move this arm."
The generally accepted term for the research around this in robotics is Constitutional AI (https://arxiv.org/abs/2212.08073) and has been cited/experimented with in several robotics VLAs.
Usually I put master disconnect switches on my robots just to make working on them safe. I use cheap toggle switches though I'm too cheap for the big red spiny ones.
It's only going to do that if you RL it with episodes that include people shutting it down for safety. The RL I've done with my models are all simulations that don't even simulate the switch.
Which will likely work for only on machine AI, but it seems to me any very complicated actions/interactions with the world may require external interactions with LLMs which know these kind of actions. Or in the future the models will be far larger and more expansive on device containing this kind of knowledge.
For example, what if you need to train the model to keep unauthorized people from shutting it off?
I’m optimistic about humanoid robotics, but I’m curious about the reliability issue. Biological limbs and hands are quite miraculous when you consider that they are able to constantly interact with the world, which entails some natural wear and tear, but then constantly heal themselves.
Industrial robots at least are very reliable, MTBF is often upwards of 100,000 hours[0]. Industrial robots are optimized to be as reliable as possible because the longer they last and less often they need to be fixed, the more profitable they are. In fact, German and Japanese companies came to dominate the industrial robotics market because they focused on reliability. They developed rotary electric actuators that were more reliable. Cincinnati Millicron(US) was out competed in the industrial robot market because although their hydraulic robots were strong, they were less reliable.
I am personally a bit skeptical of anthropormophic hands achieving similarly high reliability. There's just too many small parts that need to withstand high forces.
[0]https://robotsdoneright.com/Articles/what-are-the-different-...
It does either get very exciting or very spooky thinking of the possibilities in the near future.
I had always assumed that such a robot would be very specific (like a cleaning robot) but it does seem like by the time they are ready they will be very generalizable.
I know they would require quite a few sensors and motors, but compared to self-driving cars their liability would be less and they would use far less material.
The exciting part comes when two robots are able to do repairs on each other.
I think this is the spooky part. I feel dumb saying it, but is there a point where they are able to coordinate and build a factory to build chips/more of themselves? Or other things entirely?
Of course there is
2 bots 1 bolt ?
Consumable components could be automatically replaced by other robots.
I think those problems can be solved with further research in material science, no? Combined that with very responsive but low torque servos, I think this is a solvable problem.
It's a simple matter of the number of motors you have. [1]
Assume every motor has a 1% failure rate per year.
A boring wheeled roomba has 3 motors. That's a 2.9% failure rate per year, and 8.6% failures over 3 years.
Assume a humanoid robot has 43 motors. That gives you a 35% failure rate per year, and 73% over 3 years. That ain't good.
And not only is the humanoid robot less reliable, it's also 14.3x the price - because it's got 14.3x as many motors in it.
[1] And bearings and encoders and gearboxes and control boards and stuff... but they're largely proportional to the number of motors.
Would it be possible to reduce the failure rates?
The 1%/year failure rate appears to just be made up. There are plenty of electric motors that dont have anywhere near that failure rate (at least during the expected service life, failure rates certainly will probably hit 1%/year or higher eventually).
For example, do the motors in hard drives fail anywhere close to 1% a year in the first ~5 years? Backblaze data gives a total drive failure rate around 1% and I imagine most of those are not due to failure of motors.
Yes, obviously that 1% figure is a simplification. Of course not all motors are created equal, and neither are all operating conditions!
But the neat thing about my argument is it holds true regardless of the underlying failure rate!
So long as your per-motor annual failure rate is >0, 43x it will be bigger than 3x it.
To an extent, yes.
For example, an industrial robot arm with 6 motors achieves much higher reliability than a consumer roomba with 3 motors. They do this with more metal parts, more precision machining, much more generous design tolerances, and suchlike. Which they can afford by charging 100x as much per unit.
Also factory robots arms are probably operating in highly sterile, dry environments? How would working in a muddy / dusty / wet environment change this?
I'm interested how differences with robots work overtime, there are a lot of machines in this world that have been patched or "jimmied up" to continue working, let's say a mining robot, it would probably get quite heavily contaminated with dust, wear would occur in different places, rock falls might bend parts.
So even though another robot could probably do the "jimmy up". it seems like overtime, the robots will "drift" into all being a bit different.
Even commercial airlines seem to go through fairly unique repairs from things like collisions with objects, tail strikes etc.
Maybe it's just easier to recycle robots?
Does Anyone know how easy is to join the "trusted tester program" and if they offer modules that you can easily plug-in to run the sdk?
There's a sign up button at the bottom of the article...
What sort of hardware does the SDK runs on, can it run on a modern Raspberry Pi ?
According to the blog post, it requires an NVIDIA Jetson Orin with at least 8GB RAM, and they've optimized for Jetson AGX Orin (64GB) and Orin NX (16GB) modules.
Could you quote where in the blog post they claim that? CTRL+F "Jetson" gave no results in TFA.
Yeah they didn't really mention anything, I was almost getting my hopes up that Google might be announcing a modernized Coral TPU for the transformer age, but I guess not. It's probably all just API calls to their TPUv6 data centers lmao.
You can think of these as essentially multi-modal LLMs, which is to say you can have very small/fast ones (SmolVLA - 0.5B params) that are good at specific tasks, and larger/slower more general ones (OpenVLA - a finetuned llama2 7B). So a rpi could be used for some very specific tasks, but even the more general ones could run on beefy consumer hardware.
Nice. I work with some students younger than 13, so most cloud and llms are quite tricky to work with, local only models like vertex are nice for this use case. I will try this as a replacement for chatgpt as Computer Vision in robotics like Lego Mindstorm
The MuJoCo link actually points to https://github.com/google-deepmind/aloha_sim
mujoco_menagerie has Mujoco MJCF XML models of various robots.
google-deepmind/mujoco_menagerie: https://github.com/google-deepmind/mujoco_menagerie
mujoco_menagerie/aloha: https://github.com/google-deepmind/mujoco_menagerie/tree/mai...
What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?
Actually very close to one I'd say.
It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".
As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).
Natively multimodal LLM's are basically brains.
> Natively multimodal LLM's are basically brains.
Absolutely not.
OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk
A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.
https://arxiv.org/abs/2506.01844
Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450
The only way to prevent robots from being jailbroken and set to rob banks is to move GPUs to private SOTA secure GPU clouds
I continued to be impressed how Google stealth releases fairly groundbreaking products, and then (usually) just kind of forgets about them.
Rather than advertising blitz and flashy press events, they just do blog posts that tech heads circulate, forget about, and then wonder 3-4 years later "whatever happened to that?"
This looks awesome. I look forward to someone else building a start-up on this and turning it into a great product.
Because the whole purpose of these kinds of projects at Google is to keep regulators at bay. They don't need these products in the sense of making money from them. They will just burn some money and move on, exactly the way they did hundreds of times. But what kind of company has such a free pass to burning money? The kind of company that is a monopoly. Monopolies are THAT profitable.
These are going to be war machines, make absolutely no mistake about it. On-device autonomy is the perfect foil to escape centralized authority and accountability. There’s no human behind the drone to charge for war crimes. It’s what they’ve always dreamed of.
Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.
The elimination of toil will mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.
0: https://www.palantir.com/
MIT spinoff Google-owned Boston Dynamics pledged not to militarize their robots. Which is very hard to believe given they're backed by DARPA, the DoD/Military investment arm.
This pledge would last five seconds in an actual conflict, if it makes it even that far.
Was owned by Google. Then Softbank. Now Hyundai.
Militarize is just bad marketing. Call them cleaning machines and put them to work on dirty things.
How would these things be competitive with drones on the battlefield? They probably cost the equivalent of 1000 autonomous drones and 100x the time and materials to make, way more power would be required to make them work too.
Terminator is a good movie but in reality, a cheap autonomous drone would mess one of those up pretty good.
I've seen some of the footage from Ukraine, drones are deadly, efficient, they are terrifying on the battlefield. Even though those robots will get crazy maneuverable, it's going to be pretty hard to out run an exploding drone.
Maybe the Terminators will have shotguns, but I could imagine 5 drones per terminator being a pretty easy to achieve considering they will be built by other autonomous robots.
> These are going to be war machines, make absolutely no mistake about it
Of course they will. Practically everything useful has a military application. I'm not sure why this is considered a hot take.
The difference between this machine and the ones that came before is that there won’t have to be a human in the loop to execute mass murder.
> there won’t have to be a human in the loop to execute mass murder
This looks like an increasingly theoretical concern. (And probably always has been. Wars were far more brutal when folks fought face to face than they are today.)
THANK YOU.
Please make robots. LLMs should be put to work for *manual* tasks, not art/creative/intellectual tasks. The goal is to improve humanity. not put us to work putting screws inside of iphones
(five years later)
what do you mean you are using a robot for your drummer
I've spent the last few months looking into VLAs and I'm convinced that they're gonna be a big deal, ie they very well might be the "chatgpt moment for robotics" that everyone's been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot.
OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.
The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...
Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.
VLA = Vision-language-action model: https://en.wikipedia.org/wiki/Vision-language-action_model
Not https://public.nrao.edu/telescopes/VLA/ :(
For completeness, MMLLM = Multimodal Large language model.
I don't think transformers will be viable for self driving cars until they can both:
1) Properly recognize what they are seeing without having to lean so hard on their training data. Go photoshop a picture of a cat and give it a 5th leg coming out of it's stomach. No LLM will be able to properly count the cat's legs (they will keep saying 4 legs no matter how many times you insist they recount).
2.) Be extremely fast at outputting tokens. I don't know where the threshold is, but its probably going to be a non-thinking model (at first) and probably need something like Cerebras or diffusion architecture to get there.
The current gen VLA architectures include some tricks (like compressed action tokenization and diffusion decoding) to reach action frequencies between 50-200hz. I think they’re _more_ efficient this way than regular LLMs trying to do everything thru text.
1. Well, based on Karpathy's talks on Tesla FSD, his solution is to actually make the training set reflect everything you'd see in reality. The tricky part is that if something occurs 0.0000001% IRL and something else occurs 50% of the time, they both need to make 5% of the training corpus. The thing with multimodal LLMs is that lidar/depth input can just be another input that gets encoded along with everything else, so for driving "there's a blob I don't quite recognize" is still a blob you have to drive around.
2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix
I will be surprised if VLAs stick around, based on your description. That sounds far too low-level. Better hand that off to the 'nervous system' / kernel of the robot - it's not like humans explicitly think about the rotation of their hip & ankle when they walk. Sounds like a bad abstraction.
I wonder what kind of guardrails (like Three Laws of Robotics) there are to prevent the robots going crazy while executing the prompts
The laws of robotics were literally designed to cause conflict and facilitate strife in a fictional setting--I certainly hope no real goddamn system is built like that,.
> To ensure robots behave safely, Gemini Robotics uses a multi-layered approach. "With the full Gemini Robotics, you are connecting to a model that is reasoning about what is safe to do, period," says Parada. "And then you have it talk to a VLA that actually produces options, and then that VLA calls a low-level controller, which typically has safety critical components, like how much force you can move or how fast you can move this arm."
Of course someone will. The terror nexus doesn’t build itself, yet, you know.
The generally accepted term for the research around this in robotics is Constitutional AI (https://arxiv.org/abs/2212.08073) and has been cited/experimented with in several robotics VLAs.
Is there any evidence we have the technical ability to put such ambiguous guardrails on LLMs?
A power cord?
what if they are battery powered?
Usually I put master disconnect switches on my robots just to make working on them safe. I use cheap toggle switches though I'm too cheap for the big red spiny ones.
[Robot learns to superglue the switch open]
It's only going to do that if you RL it with episodes that include people shutting it down for safety. The RL I've done with my models are all simulations that don't even simulate the switch.
Which will likely work for only on machine AI, but it seems to me any very complicated actions/interactions with the world may require external interactions with LLMs which know these kind of actions. Or in the future the models will be far larger and more expansive on device containing this kind of knowledge.
For example, what if you need to train the model to keep unauthorized people from shutting it off?
Having a robot near people with no master off switch sounds like a dumb idea.
That's what we use twelve gauge buckshot for, here in America.
in practice, those laws are bs.
This will not end well.