Ask HN: So, how hard was it to get your AI to work?
I have been contemplating for while about this.
All the hype about AI is very familiar to us all. Let's leave it that for this thread.
But having taken on a few things for myself, trying to get customer ready results from homegrown AI/Rag/NLP/ML etc. I have to simply admit that it is by far the most challenging aspect of my programming career.
And I have not come up with results that even remotely would satisfy a paying customer.
How about you ?
I've been playing around with data extraction using OpenAI's structured outputs feature last week. Perhaps this is due to me still being fairly new to this world, but man, has it been an exercise in frustration. My general takeaway so far is that LLMs are fantastic for creative writing, and are a much harder beast to tame when you're looking for exact, reproducible answers. Even when those answers should be the most obvious ones. My engineering brain does not appreciate all the prompt engineering hacks, the hopeful JSON schema tweaking, the "let's try this other random thing just to see if it works" approach to making progress. The worst has been getting to an OK point with one of the fields, only to find out later that giving directions for another field blew away the earlier progress. I'm sure I must be missing something since there are products that are doing this, but so far I have little to show for hours spent other than growing knowledge of LLM quirks and subtle behaviors.
It's a really interesting comment, but I would also have loved the context of what you're trying to do?
Without going into specifics, trying to classify a body of freeform text (or rather, many such bodies) along a bunch of dimensions. Imagine a fiction story that you're trying to answer questions about such as "what is the genre?" and "is the protagonist male or female, adult or child?", etc, where I'm instructing the LLM to return values out of a specific range only (aka, enums). OpenAI's new-ish structured outputs feature enforces answers to a user-specified JSON schema, which is awesome, but I'm finding that the actual answers it comes up with are sub-par. There are all kinds of hacks out there like creating additional fields to let it reason out the answers before constraining them to an enum elsewhere, but I have yet to make it work well for me. There are other approaches I haven't yet tried that require much more effort or cost (more expensive models, fine-tuning custom models). It feels like it should work, so the random poking and prodding continues...
Interesting, thanks for sharing.
By coincidence I had been wondering how hard something like that would be to do the other day.
Wasn't too hard for me. It took about a month for a full prototype, another month to turn it into a full product, and another to refactor it into the existing codebase. But the team had trouble with it and there was a lot of stuff they thought was impossible that turned out to be easy. It's the kind of thing I've been doing since 2021, so it could be a matter of experience as well?
I find most people have trouble because they assume it's like programming - they talk to it like a robot and assume it uses robot speak. But it's trained on human language and works better when talked to like a person. It's more like talking to a gifted child. Some,like Claude, have been trained to use tags and such. There's also a lot of core stuff that people don't understand, things like when to fine tune, when to RAG, when to prompt engineer and use large context windows, the difference between models and how they're trained. It's useful to read the documentations; GPT and Claude function similarly at first glance, but the documentation tells you the differences.
Many mistakes are solved in the same way you would do it with a child. If they're getting the answer wrong, there might not be enough context or enough hints. People say that LLMs only get sarcasm when they know the source of the material, well, that's exactly how humans understand sarcasm as well. Instead of asking, "Solve this math question," a prompt like, "What's the best way to solve this math question?" might lead to better results. Things like Cursor works so much better vs Copilot despite the same models, because it's trained to think on the solution.
I would recommend doing hackathons to learn. Set a small goal, one that can be solved in a day or two. Solve it using AI. If you can't, you'll at least learn why not.
A few months ago I made some toy AI apps that had amazing results - when it worked. When it failed, it would get me so annoyed that I'm feeling burnt out and don't want to keep "fixing" issues that keep appearing.
[dead]