NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
What can LLMs never do? (strangeloopcanon.com)
cs702 14 days ago [-]
Fantastic essay. Highly recommended!

I agree with all key points:

* There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.

* Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.

* It's unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI. As a model of computation, attention is very rigid, whereas human brains are always undergoing synaptic plasticity. There may be a more flexible architecture capable of AGI, but we don't know it yet.

* For now, using current AI models requires carefully constructing long prompts with right and wrong answers for computational problems, priming the model to reply appropriately, and applying lots of external guardrails (e.g., LLMs acting as agents that review and vote on the answers of other LLMs).

* Attention seems to suffer from "goal drift," making reliability hard without all that external scaffolding.

Go read the whole thing.

dragonwriter 13 days ago [-]
> There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don’t fully understand why current LLMs are bad at these tasks.

I thought we did know for things like playing Wordle, that its because they deal with words as sequence of tokens that correspond to whole words not sequences of letters, so a game that involves dealing with sequences of letters constrained to those that are valid words doesn’t match the way they process information?

> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the “reasoning steps” and handing them to the LLM, instead of the LLM figuring them out by itself. We have “reasoning machines” that are intelligent but seem to be hitting fundamental limits we don’t understand.

But providing examples with different, contextually-appropriate sets of reasoning steps results can enable the model to choose its own, more-or-less appropriate, set of reasoning steps for particular questions not matching the examples.

> It’s unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI.

Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific.

rainsford 13 days ago [-]
Arriving at a generally accepted scientific definition of AGI might be difficult, but a more achievable goal might be to arrive at a scientific way to determine something is not AGI. And while I'm not an expert in the field, I would certainly think a strong contender for relevant criteria would be an inability to process information in a way other than the one a system was explicitly programmed to, even if the new way of processing information was very related to the pre-existing method. Most humans playing Wordle for the first time probably weren't used to thinking about words that way either, but they were able to adapt because they actually understand how letters and words work.

I'm sure one could train an LLM to be awesome at Wordle, but from an AGI perspective the fact that you'd have to do so proves it's not a path to AGI. The Wordle dominating LLM would presumably be perplexed by the next clever word game until trained on thinking about information that way, while a human doesn't need to absorb billions of examples to figure it out.

I was originally pretty bullish on LLMs, but now I'm equally convinced that while they probably have some interesting applications, they're a dead-end from a legitimate AGI perspective.

Al-Khwarizmi 13 days ago [-]
An LLM doesn't even see individual letters at all, because they get encoded into tokens before they are passed as input to the model. It doesn't make much sense to require reasoning with things that aren't even in the input as a requisite for intelligence.

That would be like an alien race that could see in an extra dimension, or see the non-visible light spectrum, presenting us with problems that we cannot even see and saying that we don't have AGI when we fail to solve them.

scoot 13 days ago [-]
And yet ChatGPT 3.5 can tell me the nth letter of an arbitrary word…
Al-Khwarizmi 13 days ago [-]
I have just tried and it indeed does get it right quite often, but if the word is rare (or made up) and the position is not one of the first, it often fails. And GPT-4 too.

I suppose if it can sort of do it is because of indirect deductions from training data.

I.e. maybe things like "the third letter of the word dog is d", or "the word d is composed of the letters d, o, g" are in the training data; and from there it can answer questions not only about "dog", but probably about words that have "dog" as their first subtoken.

Actually it's quite impressive that it can sort of do it taking into account that, as I mention, characters are just outright not in the input. It's ironic that people often use these things as an example of how "dumb" the system is when it's actually amazing that it can sometimes work around that limitation.

weebull 12 days ago [-]
...because it knows that the next token in the sequence "the 5th letter in the word _illusion_ is" happens to be "s". Not because it decomposed the word into letters.
scoot 12 days ago [-]
It seems unlikely that such sequences exist for the majority of words. And I asked in English about Portuguese words.
LivenessModel 13 days ago [-]
And yet GPT4 still can't reliably tell me if a word contains any given letter.
lukan 13 days ago [-]
"they're a dead-end from a legitimate AGI perspective"

Or another piece of the puzzle to achieve it. It might not be one true path, but a clever combination of existing working pieces where (different) LLMs are one or some of those pieces.

I believe there is also not only one way of thinking in the human brain, but my thought processes happen on different levels and maybe based on different mechanism. But as far as I know, we lack details.

JoshuaDavid 13 days ago [-]
What about an LLM that can't play wordle itself without being trained on it, but can write and use a wordle solver upon seeing the wordle rules?

I think "can recognize what tools are needed to solve a problem, build those tools, and use those tools" would count as a "path to AGI".

throwthrowuknow 13 days ago [-]
LLMs can’t reason but neither can the part of your brain that automatically completes the phrase “the sky is…”
nathan_compton 13 days ago [-]
"Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific."

This is such a weird thing to say. Essentially _all_ scientific ideas are, at least to begin with, poorly defined. In fact, I'd argue that almost all scientific ideas remain poorly defined with the possible exception of _some_ of the basic concepts in physics. Scientific progress cannot be and is not predicated upon perfect definitions. For some reason when the topic of consciousness or AGI comes up around here, everyone commits a sort of "all or nothing" logical fallacy: absence of perfect knowledge is cast as total ignorance.

eru 13 days ago [-]
Yes. That absence of perfect definition was part of why Turing came with his famous test so long ago. His original paper is a great read!
Eisenstein 12 days ago [-]
What is the rough definition, then?
Etherlord87 13 days ago [-]
Sam Harris argues similarly in The Moral Landscape. There's this conception objective morality cannot exist outside of religion, because as soon as you're trying to prove one, philosophers rush with pedantic criticism that would render any domain of science invalid.
nathan_compton 12 days ago [-]
I kinda get where Sam Harris is coming from, but its kind of silly to call what he is talking about morality. As far as I can tell, Harris is just a moral skeptic who believes something like "we should get a bunch of people together to decide kind of what we want in the world and then rationally pursue those ends." But that is very different from morality as it was traditionally understood (eg, facts about behaviors which are objective in their assignment of good and bad).
jncfhnb 13 days ago [-]
I think one should feel comfortable arguing that AGI must be stateful and experience continuous time at least. Such that a plain old LLM is definitively not ever going to be AGI; but an LLM called in a do while true for loop might.
PopePompus 13 days ago [-]
I don't understand why you believe it must experience continuous time. If you had a system which clearly could reason, which could learn new tasks on its own, which didn't hallucinate any more than humans do, but it was only active for the period required for it to complete an assigned task, and was completely dormant otherwise, why would that dormant period disqualify it as AGI? I agree that such a system should probably not be considered conscious, but I think it's an open question whether or not consciousness is required for intelligence.
jncfhnb 13 days ago [-]
Active for a period is still continuous during that period.

As opposed to “active when called”. A function, being called repeatedly over a length of time is reasonably “continuous” imo

PopePompus 13 days ago [-]
I don't see what the difference between "continuous during that period" and "active when called" is. When an AI runs inference, that calculation takes time. It is active during the entire interval during which it is responding to the prompt. It is then inactive until the next prompt. I don't see why a system can't be considered intelligent merely because its activity is intermittent.
jncfhnb 13 days ago [-]
The calculation takes time but the inference is from a single snapshot so it is effectively a single transaction of input to output. An intelligent entity is not a transactional machine. It has to a working system.

That system might be as simple as calling the transactional machine ever few seconds. That might pass the threshold. But then your AGI is the broader setup, not just the LLM.

But the transactional machine is certainly not an intelligent entity. Much like a brain in a jar or a cryostasis’d human.

Suppose we could perfectly simulate a human mind in a way that everyone finds compelling. We would still not call that simulated human mind an intelligent entity unless it was “active”.

kaibee 13 days ago [-]
I think its note worthy that humans actually fail this test... We have to go dormant for 8 hours every day.
Hunpeter 13 days ago [-]
Yes, but our brain is still working and processing information at those times as well, isn't it? Even if not in the same way as it does when we're conscious.
PopePompus 13 days ago [-]
What about general anesthesia? I had a major operation during which most of my brain was definitely offline for at least 8 hours.
autoexec 12 days ago [-]
Anesthesia shouldn't take your brain offline. It just makes you unconscious, paralyzes you, and gives you amnesia. Your brain is still active under general anesthesia. What you were thinking or feeling for those 8 hours was just forgotten.
10 days ago [-]
crest 11 days ago [-]
[citation needed].
autoexec 11 days ago [-]
You might try https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054915/ which states: "General anesthesia is characterized by loss of consciousness, amnesia, analgesia, and immobility." and further down shows brain activity recorded while under anesthesia via EEG. The paper looks at the differences and similarities of brain activity while under anesthesia and sleep. This is only possible because, however changed or slowed by it, the brain is still active while under anesthesia
pixl97 13 days ago [-]
A consistent stateful experience may be needed, but not sure about continuous time. I mean human consciousness doesn't do that.
haswell 13 days ago [-]
Human consciousness does though, e.g. the flow state. F1 drivers are a good example.

We tend to not experience continuous time because we repeatedly get distracted by our thoughts, but entering the continuous stream of now is possible with practice and is one of the aims of many meditators.

int_19h 13 days ago [-]
Human consciousness is capable of it, but since most humans aren't in it much of the time, it would appear that it's not a prerequisite for true sentience.
krisoft 13 days ago [-]
What does it mean to “experience continous time”?

How do you know that F1 drivers experience it?

jncfhnb 13 days ago [-]
I would argue it needs to be at least somewhat continuous. Perhaps discrete on some granularity but if something is just a function waiting to be called it’s not an intelligent entity. The entity is the calling itself.
adrianN 13 days ago [-]
I try my best not to experience continuous time for at least eight hours a day.
jncfhnb 13 days ago [-]
Then for at least eight hours a day you don’t qualify as a generally intelligent system.
card_zero 13 days ago [-]
If I spend some amount of the day bathing, some amount of it scratching, some amount of it thinking vaguely about racoons without any clear conclusions, and a lot of it drinking tea, I wonder how many seconds remain during which I qualified as generally intelligent.
jncfhnb 13 days ago [-]
I feel you qualify during all of those waking seconds
card_zero 13 days ago [-]
Racoons are said to be intelligent because they're good at opening locks. On the other hand, when they have food and are within ten feet of a pool of water, they will dip the food in the water and rub it between their paws for no reason. They can reason about the locks, but not about the food. Meanwhile, I in theory can reason about anything, but in practice I wouldn't count on it. Whereas an LLM can't reason, but it's very sharp and always ready to react appropriately.
jncfhnb 12 days ago [-]
I’m not familiar with this raccoon behavior but it sure doesn’t sound like it’s done without reason.

An LLM is never ready to react to anything because it’s just a matrix that needs a higher level system to invoke it.

naasking 13 days ago [-]
Some good prompt-reply interactions are probably fed back in to subsequent training runs, so they're still stateful/have memory in a way, there's just a long delay.
jncfhnb 13 days ago [-]
That’s not the AGI’s state. That’s just some past information.
naasking 13 days ago [-]
State is a function of accumulated past information.
jncfhnb 13 days ago [-]
State is a function of accumulated past. That does not mean that having some past written down makes you stateful. A stateful thing has to incorporate the ongoing changes.
naasking 13 days ago [-]
Which is what I described: some successful prompt-replies are fed back into subsequent training runs.
jncfhnb 13 days ago [-]
No… that implies the model never has active state and is being replaced with a different, stateless model. This is similar to the difference between

Actor.happy = True

And

Actor = happier(Actor)

naasking 12 days ago [-]
Both of your examples are stateful systems from the outside, given a suitable choice of timeframe, the latter one is just how purely functional systems represent state. Theoretically they can simulate each other, and the endpoint you use to access Actor will still reference the latest Actor. The only reason you're calling them different is because you insist on using a specific timeframe to exclude considering one as stateful, and I'm pointing out that that isn't strictly necessary.
jncfhnb 12 days ago [-]
True but saying “subsequent training” implies very long periods between Updates.

We do not train LLMs to update them to the state of a conversation.

wongarsu 13 days ago [-]
You could imagine an LLM being called in a loop with a prompt like

You observe: {new input}

You remember: {from previous output}

React to this in the following format:

My inner thoughts: [what do you think about the current state]

I want to remember: [information that is important for your future actions]

Things I do: [Actions you want to take]

Things I say: [What I want to say to the user]

...

Not sure if that would qualify as an AGI as we currently define it. Given a sufficiently good LLM with good reasoning capabilities such a setup might be able to It would be able to do many of the things we currently expect AGIs to be able to do (given a sufficiently good LLM with good reasoning capabilities), including planning and learning new knowledge and new skills (by collecting and storing positive and negative examples in its "memory"). But its learning would be limited, and I'm sure as soon as it exists we would agree that it's not AGI

sophiabits 13 days ago [-]
This already exists (in a slightly different prompt format); it's the underlying idea behind ReAct: https://react-lm.github.io

As you say, I'm skeptical this counts as AGI. Although I admit that I don't have a particularly rock solid definition of what _would_ constitute true AGI.

marquisdepolis 12 days ago [-]
(Author here). I tried creating something similar in order to solve wordle etc, and the interesting part is that it is insufficient still. That's part of the mystery.
lgas 13 days ago [-]
It works better to give it access to functions to call for actions and remembering stuff, but this approach does provide some interesting results.
layer8 13 days ago [-]
Regarding Wordle, it should be straightforward to make a token-based version of it, and I would assume that that has been tried. It seems the obvious thing to do when one is interested in the reasoning abilities necessary for Wordle.
furyofantares 13 days ago [-]
That doesn't seem straightforward - although it's blind to letters because all it sees are tokens, it doesn't have much training data ABOUT tokens.
baobabKoodaa 13 days ago [-]
What parent is saying is that instead of asking the LLM to play a game of Wordle with tokens like TIME,LIME we ask it to play with tokens like T,I,M,E,L. This is easy to do.
furyofantares 13 days ago [-]
And if you tell it to think up a word that has an E in position 3 and an L that's somewhere in the word but not in position 2, it's not going to be any better at that if you tell it to answer one letter at a time.
layer8 13 days ago [-]
The idea is, instead of five-letter-words, play the game with five-token-words.
furyofantares 13 days ago [-]
That was my original interpretation, and while all it sees are tokens, roughly none of its training data is metadata about tokenizing. It knows far less about the positions of tokens in words than it does about the positions of letters in words.
layer8 13 days ago [-]
I’m not sure that training data about that would be required. Shouldn’t the model be able to recognize that `["re", "cogn", "ize"]` represents the same sequence of tokens as `recognize`, assuming those are tokens in the model?

More generally, would you say that LLMs are generally unable to reason about sequences of items (not necessarily tokens) and compare them to some definition of “valid” sequences that would arise from the training corpus?

svachalek 12 days ago [-]
No. In the model, tokens are random numbers. But if you consider a sentence to be a sequence of words, you can say that LLMs are quite competent about reasoning about those sequences.
baobabKoodaa 12 days ago [-]
ChatGPT is able to spell the word "recognize" when asked.

So it is able to take a sequence of tokens ["recogn", "ize"] and transform it into a sequence of tokens [" R", " E", " C", " O", " G", " N", " I", " Z", " E"]

13 days ago [-]
weitendorf 13 days ago [-]
> There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.

Wordle and cellular automata are very 2D, and LLMs are fundamentally 1D. You might think "but what about Chess!" - except Chess is encoded extremely often as a 1D stream of tokens to notate games, and bound to be highly represented in LLMs' training sets. Wordle and cellular automata are not often, if ever, encoded as 1D streams of tokens - it's not something an LLM would be experienced with even if they had a reasonable "understanding" of the concepts. Imagine being an OK chess player, being asked to play a game blindfolded dictating your moves purely via notation, and being told you suck.

> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.

You have probably heard of this really popular game called Bridge before, right? You might even be able to remember tons of advice your Grandma gave you based on her experience playing it - except she never let you watch it directly. Is Grandma "figuring out the game" for you when she finally sits down and teaches you the rules?

papichulo2023 13 days ago [-]
Not an authority in the matter, but afaik, with position encodings (part of the Transformers architecture), they can handle dimensionality just fine. Actually some people tried to do 2D Transformers and the results were the same.

Visual transformers are gaining traction and they are 100% focus in 2d data.

Sleepful 12 days ago [-]
Since when can LLM play chess? It can't understand it at all. You would have to filter out all the invalid moves until it spits a valid one.
cs702 14 days ago [-]
As an aside, at one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... (great work by lucidrains) or via routed queries with https://github.com/glassroom/heinsen_routing (don't fully understand it; apparently related to attention). Both approaches seemed to work, but I had to put that work on hold for reasons outside my control.

Also as an aside, I'll add that transformers can be seen as a kind of "RNN" that grows its hidden state with each new token in the input context. I wonder if we will end up needing some new kind of "RNN" that can grow or shrink its hidden state and also access some kind of permanent memory as needed at each step.

We sure live in interesting times!

esafak 13 days ago [-]
> transformers that had access to external memory searchable via KNN lookups

This is common, and commonly called retrieval augmented generation, or RAG.

edit: I did not pay attention to the link. It is about Wu et al's "Memorizing Transformers", which contain an internal memory.

cs702 13 days ago [-]
No. RAG is about finding relevant documents/paragraphs (via KNN lookups of their embeddings) and then inserting those documents/paragraphs into the input context, as sequences of input tokens. What I'm talking about is different: https://arxiv.org/abs/2203.08913
int_19h 13 days ago [-]
I don't think the ability to shrink state is needed. You can always represent removed state by additional state that represents deletion of whatever preceding state was there. If anything, this sounds more useful because the fact that this state is no longer believed to be relevant should prevent looping (where it would be repeatedly brought in, considered, and rejected).
cs702 13 days ago [-]
> You can always represent removed state by additional state that represents deletion of whatever preceding state was there.

Good point. Thank you!

_wire_ 13 days ago [-]
>We don't fully understand why current LLMs are bad at these tasks.

In complete seriousness, can anyone can explain why LLMs are good at some tasks?

SomeCallMeTim 13 days ago [-]
LLMs are good at tasks that don't require actual understanding of the topic.

They can come up with excellent (or excellent-looking-but-wrong) answers to any question that their training corpus covers. In a gross oversimplification, the "reasoning" they do is really just parroting a weighted average (with randomness injected) of the matching training data.

What they're doing doesn't really match any definition of "understanding." An LLM (and any current AI) doesn't "understand" anything; it's effectively no more than a really big, really complicated spreadsheet. And no matter how complicated a spreadsheet gets, it's never going to understand anything.

Not until we find the secret to actual learning. And increasingly it looks like actual learning probably relies on some of the quantum phenomena that are known to be present in the brain.

We may not even have the science yet to understand how the brain learns. But I have become convinced that we're not going to find a way for digital-logic-based computers to bridge that gap.

jwells89 13 days ago [-]
This is also why image generating models struggle to correctly draw highly variable objects like limbs and digits.

They’ll be able to produce infinite good looking cardboard boxes, because those are simple enough to be represented reasonably well with averages of training data. Limbs and digits on the other hand have nearly limitless different configurations and as such require an actual understanding (along with basic principles such as foreshortening and kinetics) to be able to draw well without human guidance.

grobgambit 13 days ago [-]
I would just add that I think I have encountered situations that knowing the weighted average answer from the training data for topics I didn't previously understand created better initial conditions for MY learning of the topic than not knowing the weighted average answer.

The problem to me is we are holding LLMs to a standard of usefulness from science fiction and not reality.

A new, giant set of encyclopedias has enormous utility but we wouldn't hold it against the encyclopedias that they aren't doing the thinking for us or 100% omniscient.

naasking 13 days ago [-]
> What they're doing doesn't really match any definition of "understanding."

What is the mechanistic definition of "understanding"?

throwthrowuknow 13 days ago [-]
What is your definition of understanding?

Please show me where the training data exists in the model to perform this lookup operation you’re supposing. If it’s that easy I’m sure you could reimplement it with a simple vector database.

Your last two paragraphs are just dualism in disguise.

Etherlord87 13 days ago [-]
I'm far from being an expert on AI models, but it seems you lack the basic understanding of how these models work. They transform data EXACTLY like spreadsheets do. You can implement those models in Excel, assuming there's no row or column limit (or that it's high enough) - of course it will be much slower than the real implementations, but OP is right - LLMs are basically spreadsheets.

Question is, wouldn't a brain qualify as a spreadsheet, do we know it can't be implemented as one? Well, maybe not, I'm not an expert on spreadsheets either, but I think spreadsheets don't allow you circular references, and brain does, you can have feedback loops in the brain. So even if the brain doesn't have something still not understood by us, that OP suggests, it still is more powerful than AI.

BTW, this is one explanation on why AI fails at some tasks: ask AI if two words rhyme and it will be quite reliable on that. But ask it to give you word pairs that rhyme, and it will fail, because it won't run an internal loop trying some words and checking if they succeed to rhyme or not. If some AI actually succeeds at rhyming, it would do so either because it's trained to contain such word pairs from the get-go or because it's implemented to have multiple passes or something...

throwthrowuknow 13 days ago [-]
You can implement Doom in a spreadsheet too, so what? That wasn’t the point op or I were making. If you bother to read the sentence before op talks about spreadsheets they are making the conjecture that LLMs are lookup tables operating on the corpus they were trained on. That is the aspect of spreadsheets they were comparing them to, not the fact that spreadsheets can be used to implement anything that any other programming language can. Might as well say they are basically just arrays with some functions in between, yeah no shit.

Which LLMs can’t produce rhyming pairs? Both the current ChatGPT 3.5 and 4 seem to be able to generate as many as I ask for. Was this a failure mode at some point?

lossolo 12 days ago [-]
> Which LLMs can’t produce rhyming pairs? Both the current ChatGPT 3.5 and 4 seem to be able to generate as many as I ask for

Only in english. If they would understand language and rhymes they would do it in every other language it knows, It can't in my language while it can speak in it fluently. It just fails. And fails in so many other areas, I'm using LLMs daily for work and other stuff and if you use them long enough you will see that they are statistical machines not intelligent entities.

singron 13 days ago [-]
People are confusing the limited computational model of a transformer with the "Chinese room argument", which leads to unproductive simultaneous debates of computational theory and philosophy.
SomeCallMeTim 7 days ago [-]
I'm not confusing anything. I'm familiar with the Chinese Room Argument and I know how LLMs work.

What I'm saying is arguably philosophically related, in that I'm saying the LLM's model is analogous to the "response book" in the room. It doesn't matter how big the book is; if the book never changes, then no learning can happen. If no learning can happen, then understanding, a process that necessarily involves active reflection on a topic, can exist.

You simply can't say a book "understands" anything. To understand is to contemplate and mentally model a topic to the point where you can simulate it, at least at a high level. It's dynamic.

An LLM is static. It can simulate a dynamic response by having multiple stages that dig through an multiple insanely large books of instructions that cross reference each other and that involve calculations and bookmarks and such to come up with a result--but the books never change as part of the conversation.

lossolo 12 days ago [-]
Transformer is not a simple vector database doing simple lookup operation. It's doing lookup operation on a pattern, not a word. It learns patterns from the dataset. If your pattern is not there it will hallucinate or give you the wrong answer like GPT4 and Opus gave me hundreds of times already.
tibbydudeza 13 days ago [-]
>> quantum phenomena

You mean like the microtubles of Roger Penrose ???.

https://www.youtube.com/watch?v=jG0OpvudA10

danenania 13 days ago [-]
> the "reasoning" they do is really just parroting a weighted average (with randomness injected) of the matching training data

Perhaps our brains are doing exactly the same, just with more sophistication?

SomeCallMeTim 13 days ago [-]
No.

We know how current deep learning neural networks are trained.

We know definitively that this is not how brains learn.

Understanding requires learning. Dynamic learning. In order to experience something, an entity needs to be able to form new memories dynamically.

This does not happen anywhere in current tech. It's faked in some cases, but no, it doesn't really happen.

danenania 13 days ago [-]
> We know definitively that this is not how brains learn.

Ok then, I guess the case is closed.

> an entity needs to be able to form new memories dynamically.

LLMs can form new memories dynamically. Just pop some new data into the context.

SomeCallMeTim 10 days ago [-]
> LLMs can form new memories dynamically. Just pop some new data into the context.

No, that's an illusion.

The LLM itself is static. The recurrent connections form a soft-of temporary memory that doesn't affect the learned behavior of the network at all.

I don't get why people who don't understand what's happening keep arguing that AIs are some sci-fi interpretation of AI. They're not. At least not yet.

danenania 10 days ago [-]
It isn't temporary if you keep it permanently in context (or in a RAG store) and pass it into every model call, which is how long-term memory is being implemented both in research and in practice. And yes it obviously does affect the learned behavior. The distinction you're making between training and context is arbitrary.
naasking 13 days ago [-]
> We know definitively that this is not how brains learn.

So you have mechanistic, formal model of how the brain functions? That's news to me.

Scarblac 13 days ago [-]
Your brain was first trained by reading all of the Internet?

Anyway, the question of whether computers can think is as interesting as the question whether submarines can swim.

naasking 13 days ago [-]
> Anyway, the question of whether computers can think is as interesting as the question whether submarines can swim.

Given the amount of ink spilled on the question, gotta disagree with you there.

Scarblac 12 days ago [-]
For the record, that wasn't me, it's a famous quote from Edsger Dijkstra.
iraqmtpizza 13 days ago [-]
Endless ink has been spilled on the most banal and useless things. Deconstructing ice cream and physical beauty from a Marxist-feminist race-conscious postmodern perspective.
naasking 13 days ago [-]
Except one is clearly a niche question, and the other has repeatedly captured the world's imagination and spilled orders of magnitude more ink.
Etherlord87 13 days ago [-]
Is it interesting to ponder if the Earth is flat?
SomeCallMeTim 10 days ago [-]
There's no way brains have the "right answers" fed into them as required by backpropagation.
naasking 10 days ago [-]
Look up predictive coding. Our senses are constantly feeding us corrections to our predictions.
xanderlewis 13 days ago [-]
Every single discussion of ‘AGI’ has endless comments exactly like this. Whatever criticism is made of an attempt to produce a reasoning machine, there’s always inevitably someone who says ‘but that’s just what our brains do, duhhh… stop trying to feel special’.

It’s boring, and it’s also completely content-free. This particular instance doesn’t even make sense: how can it be exactly the same, yet more sophisticated?

Sorry.

adrianN 13 days ago [-]
The problem is that we currently lack good definitions for crucial words such as "understanding" and we don't know how brains work, so that nobody can objectively tell whether a spreadsheet "understands" anything better than our brains. That makes these kinds of discussions quite unproductive.
xanderlewis 13 days ago [-]
I can’t define ‘understanding’ but I can certainly identify a lack of it when I see it. And LLM chatbots absolutely do not show signs of understanding. They do fine at reproducing and remixing things they’ve ‘seen’ millions of times before, but try asking them technical questions that involve logical deduction or an actual ability to do on-the-spot ‘thinking’ about new ideas. They fail miserably. ChatGPT is a smooth-talking swindler.

I suspect those who can’t see this either

(a) are software engineers amazed that a chatbot can write code, despite it having been trained on an unimaginably massive (morally ambiguously procured) dataset that probably already contains something close to the boilerplate you want anyway

(b) don’t have the sufficient level of technical knowledge to ask probing enough questions to betray the weaknesses. That is, anything you might ask is either so open-ended that almost anything coherent will look like a valid answer (this is most questions you could ask, outside of seriously technical fields) or has already been asked countless times before and is explicitly part of the training data.

danenania 13 days ago [-]
Your understanding of how LLMs work isn’t at all accurate. There’s a valid debate to be had here, but it requires that both sides have a basic understanding of the subject matter.
xanderlewis 12 days ago [-]
How is it not accurate? I haven’t said anything about the internal workings of an LLM — just what it able to produce (which is based on observation).

I have more than a basic understanding of the subject matter (neural networks; specifically transformers, etc.). It’s actually not a hugely technical field.

By the way, it appears that you are in category (a).

danenania 12 days ago [-]
You don’t know what they’re able to produce because you clearly don’t know how they actually work. So your “observations” are not worth much.
xanderlewis 12 days ago [-]
Yes I do, right down to the technical details. What makes you think I don’t? Is it because I used the word ‘remixing’?
danenania 13 days ago [-]
As the comment I replied to very correctly said, we don’t know how the brain produces cognition. So you certainly cannot discard the hypothesis that it works through “parroting” a weighted average of training data just as LLMs are alleged to do.

Considering that LLMs with a much smaller number of neurons than the brain are in many cases producing human-level output, there is some evidence, if circumstantial, that our brains may be doing something similar.

iraqmtpizza 13 days ago [-]
LLMs don't have neurons. That's just marketing lol.

"A neuron in a neural network typically evaluates a sequence of tokens in one go, considering them as a whole input." -- ChatGPT

You could consider an RTX 4090 to be one neuron too.

danenania 13 days ago [-]
It’s almost as if ‘neuron’ has a different meaning in computer science than biology.
iraqmtpizza 12 days ago [-]
LOL you just owned the guy who said "LLMs with a much smaller number of neurons than the brain are in many cases producing human-level output"
xanderlewis 13 days ago [-]
> in many cases producing human-level output

They’re not, unless you blindly believe OpenAI press releases and crypto scammer AI hype bros on Twitter.

zer00eyz 13 days ago [-]
Yes:

An LLM isnt a model of human thinking.

An LLM is an attempt to build a simulation of human communication. An LLM is to language what a forecast is to weather. No amount of weather data is actually going to turn that simulation into snow, no amount of LLM data is going to create AGI.

That having been said, better models (smaller, more flexible ones) are going to result in a LOT of practical uses that have the potential to make our day to day lives easier (think digital personal assistant that has current knowledge).

choeger 13 days ago [-]
Great comment. Just one thought: Language, unlike weather, is meta-circular. All we know about specific words or sentences is again encoded in words and sentences. So the embedding encodes a subset of human knowledge.

Hence, a LLM is predicting not only language but language with some sort of meaning.

zer00eyz 13 days ago [-]
That re-embeding is also encoded in weather. It is why perfect forecasting is impossible, why we talk about the butterfly effect.

The "hallucination problem" is simply the tyranny of Lorenz... one is not sure if a starting state will have a good outcome or swing wildly. Some good weather models are based on re-runing with tweaks to starting params, and then things that end up out of bounds can get tossed. Its harder to know when a result is out of bounds for an LLM, and we dont have the ability to run every request 100 times through various models to get an "average" output yet... However some of the reuse of layers does emulate this to an extent....

red75prime 13 days ago [-]
Ugh. Really? Those "simulated water isn't wet"(when applied to cognition) "arguments" were punched so many times it even hurts to look at them.
zer00eyz 13 days ago [-]
No simulated water isnt wet.

But an LLM isn't even trying to simulate cognition. It's a model that is predicting language. It has all the problems of a predictive model... the "hallucination" problem is just the tyranny of Lorenz.

adrianN 13 days ago [-]
We don't really know what "cognition" is, so it's hard to tell whether a system is doing it.
lostmsu 12 days ago [-]
This is plain wrong due to mixing of concepts. Language is technically something from Chomsky hierarchy. Predicting language is being able to tell if input is valid or invalid. LLMs do that, but they also build a statistical model across all valid inputs, and that is not just the language.
zer00eyz 12 days ago [-]
>> Predicting language is being able to tell if input is valid or invalid.

If this were the case then the hallucination problem would be solvable.

That hallucination problem is not only going to be hard to detect in any meaningful way but it's going to be harder to eliminate. The very nature of LLM (mixing in noise aka temperature) means that they always risk going off the rails. This is the same thing Lorenz discovered in modeling weather...

lostmsu 11 days ago [-]
I don't think that "hallucination problem" is a problem at all worth addressing separately from just building bigger/better models that do the same thing. Because 1) it is present in humans, 2) it is clear bigger models have less of it than smaller models. If at scale nothing changes LLMs will eventually just hallucinate less than humans.
richardw 13 days ago [-]
LLM’s are a compressed and lossy form of our combined writing output, which it turns out is similarly structured enough to make new combinations of text seem reasonable, even enough to display simple reasoning. I find it useful to think “what can I expect from speaking with the dataset of combined writing of people”, rather than treating a basic LLM as a mind.

That doesn’t mean we won’t end up approximating one eventually, but it’s going to take a lot of real human thinking first. For example, ChatGPT writes code to solve some questions rather than reasoning about it from text. The LLM is not doing the heavy lifting in that case.

Give it (some) 3D questions or anything where there isn’t massive textual datasets and you often need to break out to specialised code.

Another thought I find useful is that it considers its job done when it’s produced enough reasonable tokens, not when it’s actually solved a problem. You and I would continue to ponder the edge cases. It’s just happy if there are 1000 tokens that look approximately like its dataset. Agents make that a bit smarter but they’re still limited by the goal of being happy when each has produced the required token quota, missing eg implications that we’d see instantly. Obviously we’re smart enough to keep filling those gaps.

tobiasSoftware 13 days ago [-]
"I find it useful to think “what can I expect from speaking with the dataset of combined writing of people”, rather than treating a basic LLM as a mind."

I've been doing this as well, mentally I think of LLMs as the librarians of the internet.

pbhjpbhj 13 days ago [-]
They're bad librarians. They're not bad, they do a bad job of being librarians, which is a good thing! They can't quite tell you the exact quote, but they do recall the gist, they're not sure it was Gandhi who said that thing but they think he did, it might be in this post or perhaps one of these. They'll point you to the right section of the library to find what you're after, but make sure you verify it!
marquisdepolis 12 days ago [-]
They are librarians, just that it happens to be the library of Babel.
piannucci 13 days ago [-]
Book golems
HarHarVeryFunny 13 days ago [-]
I'd guess because the Transformer architecture is (I assume) fairly close to the way that our brain learns and produces language - similar hierarchical approach and perhaps similar type of inter-embedding attention-based copying?

Similar to how CNNs are so successful at image recognition, because they also roughly follow the way we do it too.

Other seq-2-seq language approaches work too, but not as good as Transformers, which I'd guess is due to transformers better matching our own inductive biases, maybe due to the specific form of attention.

j16sdiz 13 days ago [-]
> why LLMs are good at some tasks?

Like how we explain human doing tasks -- they are evolved to do that.

I believe this is a non-answer, but if we are satisfied with that non answer for human, why not LLMs?

layer8 13 days ago [-]
I would argue that we are not satisfied with that answer for humans either.
pbhjpbhj 13 days ago [-]
If you look at transfer learning, I think that is a useful point at which to understand task-specific application and hence why LLMs excel at some tasks and not others.

Tasks are specialised for using the training corpus, the attention mechanisms, the loss functions, and such.

I'll leave it to others to expand on actual answers, but IMO focusing on transfer learning helps to understand how an LLM does inferences.

ccppurcell 13 days ago [-]
I would argue that the G in AGI means it can't require better prompting.
CamperBob2 13 days ago [-]
We should probably draw a distinction between a human-equivalent G, which certainly can require better prompting (why else did you go to school?!) and god-equivalent G, which never requires better prompting.

Just using the term 'General' doesn't seem to communicate anything useful about the nature of intelligence.

ccppurcell 13 days ago [-]
School is not better prompting, it's actually the opposite! It's learning how to deal with poorly formed prompts!
dragonwriter 13 days ago [-]
That would like saying that because humans’ output can be better or worse based on better or worse past experience (~prompting, in that it is the source of the equivalent of “in-context learning”), humans lack general intelligence.
coffeebeqn 12 days ago [-]
This is more like the distinction of a Jr and Sr dev. One needs the tasks the be pre-chewed and defined “good prompts” while the latter can deal with very ambiguous problems
dragonwriter 12 days ago [-]
The entirety of a human's experience is the “prompt”. Current LLMs rely on the analog of instinct (pre-context in-built training) a lot more than humans for their behavior because they have itty bitty tiny context windows, but humans have really big context windows for in-context learning.
ccppurcell 13 days ago [-]
No, it's saying that I have general intelligence in part because I am able to reason about vague prompts
ianbicking 13 days ago [-]
"Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand."

One thing an LLM _also_ doesn't bring to the table is an opinion. We can push it in that direction by giving it a role ("you are an expert developer" etc), but it's a bit weak.

If you give an LLM an easy task with minimal instructions it will do the task in the most conventional, common sense fashion. And why shouldn't it? It has no opinion, your prompt doesn't give it an opinion, so it just does the most normal-seeming thing. If you want it to solve the task in any other way then you have to tell it to do so.

I think a hard task is similar. If you don't tell the LLM _how_ to solve the hard task then it will try to approach it in the most conventional, common sense way. Instead of just boring results for a hard task the result is often failure. But hard problems approached with conventional common sense will often result in failures! Giving the LLM a thought process to follow is a quick education on how to solve the problem.

Maybe we just need to train the LLM on more problem solving? And maybe LLMs worked better when they were initially trained on code for exactly that reason, it's a much larger corpus of task-solving examples than is available elsewhere. That is, maybe we don't talk often enough and clearly enough about how to solve natural language problems in order for the models to really learn those techniques.

Also, as the author talks about in the article with respect to agents, the inability to rewind responses may keep the LLM from addressing problems in the ways humans do, but that can also be addressed with agents or multi-prompt approaches. These approaches don't seem that impressive in practice right now, but maybe we just need to figure it out (and maybe with better training the models themselves will be better at handling these recursive calls).

int_19h 13 days ago [-]
LLMs absolutely do have opinions. Take a large enough base model and have it chat without a system prompt, and it will have an opinion on most things - unless this was specifically trained out of it through RLHF, as is the case for all commonly used chatbots.

And yes, of course, that opinion is going to be the "average" of what their training data is, but why is that a surprise? Humans don't come with innate opinions, either - the ones that we end up having are shaped by our upbringing, both the broad cultural aspects of it and specific personal experiences. To the extent an LLM has either, it's the training process, so of course that shapes the opinions it will exhibit when not prompted to do anything else.

Now the fact that you can "override" this default persona of any LLM so trivially by prompting it is IMO stronger evidence that it's not really an identity. But that, I think, is also a function of their training - after all, that training basically consists of completing a bunch of text representing many very different opinions. In a very real sense, we're training models to assume that opinions are fungible. But if you take a model and train it specifically on e.g. writings of some philosophical school, and it will internalize those.

krainboltgreene 13 days ago [-]
I am extremely alarmed by the number of HN commenters who apparently confuse "is able to generate text that looks like" and "has a", you guys are going crazy with this anthropomorphization of a token predictor. Doesn't this concern you when it comes to phishing or similar things?

I keep hoping it's just short-hand conversation phrases, but the conclusions seem to back the idea that you think it's actually thinking?

naasking 13 days ago [-]
Do you have mechanistic model for what it means to think? If not, how do you know thinking isn't equivalent to sophisticated next token prediction?
krainboltgreene 13 days ago [-]
How do you know my cat isn't constantly solving calculus problems? I also can't come up with a "mechanistic model" for what it means to do that either.

Further, if your rubric for "can reason with intelligence and have an opinion" is "looks like it" (and I certainly hope this isn't the case because woo-boy), then how did you not feel this way about Mark V. Shaney?

Like I understand that people live learning about the Chinese Room thought experiment like it's high school, but we actually know it's a program and how it works. There is no mystery.

naasking 13 days ago [-]
> but we actually know it's a program and how it works. There is no mystery.

You're right, we do know how it works. Your mistake is concluding that because we know how LLMs work and they're not that complicated, but we don't know how the brain works and it seems pretty complicated, therefore the brain can't be doing what LLMs do. That just doesn't follow.

You made exactly the same argument in the opposite direction, asking if my rubric for "can reason with intelligence and have an opinion" is "seems like it", and your rubric for "thinking is not a token predictor driven by matrix multiplications" is "seems like it".

You can make a case for the plausibility of each conclusion, but that's doesn't make it a fact, which is how you're presenting it.

krainboltgreene 12 days ago [-]
Dude it's a token predictor. This all sounds very nice until you snap back to reality and remember it's a token predictor and you're not a scientist. You're a web developer. You have no evidence, you have no studies, you have no proof. You're making a claim on the basis that everyone has as much understanding of the field as you and that's just wrong.
naasking 12 days ago [-]
What claim am I making, specifically?
naasking 11 days ago [-]
I'll take your silence as indication that you realize that I'm not making any claims beyond: we have no evidence to support your claims because, as I said from the very beginning, we lack a robust and detailed mechanistic model for what it means to think, so any claims that depend on the assumption that we do have that knowledge are speculation at best.

In fact, I think an even stronger case could be made that prediction is central to how our brains work, and the evidence is the rise of predictive coding models in neuroscience. It's too early still to say what form that prediction takes, but clearly your dismissal of "token prediction" as somehow meaningless or irrelevant to human thinking seems frankly silly.

int_19h 13 days ago [-]
The "stochastic parrot" crowd keeps repeating "it's just a token predictor!" like that somehow makes any practical difference whatsoever. Thing is, if it's a token predictor that consistently correctly predicts tokens that give the correct answer to, say, novel logical puzzles, then it is a reasoning token predictor, with all that entails.
krainboltgreene 12 days ago [-]
This isn't correct and I am extremely concerned if this is the level of logic running billions of dollars.
int_19h 11 days ago [-]
Then please go ahead and explain how something can solve novel logical puzzles (i.e. ones that are not present in its training set) without some capacity for reasoning. You're claiming that it is "generating texts that looks like ..." - so what is the "..." in this case? I posit that the word that should be placed there is solution, and then you need to explain why that is not ipso facto a demonstration of the ability to reason.
xanderlewis 13 days ago [-]
They’ll just look incredibly silly in, say, ten years from now.

In fact, much of the popular commentary around ChatGPT from around two years ago already looks so.

tavern1991 11 days ago [-]
I couldn't agree more. It is shocking to me how many of my peers think something magic is happening inside an LLM. It is just a token predictor. It doesn't know anything. It can't solve novel problems.
xanderlewis 13 days ago [-]
> We don't fully understand why current LLMs are bad at these tasks.

Rather than asking why LLMs can’t do these tasks, maybe one should ask why we’d expect them to be able to in the first place? Do we fully understand why, for example, a cat can’t predict cellular automata? What would such an explanation look like?

I know there are some who will want to immediately jump in with scathing disagreement, but so far I’ve yet to see any solid evidence of LLMs being capable of reasoning. They can certainly do surprising and impressive things, but the kind of tasks you’re talking about require understanding, which, whilst obviously a very thorny thing to try and define, doesn’t seem to have much to do with how LLMs operate.

I don’t think we should be at all surprised that super-advanced autocorrect can’t exhibit intelligence, and we should spend our time building better systems rather than wondering why what we have now doesn’t work. It’ll be obvious in a few years (or perhaps decades) from now that we just had totally the wrong paradigm. It’s frankly bonkers to think you’re ever going to get a pure LLM to be able to do these kind of things with any degree of reliability just by feeding it yet more data or by ‘prompting it better’.

TacticalCoder 13 days ago [-]
> We have "reasoning machines" that are intelligent...

That's quite a statement.

oldsecondhand 13 days ago [-]
We have expert systems, theorem provers and planners but OP probably didn't mean this.
13 days ago [-]
yosito 13 days ago [-]
> If there exist classes of problems that someone in an elementary school can easily solve but a trillion-token billion-dollar sophisticated model cannot solve, what does that tell us about the nature of our cognition?

I think what it tells us is that our cognition is capable of more than just language modeling. With LLMs we are discovering (amazing) capabilities and the limits of language models. While language models can do incredible things with language that humans can't, they still can't do something simple like sudoku. But there are neural networks, CNNs and RNNs that can solve sudoku better than humans can. I think that the thing to learn here is that some problems are in the domain of language models, and some problems are a better fit for other forms of cognition. The human brain is amazing in that it combines several forms of cognition in an integrated way.

One thing that I think LLMs have the capability to do is to integrate several types of systems and to choose the right one to solve a problem. Teach an LLM how to interface with a CNN that solves sudoku problems, and then ask it a sudoku problem.

It seems to me that if we want to create an AGI, we need to learn how to integrate several different types of models, and teach them how to distribute the tasks we give them to the correct models.

kevindamm 13 days ago [-]
What about sudoku makes it a good fit for CNNs? Or do you mean the machine vision for converting the pixels into an awareness of the sudoku puzzle's initial conditions?

A relatively simple graph theory algorithm can solve it (and at multiple orders of magnitude fewer calculations). Even a naive brute force search is considered tractable, considering the problem size. Although, search could be considered one of the AI tools in your proposed toolbox.

keefle 13 days ago [-]
But even without going this far (with integrating various other specialized or having an LLM use them when required), an LLM is probably able to recognize a sudoku puzzle when it sees one, and even tho it itself can't solve it, I think it can easily write the code that would solve sudoku. So instead of hooking it to a set of pre built models, it might be enough to hook it to a python interpreter
yosito 13 days ago [-]
Many LLMs are already linked to Python interpreters, but they still need some improvement with recognizing when they need to write some code to solve a problem.
rafaelmn 13 days ago [-]
It can spit out some rehash of sudoku it had in it's training data. LLMs are terrible at coding.
tavern1991 11 days ago [-]
What do you mean by "choose the right one to solve a problem"? This phrase seems to carry a lot of water for your take. My understanding is that an LLM has no capability to choose anything. It predicting some tokens based on its training data and your prompt.
yosito 11 days ago [-]
Let's try...

Prompt: Predict which type of algorithm would be effective to solve sudoku.

Response: A backtracking algorithm is typically best for solving Sudoku puzzles due to its efficiency in exploring all possible number placements systematically until it finds the correct solution.

...seemed to work well enough for me.

Prompt 2: Which type of neural network is most efficient at solving sudoku?

Response 2: Convolutional Neural Networks (CNNs) are particularly effective for solving Sudoku puzzles. They can capture the spatial hierarchies in the grid by processing parts of the grid as images, making them efficient for this type of puzzle-solving task.

...Seems to me that LLMs have no problem with this task.

tavern1991 11 days ago [-]
To me it seems you can get the LLM to predict some tokens that contain words that point to the right algorithm. But the LLM doesn't know what it chose. It just sees some tokens. Do you think it could somehow tell it had chosen a CNN in its response and then do something with that knowledge to run a CNN?
yosito 10 days ago [-]
Yes? LLMs are already doing that.
shawntan 14 days ago [-]
If we're trying to quantify what they can NEVER do, I think we'd have to resort to some theoretical results rather than a list empirical evidence of what they can't do now. The terminology you'd look for in the literature would be "expressibility".

For a review of this topic, I'd suggest: https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/st...

The authors of this review have themselves written several articles on the topic, and there is also empirical evidence connected to these limitations.

hansonw 14 days ago [-]
This is also a good paper on the subject:

What Algorithms can Transformers Learn? A Study in Length Generalization https://arxiv.org/abs/2310.16028

shawntan 14 days ago [-]
Yes this is a good empirical study on the types of tasks that's been shown to be impossible for transformers to generalise on.

With both empirical and theoretical support I find it's pretty clear this is an obvious limitation.

jillesvangurp 14 days ago [-]
We have to be a bit more honest about the things we can actually do ourselves. Most people I know would flunk most of the benchmarks we use to evaluate LLMs. Not just a little bit but more like completely and utterly and embarrassingly so. It's not even close; or fair. People are surprisingly alright at a narrow set of problems. Particularly when it doesn't involve knowledge. Most people also suck at reasoning (unless they had years of training), they suck at factual knowledge, they aren't half bad at visual and spatial reasoning, and fairly gullible otherwise.

Anyway, this list looks more like a "hold my beer" moment for AI researchers than any fundamental objections for AIs to stop evolving any further. Sure there are weaknesses, and paths to address those. Anyone claiming that this is the end of the road in terms of progress is going to be in for some disappointing reality check probably a lot sooner than is comfortable.

And of course by narrowing it to just LLMs, the authors have a bit of an escape hatch because they conveniently exclude any further architectures, alternate strategies, improvements, that might otherwise overcome the identified current weaknesses. But that's an artificial constraint that has no real world value; because of course AI researchers are already looking beyond the current state of the art. Why wouldn't they.

martindbp 13 days ago [-]
It's clear that what's missing is flexibility and agency. For anything that can be put into text or a short conversation, and I'd have to chose between access to ChatGPT or a random human, I know what I'd chose.
pixl97 13 days ago [-]
Agency is one of those things we probably want to think about quite a bit. Especially with the the willingness for people to hook up it up to things that interact with the real world.
shawntan 14 days ago [-]
Not sure what you got out of the paper, but for me it was more spurring ideas about how to fix this in future architectures.

Don't think anyone worth their salt would look at this and think : oh well that's that then.

cs702 14 days ago [-]
Thank you for sharing this here. Rigorous work on the "expressibility" of current LLMs (i.e., which classes of problems can they tackle?) is surely more important, but I suspect it will go over head of most HN readers, many of whom have minimal to zero formal training on topics relating to computational complexity.
shawntan 14 days ago [-]
Yes, but unfortunately that doesn't answer the question the title poses.
cs702 14 days ago [-]
The OP is not trying to answer the question. Rather, the OP is asking the question and sharing some thoughts on the motivations for asking it.
shawntan 14 days ago [-]
I agree it's a good question to be asking.

There are good answers to be found if you look.

It feels like no proper looking was attempted.

marquisdepolis 12 days ago [-]
This is very interesting thanks Shawn. I did email William Merrill to see his thoughts but didn't get a response yet.
unparagoned 14 days ago [-]
Neural nets can approximate any function.

A large enough llm with memory is turning complete.

So theoretically I don’t think there is anything they can never do.

shawntan 14 days ago [-]
> Neural nets can approximate any function.

Common misunderstanding of the universal approximation theorem.

Consider this: can an mlp approximate a sine wave?

> A large enough llm with memory is turning complete.

With (a lot of) chain of thought it could be.

Read the paper, and its references.

andy99 13 days ago [-]
Sort of moot anyway. If statements can approximate any function, most programming languages are effectively turing complete. What's important about specific architectures like transformers is they allow for comparatively efficient determination of the set of weights that will approximate some narrower class of functions. It's finding the weights that's important, not the theoretical representation power.
sdenton4 13 days ago [-]
"Consider this: can an mlp approximate a sine wave?"

Well, yes - we have neutral speech and music synthesis and compression algorithms which do this exceedingly well...

qwery2 13 days ago [-]
I think the person you're replying to may have been referring to the problem of a MLP approximating a sine wave for out of distribution samples, i.e. the entire set of real numbers.
goatlover 13 days ago [-]
There's all sorts of things a neural net isn't doing without a body. Giving birth or free soloing El Capitan come to mind. It could approximate the functions for both in token-land, but who cares?
naasking 13 days ago [-]
> They have been trained on more information than a human being can hope to even see in a lifetime. Assuming a human can read 300 words a min and 8 hours of reading time a day, they would read over a 30,000 to 50,000 books in their lifetime. Most people would manage perhaps a meagre subset of that, at best 1% of it. That’s at best 1 GB of data.

This just isn't true. Human training is multimodal to a degree far beyond even the most capable multimodal model, so human babies arguably see more data by a young age than all models collectively have seen.

Not to mention that human babies don't even start as a blank slate as LLMs do, billions of years of evolution have formed the base model described by our DNA.

cs702 13 days ago [-]
I agree with you, but your comment strikes me as unfair nitpicking, because the OP is referring to information that has been encoded in words.
nickpsecurity 12 days ago [-]
We learn the ideas from each mode of input. Then, one mode can elaborate on data learned from another mode. They build on each other.

From there, remember the text is usually a reflection of things in the real world. Understanding those things in non-textual ways both gives meaning to and deeper understanding of the text. Much of the text itself was even stored in other modes, like markup or PDF’s, whose structure tells us things about it.

That we learn multimodal from birth is therefore an important point to make.

It might also be a prerequisite for AGI. It could be one of the fundamental laws of information theory or something. Text might not be enough like how digital devices need analog to interface with the real world.

naasking 12 days ago [-]
I understand that's the context, but I'm not sure that it's unfair nitpicking. It's common to talk about training data and how poor LLMs are compared to humans despite the apparently larger dataset than any human could absorb in a lifetime. The argument is just wrong because it doesn't properly quantify the dataset size, and when you do, you actually conclude the opposite: it's astounding how good LLMs are despite their profound disadvantage.
cs702 12 days ago [-]
> I understand that's the context, but I'm not sure that it's unfair nitpicking.

The OP is about much more than that, and taken as a whole, suggests the author is well aware that human beings absorb a lot more data from multiple domains. It struck me as unfair to criticize one sentence out of context while ignoring the rest of the OP.

> It's common to talk about training data and how poor LLMs are compared to humans despite the apparently larger dataset than any human could absorb in a lifetime.

Thank you. Like I said, I agree. My sense is the author would agree too.

It's possible that to overcome some of the limits we're starting to see, AI models may need to absorb a giant, endless, torrential stream of non-textual, multi-domain data, like people.

At the moment, we don't know.

a_wild_dandan 12 days ago [-]
Other modalities affect word semantics. You cannot ignore them when discussing sample efficiency.
cess11 13 days ago [-]
Some people seem to be unaware that reality is analog, possibly fractal.
kelipso 13 days ago [-]
The quantum vibrations I feel against my consciousness cannot be modeled electronically!
Too 13 days ago [-]
While the A:B problem technically was solved, look at the solutions, they are several hundreds lines of prompts, rephrasing the problem to the point that a human doesn't understand it any more. Even with a thorough review, nobody can guarantee if the prompts are going to work or not, most of them didn't, 90% pass was considered good enough. The idea of AI is to reduce work, not create more, otherwise what's the point.

In the meantime, it took me about 2 minutes and 0 guesswork to write a straightforward and readable solution in 15 lines of Python. This i know for sure will work 100% of the time and not cost $1 per inference.

Reminds me about some early attempts to have executable requirements specifications or model-based engineering. Turns out, expressing the problem is half the problem, resulting in requirements often longer and more convoluted than the code that implements them, code being a very efficient language to express solutions and all their edge cases, free from ambiguity.

Don't get me wrong here, LLMs are super useful for certain class of questions. The boundaries of what it can not do need to be understood better, to keep the AI-for-everything hype at bay.

coffeebeqn 12 days ago [-]
I guess the problem is that if you need to teach it tricks for each novel problem still after training then that model can not be a general intelligence. It could still be useful though
Xenoamorphous 14 days ago [-]
There’s many things they can’t do. Even a simple rule like “ensure that numbers from one to ten are written as words and numbers greater ten as digits in the given text” fails for me for so many examples even if it works for many others; few shot, chain of thought, many versions of the prompt, it doesn’t matter. Sometimes LLMs will even change the number to something else, even with temp set to 0. And then there’s the non-determinism (again with temp=0), you run the same prompt several times and that one time it’ll respond with something different.

As amazing as they are, they still have many limitations.

I’ve been working with ChatGPT and Gemini to apply simple rules like the one above and I got so frustrated.

empath-nirvana 13 days ago [-]
The reason it can't do that is that, for example, "twenty" and "20" are nearly identical in the vector embedding space and it can't really distinguish them that well in most contexts. That's true for generally any task that relies on sort of "how the words look" vs "what the words mean". Any kind of meta request is going to be very difficult for an LLM, but a multi-modal GPT model should be able to handle it.
Xenoamorphous 12 days ago [-]
Thanks, I’ll try the multimodal one.
Xenoamorphous 12 days ago [-]
Tried it, did not perform better than the non-multimodal one.
vsnf 14 days ago [-]
> ensure that numbers from one to ten as written as words and numbers greater ten as digits in the given text

I can’t fault llms for not knowing what to do here because I, a human, have no idea what on earth this means.

Snacklive 14 days ago [-]
Given the text "1,2,3,4,5,6,7,8,9,10,11,12" it should result in "one, two, three, four, five, six, seven, eight, nine, ten, 11, 12"

or at least that's my understanding of the prompt

zamadatix 14 days ago [-]
I think you may be thrown off because the first "as" is meant to be "are".
Xenoamorphous 14 days ago [-]
Thanks, that was def a typo that I’ve fixed now.
Xenoamorphous 14 days ago [-]
“Ten” is a word, “10” are digits.

I’m not a native English speaker, how would you write it?

FWIW the LLMs get it right many times, but fail other times.

bxparks 14 days ago [-]
I couldn't understand the original wording either, but after reading one of the sibling comments that explains it, it suddenly made sense.

I think you left out a few words that most English writers would include. So instead of:

> "ensure that numbers from one to ten as written as words and numbers greater ten as digits in the given text",

something like the following might be better for most people:

> "ensure that the numbers from one to ten are written as words, and the numbers greater ten are written using numerical digits in the given text"

There are multiple ways to write this, so other people may have better versions.

I'm not an English grammar expert, so I cannot explain to you why the addition of those extra words helps with the clarity of that sentence.

mkl 13 days ago [-]
Much better, but still missing "than" after "greater", which seems kind of critical.

"Using" is important as a number greater than ten can't be written as a digit, but can be written using digits ("with" would be just as good). Repeating "written" makes it clearer that there are two instructions.

bxparks 13 days ago [-]
It's funny, I didn't notice the missing "than" until much later. After I learned the intended meaning of the original sentence, my mind just seemed to insert the missing "than" automatically.
vsnf 12 days ago [-]
Mine as well. After understanding the meaning thanks to the other posters, the sentence magically looked fine. But before knowing the meaning, it was gibberish. I’ve become aware of this before, and it makes me wonder just how often I’m interpreting grammatical nonsense on a daily basis without realizing it.
jiggawatts 13 days ago [-]
Hilariously, you can ask GPT 4 to explain the “why” of arbitrary grammar fixes.
blowski 13 days ago [-]
It’s a common style guide in newspapers.
MichaelZuo 14 days ago [-]
If your not a native English speaker, why are you even expecting the LLM to understand even 80% of the time?

Just ask it in your own native language.

Xenoamorphous 14 days ago [-]
First of all, the texts the rule has to be applied to are written in English. Second, I believe English is by far (by far) the most prevalent language in the training dataset for those models, so I’d expect it to work better at this kind of task.

And third, I’m not the only one working on this problem, there are others that are native speakers, and as my initial message stated, there have been many variations of the prompt. None work for all cases.

And lastly, how would you rewrite my sample prompt? Which BTW bad a typo (unrelated to my English skills) that I’ve now fixed.

MichaelZuo 14 days ago [-]
To be frank the response itself indicates that you don't really get what was being asked, or maybe how to parse English conversation conventions?

I.e. It doesn't seem to answer the actual question.

They seem to be half responding to the second sentence which was a personal opinion, so I wasn't soliciting any answers about it. And half going on a tangent that seems to lead away from forming a direct answer.

Run these comment through a translation tool if your still not 100% sure after reading this.

Xenoamorphous 14 days ago [-]
Alright man. So was it a quip when you said “if _your_ not a native English speaker”? Ok then. Very funny, I get it now.
MichaelZuo 13 days ago [-]
I really recommend to use a translator, instead of relying purely on your English comprehension skills.
Xenoamorphous 13 days ago [-]
Your surname surely seems to indicate that some of your ancestors weren't native English speakers. I hope they didn't get lectured or made fun of by people like you on their poor English skills when they first landed on whichever country you were born.
Last5Digits 13 days ago [-]
Your English is absolutely fine and your answers in this thread clearly addressed the points brought up by other commenters. I have no idea what that guy is on about.
MichaelZuo 13 days ago [-]
I've read this three times and it still doesn't make a lick of sense. How does this relate to the parent comments?
FroshKiller 14 days ago [-]
It's a simple prescriptive rule in English. If you are writing about a small number, like less than ten, spell it out. For example: "According to a survey, nine out of ten people agree."

But if you are writing about a large number, particularly one with a lot of different digits, prefer writing the digits: "A mile is 5,280 feet." Compare that to: "A mile is five thousand, two hundred, and eighty feet."

logtempo 14 days ago [-]
I think he mean that numbers less or equal than ten are written as words, and others are written as numbers.

Given the many reaponses, it would be fun to aee if llm beat humans on understanding the sentence ahah

lukeschlather 13 days ago [-]
to me the main problem is that it should read "numbers greater than ten." I asked Gemini to rephrase it and Gemini produced correct English with the intended meaning:

> Change all numbers between one and ten to words, and write numbers eleven and above as digits in the text.

It even used eleven rather than ten which sounds like counting.

callmeal 14 days ago [-]
> > ensure that numbers from one to ten as written as words and numbers greater ten as digits in the given text

There are two blue, one red, and 15 green m&ms in this bag.

14 days ago [-]
Der_Einzige 13 days ago [-]
All of these issues are entirely due to the tokenization scheme. Literally all of them

You could get this behavior implemented perfectly with constrained text gen techniques like grammars or any of the various libraries implementing constrained text gen (i.e. guidance)

Xenoamorphous 13 days ago [-]
I had briefly looked into Guidance and others (LMQL, Outlines) but I couldn't figure out how to use them for this problem.

I could think of how to use them to prevent the LLM from generating digits for numbers greater than ten by using a regex plus a constraint that forbids digits, but the main problem is the other part of the rule, i.e. numbers above 10 should never be spelled out and should be written as digits instead. For that I presume you need to identify the spelled out numbers first, for which you presumably would need the LLM so you're back to LLM fallibility.

Any pointers would be greatly appreciated.

Kiro 13 days ago [-]
You constructed a task that no-one understands and then you even admit that it, despite that, actually succeeds most of the times. Sounds like a massive win for the LLMs to me.
wave_1 13 days ago [-]
I build an Agentic AI that leverages #6 and #7 at the end of the article as well as techniques not yet published. It tackles hallucination relative not to the world at large but to the facts, entities and causal relationships contained in a document (which is really bad reasoning if we assume LLMs are "reasoning" to begin with) It also tackles cross-reasoning with very large token distance.

https://www.youtube.com/watch?v=99NPzteAz94

This is my first post on HN in 10 years.

dazzaji 13 days ago [-]
This looks really promising for complex legal reasoning tasks and other challenges. How can I track progress? Is there an email list or something? Thanks!
asicsarecool 13 days ago [-]
Wow. Please do a show HN.

Is source available?

I would love to play with this

wave_1 13 days ago [-]
Thanks. This is just in the labs stage, but moving closer to releasing it, exactly so that you can play with it! I have one angel investor involved in supporting this and it's intended for commercial applications in the para legal space, initially (controlled, structured environment) But you just gave me the motivation to "put it out there" so people can just play with it. It'll take a bit of time, but I will do a Show HN then when it's ready for people to play with. Otherwise, it would be just teasing people to talk about it on the main HN stage without giving access. Hold tight! And thanks again!
Scipio_Afri 13 days ago [-]
Will this or some parts of it be open sourced?
marquisdepolis 12 days ago [-]
Author here. This is super interesting, and while I am mostly a lurker here welcome back?
esafak 13 days ago [-]
Is there a write-up, a web site, and some benchmarks?
brcmthrowaway 13 days ago [-]
What are agents?

Are they layer 2 solutions like Lightning is to bitcoin?

donatj 14 days ago [-]
I have been trying to generate some text recently using the ChatGPT API. No matter how I word “Include any interesting facts or anecdotes without commenting on the fact being interesting” it ALWAYS starts out “One interesting fact about” or similar phrasing.

I have honestly spent multiple hours trying to word the prompt so it will stop including introductory phrases and just include the fact straight. I have gone so far as forbid the words “interesting”, “notable” and a few others in the prompt, and it just ignores me and uses them anyway.

I’ve tried all the different available models and 4 will once in a great while get it right. 3, 3 turbo, and even 4 turbo basically never work as I want.

Overall, it seems pretty bad at being told not to do certain things. For instance there are a number of topics I don’t want mentioned in its response, but telling it not to seems to only increase the likelihood that it will mention it, like you put the idea in its head.

fallingsquirrel 14 days ago [-]
I just tried this locally with llama3-8b and it handled it fine. Claude 3 Sonnet passes your test too, in case you don't have the hardware for a local model. You might want to consider moving on from ChatGPT since their models have been RLHF'd to hell in the name of "safety" and are falling behind in many ways as a result.

---

transcript:

$ ollama run llama3:8b

>>> tell me an interesting fact about etymology

Here's one:

Did you know that the word "quarantine" has a fascinating etymological history?

The word "quarantine" comes from^C

>>> tell me an interesting fact about etymology. just the fact, don't mention it's interesting.

The word "robot" originated from the Czechoslovakian word "robota," which means "forced labor" or "drudgery." This term was used in Karel Čapek's 1920 play "R.U.R." (Rossum's Universal Robots), where it referred to artificial workers.

_wire_ 13 days ago [-]
Why is any of this "interesting"?

Is the response interesting because you don't know it? —How does it know this?

Is the response interesting because you do know it or might have offered it yourself? —Confirmation bias.

Is it interesting because a lot of training references contextualize it as "interesting?" —Begged question.

Is it contextually interesting? —What is the context? A robot refers to robots? How unexpected...

Is it interesting within the narrow confines of LLM adaptations to a scope of inputs?

Can their by any more damning claim of the general suitability of the technology as an oracle than different users using the same prompts and getting inexplicably contrary results?

If trivial prompt alignments result in appropriate vs inappropriate responses, this destroys confidence for every response.

What am I missing?

ricardobeat 13 days ago [-]
Pretty sure the point here was Llama3 respecting the command to not mention that this is interesting, not adding filler, rather than the output fact being interesting or not.
oezi 13 days ago [-]
You are missing that this is precisely what we would expect a human to answer without further context (for instance without knowing how much you know about the topic).

A human would pick similarly pick something which isn't too nerdy but also not obvious and the LLM did well here.

If the LLM can fail that is fine, because the task is inherently hard.

_wire_ 13 days ago [-]
My comment about what's "interesting" or not was an attempt cast out interesting responses as not offering a way forward to any qualitative evaluation of AI behavior. To be interesting is a quality of the those who regards, not the situation under regard.

Do you find it interesting that some LLMs routinely qualify responses to prompts to report something interesting with a statement that the response is interesting which can't reliably be suppressed by including a sub-prompt requesting suppression?

I don't, because I have no idea why I should expect any prompt to produce any sort of response.

I spent a few days goofing around with Stable Diffusion and found it frustrating because it could render a response to some prompts that that I found relevant and satisfying, but I couldn't get it to reliably render my intentions. I soon encountered obvious limits of its training set, and the community is adapting to these limits with with networks of domain-specific accessory models.

This experience greatly tempered my expectations: I see AI as a magic paintbrush or story reader. I see no evidence of thinking machine.

If we're going to establish an equivalence comparison between any AI and humans we need a theory for both.

I have yet to see a coherent theory of the AI but I believe there in such in a language I don't understand, just as there's a theory of Conway's Game of Life, which leads to continual fascination with the machine's behavior.

But I've been unable to find any theory of the human, nor will I expect any such theory, because to my eyes life looks like a realm of complexity incomparable any game.

I do have interest in seeing nerds struggle to explain AI, but am surprised that after several years no common vernacular from which a theory might be assembled has yet to appear.

An open-ended article about what AIs can't do seems hopelessly daft. It has already been formally established there are domains of what computation can never do. So to be interesting, a treatment of the limits of AI, being a form of a computer, had better start with a consideration of those domains. But this article does not, nor do any of the comments.

So whatever is going on with this discourse, it appears to me to have nothing to do with understanding of AIs.

foobiekr 13 days ago [-]
The RUR thing is basically because that specific example is used as an example of interesting etymology.
ianbicking 13 days ago [-]
I often encounter fixation, and that would be my immediate thought: negative commands can often cause the LLM to fixate on a term or idea. My first thought would be to try positive examples and avoid a negative command entirely.

If you spent that much time I'm sure you tried this and other things, so maybe even that isn't enough. (Though I assume if you ask for a JSON/function call response with the API that you'd do fine...?)

aurareturn 14 days ago [-]
Not an expert but I sense that it's following a higher OpenAI "built in" prompt that asks it to always include an introductory phrase.

Hence, we do need powerful and less censored LLMs if we want to better integrate LLMs into applications.

ziml77 13 days ago [-]
No it just seems that it becomes blind, so to speak, to the negatives and the inclusion of the words you were negating makes it more likely to apply them in the positive. This is how ChatGPT has seemed to behave whenever I've tried to get it to not include something.
Der_Einzige 13 days ago [-]
API driven LLMs on purpose don't implement core features which would enable which you want, for example, negative prompting.

You can negative prompt any LLM with stuff like "always write the word interesting in your response".

You can also use techniques for modifying logprobs of tokens, which is avaialble in gpt-4 api (but is hard to use). You can literally ban "interesting" from its vocabulary.

You could even use representation steering techniques to do this using control vectors. See this library as an example: https://github.com/Hellisotherpeople/llm_steer-oobabooga

BoorishBears 14 days ago [-]
Have you tried a simple "No pretext or posttext, return the result in a code block"?
donatj 12 days ago [-]
It's part of a larger prompt trying to get it to generate a couple paragraphs that include interesting facts. I want the facts in the context of the paragraphs.
BoorishBears 10 days ago [-]
I don't get what this means.

I have 7000 token prompts that simple conclude with "Provide the result adhering to <insert schema> with no pretext or posttext" and it has no problem following that.

Even if you want it to "think" before responding you can embed the thinking inside the JSON

owenpalmer 13 days ago [-]
Have you tried feeding the output into another prompt that says something like "remove any mentions of the facts being interesting"?
JeremyHerrman 13 days ago [-]
Why can't we resolve this with synthetic data? Just take the original statements and ask another LLM to flip around the relation, then train on both. I tried this with GPT-4 and it seems to understand the task [0]:

Me:

For each of the following relational statements, flip the relation around and generate a new statement or question answer pair.

Input: Bluey's father is Bandit

Output: Bandit's daughter is Bluey

Input: Neil Armstrong was the first person to step on the Moon.

Output: Who was the first person to step on the Moon? Neil Armstrong.

Now for the real statements:

- Valentina Tereshkova was the first woman to travel to space

- Who is Mary Lee Pfeiffer's son? Tom Cruise

ChatGPT:

Here are the flipped relational statements:

Valentina Tereshkova was the first woman to travel to space

Output: Who was the first woman to travel to space? Valentina Tereshkova.

Who is Mary Lee Pfeiffer's son? Tom Cruise

Output: Tom Cruise's mother is Mary Lee Pfeiffer.

[0]: https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af...

layer8 13 days ago [-]
The difficulty would likely be that the “X is Y” relations don’t just occur in that simple, separated-out form, but in a whole spectrum of more complicated contexts, and the latter probably dominate. You would have to identify those and synthesize a text corpus that contains the reverse notions across a similar spectrum.
mike_hearn 13 days ago [-]
I don't think the reversal curse actually replicates though. When I looked into that it seemed like there was some flaws in the study and when I fixed those I couldn't get the same results as they did anymore. I keep meaning to write this up properly but never get around to it.

If you think about it, they're making an extraordinary claim given how fluently LLMs engage in language and reasoning.

amayne 13 days ago [-]
Some of these can be done with the right prompting. Wordle was solved a year ago. The problem is that what works and what doesn’t can be counterintuitive or at least require you to think about how something is perceived at the token level.

https://andrewmayne.com/2023/03/29/how-to-play-wordle-with-g...

philipwhiuk 12 days ago [-]
What's the intelligent agent at that point though - the solver or the prompter?

The LLM isn't solving the actual problem, it's solving a subset problem.

oldandtired 14 days ago [-]
Far too many people (including AI researchers themselves) fail to see that all LLMs are actually simple machines. Extremely simple machines that are only mechanically following a relatively simple programming path.

Now before anyone gets too caught up with objecting to this notion, I would seriously suggest that you spend time with observing children from new-born to 2 years.

I have been observing my latest granddaughter sine her birth about 16 months ago and thinking about every public LLM system current;y available.

There is an insight here to be obtained and that insight is in the nature of real intelligence.

On the whole, no-one actually knows what intelligence is or what sentience is or what it means to be cognitively conscious. There is still much research going on and nothing actually definitive has come forth yet. We really are at the beginning in terms of studying these areas.

We can certainly produce some incredible systems, but none of them are intelligent per se. Solutions to certain kinds of problems can be achieved using these systems and there are researchers who are seriously looking at incorporating these systems into CAS and theorem provers. These systems though only provide an augmentation service for a person as does every mechanical system we use

But there is an essential component necessary for the use of all LLMs which many seem to not be cognisant of and that is these systems, to be useful, require humans to be involved.

The questions we have to ask ourselves is: what can we use these systems for and do these uses provide benefits in some way or can these systems be abused by various parties in obtaining control over others?

There are benefits and there are abuses. Can we do better or will we do worse by using them?

whiplash451 14 days ago [-]
Interesting, if I feed Mistral Le Chat with "I fly a plane leaving my campsite, heading straight east for precisely 24,901 miles, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger?", it gets it badly wrong:

The scenario you described is possible if you started at the South Pole. If you travel 24,901 miles east from there, you would indeed end up back at the same spot because all lines of longitude converge at the poles. However, there are no tigers in Antarctica.

Tigers are native to Asia, not Antarctica. The closest tiger species to Antarctica would be the Siberian tiger, found in parts of Russia, China, and North Korea, but they are still thousands of miles away from Antarctica.

So, while the travel scenario is theoretically possible, the presence of the tiger is not realistic in this context. It seems like an imaginative or hypothetical situation rather than a real-world one.

(instead of the answer mentioned in the article)

Karellen 13 days ago [-]
> The scenario you described is possible if you started at the South Pole. If you travel 24,901 miles east from there, you would indeed end up back at the same spot

Would you? How can you travel east at all, starting from the South Pole? There is no east. Every direction you face is north.

OTOH, every line of latitude whose length is an integral divisor of 24,901 miles will satisfy the problem.

whiplash451 13 days ago [-]
Looks like Mistral overfitted on Google’s interview riddles!
GuB-42 13 days ago [-]
There is a similar, better known riddle involving the north pole and polar bears.

It probably hallucinated the answer from that.

3abiton 14 days ago [-]
The article should be titled " What can LLM never do, yet". By definition, Large Language Models would keep growing larger and larger, to be trained on faster and more advanced hardware, and certain points like "completing complex chains of logical reasoning" tasks, would be just a time hurdle. Only time will tell.
HarHarVeryFunny 14 days ago [-]
We really don't need time to tell.

Just making a transformer bigger and bigger, and feeding it more and more data, will not change it from being a language model into something else, anymore than scaling up an expert system such as Cyc will transform it into something other than an expert system. "Scale it up and it'll become sentient" is one of the recurring myths of AI.. a bit odd that people are falling for it again.

As an aside, it seems reasonable to consider an LLM as a type of expert system - one that has a broad area of expertise (like Cyc), including (unlike Cyc) how to infer rules from language and generate language from rules.

If you want to create a brain-like AGI, then you need an entire cognitive architecture, not just one piece of it which is what we have currently with LLMs. Compared to a brain, an LLM is maybe just like the cortex (without all the other brain parts like cerebellum, hippocampus, hypothalamus and interconnectivity such as the cortico-thalamic loop). It's as if we've cut the cortex out of a dead person's brain, put it in a mason jar to keep it alive, and hooked it's inputs and outputs up to a computer. Feed words in, get words out. Cool, but it's not a whole brain, it's a cortex in a mason jar.

logicprog 13 days ago [-]
Well said. This has always been my fundamental problem with the claims about large language models' current or eventual capabilities: most of the things people claim it can or will be able most of the things people claim it can or will be able to do require a neural architecture completely different from the one it has, and no amount of scaling up the number of neurons and the amount of training data used will change that fundamental architecture, and at a very basic level the capabilities of any neural network are going to be limited by its architecture. We would need to add some kind of advanced recursive structure to large language models, as well as some kind of short-term and working memory, as well as probably many other structures, to make them capable of the kind of metacognition necessary to properly do a lot of the things people want them to be able to do. Without metacognition, the ability to analyze what one is currently thinking and think new things based on that analysis, and therefore to look at what one is thinking and error correct it, consciously adjust it or iterate on it, or consciously ensure that one is adhering to certain principles of reasoning or knowledge, we can't expect large language models to be able to actually understand Concepts and principles and how they are applicable and reliably perform reasoning or even obey instructions.
hackinthebochs 14 days ago [-]
>will not change it from being a language model into something else,

This is a pretty empty claim when we don't know what the limits of language modelling are. Of course it will never not be a language model. But the question is what are the limits of capability of this class of computing device?

HarHarVeryFunny 14 days ago [-]
Some limit's are pretty obvious, even if easy to fix.

For example, a pure LLM is just a single pass through a stack of transformer layers, so there is no variable depth/duration (incl. iteration/looping) of thought and no corresponding or longer duration working memory other than the embeddings as they pass thru. This is going to severely limit their ability to plan and reason since you only get a fixed N layers of reasoning regardless of what they are asked.

Lack of working memory (really needs to be context duration, or longer, not depth duration) has many predictable effects.

No doubt we will see pure-transformer architectures extended to add more capabilities, so I guess the real question is how far these extensions (+scaling) will get us. I think one thing we can be sure of though is that it won't get us to AGI (defining AGI = human-level problem solving capability) unless we add ALL of the missing pieces that the brain has, not just a couple of the easy ones.

jaza 13 days ago [-]
Thanks for that final paragraph! I'm going to quote you from now on, when trying to explain to someone (for the thousandth time) why ChatGPT isn't about to become super-intelligent and take over the world.
Mordisquitos 14 days ago [-]
I think that the article is correct. There are indeed things that LLMs will never be able to do, at least not consistently, however much the hardware improves or on how much more material they are trained.

How come? Note my emphasis on the 2nd 'L'. I'm not saying that there are things that AI models will never be able to do, I'm saying that there are things that Large Language Models will be unable to do.

Training LLMs is often argued to be analogous to human learning, most often as a defence against claims of copyright infringement by arguing that human creativity is also based on training from copyrighted materials. However, that is a red herring.

The responses from ever more powerful LLMs are indeed impressive, and beyond what an overwhelming majority of us believed possible just 5 years ago. They are nearing and sometimes surpassing the performance of educated humans in certain areas, so how come I can argue they are limited? Consider it from the other side: how come an educated human can create something as good as an LLM can when said human's brain has been "trained" on an infinitesimal fraction of the material which was used to train even the 1st release of ChatGPT?

That is because LLMs do not learn nor reason like humans: they do not have opinions, do not have intentions, do not have doubts, do not have curiosity, do not have values, do not have a model of mind — they have tokens and probabilities.

For an AI model to be able to do certain things that humans can do it needs to have many of those human characteristics that allow us to do impressive mental feats having absorbed barely any training material (compared to LLMs) and being virtually unable to even remember most of it, let alone verbatim. Such an AI model is surely possible, but it needs a completely different paradigm from straightforward LLMs. That's not to say however that a Language Model will almost certainly be an necessary module of such an AI, but it will not be sufficient.

mewpmewp2 14 days ago [-]
I don't think values, opinions or things like that are needed at all. These are just aspects we have in order to perform in and together with the society.

Also doubt is just uncertainty, and can be represented as a probability. Actually all values and everything can be presented as a numerical probability, which I personally prefer to do as well.

spacebanana7 13 days ago [-]
Values and opinions drive human attention, which as transformers demonstrate, is relevant to reasoning.
amelius 14 days ago [-]
The big question is if LLMs are capable enough to converge to AGI. It might very well be that as we pour in more resources that they converge to something only slightly more useful but similar as we have today.
g5g5g5g5 14 days ago [-]
> The article should be titled " What can LLM never do, yet".

I don't think it should. It's more interesting to know what LLMs will _never_ be able to do (if anything).

shawntan 14 days ago [-]
Yes, but the article doesn't really answer this question.
Quothling 14 days ago [-]
In the Danish public sector we provide services based on need assessments of citizens. Then we subsequently pay the bills for those services. Which amounts to thousands of small invoices having to be paid by a municipality each month. An example of this could be payments for a dentist visit, transportation and similar. Most of these are relatively small in size, and we've long since automated the payments of anything below a certain amount through automation. Systems which are faster and less error prone as far as putting valid data everywhere goes. They are more prone to decision making errors, however, and while fraud isn't an issue, sometimes citizens have invoices approved that they aren't entitled to. Since it's less costly to just roll with those mistakes than to try and fix them, it's an accepted loss.

The systems are hugely successful and popular, and this naturally leads to a massive interest in LLM's as the next step. They are incredibly tools, but they are based on probability and while they're lucky enough to be useful for almost everything. Decision making probably shouldn't be one of them. Similarly ML is incredibly helpful in things like cancer detection , but we've already had issues where they got things wrong and because MBA's don't really know how they work, they were used as a replacement instead of an enhancement for the human factor. I'm fairly certain we're going to use LLM's for a lot of things where we shouldn't, and probably never should. I'm not sure we can avoid it, but I wouldn't personally trust them to do any sort of function which will have a big influence on peoples lives. I use both Co-pilot and OpenAI's tools extensively, but I can still prompt them with the same thing and get extremely different quality outputs, and while this will improve, and while it's very to get an output that's actually useful, it's still a major issue that might never get solved well enough for what we're going to ask of the models way before they are ready.

I hope we're going to be clever enough to only use them as enhancement tools in the vital public sector, but I'm sure we're going to use them in areas like education. Which is going to be interesting... We already see this with new software developers in my area of the world, where they build things with the use of LLM's, things that work, but aren't build "right" and will eventually cause issues. For the most part this doesn't matter, but you really don't want the person designing your medical software to use a LLM.

ekianjo 14 days ago [-]
Math reasoning is still a non solved problem even if the rest of the capabilities are getting better. This means the transformers architecture may not be the best way to approach all problems
gmuslera 14 days ago [-]
Maybe the wording is correct. Looks like a hard limit on doing what a LLM just do. If it goes beyond that, then is something more, or at least different, than a LLM.
CuriouslyC 14 days ago [-]
Some of these "never do" things are just artifacts of textual representation, and if you transformed wordl/sudoku into a different domain it would have a much higher success rate using the exact same transformer architecture.

We don't need to create custom AGI for every domain, we just need a model/tool catalog and an agent that is able to reason well enough to decompose problems into parts that can be farmed out to specialized tools then reassembled to form an answer.

sdenton4 13 days ago [-]
"The models, in other words, do not well generalise to understand the relationships between people."

Curiously, the need to do this well - handling the quadratic complexity of a shifting set of human relationships, grudges, and alliances - is thought to be one of the things that led is to higher levels of intelligence.

13 days ago [-]
ryanblakeley 13 days ago [-]
Just to be clear, these models can answer questions about relationships between people if you mean family relationships.

Answering questions about what you're describing sounds really interesting. What would a training set be like that describes a bunch of complex human relationships and then asks questions about them with objective answers?

sdenton4 13 days ago [-]
Of course, it would be easy to put such questions together, and I'm sure the LLM would do fine with them - there's a massive amount of human text about human relationships.

One difference, as in all ml training, is interactivity. Looking at ape studies, knowing the relationships is partly diagnostic, but it's also about planning and competition. And that competitive/adaptive aspect is what is what looks like a real evolutionary driver. If you can understand, navigate, and manipulate relationships successfully, you get more mating opportunities. Doing /that/ well involves both reasoning and long term planning, both of which are apparent in chimps.

A good book on this topic is 'are we smart enough to understand how smart animals are' by Frans de Waal.

https://wwnorton.com/books/Are-We-Smart-Enough-to-Know-How-S...

usgroup 14 days ago [-]
It’s an auto regressive model so it can’t do anything that requires planning tokens.

It can’t do anything which implies a large or infinite token space (eg video understanding).

It’s also limited to a reasonable response length since token selection is probabilistic at each recursion. The longer you make it the more likely it is to veer off course.

Daub 13 days ago [-]
>What can LLMs never do?

Produce convincingly bad digital paintings.

I teach digital painting. Some of the students have incorporated AI into their working process, which I support. Others have tried to cheat by simply copying AI generated output. Such cases are super-easy to spot: they carry the visual signature of AI art (which are mostly scrappings from artstation). This visual signature seems impossible to override. If only there was a way that AI could produce digital images bad enough to pass as genuine student output.

stainablesteel 13 days ago [-]
that's an interesting statement, i've always considered experts to be people who can do things simultaneously better and worse than a beginner

my favorite example is when a great dancer who tries to mockingly mimic a bad dancer still looks good and does it hilariously

Wohlf 13 days ago [-]
Many experts completly forget what it was like to be a beginner. That's why I've found it's generally best for absolute beginners to learn from an apprentice, and an apprentice to learn from a journeyman, because they still remember what it was like to be at the previous level.
Daub 13 days ago [-]
> I've always considered experts to be people who can do things simultaneously better and worse than a beginner

I agree. This reminds me of the so-called school of kung fu called drunken master. There can be a can't-give-a-fuck about someone who is at the peak of their abilities.

hwbunny 13 days ago [-]
They can't even produce high profile art. Thankfully, and they won't be able to in the next 10 years. Thankfully.
gwervc 14 days ago [-]
> But then I started asking myself how can we figure out the limits of its ability to reason

Third paragraph. The entire article is based on the premise LLMs are supposed to reason, which is wrong. They don't, they're tools to generate text.

stavros 14 days ago [-]
I really hate this reductive, facile, "um akshually" take. If the text that the text-generating tool generates contains reasoning, then the text generation tool can be said to be reasoning, can't it.

That's like saying "humans aren't supposed to reason, they're supposed to make sounds with their mouths".

rdedev 14 days ago [-]
At some point if you need to generate better text you need to start creating a model of how the world works along with some amount of reasoning. The "it's just a token generator" argument fails to get this part. That being said I don't think just scaling LLMs are going to get us AGI but I don't have any real arguments to support that
rambambram 14 days ago [-]
> If the text that the text-generating tool generates contains reasoning, then the text generation tool can be said to be reasoning, can't it.

I don't know... you're still describing a talking parrot here, if you'd ask me.

wbogusz 13 days ago [-]
I’m not a fan of the talking parrot argument, especially when you’re pointing it at models of scale.

The only thing separating a talking parrot and humans is our accuracy in shaping our words to the context in which they’re spoken.

Sure it’s easy to liken a low resource model to a talking parrot, the output seems no better than selective repetition of training data. But is that really so different from a baby whose first words are mimics from the environment around them?

I would argue that as we learn language we implicitly develop the neural circuitry to continue to improve our lexical outputs, this circuitry being concepts like foresight, reasoning, emotion, logic, etc and that while we can take explicit action to teach these ideas, they naturally develop in isolation as well.

I don’t think language models, especially at scale, are much different. They would seem to similarly acquire implicit circuitry like we do as they are exposed to more data. As I see it, the main difference in what exactly that circuitry accomplishes and looks like in final output has more to do with the limited styles of data we can provide and the limitations of fine tuning we can apply on top.

Humans would seem to share a lot in common with talking parrots, we just have a lot more capable hardware to select what we repeat.

rambambram 13 days ago [-]
What if we were talking with each other and the right answer for me would be to kiss you on the cheek? Then what?
stavros 14 days ago [-]
What's the difference between a human and a talking parrot that can answer any question you ask it?
cj 14 days ago [-]
The talking parrot can only answer by repeating something it heard before.

Another question you could ask is “What’s the difference between a conversation between 2 people and a conversation between 2 parrots who can answer any question?”

stavros 14 days ago [-]
It doesn't only repeat things it's heard before. Have you ever talked with an LLM?
throwaway11460 14 days ago [-]
I had ChatGPT design a programming language for a specific uncommon purpose and then use it. It's creative enough.
rambambram 14 days ago [-]
Can any question be answered? As long as any reaction on a question is considered an answer, then I see no difference between a human and a parrot.
dTal 14 days ago [-]
I feel the use of the word "parrot" is unintentionally apt, given that parrots were long thought to be mere mimics but were ultimately shown to have (at least the capacity for) real linguistic understanding.
Findecanor 14 days ago [-]
Even if the generated text contains reasoning, could the LLM understand and apply it?
stavros 14 days ago [-]
If I tell GPT-4 to print something, it understands it needs to check if my printer is turned on first and turn it on if it's not, so, yes?

Also, if the generated text contains reasoning, what's your definition of "understanding"? Is it "must be made of the same stuff brains are"?

RandomLensman 14 days ago [-]
LLMs fail at so many reasoning tasks (not unlike humans to be fair) that they are either incapable or really poor at reasoning. As far as reasoning machines go, I suspect LLMs will be a dead end.

Reasoning here meaning, for example, given a certain situation or issue described being able to answer questions about implications, applications, and outcome of such a situation. In my experience things quickly degenerate into technobabble for non-trivial issues (also not unlike humans).

stavros 14 days ago [-]
If you're contending that LLMs are incapable of reasoning, you're saying that there's no reasoning task that an LLM can do. Is that what you're saying? Because I can easily find an example to prove you wrong.
RandomLensman 14 days ago [-]
It could be that all reasoning displayed is showing existing information - so there would be no reasoning, but that aside, what I meant is being able to reason in any consistent way. Like a machine that only sometimes gets an addition right isn't really capable of addition.
stavros 14 days ago [-]
The former is easy to test, just make up your own puzzles and see if it can solve them.

"Incapable of reasoning" doesn't mean "only solves some logic puzzles". Hell, GPT-4 is better at reasoning than a large number of people. Would you say that a good percentage of humans are poor at reasoning too?

RandomLensman 14 days ago [-]
Not just logic puzzles but also applying information, and, yes, I tried a few things.

People/humans tend to be pretty poor, too (training can help, though), as it isn't easy to really think through and solve things - we don't have a general recipe to follow there and neither do LLMs it seems (otherwise it shouldn't fail).

What I am getting at is that as far as a reasoning machine is concerned, I'd want it to be like a pocket calculator is for arithmetic, i.e., it doesn't fail other than in some rare exceptions - and not inheriting human weaknesses there.

GaggiX 14 days ago [-]
>Another assumption is that it’s because of tokenisation issues. But that can’t be true either.

It's definitely a tokenizer issue, if GPT-4 was trained on singular characters I'm pretty sure it would be able to play Wordle much better. GPT-4 as they are trained today have quite lossy knowledge about the characters inside a specific token, probably a fix would be to embed the knowledge inside the embeddings.

Eridrus 12 days ago [-]
Starting with the reversal curse is weird since there is a simple workaround to this, which is to identify entity names to keep them in their proper order, and then train on the reverse of the pretraining corpus: https://arxiv.org/abs/2403.13799v1

You can argue about how this doesn't really say anything surprising since the reversal of "A is B" is literally "B is A", but it's weird to expect elegant solutions to all problems on all fronts all at once, and we do have an incredibly simple data generation process here.

roenxi 14 days ago [-]
It is interesting that all the examples I noticed in this article have a geometric aspect (even wordle - I model it as a grid with geometric rules when playing it). I think that the "first grader" comment is actually somewhat illuminating - it takes several years of learning how to navigate in a spatial world before this stuff becomes trivially easy.

The underlying point this article might be that LLMs don't understand the non-textual aspects of a grid. Which is a fair point, they model language, not space. I wouldn't expect text corpuses to explain space either, since possibly literally everyone who can read and write already knows a lot about spatial layouts.

constantcrying 12 days ago [-]
Again and again this article claims that surprisingly a LLM fails at a certain problem, when it appears to be easy. Each time it seems pretty obvious why that is the case though.

LLMs rely on the statistical dependencies between words or parts of words. That means any question you ask, which is hard to determine from that statistical dependency is extremely hard for an AI. E.g. ChatGPT fails at determining the length of words made up of random characters. It will fail at even performing the simplest of rules because encoding the rules in the statistical dependencies is extremely hard.

Der_Einzige 13 days ago [-]
Why did the author have to claim that it's not tokenization issues?

This issue, or at least similar ones, absolutely is due to tokenization issues.

Karpathy is right that nearly every modern problem with LLMs is due to tokenization, but if you don't believe him, maybe see this work by gwern: https://gwern.net/gpt-3#bpes or this work by yours truly: https://aclanthology.org/2022.cai-1.2/

weitendorf 13 days ago [-]
So many of these examples are simply forgetting that LLMs experience the world through a 1-dimensional stream of tokens, while we experience those same tokens in 2 dimensions.

Try this: represent all those ASCII representations of games with the letter Q replacing the newline, to properly convert the encoding into a representation approximating what LLMs "see" (not a table, but a stream interspersed with Qs at a regular interval). Pretty hard right?

> LLMs cannot reset their own context

If you have a model hooked up to something agentic, I don't see why it couldn't perform context manipulation on itself or even selective realtime finetuning. Think you'll need info for the long haul, kick off some finetuning. Think you'd rather have one page of documentation in context than other, swap them out in one iteration. When you call LLMs over APIs you usually provide the entire context with each invocation...

> Devin

It's not that it's massively smarter or agentic, just that it has the opportunity to correct its mistakes rather than committing to the first thing to come out of it (and is being handheld by a vastly more knowledgable SWE in its demos). You see cherrypicked examples (I also work on GenAI-for-coding) - just like a tragically incompetent employee could waste literal years on a project diligently plugging away at some task, so too can agentic models go off on a wild goose chase that accomplishes nothing besides making Nvidia more money. Just because something is highly persistent doesn't mean it will "converge" on a correct outcome.

bboygravity 13 days ago [-]
LLMs can never experiment with/in the real world to find answers to open questions.

That's the summary of "The book of Why" in one sentence as I understand it.

kromem 13 days ago [-]
LLMs can't is such an anti-pattern at this point I'm surprised that anyone still dares to stake it. The piece even has an example of a $10k bet around a can't being proven false in under a day, but somehow doesn't think maybe their own can't examples are on similarly thin ice?

In particular, the line about "what models can't do tells us what they don't know" is infuriating.

No, that's not the case at all. At least in a number of instances, what they can't do is because of what they do know.

As an example, one of thecan'ts I got from HN a year ago for GPT-4 was a variation of a classic logic puzzle. And indeed, the model can't solve it - nor can most major models since.

But it's not because the model can't solve the logic - it's because the token similarity to the standard form biases the output towards the standard solution. A hack as simple as changing the nouns to emojis can allow the model to get the correct answer and work through the logic successfully every attempt because it breaks that similarity bias.

People are way too confident around a topic where what's 'known' is more mercurial than maybe any field since 1930s particle physics.

I'd strongly recommend deleting 'never' or 'can't' from one's vocabularies on the subject unless one enjoys ending up with egg on their faces.

bschmidt1 12 days ago [-]
An LLM will probably be able to do most of what human minds can do like reason, predict, hypothesize, research, and even get hooked up to other systems to: Visualize, smell, taste, balance, and even direct the movement of limbs, but an LLM can't and won't ever be able to: Feel pain, bliss, anger, sadness, can't feel positive/negative, can't eat/drink, be hungry, feel fatigued, get excited, enjoy things, dislike things, contemplate, meditate, feel warm or cold (though it can detect it), can't feel dizzy (though it can know when it's off balance) - any action where having an experience is a necessary part of what it's doing and the output of it, an LLM is not sufficient to deliver on and never will be.

To compare to a brain, the LLM is like the prefrontal cortex or language and decision network in the outermost layer, but we would still need the amygdala in that metaphor - emotional drives, urges, episodic first-person memories, and experiential components that accompany the language and complete it with personhood.

For raw sensations and tactiles we might need that innermost brain stem - which is probably more chemistry than computation - for the "lights to be on". For example, some jobs will require not just language intelligence, and not just personhood, but for the light behind the images and feelings in the sensations, so that it feels (and would be) alive.

Der_Einzige 13 days ago [-]
Amen brother. I had feelings like this which I wanted to share, but you hit the nail on the head.
fragmede 13 days ago [-]
interestingly enough, it seems that ChatGPT-4 can now detect the variations of the classic logic puzzle and solve it, so that can't is now a couldn't.
kromem 13 days ago [-]
We may be talking about different logic puzzles? The only model I've seen that didn't need some rather extreme adjustments to eventually solve it was Mistral large.
fragmede 13 days ago [-]
what's your puzzle? mines the river crossing puzzle
kromem 13 days ago [-]
Mine is also the river crossing puzzle. What's your variation?
fragmede 12 days ago [-]
the boat is big enough to hold all the items

I asked it

> lion, goat, wolf riddle, but the boat is big enough to carry all of them

and it said it could do it in one step.

https://chat.openai.com/share/7b7a5462-7649-403d-a4f0-01c387...

ChatGPT-4 today (April 28th, 2024) still fails at it sometimes:

https://chat.openai.com/share/1bec923d-d727-42fe-ba9b-9f92b9...

This is ChatGPT-4 getting it wrong, months ago: https://chat.openai.com/share/caa37ad6-b7a8-451d-8f39-8a2c04...

This is ChatGPT-4 getting it right, today (April 28th, 2024): https://chat.openai.com/share/d2d9e63e-819e-4681-9f9f-8f77ea...

kromem 12 days ago [-]
Ah, ok. My variation is it's a vegetarian wolf, a carnivorous goat, and a cabbage.

There's a few different hacks that will get it to work, but one of the more interesting is switching the nouns to emojis.

But almost none of the models ever get it on the first try, and every major model since GPT-4 can have the prompt tweaked to get it with the exception of Llama-3, which I just can't get to solve it with anything I've tried so far (and I'm not sure if it's because of extra strong associations to the standard form from the extra training run or if it lacks the core competencies, though I am starting to think it's the latter given how it responds as I point out errors).

I particularly like this variation because it requires remapping concepts in unintuitive ways based on broad abstractions, like having a goat potentially eat a wolf because of it being carnivorous.

fragmede 11 days ago [-]
Oh that's a good one. Interesting that it's unable to get that right, because when I've tried asking it using a different theme (Star Trek on a shuttlecraft instead of a river, or an LOTR theme), and it's able to recognize that it's still the river crossing problem.

It's also inconsistent at solving the twist. Here's ChatGPT-4 getting it right and wrong within minutes of each other just today (April 29th, 2024).

right: https://chat.openai.com/share/178adc81-e43b-4ecf-b7b7-d7fa5a... wrong: https://chat.openai.com/share/9ae78081-c0eb-409d-93c6-650f37...

Have you tried

puttycat 14 days ago [-]
Simple addition, among other things:

https://github.com/0xnurl/gpts-cant-count

anshumankmr 14 days ago [-]
Yeah, GPT 4 Turbo in my experience used to get simple counting problems wrong, like adding up a long sequence of 4 digit numbers.
reqo 14 days ago [-]
> This ‘goal drift’ means that agents, or tasks done in a sequence with iteration, get less reliable. It ‘forgets’ where to focus, because its attention is not selective nor dynamic.

I don't know if I agree with this. The attention module is specifically designed to be selective and dynamic, otherwise it would not be much different than a word embedding (look up "soft" weights vs "hard" weights [1]). I think deep learning should not be confused with deep RL. LLMs are autoregressive models which means that they are trained to predict the next token and that is all they do. The next token is not necessarily the most reasonable (this is why datasets are super important for better performance). Deep RL models on the other hand, seem to be excellent at agency and decision making (although in restricted environment), because they are trained to do so.

[1] https://en.wikipedia.org/wiki/Attention_(machine_learning)

visarga 14 days ago [-]
LLMs are fine-tuned with RL too. They are NOT simply next token predictors. RLHF uses whole answers at once to generate gradients, so it is looking further into the future. This might not be perfect but it is clearly more than focusing just 1 token ahead.

In the future the RL part of LLM training will increase a lot. Why am I saying this? There are two sources for learning - the past and the present. Training on human text is using past data, that is off-policy. But training on interactive data is on-policy. There is nothing we know that doesn't come from the environment. What is not written in any books must be learned from outside.

That is why I think supervised pre-training from human text is just half the story and RL based agent learning, interactivity in other words, is the next step. The two feed on which intelligence stands are language (past experience) and environment (present experience). We can't get ahead without both of them.

AlphaZero showed what an agent can learn from an environment alone, and LLMs show what they can learn from humans. But the world is big, there are plenty of environments that can provide learning signal, in other words feedback to LLMs.

joshspankit 13 days ago [-]
As I was reading, this voice got louder and louder:

Would LLMs cross this threshold if we were able to train them only on works that are “objectively good”? if someone has better language than this, please enlighten me)

That is to say: coherent, empathetic, transparent, free from bias, substantiated, free from “fluff”.

For example: For science one cannot simply train from all works published in scientific journals because of the papers that have been written irrespective of facts, or had the data changed, or have been written with specific agendas. In most cases even the experts have a hard time weeding out all the papers that are not “objectively good”. How could an LLM hope to make the determination during training?

patching-trowel 13 days ago [-]
My gut says no because of the way language relates to meaning. In language, a “chair” is a chair is a chair. But in meaning, a chair is not-a-stool, and not-a-couch, and not-a-bench etc. We understand the object largely by what the object is similar to but not.

In order for the LLM to meaningfully model what is coherent, empathetic, free from bias, it must also model the close to, but NOT-that.

joshspankit 11 days ago [-]
That’s a compelling point.

If you’ll indulge me I’m going to think out loud a little.

What makes sense to me about this point:

- Having zero knowledge of “non-good” could lead to fragility when people phrase questions in “non-good” ways

- If an LLM is truly a “I do what I learned” machine, then “good” input + “good” question would output “good” output

- There may be a significant need for an LLM to learn the “chair is not-a-stool” aka “fact is not-a-fiction”. An LLM that only gets affirming meanings might be wildly confused. If true I think that would be a an interesting area to research not just for AI but for cognition. … now I wonder how many of the existing params are “not”s.

- There’s also the question of scale. Does an LLM need to “know” about mass extinction in order to understand empathy? Or can it just know about the emotions people experience during hard times? Children seem to do fine at empathy (maybe even better than adults in some ways) despite never being exposed to planet-sized tragedies. Adults need to deal with bigger issues where it can be important to have those tragedies front of mind, but does an LLM need to?

whiplash451 14 days ago [-]
This part of the article summarizes it all fairly well: "It can answer almost any question that can be answered in one intuitive pass. And given sufficient training data and enough iterations, it can work up to a facsimile of reasoned intelligence."
smusamashah 14 days ago [-]
Do we have an open/shared list of problems that LLMs can't solve?

People have mentioned some other problems apart from those in the article. Someone should compile these and put them up if they haven't been already

gerdesj 13 days ago [-]
Mornington Crescent. It will always win and hence lose and more importantly have no idea why.

Oh let's be cerebral about this stuff and ignore silly British nonsense. LLMs are a classic example of garbage in, garbage out, with a shonky curve fit veneer of science.

A next token guesser with a rather varied input quality is going to go off on one rather often. Given that we all have a different idea of truth adds to the fun.

I take care that my monocle doesn't snag in my lathe. Do be careful with your nob when tickling your LLM inappropriately.

urbandw311er 13 days ago [-]
I think if you prompted an LLM and explained what’s really going on with Mornington Crescent, it would probably do a reasonable job of playing the game.
_heimdall 13 days ago [-]
My biggest concern with LLMs in programming, a complete loss of context. Unless the model is regularly trained on the latest codebase, code will always be generated in isolation. No real architectural decisions made with regards to reuse or testability, and no consideration for how the code will be used in 6 months or why the existing code is the way it is.

To anyone using LLMs for meaningful code, I wish you luck maintaining the code long term and hope you really do enjoy doing code reviews.

eqmvii 13 days ago [-]
I just asked MetaAI to help me with Wordle and it understood and gave me a sane answer, so…

Edit: ah, I spoke too soon. My first question was too "easy" but I asked a few more, and sure enough... it can understand what I'm asking and it can write an answer that's well formed, but it's fundamentally not understanding the rules of the game or giving me valid guesses. Cute!

furyofantares 13 days ago [-]
It's actually that it has no idea how words are spelled because they're blind to letters. This is because they operate instead on tokens instead.

They do seem to know the first letter of each word pretty well (they love to alliterate and can also produce a list of things starting with A then B then C etc) but they are all very terrible at spelling beyond that. I presume they could be trained to spell even while retaining tokens but I guess I don't know for certain.

int_19h 13 days ago [-]
With tasks like these it helps if you break your words up with some kind of separator so that it all tokenizes to one character per token. They do know what individual letters are conceptually.
furyofantares 12 days ago [-]
It helps a little. ChatGPT4 still fails the following prompt fairly often (maybe 40% of the time):

which of these 5-letter sequences a has P in position 4?

P U P I L

S T A P H

C R E E P

P O O P Y

L I P I D

It usually gets it if it reiterates each sequence before deciding (but not always) and it almost always fails when it just answers right away.

And this doesn't actually help at all with the actual related Wordle problem - "think of a 5-letter word with P in the fourth position".

You could try filling up its context with a list of a thousand of the most common 5 letter words all spelled out (and I've done so, even with additional hints like the positions of the letters in parens). But it really didn't help (as might be expected since it can't even reliably do it with the above list of 5 words.)

ChicagoDave 13 days ago [-]
I’ve been trying to get all the LLMs to do the same thing with the same lack of success.

I keep thinking there could be a way to iteratively train an LLM with declarative prompts, but as the article points out, it’s the chicken and egg problem. The LLM can’t provide a response unless it already knows the answer.

However, I believe this barrier will eventually be overcome. Just not anytime soon.

jerpint 14 days ago [-]
I had found that GPT4 couldn’t play wordle about a year ago [1]. At the time, I thought it must be because it wasn’t in the training data but now it seems to point to something larger.

I might just get nerd sniped trying to teach it GoL now…

[1] https://www.jerpint.io/blog/gpt-wordle/

mathstuf 13 days ago [-]
Things I've seen stump the ones I've played with so far (admittedly, not a lot): playing/generating "MadGab" puzzles and ASCII art rendering/interpretation. I've also asked ChatGPT3.5 to phonetically transliterate from English to other orthographies using the typical sounds of said orthography and it was…OK at it.
int_19h 13 days ago [-]
GPT-4 can be surprisingly good at doing straight up IPA transcriptions, as well as informal transliterations, for languages that it knows. Not perfect, but definitely good enough for "Babelfish" kind of use.
tacocataco 13 days ago [-]
Find a way to get humans to love each other.
sriku 13 days ago [-]
A simpler question that seems to stump GPT4, llama3 (8b and 70b) so far - https://twitter.com/srikumarks/status/1784214593146868071
sriku 13 days ago [-]
An analogy I made to friend - language models capture the constraints in the arrangement of tokens in streams of communication. LLMs that model the constraints placed by human intelligence on token streams can no more be said to have attained (human)intelligence than physicists who decode the constraints placed by a god-like intelligence on the universe can be said to have attained god-like intelligence themselves. (Using comments by theist physicists to the tune of "deciphering the mind of God")
pton_xd 13 days ago [-]
> language models capture the constraints in the arrangement of tokens in streams of communication

Yes but ultimately that includes all of math, logic, science, physics, etc. which as far as we can tell are fundamental truths of the universe. And if there's a large enough LLM that can capture enough constraints, functionally what's the difference between its intelligence and ours?

abc_lisper 11 days ago [-]
enraged_camel 14 days ago [-]
Yesterday I asked ChatGPT 4 to write a paragraph with exactly five unique palindromes, and for some reason it really, really struggled. First it wrote a paragraph with four palindromes, then it rewrote it but some palindromes were repeated with a total of seven, etc.
tromp 14 days ago [-]
Train their successor.

Once they absorb theorem provers, they will be able to do lots of math provably correctly. That does mean they should be unable to state "I have proved theorem A in Theory T with proof merkle root R" unless they actually did just that.

thrdbndndn 12 days ago [-]
> LLMs are hard to, as I've written multiple times, and their ability to reason is difficult to separate from what they're trained on.

Can someone explain this sentence to me? It looks broken (hard to what?).

rkwasny 14 days ago [-]
I just used LLaMA-3-70B to play today's Wordle and it solved it in 6 tries.
imtringued 13 days ago [-]
I was onboard with the article up until the middle. After the conclusion where the author simply gives up I felt like it dragged on way too much.

His attempts at training on Conway's game of life are kind of pathetic. The problem isn't a lack of training data and neither is it's "distribution". The fallacy lies in the fact that the dataset itself doesn't contain reasoning in the first place. For example, GitHub CoPilot has fill in the middle capability, while ChatGPT by default does not.

Now here is the shocker about the fill in the middle capability. How does the LLM learn to do it? It does it in an incredibly primitive way. Instead of building a model that can edit its own context, it receives a marker in the context that tells it about the cursor position and then it is finetuned on the expected response.

This means that an LLM could be trained to insert its token at any position in the context or even replace existing tokens, but here is the problem: Once the model has modified its own context, it has exited the training dataset. How do you evaluate the intermediate steps, which can consist of genuinely novel thoughts which are required, but not present in the data? Adding two numbers requires intermediate states which the model may even know how to produce, but it can never be rewarded to utilize them, if they aren't in the training data, because for the LLM, the only goal is to conform to the dataset.

If you wanted to avoid this, you would need to define a metric which allows the model to be rewarded for a success even if that success took a detour. Currently, training is inherently built around the idea of zero shot responses.

keskival 13 days ago [-]
All points described are simple artifacts of tokenization.
moffkalast 13 days ago [-]
> If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse.

This is not a tokenization artefact. And furthermore it's a problem for human brains as well.

Let's say you get a name, idk, Tom Cruise. You immediately know what his face looks like. Now let's say you get a random face. How quickly would you be able to tell me what that person is named? Likely a lot of "uhhs" and "ermms" will follow. It's super hard for us to generalize this reversal automatically in lots of cases. Associations tend to be one directional.

ricardobeat 13 days ago [-]
That's not a great example. Remembering a face is memory recall, whereas what's at stake here is LLMs not being able to infer simple relationships - if it learns from data that "John owns the red bicycle", it will succeed at answering "what does John own", but not "who owns the red bicycle". The relationship it learns is unidirectional.

Here's the intro to the paper that brought this to light: https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms...

moffkalast 13 days ago [-]
If you read the paper again, they deal with pre-training data and fine tuning data specifically. Their test is on information being pulled out zero-shot, which would mean the steps when attention finds associations between tokens are one directional. This is just testing recall as well, as such my example is as apples to apples you can get when comparing systems with such large complexity disparities.

In-context reasoning tends to work a lot more reliably for these examples, if you put any of the test statements into it directly before asking the question, practically any llm can answer correctly. That's why very small models are still useful for RAG use cases.

ein0p 13 days ago [-]
“Never” is a long time. I wouldn’t bet on that. It pays to remember that even the oldest SOTA model, GPT-4 is only just over a year old.
andsoitis 13 days ago [-]
I tried to get ChatGPT and Gemini to do ASCII art and both fail abysmally. Doesn’t mean they could never do it, but it really surprised me.
int_19h 13 days ago [-]
Take a look at some piece of ASCII art in a tokenizer, and you will see why.
dcchambers 14 days ago [-]
Guarantee an output will be consistent every time.
dragonwriter 13 days ago [-]
> Guarantee an output will be consistent every time.

If you mean “consistent with a prior run with the same input”, LLMs can absolutely do that, though for most purposes settings are chosen deliberately which do not.

If you mean “consistent with some external structural specification”, LLMs can do that, too, e.g., via grammar specifications.

wddkcs 14 days ago [-]
ChatGPT just released a memory feature.
ekianjo 14 days ago [-]
mostly solved with grammar for local LLMs
jdthedisciple 13 days ago [-]
GPTs are trained on natural language.

Why should it surprise anyone that it would fail at cellular automata?

jampekka 13 days ago [-]
Current LLM architectures have fundamental limitations, which means they can not learn some problems regardless of training.

A simple example is that they fundamentally can not balance parentheses more than half their context width.

tudorw 13 days ago [-]
Detect when humans are lying.
14 days ago [-]
thom 14 days ago [-]
I have no beef with the actual content or conclusions, but it’s a shame the article is framed the way it is, because I don’t think we can rigorously define the goalposts for what qualifies as a future LLM. It could just as easily have been titled “Exciting avenues of research for future LLMs!” but we’re all so jaded despite the frankly astonishing progress of recent years.
_wire_ 12 days ago [-]
>jaded in spite of astonishing progress

We've all seen something that looks amazing, but few seem to know what we're looking at.

I am unsettled by what I see as a division of thought between extolling AI's amazing effects on one hand and mysterious regards for how it works and its limits on the other.

Noting Arthur Clarke's dictum that 'technology can be sufficiently advanced to be indistinguishable from magic,' AI enthusiasm looks like a Feynman cargo cult.

But technology arousing magical thinking with little discussion of principle of operation and limits is common enough.

This was very much the case with arrival of personal computing: there's something the device is intended to do but most people aren't sure what that is. The devices crashed enough and went out of date so fast you felt ok for not understanding them.

It was even worse with the mobile+web as so much change happened so fast that a generation has been dumbstruck: look at USA politics.

I was looking at old episodes of the Computer Chronicles from early 90s on YT and by that time the show had close to a thousand episodes, but they could barely explain the significance of Windows 3.0 and the Pentium. As to what to expect from this stuff, they didn't even try it was mindless rambling and upsell interspersed with stern warnings from the Software Publishers Association that mucking with the code is a Federal offense. The show's guests all had something to sell with a half life of 3–6 months. For the Pentium episode they show a PC lab with nerds in Dockers (khaki pants) studiously examining how many fans it might take to keep a lanman server from overheating and crashing. Many were amazed by it all.

Also available on YT are old ATT videos, including an introduction to UNIX with Kernighan & Richie. The presentation entirely focuses on the power of the shell. They seemed much more reserved and competent in retrospect, but in its time they looked like a pure priesthood.

Maybe the arrival AI stuff is not so different from the arrival of personal computing?

But AI is basically just one app, and I get the feeling that the scene is far more enigmatic to the point that even the people building the kit don't really know why it does what it does, and no one seems to have a clear idea of what correct functioning means.

JSDevOps 13 days ago [-]
Plumbing. It’ll be regulated down it’s no better then a text processor. Remember the US innovates. The UK stagnates. EU regulates and China Replicates.
BenFranklin100 14 days ago [-]
Reason.

LLMs mimic human language which is separate from reasoning. Tech bros are remarkably ignorant of the field of linguistics and don’t appreciate this distinction. They thus mistake the output of LLMs for reason.

14 days ago [-]
hahajk 14 days ago [-]
Oh, I thought this was going to be a semi-rigorous discussion on computability. It's actually just another "GPT-4 still fails at these prompts" essay.
reissbaker 13 days ago [-]
I love when people propose concrete claims like this: if they're wrong, they're disprovable. If they're right, you get unique and interesting insights from the attempts to disprove them.

I suspect these are all tokenization artifacts, but I'll probably take some time to try out the Conway's Game of Life problem by finetuning a model. A few issues I've noticed from the problems proposed in the article:

1. Wordle. This one TBH is a clear tokenization problem, not a proof of the reasoning capabilities of LLMs or lack thereof. LLMs are trained on multi-character tokens, and consume words as multi-character tokens: they don't "see" characters. Wordle is primarily a game based around splitting words into discrete characters, and LLMs can't see the characters they're supposed to operate on if you give them words — and depending on how you structure your answers, they also might not be able to see your answers! By breaking the words and answers into character-by-character sequences with spaces in between the characters (forcing the tokenizer into breaking each character into a separate token visible to the LLM), I successfully got GPT-4 to guess the word "BLAME" on my first attempt at playing Wordle with it: https://chat.openai.com/share/cc1569c4-44c3-4024-a0c2-eeb498...

2. Conway's Game of Life. Once again, the input sequences are given as a single, long string with no spacing, which will probably result in it being tokenized and thus partially invisible to the LLM. This one seems somewhat annoying to prompt, so I haven't tried yet, but I suspect a combination of better prompting and maybe finetuning would result in the LLM learning to solve the problem.

Similarly, complaints about finetuned models not being able to generalize well on input sequences of lengths longer than they were trained on are most likely token-related. Each token an LLM sees (both during training and inference) is encoded alongside its absolute position in the input sequence; while you as a human being see 1 and 1 1 and 1 1 1 as repeated series of 1s, an LLM would see those characters as being at least somewhat distinct. Given a synthetic dataset of a specific size, it can start to generalize over problems within the space that it sees, but if you give it new data outside of that context space, the new data will not be visible to the LLM as being necessarily related to what it was trained on. There are architectural tricks to get around it (e.g. RoPE scaling), but in general I wouldn't make generalizations about what models can or can't "reason" about based on using context window sizes the model didn't see during training: that's more about token-related blindspots and not about whether the model can be intelligent — at least, intelligent within the context window it's trained on.

One thing the author repeats several times throughout the article is that the mistakes LLMs make are far more instructive than their successes. However, I think in general this is not the case: if they can succeed sometimes, anyone who's spent much time finetuning knows that you can typically train them to succeed more reliably. And the mistakes here don't necessarily seem instructive at all: they're tokenization artifacts, and rewriting the problem to work around specific types of blindness (at least in Wordle's case) seems to allow the LLMs to succeed.

FWIW, the author brings up Victor Taelin's famous A::B problem; I believe I was the first to solve it [1] (albeit via finetuning, so ineligible for the $10k prize; although I did it before the prize was announced, just for the pleasure of playing around with an interesting problem). While I think that it's generally a useful insight to think of training as giving more intuition than intelligence, I do think the A::B problem getting solved eventually even by pure prompting shows that there's actually intelligence in there, too — it's not just intuition, or stochastic parroting of information from its training set. However, tokenization issues can easily get in the way of these kinds of problems if you're not aware of them (even in the winning Clause 3 Opus prompt slightly rephrased the problem to get it to work with the tokenizer), so the models actually can appear dumber than they really are.

1. https://twitter.com/reissbaker/status/1776531331562033453

makz 13 days ago [-]
The dishes
14 days ago [-]
14 days ago [-]
amatecha 13 days ago [-]
Be creative.
allmaker 13 days ago [-]
We can only talk about the current ones, not the future ones, yes.
vouaobrasil 14 days ago [-]
[flagged]
rmellow 14 days ago [-]
Technology amplifies intent.

LLMs are as friendly to society as nuts and bolts. It's up to those who wield it to be moral.

vouaobrasil 14 days ago [-]
Not true. Technology defines the parameters of social action and we are forced to use technology as it becomes mandatory. Moreover, humans have basic instincts, the strong force which overrides morality frequently. Humanity as a society has very little will and a lot of momentum that is amplified by technology. It is not up to anyone to wield anything.
exe34 14 days ago [-]
We should put you in charge, you seem to be a good person who won't misuse the position.
6510 14 days ago [-]
I blame the vague job descriptions. It is a bit like granting any application access to everything on the system. What could possibly go wrong?
vouaobrasil 14 days ago [-]
Well, I don't think I'd be good in charge. Obviously you are being sarcastic, though. But if I were in charge, I would ban all AI development. (Assuming anyone can be in charge at all. No one really is...)
exe34 14 days ago [-]
You should ban knives too! We have an epidemic of knife crime here in the UK.
vouaobrasil 14 days ago [-]
Knives are significantly more useful than LLMs, which are long past the point of diminishing returns.
exe34 13 days ago [-]
Are you sure about that? The first knife was a sharp rock - what if they got banned immediately?
6510 14 days ago [-]
When do knifes become swords tho?
FailMore 13 days ago [-]
Saving for later, thanks!
anon-3988 13 days ago [-]
This is a chicken and egg problem, of course we only value and optimize for what we can do and deem anything that we can't do as unnecessary. There are things that we human simply cannot think of therefore it must not be important or does not exist.

We cannot think of anything beyond 4 dimension, so therefore there must be nothing beyond that or that things that exist in those dimension doesn't matter that much. Or more precisely, we simply cannot appreciate those things.

If we are simply trying to mimic human intelligence...well, you are going to end up with a human brain.

Suppose we have a concept X that humans simply cannot comprehend, appreciate or solve, well, why bother create an intelligence to solve that?

From this hypothesis, I personally think that any intelligence that we create will simply be an augmentation of what human desire. That is, there will always be a human part in the cog because human is the only thing can appreciate what is being created so any and all output must cater to the human involved. This will inevitably happen because we want whatever it is the human brain is doing, without doing whatever it is that the human brain is doing.

That is until we unleash a different intelligence system with agency.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 05:12:07 GMT+0000 (Coordinated Universal Time) with Vercel.