Ask HN: Any insider takes on Yann LeCun's push against current architectures?

84 points by vessenes 4 days ago

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

inimino 3 minutes ago

I have a paper coming up that I modestly hope will clarify some of this.

The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.

ActorNightly 4 days ago

Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

  • seanhunter 3 hours ago

    > The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

    I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.

    The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).

    • Lerc an hour ago

      I feel like we're stacking naive misinterpretations of how LLMs function on top of one another here. Grasping gradient descent and autoregressive generation can give you a false sense of confidence. It is like knowing how transistors make up logic gates and believing you know more than CPU design than you actually do.

      Rather than inferring from how you imagine the architecture working, you can look at examples and counterexamples to see what capabilities they have.

      One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels. It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

      Models predict the next word, but they don't just predict the next word. They generate a great deal of internal information in service of that goal. Placing limits on their abilities by assuming the output they express is the sum total of what they have done is a mistake. The output probability is not what it thinks, it is a reduction of what it thinks.

      One of Andrej Karpathy's recent videos talked about how researchers showed that models do have an internal sense of not knowing the answer, but fine tuning on question answering I'd not give them the ability to express that knowledge. Finding information the model did and didn't know then fine tuning to say I don't know for cases where it had no information allowed the model to generalise and express "I don't know"

      • littlestymaar 39 minutes ago

        No an ML researcher or anything (I'm basically only a few Karpathy video into ML, so please someone correct me if I'm misunderstanding this), but it seems that you're getting this backwards:

        > One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels.

        My understanding is that there's simply not “'an' ahead of a word that starts with a vowel”, the model (or more accurately, the sampler) picks “an” and then the model will never predict a word that starts with a consonant after that. It's not like it “knows” in advance that it wants to put a word with a vowel and then anticipates that it needs to put “an”, it generates a probability for both tokens “a” and “an”, picks one, and then when it generates the following token, it will necessarily take its previous choice into account and never puts a word starting with a vowel after it has already chosen “a”.

        • yunwal 9 minutes ago

          The model still has some representation of whether the word after an/a is more likely to start with a vowel or not when it outputs a/an. You can trivially understand this is true by asking LLMs to answer questions with only one correct answer.

          "The animal most similar to a crocodile is:"

          https://chatgpt.com/share/67d493c2-f28c-8010-82f7-0b60117ab2...

          It will always say "an alligator". It chooses "an" because somewhere in the next word predictor it has already figured out that it wants to say alligator when it chooses "an".

          If you ask the question the other way around, it will always answer "a crocodile" for the same reason.

    • estebarb 2 hours ago

      The problem is exactly that: the probability distribution. The network has no way to say: 0% everyone, this is non sense, backtrack everything.

      Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.

      • ortsa 2 hours ago

        Has anybody ever messed with adding a "backspace" token?

        • refulgentis 2 hours ago

          Yes. (https://news.ycombinator.com/item?id=36425375, believe there's been more)

          There's a quite intense backlog of new stuff that hasn't made it to prod. (I would have told you in 2023 that we would have ex. switched to Mamba-like architectures in at least one leading model)

          Broadly, it's probably unhelpful that:

          - absolutely no one wants the PR of releasing a model that isn't competitive with the latest peers

          - absolutely everyone wants to release an incremental improvement, yesterday

          - Entities with no PR constraint, and no revenue repurcussions when reallocating funds from surely-productive to experimental, don't show a significant improvement in results for the new things they try (I'm thinking of ex. Allen Institute)

          Another odd property I can't quite wrap my head around is the battlefield is littered with corpses that eval okay-ish, and should have OOM increases in some areas (I'm thinking of RWKV, and how it should be faster at inference), and they're not really in the conversation either.

          Makes me think either A) I'm getting old and don't really understand ML from a technical perspective anyway or B) hey, I 've been maintaining a llama.cpp wrapper that works on every platform for a year now, I should trust my instincts: the real story is UX is king and none of these things actually improve the experience of a user even if benchmarks are ~=.

          • vessenes 2 hours ago

            For sure read Stephenson’s essay on path dependence; it lays out a lot of these economic and social dynamics. TLDR - we will need a major improvement to see something novel pick up steam most likely.

    • skybrian an hour ago

      I think some “reasoning” models do backtracking by inserting “But wait” at the start of a new paragraph? There’s more to it, but that seems like a pretty good trick.

    • duskwuff 2 hours ago

      Right. And, as a result, low token-level confidence can end up indicating "there are other ways this could have been worded" or "there are other topics which could have been mentioned here" just as often as it does "this output is factually incorrect". Possibly even more often, in fact.

      • vessenes 2 hours ago

        My first reaction is that a model can’t, but a sampling architecture probably could. I’m trying to understand if what we have as a whole architecture for most inference now is responsive to the critique or not.

    • derefr 2 hours ago

      You get scores for the outputs of the last layer; so in theory, you could notice when those scores form a particularly flat distribution, and fault.

      What you can't currently get, from a (linear) Transformer, is a way to induce a similar observable "fault" in any of the hidden layers. Each hidden layer only speaks the "language" of the next layer after it, so there's no clear way to program an inference-framework-level observer side-channel that can examine the output vector of each layer and say "yup, it has no confidence in any of what it's doing at this point; everything done by layers feeding from this one will just be pareidolia — promoting meaningless deviations from the random-noise output of this layer into increasing significance."

      You could in theory build a model as a Transformer-like model in a sort of pine-cone shape, where each layer feeds its output both to the next layer (where the final layer's output is measured and backpropped during training) and to an "introspection layer" that emits a single confidence score (a 1-vector). You start with a pre-trained linear Transformer base model, with fresh random-weighted introspection layers attached. Then you do supervised training of (prompt, response, confidence) triples, where on each training step, the minimum confidence score of all introspection layers becomes the controlled variable tested against the training data. (So you aren't trying to enforce that any particular layer notice when it's not confident, thus coercing the model to "do that check" at that layer; you just enforce that a "vote of no confidence" comes either from somewhere within the model, or nowhere within the model, at each pass.)

      This seems like a hack designed just to compensate for this one inadequacy, though; it doesn't seem like it would generalize to helping with anything else. Some other architecture might be able to provide a fully-general solution to enforcing these kinds of global constraints.

      (Also, it's not clear at all, for such training, "when" during the generation of a response sequence you should expect to see the vote-of-no-confidence crop up — and whether it would be tenable to force the model to "notice" its non-confidence earlier in a response-sequence-generating loop rather than later. I would guess that a model trained in this way would either explicitly evaluate its own confidence with some self-talk before proceeding [if its base model were trained as a thinking model]; or it would encode hidden thinking state to itself in the form of word-choices et al, gradually resolving its confidence as it goes. In neither case do you really want to "rush" that deliberation process; it'd probably just corrupt it.)

  • spmurrayzzz 26 minutes ago

    > i.e there isn't a "I don't have enough information" option.

    This is true in terms of default mode for LLMs, but there's a fair amount of research dedicated to the idea of training models to signal when they need grounding.

    SelfRAG is an interesting, early example of this [1]. The basic idea is that the model is trained to first decide whether retrieval/grounding is necessary and then, if so, after retrieval it outputs certain "reflection" tokens to decide whether a passage is relevant to answer a user query, whether the passage is supported (or requires further grounding), and whether the passage is useful. A score is calculated from the reflection tokens.

    The model then critiques itself further by generating a tree of candidate responses, and scoring them using a weighted sum of the score and the log probabilities of the generated candidate tokens.

    We can probably quibble about the loaded terms used here like "self-reflection", but the idea that models can be trained to know when they don't have enough information isn't pure fantasy today.

    [1] https://arxiv.org/abs/2310.11511

    EDIT: I should also note that I generally do side with Lecun's stance on this, but not due to the "not enough information" canard. I think models learning from abstraction (i.e. JEPA, energy-based models) rather than memorization is the better path forward.

  • thijson 32 minutes ago

    I watched an Andrej Karpathy video recently. He said that hallucination was because in the training data there were no examples where the answer is, "I don't know". Maybe I'm misinterpreting what he was saying though.

    https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s

  • josh-sematic an hour ago

    I don’t buy Lecun’s argument. Once you get good RL going (as we are now seeing with reasoning models) you can give the model a reward function that rewards a correct answer most highly, an “I’m sorry but I don’t know” less highly than that, a wrong answer penalized, a confidently wrong answer more severely penalized. As the RL learns to maximize rewards I would think it would find the strategy of saying it doesn’t know in cases where it can’t find an answer it deems to have a high probability of correctness.

    • Tryk 37 minutes ago

      How do you define the "correct" answer?

      • jpadkins 28 minutes ago

        obviously the truth is what is the most popular. /s

  • unsupp0rted 26 minutes ago

    > The problem with LLMs is that the output is inherently stochastic

    Isn't that true with humans too?

    There's some leap humans make, even as stochastic parrots, that lets us generate new knowledge.

  • TZubiri 29 minutes ago

    If multiple answers are equally likely, couldn't that be considered uncertainty? Conversely if there's only one answer and there's a huge leap to the second best, that's pretty certain.

  • throw310822 an hour ago

    > there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

    Have you ever tried telling ChatGPT that you're "in the city centre" and asking it if you need to turn left or right to reach some landmark? It will not answer with the average of the directions given to everybody who asked the question before, it will answer asking you to tell it where you are precisely and which way you are facing.

bobosha 11 minutes ago

I argue that JEPA and its Energy-Based Model (EBM) framework fail to capture the deeply intertwined nature of learning and prediction in the human brain—the “yin and yang” of intelligence. Contemporary machine learning approaches remain heavily reliant on resource-intensive, front-loaded training phases. I advocate for a paradigm shift toward seamlessly integrating training and prediction, aligning with the principles of online learning.

Disclosure: I am the author of this paper.

Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh... [accessed Mar 14, 2025].

jawiggins 3 hours ago

I'm not an ML researcher, but I do work in the field.

My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.

I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.

Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.

I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.

[1]: https://www.open.edu/openlearn/nature-environment/organisati...

  • Matthyze 3 hours ago

    > Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found.

    That seems to be how science works as a whole. Long periods of little progress between productive paradigm shifts.

    • tyronehed 2 hours ago

      This is actually a lazy approach as you describe it. Instead, what is needed is an elegant and simple approach that is 99% of the way there out of the gate. Soon as you start doing statistical tweaking and overfitting models, you are not approaching a solution.

__rito__ 2 hours ago

Sligtly related: Energy Based Models (EBMs) are better in theory and yet too resource intensive. I tried to sell using EBMs to my org, but the price for even a small use case was prohibitive.

I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...

Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.

Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.

Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.

There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.

There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.

He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.

Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.

  • vessenes 2 hours ago

    Thanks. This is interesting. What kind of equation is used to assess an ebm during training? I’m afraid I still don’t get the core concept well enough to have an intuition for it.

  • tyronehed 2 hours ago

    Since this exposes the answer, the new architecture has to be based on world model building.

    • uoaei 2 hours ago

      The thing is, this has been known since even before the current crop of LLMs. Anyone who considered (only the English) language to be sufficient to model the world understands so little about cognition as to be irrelevant in this conversation.

hnfong 2 hours ago

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

  • hnuser123456 2 hours ago

    And they seem to be about 10x as fast as similar sized transformers.

    • 317070 an hour ago

      No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.

      • littlestymaar 33 minutes ago

        If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.

EEgads 23 minutes ago

Yann LeCun understands this is an electrical engineering and physical statistics of machine problem and not a code problem.

The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.

The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.

Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.

TrainedMonkey 3 hours ago

This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

  • ketzo 3 hours ago

    Yeah, I think a lot of people talk about "fixing hallucinations" as the end goal, rather than "LLMs providing value", which misses the forest for the trees; it's obviously already true that we don't need totally hallucination-free output to get value from these models.

  • dtnewman 38 minutes ago

    I’m not sure I follow. Sure, people lie, and make stuff up all the time. If an LLM goes and parrots that, then I would argue that it isn’t hallucinating. Hallucinating would be where it makes something up that is not in its training site nor logically deducible from it.

rglover 44 minutes ago

Not an ML researcher, but implementing these systems has shown this opinion to be correct. The non-determinism of LLMs is a feature, not a bug that can be fixed.

As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.

There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.

killthebuddha 3 hours ago

I've always felt like the argument is super flimsy because "of course we can _in theory_ do error correction". I've never seen even a semi-rigorous argument that error correction is _theoretically_ impossible. Do you have a link to somewhere where such an argument is made?

  • vhantz 11 minutes ago

    > of course we can _in theory_ do error correction

    Oh yeah? This is begging the question.

  • aithrowawaycomm 2 hours ago

    In theory transformers are Turing-complete and LLMs can do anything computable. The more down-to-earth argument is that transformer LLMs aren't able to correct errors in a systematic way like Lecun is describing: it's task-specific "whack-a-mole," involving either tailored synthetic data or expensive RLHF.

    In particular, if you train an LLM to do Task A and Task B with acceptable accuracy, that does not guarantee it can combine the tasks in a common-sense way. "For each step of A, do B on the intermediate results" is a whole new Task C that likely needs to be fine-tuned. (This one actually does have some theoretical evidence coming from computational complexity, and it was the first thing I noticed in 2023 when testing chain-of-thought prompting. It's not that the LLM can't do Task C, it just takes extra training.)

probably_wrong 2 hours ago

I haven't read Yann Lecun's take. Based on your description alone my first impression would be: there's a paper [1] arguing that "beam search enforces uniform information density in text, a property motivated by cognitive science". UID claims, in short, that a speaker only delivers as much content as they think the listener can take (no more, no less) and the paper claims that beam search enforced this property at generation time.

The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.

Then again, perhaps they have one in mind and I just haven't read it.

[1] https://aclanthology.org/2020.emnlp-main.170/

  • vessenes 2 hours ago

    I believe he’s talking about some sort of ‘energy as measured by distance from the models understanding of the world’ as in quite literally a world model. But again I’m ignorant, hence the post!

    • deepsquirrelnet an hour ago

      In some respects that sounds similar to what we already do with reward models. I think with GRPO, the “bag of rewards” approach doesn’t strike me as terribly different. The challenge is in building out a sufficient “world” of rewards to adequately represent more meaningful feedback-based learning.

      While it sounds nice to reframe it like a physics problem, it seems like a fundamentally flawed idea, akin to saying “there is a closed form solution to the question of how should I live.” The problem isn’t hallucinations, the problem is that language and relativism are inextricably linked.

    • tyronehed 2 hours ago

      When an architecture is based around world model building, then it is a casual outcome that similar concepts and things end up being stored in similar places. They overlap. As soon as your solution starts to get mathematically complex, you are departing from what the human brain does. Not saying that in some universe it might be possible to make a statistical intelligence, but when you go that direction you are straying away from the only existing intelligences that we know about. The human brain. So the best solutions will closely echo neuroscience.

jurschreuder an hour ago

This concept comes from Hopfield networks.

If two nodes are on, but the connection between them is negative, this causes energy to be higher.

If one of those nodes switches off, energy is reduced.

With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".

All you can do then is try to get the energy as low as possible.

This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.

janalsncm 2 hours ago

I am an MLE not an expert. However, it is a fundamental problem that our current paradigm of training larger and larger LLMs cannot ever scale to the precision people require for many tasks. Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.

https://arxiv.org/pdf/2402.04494

  • throw310822 2 hours ago

    > Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.

    This is true also for the much bigger neural net that works in your brain, and even if you're the world champion of chess. Clearly your argument doesn't hold water.

    • janalsncm 29 minutes ago

      For the sake of argument let’s say an artificial neural net is approximately the same as the brain. It sounds like you agree with me that smaller programs are both more efficient and more effective than a larger neural net. So you should also agree with me that those who say the only path to AGI is LLM maximalism are misguided.

      • throw310822 5 minutes ago

        > It sounds like you agree with me that smaller programs are both more efficient and more effective than a larger neural net.

        At playing chess. (But also at doing sums and multiplications, yay!)

        > So you should also agree with me that those who say the only path to AGI is LLM maximalism are misguided.

        No. First of all, it's a claim you just made up. What we're talking about is people saying that LLMs are not the path to AGI- an entirely different claim.

        Second, assuming there's any coherence to your argument, the fact that a small program can outclass an enormous NN is irrelevant to the question of whether the enormous NN is the right way to achieve AGI: we are "general intelligences" and we are defeated by the same chess program. Unless you mean that achieving the intelligence of the greatest geniuses that ever lived is still not enough.

      • jpadkins 17 minutes ago

        smaller programs are better than artificial or organic neural net for constrained problems like chess. But chess programs don't generalize to any other intelligence applications, like how organic neural nets do today.

  • thewarrior 2 hours ago

    Any chance that “reasoning” can fix this

    • janalsncm 22 minutes ago

      It kind of depends. You can broadly call any kind of search “reasoning”. But search requires 1) enumerating your possible options and 2) assigning some value to those options. Real world problem solving makes both of those extremely difficult.

      Unlike in chess, there’s a functionally infinite number of actions you can take in real life. So just argmax over possible actions is going to be hard.

      Two, you have to have some value function of how good an action is in order to argmax. But many actions are impossible to know the value of in practice because of hidden information and the chaotic nature of the world (butterfly effect).

estebarb 2 hours ago

I have no idea about EBM, but I have researched a bit on the language modelling side. And let's be honest, GPT is not the best learner we can create right now (ourselves). GPT needs far more data and energy than a human, so clearly there is a better architecture somewhere waiting to be discovered.

Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.

Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.

Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.

blueyes 3 hours ago

Sincere question - why doesn't RL-based fine-tuning on top of LLMs solve this or at least push accuracy above a minimum acceptable threshhold in many use cases? OAI has a team doing this for enterprise clients. Several startups rolling out of current YC batch are doing versions of this.

  • InkCanon 2 hours ago

    If you mean the so called agentic AI, I don't think it's several. Iirc someone in the most recent demo day mentioned ~80%+ were AI

d--b an hour ago

Well, it could be argued that the “optimal response” ie the one that sorta minimizes that “energy” is sorted by LLMs on the first iteration. And further iterations aren’t adding any useful information and in fact are countless occasions to veer off the optimal response.

For example if a prompt is: “what is the Statue of Liberty”, the LLMs first output token is going to be “the”, but it kinda already “knows” that the next ones are going to be “statue of liberty”.

So to me LLMs already “choose” a response path from the first token.

Conversely, a LLM that would try and find a minimum energy for the whole response wouldn’t necessarily stop hallucinating. There is nothing in the training of a model that says that “I don’t know” has a lower “energy” than a wrong answer…

tyronehed 2 hours ago

The alternative architectures must learn from streaming data, must be error tolerant and must have the characteristic that similar objects or concepts much naturally come near to each other. They must naturally overlap.

tyronehed 2 hours ago

Any transformer based LLM will never achieve AGI because it's only trying to pick the next word. You need a larger amount of planning to achieve AGI. Also, the characteristics of LLMs do not resemble any existing intelligence that we know of. Does a baby require 2 years of statistical analysis to become useful? No. Transformer architectures are parlor tricks. They are glorified Google but they're not doing anything or planning. If you want that, then you have to base your architecture on the known examples of intelligence that we are aware of in the universe. And that's not a transformer. In fact, whatever AGI emerges will absolutely not contain a transformer.

  • visarga 27 minutes ago

    The transformer is a simple and general architecture. Being such a flexible model, it needs to learn "priors" from data, it makes few assumptions on its distribution from the start. The same architecture can predict protein folding and fluid dynamics. It's not specific to language.

    We on the other hand are shaped by billions of years of genetic evolution, and 200k years of cultural evolution. If you count the total number of words spoken by 110 billion people who ever lived, assuming 1B estimated words per human during their lifetime, it comes out to 10 million times the size of GPT-4's training set.

    So we spent 10 million more words discovering than it takes the transformer to catch up. GPT-4 used 10 thousand people's worth of language to catch up all that evolutionary finetuning.

  • flawn an hour ago

    It's not about just picking the next word here, that doesn't at all refuse whether Transformers can achieve AGI. Words are just one representation of information. And whether it resembles any intelligence we know is also not an argument because there is no reason to believe that all intelligence is based on anything we've seen (e.g us, or other animals). The underlying architecture of Attention & MLPs can surely still depict something which we could call an AGI, and in certain tasks it surely can be considered an AGI already. I also don't know for certain whether we will hit any roadblocks or architectural asymptotes but I haven't come across any well-founded argument that Transformers definitely could not reach AGI.

  • unsupp0rted 21 minutes ago

    > Does a baby require 2 years of statistical analysis to become useful?

    Well yes, actually.

ALittleLight 3 hours ago

I've never understood this critique. Models have the capability to say: "oh, I made a mistake here, let me change this" and that solves the issue, right?

A little bit of engineering and fine tuning - you could imagine a model producing a sequence of statements, and reflecting on the sequence - updating things like "statement 7, modify: xzy to xyz"

  • rscho 2 hours ago

    "Oh, I emptied your bank account here, let me change this."

    For AI to really replace most workers like some people would like to see, there are plenty of situations where hallucinations are a complete no-go and need fixing.

  • fhd2 2 hours ago

    I get "oh, I made a mistake" quite frequently. Often enough, it's just another hallucination, just because I contested the result, or even just prompted "double check this". Statistically speaking, when someone in a conversation says this, the other party is likely to change their position, so that's what an LLM does, too, replicating a statistically plausible conversation. That often goes in circles, not getting anywhere near a better answer.

    Not an ML researcher, so I can't explain it. But I get a pretty clear sense that it's an inherent problem and don't see how it could be trained away.

  • croes 2 hours ago

    Isn’t that the answer if you tell them they are wrong?

bitwize 2 hours ago

Ever hear of Dissociated Press? If not, try the following demonstration.

Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.

Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.

LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.

What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles -- enough to form many different responses -- and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.