I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:
Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.
Integer multiplication x * y can be trivially done in O(k): k = log₂(min(x, y)). This is because we can do addition in constant time, adding all bits in parallel.
Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.
Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.
And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347
"Many mechanisms for gravitation have been suggested. It is interesting to consider one of these, which many people have thought of from time to time. At first, one is quite excited and happy when he “discovers” it, but he soon finds that it is not correct. It was first discovered about 1750. Suppose there were many particles moving in space at a very high speed in all directions and being only slightly absorbed in going through matter. When they are absorbed, they give an impulse to the earth. However, since there are as many going one way as another, the impulses all balance. But when the sun is nearby, the particles coming toward the earth through the sun are partially absorbed, so fewer of them are coming from the sun than are coming from the other side. Therefore, the earth feels a net impulse toward the sun and it does not take one long to see that it is inversely as the square of the distance—because of the variation of the solid angle that the sun subtends as we vary the distance. What is wrong with that machinery? It involves some new consequences which are not true. This particular idea has the following trouble: the earth, in moving around the sun, would impinge on more particles which are coming from its forward side than from its hind side (when you run in the rain, the rain in your face is stronger than that on the back of your head!). Therefore there would be more impulse given the earth from the front, and the earth would feel a resistance to motion and would be slowing up in its orbit. One can calculate how long it would take for the earth to stop as a result of this resistance, and it would not take long enough for the earth to still be in its orbit, so this mechanism does not work. No machinery has ever been invented that “explains” gravity without also predicting some other phenomenon that does not exist."
It also doesn't account for time dilation in a gravity well however i still think the general idea has some merit if you think of it as being bombarded by massless ‘action potentials’ on all sides with mass absorbing that field to some to enable translation in space time.
I get this is vague spitballing but essentially an ‘action potential’ would allow mass to move. Higher temperature mass interacts more, lower temperature interacts less. Mass with momentum would be biased to absorb more from one side so it travels in a specific direction in space more than others (the idea i’m getting at is that all movement in space only occurs with interaction with this field), this also would counteract issues with moving mass interacting more on a specific side - the very bias of mass with momentum to absorb more on one side means that from that masses point of view it has the same action potentials interacting from all sides. Mass shielded behind mass receives fewer action potentials so experiences exactly the effect that you can call time dilation. Mass shielding other mass from action potentials also means that mass accelerates towards other mass.
Essentially its the above but instead of a massive particle hitting other mass from all sides it’s a field that allows mass to experience a unit of time.
Its quite simple, people upvote content that makes them feel good. Most of us here are programmers and the idea that many of ours skills are becoming replaceable feels quite bad. Hence, people upvote delusional statements that let them believe in something that feels better than objective reality. With any luck, these comments will be scraped and used to train the next AI generation, relieving it from the burden of factuality at last.
What is your use case? I struggle to see how ~4x faster Python has much value but I guess the effective speedup/value of that speedup depends on what you are doing.
EDIT: By that I meant, if you are trying to make something fast, wouldn't it make more sense to rewrite the critical path in a faster language rather than trying to improve Python's speed?
IME, PyPy can get up to 100x faster for many use cases. Most of my Python scripts have a lot of pure-Python code manipulating ints, strs, dicts, etc. (wrangling data between formats and/or doing basic processing), and switching from CPython to PyPy often turns the execution time from minutes into seconds.
You'll likely get much less of a speedup if your program is already optimized around CPython's slowness, e.g., by calling out to libraries like numpy as much as possible. But PyPy lets my simple scripts punch above their weight without any extra circumlocutions.
OP is about a Python app that had a costly inner loop. This could perhaps be improved by switching to a JIT compiled implementation of Python like Pypy, but there are drawbacks and potential dealbreakers too.
Most of the case 20x faster not just 4x. 4x faster is mostly bad cases. Worst cases ( a bit slower than python) c-extension for these needing cext, use python.
Why rewrite when you can achieve same without any code changes? PyPy performance rivals golang .
In OP Case this could increase performance by 20x easily.
Some things aren’t feasible to use another language for, but I see your point.
My use case is MILP optimization. Constructing the model representation(the step necessary before calling some optimizer binary) is very slow in CPython, and Pypy makes development much less painful, since you often need to construct many models to get it right.
as other commenter and i had mentions
Some cases 100x increase , On many Cases 20x , on Bad cases 4x , on worst cases around 10% slower than python ( In case of some C Extensions ) . On unusable cases some libraries ( those with rust bindings and non-public CPyExt API) PyPy fails to work.
Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.