I'm not sure if people here even read the entirety of the article. From the article:
> We applied the AI co-scientist to assist with the prediction of drug repurposing opportunities and, with our partners, validated predictions through computational biology, expert clinician feedback, and in vitro experiments.
> Notably, the AI co-scientist proposed novel repurposing candidates for acute myeloid leukemia (AML). Subsequent experiments validated these proposals, confirming that the suggested drugs inhibit tumor viability at clinically relevant concentrations in multiple AML cell lines.
and,
> For this test, expert researchers instructed the AI co-scientist to explore a topic that had already been subject to novel discovery in their group, but had not yet been revealed in the public domain, namely, to explain how capsid-forming phage-inducible chromosomal islands (cf-PICIs) exist across multiple bacterial species. The AI co-scientist system independently proposed that cf-PICIs interact with diverse phage tails to expand their host range. This in silico discovery, which had been experimentally validated in the original novel laboratory experiments performed prior to use of the AI co-scientist system, are described in co-timed manuscripts (1, 2) with our collaborators at the Fleming Initiative and Imperial College London. This illustrates the value of the AI co-scientist system as an assistive technology, as it was able to leverage decades of research comprising all prior open access literature on this topic.
The model was able to come up with new scientific hypotheses that were tested to be correct in the lab, which is quite significant.
So, I've been reading Google research papers for decades now and also worked there for a decade and wrote a few papers of my own.
When google publishes papers, they tend to juice the results significance (google is not the only group that does this, but they are pretty egregious). You need to be skilled in the field of the paper to be able to pare away the exceptional claims. A really good example is https://spectrum.ieee.org/chip-design-controversy while I think Google did some interesting work there and it's true they included some of the results in their chip designs, their comparison claims are definitely over-hyped and they did not react well when they got called out on it.
The article you linked is not an example of this happening. Google open-sourced the chip design method, and uses it in production for TPU and other chips.
Yes, I am aware. I didn't find Jeff's argument particularly convincing. Please note: I've worked personally with Jeff before and shared many a coffee with him. He's done great work and messed up a lot of things, too.
Unironically "just trust me bro" is actually fine here. They're objectively right and you'll find they are when you do your painstaking analysis to figure it out.
Seems to be true. 'Published' scientific research, by its sheer social-dynamics (verging on highly toxic), is the academic equivalent of a pouty-girl vis-a-vis Instagram.
(academic-burnout resembles creator-burnout for similar reasons)
Remember Google is a publicly traded company, so everything must be reviewed to "ensure shareholder value". Like dekhn said, its impressive, but marketing wants more than "impressive".
This is true for public universities and private universities; you see the same thing happening in academic papers (and especially the university PR around the paper)
The actual papers don't overhype. But the university PR's regarding those papers? They can really overhype the results. And of course, the media then takes it up an extra order of magnitude.
I've definitely seen many examples of papers where the conclusions went far beyond what the actual results warranted. Scientists are incentivized to claim their discovery generalizes as much as possible.
But yes, it's normally: "science paper says an experiment in mice shows promising results in cancer treatment" then "University PR says a new treatment for cancer is around the corner" and "Media says cure for all cancer"
> As a field, I believe that we tend to suffer from what might be called serial silver bulletism, defined as follows: the tendency to believe in a silver bullet for AI, coupled with the belief that previous beliefs about silver bullets were hopelessly naive.
(H. J. Levesque. On our best behaviour. Artificial Intelligence, 212:27–35, 2014.)
I have worked with Google teams as well, and they taught me a fair bit about how to be rigorously skeptical. It takes domain knowledge, statistical knowledge, data, time and the computational resources to challenge them. I've done it, but it took real resources.
That said, it's a useful exercise to figure out the plan of attack. My experience is the "juice" was mainly in "easy true negative" subclasses. They weren't oversampled, but the human brain wouldn't even consider most of that data. Once you ablate those subclasses from the dataset, (which takes a lot of additional labelling effort), you can start challenging their assertions. But it's hard.
And that said I also review a number of articles in that domain, and I haven't seen a group with stronger datasets overall.
That applies to absolutely everyone. Convenient results are highlighted, inconvenient are either not mentioned or de-emphasized. You do have to be well read in the field to see what the authors _aren't_ saying, that's one of the purposes of being well-read in the first place. That is also why 100% of science reporting is basically disinformation - journalists are not equipped with this level of nuanced understanding.
yes, but google has a long history of being egregious, with the additional detail that their work is often irreproducible for technical reasons (rather than being irreproducible for missing methods). For example, we published an excellent paper but nobody could reproduce it because at the time, nobody else had a million spare cores to run MD simulations of proteins.
It's hardly Google's problem that nobody else has a million cores, wouldn't you agree? Should they not publish the result at all if it's using more than a handful of cores so that anyone in academia can reproduce it? That'd be rather limiting.
Well, a goal of most science is to be reproducible, and it couldn't be reproduced, merely for technical reasons (and so we shared as much data from the runs as possible so people could verify our results). This sort of thing comes up when CERN is the only place that can run an experiment and nobody can verify it.
Actually it IS google's problem. They don't publish through traditional academic venues unless it suits them (much like OpenAI/Anthropic, often snubbing places like NeurIPS due to not wanting to MIT open source their code/models which peer reviewers demand) and them demanding so many GPUs chokes supply for the rest of the field - a field which they rely on the free labor of to make complimentary technologies to their models.
Eating, drinking, sleeping apply to absolutely everyone. Deception varies greatly by person and situation. I know people who are painfully honest and people I don't trust on anything, and many in between.
That a UPR inhibitor would inhibit viability of AML cell lines is not exactly a novel scientific hypothesis. They took a previously published inhibitor known to be active in other cell lines and tried it in a new one. It's a cool, undergrad-level experiment. I would be impressed if a sophomore in high school proposed it, but not a sophomore in college.
> I would be impressed if a sophomore in high school proposed it
That sounds good enough for a start, considering you can massively parallelize the AI co-scientist workflow, compared to the timescale and physical scale it would take to do the same thing with human high school sophomores.
And every now and then, you get something exciting and really beneficial coming from even inexperienced people, so if you can increase the frequency of that, that sounds good too.
We don't need an army of high school sophomores, unless they are in the lab pipetting. The expensive part of drug discovery is not the ideation phase, it is the time and labor spent running experiments and synthesizing analogues.
As discussed elsewhere, Deepmind are also working on extending Alphafold to simulate biochemical pathways and then looking to tackle whole-cell simulation. It's not quite pipetting, but this sort of AI scientist would likely be paired with the simulation environment (essentially as function calling), to allow for very rapid iteration of in-silico research.
It sounds like you're suggesting that we need machines that mass produce things like automated pipetting machines and the robots that glue those sorts of machines together.
Replacing a skilled technician is remarkably challenging. Often times, when you automate this, you just end up wasting a ton of resources rather than accelerating discovery. Often, simply integrating devices from several vendors (or even one vendor) takes months.
I've built microscopes intended to be installed inside workcells similar to what companies like Transcriptic built (https://www.transcriptic.com/). So my scope could be automated by the workcell automation components (robot arms, motors, conveyors, etc).
When I demo'd my scope (which is similar to a 3d printer, using low-cost steppers and other hobbyist-grade components) the CEO gave me feedback which was very educational. They couldn't build a system that used my style of components because a failure due to a component would bring the whole system down and require an expensive service call (along with expensive downtime for the user). Instead, their mech engineer would select extremely high quality components that had a very low probability of failure to minimize service calls and other expensive outages.
Unfortunately, the cost curve for reliability not pretty, to reduce mechanical failures to close to zero costs close to infinity dollars.
One of the reasons Google's book scanning was so scalable was their choice to build fairly simple, cheap, easy to maintain machines, and then build a lot of them, and train the scanning individuals to work with those machines quirks. Just like their clusters, they tolerate a much higher failure rate and build all sorts of engineering solutions where other groups would just buy 1 expensive device with a service contract.
This sounds like it could be centralised, a bit like the clouds in the IT world. A low failure rate of 1-3% is comparable to servers in a rack, but if you have thousands of them, then this is just a statistic and not a servicing issue. Several hyperscalers simply leave failed nodes where they are, it’s not worth the bother to service them!
Maybe the next startup idea is biochemistry as a service, centralised to a large lab facility with hundreds of each device, maintained by a dedicated team of on-site professionals.
None of the companies that proposed this concept have managed to demonstrate strong marketplace viability. A lot of discovery science remains extremely manual, artisinal, and vehemently opposed to automation.
> They couldn't build a system that used my style of components because a failure due to a component would bring the whole system down and require an expensive service call
Could they not make the scope easily replaceable by the user and just supply a couple of spares?
Just thinking of how cars are complex machines but a huge variety of parts could be replaced by someone willing to spend a couple of hours learning how.
That’s similar to how Google won in distributed systems. They used cheap PCs in shipping containers when everyone else was buying huge expensive SUN etc servers.
yes, and that's the reason I went to work at google: to get access to their distributed systems and use ML to scale up biology. I never was able to join Google Research and do the work I wanted (but DeepMind went ahead and solved protein structure prediction, so, the job got done anyway).
They really didn't solve it. AF works great for proteins that have a homologous protein with a crystal structure. It is absolutely useless for proteins with no published structure to use as a template - e.g. many of the undrugged cancer targets in existence.
@dekhn it is true (I also work in the field. I'm a software engineer who got a wet-lab PhD in biochemistry and work at a biotech doing oncology drug discovery)
There is a big range in both automation capabilities and prices.
We have a couple automation systems that are semi-custom - the robot can handle operation of highly specific, non-standard instruments that 99.9% of labs aren't running. Systems have to handle very accurate pipetting of small volumes (microliters), moving plates to different stations, heating, shaking, tracking barcodes, dispensing and racking fresh pipette tips, etc. Different protocols/experiments and workflows can require vastly different setups.
So pharmaceutical research is largely an engineering problem, of running experiments and synthesizing molecules as fast, cheap and accurate as possible ?
I wouldn't say it's an engineering problem. Biology and pharmacology are very complex with lots of curveballs, and each experiment is often different and not done enough to warrant full engineering-scale optimization (although this is sometimes the case!).
It also seems to be a financial problem of getting VC funds to run trials to appease regulators. Even if you’ve already seen results in a lab or other country.
We could have an alternative system where VC don’t need to appease regulators but must place X billion in escrow for compensation of any harm the medicine does to customers.
Regulator is not only there to protect the public, it also protects VC from responsibility
Regulations around clinical trials represent the floor of what's ethically permissible, not the ceiling. As in, these guidelines represent the absolute bare minimum required when performing drug trials to prevent gross ethical violations. Not sure what corners you think are ripe for cutting there.
> Regulations around clinical trials represent the floor of what's ethically permissible, not the ceiling.
Disagree. The US FDA especially is overcautious to the point of doing more harm than good - they'd rather ban hundreds of lifesaving drugs than allow one thalidomide to slip through.
Yeah that's not how anything works. Compounds are approved for use or not based on empirical evidence, thus the need for clinical trials. What's your level of exposure to the pharma industry?
> Compounds are approved for use or not based on empirical evidence, thus the need for clinical trials.
But off-label use is legal, so it's ok to use a drug that's safe but not proven effective (to the FDA's high standards) for that ailment... but only if it's been proven effective for some other random ailment. That makes no sense.
> What's your level of exposure to the pharma industry?
I strongly encourage you to take a half hour and have a look at what goes into preclinical testing and the phases of official trials. An understanding of the data gathered during this process should clear up some of your confusion around safety and efficacy of off-label uses, which parenthetically pharma companies are strictly regulated against encouraging in any way.
This is the general problem with nearly all of this era of generative AI and why the public dislike it so much.
It is trained on human prose; human prose is primarily a representation of ideas; it synthesizes ideas.
There are very few uses for a machine to create ideas. We have a wealth of ideas and people enjoy coming up with ideas. It’s a solution built for a problem that does not exist.
Especially when you consider the artificial impressive high school sophomore is capable of having impressive high school sophomore ideas across and between an incredibly broad spectrum of domains.
And that their generation of impressive high school sophomore ideas is faster, more reliable, communicated better, and can continue 24/7 (given matching collaboration), relative to their bio high school sophomore counterparts.
I don’t believe any natural high school sophomore as impressive on those terms, has ever existed. Not close.
We humans (I include myself) are awful at judging things or people accurately (in even a loose sense) across more than one or two dimensions.
This is especially true when the mix of ability across several dimensions is novel.
(I also think people under estimate the degree that we, as users and “commanders” of AI, bottleneck their potential. I don’t suggest they are ready to operate without us. But that our relative lack of energy, persistence & focus all limit what we get from them in those dimensions, hiding significant value.
We famously do this with each other, so not surprising. But worth keeping in mind when judging limits: whose limits are we really seeing.)
I don't need high school level ideas, though. If people do, that's good for them, but I haven't met any. And if the quality of the ideas is going to improve in future years, that's good too, but also not demonstrated here.
I am going to argue that you do. Then I will be interested in your response, if you feel inclined.
We all have our idiosyncratically distributed areas of high intuition, expertise and fluency.
None of us need apprentice level help there, except to delegate something routine.
Lower quality ideas there would just gum things up.
And then we all have vast areas of increasingly lesser familiarity.
I find, that the more we grow our strong areas, the more those areas benefit with as efficient contact as possible with as many more other areas as possible. In both trivial and deeper ways.
The better developer I am, in terms of development skill, tool span, novel problem recognition and solution vision, the more often and valuable I find quick AI tutelage on other topics, trivial or non-trivial.
If you know a bright high school student highly familiar with a domain that you are not, but have reason to think that area might be helpful, don’t you think instant access to talk things over with that high schooler would be valuable?
Instant non-trivial answers, perspective and suggestions? With your context and motivations taken into account?
Multiplied by a million bright high school students over a million domains.
—
We can project the capability vector of these models onto one dimension, like “school level idea quality”. But lower dimension projections are literally shadows of the whole.
It if we use them in the direction of their total ability vector (and given they can iterate, it is actually a compounding eigenvector!) and their value goes way beyond “a human high schooler with ideas”.
It does take time to get the most out of a differently calibrated tool.
Suggesting "maybe try this known inhibitor in other cell lines" isn't exactly novel information though. It'd be more impressive and useful if it hadn't had any published information about working as a cancer inhibitor before. People are blasé about it because it's not really beating the allegations that it's just a very fancy parrot when the highlight of it's achievements is to say try this known inhibitor with these other cell lines, decent odds that the future work sections of papers on the drug already suggested trying on other lines too...
A couple years ago even suggesting that a computer could propose anything at all was sci-fi. Today a computer read the whole internet, suggested a place to look at and experiments to perform and… ‘not impressive enough’. Oof.
People are facing existential dread that the knowledge they worked years for is possibly about to become worth a $20 monthly subscription. People will downplay it for years no matter what.
I'm sure the scientists involved had a wish list of dozens of drug candidates to repurpose to test based on various hypotheses. Ideas are cheap, time is not.
In this case they actually tested a drug probably because Google is paying for them to test whatever the AI came up with.
I’m not familiar with the subject matter, but given your description, I wouldn’t really be impressed by anyone suggesting it. It just sounds like a very plausible “What if” alternative.
On the level of suggesting suitable alternative ingredients in fruit salad.
We should really stop insulting the intelligence of people to sell AI.
I read the cf-PICI paper (abstract) and the hypothesis from the AI co-scientist. While the mechanism from the actual paper is pretty cool (if I'm understanding it correctly), I'm not particularly impressed with the hypothesis from the co-scientist.
It's quite a natural next step to take to consider the tails and binding partners to them, so much so that it's probably what I would have done and I have a background of about 20 minutes in this particular area. If the co-scientist had hypothesised the novel mechanism to start with, then I would be impressed at the intelligence of it. I would bet that there were enough hints towards these next steps in the discussion sections of the referenced papers anyway.
What's a bit suspicious is in the Supplementary Information, around where the hypothesis is laid out, it says "In addition, our own preliminary data indicate that cf-PICI capsids can indeed interact with tails from multiple phage types, providing further impetus for this research direction." (Page 35). A bit weird that it uses "our own preliminary data".
> A bit weird that it uses "our own preliminary data"
I think potential of LLM based analysis is sky high given the amount of concurrent research happening and high context load required to understand the papers. However there is a lot of pressure to show how amazing AI is and we should be vigilant. So, my first thought was - could it be that training data / context / RAG having access to a file it should not have contaminated the result? This is indirect evidence that maybe something was leaked.
This is one thing I've been wondering about AI: will its broad training enable it to uncover previously covered connections between areas the way multi-disciplinary people tend to, or will it still miss them because it's still limited to its training corpus and can't really infer.
If it ends up being more the case that AI can help us discover new stuff, that's very optimistic.
In some sense, AI should be the most capable at doing this within math. Literally the entire domain in its entirety can be tokenized. There are no experiments required to verify anything, just theorem-lemma-proof ad nauseam.
Doing this like in this test, it's very tricky to rule out the hypothesis that the AI is just combining statements from the Discussion / Future Outlook sections of some previous work in the field.
Math seems to me like the hardest thing for LLMs to do. It requires going deep with high IQ symbol manipulation. The case for LLMs is currently where new discoveries can be made from interpolation or perhaps extrapolation between existing data points in a broad corpus which is challenging for humans to absorb.
Alternatively, human brains are just terrible at "high IQ symbol manipulation" and that's a much easier cognitive task to automate than, say, "surviving as a stray cat".
If they solve tokenization, you'll be SHOCKED at how much it was holding back model capabilities. There's tons of works at NeurIPS about various tokenizer hacks or alternatives to bpe which massively improve various types of math that models are bad at (i.e. arithmatic performance)
This line of reasoning implies "the stochastical parrot people are right, there is no intelligence in AI". Which is the opposite of what AI thought leaders are saying.
I reject the Stochastic Parrot theory. The claim is more about comparative advantage; AI systems already exist that are superhuman on breadth of knowledge at undergrad understanding depth. So new science should be discoverable in fields where human knowledge breadth is the limiting factor.
> AI systems already exist that are superhuman on breadth of knowledge at undergrad understanding depth
Two problems with this:
1. AI systems hallucinate stuff. If it comes up with some statement, how will you know that it did not just hallucinate it?
2. Human researchers don't work just on their own knowledge, they can use a wide range of search engines. Do we have any examples of AI systems like these that produce results that a third-year grad student couldn't do with Google Scholar and similar instructions? Tests like in TFA should always be compared to that as a baseline.
> new science should be discoverable in fields where human knowledge breadth is the limiting factor
What are these fields? Can you give one example? And what do you mean by "new science"?
The way I see it, at best the AI could come up with a hypothesis that human researchers could subsequently test. Again, you risk that the hypothesis is hallucination and you waste a lot of time and money. And again, researchers can google shit and put facts together from different fields than their own. Why would the AI be able to find stuff the researchers can't find?
This is kinda getting at a core question of epistemology. I’ve been working on an epistemological engine by which LLMs would interact with a large knowledge graph and be able to identify “gaps” or infer new discoveries. Crucial to this workflow is a method for feedback of real world data. The engine could produce endless hypotheses but they’re just noise without some real world validation metric.
Similar stuff is being done for material sciences where AI suggest different combinations to find different properties. So when people say AI(machine learning, LLM) are just for show I am a bit shocked as AI's today have accelerated discoveries in many different fields of science and this is just the start. Anna archive probably will play a huge role in this as no human or even a group of humans will have all the knowledge of so many fields that an Ai will have.
The automobile was a useful invention. I don't know if back then there was a lot of hype around how it can do anything a horse can do, but better. People might have complained about how it can't come to you when called, can't traverse stairs, or whatever.
It could do _one_ thing a horse could do better: Pull stuff on a straight surface. Doing just one thing better is evidently valuable.
I think AI is valuable from that perspective, you provide a good example there. I might well be disappointed if I would expect it to be better than humans at anything humans can do. It doesn't have to. But with wording like "co-scientist", I see where that comes from.
> It could do _one_ thing a horse could do better: Pull stuff on a straight surface
I would say the doubters were right, and the results are terrible.
We redesigned the world to suit the car, instead of fixing its shortcomings.
Navigating a car centric neighbourhood on foot is anywhere between depressing and dangerous.
I hope the same does not happen with AI. But I expect it will.
Maybe in your daily life AI will create legal contracts there are thousands of pages long
And you will need AI of your own to summarise them and process them.
Excellent point. Just because the invention of the automobile arguably introduced something valuable, how we ended up using them had a ton of negative side effects. I don't know enough about cars or horses to argue pros and cons. But I can certainly see how we _could_ have used them in a way that's just objectively better than what we could do without them. But you're right, I can't argue we did.
It's not just about doing something better but about the balance between the pros and the cons. The problem with LLMs are hallucinations. If cars just somehow made you drive the wrong way with the frequency that LLMs send one down the wrong path with compelling sounding nonsense, then I suspect we'd still be riding horses nowadays.
I can get value out of them just fine. But I don't use LLMs to find answers, mostly to find questions. It's not really what they're being sold/hyped for, of course. But that's kinda my point.
What does this cited article have to do with AI? Unless I’m missing something the researchers devised a novel method to create a material that was known since 1967.
It's cool, no doubt. But keep in mind this is 20 years late:
As a prototype for a "robot scientist", Adam is able to perform independent
experiments to test hypotheses and interpret findings without human guidance,
removing some of the drudgery of laboratory experimentation.[11][12] Adam is
capable of:
* hypothesizing to explain observations
* devising experiments to test these hypotheses
* physically running the experiments using laboratory robotics
* interpreting the results from the experiments
* repeating the cycle as required[10][13][14][15][16]
While researching yeast-based functional genomics, Adam became the first
machine in history to have discovered new scientific knowledge independently of
its human creators.[5][17][18]
I also think people underestimate how much benefit a current LLM already has to researchers.
A lot of them have to do things on computers which has nothing to do with their expertise. Like coding a small tool for working their data, small tools crunching results, formatting text data, searching and finding the right materials.
A LLM which helps a scientist to code something in an hour instead of a week, makes this research A LOT faster.
And we know from another paper, that we have now so much data, you need to use systems to find the right information for you. The study estimated how much additionanl critical information a research paper missed.
Don't worry, it takes about 10 years for drugs to get approved, AIs will be superintelligent long before the government gives you permission to buy a dose of AI-developed drugs.
Not that I don't think there's a lot of potential in this approach, but the leukemia example seemed at least poorly-worded, "the suggested drugs inhibit tumor viability" reads oddly given that blood cancers don't form tumors?
> We applied the AI co-scientist to assist with the prediction of drug repurposing opportunities and, with our partners, validated predictions through computational biology, expert clinician feedback, and in vitro experiments.
> Notably, the AI co-scientist proposed novel repurposing candidates for acute myeloid leukemia (AML). Subsequent experiments validated these proposals, confirming that the suggested drugs inhibit tumor viability at clinically relevant concentrations in multiple AML cell lines.
and,
> For this test, expert researchers instructed the AI co-scientist to explore a topic that had already been subject to novel discovery in their group, but had not yet been revealed in the public domain, namely, to explain how capsid-forming phage-inducible chromosomal islands (cf-PICIs) exist across multiple bacterial species. The AI co-scientist system independently proposed that cf-PICIs interact with diverse phage tails to expand their host range. This in silico discovery, which had been experimentally validated in the original novel laboratory experiments performed prior to use of the AI co-scientist system, are described in co-timed manuscripts (1, 2) with our collaborators at the Fleming Initiative and Imperial College London. This illustrates the value of the AI co-scientist system as an assistive technology, as it was able to leverage decades of research comprising all prior open access literature on this topic.
The model was able to come up with new scientific hypotheses that were tested to be correct in the lab, which is quite significant.