This is obviously very cool, but at *this point* — who knows what I’ll say in a ...

c54 · on April 11, 2023

I agree with the point you're making here, but it’s also funny that the description of someone passing a test but not being able to do much without a lot of human supervision is… exactly the description of a human college graduate.

mattmanser · on April 11, 2023

Can anyone answer the chance that example tests of these questions were in its training set?

And it's just regurgitating the answers someone else wrote?

As I imagine it's a very high chance given how much uni lecturers recycle exam questions.

When I was at uni you could just get the last 5 years worth of questions from the library for almost any subject and guess what the questions were probably going to be. Often they just changed a few numbers.

Teaching undergrads is like a sausage factory, the actual intellectual value for undergrads is in the seminars, the practical value in the labs. The rest is showing you can regurgitate what you've been told.

Which ChatGPT excels at.

skepticATX · on April 11, 2023

I don’t know anything about quantum computing. I successfully answered a few true/false questions by pasting them into Google.

The exact questions aren’t in the top results, but the answers are.

M4v3R · on April 11, 2023

From the article:

> To the best of my knowledge—and I double-checked—this exam has never before been posted on the public Internet, and could not have appeared in GPT-4’s training data.

sudosysgen · on April 12, 2023

The exam, no, but most of the questions most certainly are. I know this because I've done extremely similar problems for homework and checked my answers online.

sudosysgen · on April 11, 2023

Yes, the vast majority of these questions are standard known problems it definitely already saw with a slightly different formulation.

moonchrome · on April 11, 2023

You can try phrasing the question in a way that it wouldn't be phrased but would still demonstrate understanding of concept.

I remember Yann LeCun gave an interview and he came up with some random question like "If I'm holding a peace of paper with both of my hands above the desk and I release one what would happen". His point was that since the LLM doesn't have a world model it wouldn't be able to answer these trivial intuitive questions unless it saw something similar in the training set. And then the interviewer tried it and it failed. That was 3.5. I've tried many variation of that class of problem with 4 and it seems to generalize basic physics concepts quite well. So maybe 4 learned basic physics ? Why couldn't it learn QM theory as well ?

jltsiren · on April 11, 2023

For a college graduate, that is the starting point. Test results are supposed to signal that the person can learn new things. While a fresh graduate needs a lot of supervision, they should quickly become more capable and productive.

For a language model, test results are the end. They are supposed to measure what the model is capable of. If you need better performance, you must train a better model.

ChatGTP · on April 11, 2023

It the college graduates who aren’t the way you describe, those who show initiative and responsibility in their work are the best hires. So not much changes.

reso · on April 11, 2023

I think watching the development of driverless cars in the last 15 years has taught a lot of people to be skeptical of 95% solutions. Sometimes you really need that 100% or the solution is practically useless.

sharemywin · on April 11, 2023

I guess the million dollar question is what are the problems where a 95% solution works.

ben_w · on April 11, 2023

Making websites for a small business such as restaurants and hairdressers, where neither the owners nor the clients have either heard of nor care about "reactive design", and don't want any database more complex than an excel spreadsheet even if you do try to explain why that's a horrifyingly bad use of the wrong tool.

red-iron-pine · on April 12, 2023

Probably a lot of them, honestly.

The 95% only problem is an issue for cars cuz that last 5% means you die horribly in a head-on collision, or maybe only get a mild concussion but are stuck in a ditch.

But if I can get 95% of my router configs done, 95% of my documentation written, and 95% of a website whipped up I can hand that off to a Sr Engineer/Admin and have them take care of the last bits. As long as the hours, phone number, and location are good a website just needs to be "directionally accurate" and otherwise fairly basic.

kenjackson · on April 11, 2023

Or people need to learn that different problems have different risk profiles. 95% on if I need more eggs is different than driving a 60 MPH vehicle.

Loeffelmann · on April 11, 2023

I mean this is clearly not the case with LLM. They create value today even though they are not AGI yet.

reso · on April 11, 2023

LLMs create value but it's not yet clear how much.

koboll · on April 11, 2023

>No negativity towards AI here. It’s amazing and it’ll change the future. But we need to be careful on the way.

Yeah, I suspect a lot of fields will have a similar trajectory to how AI has impacted radiology.

It might catch the tumor in 99.9999% of cases, better than any human doctor. But missing a malignant tumor 0.0001% of the time is unacceptable, because it spikes the hospital's malpractice costs. So every single scan still has to be reviewed manually by a doctor first, then by the AI as a fallback.

In theory there's some insurance scheme that could overcome this, but in practice when you have software reviewing millions of scans a day you're opening yourself up to class action lawsuits in a way no competent human doctor would.

spaced-out · on April 11, 2023

>It might catch the tumor in 99.9999% of cases, better than any human doctor. But missing a malignant tumor 0.0001% of the time is unacceptable, because it spikes the hospital's malpractice costs. So every single scan still has to be reviewed manually by a doctor first, then by the AI as a fallback.

I find it hard to believe human doctors miss malignant tumors in less than 1 out of every 10 million cases.

asmor · on April 11, 2023

The point is that it may miss a tumor obvious to a human doctor.

epups · on April 11, 2023

As long as on average it performs better that should not be an issue.

jakelazaroff · on April 11, 2023

"So sorry the AI missed your malignant tumor! On average, it actually performs better than a human doctor. I mean, a human doctor definitely would have caught this one, and yeah, you're going to die, but hopefully the whole average thing makes you feel better!"

spaced-out · on April 11, 2023

Does the opposite work too? What if a human doctor mis-diagnoses me but I can prove in court that an available medical grade AI would have given the correct diagnosis. Could I sue for that?

jakelazaroff · on April 12, 2023

I don’t really understand why you’d ask this. The point is to save as many lives as possible.

asmor · on April 12, 2023

We acknowledge that both humans and "medical grade AI" are flawed, but they're flawed in very different ways and until we can understand how and why an AI model fails, it should be supplemental.

red-iron-pine · on April 12, 2023

The standards for medical malpractice are super nuanced and variable but the general idea is the "man on the street" concept, or in this case "the average doctor" concept.

As the parent poster put it, it's only a problem if the average doc won't detect it. If it's truly a 1 in 10-million thing, an extreme edge or corner case, malpractice courts may not have a problem with you missing it -- as they say "if you hear hooves, do you think of horses or zebras?". 99% of the time a different diagnosis is the right one, and even at five-nines you're letting someone through eventually.

jakelazaroff · on April 12, 2023

I always think of comparisons to aviation. There are a million and one things that can go wrong when flying a plane, but it's still one of the safest ways to travel. That's because regulations and safety standards are to such a high degree that we simply don't consider injury or death an acceptable outcome.

Whenever someone says "as long as it's better than a human", that's where my mind goes. We shouldn't be satisfied with just being better than a human. We shouldn't be satisfied with five nines! I don't really care about what courts have a problem with — my point is just that our goal should be zero preventable deaths, not just moving from humans to AI once the latter can be better on average than the former.

epups · on April 12, 2023

We already accept that a particular doctor, even if they are an expert, can miss a tumor that could be obvious to a second doctor.

simonh · on April 11, 2023

If also passing the scan by a human is feasible to do, and clearly it is because previously that’s what we were doing, and will reduce the error rate even further what’s the argument for not doing it?

xscott · on April 12, 2023

I'm pretty sure koboll's point was that by having a doctor in the loop, the hospital can wash their hands of that one person's malpractice suit easy enough. Just fire the doctor, let their individual insurance deal with it, and move on. When the hospital cuts out the middle man, they take on a new level of direct accountability they don't currently have.

nostoc · on April 11, 2023

I suspect AI went that way in radiology not because of the chances of False Negatives, but because radiologist are entrenched in the system and will not yield an insanely lucrative stream of revenue.

koboll · on April 11, 2023

Hospitals would love to fire all radiologists and replace them with software.

They've already done it with outsourcing; a large chunk of what used to be done entirely in-house has been contracted out to remote overseas doctors.

f5e4 · on April 11, 2023

What is outsourced to overseas doctors today? I'm assuming you're talking about the US.

From what I understand it isn't even possible generally to see a doctor remotely in a cheaper state, because medical licensing is per-state.

alephnerd · on April 11, 2023

Medical Scans are reviewed abroad. This practice started in Dentistry in the 90s/early 2000s but expanded to Radiological scans as well. At this point most CT, MRI, and XRay scans in the US have a first pass analysis done by doctors in India+Pakistan.

Medical billing has also been offshored to India+Pakistan btw

In general, a lot of back office Dental+Medical functions were outsourced in the 2000s+2010s.

Eg. Paper about this from 2006 - https://ipc.mit.edu/sites/default/files/2019-01/06-005.pdf

bsder · on April 11, 2023

> It might catch the tumor in 99.9999% of cases, better than any human doctor. But missing a malignant tumor 0.0001% of the time is unacceptable

Those probabilities are way off given biology, but anyway ...

The interesting cases of AI in radiology would be being able to catch stuff that a human has no hope of catching.

For example, a woman with lobular (instead of ductal) breast cancer generally doesn't present until mid-to-late Stage 3 (which limits treatment options) because those cancers don't form lumps.

You can stare at mammograms and ultrasounds all day and won't see anything because the "lumps" are unresolvable. You're trying to find a sleet particle in a blizzard. Sure, it's totally obvious on an MRI scan, but you don't want to do those without reason (picking up totally benign growths, gadolinium bioaccumulation, infections from IVs, etc.)

An AI, however, could correlate subtle, but broad changes that humans are really bad at catching. Your last 5 mammograms looked like this but there is just something a little off about this one--go get an MRI this time.

quijoteuniv · on April 11, 2023

This seems an oversimplication of radiology. Things are not black and white, we are talking years of training on specific subjects to be able to “see” an image. I believe AI will help, but it will need supervision, at the same rime the doctors are going to get trained on the difficult border cases. Also deanonimizing data for training is a big deal. This is not happening any time soon.

kenjackson · on April 11, 2023

ChatGPT has already found an issue with my relative in the ICU that a literal team of doctors and nurses missed. This just happened last week. Unfortunately we checked ChatGPT retroactively after we went through the screw up.

I think people probably overestimate (maybe vastly) how good at differential diagnosis most doctors are.

koboll · on April 14, 2023

It will absolutely, 100%, be in place as a fallback very soon, and be ubiquitous in that role. Just as AI is now for radiology. That's different from replacing the team of doctors and nurses, though.

cephei · on April 11, 2023

Even with the 1 in 10000 false negative rate, I bet someone is doing the cost calculation of risk vs how many hours it would take for a doctor to check 10000 scans. Doctors themselves are not perfect so they may even have a higher error rate.

bee_rider · on April 11, 2023

Give the doctor an AI tool which is fast and 99.999% accurate. Since they have automation now, give them a massive workload, so they can’t reasonably check everything. Now the machine does the work and the doctor is just the fall-guy if it messes up.

bee_rider · on April 11, 2023

I have no idea how the legal responsibility works out of a doctor misses a malignant mole. But I’d be very willing to believe that the inability to be at all legally responsible for something will be a problem with AI uptake.

This does seem like an odd outcome though, right? I guess fundamentally people will be legally/economically advantageous in some sense because the amount of insurance that an individual doctor can be expected to hold is much less than a hospital. Is the fate of humans to exist not as a unit of competence, but as a unit of ablative legal armor?

rscho · on April 11, 2023

> catch the tumor in 99.9999% of cases, better than any human doctor

I don't think results are anywhere close to that in the field. If hospitals could do without radiologists, they would do so immediately. Currently, we are seeing very little technical progress due to applied statistics in the field, and the real cause of that is that tech people don't understand what we do and why we're still very much needed. The problem lies more in information retrieval capabilities than acting on the data itself.

onion2k · on April 11, 2023

But missing a malignant tumor 0.0001% of the time is unacceptable, because it spikes the hospital's malpractice costs.

Meanwhile, in every country that isn't America, tumor detection rates go up, cancer outcomes improve, and the cost of delivery falls.

Americans continue to claim their system of funding insurance company profits rather than actual healthcare is best.

pkaye · on April 11, 2023

US has high cancer survival rates though.

https://worldpopulationreview.com/country-rankings/cancer-su...

b59831 · on April 11, 2023

I guess it doesn't have to make much sense, as long as you can blame America...

dcow · on April 11, 2023

Image recognition and statistics is already being used as a first pass for pathologists in full force today. It’s weird to pretend like this is some new uncharted frontier for medicine and/or that insurance doesn't know how to handle it…

antibasilisk · on April 11, 2023

And the class action lawsuit will be orchestrated by an AI

dcow · on April 11, 2023

Thats not how malpractice works.

sholladay · on April 11, 2023

A lot of humans fake it until they make it. A lot of humans are lazy. A lot of humans are given responsibility of things that they are unqualified for. A lot of humans make mistakes.

The military, for all its funding and all its training and all its planning, has lost multiple nuclear weapons, on American soil.

We are imperfect machines who aspire to build more perfect versions of ourselves, through children and now through AI. By many measures, we’ve succeeded. The progress will almost certainly continue. The question is, when will it be good enough for you to embrace it despite its imperfections?

ChatGTP · on April 11, 2023

This is my philosophical take on it… We’d be better off admitting that we rely on nature for almost everything we need, no matter how great an iPhone or ChatGPT is. Some “peasant” still grows your food and the “dumb plants” still produce your air, the more complicated we make things, even with AI augmentation, the more problems we will continue to cause ourselves.

Look at what wave done to soil, the oceans, we’re likely on the wrong track. Maybe it’s even unrecoverable.

I think we can have a much more advanced society , but we need to slow down a lot and do things more in harmony with the natural world and out our egos aside. We’d probably have a much better quality of life for doing so.

CatWChainsaw · on April 11, 2023

A lot of people here will say "never", and they'll drag the rest of us to hell on the way to creating their "heaven".

jahsome · on April 11, 2023

How is this any different from using other technology, e.g. a calculator or a power tool? Or a manager and the output of their ICs?

Obviously the scope of what's possible is different but _any_ craftsman using _any_ tool will only be as good as what they verify themselves.

svachalek · on April 11, 2023

Look at the people who have been killed by Teslas because they were playing video games or hanky panky in the back seat. People don't do that with a gas pedal or even cruise control because it will go wrong very quickly and we know it. With subhuman but roughly functional self-driving, we instinctively see that our attention is wasted 99% of the time. It's hard to make people pay attention 100% of the time because it's needed 1% of the time.

jahsome · on April 12, 2023

People absolutely do that with cruise control, and gas pedal... Look at the people texting or drinking and driving non-self-driving vehicles. Adding AI to an inherently risky activity doesn't eliminate the risk. If you're saying stakes are too low when technology is involved (in this case ai) then where do you draw the line?

It reminds me of anti seatbelt rhetoric or the same stuff skateboarders say about helmets.

mywittyname · on April 11, 2023

I think this is the point.

These tools aren't replacements for knowledge and experience (which is the narrative that is constantly pushed), they bionic enhancements.

lucisferre · on April 11, 2023

A calculator or even most power tools have very predictable outputs for a given input. GPT does not.

whazor · on April 11, 2023

A big factor is that the error rate is compounding. Tiny mistakes would become bigger and bigger if you would run GPT in a loop.

It needs interaction and calibration from a human to keep it in check. Even we as humans are using all kinds of different feedback mechanisms to validate our thought process.

Though this might be a different way of hooking GPT up to reality.

mupuff1234 · on April 11, 2023

Also today humans need to check the output, code reviews etc.

Maybe in the near future humans can just focus on the reviewing and testing part.

CalRobert · on April 11, 2023

Sure, but we'll get lazy.

SketchySeaBeast · on April 11, 2023

Humans are going to want to do the worst parts of the job?

tjr · on April 11, 2023

Would we even still be able to? If we stop learning how to code (because AI does that for us) we might not be very good reviewers.

cyberlurker · on April 11, 2023

Humans used to go in coal mines.

SketchySeaBeast · on April 11, 2023

They still do. But that job hasn't gotten worse, it's always been the same level of suck.

skinnydennys · on April 11, 2023

It's gotten quite a bit safer and healthier, but it is still a quite gnarly job.

tonymillion · on April 11, 2023

Let me make an AI to check the first AIs work, then we can all have cake… on the beach… in Tahiti

jiggywiggy · on April 11, 2023

I also don't think it's going to a be lineair growth. 4 is really good, but for writing 3.5 gave better results in some cases.

And if you really dive deep you see 4 really hasn't any deep understanding and makes obvious mistakes. I think they've also explicitly trained it for tests.

umvi · on April 11, 2023

> So humans still have to check the output, and now we’re in that situation where humans driving a Tesla on autopilot who are supposed to be 100% aware of the road aren’t, because they get lazy and doze off, and now the car crashes and whoops.

The only question is whether, on average, AI-augmented code suggestions reduce overall bug rates. Human devs are already dozing off with crashes and whoopsies at high rates so it doesn't matter too much if AI output has a bug or whoopsie so long as they occur at rates lower than pure human generated code.

__MatrixMan__ · on April 11, 2023

I don't think it's as simple as a mere bug rate comparison. When a human makes a whoopsie, we blame that human. Maybe it's a bit uncomfortable, but it it works to prevent them from cooperating with attempts to inject whoopsies into their code.

If we create a culture of tolerating fatal AI hallucinations as long as they happen at a rate that's below the human caused fatality baseline, we create an opportunity for bad actors to use the "nobody is to blame" window as a plausible deniability weapon (so long as they don't use it too often).

MuffinFlavored · on April 11, 2023

> humans still have to check the output

how often is the “have to check output” more work than just doing the task at hand without the LLM in the first place?

dcow · on April 11, 2023

I don't think the data supports the fear that AI assisted driving is more or newly dangerous when compared to fully human drivers. Teslas are safer than any other car on the road. Yes they’re newer, but by mile they’re safer. So the fear that “we should be careful” is understandable but ultimately unfounded. We are being careful.

notahacker · on April 11, 2023

> Teslas are safer than any other car on the road.

This simply isn't true. By the mile they have a worse safety record than other cars in their class (mitigating factors: where they're driven and who drives them). You might be referring to Tesla's marketing statistic that there are fewer accidents per mile involving autopilot - typically engaged in ideal driving circumstances - than when it's switched off, or across all other drivers. But that's not very meaningful data. Some would argue that citing marketing figures from a company with a track record for obfuscation and dishonesty as "the data" (and really, there isn't much better data available to the average person) is an indication we're not being careful.

emmelaich · on April 12, 2023

Where did you get your stats? I could only find crash test and similar ratings, not actual records.

adriand · on April 11, 2023

It’s abundantly clear that we are going to expect near-perfect reliability from autonomous vehicles. This isn’t necessarily illogical; they operate in a different context than humans do. We expect humans to make mistakes and we have various ways of dealing with the consequences (eg lawsuits targeted at the responsible individual). The argument from statistics doesn’t appear likely to win the kind of societal approval we need for autonomous vehicles to be accepted.

dragonwriter · on April 11, 2023

> We expect humans to make mistakes and we have various ways of dealing with the consequences (eg lawsuits targeted at the responsible individual).

The same is true of manufacturers and others in the chain of commerce of goods (see, e.g., the general rules on defective product liability), even where they aren’t individual humans. There’s nothing about AI which makes it particularly special in this regard.

dcow · on April 11, 2023

Your point about liability is valid, but it’s far from “abundantly clear” that society needs machines to be near perfect as opposed to just significantly better than humans.

Solve the liability problem and I would 100% take a machine that performs 30% better than a human or helps a human perform 30% better every time because it means fewer humans die on the road.

notahacker · on April 11, 2023

> I would 100% take a machine that performs 30% better than a human or helps a human perform 30% better every time because it means fewer humans die on the road.

It's surprisingly hard to establish those parameters though, since (i) the more meaningful indicators of good performance (driver error fatalities) happen only every million or so miles, even less frequently once you've narrowed your pool down to errors made by sober drivers that weren't racing or attempting to driver in conditions unsuited to electronic assistance, so that's a lot of real world road use required to establish statistically significant evidence a machine is a 30% or so safe than a human driver (ii) complex software doesn't improve monotonically, so really you need that amount of testing per update to be confident that the next minor version of something "30% safer" hasn't introduced regression bugs which mean that it is now a bit worse than the average driver and (iii) performance in different road conditions is likely highly variable such that it might be both 30% better overall and 30x as likely to cause an accident if not disengaged in that particular circumstance.

To make a valid assessment of the overall safety impact, you'd also have to factor in that (iv) the worst drivers who skew the stats are the generally the ones least likely to buy it and (v) if it's fully autonomous driving, road use would increase substantially and whilst that may have other benefits the likely outcome from substantial increase in miles driven using tech that's only marginally better than a human is more humans dying on the road

dcow · on April 13, 2023

Billions of miles are driven every day.

notahacker · on April 13, 2023

And yet the total mileage driven by a huge variety of autonomous systems over research programmes dating back a decade is of the order of 20-30 million. This disparity obviously supports my point about the difficulty of establishing statistically significant evidence that a particular software update on a particular platform is less lethal than the human driver [in a given set of circumstances] based on events which are very rare on a per mile basis, particularly if the baseline performance gap isn't that large.

The fact that overall road use is so high that a sufficiently bad regression bug in sufficiently widely-deployed software could rack up a massive body count within hours obviously doesn't make the case for introducing something believed to only be a marginal improvement any stronger.

spiderice · on April 11, 2023

> Your point about liability is valid

How is it valid? It's currently a solved problem. The driver of the car is still held liable because they are still, ultimately, driving the car. Tesla's with autopilot seem to make drivers safer, and acting like we don't know who is liable in the event of a crash is just a red herring.

dcow · on April 12, 2023

I meant for fully autonomous systems, but yeah I agree with you.

bsder · on April 11, 2023

It's very easy to stop these AI cold. Make "I don't know" or "Insufficient information" a correct answer--those are not in the training set.

These AI have "Male Answer Syndrome" to the nth degree. They will shamelessly give you an answer even if they have to completely make one up.

ftxbro · on April 11, 2023

> Male Answer Syndrome

is that like mansplaining?

is botsplaining the new mansplaining?

dustingetz · on April 11, 2023

that’s just QA and human work requires those same feedback loops, feedback is the cornerstone of all value delivery pipelines really