Clinical pharmacist for 10 years here. Yea, base model is very good. Better than first year residents - but not necessarily experienced clinicians.
Now - throw a punch of clinical guidelines in a vector database and give it context and it’s 10x better than me and any doctor outside their speciality or all the mid-levels. (I.E, it’s better than cardiologist doing infectious disease - but not cardiologists doing cardiology). This because there are very niche stuff as you specialize where it’s only like 5 doctors who see it in the whole world on a consistent basis (and they don’t blog!)
I trained it on the IDSA guidelines (infectious disease) and put up a proof of concept on GalenAI.co - just as way to start talking to health systems and clinicians. it’s going to be very different world in medicine in a couple of years from now!!
For some context, the USMLE is taken during medical school. The amount I have learned about actually practicing medicine since graduating is probably an order of magnitude more than everything I learned in medical school! I still learn stuff, all the time, and I’m not just talking about new research.
So, while impressive and clearly part of the future world, we shouldn’t get too far ahead of ourselves with the current models.
Edit: oh I should add that there are more clinically relevant exams that would be more likely to reveal d clinical usefulness, for example “board” exams. These are taken after training, usually before practice. Not knocking LLMs, just ensuring that people don’t misunderstand passing the USMLE as being clinically useful.
I agree that we shouldn't get ahead of ourselves with the current technology, but what you said earlier applies to practically every industry and science. What you learn at the actual job is always far more up to date than what you learn in school, no matter if it's being a engineer, doctor or just a lowly programmer.
Yes but the difference is most engineers, and pretty much all lowly programmers are unlicensed. AI or some non-human accessible in the countryside so you don't have to order questionable "fish" antibiotics or "cat" anti-parasitic would be a nice step up from the current gatekeeping of medicine from people with limited access.
The only significant ways I'm aware of that people get needed rx medications outside of physicians and mid-level practitioners is to leave the country or use "vet"/"animal" drugs. What are the simpler other alternatives currently available?
The medical cartel loves to cloak their policies under the auspices of safety. In the end you'll find their policies magically result in massive profits for bureaucrats and chokepoints that constrain the supply of gatekeepers. This is not accidental.
Incentives. Plenty of people would go into rural medicine if it wasn't paying crap and dealing with nothing but elderly Medicare patients because of the lack of health insurance for people just marginally above the Medicaid line. Of course this means there's basically no support currently for the hospitals and more than half are being bought up by private equity anyways and shipping it all to the smallest bottom line.
What if the AI is trained on board exams and other high signal testing/examination materials? Surely it will become superhuman in its medical abilities?
Ive never experienced anything in my care with doctors that I couldn’t understand with a days worth of research into things like UpToDate. It isn’t complicated. It is largely memorization and application of an algorithm which is just borderline useless for complex conditions that are emerging more modernly.
Have you ever had anything besides a bad cold? Tell me you understand every acronym in this article, and the ability to explain it succinctly to a patient much less to be able to hand off a case to "another" physician https://www.nejm.org/doi/full/10.1056/NEJMoa2206714
Doctors and nurses have saved me many times from some very close calls because of decades of experience, training and intuition. That is of course not to mention the friend who beat a deep brain tumor on their brain stem that everyone else told them was inoperable, and now are in medical school themselves for neurology. No LLM is going to pull that out of itself, possibly ever, and certainly not GPT-4 (no one else had ever had the surgery done before, it was novel).
How do you know that you actually understand a topic as well as a doctor? How do you verify that? It's not unusual for people to think they comprehend a topic at expert level, when in fact they do not. The correlation between confidence and understanding is not a reliable measure. That's why doctors are trained by more expert colleagues who can judge their true understanding, have to take exams, etc.
I think most people with complicated chronic diseases for more than a few years end up knowing more than most doctors about their condition and related conditions. Doctors are more breadth than depth. But the problem is that depth is what is absolutely necessary in these situations. But there is a lack of that among specialists too, or at least they are not willing to go outside of insurance mandated covered procedures and testing and it creates a really useless and frustrating scenario for the patients.
Doesn't matter how much GI doctors know when all they do is scope you. Sure doc scoping is sure going to help people with atypical intolerances, IBS, and any number of modern chronic conditions for which treatments are inadequate. They have to do better!
Reminder that these models have no concept of truth, no abstract reasoning abilities, and there's no guarantee that the plausible-looking text they spit out isn't complete bullshit.
Their output is impressive considering that they're incapable of reasoning, but it desperately needs to be sanity-checked by someone with relevant knowledge before using it for anything serious.
I wonder how many people who think they have no reasoning abilities have actually seen GPT-4. There's clearly some form of reasoning with GPT-4.
There are situations where a human can reason better. But these are usually identified as bugs and will probably just be fixed later.
Yes, they're basically autocomplete machines but a human is a bundle of neurons optimized for survival anyway. But humans can become irrational to things that threaten survival. Classic medical example is the drama that emerged when one doctor suggested that washing hands would reduce mortality rates.
Humans and LLM-based AI evolve rationality based around different factors and form different "bugs" and I love that they're capable of fact checking each other now.
Just because it produces sequences of tokens that we typically experience as evidence of reasoning does not mean that it reasoned to produce them. We know -- generally -- the process by which those tokens were produced. It is not deterministic, but its properties are understood. It's a language model! It is not a mind.
Real time post shifting and mental gymnastics is truly a sight to behold. No the output isn't perfect and should be double checked (as with any professional domain - you think what an Engineer with decades of experience signs off hasn't been checked multiple times by multiple parties ?, you think a professional programmer writes code and..just deploys it ?)
but LLMs reason just fine. If you have anything to say about that other than the nonsensical - "it's not true understanding just because", i'd love to hear it.
This line of reasoning sort of goes both ways. Advocates and claimants of the technology arguing that humans reason no differently and that LLMs just given the right data will be better than everyone at everything related is just as hokey.
I didn't say anything about reasoning no differently.
Suppose you have 2 equations. You don't know what these equations are. However, you know that for any input, the output is the same.
Any mathematician worth his salt will tell you that given said information, those 2 equations are equal or equivalent.
It doesn't actually matter what those equations are.
Equation 1 could be say a+b
and Equation 2 could be (a-5) + (b-5) + 10
Or more realistically, both equations could look vastly different.
The point I'm driving home here is that true distinction reveals itself in results.
The fallacy of the philosophical zombie is that there is this supposed important distinction between "true understanding" and "fake/mimicry/whatever understanding" and yet you can't actually test for it. You can't show this supposed huge difference. A distinction that can't be tested for is not a distinction.
This analogy does not make sense to me. We do not have equivalence between all of the infinite inputs and outputs here, we have equivalence in a finite number of cases and known cases where the two functions (human output and llm output) diverge drastically. Any mathematician worth their salt would tell you these functions are definitely not equal.
Now you could make the argument that these functions are close enough most of the time so it won't matter but unless you want to get really rigorous that's more of a stats / engineering perceptive not mathematics. Any more importantly that's very up for debate, especially in a high pressure situation like medicine.
Of course these models are wild and I'm quite impressed with them. I still can be worried about the damage someone who doesn't think things through could cause by assuming GPT-4 has human or super human level intelligence in a niche, high impact field.
The point I'm making is that I can quite clearly show output that demonstrates reasoning and understanding by any actual metric. That's not the problem. The problem is that when I do, the argument Quickly shifts to "it's not real understanding!".
That is what is nonsensical here. It's actually nonsensical whatever domain you want to think about it in.
Either your dog fetches what you throw at it or it doesn't. The idea of "pretend fetching" as any meaningful distinction is incredibly silly and makes no sense.
If you want to tell me there's a special distinction cool but when you can't show me what that distinction is, how to test for it, the qualitative or quantitative differences then I'm going to throw your argument away because it's not a valid one. It's just an arbitrary line drawn on sand.
There is a lot of research that shows they have trouble reasoning. Folks working with LLMs and building them agree that they can't reason, yet pop-sci folks religiously insist they can reason.
Moreover, "have trouble reasoning" and "can't reason at all" are two very different statements. Not that i particularly agree with either but you should keep that in mind.
I'm not the person you responded to, but that's an old paper. LLMs have come a long way since. Nothing conclusive but GPT4 does show some signs of something deeper. https://arxiv.org/pdf/2303.12712.pdf
Yes, I have read that paper and have forwarded it eagerly to colleagues. Respectfully, please read the paper you have linked, especially section 8.2.
Look at section 8.2 in the paper you linked:
"8.2 Lack of planning in arithmetic/reasoning problems"
> However, it seems that the autoregressive nature of the model which forces it to solve problems in a sequential fashion sometimes poses a more profound difficulty that cannot be remedied simply by instructing the model to find a step by step solution
Nowhere in the paper is an evaluation on standard reasoning datasets (https://tptp.org) because they are much much much tougher than the simple examples that GPT-4 struggles with.
LLMs are a huge breakthrough but let us not muddy the science.
GPT 4 was evaluated on multiple reasoning benchmarks in the release paper.[1] The problems in TPTP are not really appropriate for a language model and exceed the capabilities of most humans working without tools. Clearly humans have some reasoning ability even if they cannot solve those problems.
> Reminder that these models have no concept of truth, no abstract reasoning abilities
This has positive potential too - bias prevention. Give it an ultrasound and it'll do its work regardless of the patient.
None of this is guaranteed (there could be training pitfalls, i.e. maybe patient data is fed to AI, including race/sex, and if training data has bias toward non-white males then so will AI), but it's a potentially positive aspect to explore!
I think most, if not all, training data has bias. Removing bias from training data is a challenge in and of itself that I don't think we've solved yet. I worry that there isn't enough incentive there for us to solve it and some minorities will be left behind.
Did you mean to write bias can have a positive impact, too?
Any model or human will suffer from some degree of uncertainty. You can never say anything with 100% certainty.
If you know a patient had cancer before, the chance that they have cancer again is much higher usually than the general population. This is very valuable for diagnosis.
It's good to have a prior/bias to check more closely here and err on the side of caution.
It's good to check for the more likely explanations first when you have multiple competing hypotheses and limited capacity for testing.
That's not bias, it's use of relevant context. If you're going to have an LLM doing diagnoses, you're definitely going to feed it that sort of context.
Do you just tell the AI "this person had cancer?", or "this person went into remission 1 year ago?" or "this person was treated with 4 lines of chemo, partial remission for 1 line, non-response for 2nd line, strong response for cycles 1-3 for 3rd line, but then discontinuation due to tolerability, then 4th line complete remission after 6 cycles plus two courses of radiotherapy"?
Theoretically you could do this, but capturing all this and correctly interpreting it is far harder for AI than for a human who has been directing treatment the entire time.
Meanwhile for late childhood to to mid adulthood, 99% of my visits have been a doctor listening to me for 2 minutes tops while applying what amounts to a flow chart of how to diagnose and treat something like strep. Yes a doctor is best, but remember in many places we're competing with "no access do doctor so just pray it doesn't go to Rheumatic fever."
Very often asking the _right_ questions is hardest part especially in a non-typical cases. Statistical ML models tend to do well on high-probability regions of data that are densely sampled in training and are less good dealing with outlier cases. I am curious to see how GPT-4 deals with hard atypical cases.
Except we have very powerful reasoning systems at the moment. Attaching a fully trained LLM on a speciality outputting in a structured form into a semantic reasoning system and feeding the results back through the LLM in a feedback cycle should control for hallucinations. It’s pretty clear though to all but the most cynical these LLM are producing amazing results regardless of whether “reasoning” is happening. The degree to which they produce insightful, cogent, and complex synthesis of concepts in a rational seeming form is startling.
I’d challenge anyone to demonstrate humans don’t “reason” in a similar way. Are we really anything more than expectation engines powered by gradient descent expectation juice? How come our reasoning fails so often and we have to be taught logic through reinforcement learning? Why do we fall for fallacies so easily when we “know” about them and can “reason” so well? I will wager $5 that as we build ensemble models that can leverage the reasoning AI tech of the last 80 years in a feedback cycle with LLMs and other generative AI models we will be staring at intelligence that far surpasses our own, and perhaps be so alien we can’t recognize it because it actually is backed by reasoning - unlike ours.
The question isn't really whether someone that has built it is aware of the limitation, but whether the users of that thing that someone built is aware of them.
In one study (see reference) users were able to identify the AI about 2/3rd of the times (and identical for the human experts). So 1/3rd of the time the AI is not detected and may well be spouting nonsense.
It does in fact create some sort of model out of those billions of dimensions, and it's able to re-use it across many different fields, too
It's not "bottom-up reasoning" so it misses a bunch of little details.
Basically AI is a search through a vast space, and most of the approaches have focused on top-down search. Reasoning bottom-up is what AlphaBeta search did (rybka and all those other chess programs, that calculated all possible combinations 10 moves ahead). Those missed some of the "top-down positional" stuff, whereas AlphaGo missed some weird-ass edge cases
I experimented with asking the models to rewrite text using odd and specific rules - “all sentences must start with the letter b and end with the letter d.”
It was pretty good, but sometimes it wouldn’t even try. It would spit out a sentence that could have been easily rewritten to start with b and end with d - but it just didn’t.
These sorts of hallucinations are a big problem is high stakes practice.
LLMs like GPT4 operate on tokens, not characters, so they're operating with a severe handicap when playing those sorts of games -- they see a word like "robot" as a single token, not as a collection of letters, so they don't know that it starts with the letter R unless that fact appeared in its training material.
Interestingly, they do better at rhyming games, because those are based on associations between tokens which are easier to infer from usage in poetry, or by reading rhyming dictionaries.
GPT-4 can do some reasoning, just not at a smart/trained human level. You cannot explain its results in USMLE, Uniform Bar Exam, and several other tests otherwise. This is a major improvement from earlier models.
This assumes there is not too much data contamination, but other people’s experiences with it suggest that the capability is real even if it might be a bit below what the test performance suggests. The latter is also explainable by a broader distribution of real world problems relative to tests.
GalenAI looks very interesting -- definitely will be taking a closer look! :)
>> Now - throw a punch of clinical guidelines in a vector database and give it context
I built MedQA (https://labs.cactiml.com/medqa) as a way to explore how GPT could be used through providing clinical guidelines as context + pretty extensive prompt engineering. There are definitely limitations and the accuracy/constraining "hallucinations" has to meet a very high bar, but I've found it interesting-to-helpful several times while on rounds at the hospital as a med student.
Some functionality made possible through GPT that I am excited to explore further:
This makes me think we need some kind of program for experts to start writing things down in a way which is helpful. Even just take dictation and transcribe it.
You should take a quick look at EPIC. They dominate the electronic heath record space, and a ton of health systems use it. You will know if your doctor's office uses an EHR application, because they will be typing notes into it for the majority of your visit. I have not been too excited about the amount of time that physicians spend on EHR systems, but I am hopeful that taking the data they input (along with blood work and other test results) will make everything more accurate, fast and effective.
EPIC unfortunately is all the bad things about Google, and none of the good.
Unable to ship anything, protect their margin > help the users solve problems, monopoly, locked up distribution so no one else can innovate.
Honestly, my bear case for AI in medicine is Epic picking up the phone and telling health-systems not to buy anything because they are working on something for them for free. (Which would be some note completion BS stuff, rather than actual clinical support that helps patients and cuts costs). They may be doing this already.
I spent a few weeks at an MGH affiliate hospital that had since my last stay began using Epic and from what I could tell all it did was muddy things up. The staff all seemed to fumble through use of the interface, even those who had spent years with it. From the nurses to the medical director of this particular program, everyone was always complaining about using the software
As a patient I never really set eyes on the interface but there seems to be a UX nightmare afoot. Once I was unintentionally dispensed 3x the intended dose of a stimulant medication due to a “default dose” feature in the interface that my physician admitted to accidentally submitting based on
Training a model on an EHR is worse than nothing. Epic allows infinite customization, and customers build up their own informal standards such that you can’t dump and compare data across multiple sites.
While it's not easy or simple for every facility, in general it seems to be possible to pull whatever data you want from Epic and other EHRs. There might be a fee, work order, and vendor involved, but if you want a 100GB CSV containing certain columns, it's generally possible.
Of course matching that data up with sets from other locations will still involve someone in the middle gluing it all together.
By any reasonable standard, a “clinical pharmacist for 10 years” should understand that there are circumstances where there is a correct answer, and incorrect answers are potentially catastrophic.
You, clearly, are either unable to recognize that or are unable to recognize that LLM are dangerously useless in such contexts. Either you already know the correct answer, making this useless, or you are not competent to utilize its answers because you can’t tell whether it’s about to kill someone with plausible nonsense.
And if a “clinical pharmacist for 10 years” can lose sight of this really fundamental issue, this whole thing needs to be halted. Governmentally if necessary.
And it won’t be.
It’s bad enough that “bad actors” will leverage this for their purposes, but the presumably competent simply turning off their brains over this is the most perplexing glitch ever to be on public display.
"Either you already know the correct answer, making this useless, or you are not competent to utilize its answers because you can’t tell whether it’s about to kill someone with plausible nonsense"
Evaluating if an answer is correct or not is easier than coming up with a correct answer from scratch.
NP != P.
if you don't understand this basic fact, then you are not competent enough to comment on AI.
A few of the comments in this thread seem to be misusing mathematics in order to lend more credence to themselves. At the risk of responding to low quality flamebait here are some problems with your statements.
1. P = NP refers to a two very specific sets of problems (which might actually be the same set) not any general question. There are problems that we know don't fall into P or NP, (for example the Halting Problem). Also whether or not P=NP is an open question almost the opposite of a fact.
2. You claim:
"Evaluating if an answer is correct or not is easier than coming up with a correct answer from scratch." This is the right idea but not quite correct.
The correct statement is as follows:
"Evaluating if an answer is correct is not harder than the difficulty of coming up with a correct answer from scratch."
This is because evaluating some answer can still be just as hard as the original problem. In fact sometimes it's uncomputable (if the original problem is also uncomputable). To use an example from above consider the question: "Does a program x halt?" If I tell you "no" it could be impossible to verify my answer unless you have solved the halting problem.
To bring this back to reality, again if GPT-4 is wrong about some complex medical question it doesn't mean it's mathematically easier to figure that out than solving the problem from scratch.
GalenAI is such an awesome idea, and I'm genuinely rooting for its success! I just wanted to point out a small typo in one of the headings: "So how does Galen really works?" might be better as "So how does Galen really work?". Having a native speaker give your copy a quick check could be a helpful way to ensure it comes across as polished and professional. :)
> This because there are very niche stuff as you specialize where it’s only like 5 doctors who see it in the whole world on a consistent basis (and they don’t blog!)
We need to do a better job with sharing niche information as a society! Imagine all the benefits for medicine that can come from a model knowing about ultra rare and niche occurrences happening all over the world.
And therein lies a problem. In a world with billions of people, or even a country of 300 million - how could someone possibly get diagnosed when their problem needs to be put in front of one of five doctors!
Now - throw a punch of clinical guidelines in a vector database and give it context and it’s 10x better than me and any doctor outside their speciality or all the mid-levels. (I.E, it’s better than cardiologist doing infectious disease - but not cardiologists doing cardiology). This because there are very niche stuff as you specialize where it’s only like 5 doctors who see it in the whole world on a consistent basis (and they don’t blog!)
I trained it on the IDSA guidelines (infectious disease) and put up a proof of concept on GalenAI.co - just as way to start talking to health systems and clinicians. it’s going to be very different world in medicine in a couple of years from now!!