IBM and NASA build language models to make scientific knowledge more accessible

wolverine876 · on March 13, 2024

Who wants to learn from anything less than the best sources?

I've often thought that a search engine that indexes only the highest quality, probably hand-curated, sources would be highly desireable. I'm not really interested in learning from everyone about, for example, physics or history or climate change or the invasion of Ukraine; I only want the best. I'm not missing out, practically: there is far more than enough of the 'best' to consume all my time; there's a large opportunity cost to reading other things. Choosing the 'best' is somewhat subjective, but it is far better than arbitrarily or randomly choosing sources.

LLMs, used for knowledge discovery and retrieval, would seem to benefit from the same sources.

robrenaud · on March 13, 2024

Diversity and quantity are important for LLM training.

A search engine can index more than just "the best sources", and show results from the tail when no relevant matches are in the best sources.

I would agree that with a softer restatement of your thesis though, I am sure there is a lot of diminishing marginal utility in search indexing broadly, especially as the web keeps getting more and more full of spam and nonsense.

For pre-training LLMs, the quality/quantity/diversity story is more nuanced. They do seem to benefit a lot from quantity. For a fixed LLM training budget, the choice to train on the same high quality documents for more epochs, or to train on lower quality but unseen data is an interesting area of research. Empirically, the research finds that additional epochs on the same data starts to diminish after the 4th iteration. All the research I've read tends to have an all or nothing flavor to data selection. Either it makes it in, and gets processed the same number of times, or it doesn't get in at all. There is probably some juice in the middle ground, where high quality data gets 4x'ed, bad data is still eliminated, but the lesser but not terrible data gets in once.

wolverine876 · on March 13, 2024

Thanks for an informed response!

vecinu · on March 13, 2024

This might be a naive question but how does one determine what "best" is for multiple subjects?

Even in your example, physics and mathematics could be curated for "best" when dealing with equations and foundational knowledge that has been hardened over decades. For history, climate change of invasion of Ukraine, isn't that sensitive to bias, manipulation and interpretation? These are not exact sciences.

mncharity · on March 14, 2024

> how does one determine what "best" is for multiple subjects?

Perhaps invert the question - how to recognize "not-best"? If it's on a consensus list of common misconceptions, it's not-best. Science textbooks, web and outreach content, are thus often not-best. If the topic isn't the author's direct research or professional focus, it's likely non-best. People badly underestimate how rapidly expertise degrades as you blur from focus to subfield, let alone to broader field. Journalism is pervasively not-best. If the author won't be embarrassed by serving not-best, it likely is. Beware communities where avoiding not-best embarrassment isn't a dominating incentive.

> not exact sciences.

Most content fails even the newspaper test, that any professional familiar with the topic will recognize that it's wrong. This applies as much to science and engineering as to whatever. Not-best.

"Soft" fields do have challenges. Subcultures with incompatible "this work is great/trash" evaluations. Integration of diverse perspectives in general.

But note that agreement and uncertainty is often poorly characterized. A description of "A, B, and C" rather than "A. And also B and C, orders-of-magnitude down.". "B vs C!" rather than "A. And A.B vs A.C." Leaving out the important insight, the foundational context, is common. And sloppy argumentation. Not-best. Basically, there's opportunity for very atypically extensive pruning of not-best before becoming constrained by uncertainty rather than by effort.

Once you eliminate the not-best, whatever remains, however imperfect, is... far less wretched than usual.

bigger_cheese · on March 13, 2024

Perhaps you can limit to training only on peer reviewed sources. The peer review process is imperfect but it is perhaps the closest thing we have to flagging something as the "best" answer for a particular topic.

History (maybe exception for scientific history), politics and current affairs I would say falls outside the scope of "scientific knowledge". I do not think it is possible to avoid bias in those topics.

A significant question is what the cutoff point would be for a model based on "scientific knowledge"? Should subjects like economics, philosophy etc be included as Scientific knowledge or Should it be limited to "hard" sciences only?

hotdogscout · on March 14, 2024

Peer review isn't all that and a lot of subjects are censored and even disinformation is accepted if it flatters the ideological inclinations of the publication. See: Proximal Origins from Nature

colechristensen · on March 13, 2024

You have to spend quite a lot of time thinking about quality and values. It becomes impossible as the size of the “best slice” you’re seeking gets smaller (top half is much easier than top ten percent, etc)

If your values are “everyone should agree with my opinions” you’ll have a garbage biased data set. There are other values though. Bias free is also impossible because having a definition of a perfectly neutral bias is itself a very strong bias.

animal_spirits · on March 13, 2024

"Best" will be chosen by the creators of software for specific application uses. Medical software will use the "best" medical LLM under the hood. Programming software (Copilot et. all) will use the "best" programming LLM. General purpose language models will probably still be used by the public when doing internet searches. Or, an idea that just popped into my head, use a classifier to determine which model can most accurately answer the user's query, and send the query off to that model for a response.

edgyquant · on March 14, 2024

Reread the comment you responded to. We aren’t discussing the best LLM here but how to determine the best source.

Terr_ · on March 14, 2024

> how does one determine what "best" is for multiple subjects?

Also, "best" depends on audience and use-case.

Imagine a horribly tone-deaf LLM-powered Sesame Street episode about the importance of recycling, illustrated by supply-demand graphs and Kekulé structures of plastic polymers.

wolverine876 · on March 13, 2024

What do you think of how that was addressed in the GP?

javiramos · on March 13, 2024

With Perplexity [0] you can narrow you LLM interactions to only reference Academic articles or other high-quality sources.

[0] https://www.perplexity.ai/

Tomte · on March 13, 2024

Ignoring the question whether LLMs can produce the best, the idea wouldn‘t be to cite the best sources, but to train only on the best sources.

A garbage hallucination with a link to the Stanford Encyclopedia of Philosophy won‘t help anyone.

throwanem · on March 13, 2024

On the other hand, given the query "facial recognition in paper wasps", Perplexity just gave not only an answer in accord with my prior understanding derived from reading in the field, but also surfaced a paper [1] published two months ago that I hadn't yet seen.

I expected less, and I suspect a researcher could easily find gaps. But from the perspective an amateur autodidact, that's still a fairly impressive result.

[1] https://www.pnas.org/doi/10.1073/pnas.1918592117

StableAlkyne · on March 13, 2024

> I've often thought that a search engine that indexes only the highest quality, probably hand-curated, sources would be highly desireable

That's what I miss about the old internet, where folks would have link pages that were just other cool sites

Sure, discovery was harder, but it was harder to AstroTurf worth SEO too

staplers · on March 13, 2024

  a search engine that indexes only the highest quality

Any for-profit search engine eventually loses quality as it succumbs to ad spend.

It'd require subsidies to remain profit-neutral (skew towards quality). Think ycombinator and HN.

Even a subscription model will eventually skew towards placating the masses with "dumbed down" content.

thfuran · on March 14, 2024

Let's ban advertising and let the market sort out the price of search with that perverse incentive out of the way.

wolverine876 · on March 13, 2024

> Even a subscription model will eventually skew towards placating the masses with "dumbed down" content.

Accuracy and simplicity are not the same. I can see that most people won't want to read the Stanford Encyclopedia of Philosophy's take on Plato. But anyone can read the Associated Press rather than someone's misinfo on the topic. Cut out the latter.

porphyra · on March 13, 2024

That makes sense but the principle of "more data = more better" suggests that maybe training an LLM on all the possible data and then fine tuning it to only spit out the best answers might be better than only training it on only the best data to begin with.

wolverine876 · on March 13, 2024

How will training it on false data, for example, result in better output?

occamrazor · on March 13, 2024

Note that the model is based on RoBERTa and has only 125m parameter. It is not competing against any of the new popular models, not even small ones like Phi or GeMMa.

jerpint · on March 14, 2024

It’s also not meant to be a generative model - only to be used as an encoder model (they list retrieval as a potential use case )

3abiton · on March 14, 2024

Given the current state of LLMs, I am not even sure this qualify to be called an LLM.

mistrial9 · on March 14, 2024

second opinion - BERT family are transformer-based, and that is a big threshold right there.. secondly I am not sure that two one-minute comments could capture what exactly went on with fine tuning or graph-based methods of constraint or whatnot.. with respect to the fitness of the production tools for intended purposes.

givinguflac · on March 13, 2024

This looks great! I'm excited to play with it.

Can anyone point me to a resource on how to load it?

I tried downloading the model into LM Studio on my Mac but it seems there is more to be done than just loading it.

Any pointers much appreciated!

datadrivenangel · on March 14, 2024

NASA has been investing in hardcore knowledge management and information organization for well over half a century now.

kingkongjaffa · on March 13, 2024

Is it weird they mentioned these examples and not, OpenAI, Anthropic, Gemini etc.?

> Transformer-based language models — which include BERT, RoBERTa, and IBM’s Slate and Granite family of models

Why would they not mention the most popular transformer based language models?

hackinthebochs · on March 13, 2024

BERT and RoBERTa aren't competing against IBMs products. You don't advertise your competitor in your own ad.

hiddencost · on March 13, 2024

IBM isn't competing against anthropic / openAI/ Google.

IBM's business model is to be worse but sell to lots of clients because the clients don't know any better.

esafak · on March 14, 2024

Which doesn't work if you tell them about your competitors :)

jwuphysics · on March 13, 2024

UniverseTBD (https://universetbd.org) is also making great strides in the space of large language models and astronomy.

mncharity · on March 13, 2024

Briefly hopeful, I ask astrollama-7b-chat-alpha[1] "What color is the Sun?". It replies "The Sun has no color as it emits radiation across all wavelengths from ultraviolet to infrared. [...] there isn't an answer for what color the Sun truly is since it doesn't have one but rather produces every visible spectrum imaginable!". Sigh. Hmm, I wonder if LLM replies might be usefully mined to generate misconception lists?

[1] https://huggingface.co/spaces/universeTBD/astrollama-7b-chat...

jwuphysics · on March 13, 2024

Oof, yeah that's not a very good response. But I guess color is actually a technical term in research astronomy: it refers to the difference in two specified magnitudes (usually photometric filters). There's also the question of whether the sun is being viewed from the Earth or from space, since those will change the effective "transmission curve" of the detector system.

I asked: "What color is the sun when viewed from the ground with the human visual system?"

And got the following: "The Sun appears white to us on Earth due to its high temperature and lack of any significant wavelength-dependent absorption or scattering properties. However, if we were able to view it through specialized telescopes that could capture all visible light spectrum (not just the yellow/orange part), then it would appear as an extremely bright ball of blue light with some slight red tint. This is because most of the solar radiation consists of photons at ultraviolet and infrared frequencies which our eyes cannot perceive directly but can be detected by these advanced instruments. [...]"

An overall better response, but still not exactly right. Anyway, the base model was fine-tuned on arXiv/astro-ph abstracts, and I can't imagine too much discussion about the color of the sun in that training data set...

mncharity · on March 24, 2024

Nod. Though briefly asking E&M questions a few days ago made me think the "not exactly right"-but-seemingly-closer may be stopped-clock-ish. Very not at the point where latents are seemingly encoding deep structure about the world.

hulitu · on March 14, 2024

> IBM and NASA build language models to make scientific knowledge more accessible

Newspeak ?

mistrial9 · on March 14, 2024

RoBERTa models were presented at least 18 months ago at NASA.. it seems that "reading and tagging your own manuals" is such a daunting task that these tools got fast-tracked..

Alifatisk · on March 13, 2024

What does IBM contribute with in this collaboration? The development?

astrange · on March 14, 2024

They enable the quantum 5G AI digital transformations.

Alifatisk · on March 14, 2024

Ohhh why didn't I think about that!

sinuhe69 · on March 14, 2024

I thought IBM has its Watson already? ;)

fghorow · on March 13, 2024

ELI5: Does one need to write code to use these, or is there a front-end somewhere?

bottom999mottob · on March 13, 2024

Using pre-trained language models like the encoder and retrieval models mentioned [1] typically doesn't require writing a lot of code, but there are still a few steps involved.

The retrieval model, [2], is hosted on the Hugging Face platform. To use it, you can use Hugging Face's Inference API to send HTTP requests to their servers and receive responses from the model.

HuggingFace's docs [3] provides instructions on how to use the Inference API, including code examples in Python and other languages. Essentially, you'll need to format your input text according to the model's requirements, send an HTTP request to the API endpoint, and then process the response.

This does require some basic programming knowledge to interact with APIs and handle the requests/responses.

There are some third-party applications and services that provide a front-end for accessing pre-trained language models like this one, like Hugging Face Spaces, Replicate.ai, and Google Colab. However, these often come with additional costs or limitations...

Here's a related model by IBM and NASA for geospatial stuff [4].

[1] https://research.ibm.com/blog/science-expert-LLM

[2] https://huggingface.co/nasa-impact/nasa-smd-ibm-st

[3] https://huggingface.co/docs/huggingface_hub/v0.14.1/en/guide...

[4] https://huggingface.co/ibm-nasa-geospatial