Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
IBM and NASA build language models to make scientific knowledge more accessible (ibm.com)
183 points by rbanffy on March 13, 2024 | hide | past | favorite | 43 comments


Who wants to learn from anything less than the best sources?

I've often thought that a search engine that indexes only the highest quality, probably hand-curated, sources would be highly desireable. I'm not really interested in learning from everyone about, for example, physics or history or climate change or the invasion of Ukraine; I only want the best. I'm not missing out, practically: there is far more than enough of the 'best' to consume all my time; there's a large opportunity cost to reading other things. Choosing the 'best' is somewhat subjective, but it is far better than arbitrarily or randomly choosing sources.

LLMs, used for knowledge discovery and retrieval, would seem to benefit from the same sources.


Diversity and quantity are important for LLM training.

A search engine can index more than just "the best sources", and show results from the tail when no relevant matches are in the best sources.

I would agree that with a softer restatement of your thesis though, I am sure there is a lot of diminishing marginal utility in search indexing broadly, especially as the web keeps getting more and more full of spam and nonsense.

For pre-training LLMs, the quality/quantity/diversity story is more nuanced. They do seem to benefit a lot from quantity. For a fixed LLM training budget, the choice to train on the same high quality documents for more epochs, or to train on lower quality but unseen data is an interesting area of research. Empirically, the research finds that additional epochs on the same data starts to diminish after the 4th iteration. All the research I've read tends to have an all or nothing flavor to data selection. Either it makes it in, and gets processed the same number of times, or it doesn't get in at all. There is probably some juice in the middle ground, where high quality data gets 4x'ed, bad data is still eliminated, but the lesser but not terrible data gets in once.


Thanks for an informed response!


This might be a naive question but how does one determine what "best" is for multiple subjects?

Even in your example, physics and mathematics could be curated for "best" when dealing with equations and foundational knowledge that has been hardened over decades. For history, climate change of invasion of Ukraine, isn't that sensitive to bias, manipulation and interpretation? These are not exact sciences.


> how does one determine what "best" is for multiple subjects?

Perhaps invert the question - how to recognize "not-best"? If it's on a consensus list of common misconceptions, it's not-best. Science textbooks, web and outreach content, are thus often not-best. If the topic isn't the author's direct research or professional focus, it's likely non-best. People badly underestimate how rapidly expertise degrades as you blur from focus to subfield, let alone to broader field. Journalism is pervasively not-best. If the author won't be embarrassed by serving not-best, it likely is. Beware communities where avoiding not-best embarrassment isn't a dominating incentive.

> not exact sciences.

Most content fails even the newspaper test, that any professional familiar with the topic will recognize that it's wrong. This applies as much to science and engineering as to whatever. Not-best.

"Soft" fields do have challenges. Subcultures with incompatible "this work is great/trash" evaluations. Integration of diverse perspectives in general.

But note that agreement and uncertainty is often poorly characterized. A description of "A, B, and C" rather than "A. And also B and C, orders-of-magnitude down.". "B vs C!" rather than "A. And A.B vs A.C." Leaving out the important insight, the foundational context, is common. And sloppy argumentation. Not-best. Basically, there's opportunity for very atypically extensive pruning of not-best before becoming constrained by uncertainty rather than by effort.

Once you eliminate the not-best, whatever remains, however imperfect, is... far less wretched than usual.


Perhaps you can limit to training only on peer reviewed sources. The peer review process is imperfect but it is perhaps the closest thing we have to flagging something as the "best" answer for a particular topic.

History (maybe exception for scientific history), politics and current affairs I would say falls outside the scope of "scientific knowledge". I do not think it is possible to avoid bias in those topics.

A significant question is what the cutoff point would be for a model based on "scientific knowledge"? Should subjects like economics, philosophy etc be included as Scientific knowledge or Should it be limited to "hard" sciences only?


Peer review isn't all that and a lot of subjects are censored and even disinformation is accepted if it flatters the ideological inclinations of the publication. See: Proximal Origins from Nature


You have to spend quite a lot of time thinking about quality and values. It becomes impossible as the size of the “best slice” you’re seeking gets smaller (top half is much easier than top ten percent, etc)

If your values are “everyone should agree with my opinions” you’ll have a garbage biased data set. There are other values though. Bias free is also impossible because having a definition of a perfectly neutral bias is itself a very strong bias.


"Best" will be chosen by the creators of software for specific application uses. Medical software will use the "best" medical LLM under the hood. Programming software (Copilot et. all) will use the "best" programming LLM. General purpose language models will probably still be used by the public when doing internet searches. Or, an idea that just popped into my head, use a classifier to determine which model can most accurately answer the user's query, and send the query off to that model for a response.


Reread the comment you responded to. We aren’t discussing the best LLM here but how to determine the best source.


> how does one determine what "best" is for multiple subjects?

Also, "best" depends on audience and use-case.

Imagine a horribly tone-deaf LLM-powered Sesame Street episode about the importance of recycling, illustrated by supply-demand graphs and Kekulé structures of plastic polymers.


What do you think of how that was addressed in the GP?


With Perplexity [0] you can narrow you LLM interactions to only reference Academic articles or other high-quality sources.

[0] https://www.perplexity.ai/


Ignoring the question whether LLMs can produce the best, the idea wouldn‘t be to cite the best sources, but to train only on the best sources.

A garbage hallucination with a link to the Stanford Encyclopedia of Philosophy won‘t help anyone.


On the other hand, given the query "facial recognition in paper wasps", Perplexity just gave not only an answer in accord with my prior understanding derived from reading in the field, but also surfaced a paper [1] published two months ago that I hadn't yet seen.

I expected less, and I suspect a researcher could easily find gaps. But from the perspective an amateur autodidact, that's still a fairly impressive result.

[1] https://www.pnas.org/doi/10.1073/pnas.1918592117


> I've often thought that a search engine that indexes only the highest quality, probably hand-curated, sources would be highly desireable

That's what I miss about the old internet, where folks would have link pages that were just other cool sites

Sure, discovery was harder, but it was harder to AstroTurf worth SEO too


  a search engine that indexes only the highest quality
Any for-profit search engine eventually loses quality as it succumbs to ad spend.

It'd require subsidies to remain profit-neutral (skew towards quality). Think ycombinator and HN.

Even a subscription model will eventually skew towards placating the masses with "dumbed down" content.


Let's ban advertising and let the market sort out the price of search with that perverse incentive out of the way.


> Even a subscription model will eventually skew towards placating the masses with "dumbed down" content.

Accuracy and simplicity are not the same. I can see that most people won't want to read the Stanford Encyclopedia of Philosophy's take on Plato. But anyone can read the Associated Press rather than someone's misinfo on the topic. Cut out the latter.


That makes sense but the principle of "more data = more better" suggests that maybe training an LLM on all the possible data and then fine tuning it to only spit out the best answers might be better than only training it on only the best data to begin with.


How will training it on false data, for example, result in better output?


Note that the model is based on RoBERTa and has only 125m parameter. It is not competing against any of the new popular models, not even small ones like Phi or GeMMa.


It’s also not meant to be a generative model - only to be used as an encoder model (they list retrieval as a potential use case )


Given the current state of LLMs, I am not even sure this qualify to be called an LLM.


second opinion - BERT family are transformer-based, and that is a big threshold right there.. secondly I am not sure that two one-minute comments could capture what exactly went on with fine tuning or graph-based methods of constraint or whatnot.. with respect to the fitness of the production tools for intended purposes.


This looks great! I'm excited to play with it.

Can anyone point me to a resource on how to load it?

I tried downloading the model into LM Studio on my Mac but it seems there is more to be done than just loading it.

Any pointers much appreciated!


NASA has been investing in hardcore knowledge management and information organization for well over half a century now.


Is it weird they mentioned these examples and not, OpenAI, Anthropic, Gemini etc.?

> Transformer-based language models — which include BERT, RoBERTa, and IBM’s Slate and Granite family of models

Why would they not mention the most popular transformer based language models?


BERT and RoBERTa aren't competing against IBMs products. You don't advertise your competitor in your own ad.


IBM isn't competing against anthropic / openAI/ Google.

IBM's business model is to be worse but sell to lots of clients because the clients don't know any better.


Which doesn't work if you tell them about your competitors :)


UniverseTBD (https://universetbd.org) is also making great strides in the space of large language models and astronomy.


Briefly hopeful, I ask astrollama-7b-chat-alpha[1] "What color is the Sun?". It replies "The Sun has no color as it emits radiation across all wavelengths from ultraviolet to infrared. [...] there isn't an answer for what color the Sun truly is since it doesn't have one but rather produces every visible spectrum imaginable!". Sigh. Hmm, I wonder if LLM replies might be usefully mined to generate misconception lists?

[1] https://huggingface.co/spaces/universeTBD/astrollama-7b-chat...


Oof, yeah that's not a very good response. But I guess color is actually a technical term in research astronomy: it refers to the difference in two specified magnitudes (usually photometric filters). There's also the question of whether the sun is being viewed from the Earth or from space, since those will change the effective "transmission curve" of the detector system.

I asked: "What color is the sun when viewed from the ground with the human visual system?"

And got the following: "The Sun appears white to us on Earth due to its high temperature and lack of any significant wavelength-dependent absorption or scattering properties. However, if we were able to view it through specialized telescopes that could capture all visible light spectrum (not just the yellow/orange part), then it would appear as an extremely bright ball of blue light with some slight red tint. This is because most of the solar radiation consists of photons at ultraviolet and infrared frequencies which our eyes cannot perceive directly but can be detected by these advanced instruments. [...]"

An overall better response, but still not exactly right. Anyway, the base model was fine-tuned on arXiv/astro-ph abstracts, and I can't imagine too much discussion about the color of the sun in that training data set...


Nod. Though briefly asking E&M questions a few days ago made me think the "not exactly right"-but-seemingly-closer may be stopped-clock-ish. Very not at the point where latents are seemingly encoding deep structure about the world.


> IBM and NASA build language models to make scientific knowledge more accessible

Newspeak ?


RoBERTa models were presented at least 18 months ago at NASA.. it seems that "reading and tagging your own manuals" is such a daunting task that these tools got fast-tracked..


What does IBM contribute with in this collaboration? The development?


They enable the quantum 5G AI digital transformations.


Ohhh why didn't I think about that!


I thought IBM has its Watson already? ;)


ELI5: Does one need to write code to use these, or is there a front-end somewhere?


Using pre-trained language models like the encoder and retrieval models mentioned [1] typically doesn't require writing a lot of code, but there are still a few steps involved.

The retrieval model, [2], is hosted on the Hugging Face platform. To use it, you can use Hugging Face's Inference API to send HTTP requests to their servers and receive responses from the model.

HuggingFace's docs [3] provides instructions on how to use the Inference API, including code examples in Python and other languages. Essentially, you'll need to format your input text according to the model's requirements, send an HTTP request to the API endpoint, and then process the response.

This does require some basic programming knowledge to interact with APIs and handle the requests/responses.

There are some third-party applications and services that provide a front-end for accessing pre-trained language models like this one, like Hugging Face Spaces, Replicate.ai, and Google Colab. However, these often come with additional costs or limitations...

Here's a related model by IBM and NASA for geospatial stuff [4].

[1] https://research.ibm.com/blog/science-expert-LLM

[2] https://huggingface.co/nasa-impact/nasa-smd-ibm-st

[3] https://huggingface.co/docs/huggingface_hub/v0.14.1/en/guide...

[4] https://huggingface.co/ibm-nasa-geospatial




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: