Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That was a good post. Vector Embeddings are in some sense a summary of a doc that's unique similar to a hashcode of a doc. It makes me think it would be cool if there were some universal standard for generating embeddings, but I guess they'll be different for each AI model, so they can't have the same kind of "permanence" hash codes have.

It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.



”you could almost build a new kind of Job Search Service that matches job descriptions to job candidates”

The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

For example, say a job requires A and B.

Candidate 1 is a junior who has done some work with A, B and C.

Candidate 2 is a senior and knows A, B, C, D, E and F by heart. All are relevant to the job and would make 2 the optimal candidate, even though C–F are not explicitly stated in the job requirements.

Candidate 1 would seem a much better candidate than 2, because 1’s embedding vector is closer to the job embedding vector.


Even that is just static information.

We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.

So Candidate 1 could still blow them out of the water in performance, and even be able to trivially learn D, and E in a short while on the job if needed.

The skill vector wont tell much by itself, and even prevent finding the better candidate if its used for screening.


> We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.

That is indeed a problem. I have been thinking about a possible solution to the very same problem for a while.

The fact: people lie on their resumes, and they do it for different reasons. There are white lies (e.g. pumps something up because they aspire to something but were not presented with an opportunity to do it, yet they are eager to skill themselves up, learn and do it, if given an opportunity). Then there are other lies. Generally speaking, lies are never black or white, true or false; they are a shade of grey.

So the best idea I have been able to come up with so far is a hybrid solution that entails the text embeddings (the skills similarity match and search) coupled with the sentiment analysis (to score the sincerity of the information stated on a resume) to gain an extra insight into the candidate's intentions. Granted, the sentiment analysis is an ethically murky area…


Sincerity score on a resume? I can't tell if you're joking or not. I mean yeah, any sentence that ends in something like "...yeah, that's the ticket." would be detectable for sure, but I'm not sure everyone is as bad a liar as Jon Lovitz.


Are you speaking hypothetically or from your own experience? The sentiment analysis is a thing, and it mostly works – I have tested it with satisfactory results on sample datasets. It is relatively easy to extract the emotional context from a corpus of text, less so when it comes to resumes due to their inherently more condensed content. Which is precisely why I mentioned ethical considerations in my previous response. With the extra effort and fine tuning, it should be possible to overcome most of the false negatives though.


Sure AI can detect emotional tones (being positive, being negative, even sarcasm sometimes) in writing, so if you mean something like detecting negativity in a resume so it can be thrown immediately in the trash, then I agree that can work. Any negative emotionality is always a red-flag.

But insofar as detecting lies in sentences, that simply cannot be done, because even if it ever did work the failure rate would still be 99%, so you're better off flipping a coin.


So your point is that LLMs can't tell when job candidates are lying on their resume? Well that's true, but neither can humans. lol.


> The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.

If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.

Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.

It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.

Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.


The trick is evaluate the score for each skill, also weighing it by the years of experience with the skill, then sum the evaluations. This will address your problem 100%.

Also, what a candidate claims as a skill is totally irrelevant and can be a lie. It is the work experience that matters, and skills can be extracted from it.


That's not accurate. You can explicitly bake in these types of search behaviors with model training.

People do this in ecommerce with the concept of user embeddings and product embeddings, where the result of personalized recommendations is just a user embedding search.


> not useful for the task of finding an optimal candidate

That statement is just flat out incorrect on it's face, however it did make me think of something I hadn't though of before, which is this:

Embedding vectors can be made to have a "scale" (multiplier) on specific terms which represent the amount of "weight" to add to that term. For example if I have 10 years experience in Java Web Development, then we can take the actual components of that vector embedding (i.e. for string "Java Web Development") and multiply them by some proportionality of 10, and that results in a vector that is "Further" into that direction. This represents an "amount" of directional into the Java Web direction.

So this means even with vector embeddings we can scale out to specific amounts of experience. Now here's the cool part. You can then take all THOSE scaled vectors (one for each individual job candidate skill) and average them to get a single point in space which CAN be compared as a single scalar distance from what the Job Requirements specify.


Then you would have to renormalize the vectors. You really really want to keep the range -1..1 because that is a special case where cosine similarity equals dot product equals Euclidean distance.


I meant the normalized hyperspace direction (unit vector) represents a particular "skill" and the distance into that direction (extending outside the unit hypersphere) is years of experience.

This is geometrically "meaningful", semantically. It would apply to not just a time vector (experience) but in other contexts it could mean other things. Like for example, money invested into a particular sector (Hedge fund apps).

This makes me realize we could design a new type of Perceptron (MLP) where specific scalars for particular things (money, time, etc.) could be wired into the actual NN architecture, in such a way that a specific input "neuron" would be fed a scalar for time, and a different neuron a scalar for money, etc. You'd have to "prefilter" each training input to generate the individual scalars, but then input them into the same "neuron" every time during training. This would have to improve overall "Intelligence" by a big amount.


It does exist! I built this for the monthly Who's Hiring threads: https://hnresumetojobs.com/

It just does cosine similarity with OpenAI embeddings + pgVector. It's not perfect by any means, but it's useful. It could probably stand to be improved with a re-ranker, but I just never got around to it.


Very cool. I knew it was too obvious an idea to be missed! Did you read my comments below about how you can maybe "scale up" a vector based on number of years of experience. I think that will work. It makes somebody with 10 yrs Java Experience closer to the target than someone with only 5yrs, if the target is 10 years! -- but the problem is someone with 20yrs looks even worse when they should look better! My problem in my life. hahaha. Too much experience.

I think the best "matching" factor is to minimize total distance where each distance is the time-multiplied vector for a specific skill.


> For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.

Literally the next item on my roadmap for employbl dot com lol. we're calling it a "personalized job board" and using PGVector for storing the embeddings. I've also heard good things about Typesense though.

One thing I've found to be important when creating the embeddings is to not do an embedding of the whole job description. Instead use an LLM to make a concise summary of the job listing (location, skills etc.) in a structured format. Then store that store as the embedding. It reduces noise and increases accuracy for vector search.


For one point of inspiration, see https://entropicthoughts.com/determining-tag-quality

I really like the picture you are drawing with "semantic hashes"!


Yeah for "Semantic Hashes" (that's a good word for them!) we'd need some sort of "Canonical LLM" model that isn't necessarily used for inference, nor does it need to even be all that smart, but it just needs to be public for the world. It would need to be updated like every 2 to 5 years tho to account for new words or words changing meaning? ...but maybe could be updated in such a way as to not "invalidate" prior vectors, if that makes sense? For example "ride a bicycle" would still point in the same direction even after a refresh of the canonical model? It seems like feeding the same training set could replicate the same model values, but there are nonlinear instabilities which could make it disintegrate.


Maybe the embedding could be paired up with a set of words that embed to somewhere close to the original embedding? Then the embedding can be updated for new models by re-embedding those words. (And it would be more interpretible by a human.)


I mean it was just a thought I had. May be a "solution in search of a problem". I generate those a lot! haha. But it seems to me like having some sort of canonical set of training data and a canonical LLM architecture, we'd end up able to generate consistent embeddings of course, but I'm just not sure what the use cases are.


I guess it might be possible to retroactively create an embeddings model which could take several different models' embeddings, and translate them into the same format.


This is done with two models in most standard biencoder approaches. This is how multimodal embedding search works. We want to train a model such that the location of the text embeddings that represent an item and the image embeddings for that item are colocated.


No. That’s like saying you can transplant a person’s neuronal action potentials into another person’s brain and have it make sense to them.


That metaphor is skipping the most important part in between! You wouldn't be transplanting anything directly, you'd have a separate step in between, which would attempt to translate these action potentials.

The point of the translating model in between would be that it would re weight each and every one of the values of the embedding, after being trained on a massive dataset of original text -> vector embedding for model A + vector embedding for model B. If you have billions of parameters trained to do this translation between just two specific models to start with, wouldn't this be in the realm of possible?


A translation between models doesn't seem possible because there are actually no "common dimensions" at all between models. That is, each dimension has a completely different semantic meaning, in different models, but also it's the combination of dimension values that begin to impart real "meaning".

For example, the number of different unit vector combinations in a 1500 dimensional space is like the number of different ways of "ordering" the components, which is 5^4114 .

EDIT: And the point of that factorial is that even if the dimensions were "identical" across two different LLMs but merely "scrambled" (in ordering) there would be that large number to contend with to "unscramble".


This is very similar to how LLMs are taught to understand images in llava style models (the image embeddings are encoded into the existing language token stream)


This is definitely possible. I made something like this. It worked pretty well for cosine similarity in my testing.


I tried doing something like that: https://gettjalerts.com/

I added semantic search, but I'm workin on adding resume upload/parsing to do automatic matching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: