Hacker Newsnew | past | comments | ask | show | jobs | submit | Adityav369's commentslogin

You can ask the model to describe the image, but that is inherently lossy. What if it is a chart and the model gets most x, y pairs, but the user asks about a missing "x" or "y" value. Presenting the image at inference is effective since you're guaranteeing that the LLM is able to answer exactly the user's question. The only blocker here becomes how good retrieval is, and that's a smaller problem to solve. This approach allows us to only solve for passing in relevant context, the rest is taken care of by the LLM, otherwise the problem space expands to correct OCR, parsing, and getting all possible descriptions to images from the model.


Yeah the fine tuning is definitely the best part.

Often, the blocker becomes high quality eval sets (which I guess always is the blocker).


We do use ColQwen! Currently 2, but upgrading to 2.5 soon :)


Yeah we had an overload on the ingestion queue. If you try again will be much faster as we just moved to a beefier machine. (The previous ingestion will still work since it is in queue, but new ones will be faster)


Wait, your title says this "runs locally"?


Yes! If you're running the local version and it's taking long, that an indication that your GPU isn't being used properly. This can be traced back to the `colpali_embedding_model.py` file, where you can set the device and attention you want PyTorch to use.


Depending on the use case, it happily runs on my MacBook air M2 16GB ram with mps for small pdfs, and searching over 100-150 documents with colpali takes a 2-ish minutes. Very rough numbers. For ingestion, takes around 15-20-ish seconds a page, which is on the slower end. On an A100, it takes 4-5 seconds per page for ingestion using Colpali to run (we haven't performance optimized, or optimized batch sizes yet tho). Without Colpali it is much faster. Ingestion doesn't change much as size grows.

I'd be happy to report back after some testing, we are looking to optimize more of this soon, as speed is somewhat of a missing piece at the moment.


For ingesting graphs, you can define a filter, or certain document ids. When updating, we look at if any other docs are added with that filer (or you can specify new doc ids). We then do entity and relationship extraction again, and do entity resolution with the existing graph to merge the two.

Creating graphs and entity resolution are both tunable with overrides, you can specify domain specific prompts and overrides (will add a pharma example!) (https://docs.morphik.ai/python-sdk/create_graph#parameters). I tried to add code, but was formatting badly, sorry for the redirect.


You can run this fully locally using Ollama for inference, although you'll need larger models and a beefy machine for great results. On my end llama 3.2 8B does a good job on technical docs, but bigger the better lol.


Ahh, I didn't see that, I just saw them talking about a free tier or whatever and my eyes glazed over. I'll try it out with Mistral-small 3.1 at some point tonight, I've been having really great results with it's multimodal understanding.


how would you use this within open-web-ui locally?


Thanks we should have been more clear. The part in ee is our UI, which can be used to test or in dev environments. The main code, including API, SDK, and the entire backend logic is MIT expat.


Do you mean ingesting the extracted rectangles/ bounding boxes? We're actually working on bounding boxes, this is a good insight and we can add this to the product. However, the way we ingest is literally converting each page to an image then embedding that so the text, layout, diagrams are all encoded in. Would like to know what the exact use case is, can help you better


Why do you convert to image? It’s easy to turn the components of a pdf into separate items and then ingest them individually. I also imagine at some point rasterizing vectors will become a pain point for some reason.


Mainly to maintain layout information. Also search becomes easier this way.


Depends on your document types.

If you're using txts, then plain RAG built on top of any vector database can suffice depending on your queries (if they directly reference the text, or can be made to, then similarity search is good enough). If they are cross document, setting a high number of chunks with plain RAG to retrieve might also do a good job.

If you have tables, images, etc. then using a better extraction mechanism (maybe unstructured, or other document processors) and then creating the embeddings can also work well.

I'd say if docs are simple, then just building your own pipeline on top of a vector db is good!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: