Nice! I always looked for a solution to animate diagrams as it would help a lot in visualising the workflow.
Feedback:
1. I tried different mermaid diagrams from https://mermaid.live/, and your animation is only working with classes and flowcharts. It didn't work with the sequence diagram (which is the most interesting to me).
2. It would be great to control the animation to be a sequence instead of one animation for all arrows at once. What I would like to do is show fellow devs the workflow from start to finish, according to the spec.
I appreciate that this is just a start, but it looks promising and has great potential. Good luck!
Why? Did I miss something? There's no indication that OpenAI has been collecting personal information about me (other than typical name, payment info, email) for reasons other than the actual service.
The hardest part in RAQ is document parsing. If you only consider text then it should be ok, but once you start having tables, tables going multiple pages, charts, ignore TOC when available, footnotes … etc, that part becomes really hard and accuracy suffers to get the context regardless of what chunking do you use.
There are some patterns to help such as RAPTOR where you make ingestion content aware and instead of just ingesting content, you start using LLMs to question and summarise the content and save that to the vector database.
But reality is, having one size fits all for RAQ is not an easy task.
The issue is the ingestion (extracting the right data in the right format). This is mainly an issue in PDFs and sometimes when there are tables added as images in Docx too. You need a mix of text and OCR extraction to get the data correctly first before start chunking and adding embeddings
I can believe that many startups are doing prompt engineering and agents but in a sense this like saying 90% of startups are using cloud providers mainly AWS and Azure.
There is absolutely no point of reinventing the wheel to create a generic LLM, spend fortune to run GPUs while there are providers giving this power cheaply
In addition, there may be value in getting to market quickly with existing LLM providers, proving out the concept, then building / training specialized models if needed once you have traction.
I got excited by reading the article about releasing the training data, went to their HF account to look at the data (dolma3) and first rows? Text scraped from porn websites!
Isn’t this before any curation has happened? I looked at it, I can see why it looks bad, but if they’re really being open about the whole pipeline, they have to include everything. Giving them a hard time for it only promotes keeping models closed.
That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview
It says it’s common crawl, I interpret it to mean this is a generic web scrape dataset, presumably they filter stuff out they don’t want before pretraining. You’d have to do do some ablation testing to know what value it adds
Feedback:
1. I tried different mermaid diagrams from https://mermaid.live/, and your animation is only working with classes and flowcharts. It didn't work with the sequence diagram (which is the most interesting to me).
2. It would be great to control the animation to be a sequence instead of one animation for all arrows at once. What I would like to do is show fellow devs the workflow from start to finish, according to the spec.
I appreciate that this is just a start, but it looks promising and has great potential. Good luck!
reply