A long time back, I built a Retrieval Augmented Generation (RAG) app that contained all publicly available data on Warren Buffett and Charlie Munger. I could ask it questions and it would reply in a way that would make it happen in their shareholder meetings. I thought I should test CloudFlare's AutoRAG product using that dataset and see what results it gets.
Some raw notes from my testing:
- It provides only two options for embedding models: baai/bge-large-en-v1.5 and baai/bge-m3; the former specialises in English text while the latter is multi-lingual; also, it only provides cosine distance matches even when the vectorize AI product from Cloudflare offers more options.
- It provides the ability to choose two tweaks in chunking, namely chunk size and chunk overlap, with sane defaults.
- It allows you to choose their AI gateway product, which can help you route the LLM calls to top providers like Gemini, Openai, Anthropic, Mistral, etc. Their AI gateway product provides a comprehensive suite for routing LLM calls and good observability options. Even though the AI gateway product allows routing calls to many different LLMs, the calls are routed to only LLama models.
- Their default LLM pick is LLama 3.3 70B; Gemma 27B would be a better default.
- Gives the option to do query rewriting
- Two options in Retrieval configuration: number of return results and match threshold
- 4 options for similarity search (remains unknown how it's implemented):
- Exact - near-identical matches
- Strong - high semantic similarity
- Broad - Moderate matches, more hits
- Loose - Low similarity, max reuse
- I love the fact that it adds references to the data source used to generate an answer.
- I can add more data files to the Cloudflare R2 bucket, which will automatically index and use them in retrieval.
- I can understand what's going on through the AI gateway, where I can see the prompts and the context in the Logs tab.
- It populates the context window with
<document name="..."></document
and uses that as a reference to point to which document has been used. - The query rewriting is slow - 1.7s
- Some caveats:
- R2 bucket needs data before you create the AutoRAG app; otherwise, it doesn't work
- Inability to do prompt optimisation
- It has a few bugs, but those may be fixed; it's currently in beta after all
Some vibe-check questions I asked:
- Why did you invest in Solomon Brothers? What made you like the company before it got involved in the scandal? What did you learn from it?
- What should a young, enterprising investor do to get started in investing? Where should they look? What would you do if you had a million dollars to invest today?
- What is float in investing? Why has it been crucial for Berkshire's success?
Answers were good, but retrieval isn't. Lots of improvement can and should be made for retrieval. But then maybe I feel some retrieval techniques are domain and dataset dependent, so maybe it's not possible for a generalised product like this?
Where would I use this?
- Let's say I want to achieve a goal that I haven't been able to do purely by prompting, and I have a suspicion that giving a dataset can be of help, then I would use this as a simple way to test whether RAG is going to be beneficial or not.
- To test whether an idea resonates with people without investing a lot of time and energy
What I would like to see before considering this beyond prototyping:
- A way to evaluate how well the retrieval was, i.e. whether the correct context was placed to answer the question
- More abilities like BM25 in retrieval
- Ability to have re-rankers, although seeing how promising frontier AI models are getting, especially GPT-4.1 in the needle in the haystack retrieval, I don't think re-rankers will be necessary in the future. My hot take is that by 2026, the market for re-rankers will completely collapse. They increase overall latency and cost of the system while providing increasingly less value.
- Custom rules in retrieval, e.g. if I can label some piece of data source as either coming from
news
or fromdisclosures
then I want the context window not to be filled with onlynews
or onlydisclosures
i.e. limit the number of instances from a data type to get more diversity even if the cosine similarity is slightly lower.
Overall, I think this product fits well with their current serverless direction. They have been building products that pick sane defaults, abstract away messy details, and allow their customers to ship ridiculously fast, as long as they relinquish control.
Their AI gateway products look a lot more promising, and so does their vectorise product. Not sure if I want to review them right now.
A side note:
Six months back, I designed some 15 prompts, of which Chatgpt failed 7, whereas the RAG I built only failed on 2. I reran those same prompts, and Chatgpt failed only 2. Incredible process in these 6 months. I suspect niche RAG will become less and less valuable. This is especially true if the dataset was smallish and easily publicly available. Proprietary datasets that aren't publicly available will still benefit from RAG. I would say RAG is selectively facing extinction.