Tri Dao, the creator of an increasingly popular technique in language model development, is running head-first into a new problem for AI researchers: working with the incredibly fast LLM open source community.
Dao is one of the creators of FlashAttention, a technique now adopted by some developers to increase the amount of information that can go into a context window for a large language model more efficiently. The theory goes that if you can find ways to get larger amounts of useful information into that window (without going overboard), you could achieve better outcomes because the model has a better idea of what you’re trying to do. FlashAttention got an update this week to FlashAttention-2.
Dao this week joined Together, a startup that aims to build open source language models and associated technology, where he’ll work as chief scientist. Together raised $20 million in a round earlier this year led by Lux Capital (an investor in MosaicML which was very recently acquired by Databricks for $1.3 billion). SV Angel, First Round Capital, and Jakob Uszkoreit, one of the co-writers of the Transformers paper, are also investors.
FlashAttention, which speeds up training and fine-tuning of LLMs, could help solve an important piece of the puzzle for making large language models address more practical use cases. For complicated tasks, users could hypothetically inject a lot of instructions and examples for how to address the task.
Dao is now one of many academic researchers that are now finding their work catching fire on the internet among hobbyists and non-academic practitioners, with developers quickly finding widely-applicable use cases. That’s different than the typical environment of academia, which is often concerned with finding novel technologies and resolving novel use cases that don’t necessarily consider concerns around performance, cost, or practicality.
“This is probably my first exposure to open source,” he told me in an interview. “Previously we had research where almost all our papers have just made code available, but it was more on the research-y side. And then some interesting researchers would use that and improve the methods. With FlashAttention it was solving a problem that people need and immediately got integrated into a bunch of frameworks like Hugging Face and PyTorch. It’s available in PyTorch and it benefits a huge number of folks. As a result we get great feedback.”
That, of course, comes with a learning curve to shift to a more production-level mindset that scientists like Dao now face. And it’s one that AI researchers will continue to face going forward as their work gets quickly adapted and iterated on in the open source community.
Dao’s interest in Together started with his colleagues working at the company in the first place. Chris Ré, a co-writer on the paper for FlashAttention, is a co-founder of Together, and co-writer Dan Fu is an academic partner of the company. Together at its core is built around open source methodologies in contrast to closed-source model developers like OpenAI.
“What appealed to me about Together was their focus on open source,” Dao said. “Philosophically, I think that aligns well with what I wanted to do and how I see the future looking like. The future might not be a few companies offering really good AI models. The future is probably a bunch of really good open source models with a bunch of players in this field. The models will be accessible and lots of people can contribute and improve them, or contribute data and things like that.”
But the biggest learning curve to academics entering the open source ecosystem is that production-grade open source tools are mostly focused on practicality, rather than achieving some optimal solution without taking concerns about resources, costs, or other considerations like that into account, Bob Van Luijt, CEO of open source vector database startup Weaviate, told me.
“So, as a researcher, you might squeeze a couple of microseconds out of your algorithm with double the memory footprint, making the index immutable, etc. and eureka! You are state-of-the-art,” he said. “The only downside is that the rather pragmatic community doesn’t really adopt it because it’s unusable. Pragmatically, the cheapest to run, good enough solution often wins.”
Many of the most widely-used technologies in the open source ecosystem found their roots in academia. One of the most notable ones is Apache Spark, coming out of UC Berkeley. That led to the birth of Databricks, a now $38 billion company that’s generating $1 billion in revenue annually. But it certainly wasn’t an easy transition.
“I think it’s been great for Databricks,” CEO Ali Ghodsi told me. “We innovate largely thanks to our academic roots. In academia it’s all about novelty.”
In the case of Spark, it built up a foundational layer for managing colossal amounts of data. Databricks has also popularize the use of data lake architectures (and the “lakehouse” paradigm by extending functionality to data warehouses) and built essentially a one-stop shop for developing and deploying machine learning models.
Both have extremely practical use cases for businesses—and can lead to even more new open source technologies, like its open source Delta Lake framework. But it also meant growing beyond producing novel technologies and solving for niche use cases and focusing more on how it could have a measurable impact on products and businesses.
“I actually actively had to put hurdles in so people would stop innovating,” Ghodsi said. “It’s so ingrained in the bloodstream of the company. The original 20 people were researchers and their modus operandi was, ‘how can we improve how people do things in a novel way?’”
It’s one thing to appeal to data wonks looking to derive insights from their vast piles of data and build functionality around it. It’s another thing to run head-on into a community of avid hobbyists and practitioners that find immediate use cases that aren’t just practical—they’re widely accessible and, perhaps more importantly, fun to play with.
Large language models have captured the attention of those same hobbyists and practitioners in the same way they’ve captured the attention of the general public. That’s partly thanks to ChatGPT, but the explosion of open source development around language models like LLaMA has enabled a vast array of new fun use cases for the technology that’s very accessible to non-scientist types.
Meta released Llama 2, the first version of LLaMA that’s available for commercial use, just yesterday. Within just a few hours, quantized GGML versions of Llama 2 started appearing on Hugging Face that would work with Llama.cpp, a package that enables users to run the models locally on a MacBook Pro. That technology is built using ggml, another open source tensor library built by Llama.cpp creator Georgi Gerganov.
Tools like that, though, are focused on practicality—sacrificing certain levels of accuracy in a large language model for the sake of running them locally on a laptop. An emphasis on novelty takes a back seat to managing the tradeoffs to create a technology that has the most widely applicable capabilities.
“If academia wants to adapt to users, they need to broaden the scope, which in turn conflicts with creating state-of-the-art algorithms,” Luijt said. “It’s very common that the academic work that’s the easiest to use causing the least amount of friction to end users wins.”
Together is particularly focused on that tradeoff. It created the RedPajamas data set to mimic the data set used to train the original LLaMA, as well as two accompanying open source models in the form of RedPajamas-INCITE, to address a much broader set of use cases. It’s part of the appeal of scientists joining a startup like Together, where they can potentially have a larger impact than incrementally advancing the field with advanced research.
But there is an enormous appeal to research and academia, especially as we start to run into walls around the performance of Transformers, the most popular technique for AI today. The restrictions around context windows is one example that FlashAttention addresses. But Dao also said he’s exploring completely novel use cases that go beyond Transformers at Together—which tries to strike a balance between research and developing practical tools.
“If we want to understand why Transformers is so good, we should try to develop alternatives and see if we can come up with something just as good,” He said. “If we can’t maybe there’s something special about Transformers. If theres evidence theres alternatives, that gives us info about what’s important for models to perform well.”
Apple Tests ‘Apple GPT,’ Develops Generative AI Tools to Catch OpenAI (Bloomberg): Mark Gurman at Bloomberg basically confirms what we thought to be true with Apple building an in-house LLM for its products. Apple noted that it was using transformers-based approaches for language modeling in autocorrect, which was a clear indicator that there was something new under the hood. It also, per Bloomberg, built a framework called Ajax to build LLMs. Perhaps Apple is interested in personalized fine-tuned models for every user hosted directly on-device?
How is ChatGPT’s behavior changing over time? (ArXiv): A team out of Stanford and Berkeley, including Databricks CTO Mateo Zaharia, investigate what seemed to be a pet theory on the internet that the quality of OpenAI’s language models was decreasing over time. The paper suggests that, indeed, that may be the case. OpenAI said it was updating its models earlier this year, including new feedback (typically RLHF) to “improve” the model. But we’re seeing lately that any changes and tuning can have a big downstream impact on the actual responses from the model.
Unstructured Raises $25 Million to Bring Order to the Chaos for Language Model Data (Newcomer): It wasn’t going to be long before we started getting new tools that make vector databases even more useful. Unstructured improves the process of transitioning company data into a vector database like Weaviate, Pinecone, or Chroma. Unstructured is getting $25 million in a round led by Madrona and including Bain Capital Ventures. Weaviate CEO Bob van Luijt and LangChain creator/CEO Harrison Chase also participated in the round.
OpenAI Worries About What Its Chatbot Will Say About People’s Faces (New York Times): The Times details some reasoning behind OpenAI restricting its multimodal capabilities for GPT-4—namely, avoiding use cases that could invade privacy. But with all the major developers racing to adapt multi-modal capabilities into next-generation models it’s really only a matter of time before one comes out that’ll wade into that very complicated territory.
If you have any tips, please send me a note at email@example.com or contact me directly on Signal at +1-415-690-7086. As always, please send any and all feedback my way.