Global Big Data Conference

Industry News Details

Open source NLP is fueling a new wave of startups Posted on : Dec 24 - 2021

Large language models capable of writing poems, summaries, and computer code are driving the demand for “natural language processing (NLP) as a service.” As these models become more capable — and accessible, relatively speaking — appetite in the enterprise for them is growing. According to a 2021 survey from John Snow Labs and Gradient Flow, 60% of tech leaders indicated that their NLP budgets grew by at least 10% compared to 2020, while a third — 33% — said that their spending climbed by more than 30%.

Well-resourced providers like OpenAI, Cohere, and AI21 Labs are reaping the benefits. As of March, OpenAI said that GPT-3 was being used in more than 300 different apps by “tens of thousands” of developers and producing 4.5 billion words per day. Historically, training and deploying these models was beyond the reach of startups without substantial capital — not to mention compute resources. But the emergence of open source NLP models, datasets, and infrastructure is democratizing the technology in surprising ways.

Open source NLP

The hurdles to developing a state-of-the-art language model are significant. Those with the resources to develop and train them, like OpenAI, often choose not to open-source their systems in favor of commercializing them (or exclusively licensing them). But even the models that are open-sourced require immense compute resources to commercialize.

Take, for example, Megatron 530B, which was jointly created and released by Microsoft and Nvidia. The model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 and 126 teraflops per second per GPU while training Megatron 530B, which would put the training cost in the millions of dollars. (A teraflop rating measures the performance of hardware, including

Inference — actually running the trained model — is another challenge. Getting inferencing (e.g., sentence autocompletion) time with Megatron 530B down to a half a second requires the equivalent of two $199,000 Nvidia DGX A100 systems. While cloud alternatives might be cheaper, they’re not dramatically so — one estimate pegs the cost of running GPT-3 on a single Amazon Web Services instance at a minimum of $87,000 per year.

Recently, however, open research efforts like EleutherAI have lowered the barriers to entry. A grassroots collection of AI researchers, EleutherAI aims to eventually deliver the code and datasets needed to run a model similar (though not identical) to GPT-3. The group has already released a dataset called The Pile that’s designed to train large language models to complete text, write code, and more. (Incidentally, Megatron 530B was trained on The Pile.) And in June, EleutherAI made available under the Apache 2.0 license GPT-Neo and its successor, GPT-J, a language model trained for five weeks on Google’s third-generation TPUs that performs nearly on par with an equivalent-sized GPT-3 model. View more

Get the