In today’s column, I am continuing my ongoing coverage of prompt engineering strategies and tactics that aid in getting the most out of using generative AI apps such as ChatGPT, GPT-4, Bard, Gemini, Claude, etc. The focus this time will be on several clever approaches you can use to speed up the response time to your prompts. Another way to express this concept is that I will be showcasing techniques that reduce latency or delays in getting generated responses.
If you are interested in prompt engineering overall, you might find of interest my comprehensive guide on over fifty other keystone prompting strategies, see the discussion at the link here.
Let’s begin our journey that intertwines art and science regarding fruitfully composing prompts and getting the most out of generative AI at the fastest attainable speeds.
Understanding The Tradeoffs Is Half The Battle
I feel the need, the need for speed.
That is the familiar refrain vigorously exclaimed in the classic movie Top Gun. Turns out that the desire to be fast applies to prompts and generative AI too. When you enter a prompt, you want a generated response that will immediately appear in front of you. No waiting. No watching the clock. It is your dear hope that the response will be altogether instantaneous.
Like many things in life, sometimes we don’t quite get what our hearts desire.
There are times when you wait a near eternity to get a response from generative AI. Well, let’s be serious, the waiting time isn’t that bad. The typical large-scale generative AI app tends to respond within 1 to maybe 10 seconds. You can’t even take a fulfilling sip from your coffee mug in that short a length of time.
Nonetheless, we are used to fast food, fast cars, and fast lives, so our expectations are set that we want fast answers to our entered prompts. All else being equal, it would be great if we could always get responses within 1 second or so. Waiting ten or more seconds is something that no modern-day online user should have to experience. We live in a glorious era of supersized computer servers, and rightfully expect and demand that those massive-sized large language models (LLMs) and generative AI can quickly produce answers.
Is getting a super-fast response time truly needed?
The answer is that it all depends.
If you are using generative AI on a casual basis and just mainly for fun, the reality is that there isn’t any skin off your nose for waiting a few extra seconds. On the other hand, if you are using generative AI for crucially important tasks that are time-boxed, those added seconds might be bad. This comes up for example in commercial uses of generative AI, such as using AI to control real-time robots or perhaps operate time-sensitive hospital equipment. In those dire cases, even just an additional second can potentially be the difference between life and death.
Okay, we can acknowledge that being truly speedy is something that ranges from being a nice-to-have to a must-have. The nature of how someone is making use of generative AI for solving problems, answering questions, or controlling some other allied systems are all factors regarding the speed issue.
Whenever we discuss speed, you can look at the matter in one of two ways. One aspect is to make things fast or speedy. Another aspect entails reducing latency. Those are two sides of the same coin. Our goal is to minimize latency while maximizing speed. They go hand-in-hand.
There is a rub to all of this.
Oftentimes, you will have to choose between speed and quality.
Allow me to elaborate.
There is a famous line that in this world you can get things fast, cheap, or good, but you must select only two of those at a time. You can either get something that is fast and cheap but won’t be good, or you can get something that is cheap and good, but you won’t get it fast, and finally, you can get something that is fast and good, but it won’t be cheap. Well, upon second thought, I’m not sure that those propositions are always true. Anyway, you get the idea that tradeoffs exist.
The same notion of tradeoffs applies here. Generative AI can be fast at generating a response, but the response will likely be of a lesser quality. If you are willing to allow more time toward generating the response, which means added delay from your perspective as a user, you can possibly get a higher quality response.
I want you to always keep that in mind.
Each clever technique or trick in prompting will almost certainly involve a choice between time and quality. Some techniques will improve quality but sacrifice time to do so. Other techniques will speed things up, but likely undercut quality. Sorry, that’s the real world telling us that there isn’t such a thing as a free lunch.
Part of a twist on this is that the tradeoff might not always be the case. In other words, I regret to inform you that despite allowing more time, you aren’t guaranteed that the quality of the result will be better. It might be the same as if you hadn’t consumed the extra time. Shockingly, there is even a chance that the quality might be less. In a sense, allowing a longer time to calculate something can inadvertently cause more harm than good.
Yikes, you might be thinking, it seems that you are darned if you do, and darned if you don’t. I wouldn’t go that far out on a limb. A practical rule of thumb is that much of the time the willingness to allow for more time will produce a higher quality response. The quality increase might not be grandiose, and you might wonder whether you could have settled for the lesser quality response in a lesser amount of time.
The gist is that once you start playing the speed game, you have to accept that there are trade-offs and risks. You will aim to gain as much speed as you can, keep latency low, and yet achieve high quality. That’s the bullseye.
Factors That Go Into Latency And Speed
Generative AI is not usually an island unto itself.
The odds are that when you use a generative AI app, the app is running on a server in some remote locale. Furthermore, you will need network access to send a prompt to the AI and get back a generated response from the AI. There are a lot of moving parts all told.
Consider these six key components or factors:
- (1) Network access speed
- (2) Server speed that is running generative AI
- (3) Number of simultaneous users while you are using generative AI
- (4) Priority of your prompts as per the AI provider
- (5) Speed of generative AI or large language model
- (6) Nature of the prompt you enter
Let’s contemplate how these factors interrelate with each other.
You enter a prompt and hit return. The response from the AI takes a lengthy time. What happened?
Where was the glitch or issue that caused things to get bogged down?
It could be that the network you are using is as slow as a snail. Perhaps the generative AI responded instantly, but the network was sluggish and took a long time to deliver the response to you. From your perspective, you are probably going to blame the generative AI app for being slow. Unfair! It was the network that was the culprit, figuratively standing there with a guilty look in the kitchen and carrying the murderous candlestick.
Another possibility is that the network works like greased lightning and the generative AI as software is blazingly fast, but the computer server is woefully underpowered and overextended. Too many users, too little hardware. The prompt that you entered arrives on a timely basis at the AI software. The software was then starved in terms of computer processing cycles. It was the hardware that undermined the effort.
The crux is that there are a multitude of points of contention that arise in the span of time between entering a prompt and getting a generated response.
One of those factors involves which generative AI app you decide to make use of. Some generative AI apps are optimized for speed. Other generative AI apps have several variations that you can select from, wherein there are faster versions but at a likely lesser quality of responses. And so on, it goes.
Thus, the moment that you decide to use a particular generative AI app, you are choosing speed. You might realize that you are doing so. Most users do not realize that some generative AI apps are slower or faster than others. Perhaps the assumption is that all generative AI apps are roughly the same, including the speed of responsiveness. Nope, that’s not the case.
Another consideration involves the licensing of the generative AI and possibly the fees that you are paying to use the generative AI app. Many of the AI makers have made provisions that if you pay more, you will get a faster response time. This is usually phrased as being a higher-priority user. If you are whining about how slow the AI is, take a look to see if you can pay a surcharge to get a faster response time. The question will be whether the added cost is bearable to you.
Speaking of the AI maker being able to determine speed, they can do lots of other behind-the-scenes actions to speed up or slow down their generative AI. There are parameters associated with LLMs and generative AI that can be set on a global basis by the AI maker for their AI app. They can tweak the quality of answers to goose the AI to be faster. You might not realize the quality has decreased. You almost certainly will notice that the AI is going faster.
The odds are though that eventually, users of that AI would figure out that quality is being shortchanged. The change in speed would be almost obvious to the eye right away. The lessening of quality might take a while to discern. It is a dicey gambit because once users start grumbling about quality, an AI maker will have a devil of a time to turn around that perception. They could be cooking their own goose.
Factors That You Can Control Or At Least Seek To Moderate
I’m guessing that you now realize that there are some factors that you can potentially control when it comes to response time speed, while there are other factors essentially outside of your control.
In a roundabout way, you somewhat control which network you’ve decided to use to access a generative AI app, so in a sense that’s kind of within your control. If you opt to use a slow network, you are undermining a component in the speed-determining supply chain. Another roundabout choice is the generative AI app that you choose to use. Furthermore, your priority of use will be determined by the licensing agreement and how much you are willing to pay to use the AI.
The upshot is that if you are really worried about getting the fastest speed and the lowest latency, you will need to make suitable choices about the network you will be using, and mindfully decide which generative AI app you will utilize (plus, what arrangement associated with the use of the AI you’ve bargained for).
Those are pretty much up to you to select.
An AI maker makes a lot of crucial decisions too. They decide which servers to use. They decide how many servers to use. They decide how to split response time across their user base. They must remain vigilant to monitor response time and try to continuously tune their generative AI. There are lots of under-the-hood mechanisms and AI parameters they can globally set for all of the users and ergo determine average speeds of response times.
The good news is that generative AI apps are now almost a dime a dozen. I say this is good news because the marketplace is relatively competitive. With competition underway, the AI makers know they must strive mightily to try and keep their speed high and their quality high. If an AI maker falters on those metrics, the chances are that a lot of existing or prospective users are going to gravitate to someone else’s generative AI app.
As an aside, the speed and latency issues are going to be somewhat upended by a new trend that will gradually be evident in a year or so. Here’s the deal. Rather than running generative AI over a network, you will be able to run generative AI on your smartphone or similar mobile device. The speed will no longer be moderated via network access. It is just you and whatever chunk of iron is running the generative AI app.
A beauty too is that besides no longer suffering network delays, you will be able to keep relatively private how you are using the mobile version of generative AI. The prompts you enter can stay solely on the remote device. Well, to clarify, I’m sure that some AI makers will provide you with an option to flow your prompts up to the server in the sky for backup purposes or might even force you to do so. We’ll have to see how that pans out.
Out of all the factors at play, the one that you have the most direct control over is the prompts that you enter into generative AI. You can compose prompts that are super-fast for the generative AI to process and return a result. Or you can compose prompts that cause the AI to take additional time.
It’s all up to you.
I would suggest that the average user of generative AI is pretty much clueless or shall we say unaware of how their prompts will impact the speed of response. They ought not to be blamed for their lack of awareness. You see, the world has stridently adopted generative AI by simply rolling it out to anyone who wants to use it. There aren’t any required training courses or special certificates needed to use modern-day generative AI. Just create an account and go for it.
In my classes on prompt engineering, I always make sure to include a portion devoted to the timing facets and discuss the wording that can either speed up or slow down the response time. You might be thinking that I should only cover how to speed up. The thing is, as mentioned earlier, there is likely a relationship between speed and quality of results, and as such, sometimes you purposely are going to explicitly be willing to take a hit on speed to try and attain the highest quality results that you can get.
Here are nine crucial precepts about prompt composition and speed (I believe any prompt engineer worth their salt should know these by heart):
- (1) Prompt Wording. Be aware of how the wording of a prompt impacts the speed of a response.
- (2) Write Prompts Smartly. Try to write prompts that will be efficiently processed.
- (3) Employ Special Phrases. Use special phrasing to identify that speed is crucial.
- (4) Techniques Consume Speed. Watch out for prompting techniques that tend to chew up speed.
- (5) Multitude Of Factors. Realize that a slew of factors will determine the response time.
- (6) Optimize When Possible. Seek to optimize as many factors as possible.
- (7) Arbitrariness Of Speed. Response time will vary dramatically day-to-day, moment-to-moment.
- (8) Cost Versus Speed. There are cost tradeoffs associated with seeking low latency.
- (9) Quality Of Results versus Speed. There are tradeoffs of the quality of the response versus seeking low latency.
I will cover a few of those straight away. The rest will be covered as I take you deeper into the forest and show you the assorted trees.
Let’s begin by highlighting one of the most popular prompting techniques that entails invoking chain-of-thought (CoT). For my details on chain-of-thought prompting, see the link here and the link here.
I am going to use CoT to illustrate the points above about prompt wording (bulleted point #1), writing prompts smartly (bulleted point #2), employing special phrases (bulleted point #3), the use of prompting techniques (bulleted point #4), cost versus speed (bulleted point #8), and quality of results versus speed (bulleted point #9). Prepare yourself accordingly.
The chain-of-thought approach is easy to describe.
When you are composing a prompt, just add a line that tells the AI to work on a stepwise or step-by-step basis. Something like this: “I want you to add together the numbers one through 10. Show me your work on a step-by-step basis”. This will cause the generative AI to show each of the steps that it performed when deriving the generated response that will be shown to you. You will get your answer and a step-by-step indication of how it came to be figured out.
So far, so good.
Empirical research that I’ve cited in my coverage on chain-of-thought tends to provide hard evidence that the generated result will be of a higher quality when you’ve invoked CoT. Great! By the mere act of telling the AI to proceed on a step-at-a-time basis, you can get better results (much of the time, though not necessarily all the time).
The downside is that the stepwise endeavor is almost always more costly and time-consuming than if you hadn’t asked for the stepwise approach to be used. Your prompt is going to force the AI to take more time than usual, merely as a byproduct of trying to work out the answer on a stepwise basis (which, notably, you told it to do).
Is the added time worth it to you? This will cause a potential delay in seeing the result. Also, if you are paying by the amount of time consumed or processing cycles of the server, this will mean more money out of your pocket.
That takes us to the zillion-dollar question.
Just about any prompt that you compose can materially impact the timing of the generated response. This is going to happen whether you realize it or not. You can either blindly enter prompts and have no idea of how the response time will be impacted, or you can try to anticipate how the wording might affect the response time.
You already now have one handy rule of thumb, namely that if you opt to invoke CoT by telling generative AI to proceed on a stepwise basis, the chances are that the response time is going to have a higher latency. For those of you that have perchance been routinely invoking CoT, you are causing your response time to be longer, and you are bearing a higher cost if you are paying for the AI usage.
I don’t want to make a mountain out of a molehill.
It could be that the added time for example means that instead of getting an answer in 2 seconds it takes 8 seconds. I would guess that those added 6 seconds of waiting time on your part is negligible and you are happy to have gotten the stepwise explanation. Likewise, the cost might be tiny fractions of a penny. It all depends on the licensing arrangement you’ve agreed to with the AI maker.
If you are insensitive to time and cost, such that you aren’t worried about either one, I suppose you can compose your prompts in the most bloating of ways. It won’t matter to you. On the other hand, if wanting to get speedier results is important, knowing what to do and what to avoid can be demonstrative.
I’ll walk you through the kind of logic you ought to be employing.
You are entering a prompt. You customarily add a line that invokes CoT by telling the AI to work on a stepwise basis. Aha, now that you realize that the CoT will raise the latency and possibly increase the cost, you mentally weigh whether invoking CoT is worthwhile to you. If you truly need the CoT, or if you don’t care about time or cost, you can include the CoT instruction in your prompt. If you didn’t especially need the CoT and are concerned about time and cost, you would omit it.
Here are the thinking processes in general:
- (a) What aspects of the wording of my prompt will potentially increase time or cost?
- (b) Should I reword the prompt to avoid those possibilities?
- (c) Will any such rewording undercut the quality of the response?
- (d) Am I worried more about time and cost, or quality?
- (e) Based on which is most important, word the prompt accordingly.
I strongly recommend that you implant such a process into your prompting mindset.
Wording Ways To Speed Up The Response Time
We’ve covered that if you use CoT this will almost certainly expand time and cost. By and large, nearly all of the various prompt engineering techniques are going to expand time and cost. That’s because each of the techniques forces the AI to take additional steps that otherwise by default would be unlikely to be undertaken.
Okay, so you have a rule of thumb that is enlarged to stipulate that for about any prompting technique such as chain-of-thought, skeleton-of-thought, verification-of-thought, take-a-deep-breath, and so on (see my coverage of fifty prompting techniques at the link here), you are expanding time and cost.
If you are serious about gauging the impacts beforehand, you could try out each of the techniques on your chosen generative AI app and across your chosen network, paying attention to the time and cost additives. This will give you a semblance of the added tax, as it were, involved in using those techniques. You should eye that cautiously because you could do the same tryout a day later and potentially get different time and cost results, due to the vagaries of the network and servers being used by the AI maker.
Anyway, if you do such an experiment, write down the results and keep them handy on a tab sheet somewhere. You will want to refer to it from time to time.
A somewhat muddled viewpoint is that maybe you should avoid using prompting techniques if your primary concern is time and cost. Ouch! Don’t neglect the quality of the result. Most prompting techniques aim to get you better than normal results. Keep in mind that your quality of result might be so low when shooting from the hip that you’ll have to make use of a prompting technique inevitably. Your avoiding doing so is only staving off the inevitable.
I would caution you that you might shoot your own foot (figuratively, not literally). Here’s how. You avoid using a prompting technique, desirous of a fast result. You write a quick-and-dirty prompt. You get a fast result. Sadly, maddeningly, you look at the fast result and plainly see that it is worthless. You now make use of a prompt engineering technique.
Voila, you have probably doubled or so the total time because you ran that prompt essentially twice (once without the prompting technique, then next doing so with the prompting technique included). Your desire to keep the time low has backfired. You did double duty and paid dearly for it.
Avoid unnecessary double-duty.
One question that I often get during my classes on prompt engineering is whether there is a remark or statement that can be included in a prompt to get the AI to go faster.
Not exactly, but there are some possibilities.
For example, suppose I tell the AI to add together a bunch of numbers and include a line in the prompt that says to go fast. Maybe like this: “Add up the numbers from 1 to 10. Do this as fast as possible.” The odds are that in that particular circumstance, things aren’t going to go faster.
In fact, the processing time required to examine the line that says to go faster will end up chewing up added time. You would have been better off to omit the second line and only submit the first line.
This takes us down a bit of a rabbit hole.
It will be worth doing so.
The size of a prompt is material to the processing time, and likely to the cost too. You conventionally think about prompts in terms of the number of words in a prompt. AI makers usually count based on the number of tokens. A token is a numeric identifier that the generative AI uses to represent words or parts of words. Words are typically divided up into portions and a token represents each portion. You could for example have 75 words but require 100 tokens to represent those words.
Upon entering a prompt and hitting return, the generative AI converts the words into tokens. This is commonly referred to as tokenization. The processing then within the generative AI uses those tokens. When a generated response is prepared inside the AI, it is done as tokens. A last step before showing you the generated result involves the AI converting those tokens into words.
For my detailed explanation of tokenization, see the link here.
Various metrics about generative AI are based on the nature of tokens.
For example, I mentioned that when the AI has produced a generated result, it is composed of tokens. The first token produced for a generated result is customarily considered an important time demarcation. A popular metric is the TTFT, time to first token. That would be the time from the moment you hit return and provided your prompt (and it reached the AI), and up to the time that the generative AI produces the first of a likely series of tokens that will be in your generated result.
Here’s another rule of thumb for you.
Shorter prompts tend to save processing time and reduce latency or speed things up.
Now that I’ve told you about tokens, the underlying reason for that rule is going to be easier to explain. The fewer the number of words in your prompt, the less effort is involved in the tokenization on the input side of things. That seems obvious perhaps. Fewer words mean fewer tokens. A key consideration is that this tends to mean less time involved in the input processing.
Short prompts often also tend to lend themselves to short responses, once again speeding things up. Long prompts tend to lend themselves to lengthier efforts, increasing time consumption and tend to produce longer generated results, which also consumes time.
I want to clear up a misconception on that matter.
Many assume that a long response is being generated simply due to the prompt being long. It is as though the number of words or sentences in the prompt is being blindly matched to the number of generated words or sentences in the response. Tit for tat, as it were.
That’s not usually the basis for the long response. The basis is more so that the complexity and involved matters contained within the prompt will lead to a voluminous reply. You ask a complicated question; you get a complicated answer. Likewise, assuming that short prompts tend to have fewer complex questions, you tend to get short responses that are ergo less complicated.
All in all, you can explicitly control what is going to happen about the length of the generated response. Don’t let things happen by chance alone. In a short prompt, you can say whether you want a lengthy answer or a brief answer. In a long prompt, you can also say whether you want a lengthy answer or a brief answer.
Research On Generative AI Latency And Speed Of Response
There is a slew of research that focuses on trying to speed up large language models and generative AI. I will walk you through a few examples to illustrate the various approaches taken.
First, the AI maker Anthropic which provides the generative AI app Claude provides a variety of insights about how to speed up response time. In the posting entitled “Reducing Latency”, Anthropic, posted online at Anthropic website, accessed online May 10, 2024, they provide these points (excerpted):
- “Latency, in the context of LLMs like Claude, refers to the time it takes for the model to process your input (the prompt) and generate an output (the response, also known as the “completion”). Latency can be influenced by various factors, such as the size of the model, the complexity of the prompt, and the underlying infrastructure supporting the model and point of interaction.”
- “It’s always better to first engineer a prompt that works well without model or prompt constraints, and then try latency reduction strategies afterward. Trying to reduce latency prematurely might prevent you from discovering what top performance looks like.”
- “Baseline latency: This is the time taken by the model to process the prompt and generate the response, without considering the input and output tokens per second. It provides a general idea of the model’s speed.”
- “Time to first token (TTFT): This metric measures the time it takes for the model to generate the first token of the response, from when the prompt was sent. It’s particularly relevant when you’re using streaming (more on that later) and want to provide a responsive experience to your users.
As noted in the above points, one rule of thumb is that you might consider first trying a prompt regardless of any concerns about speed. Once you’ve refined the prompt with an aim toward heightening quality, you can begin to hone toward reducing latency.
That approach assumes that you are likely devising a prompt that you hope to reuse. If you are doing a prompt on a one-time-only basis, the idea of iterating is probably not going to be especially worthwhile. I say that because the time and cost to repeatedly refine a prompt will in total undoubtedly exceed whatever happens on a first-shot basis.
You likely observed that the TTFT was defined in the above points, a metric that I earlier introduced to you.
An online blog provides additional terminology and associated definitions, as stated in “LLM Inference Performance Engineering: Best Practices” by Megha Agarwal, Asfandyar Qureshi, Nikhil Sardana, Linden Li, Julian Quevedo and Daya Khudia, blog posting online at Databricks website, October 12, 2023, here are some excerpts:
- “Our team uses four key metrics for LLM serving.”
- “Time To First Token (TTFT): How quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.”
- “Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the “speed” of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.”
- “Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).”
- “Throughput: The number of output tokens per second an inference server can generate across all users and requests.”
Shifting to another speed topic, I mentioned that the length of a prompt will impact the amount of processing and therefore affect latency.
All sorts of research studies have examined the prompt length-related facets. For example, in “LLM Has A Performance Problem Inherent To Its Architecture: Latency”, a blog posting online at Proxet, August 1, 2023, they describe an experiment on this topic (excerpt):
- “We elaborate on why we think latency is an inherent concern to the technology of LLMs and predict that longer prompts, which require increased tokens — limited by their sequential nature — are a significant driver of latency.”
- “Through benchmarking how OpenAI’s API response time varies in response to different prompt lengths, we explore the relationship between response time and prompt size.”
- “We input a Wikipedia article with a predefined length (in tokens) to the GPT-3.5-turbo model and prompted the model with a question that it could answer by examining the article. There were 10 trials for each token length (between 250 and 4000, with a step of 250).”
Sometimes research efforts will examine one specific generative AI app. In other instances, a comparison is made among several generative AI apps.
Here’s an instance of examining four generative AI apps, as depicted in “Comparative Analysis of Large Language Model Latency” by Dylan Zuber, a blog posting online at Medium, May 13, 2024, per these points (excerpts):
- “For our evaluation, we selected four industry-leading language models: (1) Anthropic’s Claude-3-Opus-20240229, (2) OpenAI’s GPT-4, (3) Groq running LLaMA3–8B-8192, (4) Cohere’s Command-R-Plus.”
- “These models were tested under various token configurations to reflect common usage scenarios in real-world applications.”
- “The token configurations for the tests were as follows: (1) Scenario A: ~500 input tokens with a ~3,000-token output limit, (2) Scenario B: ~1,000 input tokens with a ~3,000-token output limit, (3) Scenario C: ~5,000 input tokens with a ~1,000-token output limit.”
- “Choosing the right LLM for specific operational needs depends critically on understanding each model’s latency under various conditions.”
One of the biggest difficulties in conducting research on the speed of generative AI and large language models is that the underlying AI apps are moving targets.
Allow me to elaborate.
AI makers are typically changing their generative AI apps and LLMs on an ongoing basis. They tweak this or that. They perform additional data training. All manner of maintenance, upkeep, and improvements are regularly being made.
The problem then is that if a research study examines a generative AI app, the performance results might differ dramatically the next month, or possibly even the next day, due to underlying changes being made by the AI maker. You could even argue that the results might change notably from moment to moment or instant to instant. Recall that mentioned that the network, the servers, and a cacophony of factors impact the latency. Doing a test of speed is subject to the quirks of those factors.
Overall, be cautious in interpreting the results of speed tests on generative AI and LLMs.
You will want to find out how a study was performed. Which generative AI apps were used? Which versions? What was the network setup and other factors? What kinds of prompts were used? Was the quality of the result included in the metrics or measurements? Etc.
In addition, the results are likely to reflect a particular point in time. A week later or a day later, the results might switch around. Make sure to consider such studies to be worthy of interpretation with a heavy grain of salt.
Using ChatGPT To Dive Further Into Matters Of Latency And Speed
I will next proceed to examine further the nature of latency and speed facets. This will consist of a series of dialogues with ChatGPT. ChatGPT is a logical choice in this case due to its immense popularity as a generative AI app. An estimated one hundred million weekly active users are said to be utilizing ChatGPT. That’s a lot of people and a lot of generative AI usage underway.
A few quick comments before we launch into using ChatGPT.
If you are going to try to do the same prompts that I show here, realize that the probabilistic and statistical properties will likely produce slightly different results than what I show here. That’s the nature of generative AI and how it is devised.
You also should expect that different generative AI apps will respond in different ways. Not all generative AI apps are the same. The key is that sometimes a particular prompt will work in one generative AI app and not another. Nonetheless, the odds are relatively high that you will get roughly similar responses from all the major generative AI apps such as GPT-4, Gemini, Bard, Claude, etc.
There is also a timing factor to be considered. A prompt that worked today might not work the same way tomorrow. Most of the generative AI apps are continually being updated. The updates might alter internal mechanisms that could change how the generative AI reacts to your prompting.
We are ready to get underway with ChatGPT.
I opted to start by asking ChatGPT what impacts latency in generative AI. We will likely find out which way ChatGPT goes on this matter.
As an aside, whenever you are starting a conversation with generative AI, I recommend as a prompt engineering technique to begin by asking a question that will establish if the AI has been data-trained on the topic at hand. A generated response that is vacuous will give you a heads-up that you might be barking up the wrong tree with that generative AI app.
Here we go.
- My entered prompt: “What impacts the speed of getting responses from my prompts?”
- ChatGPT generated response:
“The speed at which you receive responses from your prompts can be influenced by several factors.”
“Complexity of the Query: More complex queries require more processing power and time to generate a comprehensive and accurate response. Queries that involve detailed explanations, analysis, or the synthesis of multiple pieces of information will take longer than simpler requests.”
“Length of the Response: Generating longer responses naturally takes more time. If your prompt requires a detailed or lengthy answer, it will impact the response time.”
“Model Load and Server Performance: The current load on the servers running the language model can affect response times. High demand or heavy traffic can slow down processing.”
“Internet Connection: The speed and stability of your internet connection can affect how quickly you receive the response after it’s generated. Poor or slow internet connections can introduce delays.”
“Optimization and Efficiency of the Model: The specific architecture and optimizations of the language model also play a role. Newer, more efficient models may process requests faster than older or less optimized ones.”
“Backend Systems and Infrastructure: The efficiency of the backend systems, including the servers and databases supporting the language model, can influence response times. High-quality infrastructure can handle requests more swiftly.”
“Request Type: Different types of requests may require different processing approaches. For example, generating creative content versus factual answers can impact how quickly a response is generated.”
End of ChatGPT response
The generated response is pretty sensible and comports with the factors I identified earlier.
Let’s next ask for specific suggestions on the wording of prompts.
- My entered prompt: “Does the wording of a prompt impact the speed of response?”
- ChatGPT generated response:
“The wording of a prompt can impact the speed of the response in several ways.”
“Clarity and Specificity: Clear and specific prompts are easier for the model to understand and process. Ambiguous or vague prompts might require additional computation as the model tries to infer the intended meaning, potentially slowing down the response.”
“Complexity of Language: Prompts that use complex or convoluted language can require more processing time. Simple, straightforward language is processed more quickly.”
“Length of the Prompt: Longer prompts require more processing time simply because there is more text to analyze. Shorter prompts can lead to faster responses.”
“Number of Queries in One Prompt: If a prompt contains multiple questions or requests, the model has to address each part, which can increase the response time. Breaking down complex queries into separate, simpler prompts can help speed up the process.”
“Request for Multiple Outputs: Asking for multiple outputs (e.g., “give me five examples”) in a single prompt will take longer than requesting a single response.”
“Use of Special Instructions: Prompts that include specific instructions on formatting or style may require additional processing. For example, asking for a response in a particular format (bullet points, numbered list, etc.) adds a layer of complexity.”
“Context and Background Information: Prompts that require the model to generate context or background information can be more time-consuming. Providing necessary context within the prompt can make it easier and quicker for the model to generate a response.”
End of ChatGPT response
Once again, I would gauge that the reply abides by my points about latency and wording of prompts.
A handy use of generative AI when exploring a topic is to get the AI to provide examples of whatever you are studying.
I’ll ask for examples of prompt wording that impacts speed.
- My entered prompt: “Give me an example of a prompt that would run fast and a different example of a similar prompt but that would run slower.”
- ChatGPT generated response:
“Fast Prompt Example: “List three benefits of regular exercise.”
“Slow Prompt Example: “Can you explain in detail why regular exercise is beneficial for overall health, particularly focusing on physical, mental, and emotional aspects, and provide some examples of specific exercises that offer these benefits?”
“Clarity and Specificity.”
“Fast Prompt: This is a straightforward question with a clear request for three benefits. It doesn’t require extensive elaboration or deep analysis.”
“Slow Prompt: This prompt asks for a detailed explanation, covering multiple aspects (physical, mental, and emotional benefits) and specific examples, requiring the model to process and generate a more complex and lengthy response.”
“Length of the Prompt.”
“Fast Prompt: The prompt is short and concise, making it quicker to process.”
“Slow Prompt: The prompt is longer and more detailed, which takes additional time to analyze.”
“Complexity of Language.”
“Fast Prompt: Uses simple and direct language.”
“Slow Prompt: Includes more complex instructions and multiple components that the model needs to address.”
“Depth of Response Required.”
Fast Prompt: The expected response is brief and to the point.
Slow Prompt: The response requires a deeper analysis and more comprehensive information.
By keeping the prompt simple, direct, and limited in scope, you can generally achieve faster responses. More detailed and complex prompts will naturally take longer to process and answer comprehensively.
End of ChatGPT response
The example was simple but definitely highlighted the aspects that determine latency.
One aspect worthy of note is that I had asked a relatively short question, and yet I got quite a lengthy response. From a latency perspective, the lengthy response would have increased the time and cost of my getting the response. I could have controlled this by stating in my prompt to just give the two examples and not provide an associated explanation. This showcases how the instructions in your prompt can impact the time and cost, including when you fail to include specific instructions on what you want the AI to do.
Let’s next ask a question regarding typical latency times.
- My entered prompt: “What is the normal expected speed at which I should expect a response from a prompt?”
- ChatGPT generated response:
“The normal expected speed for receiving a response from a prompt can vary based on several factors, including the complexity of the prompt, server load, and the specific architecture of the language model. However, here are some general guidelines.”
“Simple Prompts: For straightforward questions or requests, response times are typically very fast, often within 1-2 seconds. These include prompts like: “What is the capital of France?”, “List three benefits of regular exercise.”
“Moderately Complex Prompts: For prompts that require a bit more processing or generation of a moderately detailed response, you can expect response times of 3-5 seconds. Examples include: “Explain the main causes of climate change.”, “Summarize the plot of ‘To Kill a Mockingbird.'”
“Complex or Detailed Prompts: For very detailed, complex, or multi-part prompts, response times may range from 5-10 seconds or longer. These could be: “Describe the impact of the Industrial Revolution on European society, including both positive and negative aspects.”, “Write a short story about a robot learning to feel emotions, including dialogue and character development.”
End of ChatGPT response
The times listed are generally apt.
Once again, the times you will experience are dependent on the factors I mentioned earlier. If you have a slow network, your times will be higher than those typical times. If you have a fast network connection and are paying for fast processing, you will be at the lower end of the latency range.
In case you are curious about the under-the-hood mechanisms and how those can be designed, built, and tweaked for speed, I decided to ask ChatGPT to list the main considerations.
Here we go.
- My entered prompt: “How does the design of the generative AI or large language model impact the latency or speed of responses?”
- ChatGPT generated response:
“The design of a generative AI or large language model significantly impacts the latency or speed of responses. Several key design elements contribute to this.”
“Model Architecture.”
“Complexity and Size: Larger models with more parameters can provide more detailed and accurate responses but require more computation time, leading to increased latency. Models like GPT-4 have billions of parameters, which can slow down response times compared to smaller models.”
“Layer Depth: Deep architectures with many layers can process information more comprehensively but also more slowly due to the increased number of computations per query.”
“Optimization Techniques.”
“Quantization: Reducing the precision of the model’s parameters (e.g., from 32-bit to 16-bit) can speed up computations with minimal impact on performance.”
“Pruning: Removing less critical parts of the network can reduce the computational load and improve speed without significantly affecting accuracy.”
“Inference Engine.”
“Hardware Acceleration: Utilizing specialized hardware such as GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units) can significantly speed up the processing of large models compared to general-purpose CPUs.”
“Parallel Processing: Distributing the computational load across multiple processors or machines can reduce response times.”
“Software Optimizations.”
“Efficient Algorithms: Implementing optimized algorithms for matrix multiplications and other core operations can enhance speed.”
“Batch Processing: Processing multiple queries simultaneously can improve efficiency but might introduce slight delays for individual queries.”
“Model Serving Infrastructure.”
“Load Balancing: Distributing requests evenly across multiple servers can prevent any single server from becoming a bottleneck.”
“Caching: Storing frequently accessed responses can reduce the need for repeated computations.”
“Response Generation Techniques.”
“Beam Search vs. Greedy Search: Beam search can produce better results by considering multiple candidate sequences, but it is slower compared to greedy search, which selects the most likely next token at each step without considering alternatives.”
“Temperature and Sampling: Adjusting the randomness in response generation can affect the speed. Higher temperatures can produce more diverse outputs but may require more computation to ensure coherence.”
“Preprocessing and Postprocessing.”
“Tokenization: Efficient tokenization methods that quickly convert text into model-compatible formats can reduce preprocessing time.”
“Detokenization: Converting model outputs back into human-readable text efficiently can also impact overall response time.”
End of ChatGPT response
Those are the types of concerns that AI developers focus on.
Returning to an earlier topic about how prompt engineering techniques can impact latency, let’s see what ChatGPT has to say on that matter.
- My entered prompt: “If I invoke chain-of-thought (CoT) in my prompt, will this increase the latency of the response?”
- ChatGPT generated response:
“Yes, invoking a chain of thought in your prompt is likely to increase the latency of the response. The chain of thought technique involves prompting the model to reason through the problem step-by-step, which typically results in a more detailed and extended response. This process requires additional computation and time for several reasons.”
“Extended Processing: The model needs to generate a sequence of intermediate steps, each of which involves generating and evaluating text. This requires more computational cycles compared to generating a direct response.”
“Increased Token Count: Chain of thought prompts usually leads to longer outputs as the model elaborates on its reasoning process. Longer responses mean more tokens are processed, which increases the time taken to generate the response.”
“Complexity of Reasoning: Breaking down a problem into steps and reasoning through each step can be more complex than generating a straightforward answer. This complexity adds to the computational load and increases latency.”
End of ChatGPT response
The reply by ChatGPT affirms my earlier indication that CoT is undoubtedly going to bump up the latency due to the added effort that the AI needs to undertake.
Let’s get an example that shows this.
- My entered prompt: “Give me an example of a prompt that uses CoT versus one that doesn’t, and explain why the prompt with CoT would take longer.”
- ChatGPT generated response:
“Direct Prompt: “Why is the sky blue?”
“Chain of Thought Prompt: “Explain step-by-step why the sky appears blue during the day.”
“Direct Prompt Response. The model might quickly generate: “The sky appears blue because molecules in the air scatter blue light from the sun more than they scatter red light.”
“Chain of Thought Response: The model would generate a more detailed, step-by-step explanation: “1. Sunlight is made up of different colors of light, which are scattered in all directions by the gases and particles in the Earth’s atmosphere. Blue light is scattered more than other colors because it travels as shorter, smaller waves. When we look in any direction other than directly at the sun, we see the scattered blue light.” This detailed reasoning involves more steps and thus more computation time.
End of ChatGPT response
I liked that the question asked was regarding the sky being blue. Anyone who has been a parent or otherwise interacted with youngsters knows that this is a favorite question of inquiring minds.
My final exploration, for now, will be to see what ChatGPT has to say about the wording of prompts that will intentionally nudge generative AI toward going faster and reducing latency.
Let’s see what happens.
- My entered prompt: “What can I say in a prompt to say that I want my prompt to be run as fast as possible?”
- ChatGPT generated response:
“To indicate that you want your prompt to be processed as quickly as possible, you can use concise language and explicitly request a brief response. Here are a few strategies and example phrases you can include in your prompt.”
“Explicit Request for Speed: Clearly state that you prioritize speed over detail. Examples include: “Provide a brief response.”, “Give a quick summary.”, “Answer concisely.”
“Limit the Scope: Ask for fewer details or limit the number of points. Examples include: “List three key points.”, “Summarize in one sentence.”, “Give a short answer.”
“Avoid Complex Requests: Simplify the prompt to avoid invoking complex reasoning or long outputs. Instead of: “Explain the process of photosynthesis in detail.” Use: “Briefly describe photosynthesis.”
Specify Maximum Length: Indicate the maximum length for the response. Examples include: “Limit the response to 50 words.”, “Keep the answer under 100 characters.”
End of ChatGPT response
Those are reasonable rules of thumb to keep in mind.
Try to keep your instructions related to speed to be as short as possible. This will offset the time required by the AI to process your instructions that pertain to latency aspects.
Conclusion
Nearly everyone knows that the Hippocratic oath says “First, do no harm.”
The same motto or pledge applies to any effort seeking to reduce latency or increase the speed of processing for prompts in generative AI. You can almost certainly go faster by using prompts that are going to undercut quality. That is a type of harm. The tradeoff at hand is to maximize quality while also minimizing the time of processing.
I guess you could say that I feel the need, the need for speed, but only as long as the speed doesn’t flatten quality and produce an undesirable result. This applies to generative AI and probably applies to flying fighter jets too.
Please be safe out there.