While working on Multi-Agent systems with CrewAI, I initially used the OpenAI API key for my use case but it turned out to be quite expensive, so I had to search for a more cost-effective alternative.
I decided to download Ollama on my system to utilize LLMs locally(You can also use Groq’s Free API),however I faced a few issues with this approach. Firstly, Ollama consumed a significant amount of disk space. Secondly, the inference time was much longer than I had anticipated for my use case, even though i have NVIDIA RTX 3050ti in my pc and my PC's temperature consistently exceeded 92°C during inference. Hence i had to look for another approach
That's when I discovered a way to use Ollama on Google Colab. One of the major benefits of this method is the ability to leverage Google Colab's GPUs, such as the T4 GPU or TPU v2, which significantly accelerates inference times. Additionally, using Colab kept my PC's temperature under control.
Another remarkable advantage is the download speed on Google Colab. I observed download speeds of 180 to 200 MB/s (MegaBytes per second). For instance, 4.7GB Llama 3 model with 8 billion parameters downloaded in less than 1 minute and 30 seconds. This speed is because Google's servers have optimized, high-bandwidth connections and efficient data transfer paths, which are much faster than typical consumer internet connections.
With these benefits, I can download and test multiple LLM models, such as Llama 3, Phi 3, WizardLM 2, Qwen 2, Mixtral, and others, to determine which model performs best for my use case.
To provide a clearer picture, here's a comparison of inference times between my local PC and Google Colab(T4 GPU) of the script following script :
from langchain_community.llms import Ollama
import time
llama3 = Ollama(model="llama3")
start_time = time.time()
joke = llama3.invoke("Tell me a joke")
end_time = time.time()
print(joke)
print(f"Time taken to execute script: {end_time - start_time} seconds")
| Model | Local PC Inference Time | Google Colab Inference Time |
|---|---|---|
| 1st Try | 11 seconds | 10 seconds |
| 2nd Try | 6.3 seconds | 6 seconds |
| 3rd Try | 8 seconds | 1 second |
| 4th Try | 9 seconds | 1 second |

the improvement in inference time after 1st try is due to caching
while the improvement might not look much over here but for a large use case it can save significant amount of time. for example in my usecase of using crewai the inference time was around 8 to 10 minutes on llama3 whereas it took a maximum of 2 minutes on Google Colab