Load Testing a Chatbot API: A Journey with Apache JMeter

Load Testing a Chatbot API A Journey with Apache JMeter

It started with a simple question from our QA lead:
“How fast is our chatbot when 200 users start talking to it at the same time?” 

It was a good question. We had tested the chatbot for correctness, covered edge cases, and made sure it could respond sensibly. But we hadn’t tested it under pressure — the kind of pressure it might face during a product launch or a marketing campaign. 

There are many load testing tools available like K6, Apache JMeter and LoadRunner each with its own strengths.
But for our needs, JMeter stood out. It’s open-source, easy to configure, has a graphical interface, supports scripting when needed, and integrates well into CI/CD pipelines. More importantly, it allowed us to simulate realistic user behavior without writing a single line of code

So, we turned to Apache JMeter, a powerful tool that helps simulate real-world usage patterns and understand how your system performs under load. 

Why JMeter?

Before diving in, we asked ourselves — why JMeter? 

For starters, it’s free, supports multiple protocols like HTTP, REST, and FTP, and offers a graphical interface that doesn’t require much coding. For those who want more control, scripting is available through BeanShell and JSR223. It’s also compatible with CI/CD tools like Jenkins, making it perfect for agile and DevOps workflows. 

With JMeter, we could simulate hundreds of virtual users, mimic realistic behavior, analyze response times, and quickly identify bottlenecks. 

Setting the Stage – The Test Plan

The chatbot API’s interaction model was simple:

  1. A user starts a new chat session (GET request). 
  2. Then, the user sends a query (another POST request). 

The Thread Group is configured with 200 virtual users to simulate concurrent user activity, a ramp-up period of 1 second, and a loop count set to 1. This setup mimics how multiple users would interact with the chatbot API in a real-world scenario. 

But simulating real-world behavior is rarely simple. Each interaction needed to:

  • Extract and pass dynamic session IDs using JSON Extractors.
  • Simulate real user thinking time with Constant Timers.
  • Support multi-language payloads, like English (“en”), but adaptable for more.
  • Gradually ramp up user load to avoid spiking the system instantly. 

Example Payload:

Crafting the Simulation

We broke the test plan into two core components: 

1. Start a New Session – A GET request that returns a session ID, which we extract using JMeter’s JSON Extractor

2. Send a User Query – A POST request using that session ID, simulating a real question from the user. 

We added an HTTP Header Manager to include: 

To simulate user traffic, we configured a Thread Group with 200 virtual users, each performing the above actions with random delays (using Gaussian and Constant Timers). This made the simulation more life-like and less robotic.

Execution and Discovery

Once our test was ready, we hit Start

As the virtual users swarmed the chatbot API, we used JMeter’s Summary Report, Aggregate Report, and Graph Results to monitor how the system held up. 

Key Observations: 

  • New session response time: Averaged 904 ms — fast and consistent.
  • Query response time: Averaged 22 seconds, with spikes up to 30 seconds.
  • Failures: Zero, which was reassuring. 

But there was a clear performance gap between starting a session and processing a query. The backend, likely due to complex chatbot logic or third-party API calls, was struggling to keep up when the load increased. 

Using the View Results Tree, we confirmed that the API was responding correctly, just not fast enough. 

What We Learned (and What’s Next)

Our load test journey didn’t just validate system stability—it offered valuable insights into the internal behavior of our chatbot API under pressure. 

Key Learnings 

Session Creation Performed Well 

When simulating hundreds of users starting conversations, the chatbot API maintained excellent

response times. The session creation averaged around 904 ms and scaled predictably even as the number of users increased. This gave us confidence that the authentication and initial session handling mechanisms are well-architected and production-ready. 

Query Handling Showed Performance Bottlenecks 

The true test of the system came when users started sending queries. 

Response times surged to an average of 22 seconds, with some outliers touching 30 seconds. Although no failures were detected, such long response times would lead to poor user experience in real scenarios. We investigated possible causes: 

  • Backend Processing Overhead: As query volume increased, backend services struggled to maintain speed. This points to potentially non-optimized workflows or excessive internal API calls. 
  • Absence of Caching: Identical or repeated queries from multiple users consistently hit backend systems without any response caching, adding unnecessary load. 
  • Heavy NLP Model Processing: Natural Language Processing (NLP) models or large language models (LLMs) are computationally expensive. The simultaneous invocation by many users could easily explain the sharp increase in latency. 

Overall, while the system passed the “survivability test,” it failed the “speed test” under high concurrency for query handling. 

Turning Learnings into Action

Having identified the weak points, we were able to outline clear next steps to enhance performance: 

  • Introduce Caching Mechanisms: Applying caching for repetitive queries will drastically reduce backend load and improve response times. 
  • Refactor Backend Processes: Streamlining internal API workflows, minimizing database calls, and using asynchronous processing where possible will cut down backend response delays. 
  • Optimize LLM/NLP Operations: Investigate batch processing of queries, reduce model calls when not strictly necessary, or explore lightweight fallback mechanisms. 
  • Improve Database Indexing: Optimizing data indexing structures will accelerate search and retrieval tasks, which likely contributes to delays during query resolution.
  • Expand Test Scenarios: Plan additional load tests with variations in network conditions, payload sizes, and concurrency levels to further harden the system. 

The Bigger Picture

Performance testing is not just about breaking the system; it’s about understanding its limits and preparing for the real world. 

Tools like JMeter empower teams to simulate large-scale traffic, pinpoint bottlenecks, and deliver fast, scalable applications — especially critical for chatbots where users expect immediate responses. 

Final Thoughts

What began as a curious question turned into a journey of exploration and insight. With Apache JMeter, we could visualize how our chatbot behaved under load, uncover hidden issues, and build a roadmap for better performance. 

Whether you’re working on APIs, web apps, or microservices — load testing should never be an afterthought

References

Akash Kumar

QA Engineer

LinkedIn

Akash Kumar is a QA Engineer with over 3 years of experience in both manual and automation testing. He specializes in CI/CD integration and has hands-on expertise with a variety of testing tools. Akash is a certified Google Cloud Professional Cloud Security Engineer, passionate about delivering secure and reliable software solutions.

It started with a simple question from our QA lead: “How fast is our chatbot when 200 users start talking to it at the same time?”  It was a good question. We had tested the chatbot for correctness, covered edge cases, and made sure it could respond sensibly. But we hadn’t tested it under pressure — the kind of pressure it might face during a product launch or a marketing campaign.  There are many load testing tools available like K6, Apache JMeter and LoadRunner each with its own strengths. But for our needs, JMeter stood out. It’s open-source, easy to configure, has a graphical interface, supports scripting when needed, and integrates well into CI/CD pipelines. More importantly, it allowed us to simulate realistic user behavior without writing a single line of code.  So, we turned to Apache JMeter, a powerful tool that helps simulate real-world usage patterns and understand how your system performs under load.  Why JMeter? Before diving in, we asked ourselves — why JMeter?  For starters, it’s free, supports multiple protocols like HTTP, REST, and FTP, and offers a graphical interface that doesn’t require much coding. For those who want more control, scripting is available through BeanShell and JSR223. It’s also compatible with CI/CD tools like Jenkins, making it perfect for agile and DevOps workflows.  With JMeter, we could simulate hundreds of virtual users, mimic realistic behavior, analyze response times, and quickly identify bottlenecks.  Setting the Stage – The Test Plan The chatbot API’s interaction model was simple: A user starts a new chat session (GET request).  Then, the user sends a query (another POST request).  The Thread Group is configured with 200 virtual users to simulate concurrent user activity, a ramp-up period of 1 second, and a loop count set to 1. This setup mimics how multiple users would interact with the chatbot API in a real-world scenario.  But simulating real-world behavior is rarely simple. Each interaction needed to: Extract and pass dynamic session IDs using JSON Extractors. Simulate real user thinking time with Constant Timers. Support multi-language payloads, like English (“en”), but adaptable for more. Gradually ramp up user load to avoid spiking the system instantly.  Example Payload: Crafting the Simulation We broke the test plan into two core components:  1. Start a New Session – A GET request that returns a session ID, which we extract using JMeter’s JSON Extractor.  2. Send a User Query – A POST request using that session ID, simulating a real question from the user.  We added an HTTP Header Manager to include:  To simulate user traffic, we configured a Thread Group with 200 virtual users, each performing the above actions with random delays (using Gaussian and Constant Timers). This made the simulation more life-like and less robotic. Execution and Discovery Once our test was ready, we hit Start.  As the virtual users swarmed the chatbot API, we used JMeter’s Summary Report, Aggregate Report, and Graph Results to monitor how the system held up.  Key Observations:  New session response time: Averaged 904 ms — fast and consistent. Query response time: Averaged 22 seconds, with spikes up to 30 seconds. Failures: Zero, which was reassuring.  But there was a clear performance gap between starting a session and processing a query. The backend, likely due to complex chatbot logic or third-party API calls, was struggling to keep up when the load increased.  Using the View Results Tree, we confirmed that the API was responding correctly, just not fast enough.  What We Learned (and What’s Next) Our load test journey didn’t just validate system stability—it offered valuable insights into the internal behavior of our chatbot API under pressure.  Key Learnings  Session Creation Performed Well  When simulating hundreds of users starting conversations, the chatbot API maintained excellent response times. The session creation averaged around 904 ms and scaled predictably even as the number of users increased. This gave us confidence that the authentication and initial session handling mechanisms are well-architected and production-ready.  Query Handling Showed Performance Bottlenecks  The true test of the system came when users started sending queries.  Response times surged to an average of 22 seconds, with some outliers touching 30 seconds. Although no failures were detected, such long response times would lead to poor user experience in real scenarios. We investigated possible causes:  Backend Processing Overhead: As query volume increased, backend services struggled to maintain speed. This points to potentially non-optimized workflows or excessive internal API calls.  Absence of Caching: Identical or repeated queries from multiple users consistently hit backend systems without any response caching, adding unnecessary load.  Heavy NLP Model Processing: Natural Language Processing (NLP) models or large language models (LLMs) are computationally expensive. The simultaneous invocation by many users could easily explain the sharp increase in latency.  Overall, while the system passed the “survivability test,” it failed the “speed test” under high concurrency for query handling.  Turning Learnings into Action Having identified the weak points, we were able to outline clear next steps to enhance performance:  Introduce Caching Mechanisms: Applying caching for repetitive queries will drastically reduce backend load and improve response times.  Refactor Backend Processes: Streamlining internal API workflows, minimizing database calls, and using asynchronous processing where possible will cut down backend response delays.  Optimize LLM/NLP Operations: Investigate batch processing of queries, reduce model calls when not strictly necessary, or explore lightweight fallback mechanisms.  Improve Database Indexing: Optimizing data indexing structures will accelerate search and retrieval tasks, which likely contributes to delays during query resolution. Expand Test Scenarios: Plan additional load tests with variations in network conditions, payload sizes, and concurrency levels to further harden the system.  The Bigger Picture Performance testing is not just about breaking the system; it’s about understanding its limits and preparing for the real world.  Tools like JMeter empower teams to simulate large-scale traffic, pinpoint bottlenecks, and deliver fast, scalable applications — especially critical for chatbots where users expect immediate responses.  Final Thoughts What began as a curious question turned into a journey of exploration and insight. With Apache JMeter, we could

Share

Share

Scroll to Top