Debugging memory leaks in Node-based applications
Tracking memory leaks in Node-based applications is not something you will do every day but, when your production environment starts to eat its memory and you start receiving alerts you should be ready to roll your sleeves and start digging into your application.
Unfortunately, you read it right, your production environment will show you if you have a memory leak. You would be one of the luckiest people in the world if you discover it during development time because in the development process you don’t have a monitoring tool nor enough load to notice the problem.
Introduction
After a long weekend we returned to the office and our SRE team raised an alarm, the front-end application almost ate its memory. The quick and dirty solution was to reboot machines but we knew we had one serious issue to solve. After investigation we concluded that the memory leak has been there for ages but since we were releasing every day we haven’t noticed it. We have released at least once a day and during every release, machines were rebooted and the problem slipped us for a long time. So, one long weekend revealed an issue. In normal circumstances, we would do a release on Friday and again on Monday, but this time we released on Thursday and returned to the office on Monday. During that time without release, the application ate enough memory to alert us.
We did an investigation and were able to figure out where the problem is (at least broadly) fairly quickly. Since the memory leak was there for ages there was no point to rollback the last release so we continued operating the same way as before but we also assigned an engineer to find the memory leak.
Dev Environment setup
There are many articles explaining how to debug memory leaks in NodeJS applications but all of them use Chrome DevTools for recording memory and analyzing it. In our case recordings took too long to finish, some never finished and crashed Chrome so we decided to try something else entirely.
We turned toward node-heapdump, a package used to create heap dump of V8 engine based applications for later inspection. At that time we used NextJS 6.1.2 (already an antique version) so we added a route which we can trigger manually and create a heap dump.
In order to create heap dump you need to install package using npm install heapdump and you can change your code to:
With this route in place, every time you navigate to ` /heapdump` you will have one new file created
next you can navigate to `chrome://inspect` then select Open dedicated DevTools for Node option and choose Memory tab, there you can load `.hepsnapshot` file. Once a file is loaded you will get access to everything contained in the heap memory.
You can select every object and there you will find a reference to the object where it is used and filename.
Now we have everything in place to start looking for the memory leak.
Problem statement
By default, Next.js pre-renders every page. This means that Next.js generates HTML for each page in advance, instead of having it all done by client-side JavaScript. Every time a user visits a website resources are allocated on our server and since we had a memory leak in our application our servers ate its memory. We were trapped in a vicious circle. We wanted more users, more users meant our servers would eat its memory faster.
Solution
Since the production environment discovered the problem, we need to configure our development environment as a closest match to the production. We run an application with a production build and the only thing left is workload. We needed to simulate users visiting our website and we decided to write a small Python script where we start N instances of the Chrome web browser and visit our homepage.
Running this script will open `number_of_threads` Chrome instances and navigate to your homepage. This way we simulate load on our application.
Every time we deployed an application it would reboot servers and reset the memory. Over time memory allocation would grow, that means memory grew with traffic on our website. In our development environment we needed to achieve a similar effect. We took three snapshots:
- Right after application is started
- After first round of load (first run of the Python script above)
- After second round of load (second run of the Python script above)
This experiment gave us three `.heapsnapshot` files ready to be analyzed.
As you can see from the image above, with every snapshot consumed memory increased and there it is the memory leak. At this stage we want to see a difference between the first and the last snapshot, and for that you need to:
- select the last (biggest snapshot)
- select mode Comparison
- select the first (smallest snapshot)
The biggest difference is in closures (events we subscribed to but never unsubscribed). After you expand the closures you will see a lot of objects and you can start digging. Going through the objects from the colossus section we usually ended up with Axios objects then we shifted our focus to only them. Inspecting one by one we noticed that many of them are instantiated by Contentful SDK.
We used Contenful a lot, for many different things and it made perfect sense to mark it as a potential root cause. Quick code inspection showed that we create a new client object for every request towards Contentful.
We decided to implement a singleton pattern for the Contentful client. In such a case a new client object would be instantiated only if there is no existing one, otherwise already created would be used. Idea was solid but it didn’t work. We had multiple spaces in use and if you instantiate the client with one space once you need data from another space you wouldn’t have access. The solution was adopted to the object pool like design pattern. In our case was an object pool of singletons, we would keep a single instance of the client for every space we have on Contentful.
We applied the change and ran the experiment again and our memory leak seemed to be gone.
As the first time we had three snapshots but this time they were almost identical, that ment our system does not allocate additional resources with every visit.
Since the memory leak was found in production we need to confirm our fix in production as well. We deployed the code the next day and kept memory usage on eye. 24 hours after the code was deployed we didn’t see an increase in memory usage but we let it run for several days until we declared a victory. A week later we knew that was it because New Relic diagram showed there is no more memory leak in our system.
This diagram shows memory utilization during a one month period and we have a clear point in time when memory leak was resolved and we got “a flat” diagram for memory usage.
Conclusion
Usually engineers don’t think about code monitoring and forget to provide important information back to the monitoring tool in order to profile your code.
A lot of engineers overlook the fact that our work is not done when we deploy the code to production. We have to be responsible for monitoring our code in production, measure its impact and improve over time. It’s not the same to write code for a few users and for millions of users. Lessons learned from monitoring tools are the most valuable.
In our case monitoring tools played a critical role in discovering the problem, without it our application would crash and impact a few hundred thousands of users.
Memory leaks are tricky to find but there is some enjoyment in tracking them and fixing it. This kind of problem moves your boundaries and demands your full attention. You have to think differently and dig into your application internals, you will learn how things are made and how they work, not just how to use them. Do not run away from tasks like this, personal satisfaction when you find it is something you wouldn’t trade for anything.