Have you ever had performance issues with your MongoDB database? A common situation is a sudden performance issue when running a query. The obvious first solution is “let’s create an index!” While this works in some cases, there are other options we need to consider when trying to optimize MongoDB.
Performance is not a matter of having big machines with very expensive disks and gigabit networks. In fact, these are not necessarily the keys to good performance.
MongoDB performance comes from good concepts, organization and data distribution. We are going to list some best practices for good MongoDB optimization. This is not an exhaustive or complete guide, as there are many variables. But this is a good start.
Keep documents simple
MongoDB is a schema-free database. This means there is no predefined schema by default. We can add a predefined schema in newer versions, but it is not mandatory. Be aware of the difficulties involved when working with embedded documents and arrays as it can become really complicated to parse your data in the application side/ETL process. Besides, arrays can hurt the replication performance: for every change in the array, all the array values are replicated!
In MMAPv1, choosing the right field names is really important because the database needs to save the field name for each document. It is not like saving the schema in a relational database. Let’s imagine how much data a field called “lastmessagereceivedfromsensor” costs you if you have a million documents: around 28 MB just to save this field name! A collection with ten fields would demand 280MB (just to save an empty document).
Documents almost hitting this document size aren’t desirable, as the database will need a lot of pages to work on one single document. This demands more CPU cycles to finish any operation.
Hardware is important but…
Using good hardware with several processors and a considerable amount of memory definitely helps for a good performance.
WiredTiger takes advantage of multiple processors to deliver a good performance. This storage engine features a per-document locking algorithm so as many processors and as many operations can run at the same time (there is a ticket limitation, but this is out of this blog’s scope). The MMAPv1 storage engine, however, does have to lock per collection and sometimes cannot take advantage of multiple processors to write.
But what could happen in an environment with three big machines (32 CPUs, 128 RAM and 2TB disk) when one instance dies? The answer is it will failover and the drivers are smart enough to read the health instances and write the new primary. However, your performance will not be the same.
That’s not always true, but having multiple small/medium machines in a distributed environment can ensure that outages are going to affect only a few parts of the shard — with little or no perception by the application. But at the same time, more machines implies in a high probability to have a failure. Consider this tradeoff when designing your environment. The right choices affect performance.
Read preference and WriteConcern
The read preference and write-concern vary according to a company’s requirements. But please keep in mind that new MongoDB versions (3.6) use writeConcern: “majority” and readConcern: “primary”.
This means it must acknowledge all the writes in at least floor((N/0.5)+1) – where N is the number of instances in the replica set. This can be slow. However, this is a fair trade-off for consistency for speed.
Please make sure you’re using the most appropriate read preference and write concern in your company. Drivers always read from the primary, but if it is not a requirement for your environment consider distributing the queries among the other instances. If you don’t, the instances are only for failover and won’t get used in regular operation.
How big is the working set? Usually, an application doesn’t use all the data. Some data is updated often, while other data isn’t.
Does your working data set fit in RAM? Optimal performance occurs when all the working data set is in RAM. Wome slowness, like page faults, can hurt performance depending on what you’re using.
Reads, such as backup, ETL or reporting from primaries, can really hurt performance as there is competition to have pages in cache. The same is true for large reports or aggregation.
Having multiple collections for multiple purposes and using specific machines for specific purposes – such as using zones to save documents that will no longer be used – will help to have simple and expected working set.
Are you monitoring your system? Can you tell the difference in performance from last week to this week?
If you are not using any monitoring system and want to use a free tool, we highly recommend Percona Monitoring and Management (PMM) to monitor both MongoDB, MySQL and PostgreSQL. With a GUI monitoring system, it is easy to see pattern activities and isolate instances at a specific point in time. Recording the MongoDB log files also helps to understand what one instance is doing (as all the slow queries >100ms are logged by default).