A comprehensive guide to troubleshooting Redis performance issues
1. Introduction
Redis is renowned for its high performance, capable of handling 100,000 operations per second. However, users may encounter unexpected latency issues in various scenarios:
- Same commands sometimes fast, sometimes slow
- Simple operations like SET and DEL taking unexpectedly long
- Temporary slowdowns that resolve themselves
- Sudden performance degradation after long periods of stability
This comprehensive guide (approximately 20,000 words) aims to provide a thorough troubleshooting approach for Redis performance issues.
2. Confirming Redis Slowdown
a) Isolate the problem
- Implement distributed tracing in your application
- Record response times for external dependencies, including Redis
- Identify if the Redis operation is the bottleneck
b) Eliminate network issues
- Check if all services on the same server experience similar delays
- If so, involve network operations team
- If not, focus on Redis-specific issues
c) Establish baseline performance
Conduct benchmark tests on production servers using redis-cli commands directly on the Redis server:
Measure maximum latency:
redis-cli -h 127.0.0.1 -p 6379 --intrinsic-latency 60
Example output: max latency of 72 microseconds
Monitor latency history:
redis-cli -h 127.0.0.1 -p 6379 --latency-history -i 1
Example output: average latencies between 0.08 and 0.13 milliseconds
d) Compare suspected slow instances
- Test a known good Redis instance for baseline
- Test the suspected slow instance
- If latency is 2x or more than baseline, consider it definitively slow
3. Investigating Slowdown Causes
a) Using high complexity commands
Check Redis slow log
Set slowlog parameters:
CONFIG SET slowlog-log-slower-than 5000
CONFIG SET slowlog-max-len 500
View slow log:
SLOWLOG get 5
Look for:
- O(N) or higher complexity commands (e.g., SORT, SUNION, ZUNIONSTORE)
- O(N) commands with large N values
- Commands returning large amounts of data
Impact:
- High CPU usage for complex operations
- Network transfer time for large data sets
- Blocking of subsequent commands due to Redis’ single-threaded nature
Solutions:
- Avoid high complexity commands, move aggregation to client side
- Limit the amount of data returned (N <= 300 recommended)
- Use SCAN instead of KEYS for iterating over large key sets
- Monitor CPU usage - high CPU with low OPS suggests complex commands
Additional tips:
- Use pipelining to reduce network round trips
- Consider using Lua scripts for complex operations
b) Operating on big keys
Reasons for slowdown:
- Time-consuming memory allocation for large values
- Slow memory deallocation when deleting large keys
Detecting big keys:
redis-cli -h 127.0.0.1 -p 6379 --bigkeys -i 0.01
- Scans entire keyspace, reports largest keys by type
- Shows distribution of key sizes and types
Cautions when scanning:
- Can cause OPS spikes on production systems
- May block other operations on busy systems
Solutions:
- Avoid storing big keys in applications
- Use UNLINK instead of DEL for large keys (Redis 4.0+)
- Enable lazy-free mechanism for DEL (Redis 6.0+)
- Set lazyfree-lazy-eviction, lazyfree-lazy-expire, lazyfree-lazy-server-del to yes
Best practices:
- Split large values into smaller chunks
- Use Hash data structures for large objects instead of single String keys
Redis hashes are better for large objects than single strings because they allow atomic operations on individual fields, partial data retrieval, and more efficient updates, improving performance and scalability.
c) Keys expiring at the same time
Symptoms:
- Latency spikes at regular intervals
Redis expiration strategies:
- Passive: Check expiration on access
- Active: Periodic scan of expired keys
Problems:
- Active expiration runs in main thread, blocking other operations
- Many keys expiring simultaneously cause latency spikes
Detection:
- Monitor expired_keys metric in INFO stats
- Use Redis 4.0+ MEMORY STATS for more detailed expiration info
Solutions:
- Add random jitter to expiration times (e.g., expire_at = now + TTL + random(0, 300))
- Enable lazy-free for expired key deletion (Redis 4.0+)
- Adjust active expiry algorithms (hz and active-expire-effort parameters)
Additional considerations:
- Balance between memory usage and CPU usage when tuning expiration
- Consider using SCAN to manually expire keys in batches for extreme cases
d) Memory fragmentation
Check fragmentation ratio:
INFO memory
Look for mem_fragmentation_ratio
Causes:
- Frequent creation/deletion of keys of varying sizes
- OS memory allocation strategies
Impact:
- High fragmentation ratio (>1.5) indicates inefficient memory use
- Very low ratio (<1) suggests Redis needs more memory
Solutions:
- Enable activedefrag (Redis 4.0+):
CONFIG SET activedefrag yes
- Configure
active-defrag-*
parameters:CONFIG SET active-defrag-ignore-bytes 100mb CONFIG SET active-defrag-threshold-lower 10 CONFIG SET active-defrag-threshold-upper 100 CONFIG SET active-defrag-cycle-min 25 CONFIG SET active-defrag-cycle-max 75
- Restart Redis instance to defragment memory (last resort)
Best practices:
- Use consistent key sizes when possible
- Monitor fragmentation ratio over time
- Consider using jemalloc memory allocator
e) AOF persistence impacting performance
Problem:
- fsync() on every write blocks the main thread
Detection:
Check INFO persistence
for aof_* metrics
Solutions:
- Consider relaxing durability guarantees
- Use “everysec” fsync policy as a compromise
- Configure
no-appendfsync-on-rewrite
for better performance during rewrites
AOF policies:
- always: Most durable, worst performance
- everysec: Good durability, acceptable performance
- no: Best performance, risk of data loss
Additional considerations:
- Use AOF rewrite to keep AOF file size manageable
- Monitor AOF rewrite progress and impact
f) Replication issues
Problems:
- Large replication buffers on master can cause OOM
- Slow replicas can cause master buffers to grow
Monitoring:
- Check
INFO replication
for master_repl_offset and slave_repl_offset - Monitor repl_backlog_size
Solutions:
- Increase replication backlog size if needed
- Optimize network between master and replicas
- Consider using diskless replication for faster sync
Best practices:
- Use replication timeout (repl-timeout) to detect stuck replicas
- Configure appropriate client-output-buffer-limit for slave clients
- Use Redis Sentinel or Cluster for automatic failover
g) CPU utilization
Monitoring:
- Use
INFO CPU
to check Redis CPU usage - Monitor system CPU usage
Solutions:
- Distribute load across multiple Redis instances
- Use Redis Cluster for better CPU utilization
- Optimize client-side operations to reduce load on Redis
Additional tips:
- Profile Redis commands using –latency-debug flag
- Use DEBUG OBJECT to analyze key encoding and other properties
h) Network issues
- Test latency from Redis server itself to isolate network problems
- Use
redis-cli --latency
to measure network latency - Check for network congestion or hardware issues
Considerations:
- Network interface configuration
- TCP settings (e.g., tcp-backlog)
- Use of proxies or load balancers
i) Transparent huge pages (THP)
Check if enabled:
cat /sys/kernel/mm/transparent_hugepage/enabled
Solution:
Disable THP for Redis servers:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Make change permanent in /etc/rc.local or systemd
Additional system settings to consider:
- vm.overcommit_memory
- vm.swappiness
- Disable NUMA interleaving
4. Additional Considerations
Impact of data structure choice on performance
- Strings: Fastest, but limited functionality
- Hashes: Efficient for objects, supports partial updates
- Lists: Good for queue-like data structures
- Sets: Efficient for membership checks
- Sorted Sets: Useful for leaderboards and range queries
Proper configuration of maxmemory and eviction policies
- noeviction, allkeys-lru, volatile-lru, allkeys-random, volatile-random, volatile-ttl
Importance of connection pooling in clients
Monitoring and alerting for early detection of issues
- Set up monitoring for key metrics (CPU, memory, network, ops/sec)
- Use INFO command regularly to gather stats
- Consider tools like Redis Exporter for Prometheus
Regular performance testing and capacity planning
Consideration of Redis Cluster for scaling and better resource utilization
Use of Redis modules for specific use cases (e.g., RediSearch, RedisTimeSeries)
5. Debugging and Profiling Tools
- redis-cli –stat: Real-time stats
- redis-cli –latency: Network latency testing
- redis-cli –latency-history: Latency over time
- redis-cli –latency-dist: Latency distribution
- redis-cli –bigkeys: Find large keys
- redis-cli –memkeys: Analyze memory usage of keys
- redis-cli –hotkeys: Identify frequently accessed keys
- MONITOR command: Real-time log of Redis commands (use cautiously in production)
6. Best Practices
- Regular backups and disaster recovery planning
- Implement proper security measures (password, firewall, SSL/TLS)
- Keep Redis version up-to-date
- Use Redis benchmark tool for performance testing
- Implement proper error handling and retry mechanisms in clients
- Use Redis Pub/Sub with caution, as it can impact performance
7. Conclusion
- Redis performance issues can have various causes
- Systematic approach to troubleshooting is crucial
- Understanding Redis internals helps in diagnosing and resolving issues
- Regular monitoring and proactive optimization are key to maintaining high performance
- Emphasizes the importance of ongoing learning and staying updated with Redis features
- Encourage reading Redis documentation and following the official blog for updates