Memory Management: NUMA, Huge Pages, and Memory Compaction in Production Systems

Modern server systems with multiple CPU sockets and large memory configurations present complex memory management challenges that can significantly impact application performance. Non-Uniform Memory Access (NUMA), huge pages, and memory compaction are critical techniques for optimizing memory performance in production environments. This comprehensive analysis examines these advanced memory management concepts with real-world implementations, performance benchmarks, and production deployment strategies.

The NUMA Architecture Challenge

NUMA systems organize memory into multiple nodes, each associated with specific CPU sockets. Accessing local memory is significantly faster than accessing remote memory, creating performance implications for applications that don’t account for NUMA topology.

NUMA Topology and Access Patterns

NUMA Node Structure:

Local Memory: Memory attached to the same CPU socket (fastest access)
Remote Memory: Memory attached to different CPU sockets (slower access)
Interconnect: High-speed interconnects (QPI, UPI) for cross-socket communication
NUMA Distance: Latency and bandwidth characteristics between nodes

Access Latency Characteristics:

Local Memory: 100-200 nanoseconds access time
Remote Memory: 200-400 nanoseconds access time
Cross-Socket: Additional latency for cache coherency protocols
NUMA Distance Matrix: Quantifies relative access costs between nodes

Memory Bandwidth Considerations:

Local Bandwidth: 50-100 GB/s per socket
Remote Bandwidth: 20-50 GB/s across sockets
Interconnect Bandwidth: Limited by QPI/UPI bandwidth
Contention: Shared interconnect can cause bandwidth contention

Real-World NUMA Implementations

MySQL with NUMA Optimization

Performance Impact: 2-3x performance improvement with NUMA awareness
Configuration: innodb_numa_interleave=1 for InnoDB buffer pool
Memory Allocation: NUMA-aware memory allocation for buffer pools
Thread Affinity: Bind MySQL threads to specific NUMA nodes

Implementation Details:

// MySQL NUMA-aware memory allocation
void *innodb_numa_alloc(size_t size) {
    int node = sched_getcpu() / cores_per_socket;
    void *ptr = numa_alloc_onnode(size, node);
    if (ptr == NULL) {
        // Fallback to interleaved allocation
        ptr = numa_alloc_interleaved(size);
    }
    return ptr;
}

Redis with NUMA Optimization

Performance Impact: 30-50% performance improvement
Configuration: --memory-allocator-malloc with NUMA awareness
Memory Pool: NUMA-aware memory pools for different data types
Thread Binding: Bind Redis threads to specific NUMA nodes

Performance Characteristics:

Latency: 20-40% reduction in memory access latency
Throughput: 30-50% improvement in operations per second
CPU Usage: 15-25% reduction in CPU usage
Memory Efficiency: Better memory utilization across NUMA nodes

NUMA Optimization Techniques

Memory Allocation Strategies:

Local Allocation: Allocate memory on the same NUMA node as the CPU
Interleaved Allocation: Distribute memory across all NUMA nodes
First-Touch Policy: Allocate memory on the first CPU to access it
Bind Policy: Explicitly bind memory to specific NUMA nodes

Thread and Process Affinity:

CPU Affinity: Bind threads to specific CPU cores
NUMA Affinity: Bind threads to specific NUMA nodes
Memory Affinity: Allocate memory on the same NUMA node as threads
Load Balancing: Balance load across NUMA nodes

Application-Level Optimizations:

Data Locality: Keep related data on the same NUMA node
Workload Partitioning: Partition workloads by NUMA node
Memory Pooling: Use NUMA-aware memory pools
Cache Optimization: Optimize for NUMA-aware cache hierarchies

Huge Pages: Reducing TLB Misses

Translation Lookaside Buffer (TLB) misses are a significant source of performance overhead in memory-intensive applications. Huge pages reduce TLB pressure by mapping larger memory regions with single TLB entries.

TLB and Page Size Impact

TLB Characteristics:

TLB Size: 64-1024 entries for L1 TLB, 1024-4096 entries for L2 TLB
Page Size: 4KB standard pages, 2MB huge pages, 1GB gigantic pages
TLB Coverage: 256KB-4MB with 4KB pages, 128MB-8GB with 2MB pages
Miss Penalty: 10-100 CPU cycles for TLB miss handling

Page Size Trade-offs:

Memory Efficiency: 4KB pages provide better memory efficiency
TLB Efficiency: 2MB pages provide better TLB efficiency
Fragmentation: Larger pages can cause internal fragmentation
Allocation: Larger pages require contiguous physical memory

Performance Impact:

TLB Misses: 50-90% reduction in TLB misses with huge pages
Memory Access: 10-30% improvement in memory access performance
Cache Efficiency: Better cache utilization with larger pages
Context Switching: Reduced TLB flushing during context switches

Real-World Huge Pages Implementations

Redis with Huge Pages

Configuration: --transparent-huge-pages yes for automatic huge pages
Performance Impact: 20-40% performance improvement
Memory Usage: 10-20% reduction in memory usage
Latency: 15-30% reduction in operation latency

Implementation Details:

// Redis huge pages configuration
void redisEnableTransparentHugePages() {
    // Enable transparent huge pages
    int fd = open("/sys/kernel/mm/transparent_hugepage/enabled", O_WRONLY);
    write(fd, "always", 6);
    close(fd);
    
    // Set huge page defragmentation
    fd = open("/sys/kernel/mm/transparent_hugepage/defrag", O_WRONLY);
    write(fd, "always", 6);
    close(fd);
}

PostgreSQL with Huge Pages

Configuration: huge_pages = on for shared memory
Performance Impact: 15-25% performance improvement
Memory Management: Better memory management for large databases
Buffer Pool: Huge pages for shared buffer pool

Performance Characteristics:

Query Performance: 15-25% improvement in query execution time
Memory Access: 20-40% improvement in memory access performance
Concurrency: Better performance under high concurrency
Scalability: Better scalability with large memory configurations

Huge Pages Optimization Techniques

Transparent Huge Pages (THP):

Automatic: Kernel automatically promotes pages to huge pages
Configuration: /sys/kernel/mm/transparent_hugepage/enabled
Defragmentation: Automatic defragmentation for huge page allocation
Monitoring: Monitor huge page usage and effectiveness

Explicit Huge Pages:

Pre-allocation: Pre-allocate huge pages at boot time
Configuration: hugepagesz=2M hugepages=1024 in kernel parameters
Application: Applications explicitly request huge pages
Management: Manual management of huge page allocation

Memory Allocation Strategies:

Huge Page Pools: Dedicated pools for huge page allocation
Fallback: Fallback to regular pages when huge pages unavailable
Fragmentation: Handle fragmentation in huge page allocation
Monitoring: Monitor huge page usage and performance impact

Memory Compaction: Defragmenting Physical Memory

Memory compaction is a kernel mechanism that defragments physical memory by moving pages to create larger contiguous regions, essential for allocating huge pages and reducing fragmentation.

Memory Fragmentation Problem

Fragmentation Types:

External Fragmentation: Free memory exists but not contiguous
Internal Fragmentation: Allocated memory larger than requested
Page Fragmentation: Scattered pages across physical memory
Huge Page Fragmentation: Insufficient contiguous memory for huge pages

Fragmentation Causes:

Dynamic Allocation: Frequent allocation and deallocation
Page Migration: Pages moved between NUMA nodes
Memory Reclaim: Memory reclaim can cause fragmentation
Workload Patterns: Specific workload patterns that cause fragmentation

Performance Impact:

Allocation Failures: Inability to allocate large contiguous regions
Huge Page Failures: Failure to allocate huge pages
Memory Pressure: Increased memory pressure due to fragmentation
Performance Degradation: Reduced performance due to fragmentation

Real-World Memory Compaction Implementations

Linux Kernel Memory Compaction

Compaction Daemon: kswapd and kcompactd for memory compaction
Compaction Zones: Zone-based compaction for different memory types
Migration: Page migration for defragmentation
Monitoring: /proc/vmstat for compaction statistics

Implementation Details:

// Linux kernel memory compaction
static int compact_zone(struct zone *zone, struct compact_control *cc) {
    int ret;
    unsigned long start_pfn = zone->zone_start_pfn;
    unsigned long end_pfn = zone_end_pfn(zone);
    
    cc->migrate_pfn = start_pfn;
    cc->free_pfn = end_pfn;
    
    ret = compact_zone_order(zone, cc, cc->order, &cc->migrate_pfn, 
                           &cc->free_pfn, cc->migrate_scanner);
    
    return ret;
}

MySQL with Memory Compaction

Configuration: innodb_buffer_pool_size with huge pages
Compaction: Automatic compaction for buffer pool allocation
Monitoring: Monitor compaction effectiveness
Tuning: Tune compaction parameters for optimal performance

Performance Characteristics:

Allocation Success: 90-95% success rate for huge page allocation
Compaction Overhead: 5-15% CPU overhead for compaction
Memory Efficiency: 10-20% improvement in memory efficiency
Performance: 15-25% improvement in overall performance

Memory Compaction Optimization Techniques

Compaction Tuning:

Compaction Threshold: Tune when compaction is triggered
Compaction Frequency: Control how often compaction runs
Compaction Intensity: Control how aggressively compaction runs
Compaction Zones: Configure compaction for specific memory zones

Application-Level Optimizations:

Memory Pooling: Use memory pools to reduce fragmentation
Allocation Patterns: Optimize allocation patterns to reduce fragmentation
Memory Reuse: Reuse memory to reduce allocation/deallocation cycles
Huge Page Awareness: Design applications to be aware of huge pages

Monitoring and Debugging:

Compaction Statistics: Monitor compaction success rates
Fragmentation Metrics: Track fragmentation levels
Performance Impact: Measure performance impact of compaction
Tuning: Use metrics to tune compaction parameters

Performance Analysis and Benchmarks

NUMA Performance Impact

Memory Access Latency:

Local Memory: 100-200 nanoseconds
Remote Memory: 200-400 nanoseconds
NUMA Optimization: 20-40% reduction in average latency
Application Impact: 15-30% improvement in application performance

Memory Bandwidth:

Local Bandwidth: 50-100 GB/s per socket
Remote Bandwidth: 20-50 GB/s across sockets
NUMA Optimization: 30-50% improvement in effective bandwidth
Scalability: Better scalability with NUMA-aware applications

CPU Utilization:

NUMA-Aware: 15-25% reduction in CPU usage
Memory Contention: Reduced memory contention across sockets
Cache Efficiency: Better cache utilization with NUMA awareness
Thread Efficiency: Better thread efficiency with proper affinity

Huge Pages Performance Impact

TLB Performance:

TLB Misses: 50-90% reduction in TLB misses
TLB Coverage: 4-16x improvement in TLB coverage
Memory Access: 10-30% improvement in memory access performance
Context Switching: Reduced TLB flushing overhead

Application Performance:

Database Performance: 15-25% improvement in database performance
Cache Performance: 20-40% improvement in cache performance
Memory-Intensive Apps: 30-50% improvement in memory-intensive applications
Latency: 15-30% reduction in operation latency

Memory Efficiency:

Memory Usage: 10-20% reduction in memory usage
Fragmentation: Reduced memory fragmentation
Allocation: More efficient memory allocation
Reclaim: Better memory reclaim efficiency

Memory Compaction Performance Impact

Allocation Success:

Huge Page Allocation: 90-95% success rate
Large Allocations: 80-90% success rate for large allocations
Fragmentation: 50-70% reduction in fragmentation
Memory Pressure: Reduced memory pressure

Compaction Overhead:

CPU Overhead: 5-15% CPU overhead for compaction
Memory Overhead: 1-5% memory overhead for compaction
I/O Overhead: Minimal I/O overhead for compaction
Latency: 1-5% increase in allocation latency

Overall Performance:

Application Performance: 15-25% improvement in overall performance
Memory Efficiency: 10-20% improvement in memory efficiency
Scalability: Better scalability with large memory configurations
Reliability: Improved reliability with reduced allocation failures

Production Deployment Strategies

NUMA Configuration

Hardware Configuration:

NUMA Topology: Understand NUMA topology and distances
Memory Configuration: Configure memory per NUMA node
CPU Configuration: Configure CPU cores per NUMA node
Interconnect: Ensure adequate interconnect bandwidth

Software Configuration:

NUMA Policy: Configure NUMA memory allocation policies
Thread Affinity: Bind threads to specific NUMA nodes
Memory Allocation: Use NUMA-aware memory allocation
Monitoring: Monitor NUMA performance and utilization

Application Configuration:

NUMA Awareness: Design applications to be NUMA-aware
Data Locality: Keep related data on the same NUMA node
Workload Partitioning: Partition workloads by NUMA node
Performance Tuning: Tune applications for NUMA performance

Huge Pages Configuration

Kernel Configuration:

Huge Page Support: Enable huge page support in kernel
Transparent Huge Pages: Configure transparent huge pages
Huge Page Sizes: Configure available huge page sizes
Huge Page Allocation: Configure huge page allocation policies

Application Configuration:

Huge Page Requests: Request huge pages in applications
Memory Allocation: Use huge page-aware memory allocation
Fallback: Implement fallback to regular pages
Monitoring: Monitor huge page usage and effectiveness

System Configuration:

Huge Page Pools: Configure huge page pools
Memory Defragmentation: Configure memory defragmentation
Compaction: Configure memory compaction
Monitoring: Monitor huge page performance

Memory Compaction Configuration

Kernel Configuration:

Compaction Support: Enable memory compaction support
Compaction Daemon: Configure compaction daemon
Compaction Zones: Configure compaction zones
Compaction Parameters: Tune compaction parameters

Application Configuration:

Memory Patterns: Optimize memory allocation patterns
Memory Pools: Use memory pools to reduce fragmentation
Huge Page Awareness: Design for huge page allocation
Monitoring: Monitor fragmentation and compaction

System Configuration:

Compaction Threshold: Configure compaction thresholds
Compaction Frequency: Configure compaction frequency
Compaction Intensity: Configure compaction intensity
Monitoring: Monitor compaction performance

Monitoring and Troubleshooting

Performance Monitoring

NUMA Monitoring:

NUMA Statistics: Monitor NUMA memory access patterns
NUMA Distance: Monitor NUMA distance and latency
NUMA Utilization: Monitor NUMA node utilization
NUMA Performance: Monitor NUMA performance impact

Huge Pages Monitoring:

Huge Page Usage: Monitor huge page allocation and usage
TLB Performance: Monitor TLB miss rates and performance
Memory Access: Monitor memory access performance
Application Performance: Monitor application performance impact

Memory Compaction Monitoring:

Compaction Statistics: Monitor compaction success rates
Fragmentation: Monitor memory fragmentation levels
Allocation Success: Monitor allocation success rates
Performance Impact: Monitor compaction performance impact

Troubleshooting Common Issues

NUMA Issues:

Remote Memory Access: High remote memory access rates
NUMA Imbalance: Uneven NUMA node utilization
Performance Degradation: Performance degradation due to NUMA
Thread Affinity: Incorrect thread affinity configuration

Huge Pages Issues:

Allocation Failures: Huge page allocation failures
Fragmentation: Memory fragmentation preventing huge page allocation
Performance Issues: Performance issues with huge pages
Configuration Issues: Incorrect huge page configuration

Memory Compaction Issues:

Compaction Failures: Memory compaction failures
High Fragmentation: High memory fragmentation levels
Allocation Failures: Memory allocation failures
Performance Degradation: Performance degradation due to compaction

Future Directions and Research

Emerging Technologies

Memory Technologies:

Persistent Memory: Intel Optane and other persistent memory technologies
High-Bandwidth Memory: HBM and other high-bandwidth memory technologies
Memory Disaggregation: Disaggregated memory architectures
Memory Compression: Hardware memory compression

NUMA Evolution:

NUMA Topologies: New NUMA topologies and architectures
NUMA Optimization: Advanced NUMA optimization techniques
NUMA Awareness: Better NUMA awareness in applications
NUMA Monitoring: Advanced NUMA monitoring and analysis

Memory Management:

Advanced Compaction: More sophisticated memory compaction algorithms
Memory Defragmentation: Better memory defragmentation techniques
Memory Allocation: Improved memory allocation strategies
Memory Monitoring: Better memory monitoring and analysis

Research Areas

Performance Optimization:

NUMA Optimization: Advanced NUMA optimization techniques
Huge Pages: Better huge page management and allocation
Memory Compaction: Improved memory compaction algorithms
Memory Allocation: Better memory allocation strategies

Scalability:

Large-Scale Systems: Memory management for large-scale systems
Distributed Memory: Distributed memory management
Memory Sharing: Efficient memory sharing across processes
Memory Migration: Dynamic memory migration

Security:

Memory Security: Memory security and isolation
Memory Encryption: Memory encryption and protection
Memory Integrity: Memory integrity checking
Memory Access Control: Fine-grained memory access control

Best Practices for Production Deployment

Design Principles

Performance First:

Measure: Always measure performance before and after optimization
Profile: Use profiling tools to identify memory bottlenecks
Optimize: Optimize the most critical memory access patterns first

Reliability:

Error Handling: Implement comprehensive error handling for memory operations
Testing: Test with realistic workloads and memory pressure scenarios
Monitoring: Monitor memory performance and health

Maintainability:

Documentation: Document memory configuration and optimization procedures
Logging: Implement comprehensive logging for memory operations
Debugging: Provide debugging tools and procedures

Implementation Guidelines

Code Quality:

Standards: Follow coding standards and best practices
Testing: Implement unit tests and integration tests
Review: Code review for quality and performance

Configuration:

Tuning: Tune memory parameters for optimal performance
Validation: Validate memory configuration before deployment
Documentation: Document memory configuration options and trade-offs

Operations:

Deployment: Automated deployment and rollback procedures
Monitoring: Comprehensive monitoring and alerting
Maintenance: Regular maintenance and updates

Conclusion

Advanced memory management techniques are essential for achieving optimal performance in modern server systems. NUMA optimization, huge pages, and memory compaction each address specific aspects of memory performance, enabling applications to achieve better performance, lower latency, and higher throughput.

The choice of memory management techniques depends on specific requirements: NUMA optimization for multi-socket systems, huge pages for memory-intensive applications, and memory compaction for systems requiring large contiguous memory allocations. Understanding these techniques and their trade-offs is crucial for building systems that can meet the demanding performance requirements of modern applications.

As hardware continues to evolve with new memory technologies and architectures, advanced memory management techniques will become increasingly important for achieving optimal performance. The future lies in hybrid approaches that combine multiple memory management techniques, providing both high performance and the reliability and security guarantees that production systems require.

Memory Management: NUMA, Huge Pages, and Memory Compaction in Production Systems

Memory Management: NUMA, Huge Pages, and Memory Compaction in Production Systems

The NUMA Architecture Challenge

NUMA Topology and Access Patterns

Real-World NUMA Implementations

NUMA Optimization Techniques

Huge Pages: Reducing TLB Misses

TLB and Page Size Impact

Real-World Huge Pages Implementations

Huge Pages Optimization Techniques

Memory Compaction: Defragmenting Physical Memory

Memory Fragmentation Problem

Real-World Memory Compaction Implementations

Memory Compaction Optimization Techniques

Performance Analysis and Benchmarks

NUMA Performance Impact

Huge Pages Performance Impact

Memory Compaction Performance Impact

Production Deployment Strategies

NUMA Configuration

Huge Pages Configuration

Memory Compaction Configuration

Monitoring and Troubleshooting

Performance Monitoring

Troubleshooting Common Issues

Future Directions and Research

Emerging Technologies

Research Areas

Best Practices for Production Deployment

Design Principles

Implementation Guidelines

Conclusion

Anshad Ameenza

Related Articles

Kernel Bypass Networking: DPDK, SPDK, and io_uring for Microsecond Latency

Distributed Consensus Algorithms: Raft vs PBFT vs HotStuff in Production Systems

The Future of Learning: Education Transformed in 2025

Memory Management: NUMA, Huge Pages, and Memory Compaction in Production Systems

Memory Management: NUMA, Huge Pages, and Memory Compaction in Production Systems

The NUMA Architecture Challenge

NUMA Topology and Access Patterns

Real-World NUMA Implementations

NUMA Optimization Techniques

Huge Pages: Reducing TLB Misses

TLB and Page Size Impact

Real-World Huge Pages Implementations

Huge Pages Optimization Techniques

Memory Compaction: Defragmenting Physical Memory

Memory Fragmentation Problem

Real-World Memory Compaction Implementations

Memory Compaction Optimization Techniques

Performance Analysis and Benchmarks

NUMA Performance Impact

Huge Pages Performance Impact

Memory Compaction Performance Impact

Production Deployment Strategies

NUMA Configuration

Huge Pages Configuration

Memory Compaction Configuration

Monitoring and Troubleshooting

Performance Monitoring

Troubleshooting Common Issues

Future Directions and Research

Emerging Technologies

Research Areas

Best Practices for Production Deployment

Design Principles

Implementation Guidelines

Conclusion

Anshad Ameenza

Related Articles

Kernel Bypass Networking: DPDK, SPDK, and io_uring for Microsecond Latency

Distributed Consensus Algorithms: Raft vs PBFT vs HotStuff in Production Systems

The Future of Learning: Education Transformed in 2025

Cookie & Reality Check