Monitoring Disk I/O Performance with iostat

Terminal screenshot showing iostat output with per-device read/write throughput, IOPS, KB/s, avg wait and service time, device utilization %, and CPU stats for disk I/O monitoring.

Monitoring Disk I/O Performance with iostat
SPONSORED

Sponsor message — This article is made possible by Dargslan.com, a publisher of practical, no-fluff IT & developer workbooks.

Why Dargslan.com?

If you prefer doing over endless theory, Dargslan’s titles are built for you. Every workbook focuses on skills you can apply the same day—server hardening, Linux one-liners, PowerShell for admins, Python automation, cloud basics, and more.


In today's data-driven infrastructure landscape, understanding how storage systems perform under various workloads determines whether applications run smoothly or grind to a halt. Storage bottlenecks often remain invisible until they cascade into user-facing problems—slow database queries, delayed batch processes, or unresponsive applications. Identifying these issues early requires visibility into what happens beneath the surface when applications request data from disks.

Disk input/output monitoring represents the practice of measuring and analyzing how data moves between system memory and storage devices. This discipline encompasses tracking read and write operations, measuring latency, identifying queue depths, and understanding utilization patterns. By examining these metrics systematically, infrastructure teams gain multiple perspectives on storage health—from hardware-level disk performance to filesystem behavior and application I/O patterns.

Throughout this exploration, you'll discover practical approaches to measuring storage performance, interpreting the metrics that matter most, establishing baselines for normal operation, and recognizing patterns that signal emerging problems. You'll learn to distinguish between symptoms and root causes, understand when storage becomes the limiting factor in system performance, and develop strategies for ongoing monitoring that prevents small issues from becoming critical failures.

Understanding Storage Performance Fundamentals

Storage performance exists at the intersection of hardware capabilities, operating system management, and application demands. Unlike CPU or memory resources that operate at nanosecond scales, storage devices work in milliseconds—orders of magnitude slower. This speed differential creates unique challenges because even minor inefficiencies in storage access patterns multiply into significant performance impacts.

Modern storage architectures layer abstractions between applications and physical media. Applications interact with filesystems, which communicate with volume managers, which ultimately translate requests into physical device operations. Each layer introduces overhead and potential bottlenecks. A comprehensive monitoring approach examines performance at multiple levels to identify where delays originate.

"The difference between a responsive system and a sluggish one often comes down to whether storage can keep pace with the rate of data requests being generated."

Traditional spinning disks and solid-state drives exhibit fundamentally different performance characteristics. Mechanical drives excel at sequential operations but struggle with random access patterns due to physical seek times. Solid-state storage eliminates mechanical latency but introduces considerations around write amplification and wear leveling. Network-attached storage adds another dimension, where network latency and bandwidth constraints interact with underlying storage performance.

Core Performance Dimensions

Effective storage monitoring tracks several interconnected dimensions simultaneously. Throughput measures the volume of data transferred per unit time, typically expressed in megabytes or gigabytes per second. This metric indicates raw data-moving capacity but doesn't reveal whether that capacity meets application needs efficiently.

Latency captures the time elapsed between issuing an I/O request and receiving a response, measured in milliseconds or microseconds. Applications experience latency directly as wait time, making this metric critical for understanding user-perceived performance.

IOPS (Input/Output Operations Per Second) counts discrete I/O operations completed within a timeframe, regardless of operation size. This metric particularly matters for workloads involving many small transactions, like database operations.

Queue depth represents I/O requests waiting for service at any given moment. Growing queues indicate storage cannot process requests as quickly as they arrive, leading to increased latency.

Utilization shows the percentage of time storage devices actively service requests versus sitting idle. High utilization suggests capacity constraints, though the relationship between utilization and performance varies by storage type.

These dimensions interact in complex ways. High throughput with low IOPS suggests large sequential operations, while high IOPS with modest throughput indicates many small random operations. Understanding these relationships helps diagnose whether workload characteristics or storage limitations cause performance issues.

The iostat Command Architecture

The iostat utility emerged from the sysstat package as a specialized tool for reporting storage and CPU statistics. Unlike general-purpose monitoring tools that sample many system aspects superficially, iostat focuses specifically on I/O subsystem behavior with detailed metrics unavailable elsewhere. The tool reads kernel statistics from /proc/diskstats and /sys/block/ interfaces, processing raw counters into meaningful performance indicators.

When invoked without arguments, iostat displays a single snapshot of statistics accumulated since system boot. This default behavior provides historical averages but obscures current performance patterns. The tool's true value emerges when run in continuous mode, sampling at regular intervals to reveal performance trends and anomalies as they occur.

The command accepts two primary numeric arguments: interval and count. The interval specifies seconds between reports, while count limits total reports generated. Running iostat 5 12 produces twelve reports at five-second intervals, providing one minute of detailed observations. Omitting the count parameter causes iostat to run indefinitely until interrupted.

Essential Command Options

The -x flag activates extended statistics mode, dramatically expanding the metrics reported. This mode reveals granular details about I/O patterns, including separate read and write statistics, queue depths, and service times. Extended mode transforms iostat from a basic monitoring tool into a comprehensive performance analysis instrument.

Adding -d restricts output to device statistics, suppressing CPU information. This focused view reduces clutter when investigating storage-specific issues. The -k or -m options control whether throughput appears in kilobytes or megabytes per second, with megabytes generally more readable for modern high-throughput devices.

The -p parameter followed by a device name displays statistics for individual partitions rather than whole devices. This granularity helps identify whether specific partitions experience disproportionate load. The -t flag timestamps each report, essential for correlating I/O patterns with external events or application behaviors.

Option Purpose Typical Use Case
-x Display extended statistics with detailed metrics Deep performance analysis and bottleneck identification
-d Show only device statistics, suppress CPU data Storage-focused troubleshooting sessions
-k / -m Report throughput in kilobytes or megabytes Adjusting scale for readability based on device speed
-p [device] Display partition-level statistics Identifying hot partitions or imbalanced workloads
-t Add timestamps to each report Correlating I/O patterns with application events
-y Omit first report showing boot-time averages Focusing on current performance without historical skew
-z Suppress devices with zero activity Reducing clutter in systems with many inactive devices

Combining options creates powerful monitoring configurations. The command iostat -xdmtz 5 produces extended device statistics in megabytes with timestamps, suppressing inactive devices, updating every five seconds. This combination provides comprehensive, readable output focused on active storage subsystems.

Interpreting Extended Statistics

Extended mode output contains numerous columns, each revealing specific aspects of I/O behavior. Understanding what each metric represents and how metrics relate to each other transforms raw numbers into actionable insights. The statistics fall into several categories: request rates, throughput, latency, queue behavior, and utilization.

rrqm/s and wrqm/s show read and write requests merged per second. The kernel I/O scheduler attempts to merge adjacent requests before submitting them to devices, improving efficiency. High merge rates indicate the scheduler successfully consolidates requests, while low rates suggest either highly random access patterns or scheduler configuration issues.

"When you see average wait times climbing while utilization remains moderate, you're witnessing the storage system struggle with workload characteristics rather than raw capacity limits."

The r/s and w/s columns report actual read and write operations submitted to devices per second after merging. These figures represent the IOPS load devices must service. Comparing these values against device specifications reveals whether operation rates approach hardware limits.

rkB/s and wkB/s indicate throughput in kilobytes per second for reads and writes. Dividing throughput by operation count yields average operation size—a critical workload characteristic. Large average sizes suggest sequential access patterns that storage systems handle efficiently, while small averages indicate random patterns that challenge performance.

Latency and Queue Metrics

The await column presents average time in milliseconds for I/O requests to complete, including both queue wait time and service time. This end-to-end latency directly impacts application performance. Values under 10ms generally indicate healthy performance for mechanical drives, while solid-state devices should maintain sub-millisecond latencies under normal conditions.

Separate r_await and w_await columns break down latency by operation type. Many workloads exhibit asymmetric patterns where reads require immediate responses while writes can tolerate higher latency through caching. Monitoring these separately reveals whether read or write paths experience problems.

svctm historically represented average service time, but modern kernel versions deprecate this field as meaningless. The metric attempted to separate queue wait time from actual service time, but accurate calculation proved impossible with modern I/O schedulers and device command queuing. Ignore this column in current systems.

The aqu-sz (average queue size) metric shows how many requests remained outstanding on average during the interval. Growing queue sizes indicate requests arrive faster than devices can service them. Sustained queue growth eventually leads to memory pressure and system instability as buffers fill.

%util displays the percentage of time devices had requests outstanding. Interpretation requires understanding device capabilities. Single-queue mechanical drives saturate near 100% utilization, while modern NVMe devices with deep command queues may handle additional load even at high utilization percentages. Context matters significantly for this metric.

Metric Healthy Range Warning Signs Critical Threshold
await (HDD) < 10ms 10-20ms > 20ms sustained
await (SSD) < 1ms 1-5ms > 5ms sustained
aqu-sz < 4 4-16 > 16 or growing
%util (HDD) < 70% 70-90% > 90% sustained
%util (NVMe) < 80% 80-95% > 95% with queue growth

Establishing Performance Baselines

Understanding whether current performance represents normal operation or degradation requires established baselines. Baselines capture typical I/O patterns during various operational phases—business hours versus overnight batch processing, month-end reporting periods versus routine operations, backup windows versus production workloads. Without these reference points, distinguishing normal variation from genuine problems becomes guesswork.

Effective baseline establishment involves collecting statistics during known-good operational periods across representative timeframes. Daily patterns often show predictable cycles as users arrive, work, and depart. Weekly patterns may reveal batch processes scheduled for specific days. Monthly or quarterly patterns emerge around reporting cycles or seasonal business fluctuations.

Automated collection through scheduled iostat invocations captures this data systematically. A cron job running iostat -xdmtz 60 1440 > /var/log/iostat/iostat-$(date +\%Y\%m\%d).log generates 24 hours of per-minute samples daily. Accumulating several weeks of such data reveals normal operational ranges and variation patterns.

"Baseline data transforms monitoring from reactive firefighting into proactive capacity management by revealing trends before they become emergencies."

Identifying Meaningful Deviations

Once baselines exist, recognizing significant deviations requires statistical thinking rather than arbitrary thresholds. A single interval showing elevated latency might represent normal variation, while sustained elevation signals genuine issues. Comparing current metrics against baseline distributions helps distinguish signal from noise.

Calculate percentile values from baseline data to establish normal ranges. If the 95th percentile of read latency during business hours typically falls at 8ms, current values consistently exceeding 12ms warrant investigation even if they remain below absolute alarm thresholds. This approach adapts monitoring to actual system behavior rather than generic rules.

Trend analysis reveals gradual degradation that might not trigger threshold alarms. If average queue depth slowly grows from 2 to 6 over several weeks, the system approaches saturation even though current values remain acceptable. Detecting such trends enables proactive intervention before performance becomes unacceptable.

Workload characterization through baseline analysis informs capacity planning. Understanding typical IOPS rates, throughput patterns, and operation size distributions guides storage system design. If baselines show predominantly large sequential writes during backup windows but random reads dominate business hours, storage architectures can optimize for these distinct patterns.

Diagnosing Common Performance Patterns

Certain metric combinations consistently indicate specific problems. Recognizing these patterns accelerates troubleshooting by focusing investigation on likely root causes. Pattern recognition develops through experience, but understanding fundamental relationships provides a starting framework.

High utilization accompanied by low throughput and low IOPS suggests storage capacity exhaustion or hardware failure. Devices should deliver substantial work when busy; if utilization reaches 100% while throughput remains modest, the device likely cannot keep pace with demand. This pattern warrants examining device health, checking for hardware errors, and considering capacity upgrades.

⚡ Low utilization with high latency indicates problems elsewhere in the I/O stack rather than device saturation. The storage device spends most time idle yet requests experience delays, suggesting bottlenecks in filesystem layers, volume management, or I/O scheduling. Investigation should focus on kernel I/O subsystem configuration and filesystem performance.

⚡ High read latency with acceptable write latency often points to cache inefficiency. Read operations require immediate data retrieval, while write operations can complete quickly through cache acknowledgment. If reads slow while writes remain fast, examine cache hit rates and consider memory pressure forcing cache eviction.

⚡ Elevated merge rates alongside high IOPS suggest I/O scheduler effectiveness. The system successfully consolidates many small requests into fewer larger operations, improving efficiency. This pattern represents healthy behavior, though it may indicate opportunities for application-level optimization to reduce request fragmentation.

⚡ Growing queue depths with increasing latency signal demand exceeding capacity. More requests arrive than devices can service, causing queues to build and wait times to grow. This pattern demands immediate attention as it precedes system instability.

Workload-Specific Considerations

Database workloads typically generate random read patterns with periodic sequential write bursts during checkpoint operations. Expect moderate IOPS with small average operation sizes during normal operation, punctuated by high write throughput intervals. Sustained read latency elevation impacts query performance immediately, while write latency affects checkpoint completion times.

Web server workloads often show read-heavy patterns with small file accesses. Static content serving generates many small sequential reads as files are streamed to clients. High read IOPS with modest throughput characterizes this workload. Filesystem caching significantly impacts performance, so monitoring cache hit rates alongside I/O metrics provides complete visibility.

"The same I/O statistics mean entirely different things depending on whether you're running a database, a file server, or a batch processing system."

Batch processing and ETL workloads produce large sequential operations with predictable patterns. Expect high throughput with moderate IOPS as large blocks transfer. These workloads often tolerate higher latency than interactive applications, making throughput the primary performance indicator. Monitoring should focus on whether jobs complete within allocated time windows.

Virtualization environments present unique monitoring challenges as multiple virtual machines share underlying storage. Individual VM I/O patterns may appear normal while aggregate load overwhelms shared storage. Monitor both hypervisor-level aggregate statistics and per-VM metrics to identify whether problems originate from specific VMs or overall capacity constraints.

Advanced Monitoring Techniques

Beyond basic iostat invocation, several advanced techniques extract deeper insights from I/O behavior. These approaches combine iostat with other tools, apply statistical analysis, or leverage additional kernel interfaces to build comprehensive understanding of storage performance.

Correlating iostat output with application logs reveals cause-and-effect relationships between application behavior and I/O patterns. Timestamped iostat data aligned with application event logs shows whether specific application operations trigger I/O spikes. This correlation identifies which application behaviors stress storage systems, guiding optimization efforts.

Combining iostat with iotop or pidstat attributes I/O activity to specific processes. While iostat shows device-level statistics, these process-level tools identify which applications generate load. Running both simultaneously during performance issues quickly isolates problematic applications or unexpected background processes.

Automated Analysis and Alerting

Parsing iostat output programmatically enables automated analysis and alerting. Simple shell scripts can extract specific metrics, compare against thresholds, and trigger notifications when problems appear. More sophisticated approaches use time-series databases to store metrics long-term, enabling trend analysis and predictive alerting.

A basic monitoring script might continuously run iostat, parse output for specific devices, and alert when latency exceeds thresholds for consecutive intervals. This approach detects sustained problems while filtering transient spikes. Building hysteresis into alerting logic—requiring both threshold breach and sustained elevation—reduces false alarms.

Integration with monitoring platforms like Prometheus, Grafana, or Nagios centralizes I/O metrics alongside other system data. Exporters translate iostat output into platform-native formats, enabling unified dashboards and correlation with CPU, memory, and network metrics. This holistic view reveals whether storage problems exist in isolation or as part of broader resource contention.

Machine learning approaches can detect anomalies in I/O patterns without explicit threshold configuration. By learning normal operational patterns, these systems identify deviations that might indicate emerging problems even when metrics remain within traditionally acceptable ranges. This technique particularly benefits complex environments where baseline patterns vary significantly across time and workload phases.

Performance Optimization Strategies

Once monitoring identifies performance issues, optimization strategies address root causes. Effective optimization requires understanding whether problems stem from hardware limitations, configuration issues, or workload characteristics. Different problems demand different solutions, making accurate diagnosis through monitoring essential.

I/O scheduler tuning represents a first-line optimization approach. Linux offers several schedulers—deadline, cfq, noop, and newer multiqueue schedulers like mq-deadline and bfq. Each optimizes for different workload characteristics. Database servers often benefit from deadline scheduling, which prioritizes request completion within time bounds. File servers may prefer cfq's fairness guarantees. Solid-state devices often perform best with noop or mq-deadline, as these devices don't benefit from request reordering.

Filesystem selection and configuration significantly impact I/O performance. Modern filesystems like XFS and ext4 offer numerous tuning parameters affecting everything from allocation strategies to journaling behavior. Disabling access time updates (noatime mount option) eliminates write operations generated by read accesses. Adjusting journal sizes and locations balances data integrity against performance. Alignment with underlying storage geometry prevents unnecessary read-modify-write cycles.

"Optimization without measurement is just guessing—monitor before and after changes to verify improvements rather than assumptions."

Hardware and Architecture Solutions

When software optimization exhausts possibilities, hardware upgrades or architectural changes address capacity constraints. Adding faster storage devices—replacing mechanical drives with SSDs or upgrading to NVMe—provides immediate performance improvements for I/O-bound workloads. Cost-benefit analysis should consider whether performance gains justify investment.

Implementing tiered storage strategies places frequently accessed data on fast storage while archiving cold data to slower, cheaper media. Monitoring identifies hot data through access pattern analysis. Automated tiering solutions dynamically migrate data between tiers based on access frequency, optimizing cost and performance simultaneously.

Caching layers between applications and storage dramatically reduce I/O load by serving repeated requests from memory. Operating system page caches provide basic functionality, while dedicated caching solutions like bcache or dm-cache add persistent cache layers using SSDs to accelerate slower backing storage. Monitoring cache hit rates validates caching effectiveness and guides capacity sizing.

Distributing I/O load across multiple devices through RAID configurations or software-defined storage prevents individual device saturation. Striping (RAID 0) improves throughput and IOPS by parallelizing operations. Monitoring individual device statistics within arrays identifies whether load distributes evenly or certain devices become hotspots requiring rebalancing.

Continuous Monitoring Best Practices

Effective I/O monitoring requires ongoing commitment rather than sporadic investigation during outages. Establishing systematic monitoring practices ensures performance visibility becomes routine rather than exceptional. These practices balance comprehensive data collection against overhead and operational burden.

Sampling intervals represent a fundamental tradeoff between granularity and overhead. One-second intervals capture rapid fluctuations but generate substantial data volume and impose measurable system load. Five to ten-second intervals provide reasonable granularity for most purposes while minimizing overhead. Longer intervals—one to five minutes—suit capacity planning and trend analysis but may miss transient problems.

Retention policies balance historical data value against storage costs. High-resolution recent data enables detailed troubleshooting, while aggregated historical data supports long-term trend analysis. A common approach retains per-second data for 24-48 hours, per-minute data for 30 days, and hourly aggregates indefinitely. This tiered retention provides detail when needed while controlling growth.

Documentation and Knowledge Transfer

Documenting baseline characteristics, known issues, and optimization history creates institutional knowledge that survives personnel changes. Record typical metric ranges during various operational phases, known problematic patterns and their causes, and the impact of past optimization efforts. This documentation accelerates future troubleshooting and prevents repeated investigation of understood issues.

Runbooks codify response procedures for common problems identified through monitoring. When specific metric patterns appear, documented procedures guide investigation and remediation. Runbooks might specify checking for specific processes, examining related subsystems, or implementing known workarounds. This codification ensures consistent responses regardless of who handles incidents.

Regular review of monitoring data during non-crisis periods reveals gradual trends and emerging patterns. Scheduled weekly or monthly analysis sessions examine capacity utilization trends, identify slowly degrading performance, and validate that monitoring remains effective as systems evolve. Proactive review prevents surprises by catching problems before they become critical.

Cross-training team members on I/O monitoring techniques and interpretation ensures knowledge distribution. When only one person understands storage performance analysis, that person becomes a bottleneck and single point of failure. Shared expertise improves incident response and enables peer review of analysis and optimization decisions.

Integration with Broader Observability

Storage performance rarely exists in isolation—I/O patterns interact with CPU usage, memory pressure, network activity, and application behavior. Integrating I/O monitoring into broader observability practices reveals these relationships and enables holistic understanding of system behavior.

Correlation between memory pressure and I/O patterns often reveals cause-and-effect relationships. Insufficient memory forces cache eviction, increasing read I/O as data must be retrieved from storage repeatedly. Monitoring both memory utilization and I/O statistics together identifies whether adding memory would reduce storage load more effectively than storage upgrades.

"Understanding system performance requires seeing how all components interact rather than examining each in isolation."

Application performance metrics provide context for I/O statistics. Database query latency, web request response times, or batch job completion durations show whether I/O performance meets application needs. Monitoring systems should correlate application-level metrics with infrastructure metrics to validate that infrastructure improvements translate into user-visible benefits.

Network I/O monitoring complements local storage monitoring in environments using network-attached storage. High network latency or packet loss can masquerade as storage performance problems when using NFS, iSCSI, or other network protocols. Monitoring both network and storage layers distinguishes between network and storage issues.

Distributed tracing in microservice architectures reveals how I/O latency in one service impacts overall request processing. A service experiencing storage delays may hold resources and block other services, cascading problems throughout the system. Tracing shows these dependencies and helps prioritize optimization efforts based on overall system impact rather than individual service metrics.

Troubleshooting Real-World Scenarios

Practical application of monitoring concepts becomes clearer through examining realistic scenarios. These examples illustrate diagnostic processes, showing how multiple metrics combine to reveal root causes and guide resolution.

Scenario: Gradual Performance Degradation

A database server experiences slowly increasing query latency over several weeks. Initial investigation shows CPU and memory utilization remain normal. Running iostat -xdmtz 5 reveals read latency averaging 15ms, up from a baseline of 5ms. Utilization hovers around 85%, while IOPS remain within historical ranges. Queue depth averages 8, double the baseline of 4.

This pattern suggests the storage device approaches capacity limits. The combination of elevated latency, high utilization, and growing queues indicates demand exceeds device capability. Investigation reveals filesystem fragmentation has increased significantly—sequential file writes have become fragmented over time, forcing the drive to perform more seeks. Defragmentation or migration to a fresh filesystem resolves the issue, returning performance to baseline levels.

Scenario: Intermittent Performance Spikes

An application experiences periodic slowdowns without apparent pattern. Continuous iostat monitoring with timestamps captures a spike: write latency suddenly jumps to 200ms for 30 seconds before returning to normal. Utilization during spikes reaches 100%, but IOPS and throughput remain modest. Correlation with system logs reveals these spikes coincide with automated backup processes.

The backup process generates large sequential writes that saturate the storage device temporarily. While backup throughput seems reasonable, the device cannot simultaneously handle backup writes and application I/O, causing application request queuing. The solution involves throttling backup processes to leave capacity for application I/O or scheduling backups during maintenance windows when application load is minimal.

Future Considerations and Evolving Technologies

Storage technologies continue evolving rapidly, introducing new performance characteristics and monitoring challenges. Understanding emerging trends helps prepare monitoring strategies for future requirements.

NVMe storage with its dramatically lower latency and higher parallelism requires rethinking traditional monitoring approaches. Metrics like utilization become less meaningful when devices handle hundreds of parallel operations. Queue depths that would indicate problems on SATA SSDs represent normal operation for NVMe. Monitoring must adapt to these new performance profiles.

Persistent memory technologies blur the line between storage and memory, operating at speeds between traditional RAM and storage. Monitoring these devices requires new metrics and tools as existing I/O monitoring frameworks may not capture their unique characteristics. Understanding how applications leverage persistent memory and whether they achieve expected performance benefits becomes a new monitoring challenge.

Cloud and virtualized environments introduce additional abstraction layers between monitoring tools and physical devices. Container storage interfaces, virtual disk images, and cloud storage services present different performance characteristics than direct-attached storage. Monitoring must account for these layers, understanding both virtual device performance and underlying physical resource behavior.

Software-defined storage distributes data across multiple nodes, requiring monitoring approaches that aggregate statistics across the cluster while identifying individual node problems. Traditional single-device monitoring provides incomplete visibility. Cluster-aware monitoring tools that understand data distribution, replication, and rebuild operations become essential.

How often should I run iostat for effective monitoring?

For real-time troubleshooting, run iostat with 5-second intervals to capture detailed behavior without excessive overhead. For continuous monitoring and logging, 30-60 second intervals provide sufficient granularity for trend analysis while minimizing data volume. Adjust based on your specific needs—faster intervals for high-transaction environments, longer intervals for stable systems with predictable workloads.

What await value indicates a storage problem?

Context matters significantly. For mechanical hard drives, sustained await values above 10-15ms warrant investigation, while values above 20ms indicate serious problems. Solid-state drives should maintain sub-millisecond latency under normal conditions, with values above 5ms suggesting issues. Compare current values against your established baselines rather than relying solely on absolute thresholds, as acceptable latency varies by workload and device type.

Why does high utilization not always mean poor performance?

Utilization measures time with requests outstanding, not capacity exhaustion. Modern storage devices with deep command queues can process many parallel operations efficiently even at high utilization. A device showing 90% utilization with low latency and healthy throughput operates normally. Problems appear when high utilization coincides with growing queues and increasing latency, indicating the device cannot keep pace with demand.

How do I identify which process causes high I/O?

While iostat shows device-level statistics, it doesn't attribute I/O to specific processes. Use complementary tools like iotop, pidstat with the -d flag, or /proc/[pid]/io files to identify processes generating I/O. Run these tools alongside iostat during high I/O periods to correlate device-level patterns with process-level activity, quickly identifying problematic applications or unexpected background processes.

Should I monitor individual partitions or whole devices?

Start with whole-device monitoring to understand overall storage system behavior. If problems appear, drill down to partition-level statistics using the -p flag to identify whether specific partitions experience disproportionate load. Partition-level monitoring helps in systems with multiple workloads sharing a device, revealing whether problems concentrate in specific areas or affect the entire device uniformly.

What's the difference between IOPS and throughput, and which matters more?

IOPS counts discrete operations regardless of size, while throughput measures data volume transferred. Both matter, but importance depends on workload characteristics. Database transactions care about IOPS—completing many small operations quickly. Video streaming cares about throughput—moving large amounts of data continuously. Monitor both and understand their relationship: dividing throughput by IOPS reveals average operation size, characterizing whether your workload is random (small operations) or sequential (large operations).