EDA (Exploratory Data Analysis)

TL;DR - EDA
-
What is EDA? Exploratory Data Analysis (EDA) is the process of diving into datasets to uncover patterns, detect anomalies, and test hypotheses, often with the help of visualizations.
-
Data Cleaning and Insights: Data before March 2022 was excluded due to inconsistencies. Outliers, like heart rates above 200 bpm, were smoothed using a 7-day rolling average.
-
Sleep and HRV Data: Sleep quality and HRV are pivotal health metrics. Outliers were handled using advanced statistical methods, reflecting stress, alcohol consumption, or other factors.
-
Sleep Insights: Stress impacts sleep: higher stress equals lower sleep quality. Over time, there’s a shift from poor-quality to fair and good sleep trends.
-
Activities: From 2022 to 2024, activities soared, alongside fitness improvements like increased VO2 Max, calories burned, and distances covered. Activities were split into high and low heart rate zones for deeper insights.
What is EDA ?
EDA is the process of initially exploring a dataset to understand its main characteristics
Before doing deeper analysis, lets get to know the data
-
Understand the Data:
-
Review structure, data types, and column meanings.
-
Summarize statistics (mean, median, range).
-
-
Handle Missing Data:
-
Identify missing values and decide on imputation or removal.
-
-
Univariate Analysis:
-
Analyze individual variables using histograms, bar charts, or box plots.
-
-
Bivariate & Multivariate Analysis:
-
Explore relationships using scatter plots, correlation matrices, or cross-tabulations.
-
-
Outliers & Anomalies:
-
Detect outliers using box plots or statistical methods.
-
-
Transform Features:
-
Address skewness or scaling issues through transformations.
-
-
Visualize:
-
Create insightful charts to uncover patterns and trends.
-
Data Cleaning and insights
Heart Rate
Why Start with Heart Rate?
Heart rate is a critical indicator of overall health and activity levels, making it the ideal starting point for the data cleaning process.
Cleaning and analyzing this data first provides key insights and sets the stage for cleaning other datasets more efficiently.
Here's why:
-
Baseline Trends: Heart rate data reveals average and extreme values, establishing a baseline to compare against other metrics.
-
Anomalies: It allows for the easy detection of irregularities, ensuring that outliers or errors are addressed before they affect subsequent analysis.
Lets start processing this dataset and first start with a scatter plot of the see how is the data is performing
For most of the data visualization I use Plotly it offers a interactive, web-based visualizations with features like zoom, pan. etc
Also offers a bunch of customization options while maintaining responsive design.
The library can also be integrated well with multiple frameworks if required
There are significant spikes and variations in the data around early 2022 and in 2016, likely due to a change in devices—from the Fenix 3HR to the Garmin 7. To streamline the analysis and ensure consistency, we will eliminate all data prior to March 2022.
Here's how the dataset looks after applying this adjustment:
lets explore some other metrics now as the the data looks a lot more clean
When analyzing the max_hr values, we observed:
-
Readings above 200 bpm, which are likely due to errors in the data capture process rather than actual physiological activity.
-
These values are considered outliers, as they fall significantly beyond typical heart rate thresholds esp for me.
Approach to Handling Outliers
Instead of deleting these outliers, which could disrupt data continuity, we can opt for a more robust solution:
-
Replace the erroneous values with a 7-day rolling mean. This preserves the overall trend while ensuring the data remains clean and usable.
-
This method maintains the integrity of other key data features, ensuring that the analysis remains robust and reflective of real-world activity.
-
Data Exclusion:
-
Data prior to March 2022 has been excluded due to inconsistencies in recording and quality.
-
Starting from March 2022, we now have a consistent, long-term timeline that allows for in-depth analysis and the generation of actionable, up-to-date insights.
-
-
Data Quality:
-
Careful attention has been given to avoid over-cleaning the data.
-
This ensures that the dataset retains its essential patterns and characteristics.
-
Sleep
The Importance of Sleep
We all understand the critical role sleep plays in overall health and performance. It’s during sleep that our body recovers, repairs, and prepares for the challenges of the next day. Tracking and improving sleep metrics can allow for us to better understand possibly fix these.
Outlier Analysis
To maintain clean and reliable sleep duration data, we use two robust statistical methods: IQR (Interquartile Range) and the Modified Z-Score. These approaches help identify and manage extreme values while preserving key data characteristics.
1. IQR Method
-
The IQR (Interquartile Range) method flags outliers that fall outside:
Lower Bound=Q1−1.5×IQRandUpper Bound=Q3+1.5×IQR -
This approach works well for data with skewed distributions and ensures that extreme values are flagged
2. Modified Z-Score
The Modified Z-Score is ideal for data with natural variations and non-normal distributions.
-
Why it works: It uses the median and Median Absolute Deviation (MAD), making it robust to occasional extreme events like late nights or disturbances.
-
Outliers are flagged when the Modified Z-Score > 3.5.
By combining these methods, we ensure accurate and reliable analysis without distorting essential data patterns.
IQR analysis
-
Outlier Thresholds: Sleep durations below 4.10 hours and above 9.38 hours are flagged as statistical outliers.
-
Typical Sleep Range: Most sleep durations fall between 6.08 and 7.40 hours, reflecting typical recovery patterns.
Modified Z-score
Z-Score Analysis
-
Statistical Summary:
-
Mean: 6.62 hours | Standard Deviation: 1.37 hours
-
Lower Bound: 2.49 hours (Mean - 3 * SD)
-
Upper Bound: 10.74 hours (Mean + 3 * SD)
-
-
Outlier Thresholds:
-
Sleep durations below 2.49 hours or above 10.74 hours are flagged.
-
This dual analysis provides a robust framework for identifying and managing outliers in sleep duration data.
Number of Outliers: 13 nights of sleep get flagged as outliers based on the Z-Score analysis.
Upon reviewing the graph and data, these outliers likely result from erroneous data collection and early morning work commitments
Approach to Handling Outliers
Instead of deleting these outliers, which could disrupt data continuity, we can opt for the solution:
Replace the erroneous values with a 7-day rolling mean. This preserves the overall trend while ensuring the data remains clean and usable.
HRV (Heart Rate Variability)
Heart Rate Variability (HRV) is a key metric that reveals insights into the balance between the sympathetic nervous system (fight or flight) and the parasympathetic nervous system (rest and recovery). Tracking HRV helps us understand:
-
Stress and Recovery: How well your body is managing stress and how effectively it's recovering.
-
Sleep Quality: HRV trends can indicate how restorative your sleep is.
-
Training Readiness: High HRV suggests readiness for physical activity, while low HRV might indicate the need for more rest.
You can read up the details here
What is a Good HRV Range?
-
Baseline Matters:
-
HRV is individual-specific, influenced by factors like age, fitness level, stress, and genetics.
-
For most healthy adults, HRV typically falls between 20 to 100 ms, with trained athletes often showing higher HRV in the range of 60–100 ms or more.
-
-
Interpreting HRV Trends:
-
High HRV (80–100 ms): Indicates better recovery, reduced stress, and a well-functioning parasympathetic nervous system (rest-and-digest state).
-
Low HRV (30–50 ms): Suggests fatigue, stress, overtraining, or inadequate recovery.
-
-
Lifestyle Influences:
-
Significant dips or spikes in HRV are linked to lifestyle factors, including stressful events, intense training, illness or bad sleep quality
-
Summary:
A "good" HRV range is relative to your personal baseline, but values closer to 80–100 ms are generally positive indicators of recovery and resilience, while sustained values below 30–50 ms may signal the need for rest or intervention.
Our analysis of the HRV data reveals additional outliers that were not previously removed. These variations may be attributed to external factors, including stress, alcohol consumption, or other influences that affect HRV performance.
So we will not remove them from the analysis at this point
Insights
Lets look at some of the insights after we have cleaned up most of the data
Daily Summary
Steps: Range from 7,068 to 19,170, with peak activity on December 30.
Active Calories: Highest burn of 1,175 kcal correlates with the high step count on December 30.
Heart Rate (HR): Consistent resting HR reflects good recovery; peaks vary with activity intensity.
Sleep: Fluctuates between 4.8 to 7.5 hours, with REM sleep ranging from 0% to 26%, indicating varying recovery quality.
Activity Duration: Workout time varies, with the longest session on December 29
Floors Climbed: Ranges from 16.2 to 49.7, showing variability in intensity.
In conclusions a good summary of our daily activities
Conclusion
This dataset provides a comprehensive summary of daily activities, highlighting variations in effort, recovery, and performance.
Sleep
The scatter plot above illustrates the correlation between stress levels and sleep quality:
-
Negative Correlation: As stress levels increase, sleep scores tend to decrease. This suggests that higher stress impacts sleep quality.
-
Sleep Duration Matters: The color gradient (indicating sleep duration) shows that longer sleep durations (green) are often associated with higher sleep scores, whereas shorter durations (purple) correspond to lower scores.
What This Means:
-
Impact of Stress on Sleep: Elevated stress levels disrupt sleep patterns, leading to lower restorative sleep.
-
The Stress-Sleep Cycle: Poor sleep further exacerbates stress levels, creating a cycle that can affect health and productivity.
The chart above showcases the start (blue) and end (orange) times of sleep across different days. Over time, we can observe a clear trend:
-
Earlier Bedtimes: The blue line indicates that sleep start times are shifting earlier.
-
Earlier Wake Times: The orange line shows that wake times are also becoming earlier.
The pie chart above provides a breakdown of sleep quality into various categories:
-
Fair (52.1%): Over half of the nights fall into the "fair" category, indicating room for improvement in sleep consistency and quality.
-
Poor (21.1%): A significant portion of nights is categorized as "poor," highlighting the need for immediate attention to sleep habits and recovery strategies.
-
Good (12.2%): A small percentage of nights achieve "good" quality, suggesting opportunities to increase this proportion through better sleep routines.
-
Excellent (14.7%): While a notable percentage of nights are classified as "excellent," maintaining and further improving this metric can lead to better overall health and recovery.
-
This distribution shows a clear scope for improvement in sleep quality
Key Features from Sleep and HRV Data
-
Sleep Duration:
-
Ranges from 4h 50m to 7h 32m , reflecting varying rest levels.
-
Shorter sleep durations (e.g., December 27) correlate with poor recovery scores.
-
-
Sleep Score: Range from 28 (poor) to 83 (excellent). with unbalanced and balanced HRV
-
Sleep Stages: Distribution of deep (D), REM (R), and light (L) sleep varies across days, influencing recovery and readiness.
-
HRV: Balanced HRV is maintained across most days, with values between 67 ms to 75 ms, reflecting good recovery all the way down to 36ms (Unbalanced)
-
Stress and Respiration: Stress levels fluctuate, peaking at 67 on December 27, correlating with poor sleep and HRV.
-
Respiration rates remain steady, indicating consistent respiratory health.
-
Average HR: Ranges from 49.8 bpm to 75.1 bpm
Is It All Bad News?
Looking at the trends over the year, it's clear that there has been significant improvement:
-
The proportion of poor sleep (red) has visibly decreased over time.
-
There is an upward trend in fair (yellow) and good (green) sleep categories.
-
Instances of excellent sleep (dark green), although rare, are beginning to emerge, signaling progress.
Key Takeaways
-
Positive Trends:
-
The reduction in poor-quality sleep is a clear indicator of improvements in sleep habits or external factors.
-
Gradual increases in fair and good sleep scores suggest consistent progress.
-
-
Areas for Improvement:
-
A large proportion of sleep still falls into the poor and fair categories, highlighting room for improvement in sleep consistency and quality.
-
The goal should be to further shift the distribution towards the good and excellent categories.
-

Activities
we have a detailed dataset of all the activates recorded by the Garmin watch lets explore them
Key Features from Activities Summary
-
Activity Types: Alternates between Strength Training and Running, approach to endurance and resistance training.
-
Duration: Sessions range from 1 hour 2 minutes (shortest) to 1 hour 26 minutes (longest)
-
Calories Burned: Calorie expenditure varies widely, from 503 kcal during strength training to 1,042 kcal
-
Heart Rate (HR): Average HR ranges from 108 to 153 bpm, while Max HR peaks at 168 bpm
-
Training Effect: Aerobic (AE) and Anaerobic (AN) training effects vary across sessions
-
Training Load: Training loads are highest during running sessions (up to 228.3) and lower during strength training (as low as 35.7),
Increase in Activities
-
The number of recorded activities grew significantly, from 196 in 2022 to 297 in 2024, reflecting increased consistency.
Calories Burned
-
2022: 75,917 calories burned—equivalent to about 10.8 kg of weight loss (7,000 calories ≈ 1 kg of fat) or burning off 152 plates of biryani (500 calories per plate)🍛.
-
2023: 101,010 calories burned, approximately 14.4 kg of weight, or 202 plates of biryani🍛🍛.
-
2024: 133,940 calories burned, roughly 19.1 kg of weight, or 268 plates of biryani 🍛🍛🍛!
Over three years, I’ve burned a total of 622 plates of biryani or about 45 kg of fat!🍛🍛🍛🍛🍛
VO2 Max Shift
-
A clear improvement in VO2 Max is observed, rising from 42.8–47.0 in 2022 to 51.0–54.0 in 2024, indicating enhanced cardiovascular health and endurance.
Distance Covered:
-
2022: 298.5 km, equivalent to traveling from Bangalore to Coorg ⛰️.
-
2023: 788.5 km, close to the distance from Bangalore to Kanyakumari 🌊🏖️.
-
2024: 998.2 km, nearly the same as traveling from Bangalore to Goa and back 🏖️🌊🏖️.
Over three years, I’ve covered a total of 2,085.7 km, almost the distance from Bangalore to Delhi.🛣️🌄 🛣️:🏞️🌌🛣️
lets break these activities down by type
It’s clear from the data that running and cycling activities exhibit distinctly higher heart rate patterns compared to other activities. To enhance the analysis, it makes sense to:
-
Group Activities into Buckets:
-
High Heart Rate Activities: Include running, cycling with elevated heart rate patterns.
-
Low Heart Rate Activities: Include less intense activities with relatively lower heart rate patterns.
-
-
Remove Erroneous Data:
-
Eliminate any invalid or erroneous entries (e.g., excessively high heart rates or incomplete records) to ensure the dataset remains clean and reliable.
-
This grouping approach will allow us to draw more meaningful insights and simplify comparisons between activity types while maintaining the integrity of the analysis.
interactive graphs on desktop version
