HEALTH METRICS

mailfreda
Aug 26, 2025
3 min read

Updated: Sep 10, 2025

The dataset simulates health metrics for 1,000 individuals, including information on blood pressure, cholesterol levels, BMI, smoking status, and diabetes status. The data was generated randomly, with certain constraints to mimic real-world distributions.

Link to code:

Data_Analysis_Projects/Data_Analysis_MySQL/Health_metrics.sql at Data-cleaning-and-insights · FredaMaree/Data_Analysis_Projects

DATA CLEANING

Load all data from the enhanced_health_data table
Switch to the health_metrics database
Check for duplicate names in the dataset

Note: 5 Names are duplicated, but the data to each name is unique

SELECT

FROM

enhanced_health_data

WHERE

Name IN ('Jessica Jones' , 'Timothy Williams',

'Amy Dixon',

'Steven Webb',

'Nicole Anderson')

Assign row numbers to identify duplicates uniquely
Add a primary key (row_id) for unique identification
ALTER TABLE enhanced_health_data ADD COLUMN row_id INT AUTO_INCREMENT PRIMARY KEY;
SELECT
*
FROM
enhanced_health_data;
Rename duplicate names safely using row numbers to ensure uniqueness
Verify that all previously duplicated names are now unique

DATA EXPLORATION

Compare gender representation as counts and percentages
SELECT
gender,
COUNT(*) AS count_gender,
ROUND((COUNT(*) * 100.0 / (SELECT
COUNT(*)
FROM
enhanced_health_data)),
2) AS percentage
FROM
enhanced_health_data
GROUP BY gender;
Examine minimum and maximum age per gender
Categorize age groups (young, middle_age, elderly)
Count number of people in each age group
Determine percentage distribution of age groups
SELECT
CASE
WHEN age BETWEEN 18 AND 39 THEN 'young'
WHEN age BETWEEN 40 AND 65 THEN 'middle_age'
WHEN age BETWEEN 66 AND 80 THEN 'elderly'
ELSE 'other'
END AS age_group,
COUNT(*) AS count_people,
ROUND( COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2 ) AS percentage
FROM enhanced_health_data
GROUP BY age_group;

HYPERTENSION ANALYSIS

Identify individuals with type 1 and type 2 hypertension
Count of individuals per hypertension classification
Determine percentage of individuals in each hypertension classification
SELECT
CASE
WHEN
`Systolic BP` > 140
OR `Diastolic BP` > 90
THEN
'Type_2_hypertension'
WHEN
`Systolic BP` >= 130
OR `Diastolic BP` >= 80
THEN
'Type_1_hypertension'
ELSE 'No_hypertension'
END AS Hypertension_classification,
COUNT(*) AS count_hypertension,
COUNT(*) * 100 / SUM(Type_2_hypertension
AND Type_1_hypertension)
FROM
enhanced_health_data
GROUP BY Hypertension_classification;

CHOLESTEROL ANALYSIS

Review minimum and maximum cholesterol values
Count individuals with high cholesterol (≥ 240), equals zero
Count individuals with borderline high cholesterol (200–239)
Cholesterol requires further exploration with lipogram counts

BMI AND OBESITY

Review minimum and maximum BMI values
Determine obesity prevalence (BMI ≥ 30)
Count obesity prevalence per gender with percentages
SELECT
Gender,
COUNT(*) AS Obesity_Count,
ROUND(COUNT(*)*100/SUM(COUNT(*)) OVER(),2) as percentage
FROM enhanced_health_data
WHERE BMI >= 30
GROUP BY Gender;

SMOKING STATUS

Count total smokers in the dataset
Count smokers per gender with percentages

SELECT Smoker,Gender,COUNT(*) as Smoker_count,

ROUND(COUNT(*)*100/SUM(COUNT(*)) OVER(),2) as percentage

FROM enhanced_health_data

WHERE Smoker ='True'

GROUP BY Smoker,Gender;

DIABETES STATUS

Count total diabetic individuals and calculate percentages
SELECT
*
FROM
enhanced_health_data;
SELECT Diabetes,COUNT(*) as Diabetes_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM enhanced_health_data
GROUP BY Diabetes;

HEALTH CLASSIFICATION

Count individuals per Health category and gender with percentages
SELECT health,gender, COUNT(*) as health_classification,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM enhanced_health_data
Group by Health,gender
Order by Health;
Health classification for individuals without comorbidities (age < 40, normal BP, BMI < 30, non-smoker, non-diabetic)
SELECT Health,
COUNT(*) AS health_without_comorbitities,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM enhanced_health_data
WHERE age < 40
AND `Systolic BP` <= 140
AND `Diastolic BP` <= 90
AND BMI < 30
AND Smoker = 'False'
AND Diabetes = 'False'
GROUP BY Health
ORDER BY Health;
Health classification for individuals with all comorbidities (age ≥ 40, high BP, BMI ≥ 30, smoker, diabetic)
SELECT Health,
COUNT(*) AS health_with_comorbitities,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM enhanced_health_data
WHERE age >= 40
AND `Systolic BP` > 140
OR `Diastolic BP` > 90
AND BMI >= 30
AND Smoker = 'True'
AND Diabetes = 'True'
GROUP BY Health
ORDER BY Health;
INSIGHTS
1. Gender Representation: Males and females are equally represented in the sample.
2. Age Distribution: All age groups are represented, although the elderly population is smaller compared to the middle-aged and younger groups.
3. Hypertension Prevalence: 17% of individuals in this sample have type 2 hypertension.
4. Obesity Trends: Obesity is slightly more prevalent among females.
5. Smoking Patterns: Both males and females have nearly 50% of individuals who smoke.
6. Diabetes Prevalence: Diabetes is present across both gender groups.
7. Health Classification Without Comorbidities: Individuals with no comorbidities generally fall into the Fair or Good health classification.
8. Health Classification With All Comorbidities: Individuals with all comorbidities show significantly poorer health, highlighting the detrimental impact of multiple coexisting conditions.

DATA CLEANING

DATA EXPLORATION

INSIGHTS