US Household income data
- mailfreda
- Aug 25, 2025
- 2 min read
Data cleaning and exploration
Link to code:

Data cleaning
Look for duplicates
Delete duplicates
DELETE FROM us_project.us_household_income
WHERE row_id IN (
SELECT row_id
FROM (
SELECT row_id,id,
ROW_NUMBER () OVER (PARTITION BY id ORDER BY id) AS row_num
FROM us_project.us_household_income) as duplicates
WHERE row_num>1
);
Update spelling of state names
UPDATE us_project.us_household_income
SET State_Name = 'Georgia'
WHERE State_Name = 'georia'
;
Populate missing data in the place column

SELECT *
FROM us_project.us_household_income
WHERE place = '';
SELECT *
FROM us_project.us_household_income
WHERE county ='Autauga County' ;
UPDATE us_project.us_household_income
SET place = 'Autaugaville'
WHERE COUNTY = 'Autauga County'
AND city ='Vinemont'
;
Look for missing Awater and Aland columns
SELECT DISTINCT Awater,Aland
from us_project.us_household_income
WHERE Awater =0 OR Awater =''OR Awater is NULL
AND Aland =0 OR Aland =''OR Aland is NULL
;
Exploratory data analysis
Landarea per state
SELECT DISTINCT State_Name, SUM(ALand) as total_Aland
FROM us_project.us_household_income
GROUP bY State_Name
ORDER BY total_Aland Desc
;

Comparison of water and land area per state
SELECT DISTINCT State_Name, SUM(Awater) as total_Awater, SUM(Aland) as total_Aland
FROM us_project.us_household_income
GROUP bY State_Name
ORDER BY 3 Desc
;
Top 10 states according to land area
SELECT DISTINCT State_Name, SUM(Awater) as total_Awater, SUM(Aland) as total_Aland
FROM us_project.us_household_income
GROUP bY State_Name
ORDER BY 3 Desc
Limit 10
;

Combing tables with inner join,
Mean of 0 indicates inaccurate reporting, thus filter out
SELECT *
FROM us_project.us_household_income i
INNER JOIN us_project.us_household_income_statistics s
ON i.id=s.id
WHERE mean <>0
;
AVERAGE median and mean income per state
SELECT i.State_name, ROUND(AVG( Mean),1), ROUND(AVG(Median),1)
FROM us_project.us_household_income i
INNER JOIN us_project.us_household_income_statistics s
ON i.id=s.id
WHERE mean <>0
GROUP BY i.State_name
ORDER BY 2 DESC
LIMIT 10
;

Key Insights
Data quality issues (duplicates, misspellings and zero statistics) were systematically identified and corrected.
Land and water distribution varies widely across states, with some states dominating in either category.
States and cities with the highest average mean/median incomes were identified, offering insights into regional income disparities.
Categorical variables such as type also influenced income patterns, with meaningful trends once outliers were excluded.

