| EDA LOG | ||||
| ID | Date | Description | Observations and Insights | |
| 1 | 6/13/2025 | Context of under 65 focus | 1.
Stroke cases make up 4.87%
of the original dataset. 2. Of the 249 total stroke cases, 159 (≈64%) are patients aged 65 and over—indicating that this age group heavily dominates stroke representation. dominantly represent stroke cases. 3.By excluding patients 65 and over, we reduce the overwhelming effect of age and allow other risk factors to emerge more clearly. 4. After removal, stroke cases under 65 will make up 2.20% of the remaining dataset. |
|
| 2 | 6/13/2025 | Age column number summary analysis | 1.
After data filtering, minimum age is 0.08 and max is 64. 2. No outlier detected using 1.5 x IQR bounds. |
|
| 3 | 6/13/2025 | Average glucose level number summary analysis | 1.
Average glucose level shows large outliers over the 1.5 x IQR bound. 2. Upon research, glucose levels can be as extreme as over 600; thus, outliers were deemed possibly valid and not removed. |
|
| 4 | 6/13/2025 | Bmi column number summary analysis | 1.
The number summary for bmi shows large value outliers when using 1.5 x IQR upper bound. 2. Upon investigation, bmi of 97 or more is rare but possible. Thus, data with large values were preserved. |
|
| 5 | 6/13/2025 | Impute null bmi values | 1.
The bmi column contains large outliers, making the mean an unsuitable choice for imputation. The median value (27.7) was selected as it provides
a more robust and less disruptive replacement for nulls. 2. After imputation, summary statistics such as mean and IQR shifted slightly, but not enough to compromise data integrity or require further adjustment. |
|
| 6 | 6/13/2025 | Bin Continuous Features | Based
on CDC Standards, the age group was binned into: * children(0-17), young adult(18-24), adults(25-34), midlife adults(34-44), older adults(45-54), pre-seniors(55-64) Based on CDC standards, avg_glucose_level was binned into: * hypoglycemic(<70), normal (70-99), pre-diabetic (100-125), diabetic (126-199), and high diabetes (200+) Based on CDC standards, bmi was binned into: * underweight, normal weight, overweight, obesity class 1, obesity class 2, and obesity class 3 |
|
| 7 | 6/13/2025 | Gender Distribution Analysis | 1.
Females represent 58.40% of
the under 65 population and accounts for 53.33% of the stroke cases, the stroke rate is skewed because of the
difference in population representation and not overall risk. 2. Males have a higher stroke rate compared to women within their respective groups (2.47% vs 2.01%). 3. These findings suggest that men under 65 may require more targeted early screening compared to women. |
|
| 8 | 6/13/2025 | Age Group Distribution Analysis | 1.The
age distribution of patients under 65 is fairly even, except for young adults (9.31%), who
represent roughly half the size of other groups. 2. In-group stroke risk increases steadily with age, peaking at 7.05% among pre-seniors. Children (0.23%) and young adults (0%) show very low stroke rates. 3. Pre-seniors (55–64) account for 58.89% of all stroke cases under 65, reinforcing the dominance of age as a stroke risk factor, even before age 65. |
|
| 9 | 6/14/2025 | Engineer Categorical Column Values | 1.
Added hypertension status, heart_disease_status, ever_married_status, and
stroke_status columns. 2. Replaced 0/1 and no/yes values to more interpretable version (no hypertension/hypertension, no heart disease /heart disease, never married/married, no stroke/ had stroke. |
|
| 10 | 6/14/2025 | Work Type Distribution Analysis | 1.
Private job workers make up 59.04% of the population. 2. Self-employed(2.97%) and government job (3.04%) workers have a slightly higher stroke rate compared to private company workers (2.45%). 3. Although private company workers account for 65.56% of stroke cases, this is likely skewed by their large share of the total population. |
|
| 11 | 6/14/2025 | Residence Type Distribution Analysis | 1.
The distribution for both categories are similar (50.37% vs 49.63%) 2. Urban residents (2.33%) have a slightly higher stroke rate than rural residents (2.07%). 3. Urban residents (53.33%) represent a higher stroke occurrence rate than rural residents (46.67%) overall. 4. The trend largely follows the population distribution where urban residents outnumber the rural residents. |
|
| 12 | 6/14/2025 | Marriage History Distribution Analysis | 1.
Married people (59.14%)
account for 20% more than the never
married people (40.86%). 2. Married people have a 3.36% risk of having a stroke, compared to other married people, while people that are never married experience stroke at a rate of 0.54% compared to other never married people. 3. Married individuals account for 90% of all stroke cases in the dataset. |
|
| 13 | 6/14/2025 | Smoking Status Distribution Analysis | 1.
Never smoked and unknown status patients represent roughly 69% of the data. 2. Having a history of smoking (smokes, formerly smokes) showed a higher stroke rate (3.79% and 3.76%). 3. Smoking or formerly smoked shows almost double the rate of stroke occurrence compared to the other groups. 4. Smokes and formerly smokes contributed roughly the same number of stroke occurrences while being roughly half the size of the other two groups. |
|
| 14 | 6/14/2025 | BMI Distribution Analysis | 1.
Patients within the normal and overweight categories make up ~55% of the population. 2. Patients in the normal weight range are 7 to 8 times less likely to experience stroke. 3. Patients that are above normal weight have a stroke rate of around 3%. 4. Patients in the overweight category represent 42.22% of the overall stroke cases while making up 28.47% of the population. |
|
| 15 | 6/15/2025 | Average Glucose Level Distribution Analysis | 1.
Patients with normal average
glucose level account for 48.19% of the population and 36.67% of the total stroke cases. The dominant representation of
this group accounts for the higher stroke case occurrence. 2. Patients with high diabetes is the least represented at 5.59% but it accounts for 18.89% of stroke cases which is the second highest representation of stroke overall. Patients with high diabetes also shows an in group stroke rate of 7.46%, which is more than twice the rate of other groups. |
|
| 16 | 6/15/2025 | Heart Disease Distribution Analysis | 1.
Patients with heart disease only represents 2.35% of the population while accounting for 14.44% of overall stroke
cases. 2. Patients with heart disease have a stroke rate of 13.54% or about 6x more than patients without heart disease (1.93%). |
|
| 17 | 6/15/2025 | Hypertension Distribution Analysis | 1.
Although patients with hypertension only make up 6.57% of the population, they account for 17.78%
of all stroke cases. 2. Patients with hypertension experiences stroke at a rate of 5.97%, which is more than three times higher than the 1.94% stroke rate among patients without hypertension. |
|
| 18 | 6/15/2025 | Chi-Square Test | 1.
The Chi-Square Test is was
used to evaluate the statistical significance of categorical features in relation to the target variable
(stroke). 2. Both gender and residence_type had p-values that were greater than the 0.05 significance threshold, indicating no statistical significant association between these two features and stroke in patients under 65. 3. As a result, gender and residence type will be excluded from further analysis. |
|
| 19 | 6/16/2025 | Mann-Whitney Test | 1.
Excel does not natively have a Mann-Whitney Test so Python was used to conduct this test. 2. Both bmi and average glucose level tested to have a p value that is significantly lower than the 0.05 threshold, indicating that they are both statistically significant predictors of stroke for patients under the age of 65. |
|
| 20 | 6/16/2025 | Top In-Group Risk Factors | The top 5 in group stroke risk
factors are: 1. Heart Disease at 13.54% 2. High Diabetes at 7.46% 3. Pre-Senior age at 7.05% 4. Hypertension at 5.97% 5. Smokes at 3.79% |
|
| 21 | 6/16/2025 | In Group Stroke Rate of Change | 1.
Heart disease has the highest in-group stroke rate at 13.54%. 2. The decline in rate is drastic until the fifth factor (smokes, 3.79%) and then the decline rate slows down. |
|
| 22 | 6/16/2025 | Disproportionate Stroke Burden | The
bar chart shows the following insights: 1. Pre-seniors account for 58.89% of the stroke cases while only being 18.42% of the population under 65. 2. Heart disease accounts for 14.44% of the stroke cases while only being 2.35% of the population. |
|
| 23 | 6/16/2025 | Age - Heart Disease Distribution | 1.
Pre-seniors (55-64) have the highest heart disease rate at 7.71% 2.There is a drastic increase in heart disease diagnosis starting with midlife adults (0.58%), rising to older adults (3.88%), and peaking with pre-seniors (7.71%). |
|
| 24 | 6/16/2025 | Age - Hypertension Distribution | 1.
Pre-seniors have the highest rate of hypertension at
15.56%. 2. The rate of hypertension diagnosis starts increasing drastically in adulthood (25-34). 3. Screening and preventive care for hypertension should begin in the mid-20s, long before hypertension manifests. |
|
| 25 | 6/16/2025 | Age - Diabetes Distribution | 1.
Diabetes affects 1 in
4 (26.60%)
pre-seniors (55-64). 2. Drastic increase in diabetes rate starts at adult age (25-34) at 9.38% and doubles at older adult stage (45-54) at 20.30%. 3. Screening and preventive care plan should be considered for patients starting mid 20s. |
|
| 26 | 6/16/2025 | Age - Smoking Distribution | 1. Stroke risk remains elevated even after quitting, prevention must start early. Former smokers show similar stroke risk to current
smokers, emphasizing that not starting at all is the most effective protection. 3. Most smokers begin before age 35. With smoking history rising sharply between ages 18 and 34, prevention campaigns must target young adults and adolescents before lifetime risk is locked in. |
|