Machine Learning - Project Report

Disease Prediction
System

A machine learning system that predicts diseases from patient symptoms using Decision Tree, Naive Bayes, and KNN classifiers, deployed as an interactive web application.

Subject

Machine Learning Lab

Algorithms

Decision Tree · Naive Bayes · KNN

Best Accuracy

100% (Naive Bayes)

Deployment

Flask + HTML Web App

Dataset

The dataset used is the Disease Prediction Using Machine Learning dataset from Kaggle. It contains binary symptom features mapped to disease labels, making it ideal for multi-class classification.

4,920

Training Samples

Testing Samples

132

Symptom Features

Disease Classes

Dataset Structure

Each row represents a patient. The first 132 columns are binary symptom flags (1 = present, 0 = absent). The last column prognosis is the target disease label.

Column	Type	Description	Example
itching, skin_rash, fever ...	Integer (0/1)	Binary symptom presence flags	1 or 0
prognosis	String	Target disease label	Dengue, Malaria...

Sample Diseases Covered

Fungal Infection · Allergy · GERD · Diabetes · Malaria · Dengue · Typhoid · Pneumonia · Heart Attack · Tuberculosis · Jaundice · Chicken Pox · Hypertension · Arthritis · and 27 more.

ML Pipeline

The project follows a standard supervised learning pipeline from raw CSV data to a deployed prediction system.

Load

Load CSV

Training.csv
Testing.csv

Prep

Preprocess

Strip whitespace
Drop null cols

Split

Train / Test
Pre-split

Train

Train Models

DT · NB · KNN

Eval

Evaluate

Accuracy
Report

Save

Save Model

PKL via
joblib

Algorithms Used

Three classification algorithms were implemented and compared. Each algorithm brings a different approach to multi-class disease classification.

Decision Tree - ID3 Algorithm

Primary Model

The ID3 algorithm builds a tree by recursively selecting the feature with the highest Information Gain at each node.

Entropy: H(S) = -Sigma p(x) log2 p(x)

Information Gain: IG(S, A) = H(S) - Sigma (|Sv|/|S|) * H(Sv)

Criterionentropy (ID3)

Random State42

Librarysklearn.tree

Accuracy97.6%

Naive Bayes Classifier

Best Model

Naive Bayes applies Bayes' theorem with a feature independence assumption. It performs exceptionally well for binary symptom features.

Bayes: P(Disease | Symptoms) is proportional to P(Symptoms | Disease) * P(Disease)

VariantGaussianNB

AssumptionFeature Independence

Librarysklearn.naive_bayes

Accuracy100%

K-Nearest Neighbours (KNN)

Comparison Model

KNN predicts based on the most similar historical symptom vectors. With K=5, the model votes among the nearest training samples.

K value5

DistanceEuclidean

Librarysklearn.neighbors

Accuracy100%

Libraries, Algorithms and Tools Report

This section summarizes the exact stack used in your ML pipeline and deployment. The details are generated from the backend so they stay synchronized with the project code.

Libraries and Frameworks

Name	Category	Purpose in Project
pandas	Library	Data loading and dataframe operations
numpy	Library	Numerical arrays and vector handling
matplotlib	Visualization	Charts and model visualizations
seaborn	Visualization	Confusion matrix heatmap
scikit-learn	ML Framework	ML models, metrics, and utilities
joblib	Tool	Model and feature-list serialization
Flask	Web Framework	Web serving and prediction API

Algorithms Configured

Algorithm	Configuration	Role
Decision Tree (ID3)	DecisionTreeClassifier(criterion='entropy', random_state=42)	Interpretable baseline model
Gaussian Naive Bayes	GaussianNB()	Primary deployment model
K-Nearest Neighbors	KNeighborsClassifier(n_neighbors=5)	Comparison model

Project Tools

#	Tool
1	Google Colab (training notebook execution)
2	CSV datasets: Training.csv and Testing.csv
3	Python warnings module for clean output
4	Flask templates for web report and predictor UI

Full ML Training Code

The complete code used for training, evaluation, visualization, and model export is shown below. This is the same script content passed from the Flask backend.

ml_training_full_script.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import joblib
import warnings
warnings.filterwarnings('ignore')

print('All libraries imported successfully!')

from google.colab import files
print('Upload Training.csv and Testing.csv')
uploaded = files.upload()

# Load datasets
train_df = pd.read_csv('Training.csv').drop(columns=['Unnamed: 133'], errors='ignore')
test_df  = pd.read_csv('Testing.csv')

# Strip whitespace from disease names
train_df['prognosis'] = train_df['prognosis'].str.strip()
test_df['prognosis']  = test_df['prognosis'].str.strip()

print('Train shape:', train_df.shape)
print('Test shape :', test_df.shape)
print()
train_df.head()

print('Dataset Info:')
print(f'Total training samples : {len(train_df)}')
print(f'Total testing samples  : {len(test_df)}')
print(f'Number of symptoms     : {train_df.shape[1] - 1}')
print(f'Number of diseases     : {train_df["prognosis"].nunique()}')
print()
print('Missing values:', train_df.isnull().sum().sum())
print()
print('All diseases:')
for i, d in enumerate(sorted(train_df['prognosis'].unique()), 1):
    print(f'{i}. {d}')

# Disease distribution
plt.figure(figsize=(14, 6))
train_df['prognosis'].value_counts().plot(kind='bar', color='steelblue')
plt.title('Disease Distribution in Training Data')
plt.xlabel('Disease')
plt.ylabel('Number of Samples')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# Top 15 most common symptoms across all records
symptom_cols = train_df.columns[:-1].tolist()
top15_symptoms = train_df[symptom_cols].sum().sort_values(ascending=False).head(15)

plt.figure(figsize=(12, 4))
top15_symptoms.plot(kind='bar', color='seagreen')
plt.title('Top 15 Most Frequently Occurring Symptoms')
plt.xlabel('Symptom')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

symptom_cols = train_df.columns[:-1].tolist()

X_train = train_df[symptom_cols]
y_train = train_df['prognosis']

X_test  = test_df[symptom_cols]
y_test  = test_df['prognosis']

print('X_train shape:', X_train.shape)
print('X_test shape :', X_test.shape)
print('Unique diseases in train:', y_train.nunique())

# criterion='entropy' makes it ID3
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt_model.fit(X_train, y_train)

dt_pred     = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

print(f'Decision Tree Accuracy: {dt_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, dt_pred))

# Visualize top 3 levels of the tree
plt.figure(figsize=(22, 8))
plot_tree(
    dt_model,
    feature_names=symptom_cols,
    class_names=dt_model.classes_,
    filled=True,
    max_depth=3,
    fontsize=8
)
plt.title('Decision Tree - Top 3 Levels (ID3 / Entropy)', fontsize=14)
plt.tight_layout()
plt.show()

# Top 15 most important symptoms
importances = pd.Series(dt_model.feature_importances_, index=symptom_cols)
top15 = importances.sort_values(ascending=False).head(15)

plt.figure(figsize=(10, 5))
top15.plot(kind='bar', color='steelblue')
plt.title('Top 15 Most Important Symptoms (Decision Tree)')
plt.ylabel('Importance Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

nb_pred     = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_pred)

print(f'Naive Bayes Accuracy: {nb_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, nb_pred))

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

knn_pred     = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)

print(f'KNN Accuracy: {knn_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, knn_pred))

# Print correct and wrong predictions (Lab 9 requirement)
print('KNN - All Test Predictions:
')
print(f'{"#":<5} {"Actual Disease":<42} {"Predicted Disease":<42} {"Result"}')
print('-' * 105)
correct = 0
for i, (actual, predicted) in enumerate(zip(y_test, knn_pred), 1):
    result = 'Correct' if actual == predicted else 'Wrong'
    if actual == predicted:
        correct += 1
    print(f'{i:<5} {actual:<42} {predicted:<42} {result}')
print()
print(f'Total Correct: {correct}/{len(y_test)}')

results = pd.DataFrame({
    'Model': ['Decision Tree (ID3)', 'Naive Bayes', 'KNN (k=5)'],
    'Accuracy (%)': [
        round(dt_accuracy * 100, 2),
        round(nb_accuracy * 100, 2),
        round(knn_accuracy * 100, 2)
    ]
})

print('=' * 40)
print('Model Accuracy Comparison')
print('=' * 40)
print(results.to_string(index=False))
print('=' * 40)

# Accuracy bar chart
plt.figure(figsize=(8, 5))
colors = ['steelblue', 'seagreen', 'tomato']
bars = plt.bar(results['Model'], results['Accuracy (%)'], color=colors, width=0.5)
plt.ylim(0, 110)
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy (%)')
for bar, acc in zip(bars, results['Accuracy (%)']):
    plt.text(bar.get_x() + bar.get_width() / 2,
             bar.get_height() + 1,
             f'{acc}%', ha='center', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

# Confusion matrix for Decision Tree
cm = confusion_matrix(y_test, dt_pred, labels=dt_model.classes_)
plt.figure(figsize=(18, 14))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=dt_model.classes_,
            yticklabels=dt_model.classes_)
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

def predict_disease(symptoms_input, model, symptom_cols):
    input_vector = [1 if col in symptoms_input else 0 for col in symptom_cols]
    input_df = pd.DataFrame([input_vector], columns=symptom_cols)
    return model.predict(input_df)[0]

# Sample test
sample = ['itching', 'skin_rash', 'nodal_skin_eruptions']
print(f'Input Symptoms : {sample}')
print(f'Decision Tree  : {predict_disease(sample, dt_model, symptom_cols)}')
print(f'Naive Bayes    : {predict_disease(sample, nb_model, symptom_cols)}')
print(f'KNN            : {predict_disease(sample, knn_model, symptom_cols)}')

print('All available symptoms:')
print(symptom_cols)

# Change this list and run!
my_symptoms = ['fever', 'chills', 'joint_pain', 'vomiting']

print(f'
Your symptoms     : {my_symptoms}')
print(f'Predicted disease : {predict_disease(my_symptoms, dt_model, symptom_cols)}')

joblib.dump(nb_model, 'disease_model.pkl')
joblib.dump(symptom_cols, 'symptom_cols.pkl')

print('disease_model.pkl saved')
print('symptom_cols.pkl saved')
print()
print('Downloading files...')
files.download('disease_model.pkl')
files.download('symptom_cols.pkl')

Results and Evaluation

100%

Best Model Accuracy - Naive Bayes

All test samples were correctly classified in the selected benchmark split.

Model Accuracy Comparison

Naive Bayes

100%

KNN (k=5)

100%

Decision Tree (ID3)

97.6%

Detailed Comparison Table

Algorithm	Accuracy	Training Time	Interpretable	Suitable For
Decision Tree (ID3)	97.6%	Fast	Yes	Rule-based classification
Naive Bayes	100%	Very Fast	Partial	Independent binary features
KNN (k=5)	100%	Slow (lazy learner)	No	Small, well-structured datasets

Conclusion

This project demonstrates a complete machine learning workflow from data preparation to model training and evaluation for disease prediction.

Naive Bayes emerged as the best model for this dataset because of clear symptom separability and efficient probabilistic inference.

Algorithms Implemented

100%

Best Accuracy Achieved

Diseases Predicted

132

Features Used

Disease PredictionSystem

Dataset

Dataset Structure

Sample Diseases Covered

ML Pipeline

Algorithms Used

Libraries, Algorithms and Tools Report

Libraries and Frameworks

Algorithms Configured

Project Tools

Full ML Training Code

Results and Evaluation

Best Model Accuracy - Naive Bayes

Model Accuracy Comparison

Detailed Comparison Table

Conclusion

Disease Prediction
System