Overview Dataset Pipeline Algorithms Libraries and Tools Code Results Conclusion
Machine Learning - Project Report

Disease Prediction
System

A machine learning system that predicts diseases from patient symptoms using Decision Tree, Naive Bayes, and KNN classifiers, deployed as an interactive web application.

Subject
Machine Learning Lab
Algorithms
Decision Tree · Naive Bayes · KNN
Best Accuracy
100% (Naive Bayes)
Deployment
Flask + HTML Web App

Dataset

The dataset used is the Disease Prediction Using Machine Learning dataset from Kaggle. It contains binary symptom features mapped to disease labels, making it ideal for multi-class classification.

4,920
Training Samples
42
Testing Samples
132
Symptom Features
41
Disease Classes

Dataset Structure

Each row represents a patient. The first 132 columns are binary symptom flags (1 = present, 0 = absent). The last column prognosis is the target disease label.

Column Type Description Example
itching, skin_rash, fever ...Integer (0/1)Binary symptom presence flags1 or 0
prognosisStringTarget disease labelDengue, Malaria...

Sample Diseases Covered

Fungal Infection · Allergy · GERD · Diabetes · Malaria · Dengue · Typhoid · Pneumonia · Heart Attack · Tuberculosis · Jaundice · Chicken Pox · Hypertension · Arthritis · and 27 more.

ML Pipeline

The project follows a standard supervised learning pipeline from raw CSV data to a deployed prediction system.

Load
Load CSV
Training.csv
Testing.csv
->
Prep
Preprocess
Strip whitespace
Drop null cols
->
Split
Split
Train / Test
Pre-split
->
Train
Train Models
DT · NB · KNN
->
Eval
Evaluate
Accuracy
Report
->
Save
Save Model
PKL via
joblib

Algorithms Used

Three classification algorithms were implemented and compared. Each algorithm brings a different approach to multi-class disease classification.

01
Decision Tree - ID3 Algorithm
Primary Model

The ID3 algorithm builds a tree by recursively selecting the feature with the highest Information Gain at each node.

Entropy: H(S) = -Sigma p(x) log2 p(x)

Information Gain: IG(S, A) = H(S) - Sigma (|Sv|/|S|) * H(Sv)

Criterionentropy (ID3)
Random State42
Librarysklearn.tree
Accuracy97.6%
02
Naive Bayes Classifier
Best Model

Naive Bayes applies Bayes' theorem with a feature independence assumption. It performs exceptionally well for binary symptom features.

Bayes: P(Disease | Symptoms) is proportional to P(Symptoms | Disease) * P(Disease)

VariantGaussianNB
AssumptionFeature Independence
Librarysklearn.naive_bayes
Accuracy100%
03
K-Nearest Neighbours (KNN)
Comparison Model

KNN predicts based on the most similar historical symptom vectors. With K=5, the model votes among the nearest training samples.

K value5
DistanceEuclidean
Librarysklearn.neighbors
Accuracy100%

Libraries, Algorithms and Tools Report

This section summarizes the exact stack used in your ML pipeline and deployment. The details are generated from the backend so they stay synchronized with the project code.

Libraries and Frameworks

Name Category Purpose in Project
pandas Library Data loading and dataframe operations
numpy Library Numerical arrays and vector handling
matplotlib Visualization Charts and model visualizations
seaborn Visualization Confusion matrix heatmap
scikit-learn ML Framework ML models, metrics, and utilities
joblib Tool Model and feature-list serialization
Flask Web Framework Web serving and prediction API

Algorithms Configured

Algorithm Configuration Role
Decision Tree (ID3) DecisionTreeClassifier(criterion='entropy', random_state=42) Interpretable baseline model
Gaussian Naive Bayes GaussianNB() Primary deployment model
K-Nearest Neighbors KNeighborsClassifier(n_neighbors=5) Comparison model

Project Tools

# Tool
1 Google Colab (training notebook execution)
2 CSV datasets: Training.csv and Testing.csv
3 Python warnings module for clean output
4 Flask templates for web report and predictor UI

Full ML Training Code

The complete code used for training, evaluation, visualization, and model export is shown below. This is the same script content passed from the Flask backend.

ml_training_full_script.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import joblib
import warnings
warnings.filterwarnings('ignore')

print('All libraries imported successfully!')

from google.colab import files
print('Upload Training.csv and Testing.csv')
uploaded = files.upload()

# Load datasets
train_df = pd.read_csv('Training.csv').drop(columns=['Unnamed: 133'], errors='ignore')
test_df  = pd.read_csv('Testing.csv')

# Strip whitespace from disease names
train_df['prognosis'] = train_df['prognosis'].str.strip()
test_df['prognosis']  = test_df['prognosis'].str.strip()

print('Train shape:', train_df.shape)
print('Test shape :', test_df.shape)
print()
train_df.head()

print('Dataset Info:')
print(f'Total training samples : {len(train_df)}')
print(f'Total testing samples  : {len(test_df)}')
print(f'Number of symptoms     : {train_df.shape[1] - 1}')
print(f'Number of diseases     : {train_df["prognosis"].nunique()}')
print()
print('Missing values:', train_df.isnull().sum().sum())
print()
print('All diseases:')
for i, d in enumerate(sorted(train_df['prognosis'].unique()), 1):
    print(f'{i}. {d}')

# Disease distribution
plt.figure(figsize=(14, 6))
train_df['prognosis'].value_counts().plot(kind='bar', color='steelblue')
plt.title('Disease Distribution in Training Data')
plt.xlabel('Disease')
plt.ylabel('Number of Samples')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# Top 15 most common symptoms across all records
symptom_cols = train_df.columns[:-1].tolist()
top15_symptoms = train_df[symptom_cols].sum().sort_values(ascending=False).head(15)

plt.figure(figsize=(12, 4))
top15_symptoms.plot(kind='bar', color='seagreen')
plt.title('Top 15 Most Frequently Occurring Symptoms')
plt.xlabel('Symptom')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

symptom_cols = train_df.columns[:-1].tolist()

X_train = train_df[symptom_cols]
y_train = train_df['prognosis']

X_test  = test_df[symptom_cols]
y_test  = test_df['prognosis']

print('X_train shape:', X_train.shape)
print('X_test shape :', X_test.shape)
print('Unique diseases in train:', y_train.nunique())

# criterion='entropy' makes it ID3
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt_model.fit(X_train, y_train)

dt_pred     = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

print(f'Decision Tree Accuracy: {dt_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, dt_pred))

# Visualize top 3 levels of the tree
plt.figure(figsize=(22, 8))
plot_tree(
    dt_model,
    feature_names=symptom_cols,
    class_names=dt_model.classes_,
    filled=True,
    max_depth=3,
    fontsize=8
)
plt.title('Decision Tree - Top 3 Levels (ID3 / Entropy)', fontsize=14)
plt.tight_layout()
plt.show()

# Top 15 most important symptoms
importances = pd.Series(dt_model.feature_importances_, index=symptom_cols)
top15 = importances.sort_values(ascending=False).head(15)

plt.figure(figsize=(10, 5))
top15.plot(kind='bar', color='steelblue')
plt.title('Top 15 Most Important Symptoms (Decision Tree)')
plt.ylabel('Importance Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

nb_pred     = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_pred)

print(f'Naive Bayes Accuracy: {nb_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, nb_pred))

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

knn_pred     = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)

print(f'KNN Accuracy: {knn_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, knn_pred))

# Print correct and wrong predictions (Lab 9 requirement)
print('KNN - All Test Predictions:
')
print(f'{"#":<5} {"Actual Disease":<42} {"Predicted Disease":<42} {"Result"}')
print('-' * 105)
correct = 0
for i, (actual, predicted) in enumerate(zip(y_test, knn_pred), 1):
    result = 'Correct' if actual == predicted else 'Wrong'
    if actual == predicted:
        correct += 1
    print(f'{i:<5} {actual:<42} {predicted:<42} {result}')
print()
print(f'Total Correct: {correct}/{len(y_test)}')

results = pd.DataFrame({
    'Model': ['Decision Tree (ID3)', 'Naive Bayes', 'KNN (k=5)'],
    'Accuracy (%)': [
        round(dt_accuracy * 100, 2),
        round(nb_accuracy * 100, 2),
        round(knn_accuracy * 100, 2)
    ]
})

print('=' * 40)
print('Model Accuracy Comparison')
print('=' * 40)
print(results.to_string(index=False))
print('=' * 40)

# Accuracy bar chart
plt.figure(figsize=(8, 5))
colors = ['steelblue', 'seagreen', 'tomato']
bars = plt.bar(results['Model'], results['Accuracy (%)'], color=colors, width=0.5)
plt.ylim(0, 110)
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy (%)')
for bar, acc in zip(bars, results['Accuracy (%)']):
    plt.text(bar.get_x() + bar.get_width() / 2,
             bar.get_height() + 1,
             f'{acc}%', ha='center', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

# Confusion matrix for Decision Tree
cm = confusion_matrix(y_test, dt_pred, labels=dt_model.classes_)
plt.figure(figsize=(18, 14))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=dt_model.classes_,
            yticklabels=dt_model.classes_)
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

def predict_disease(symptoms_input, model, symptom_cols):
    input_vector = [1 if col in symptoms_input else 0 for col in symptom_cols]
    input_df = pd.DataFrame([input_vector], columns=symptom_cols)
    return model.predict(input_df)[0]

# Sample test
sample = ['itching', 'skin_rash', 'nodal_skin_eruptions']
print(f'Input Symptoms : {sample}')
print(f'Decision Tree  : {predict_disease(sample, dt_model, symptom_cols)}')
print(f'Naive Bayes    : {predict_disease(sample, nb_model, symptom_cols)}')
print(f'KNN            : {predict_disease(sample, knn_model, symptom_cols)}')

print('All available symptoms:')
print(symptom_cols)

# Change this list and run!
my_symptoms = ['fever', 'chills', 'joint_pain', 'vomiting']

print(f'
Your symptoms     : {my_symptoms}')
print(f'Predicted disease : {predict_disease(my_symptoms, dt_model, symptom_cols)}')

joblib.dump(nb_model, 'disease_model.pkl')
joblib.dump(symptom_cols, 'symptom_cols.pkl')

print('disease_model.pkl saved')
print('symptom_cols.pkl saved')
print()
print('Downloading files...')
files.download('disease_model.pkl')
files.download('symptom_cols.pkl')

Results and Evaluation

100%

Best Model Accuracy - Naive Bayes

All test samples were correctly classified in the selected benchmark split.

Model Accuracy Comparison

Naive Bayes
100%
KNN (k=5)
100%
Decision Tree (ID3)
97.6%

Detailed Comparison Table

Algorithm Accuracy Training Time Interpretable Suitable For
Decision Tree (ID3) 97.6% Fast Yes Rule-based classification
Naive Bayes 100% Very Fast Partial Independent binary features
KNN (k=5) 100% Slow (lazy learner) No Small, well-structured datasets

Conclusion

This project demonstrates a complete machine learning workflow from data preparation to model training and evaluation for disease prediction.

Naive Bayes emerged as the best model for this dataset because of clear symptom separability and efficient probabilistic inference.

3
Algorithms Implemented
100%
Best Accuracy Achieved
41
Diseases Predicted
132
Features Used