Disease Prediction
System
A machine learning system that predicts diseases from patient symptoms using Decision Tree, Naive Bayes, and KNN classifiers, deployed as an interactive web application.
Dataset
The dataset used is the Disease Prediction Using Machine Learning dataset from Kaggle. It contains binary symptom features mapped to disease labels, making it ideal for multi-class classification.
Dataset Structure
Each row represents a patient. The first 132 columns are binary symptom flags (1 = present, 0 = absent). The last column prognosis is the target disease label.
| Column | Type | Description | Example |
|---|---|---|---|
| itching, skin_rash, fever ... | Integer (0/1) | Binary symptom presence flags | 1 or 0 |
| prognosis | String | Target disease label | Dengue, Malaria... |
Sample Diseases Covered
Fungal Infection · Allergy · GERD · Diabetes · Malaria · Dengue · Typhoid · Pneumonia · Heart Attack · Tuberculosis · Jaundice · Chicken Pox · Hypertension · Arthritis · and 27 more.
ML Pipeline
The project follows a standard supervised learning pipeline from raw CSV data to a deployed prediction system.
Testing.csv
Drop null cols
Pre-split
Report
joblib
Algorithms Used
Three classification algorithms were implemented and compared. Each algorithm brings a different approach to multi-class disease classification.
The ID3 algorithm builds a tree by recursively selecting the feature with the highest Information Gain at each node.
Entropy: H(S) = -Sigma p(x) log2 p(x)
Information Gain: IG(S, A) = H(S) - Sigma (|Sv|/|S|) * H(Sv)
Naive Bayes applies Bayes' theorem with a feature independence assumption. It performs exceptionally well for binary symptom features.
Bayes: P(Disease | Symptoms) is proportional to P(Symptoms | Disease) * P(Disease)
KNN predicts based on the most similar historical symptom vectors. With K=5, the model votes among the nearest training samples.
Libraries, Algorithms and Tools Report
This section summarizes the exact stack used in your ML pipeline and deployment. The details are generated from the backend so they stay synchronized with the project code.
Libraries and Frameworks
| Name | Category | Purpose in Project |
|---|---|---|
| pandas | Library | Data loading and dataframe operations |
| numpy | Library | Numerical arrays and vector handling |
| matplotlib | Visualization | Charts and model visualizations |
| seaborn | Visualization | Confusion matrix heatmap |
| scikit-learn | ML Framework | ML models, metrics, and utilities |
| joblib | Tool | Model and feature-list serialization |
| Flask | Web Framework | Web serving and prediction API |
Algorithms Configured
| Algorithm | Configuration | Role |
|---|---|---|
| Decision Tree (ID3) | DecisionTreeClassifier(criterion='entropy', random_state=42) | Interpretable baseline model |
| Gaussian Naive Bayes | GaussianNB() | Primary deployment model |
| K-Nearest Neighbors | KNeighborsClassifier(n_neighbors=5) | Comparison model |
Project Tools
| # | Tool |
|---|---|
| 1 | Google Colab (training notebook execution) |
| 2 | CSV datasets: Training.csv and Testing.csv |
| 3 | Python warnings module for clean output |
| 4 | Flask templates for web report and predictor UI |
Full ML Training Code
The complete code used for training, evaluation, visualization, and model export is shown below. This is the same script content passed from the Flask backend.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib
import warnings
warnings.filterwarnings('ignore')
print('All libraries imported successfully!')
from google.colab import files
print('Upload Training.csv and Testing.csv')
uploaded = files.upload()
# Load datasets
train_df = pd.read_csv('Training.csv').drop(columns=['Unnamed: 133'], errors='ignore')
test_df = pd.read_csv('Testing.csv')
# Strip whitespace from disease names
train_df['prognosis'] = train_df['prognosis'].str.strip()
test_df['prognosis'] = test_df['prognosis'].str.strip()
print('Train shape:', train_df.shape)
print('Test shape :', test_df.shape)
print()
train_df.head()
print('Dataset Info:')
print(f'Total training samples : {len(train_df)}')
print(f'Total testing samples : {len(test_df)}')
print(f'Number of symptoms : {train_df.shape[1] - 1}')
print(f'Number of diseases : {train_df["prognosis"].nunique()}')
print()
print('Missing values:', train_df.isnull().sum().sum())
print()
print('All diseases:')
for i, d in enumerate(sorted(train_df['prognosis'].unique()), 1):
print(f'{i}. {d}')
# Disease distribution
plt.figure(figsize=(14, 6))
train_df['prognosis'].value_counts().plot(kind='bar', color='steelblue')
plt.title('Disease Distribution in Training Data')
plt.xlabel('Disease')
plt.ylabel('Number of Samples')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
# Top 15 most common symptoms across all records
symptom_cols = train_df.columns[:-1].tolist()
top15_symptoms = train_df[symptom_cols].sum().sort_values(ascending=False).head(15)
plt.figure(figsize=(12, 4))
top15_symptoms.plot(kind='bar', color='seagreen')
plt.title('Top 15 Most Frequently Occurring Symptoms')
plt.xlabel('Symptom')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
symptom_cols = train_df.columns[:-1].tolist()
X_train = train_df[symptom_cols]
y_train = train_df['prognosis']
X_test = test_df[symptom_cols]
y_test = test_df['prognosis']
print('X_train shape:', X_train.shape)
print('X_test shape :', X_test.shape)
print('Unique diseases in train:', y_train.nunique())
# criterion='entropy' makes it ID3
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
print(f'Decision Tree Accuracy: {dt_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, dt_pred))
# Visualize top 3 levels of the tree
plt.figure(figsize=(22, 8))
plot_tree(
dt_model,
feature_names=symptom_cols,
class_names=dt_model.classes_,
filled=True,
max_depth=3,
fontsize=8
)
plt.title('Decision Tree - Top 3 Levels (ID3 / Entropy)', fontsize=14)
plt.tight_layout()
plt.show()
# Top 15 most important symptoms
importances = pd.Series(dt_model.feature_importances_, index=symptom_cols)
top15 = importances.sort_values(ascending=False).head(15)
plt.figure(figsize=(10, 5))
top15.plot(kind='bar', color='steelblue')
plt.title('Top 15 Most Important Symptoms (Decision Tree)')
plt.ylabel('Importance Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_pred)
print(f'Naive Bayes Accuracy: {nb_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, nb_pred))
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)
print(f'KNN Accuracy: {knn_accuracy * 100:.2f}%')
print()
print(classification_report(y_test, knn_pred))
# Print correct and wrong predictions (Lab 9 requirement)
print('KNN - All Test Predictions:
')
print(f'{"#":<5} {"Actual Disease":<42} {"Predicted Disease":<42} {"Result"}')
print('-' * 105)
correct = 0
for i, (actual, predicted) in enumerate(zip(y_test, knn_pred), 1):
result = 'Correct' if actual == predicted else 'Wrong'
if actual == predicted:
correct += 1
print(f'{i:<5} {actual:<42} {predicted:<42} {result}')
print()
print(f'Total Correct: {correct}/{len(y_test)}')
results = pd.DataFrame({
'Model': ['Decision Tree (ID3)', 'Naive Bayes', 'KNN (k=5)'],
'Accuracy (%)': [
round(dt_accuracy * 100, 2),
round(nb_accuracy * 100, 2),
round(knn_accuracy * 100, 2)
]
})
print('=' * 40)
print('Model Accuracy Comparison')
print('=' * 40)
print(results.to_string(index=False))
print('=' * 40)
# Accuracy bar chart
plt.figure(figsize=(8, 5))
colors = ['steelblue', 'seagreen', 'tomato']
bars = plt.bar(results['Model'], results['Accuracy (%)'], color=colors, width=0.5)
plt.ylim(0, 110)
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy (%)')
for bar, acc in zip(bars, results['Accuracy (%)']):
plt.text(bar.get_x() + bar.get_width() / 2,
bar.get_height() + 1,
f'{acc}%', ha='center', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()
# Confusion matrix for Decision Tree
cm = confusion_matrix(y_test, dt_pred, labels=dt_model.classes_)
plt.figure(figsize=(18, 14))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=dt_model.classes_,
yticklabels=dt_model.classes_)
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
def predict_disease(symptoms_input, model, symptom_cols):
input_vector = [1 if col in symptoms_input else 0 for col in symptom_cols]
input_df = pd.DataFrame([input_vector], columns=symptom_cols)
return model.predict(input_df)[0]
# Sample test
sample = ['itching', 'skin_rash', 'nodal_skin_eruptions']
print(f'Input Symptoms : {sample}')
print(f'Decision Tree : {predict_disease(sample, dt_model, symptom_cols)}')
print(f'Naive Bayes : {predict_disease(sample, nb_model, symptom_cols)}')
print(f'KNN : {predict_disease(sample, knn_model, symptom_cols)}')
print('All available symptoms:')
print(symptom_cols)
# Change this list and run!
my_symptoms = ['fever', 'chills', 'joint_pain', 'vomiting']
print(f'
Your symptoms : {my_symptoms}')
print(f'Predicted disease : {predict_disease(my_symptoms, dt_model, symptom_cols)}')
joblib.dump(nb_model, 'disease_model.pkl')
joblib.dump(symptom_cols, 'symptom_cols.pkl')
print('disease_model.pkl saved')
print('symptom_cols.pkl saved')
print()
print('Downloading files...')
files.download('disease_model.pkl')
files.download('symptom_cols.pkl')
Results and Evaluation
Best Model Accuracy - Naive Bayes
All test samples were correctly classified in the selected benchmark split.
Model Accuracy Comparison
Detailed Comparison Table
| Algorithm | Accuracy | Training Time | Interpretable | Suitable For |
|---|---|---|---|---|
| Decision Tree (ID3) | 97.6% | Fast | Yes | Rule-based classification |
| Naive Bayes | 100% | Very Fast | Partial | Independent binary features |
| KNN (k=5) | 100% | Slow (lazy learner) | No | Small, well-structured datasets |
Conclusion
This project demonstrates a complete machine learning workflow from data preparation to model training and evaluation for disease prediction.
Naive Bayes emerged as the best model for this dataset because of clear symptom separability and efficient probabilistic inference.