Exploring Machine Learning: Building a Simple Model with Scikit-Learn

corpvision.fun

21 Лютого, 2025

Machine learning (ML) is a powerful tool in the field of data science and artificial intelligence. It allows computers to learn from data, identify patterns, and make decisions with minimal human intervention. In this guide, we will explore the fundamentals of machine learning and walk through the process of building a simple model using Scikit-Learn, a widely used library in Python for machine learning.

What is Machine Learning?

Machine learning is a subset of artificial intelligence that focuses on the development of algorithms that can learn from and make predictions or decisions based on data. It generally falls into three main categories:

Supervised Learning: The model is trained on labeled data, meaning that both the input data and the expected output (labels) are provided.
Unsupervised Learning: The model is trained on unlabeled data, and it tries to find patterns and relationships in the data.
Reinforcement Learning: The model learns by interacting with its environment and receiving feedback based on its actions.

Building a Simple Model with Scikit-Learn

In this example, we will create a simple supervised machine learning model to predict whether a person has diabetes based on some health metrics. We will use a publicly available dataset from the UCI Machine Learning Repository.

Steps Involved:

Install Required Libraries
Load the Dataset
Explore the Data
Preprocess the Data
Split the Data into Training and Testing Sets
Build a Machine Learning Model
Evaluate the Model
Make Predictions

Step 1: Install Required Libraries

First, ensure you have pandas, scikit-learn, and matplotlib libraries installed. You can install them using pip:

pip install pandas scikit-learn matplotlib

Step 2: Load the Dataset

We will load the Pima Indians Diabetes Database. This dataset contains 768 samples of female patients with different health attributes.

import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
                'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, header=None, names=column_names)

# Display the first few rows of the dataset
print(data.head())

Step 3: Explore the Data

It’s important to understand the data you are working with.

# Check for missing values
print(data.isnull().sum())

# Summary statistics
print(data.describe())

# Count of outcomes
print(data['Outcome'].value_counts())

Step 4: Preprocess the Data

We may want to check for issues like missing values or scaling features. In this dataset, zeros are used as placeholders for missing values.

# Replace zeros with NaN and fill with the mean (for those specific columns)
data['Glucose'].replace(0, pd.NA, inplace=True)
data['BloodPressure'].replace(0, pd.NA, inplace=True)
data['SkinThickness'].replace(0, pd.NA, inplace=True)
data['Insulin'].replace(0, pd.NA, inplace=True)
data['BMI'].replace(0, pd.NA, inplace=True)

# Fill NaN with mean of respective columns
data.fillna(data.mean(), inplace=True)

Step 5: Split the Data into Training and Testing Sets

We will divide the data into training (80%) and testing (20%) sets.

from sklearn.model_selection import train_test_split

# Features and target variable
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Build a Machine Learning Model

We will use a simple model, the Logistic Regression, as our classifier.

from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

Step 7: Evaluate the Model

To evaluate how well our model is performing, we will check its accuracy and other performance metrics.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

# Classification Report
report = classification_report(y_test, y_pred)
print('Classification Report:')
print(report)

Step 8: Make Predictions

We can use the trained model to make predictions on new data.

# Example for new patient data
new_data = [[2, 85, 66, 29, 0, 26.6, 0.351, 31]]  # Replace with actual data
prediction = model.predict(new_data)
print('Predicted Outcome (0 = No Diabetes, 1 = Diabetes):', prediction[0])

Conclusion

In this guide, we explored the fundamental concepts of machine learning and built a simple predictive model using Scikit-Learn. We went through the entire machine learning pipeline, from loading and preprocessing data to training the model and evaluating its performance.

Next Steps

Experiment with Different Algorithms: Try using decision trees, random forests, or support vector machines.
Tune Hyperparameters: Use techniques such as grid search to find the best hyperparameters for your model.
Cross-Validation: Implement cross-validation techniques to get a better assessment of your model’s performance.
Feature Engineering: Explore different feature engineering techniques to improve your model.

Further Learning Resources

Books:
- “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron
- “Python Machine Learning” by Sebastian Raschka and Vahid Mirjalili
Online Courses:
- Coursera: Machine Learning by Andrew Ng
- Udacity: Intro to Machine Learning with Python

With this foundational knowledge, you’re now ready to dive deeper into the exciting field of machine learning! Happy coding!