使用Python进行数据分析与可视化：从零开始的实战指南

39分钟前 4阅读

在当今数据驱动的时代，数据分析和可视化已成为各行各业不可或缺的技能。无论是商业决策、科学研究还是人工智能开发，理解数据背后的模式和趋势都至关重要。Python 作为一种功能强大且易于学习的编程语言，凭借其丰富的库（如 NumPy、Pandas、Matplotlib 和 Seaborn）成为了数据科学领域的首选工具。

本文将引导你完成一个完整的数据分析与可视化项目，内容包括：

数据加载与清洗探索性数据分析（EDA）数据可视化结果解读与总结

我们将使用 Kaggle 上的一个公开数据集 —— “Iris 花卉数据集”，该数据集包含了三种鸢尾花的四个特征（萼片长度、萼片宽度、花瓣长度、花瓣宽度）以及它们的种类标签。

环境准备

首先，确保你已经安装了以下 Python 库：

pip install numpy pandas matplotlib seaborn scikit-learn

第一步：导入必要的库并加载数据

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import load_iris# 加载 Iris 数据集iris = load_iris()X = iris.datay = iris.targetfeature_names = iris.feature_namestarget_names = iris.target_names# 构建 DataFramedf = pd.DataFrame(X, columns=feature_names)df['species'] = ydf['species_name'] = df['species'].map({i: name for i, name in enumerate(target_names)})print(df.head())

输出示例：

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  species  species_name0                5.1               3.5                1.4               0.2        0      setosa1                4.9               3.0                1.4               0.2        0      setosa2                4.7               3.2                1.3               0.2        0      setosa3                4.6               3.1                1.5               0.2        0      setosa4                5.0               3.6                1.4               0.2        0      setosa

第二步：数据清洗与预处理

虽然 Iris 是一个干净的数据集，但在实际项目中，我们通常需要处理缺失值、异常值等。

# 检查是否有缺失值print(df.isnull().sum())# 检查各物种的数量分布print(df['species_name'].value_counts())

输出示例：

sepal length (cm)     0sepal width (cm)      0petal length (cm)     0petal width (cm)      0species               0species_name          0dtype: int64setosa        50versicolor    50virginica     50Name: species_name, dtype: int64

数据没有缺失值，样本数量均匀，适合后续分析。

第三步：探索性数据分析（EDA）

3.1 描述性统计

print(df.describe())

这将显示每个数值特征的平均值、标准差、最小最大值等信息。

3.2 各特征之间的相关性

corr = df.iloc[:, :-2].corr()  # 去掉 species 列sns.heatmap(corr, annot=True, cmap='coolwarm')plt.title('Feature Correlation Matrix')plt.show()

你可以看到花瓣长度与花瓣宽度之间有很强的正相关性。

第四步：数据可视化

4.1 不同种类鸢尾花的特征对比

我们可以使用箱型图来比较不同种类的花在各个特征上的差异。

for feature in feature_names:    plt.figure(figsize=(8, 4))    sns.boxplot(x='species_name', y=feature, data=df)    plt.title(f'{feature} by Species')    plt.show()

通过这些图表可以发现，setosa 的花瓣尺寸明显小于其他两种。

4.2 散点图展示两个特征的关系

sns.pairplot(data=df, hue='species_name', vars=feature_names)plt.suptitle("Pairwise Scatter Plots of Features", y=1.02)plt.show()

这个图展示了每两个特征之间的关系，并以颜色区分种类，有助于观察聚类情况。

第五步：降维可视化（PCA）

为了更直观地展示数据的结构，我们可以使用主成分分析（PCA）将数据降到二维空间进行可视化。

from sklearn.decomposition import PCApca = PCA(n_components=2)X_pca = pca.fit_transform(X)df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])df_pca['species_name'] = df['species_name']plt.figure(figsize=(8, 6))sns.scatterplot(x='PC1', y='PC2', hue='species_name', data=df_pca, palette='Set1')plt.title('PCA of Iris Dataset')plt.grid(True)plt.show()

通过 PCA 图可以看到三个类别在二维空间中基本可分，说明原始数据具有良好的可分性。

第六步：结果分析与总结

通过本次分析，我们得出了以下几点：

数据质量高：Iris 数据集几乎没有缺失值，样本均衡。特征间存在相关性：尤其是花瓣长度与宽度之间高度相关。种类可分性强：通过可视化可以看出不同种类在特征空间中有明显的分离趋势。PCA 有效降维：将四维数据压缩到二维后仍能保持良好的分类结构。

扩展方向

如果你对机器学习感兴趣，下一步可以尝试用 Scikit-Learn 构建一个分类模型，例如 KNN 或 SVM 来预测花的种类。

from sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_scoreX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)knn = KNeighborsClassifier(n_neighbors=3)knn.fit(X_train, y_train)y_pred = knn.predict(X_test)print("Accuracy:", accuracy_score(y_test, y_pred))

总结

本文通过一个完整的技术流程，演示了如何使用 Python 对 Iris 数据集进行数据分析与可视化。从数据加载、清洗、探索性分析，到高级可视化和降维技术，涵盖了数据科学工作的多个关键步骤。希望这篇文章能够帮助你建立系统的数据分析思维，并激发你对 Python 数据处理的兴趣。

附录：完整代码汇总

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import load_irisfrom sklearn.decomposition import PCAfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_score# Load datasetiris = load_iris()X = iris.datay = iris.targetfeature_names = iris.feature_namestarget_names = iris.target_names# Build DataFramedf = pd.DataFrame(X, columns=feature_names)df['species'] = ydf['species_name'] = df['species'].map({i: name for i, name in enumerate(target_names)})print(df.head())# Data inspectionprint(df.isnull().sum())print(df['species_name'].value_counts())print(df.describe())# Feature correlationcorr = df.iloc[:, :-2].corr()sns.heatmap(corr, annot=True, cmap='coolwarm')plt.title('Feature Correlation Matrix')plt.show()# Boxplotsfor feature in feature_names:    plt.figure(figsize=(8, 4))    sns.boxplot(x='species_name', y=feature, data=df)    plt.title(f'{feature} by Species')    plt.show()# Pair plotssns.pairplot(data=df, hue='species_name', vars=feature_names)plt.suptitle("Pairwise Scatter Plots of Features", y=1.02)plt.show()# PCA visualizationpca = PCA(n_components=2)X_pca = pca.fit_transform(X)df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])df_pca['species_name'] = df['species_name']plt.figure(figsize=(8, 6))sns.scatterplot(x='PC1', y='PC2', hue='species_name', data=df_pca, palette='Set1')plt.title('PCA of Iris Dataset')plt.grid(True)plt.show()# Classification modelX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)knn = KNeighborsClassifier(n_neighbors=3)knn.fit(X_train, y_train)y_pred = knn.predict(X_test)print("Accuracy:", accuracy_score(y_test, y_pred))

如有任何问题或建议，欢迎留言交流！

免责声明：本文来自网站作者，不代表CIUIC的观点和立场，本站所发布的一切资源仅限用于学习和研究目的；不得将上述内容用于商业或者非法用途，否则，一切后果请用户自负。本站信息来自网络，版权争议与本站无关。您必须在下载后的24个小时之内，从您的电脑中彻底删除上述内容。如果您喜欢该程序，请支持正版软件，购买注册，得到更好的正版服务。客服邮箱：ciuic@ciuic.com