使用 Python 实现一个简单的文本情感分析系统

41分钟前 5阅读

情感分析（Sentiment Analysis）是自然语言处理（NLP）领域的一个重要应用，广泛用于社交媒体监控、产品评论分析、舆情监测等场景。本文将介绍如何使用 Python 构建一个简单的文本情感分析系统，涵盖数据预处理、模型训练与预测的基本流程，并提供完整的代码实现。

项目目标

我们将构建一个基于机器学习的二分类情感分析系统，能够判断一段英文文本的情感倾向是正面还是负面。我们将使用以下技术栈：

Python 3.xscikit-learn：用于文本特征提取和模型训练nltk：用于文本预处理pandas 和 numpy：用于数据处理

环境准备

首先确保安装所需的库：

pip install scikit-learn nltk pandas numpy

数据集介绍

我们使用 IMDB 影评数据集，这是一个常用的情感分析公开数据集，包含 50,000 条影评，其中 25,000 条用于训练，25,000 条用于测试。每条数据都有标签（正面或负面）。

由于该数据集较大，我们可以使用 keras.datasets 中提供的简化版本来加载数据：

from keras.datasets import imdb(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

数据预处理

4.1 文本解码

由于 IMDB 数据集是以整数序列的形式存储的（每个词对应一个 ID），我们需要将其转换回原始文本形式以便理解：

word_index = imdb.get_word_index()# 将词 ID 映射回单词reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])def decode_review(text):    return ' '.join([reverse_word_index.get(i - 3, '?') for i in text])# 示例print(decode_review(x_train[0]))

4.2 特征向量化

我们将使用 TfidfVectorizer 或 CountVectorizer 对文本进行向量化处理。为了方便起见，我们先对整数序列进行解码为字符串：

import numpy as npdef vectorize_sequences(sequences, dimension=10000):    results = np.zeros((len(sequences), dimension))    for i, sequence in enumerate(sequences):        results[i, sequence] = 1.    return resultsx_train_vec = vectorize_sequences(x_train)x_test_vec = vectorize_sequences(x_test)

模型训练

我们使用逻辑回归作为分类器，因为它在文本分类任务中表现良好且易于解释。

from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, classification_report# 训练模型model = LogisticRegression(max_iter=1000)model.fit(x_train_vec, y_train)# 预测y_pred = model.predict(x_test_vec)# 评估print("准确率：", accuracy_score(y_test, y_pred))print(classification_report(y_test, y_pred))

输出示例：

准确率： 0.88764              precision    recall  f1-score   support          0       0.89      0.89      0.89     12500          1       0.89      0.88      0.89     12500    accuracy                           0.89     25000   macro avg       0.89      0.89      0.89     25000weighted avg       0.89      0.89      0.89     25000

自定义文本预测

我们可以封装一个函数，输入任意英文句子，输出其情感预测结果：

def predict_sentiment(text):    # 假设我们有相同的 tokenizer 处理方式    tokens = text.lower().split()    seq = [word_index.get(word, 2)+3 for word in tokens]    vec = vectorize_sequences([seq])    prediction = model.predict(vec)[0]    return "正面" if prediction == 1 else "负面"# 测试print(predict_sentiment("I really enjoyed this movie. It was fantastic!"))print(predict_sentiment("This film was terrible and boring. I hate it."))

输出：

正面负面

优化方向

虽然目前模型已经达到了不错的准确率，但仍有进一步优化的空间：

使用更高级的模型：如 SVM、随机森林、XGBoost 等。使用深度学习模型：如 LSTM、GRU、Transformer 模型可以更好地捕捉语义信息。改进词向量表示：使用 Word2Vec、GloVe 或 BERT 等嵌入方法。加入停用词过滤和词干提取：提升特征质量。调整 num_words 参数：控制词汇表大小以平衡性能与效果。

总结

本文介绍了如何使用 Python 构建一个基本的情感分析系统，包括数据预处理、特征提取、模型训练和预测。我们使用了经典的 IMDB 数据集和逻辑回归模型，实现了约 89% 的准确率。尽管这只是一个基础实现，但它为我们进一步探索更复杂的 NLP 任务打下了坚实的基础。

随着技术的发展，情感分析的应用场景越来越广，未来我们可以尝试结合深度学习框架如 TensorFlow 或 PyTorch 来实现更强大的模型。

附录：完整代码汇总

import numpy as npfrom keras.datasets import imdbfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, classification_report# 加载数据(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)# 解码函数word_index = imdb.get_word_index()reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])def decode_review(text):    return ' '.join([reverse_word_index.get(i - 3, '?') for i in text])# 向量化函数def vectorize_sequences(sequences, dimension=10000):    results = np.zeros((len(sequences), dimension))    for i, sequence in enumerate(sequences):        results[i, sequence] = 1.    return resultsx_train_vec = vectorize_sequences(x_train)x_test_vec = vectorize_sequences(x_test)# 训练模型model = LogisticRegression(max_iter=1000)model.fit(x_train_vec, y_train)# 预测与评估y_pred = model.predict(x_test_vec)print("准确率：", accuracy_score(y_test, y_pred))print(classification_report(y_test, y_pred))# 自定义预测函数def predict_sentiment(text):    tokens = text.lower().split()    seq = [word_index.get(word, 2)+3 for word in tokens]    vec = vectorize_sequences([seq])    prediction = model.predict(vec)[0]    return "正面" if prediction == 1 else "负面"# 测试print(predict_sentiment("I really enjoyed this movie. It was fantastic!"))print(predict_sentiment("This film was terrible and boring. I hate it."))

参考资料：

Keras IMDB Dataset Documentation Scikit-learn Documentation Natural Language Toolkit (NLTK)

如果你喜欢这篇文章，欢迎继续关注更多关于 NLP、机器学习和深度学习的技术内容！

免责声明：本文来自网站作者，不代表CIUIC的观点和立场，本站所发布的一切资源仅限用于学习和研究目的；不得将上述内容用于商业或者非法用途，否则，一切后果请用户自负。本站信息来自网络，版权争议与本站无关。您必须在下载后的24个小时之内，从您的电脑中彻底删除上述内容。如果您喜欢该程序，请支持正版软件，购买注册，得到更好的正版服务。客服邮箱：ciuic@ciuic.com