基于机器学习对猪肉价格预测
猪肉价格预测
支持向量机回归
随机森林回归
MLP神经网络回归
问题背景
“猪粮安天下”,生猪自古以来便在国计民生中占据着重要地位,猪肉是我国城乡居民“菜篮子”中不可或缺的产品。但从 2018 年非洲猪瘟爆发以来,生猪产业遭到巨大冲击,生猪市场价格波动频繁,不仅给养殖者造成巨大的经济损失,也给广大消费者造成了很大困扰。2020 年新冠肺炎疫情突袭,再次对逐步恢复的生猪产业产生一定不利影响。
(本文指标选取有待商榷,仅仅做着玩)
导入数据
# 安装库专用
# 通过如下命令设定镜像
options(repos = '')
# 查看镜像是否修改
getOption('repos')
# 尝试下载R包
#若有需要,进行安装
#install.packages('h2o')
‘’
#设置工作路径
setwd("D:/LengPY")
#导入数据
library(readxl)
data<- read_excel("liudata.xlsx")
head(data)
A tibble: 6 × 11
日期活猪仔猪去骨牛肉带骨羊肉豆粕小麦麸玉米育肥猪综合饲料非洲猪瘟新冠疫情
<dttm><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><lgl><lgl>
2006-01-01 7.56 9.89 18.60 19.09 2.64 1.24 1.26 1.81 NA NA 2006-02-01 7.11 9.48 18.65 18.76 2.75 1.24 1.27 1.83 NA NA 2006-03-01 6.68 8.85 18.37 18.25 2.69 1.23 1.28 1.83 NA NA 2006-04-01 6.21 7.82 18.33 18.41 2.60 1.21 1.28 1.82 NA NA 2006-05-01 5.96 6.98 18.31 18.35 2.56 1.21 1.34 1.84 NA NA 2006-06-01 6.08 6.84 18.32 18.23 2.54 1.20 1.39 1.86 NA NA
data1<-data[,2:9]
head(data1)
A tibble: 6 × 8
活猪仔猪去骨牛肉带骨羊肉豆粕小麦麸玉米育肥猪综合饲料
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
7.56 9.89 18.60 19.09 2.64 1.24 1.26 1.81 7.11 9.48 18.65 18.76 2.75 1.24 1.27 1.83 6.68 8.85 18.37 18.25 2.69 1.23 1.28 1.83 6.21 7.82 18.33 18.41 2.60 1.21 1.28 1.82 5.96 6.98 18.31 18.35 2.56 1.21 1.34 1.84 6.08 6.84 18.32 18.23 2.54 1.20 1.39 1.86
## 可视化特征之间的相关系数
library(corrplot)
cor <- cor(data1)
corrplot.mixed(cor,tl.col="black",tl.pos = "lt",
tl.cex = 0.8,number.cex = 0.7)
可根据相关系数结果,对变量相关性进行初步探索。
一、支持向量机
## 支持向量机回归模型
library(e1071)
library(caret)
library(Metrics)
library(readr)
system.time(
svmreg <- svm(活猪~.,data =data1,kernel = "radial")
)
user system elapsed
0.02 0.00 0.02
summary(svmreg)
Call:
svm(formula = 活猪 ~ ., data = data1, kernel = "radial")
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 1
gamma: 0.1428571
epsilon: 0.1
Number of Support Vectors: 111
set.seed(123)
index <- sample(nrow(data1),round(0.7*nrow(data1)))
train_data <- data1[index,]
test_data <-data1[-index,]
#查看训练集维度
dim(train_data)
128
8
## 在训练集上的误差
train_pre <- predict(svmreg,train_data)
train_mae <- mae(train_data$活猪,train_pre)
sprintf("训练集上的绝对值误差: %f",train_mae)
‘训练集上的绝对值误差: 1.007821’
test_pre <- predict(svmreg,test_data)
test_mae <- mae(train_data$活猪,test_pre)
sprintf("测试集上的绝对值误差: %f",test_mae)
‘测试集上的绝对值误差: 6.022539’
data<-data.frame(train_data$活猪,train_pre)
预测j效果良好,测试集误差较大。
total_pre <- predict(svmreg,data1[2:8])
total_mae <- mse(data1$活猪,total_pre)
sprintf("全部上的绝对值误差: %f",total_mae)
‘全部上的绝对值误差: 1.651221’
输出对应预测结果:
data5<-data.frame(total_pre,data1$活猪)
colnames(data5)<-c('预测','实际')
data5
A data.frame: 183 × 2
预测实际
<dbl><dbl>
1
7.081429 7.56 2
7.326892 7.11 3
7.125876 6.68 4
6.871381 6.21 5
6.957324 5.96 6
7.221186 6.08 7
7.400246 6.47 8
7.658352 7.17 9
7.709680 7.84 10
7.599010 7.93 11
7.786543 8.33 12
8.612998 9.18 13
8.892967 9.55 14
9.065229 9.20 15
9.556271 8.91 16
9.566126 9.02 17
10.184165 10.20 18
11.079956 11.37 19
12.172093 13.12 20
13.362894 14.27 21
14.095941 13.60 22
13.871400 13.21 23
14.403343 14.13 ...
... ... 171
34.98679 35.93 172
34.07102 33.41 173
34.44107 29.50 174
34.54299 30.97 175
33.63167 35.73 176
32.10181 36.50 177
29.42965 35.20 178
31.75152 30.93 179
32.81464 29.23 180
32.59779 32.40 181
29.80336 35.17 182
29.01780 31.43 183
29.59932 27.90
二、随机森林
library(readr)
library(VIM)
library(rpart)
library(rpart.plot)
library(Metrics)
library(ROCR)
library(tidyr)
library(randomForest)
library(ggRandomForests)
set.seed(123)
index <- sample(nrow(data1),round(0.7*nrow(data1)))
train_data <- data1[index,]
test_data <-data1[-index,]
rfreg <- randomForest(活猪~.,data = train_data,ntree=500)
summary(rfreg)
Length Class Mode
call
4 -none- call
type
1 -none- character
predicted
128 -none- numeric
mse
500 -none- numeric
rsq
500 -none- numeric
oob.times
128 -none- numeric
importance
7 -none- numeric
importanceSD
0 -none- NULL
localImportance 0 -none- NULL
proximity
0 -none- NULL
ntree
1 -none- numeric
mtry
1 -none- numeric
forest
11 -none- list
coefs
0 -none- NULL
y
128 -none- numeric
test
0 -none- NULL
inbag
0 -none- NULL
terms
3 terms call
## 可视化模型随着树的增加误差OOB的变化
par(family = "STKaiti")
plot(rfreg,type = "l",col = "red",main = "随机森林回归")
可发现,在trees数量 30时即可获得较好结果。
## 使用ggrandomforest包可视化误差
plot(gg_error(rfreg))+labs(title = "随机森林回归")
## 可视化变量的重要性
importance(rfreg)
A matrix: 7 × 1 of type dbl
IncNodePurity
仔猪
1938.7910 去骨牛肉
1623.8613 带骨羊肉
1522.7004 豆粕
143.6410 小麦麸
517.8845 玉米
215.0726 育肥猪综合饲料
406.2389
varImpPlot(rfreg,pch = 20, main = "Importance of Variables")
#可发现仔猪、牛、羊肉对猪肉价格影响较大,因为互为替代品或前提(仔猪),故结果较为合理。
## 对测试集进行预测,并计算 Mean Squared Error
rfpre <- predict(rfreg,test_data)
sprintf("均方根误差为: %f",mse(test_data$活猪,rfpre))
‘均方根误差为: 1.646604’
## 参数搜索,寻找合适的 mtry参数,训练更好的模型
## Tune randomForest for the optimal mtry parameter
set.seed(1234)
rftune <- tuneRF(x = test_data[,2:8],y = test_data$活猪,
stepFactor=1.5,ntreeTry = 500)
mtry = 2 OOB error = 2.785136
Searching left ...
Searching right ...
mtry = 3 OOB error = 2.63388
0.05430821 0.05
mtry = 4 OOB error = 2.531242
0.03896842 0.05
print(rftune)
mtry OOBError
2 2 2.785136
3 3 2.633880
4 4 2.531242
## OOBError误差最小的mtry参数为6
## 建立优化后的随机森林回归模型
rfregbest <- randomForest(活猪~.,data = train_data,ntree=500,mtry = 6)
## 可视化两种模型随着树的增加误差OOB的变化
rfregerr <- as.data.frame(plot(rfreg))
colnames(rfregerr) <- "rfregerr"
rfregbesterr <- as.data.frame(plot(rfregbest))
colnames(rfregerr) <- "rfregerr"
rfregbesterr <- as.data.frame(plot(rfregbest))
colnames(rfregbesterr) <- "rfregbesterr"
plotrfdata <- cbind.data.frame(rfregerr,rfregbesterr)
plotrfdata$ntree <- 1:nrow(plotrfdata)
plotrfdata <- gather(plotrfdata,key = "Type",value = "Error",1:2)
ggplot(plotrfdata,aes(x = ntree,y = Error))+
geom_line(aes(linetype = Type,colour = Type),size = 0.9)+
theme(legend.position = "top")+
ggtitle("随机森林回归模型")+
theme(plot.title = element_text(hjust = 0.5))
## 使用优化后的随机森林回归模型,对测试集进行预测,并计算 Mean Squared Error
rfprebest <- predict(rfregbest,test_data[,2:8])
sprintf("优化后均方根误差为: %f",mse(test_data$活猪,rfprebest))
‘优化后均方根误差为: 1.660115’
## 使用优化后的随机森林回归模型,对测试集进行预测,并计算 Mean Squared Error
#全部数据
total <- predict(rfregbest,data1[,2:8])
sprintf("优化后均方根误差为: %f",mse(data1$活猪,total))
‘优化后均方根误差为: 0.773364’
#预测结果
totalpre<-data.frame(data1$活猪,total)
totalpre
<tr><th scope=row>172</th><td>33.41</td><td>33.70117</td></tr>
<tr><th scope=row>173</th><td>29.50</td><td>31.42532</td></tr>
<tr><th scope=row>174</th><td>30.97</td><td>32.10734</td></tr>
<tr><th scope=row>175</th><td>35.73</td><td>34.39926</td></tr>
<tr><th scope=row>176</th><td>36.50</td><td>35.47128</td></tr>
<tr><th scope=row>177</th><td>35.20</td><td>34.27110</td></tr>
<tr><th scope=row>178</th><td>30.93</td><td>32.03528</td></tr>
<tr><th scope=row>179</th><td>29.23</td><td>31.95318</td></tr>
<tr><th scope=row>180</th><td>32.40</td><td>32.44584</td></tr>
<tr><th scope=row>181</th><td>35.17</td><td>33.15130</td></tr>
<tr><th scope=row>182</th><td>31.43</td><td>32.34892</td></tr>
<tr><th scope=row>183</th><td>27.90</td><td>32.95256</td></tr>
A data.frame: 183 × 2
data1.活猪total
<dbl><dbl>
1
7.56 7.745717 2
7.11 7.473767 3
6.68 6.802603 4
6.21 6.484864 5
5.96 6.340616 6
6.08 6.334053 7
6.47 6.463621 8
7.17 7.035551 9
7.84 6.900711 10
7.93 6.985457 11
8.33 7.875041 12
9.18 9.065969 13
9.55 9.289838 14
9.20 9.244121 15
8.91 9.259116 16
9.02 9.250085 17
10.20 9.805307 18
11.37 10.590388 19
13.12 13.012034 20
14.27 13.206428 21
13.60 13.403372 22
13.21 13.176563 23
14.13 13.715155 24
15.46 15.510558
## 数据准备
index <- order(test_data$活猪)
X <- sort(index)
Y1 <- test_data$活猪[index]
rfpre2 <- rfpre[index]
rfprebest2 <- rfprebest[index]
plotdata <- data.frame(X = X,Y1 = Y1,rfpre =rfpre2,rfprebest = rfprebest2)
plotdata <- gather(plotdata,key="model",value="value",c(-X))
## 可视化模型的预测误差
ggplot(plotdata,aes(x = X,y = value))+
geom_line(aes(linetype = model,colour = model),size = 0.8)+
theme(legend.position = c(0.1,0.8),
plot.title = element_text(hjust = 0.5))+
ggtitle("随机森林回归模型")
可发现预测结果良好,在猪瘟 时期价格预测偏差较大,可能 是未量化猪瘟i直接影响,故 可考虑进行改进。
三、 MLP神经网络
library(RSNNS)
set.seed(123)
index <- sample(nrow(data1),round(0.7*nrow(data1)))
train_data <- data1[index,]
test_data <-data1[-index,]
dim(train_data)
## 数据max-min归一化到0-1之间
data1[,2:8] <- normalizeData(data1[,2:8],"0_1")
## 猪肉价归一化到0-1之间,并获取归一化参数
price <- normalizeData(data1$活猪,type = "0_1")
NormParameters <- getNormParameters(price)
128
8
head(data1)
A tibble: 6 × 8
活猪仔猪去骨牛肉带骨羊肉豆粕小麦麸玉米育肥猪综合饲料
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
7.56 0.029987219 0.0046354825 0.013367831 0.2395437 0.07368421 0.000000000 0.000000000 7.11 0.025956150 0.0053378283 0.008624407 0.2813688 0.07368421 0.006451613 0.011049724 6.68 0.019762069 0.0014046917 0.001293661 0.2585551 0.06315789 0.012903226 0.011049724 6.21 0.009635237 0.0008428150 0.003593503 0.2243346 0.04210526 0.012903226 0.005524862 5.96 0.001376462 0.0005618767 0.002731062 0.2091255 0.04210526 0.051612903 0.016574586 6.08 0.000000000 0.0007023458 0.001006181 0.2015209 0.03157895 0.083870968 0.027624309
hist(price)
## 数据切分
set.seed(123)
datasplist <- splitForTrainingAndTest(data1[,2:8],price,
ratio = 0.3)
## MLP回归模型
system.time(
mlpreg <- mlp(datasplist$inputsTrain,datasplist$targetsTrain,maxit = 200,
size = c(100,50,100,50),learnFunc = "Rprop",
inputsTest=datasplist$inputsTest,
targetsTest = datasplist$targetsTest,
metric = "RSME")
)
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
user system elapsed
13.86 0.00 14.25
## 可视化模型的效果
plotIterativeError(mlpreg,main = "MLP Iterative Erro")
plotRegressionError(datasplist$targetsTrain,mlpreg$fitted.values,
main="MLP train fit")
plotRegressionError(datasplist$targetsTest,mlpreg$fittedTestValues,
main="MLP test fit")
## 在训练集上的误差
mlppre_train <- denormalizeData(mlpreg$fitted.values,NormParameters)
mlp_train <- denormalizeData(datasplist$targetsTrain,NormParameters)
train_mae <- mae(mlp_train,mlppre_train)
sprintf("训练集上的绝对值误差: %f",train_mae)
‘训练集上的绝对值误差: 0.456312’
mlppre_test <- denormalizeData(mlpreg$fittedTestValues,NormParameters)
mlp_test <- denormalizeData(datasplist$targetsTest,NormParameters)
test_mae <- mae(mlp_test,mlppre_test)
sprintf("测试集上的绝对值误差: %f",test_mae)
‘测试集上的绝对值误差: 6.967324’
测试集误差较大,需加以改进。
如对猪瘟及新冠疫情进行加入分析。结合RNN等方法,获得更精准结果。