所以,我是machine learning和Spark的新手,并且在这个页面上通过Spark MLlibs关于Regression特别是LinearRegressionWithSGD文档。 我对理解python代码有点困难。 这是迄今为止我所理解的 - 代码加载数据然后形成labeledpoint 。 之后,建立模型,然后在训练数据上评估并计算MSE 。
现在令我困惑的部分是,在正常的machine learning过程中,我们首先将数据划分为训练集和测试集。 我们使用训练集构建模型,最后使用测试集进行评估。 现在在Spark MLlib文档的代码中,我没有看到任何划分为训练和测试集。 最重要的是,我看到他们使用数据构建模型,然后使用相同的数据进行评估。
有没有我在代码中无法理解的东西? 任何理解代码的帮助都会有所帮助。
注意:这是Spark MLlib的LinearRegressionWithSGD文档页面的LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel # Load and parse the data def parsePoint(line): values = [float(x) for x in line.replace(',', ' ').split(' ')] return LabeledPoint(values[0], values[1:]) data = sc.textFile("data/mllib/ridge-data/lpsa.data") parsedData = data.map(parsePoint) # Build the model model = LinearRegressionWithSGD.train(parsedData) # Evaluate the model on training data valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features))) MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count() print("Mean Squared Error = " + str(MSE)) # Save and load model model.save(sc, "myModelPath") sameModel = LinearRegressionModel.load(sc, "myModelPath")So, I am a rookie to machine learning and Spark and was going through Spark MLlibs documentation on Regression especially LinearRegressionWithSGD at this page. I am having a bit of difficulty in understanding the python code. Here iss what I have understood so far - The code loads the data and then forms labeledpoint. After that the model is build and then it is evaluated on the training data and MSE is calculated.
Now that part that is confusing me is that during the normal machine learning process we first divide the data into training set and test set. The we build the model using training set and finally evaluate using test set. Now in the code at the Spark MLlib's documentation I do not see any division into training and test set. And on top of that I see them building the model using the data and then evaluating using the same data.
Is there something that I am not able to understand in the code? Any help to understand the code will be helpful.
NOTE: THis is the code at Spark MLlib's documentation page for LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel # Load and parse the data def parsePoint(line): values = [float(x) for x in line.replace(',', ' ').split(' ')] return LabeledPoint(values[0], values[1:]) data = sc.textFile("data/mllib/ridge-data/lpsa.data") parsedData = data.map(parsePoint) # Build the model model = LinearRegressionWithSGD.train(parsedData) # Evaluate the model on training data valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features))) MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count() print("Mean Squared Error = " + str(MSE)) # Save and load model model.save(sc, "myModelPath") sameModel = LinearRegressionModel.load(sc, "myModelPath")最满意答案
您正在谈论的过程是交叉验证。 正如您所观察到的,上面的示例没有进行交叉验证。 但这并不意味着它是错的。
该示例的唯一目的是说明如何训练和使用模型。 您可以自由地拆分数据并交叉验证模型,程序将是相同的。 只有数据发生了变化。
此外, 训练集的表现也很有价值 。 它可以告诉您模型是否过度使用或不合适。
所以总结一下,这个例子没问题,你需要的是交叉验证的另一个例子。
The procedure you are talking about is cross-validation. As you observed, the example above didn't do cross-validation. But this doesn't mean it's wrong.
The sole purpose of that example is to illustrate how to train and use a model. You are free to split the data and cross-validate the model, the procedure will be the same. Only the data changed.
And in addition, performance on training set is also valuable. It can tell you whether your model is overfitter or underfitting.
So to summurize, the example is all right, what you need is another example on cross-validation.
更多推荐
发布评论