endymecy / spark-ml-source-analysis Goto Github PK

View Code? Open in Web Editor NEW

1.9K 196.0 830.0 13.51 MB

spark ml 算法原理剖析以及具体的源码实现分析

Home Page: https://github.com/endymecy/spark-ml-source-analysis

License: Apache License 2.0

spark machine-learning source-analysis

spark-ml-source-analysis's Introduction

spark机器学习算法研究和源码分析

本项目对spark ml包中各种算法的原理加以介绍并且对算法的代码实现进行详细分析，旨在加深自己对机器学习算法的理解，熟悉这些算法的分布式实现方式。

本系列文章支持的spark版本

本系列文章大部分的算法基于spark 1.6.1，少部分基于spark 2.x。

本系列的目录结构

本系列目录如下：

说明

本专题的大部分内容来自spark源码、spark官方文档，并不用于商业用途。转载请注明本专题地址。本专题引用他人的内容均列出了参考文献，如有侵权，请务必邮件通知作者。邮箱地址：[email protected]。

本专题的部分文章中用到了latex来写数学公式,可以在浏览器中安装MathJax插件用来展示这些公式。

本人水平有限，分析中难免有错误和误解的地方，请大家不吝指教，万分感激。

License

本文使用的许可见 LICENSE

spark-ml-source-analysis's People

Contributors

Stargazers

Watchers

Forkers

arccos2002 jingruhou liuwei0376 lhxandcxy mt0803 ivivan wzzf harlixxy jacquesqiao rwzhao echotomei imperio-wxm xieguobin kwin-wang yuelianghaoyuana desperado1992 svti triplekill crane136 minjie055 libing1346080 row-column xubo245 1oscar kunmei vistep sc750531323 wang-zs peilibo transwarpio asiachrispy guoyang2011 yxzf dragoncircle cloudswenable ilawrencewu jiekechoo 0xb7ee qgzang changhong2013 cirichen qian2729 lw309637554 lcytzk ryanhuang01 jearik gjhkael yanjiegao codesoftwind msjbear frandavid xiaozan-pku mlh14 condorheroes1995 yunguangwang891017 kioco dolphinzhao wuzhongdehua baokunguo koolboy2016 lhui8023 charmby johnson-xu01 jessica0530 tangbogreat iqy dutinghou lixgj helloourworld easonleeee mingcaoyouxin quanpinjie codlife xiaobotianxie shaowei-su lszxlong zdx cutecha lsq88334753 zhanghaocore xingzhixi tzpbingo spaxfiz hadoop73 gangbo7388928 luoolu curtainwang jizhihang zhaozhenmuwei xbkaishui francisepp unclemelon loyrevel 601madman lionel-coding-lee mansteinliliang lovehoroscoper senlyu skylei leedon

spark-ml-source-analysis's Issues

您好，gbdt的mllib模型，可以通过GradientBoostedTreesModel的trees接口得到DecisionTreeModel，然后根据DecisionTreeModel的topNode接口得到的TopNode, 根据node的id接口可以得到叶子节点的id.
可是gbdt的ml模型的node接口没有id这一项, 这样怎么得到叶子节点的索引呢？请问您对这个熟悉吗？能帮帮我吗？相关代码如下：

 def getLeafNodes(node:Node):Array[Int] = {
    var treeLeafNodes = new Array[Int](0)
    if (node.isLeaf){
      treeLeafNodes = treeLeafNodes.:+(node.id)
    }else{
      treeLeafNodes = treeLeafNodes ++ getLeafNodes(node.leftNode.get)
      treeLeafNodes = treeLeafNodes ++ getLeafNodes(node.rightNode.get)
    }
    treeLeafNodes
  }


val numTrees = 100
    val boostingStrategy = BoostingStrategy.defaultParams("Classification")
    boostingStrategy.numIterations = 100 // Note: Use more iterations in practice.
    boostingStrategy.treeStrategy.numClasses = 2
    boostingStrategy.treeStrategy.maxDepth = 4
    boostingStrategy.learningRate = 0.01
    // Empty categoricalFeaturesInfo indicates all features are continuous.
    boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

    val gbdtModel = GradientBoostedTrees.train(data, boostingStrategy)

    println("gbdt train ok")
//    print(gbdtModel.toDebugString)


    //根据gbdt模型，拿到叶子节点特征
    val treeLeafArray = new Array[Array[Int]](numTrees)
    for(i<- 0.until(numTrees)){
      treeLeafArray(i) = getLeafNodes(gbdtModel.trees(i).topNode)
    }

Question about Random Forest

you've mentioned in this section https://github.com/endymecy/spark-ml-source-analysis/blob/master/%E5%88%86%E7%B1%BB%E5%92%8C%E5%9B%9E%E5%BD%92/%E7%BB%84%E5%90%88%E6%A0%91/%E9%9A%8F%E6%9C%BA%E6%A3%AE%E6%9E%97/random-forests.md that there seem to be two scoring methods: predictBySumming and predictByVoting.

But in ml package, I only find

override protected def predictRaw(features: Vector): Vector = { // TODO: When we add a generic Bagging class, handle transform there: SPARK-7128 // Classifies using majority votes. // Ignore the tree weights since all are 1.0 for now. val votes = Array.fill[Double](numClasses)(0.0) _trees.view.foreach { tree => val classCounts: Array[Double] = tree.rootNode.predictImpl(features).impurityStats.stats val total = classCounts.sum if (total != 0) { var i = 0 while (i < numClasses) { votes(i) += classCounts(i) / total i += 1 } } } Vectors.dense(votes) }

Does this means that ml random forest only support voting scoring?

Thank you in advance.

NaiveBayesMode的bernoulliCalculation是怎么计算的？

分类与回归-朴素贝叶斯，看到您分析的时候好像没有提到公式是怎么计算的。
Spark的代码在bernoulliCalculation
看了好久，先感谢啦

endymecy / spark-ml-source-analysis Goto Github PK

spark-ml-source-analysis's Introduction

spark机器学习算法研究和源码分析

本系列文章支持的spark版本

本系列的目录结构

说明

License

spark-ml-source-analysis's People

Contributors

Stargazers

Watchers

Forkers

spark-ml-source-analysis's Issues

大都是mllib库的源码嘛，有没有ml库的源码分析呢

GBDT 叶子节点特征id

Question about Random Forest

NaiveBayesMode的bernoulliCalculation是怎么计算的？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent