Giter Club home page Giter Club logo

knn_rec_mahout's Introduction

Hi! I've used apache mahout(in the mr package) in this project to perform the tasks that I'll explain them in the following :

1_ Computing: Precision@, Recall@, RMSE and MAE of recomendations computed by Mahout BruteForce algorithm :

To accomplish that with mahout, we're going to create a user-based recommender that uses the NearestNUserNeighborhood class from the org.apache.mahout.cf.taste.impl.neighborhood package that is exactly a Brute force computation of the K-NN graph( the k is the N : neighborhood size) to make recommendation:

UserNeighborhood neighborhood= new NearestNUserNeighborhood(k, similarity, fold.getTraining()) ;

For finding similar users, I've used Jaccard similarity by creating AbstractUserSimilarity in the package org.apache.mahout.cf.taste.impl.similarity that implements UserSimilarity interface, then I created in the same package JaccardUserSimilarity class that extends AbstractUserSimilarity in which I overrided the userSimilarity(long userID1, long userID2) method to compute Jaccard similarity between tow users respecting the threshold fixed by the contstructor of the class, then we invoke the class by :

 UserSimilarity similarity = new JaccardUserSimilarity(fold.getTraining(), threshold) ;

Now, we have all the pieces to create our recommender, we do that in the class MyRecommenderBuilder1 in the packge org.apache.main using the buildRecommender function:

public Recommender buildRecommender(DataModel dataModel, Fold fold) throws TasteException {
		UserSimilarity similarity = new JaccardUserSimilarity(fold.getTraining(), threshold) ;
		UserNeighborhood neighborhood= new NearestNUserNeighborhood(k, similarity, fold.getTraining()) ;
		UserBasedRecommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
		return recommender;
  
	}

then we evaluate this recommender in MainBruteForce class in the same package main :


DataModel model = new FileDataModel(new File(rootPath+"\\Datasets\\"+fileName+".csv")); // to load the dataset
RecommenderBuilder builder = new MyRecommenderBuilder1(model,threshold, k); // invoke the recommender we've just build it

// to calculate Precision@ and Recall@
KFoldRecommenderIRStatsEvaluator evaluatorIRStats = new KFoldRecommenderIRStatsEvaluator(model, 5); /* 5-fold cross-validation */
IRStatistics irstats = evaluatorIRStats.evaluate(builder, number_rec, threshold);
System.out.println("Precision: "+irstats.getPrecision() + "Recall: "+irstats.getRecall());

// to calculate RMSE and MAE
KFoldRecommenderPredictionEvaluator evaluatorPred = new KFoldRecommenderPredictionEvaluator(model, 5);
PredictionStatistics prestats = evaluatorPred.evaluate(builder) ;
System.out.println("MAE= "+prestats.getMAE()+"RMSE= "+prestats.getRMSE());

2_ Computing Precision@, Recall@, RMSE and MAE of recomendations using Mahout based on KNN files generated independently by python project:

Let's have the figure mentioned in this link : Program description, It explains how we have used Mahout framework to accomplish the tasks involved in the title of this section. Firstly, let's describe the code's operation :

  • Data preparation and spliting :

The program read the dataset file and convert it to a CSV format if it has not been already done, and as we perform a 5-cross validation, the program creates at each iteration a training set and test set files in Datasets folder, these tasks are performed by buildTest and buildTrain function in MyRecommenderBuilder class in the packge org.apache.main

  • Use of Python project and creation of the LBNN graph file(KNN graph)

The program generates the KNN graph file(named LBNN graph) using the Traing set file by launching the python script which responsable on excuting the algorithm and storing the result(LBNN graph) as a JSON file called KNNG_LBNN.txt whose its path is specified in config.properties file, the function that lunches this script is runScript in MyRecommenderBuilder class.

  • Create the recommender based on the LBNN graph file

Now, we compute the neighborhood and the similarities for every users based on the LBNN graph generated by the python script, I use for that the UserNeighborhoodImpl class that implements the UserNeighborhood interface and UserSimilarityIml that implements the 'UserSimilarity' interface, then we build our recommender in the buildRecommender function in the MyRecommenderBuilder class :

  
		UserNeighborhood neighborhood= new UserNeighborhoodImpl(p.getJsonObject()) ;
		UserSimilarity similarity = new UserSimilarityIml(p.getJsonObject()) ;
		UserBasedRecommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
  • Finally,

The Main class binds all the previous class and it calls the functions to make the calculations for either the Recall and Precision or the RMSE and MAE.

Note that the configuration made for this project is supposed to run on a unix OS, if you want to run you experiments in a windows os, you should modify the Paths in Main.java MainBruteForce.java and MyRecommenderBuilder.java to math the windows environment (change / by \ for example: /Datasets/ by \Datasets\), then export the jar files with this new configurations.

For Tests on servers or if you want to run these tow functionalities(1 and 2) explained obove with a jar executable, you can use bf_mahout jar file for the Brute force related experiments(point 1) and the lbnn_mahout jar file for the LBNNG related experiments(point 2).

You first should create the Datasets and Logs folders and put the config.properties file in the same location with the jar files, you can modify the paths in config.properties to match your environment structure.

Then, you should have a java development environment in you system, so you can set up the Java SE Development Kit 8 or its recent versions. you should also set up python3.7 to run the experiments with lbnn approache.

bf_mahout.jar

you can run bf_mahout by this command :

java -jar bf_mahout.jar DataSetName accuracy||error K(size of neighborhood) threshold (nbr_recommendation)+

Let's see this example: java -jar bf_mahout.jar TestSet0 accuracy 50 3 10 20 30 So, we have chosen the accuracy measures(to compute Recall and Precision), K=50, threshold=3 and we will make 3 experiments for three numbers of recommendations: 10, 20 and 30(you can specify the numbers you want)

lbnn_mahout.jar

The command here is as follows: lbnn_mahout by this command :

java -jar lbnn_mahout.jar DataSetName accuracy||error threshold graph_learning||sym_graph_learning (nbr_recommendation)+

for example java -jar lbnn_mahout.jar DataSetName error 3 graph_learning 12 20 30 40 graph_learning and sym_graph_learning are the python executables present in the path: /py_scripts/dist/

Welcome to Apache Mahout!

The Apache Mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications.

For additional information about Mahout, visit the Mahout Home Page

Setting up your Environment

Whether you are using Mahout's Shell, running command line jobs or using it as a library to build your own apps you'll need to setup several environment variables. Edit your environment in ~/.bash_profile for Mac or ~/.bashrc for many linux distributions. Add the following

export MAHOUT_HOME=/path/to/mahout
export MAHOUT_LOCAL=true # for running standalone on your dev machine, 
# unset MAHOUT_LOCAL for running on a cluster

You will need a $JAVA_HOME, and if you are running on Spark, you will also need $SPARK_HOME

Using Mahout as a Library

Running any application that uses Mahout will require installing a binary or source version and setting the environment. To compile from source:

  • mvn -DskipTests clean install
  • To run tests do mvn test
  • To set up your IDE, do mvn eclipse:eclipse or mvn idea:idea

To use maven, add the appropriate setting to your pom.xml or build.sbt following the template below.

To use the Samsara environment you'll need to include both the engine neutral math-scala dependency:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-math-scala_2.10</artifactId>
    <version>${mahout.version}</version>
</dependency>

and a dependency for back end engine translation, e.g:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-spark_2.10</artifactId>
    <version>${mahout.version}</version>
</dependency>

Building From Source

Prerequisites:

Linux Environment (preferably Ubuntu 16.04.x) Note: Currently only the JVM-only build will work on a Mac. gcc > 4.x NVIDIA Card (installed with OpenCL drivers alongside usual GPU drivers)

Downloads

Install java 1.7+ in an easily accessible directory (for this example, ~/java/) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Create a directory ~/apache/ .

Download apache Maven 3.3.9 and un-tar/gunzip to ~/apache/apache-maven-3.3.9/ . https://maven.apache.org/download.cgi

Download and un-tar/gunzip Hadoop 2.4.1 to ~/apache/hadoop-2.4.1/ . https://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/

Download and un-tar/gunzip spark-1.6.3-bin-hadoop2.4 to ~/apache/ . http://spark.apache.org/downloads.html Choose release: Spark-1.6.3 (Nov 07 2016) Choose package type: Pre-Built for Hadoop 2.4

Install ViennaCL 1.7.0+ If running Ubuntu 16.04+

sudo apt-get install libviennacl-dev

Otherwise if your distribution’s package manager does not have a viennniacl-dev package >1.7.0, clone it directly into the directory which will be included in when being compiled by Mahout:

mkdir ~/tmp
cd ~/tmp && git clone https://github.com/viennacl/viennacl-dev.git
cp -r viennacl/ /usr/local/
cp -r CL/ /usr/local/

Ensure that the OpenCL 1.2+ drivers are installed (packed with most consumer grade NVIDIA drivers). Not sure about higher end cards.

Clone mahout repository into ~/apache.

git clone https://github.com/apache/mahout.git
Configuration

When building mahout for a spark backend, we need four System Environment variables set:

    export MAHOUT_HOME=/home/<user>/apache/mahout
    export HADOOP_HOME=/home/<user>/apache/hadoop-2.4.1
    export SPARK_HOME=/home/<user>/apache/spark-1.6.3-bin-hadoop2.4    
    export JAVA_HOME=/home/<user>/java/jdk-1.8.121

Mahout on Spark regularly uses one more env variable, the IP of the Spark cluster’s master node (usually the node which one would be logged into).

To use 4 local cores (Spark master need not be running)

export MASTER=local[4]

To use all available local cores (again, Spark master need not be running)

export MASTER=local[*]

To point to a cluster with spark running:

export MASTER=spark://master.ip.address:7077

We then add these to the path:

   PATH=$PATH$:MAHOUT_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin

These should be added to the your ~/.bashrc file.

Building Mahout with Apache Maven

Currently Mahout has 3 builds. From the $MAHOUT_HOME directory we may issue the commands to build each using mvn profiles.

JVM only:

mvn clean install -DskipTests

JVM with native OpenMP level 2 and level 3 matrix/vector Multiplication

mvn clean install -Pviennacl-omp -Phadoop2 -DskipTests

JVM with native OpenMP and OpenCL for Level 2 and level 3 matrix/vector Multiplication. (GPU errors fall back to OpenMP, currently only a single GPU/node is supported).

mvn clean install -Pviennacl -Phadoop2 -DskipTests

Testing the Mahout Environment

Mahout provides an extension to the spark-shell, which is good for getting to know the language, testing partition loads, prototyping algorithms, etc..

To launch the shell in local mode with 2 threads: simply do the following:

$ MASTER=local[2] mahout spark-shell

After a very verbose startup, a Mahout welcome screen will appear:

Loading /home/andy/sandbox/apache-mahout-distribution-0.13.0/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@3ca1f0a4

                _                 _
_ __ ___   __ _| |__   ___  _   _| |_
 '_ ` _ \ / _` | '_ \ / _ \| | | | __|
 | | | | (_| | | | | (_) | |_| | |_
_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 0.13.0


That file does not exist


scala>

At the scala> prompt, enter:

scala> :load /home/<andy>/apache/mahout/examples
                               /bin/SparseSparseDrmTimer.mscala

Which will load a matrix multiplication timer function definition. To run the matrix timer:

        scala> timeSparseDRMMMul(1000,1000,1000,1,.02,1234L)
            {...} res3: Long = 16321

We can see that the JVM only version is rather slow, thus our motive for GPU and Native Multithreading support.

To get an idea of what’s going on under the hood of the timer, we may examine the .mscala (mahout scala) code which is both fully functional scala and the Mahout R-Like DSL for tensor algebra:




def timeSparseDRMMMul(m: Int, n: Int, s: Int, para: Int, pctDense: Double = .20, seed: Long = 1234L): Long = {
  val drmA = drmParallelizeEmpty(m , s, para).mapBlock(){
       case (keys,block:Matrix) =>
           val R =  scala.util.Random
           R.setSeed(seed)
           val blockB = new SparseRowMatrix(block.nrow, block.ncol)
           blockB := {x => if (R.nextDouble < pctDense) R.nextDouble else x }
       (keys -> blockB)
  }

  val drmB = drmParallelizeEmpty(s , n, para).mapBlock(){
       case (keys,block:Matrix) =>
           val R =  scala.util.Random
           R.setSeed(seed + 1)
           val blockB = new SparseRowMatrix(block.nrow, block.ncol)
           blockB := {x => if (R.nextDouble < pctDense) R.nextDouble else x }
       (keys -> blockB)
  }

  var time = System.currentTimeMillis()

  val drmC = drmA %*% drmB
 
  // trigger computation
  drmC.numRows()

  time = System.currentTimeMillis() - time

  time  
 
}

For more information please see the following references:

http://mahout.apache.org/users/environment/in-core-reference.html

http://mahout.apache.org/users/environment/out-of-core-reference.html

http://mahout.apache.org/users/sparkbindings/play-with-shell.html

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

Note that due to an intermittent out-of-memory bug in a Flink test we have disabled it from the binary releases. To use Flink please uncomment the line in the root pom.xml in the <modules> block so it reads <module>flink</module>.

Examples

For examples of how to use Mahout, see the examples directory located in examples/bin

For information on how to contribute, visit the How to Contribute Page

Legal

Please see the NOTICE.txt included in this directory for more information.

Build Status

knn_rec_mahout's People

Contributors

mlechiakh avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.