jasonwalker80 / cse427s Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Submit your report including documentation, as well as, results as project_report.pdf by adding it to the final_project/spark folder in your SVN repository. Submit your imple- mentation by adding your implementations to the final_project/spark/src folder in your SVN repository. Do NOT add any data!
Add the new files/folders to your SVN repo before committing:
$ svn add src/*
$ svn add project_report.pdf
$ svn commit - m ’ final project submission’ .
A few comments on the use of the regex:
The problem suggests determining the delimiter, either ,
or |
, by parsing out character 19 of each string. The regex looks like it should work for either comma or pipe delimited text. I wonder if there are any gotchas where a regex matches, but using character 19 would produce a different result.
Is there a regex/pattern match to filter out records that do not contain exactly 14 fields? The regex match is usually stricter than a simply split() on a delimiter. However, again I wonder if this results in a different answer.
Download the synthetic clustering data from http://statistical-research.com/wp-content/ uploads/2013/11/sample_geo.txt and visualize the (latitude, longitude) pairs. You do not have to use SPARK for the visualization.
In this step, you will compare the clusters using Euclidean distance vs. great circle distance. Calculate the k-means clusters for the device location data using k = 5.
Calculate the k-means clusters for the synthetic location data using k = 2 and k = 4.
Calculate the k-means clusters for the large-scale DBpedia location data. You will need to experiment with the number of clusters (maybe use k = 6 for a start or k = 2 or 4 if you use the US locations only). Argue, what choice of k makes sense by considering the problem context, i.e., what could the clusters actually mean/represent?
Visualize the clusters and cluster centers (use a random subset of data points for the last dataset) for both distance measures. Can you observe a difference?
MMDS chapter 7.3 (http://infolab.stanford.edu/~ullman/mmds/ch7.pdf) gives pseudo code and implementation strategies for the k-means clustering algorithm. Detailed implemen- tation requirements/specifications are listed below:
The following functions will be useful for calculating k-means:
• closestPoint:givena(latitude/longitude)pointandanarrayofcurrentcenterpoints,
returns the index in the array of the center closest to the given point
• addPoints: given two points, return a point which is the sum of the two points.
• EuclideanDistance: given two points, returns the Euclidean distance of the two.
• GreatCircleDistance: given two points, returns the great circle distance of the two.
Note, that the addPoints function will be used to compute the new cluster centers. As we are working with spatial data given as latitude-longitude pairs implementing this function in a meaningful way will need some thought!
The used distance measure (Euclidean or great circle), as well as the parameter k (number of clusters) should be read as an input from the command line.
Create a variable convergeDist that will be used to decide when the k-means calculation is done, i.e. when the amount the locations of the means changes between iterations is less than convergeDist. A "perfect" solution would be 0; this number represents a "good enough" solution. For this project, use a value of 0.1.
Parse the input file, which should also specified as a variable via the command line, into (latitude,longitude) pairs. Be sure to persist (cache) the resulting RDD because you will access it each time through the following iterations.
Now, plan and implement the main part of the k-means algorithm. Make sure to consider an efficient implementation being aware of tasks, stages, and cached RDDs.
When the iteration is complete, display and return the final k center points and store the k clusters (i.e., all data points plus cluster information).
Are we supposed to sort somewhere? i'm getting a different answer than you
[u'131.166.169.114/67858',
u'254.253.248.193/3579',
u'200.192.65.234/8897',
u'62.193.77.201/169',
u'237.64.198.45/4261',
u'79.35.153.191/91',
u'191.71.1.210/26287',
u'181.208.88.8/59523',
u'255.201.124.130/183',
u'187.27.227.110/113']
Project 3
Project Manager: @jasonwalker80
Local Developer: @keon6kim
Cloud Dev/Key User: @bskowron23
Again, just to confirm, 2.2 Viz.png
will be submitted as step2.png
in our svn?
@bskowron23 I think this is a typo. I get 805 for the negative word count. Also you used 805 in the s and p equations:
Line 77 in 22e9722
@keon6kim have we verified that this actually runs out on a subset of data?
Very unsure about this... there is a question on piazza about it, but this will likely need some review...
Review the contents of the file $DEV1DATA/devicestatus.txt. You will have to pre-process the data in order to get it into a standardized format for later processing. This is a common part of the ETL (Extract-Load-Transform) process called data scrubbing.
3
The input data contains information collected from mobile devices on Loudacre’s network, including device ID, current status, location and so on. Loudacre Mobile is a (fictional) fast- growing wireless carrier that provides mobile service to customers throughout western USA. Because Loudacre previously acquired other mobile provider’s networks, the data from different subnetworks has a different format. Note that the records in this file have different field delimiters: some use commas, some use pipes (|) and so on. Your task is to
• Load the dataset
• Determine which delimiter to use (hint: the character at position 19 is the first use of
the delimiter)
• Filter out any records which do not parse correctly (hint: each record should have exactly 14 values)
• Extract the date (first field), model (second field), device ID (third field), and latitude and longitude (13th and 14th fields respectively). You might want to store latitude and longitude as the first two fields to make it consistent with the other two datasets.
• Filter out locations that have a latitude and longitude of 0.
• The model field contains the device manufacturer and model name (e.g. Ronin S2.) Split this field by spaces to separate the manufacturer from the model (e.g. manufac- turer Ronin, model S2.)
• Save the extracted data to comma delimited text files in the /loudacre/devicestatus_ etl directory on HDFS .
• Confirm that the data in the file(s) was saved correctly. Provide a screen-shot named devicedata.png showing a couple of records in your SVN repository.
• Visualize the (latitude, longitude) pairs of the device location data. You do not have to use SPARK for the visualization. Show this visualization at your milestone 2 demo with the TA or instructor.
With lat/long you can only go up to +-90/180. i think we need to adjust addpoints to handle this.
@bskowron23 So it seems like the Tableau file you uploaded on to Google drive is not calculating dist_norm and Distance Measure at all.
I cannot output an Elbow graph without them properly working. Specifically, what is Distance Measure supposed to be?
I don't think your implementation of k-means algo outputs Distance Measure.
Can you find or crawl an even bigger dataset you want to cluster? This dataset should be bif enough that your k-means clustering computation is not feasible in your pseudo- cluster? Cluster this data by executing your SPARK program on Amazon EMR and report your results and experiences. If, you do not find another geo-location dataset, feel free to perform clustering on any other Big data clustering problem. Keep in mind that now, the interpretation and visualization of the retrieved clusters is much harder. If need be you can also use the DBpedia data in EMR.
Document your cloud execution approach and provide the data source in your final project report. Add your pre-processing code to the src folder in your SVN repository. Describe your findings including dataset size and runtimes in your final project report.
If the output is really a print statement, I don't think the "result" goes anywhere. Although when submitted to the cluster I didn't see the output number of JPGs, so maybe that ends up in a log file on HDFS?
Reference this line:
Line 54 in 57341d6
This is the comment on piazza regarding this question:
https://piazza.com/class/j6nxmwmqeet1zy?cid=376
To confirm, 2.1 Viz.PNG
is the file to be submitted to svn as devicedata.png
?
This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).
I noticed that the @bskowron23 version of the k-means implementation is missing the script/preprocessing logic for DBpedia data. @keon6kim can you upload the latest version of the implementation that was used for Problem 3 Step 3?
I will need that implementation for submission of the final project.
I just realized that it maybe needed to add distinct() to the code at some point.
What do you guys think?
@bskowron23
Can you upload a PNG for the devicedata.png
submission requirement?
This is question 3 in HW8:
(a) Describe what pipelining means in the context of a SPARK job execution. What is its benefit?
(b) Give an example of two operations that can be pipelined together.
Write the project report documenting your clustering approach, your implementation, the obtained results, and runtime analysis. This report should be readable for an informed outsider and it should not require the reader to look at or run any code.
Line 52 in 6bede92
@bskowron23 - I found slightly different numbers for the insensitive. The values for sensitive matched.
[training@localhost ~]$ awk '{if($1=="a")print}' shakespeare_AvgWordLenght_case-insensitive.txt
a 3.275899648342265
[training@localhost ~]$ awk '{if($1=="w")print}' shakespeare_AvgWordLenght_case-insensitive.txt
w 4.373096283946263
[training@localhost ~]$ awk '{if($1=="z")print}' shakespeare_AvgWordLenght_case-insensitive.txt
z 5.053333333333334
Run on the full TrainingRatings.txt set with N=15 and add answer to written answer.
Compare the runtime of your k-means implementation (using the same value for k) for all three datasets using the local mode with at least two threads. Further, rerun your imple- mentation without using persistent RDDs and compare those runtimes to the previously obtained ones. Create a table summarizing these results and briefly discuss your findings. After job completion you can read off the runtimes and other statistics from the Spark His- tory Server at http://localhost:18080. You might want to rerun each experiment a couple of times and use the average runtime for a more robust comparison (time permitting).
Step 1: Hue File Browser and Data Deployment
Step 2: Spark Documentation
Step 3: Spark Shell and RDDs
Step 4: Word-Count and Job Execution
@bskowron23,
Sorry about the radio silence. Work continues to be relentless. Also I've had a fever the last two days to top it all off.
My current work is fairly unorganized in the Word doc, but I think the Spark answers/code are there for 1 and 2.
I need to reformat to the latex document. Hence this issue.
Also, I need to read some of the text for 3. I think I know the answer, but I suspect it's in one of our readings or available in Spark docs.
Jason
Download the large-scale clustering data of (latitude, longitude) pairs extracted from DB- pedia (https://classes.cec.wustl.edu/cse427/lat_longs.zip). Each record represents a location/place that has a Wikipedia article and latitude/longitude information. The for- mat is: lat long name_of_page.
In total, there are 450,151 points in a 2D space (i.e., space with spherical geometry – maybe it would make sense to use the great circle distance when analyzing this data...). To get a smaller sample of this dataset for testing purposes, you could put a bounding box around the US and filter only those records inside the bounding box. Try to visualize this data. Eventually, you want to cluster the whole world using the entire dataset...
For the combiner, the input record count is: 3,192,295
The output record count is: 1,789
Line 71 in ed8172c
I think the answer would be 3,190,506 key,value pairs when combined.
Understanding Parallel Data Processing and Persisting RDDs
This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.