jasonwalker80 / cse427s Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 17.88 MB

License: MIT License

Java 46.49% TeX 29.50% PigLatin 1.82% Python 22.19%

cse427s's People

Contributors

Watchers

Forkers

pwytc

cse427s's Issues

Final Submission

Submit your report including documentation, as well as, results as project_report.pdf by adding it to the final_project/spark folder in your SVN repository. Submit your imple- mentation by adding your implementations to the final_project/spark/src folder in your SVN repository. Do NOT add any data!

Add the new files/folders to your SVN repo before committing:

$ svn add src/*
$ svn add project_report.pdf
$ svn commit - m ’ final project submission’ .

Delimiter and number of fields

A few comments on the use of the regex:

cse427s/final_project/spark/milestone2/step1.py

Line 24 in 71f10dc

 valid_regex = re.compile(r"\d{4}-\d{2}-\d{2}:\d{2}:\d{2}:\d{2}[,|][\w\s.]+[,|]\w{8}-\w{4}-\w{4}-\w{4}-\w{12}[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([-]{0,1}[\d]{1,3}[.][\d]+|0)[,|]([-]{0,1}[\d]{1,3}[.][\d]+|0)") 

The problem suggests determining the delimiter, either , or |, by parsing out character 19 of each string. The regex looks like it should work for either comma or pipe delimited text. I wonder if there are any gotchas where a regex matches, but using character 19 would produce a different result.
Is there a regex/pattern match to filter out records that do not contain exactly 14 fields? The regex match is usually stricter than a simply split() on a delimiter. However, again I wonder if this results in a different answer.

Problem 2 Step 2: Get and Visualize synthetic location data

Download the synthetic clustering data from http://statistical-research.com/wp-content/ uploads/2013/11/sample_geo.txt and visualize the (latitude, longitude) pairs. You do not have to use SPARK for the visualization.

Problem 3 Step 3: Compute and Visualize Clusters

In this step, you will compare the clusters using Euclidean distance vs. great circle distance. Calculate the k-means clusters for the device location data using k = 5.
Calculate the k-means clusters for the synthetic location data using k = 2 and k = 4.
Calculate the k-means clusters for the large-scale DBpedia location data. You will need to experiment with the number of clusters (maybe use k = 6 for a start or k = 2 or 4 if you use the US locations only). Argue, what choice of k makes sense by considering the problem context, i.e., what could the clusters actually mean/represent?
Visualize the clusters and cluster centers (use a random subset of data points for the last dataset) for both distance measures. Can you observe a difference?

Problem 3 Step 2: Understanding and Implementing k-means

MMDS chapter 7.3 (http://infolab.stanford.edu/~ullman/mmds/ch7.pdf) gives pseudo code and implementation strategies for the k-means clustering algorithm. Detailed implemen- tation requirements/specifications are listed below:
The following functions will be useful for calculating k-means:
• closestPoint:givena(latitude/longitude)pointandanarrayofcurrentcenterpoints,
returns the index in the array of the center closest to the given point
• addPoints: given two points, return a point which is the sum of the two points.
• EuclideanDistance: given two points, returns the Euclidean distance of the two.
• GreatCircleDistance: given two points, returns the great circle distance of the two.
Note, that the addPoints function will be used to compute the new cluster centers. As we are working with spatial data given as latitude-longitude pairs implementing this function in a meaningful way will need some thought!
The used distance measure (Euclidean or great circle), as well as the parameter k (number of clusters) should be read as an input from the command line.
Create a variable convergeDist that will be used to decide when the k-means calculation is done, i.e. when the amount the locations of the means changes between iterations is less than convergeDist. A "perfect" solution would be 0; this number represents a "good enough" solution. For this project, use a value of 0.1.
Parse the input file, which should also specified as a variable via the command line, into (latitude,longitude) pairs. Be sure to persist (cache) the resulting RDD because you will access it each time through the following iterations.
Now, plan and implement the main part of the k-means algorithm. Make sure to consider an efficient implementation being aware of tasks, stages, and cached RDDs.
When the iteration is complete, display and return the final k center points and store the k clusters (i.e., all data points plus cluster information).

Problem 1(a)

Are we supposed to sort somewhere? i'm getting a different answer than you
[u'131.166.169.114/67858',
u'254.253.248.193/3579',
u'200.192.65.234/8897',
u'62.193.77.201/169',
u'237.64.198.45/4261',
u'79.35.153.191/91',
u'191.71.1.210/26287',
u'181.208.88.8/59523',
u'255.201.124.130/183',
u'187.27.227.110/113']

Assign Group Roles/Members

Project 3
Project Manager: @jasonwalker80
Local Developer: @keon6kim
Cloud Dev/Key User: @bskowron23

2.2 Viz for submission

Again, just to confirm, 2.2 Viz.png will be submitted as step2.png in our svn?

HW4: Problem 2 - Negative Word Count

@bskowron23 I think this is a typo. I get 805 for the negative word count. Also you used 805 in the s and p equations:

cse427s/hw4/HW4.tex

Line 77 in 22e9722

\item Negative Words: 1927

k-means implementation

@keon6kim have we verified that this actually runs out on a subset of data?

hw7 prob3 b

Very unsure about this... there is a question on piazza about it, but this will likely need some review...

Problem 2 Step 1: Prepare device status data

Review the contents of the file $DEV1DATA/devicestatus.txt. You will have to pre-process the data in order to get it into a standardized format for later processing. This is a common part of the ETL (Extract-Load-Transform) process called data scrubbing.
3
The input data contains information collected from mobile devices on Loudacre’s network, including device ID, current status, location and so on. Loudacre Mobile is a (fictional) fast- growing wireless carrier that provides mobile service to customers throughout western USA. Because Loudacre previously acquired other mobile provider’s networks, the data from different subnetworks has a different format. Note that the records in this file have different field delimiters: some use commas, some use pipes (|) and so on. Your task is to
• Load the dataset
• Determine which delimiter to use (hint: the character at position 19 is the first use of
the delimiter)
• Filter out any records which do not parse correctly (hint: each record should have exactly 14 values)
• Extract the date (first field), model (second field), device ID (third field), and latitude and longitude (13th and 14th fields respectively). You might want to store latitude and longitude as the first two fields to make it consistent with the other two datasets.
• Filter out locations that have a latitude and longitude of 0.
• The model field contains the device manufacturer and model name (e.g. Ronin S2.) Split this field by spaces to separate the manufacturer from the model (e.g. manufac- turer Ronin, model S2.)
• Save the extracted data to comma delimited text files in the /loudacre/devicestatus_ etl directory on HDFS .
• Confirm that the data in the file(s) was saved correctly. Provide a screen-shot named devicedata.png showing a couple of records in your SVN repository.
• Visualize the (latitude, longitude) pairs of the device location data. You do not have to use SPARK for the visualization. Show this visualization at your milestone 2 demo with the TA or instructor.

addpoints

With lat/long you can only go up to +-90/180. i think we need to adjust addpoints to handle this.

Problem with creating Elbow Graph

@bskowron23 So it seems like the Tableau file you uploaded on to Google drive is not calculating dist_norm and Distance Measure at all.

I cannot output an Elbow graph without them properly working. Specifically, what is Distance Measure supposed to be?

I don't think your implementation of k-means algo outputs Distance Measure.

Problem 4: Big Data and Cloud Execution

Can you find or crawl an even bigger dataset you want to cluster? This dataset should be bif enough that your k-means clustering computation is not feasible in your pseudo- cluster? Cluster this data by executing your SPARK program on Amazon EMR and report your results and experiences. If, you do not find another geo-location dataset, feel free to perform clustering on any other Big data clustering problem. Keep in mind that now, the interpretation and visualization of the retrieved clusters is much harder. If need be you can also use the DBpedia data in EMR.
Document your cloud execution approach and provide the data source in your final project report. Add your pre-processing code to the src folder in your SVN repository. Describe your findings including dataset size and runtimes in your final project report.

Problem 3 output location

If the output is really a print statement, I don't think the "result" goes anywhere. Although when submitted to the cluster I didn't see the output number of JPGs, so maybe that ends up in a log file on HDFS?

Reference this line:

cse427s/hw9/HW9.tex

Line 54 in 57341d6

 The Driver program and processing are done locally. The result is stored locally. 

This is the comment on piazza regarding this question:
https://piazza.com/class/j6nxmwmqeet1zy?cid=376

2.1 Viz for submission

To confirm, 2.1 Viz.PNG is the file to be submitted to svn as devicedata.png?

Problem 3 Step 1: Understanding Parallel Data Processing and Persisting RDDs

This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).

Upload FINAL implementation of k-means

I noticed that the @bskowron23 version of the k-means implementation is missing the script/preprocessing logic for DBpedia data. @keon6kim can you upload the latest version of the implementation that was used for Problem 3 Step 3?

I will need that implementation for submission of the final project.

Adding distinct() to step1.py?

I just realized that it maybe needed to add distinct() to the code at some point.
What do you guys think?

Add new Step1 visualization

@bskowron23
Can you upload a PNG for the devicedata.png submission requirement?

Update SVN with new implementation and viz

HW4: Problem 1 - Part B and C

@bskowron23

I only noticed differences in the output, but the question specifically asks for differences in the execution.... I'll try to think about this more.

cse427s/hw4/HW4.tex

Line 44 in 6bede92

 \item[b.] ?? here referring to output or actually in the way the job executes ?? 

cse427s/hw4/HW4.tex

Line 45 in 6bede92

\item[c.] ?? same question ??

HW8: Read more on benefits of pipelining with examples.

This is question 3 in HW8:

(a) Describe what pipelining means in the context of a SPARK job execution. What is its benefit?
(b) Give an example of two operations that can be pipelined together.

Problem 3 Step 5: Documentation of Approach and Results (Report)

Write the project report documenting your clustering approach, your implementation, the obtained results, and runtime analysis. This report should be readable for an informed outsider and it should not require the reader to look at or run any code.

HW4: Problem 1 - Part E

cse427s/hw4/HW4.tex

Line 52 in 6bede92

\item[e.]

@bskowron23 - I found slightly different numbers for the insensitive. The values for sensitive matched.

[training@localhost ~]$ awk '{if($1=="a")print}' shakespeare_AvgWordLenght_case-insensitive.txt
a	3.275899648342265
[training@localhost ~]$ awk '{if($1=="w")print}' shakespeare_AvgWordLenght_case-insensitive.txt
w	4.373096283946263
[training@localhost ~]$ awk '{if($1=="z")print}' shakespeare_AvgWordLenght_case-insensitive.txt
z	5.053333333333334

HW6: Problem 1b

Run on the full TrainingRatings.txt set with N=15 and add answer to written answer.

Problem 3 Step 4: Runtime Analysis

Compare the runtime of your k-means implementation (using the same value for k) for all three datasets using the local mode with at least two threads. Further, rerun your imple- mentation without using persistent RDDs and compare those runtimes to the previously obtained ones. Create a table summarizing these results and briefly discuss your findings. After job completion you can read off the runtimes and other statistics from the Spark His- tory Server at http://localhost:18080. You might want to rerun each experiment a couple of times and use the average runtime for a more robust comparison (time permitting).

Problem 1

Step 1: Hue File Browser and Data Deployment
Step 2: Spark Documentation
Step 3: Spark Shell and RDDs
Step 4: Word-Count and Job Execution

HW8: Convert Word Doc to PDF and latex document.

@bskowron23,

Sorry about the radio silence. Work continues to be relentless. Also I've had a fever the last two days to top it all off.

My current work is fairly unorganized in the Word doc, but I think the Spark answers/code are there for 1 and 2.

I need to reformat to the latex document. Hence this issue.

Also, I need to read some of the text for 3. I think I know the answer, but I suspect it's in one of our readings or available in Spark docs.

Jason

hw5 Problem2 b) Compute communication cost.

shakespeare_word_co-occurrence1.pdf

Problem 2 Step 3: Get and Pre-process the DBpedia location data

Download the large-scale clustering data of (latitude, longitude) pairs extracted from DB- pedia (https://classes.cec.wustl.edu/cse427/lat_longs.zip). Each record represents a location/place that has a Wikipedia article and latitude/longitude information. The for- mat is: lat long name_of_page.
In total, there are 450,151 points in a 2D space (i.e., space with spherical geometry – maybe it would make sense to use the great circle distance when analyzing this data...). To get a smaller sample of this dataset for testing purposes, you could put a bounding box around the US and filter only those records inside the bounding box. Try to visualize this data. Eventually, you want to cluster the whole world using the entire dataset...

HW6: Problem 2

For the combiner, the input record count is: 3,192,295
The output record count is: 1,789

cse427s/hw6/HW6.tex

Line 71 in ed8172c

Number of key-value pairs combined: 1,789

I think the answer would be 3,190,506 key,value pairs when combined.

Problem 2: Step 1

Understanding Parallel Data Processing and Persisting RDDs

This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).

jasonwalker80 / cse427s Goto Github PK

cse427s's People

Contributors

Watchers

Forkers

cse427s's Issues

Recommend Projects

Recommend Topics

Recommend Org