Giter Club home page Giter Club logo

cse427s's People

Contributors

bob-skowron avatar jasonwalker80 avatar keon6kim avatar

Watchers

 avatar  avatar  avatar

Forkers

pwytc

cse427s's Issues

Final Submission

Submit your report including documentation, as well as, results as project_report.pdf by adding it to the final_project/spark folder in your SVN repository. Submit your imple- mentation by adding your implementations to the final_project/spark/src folder in your SVN repository. Do NOT add any data!

Add the new files/folders to your SVN repo before committing:

$ svn add src/*
$ svn add project_report.pdf
$ svn commit - m ’ final project submission’ .

Delimiter and number of fields

A few comments on the use of the regex:

valid_regex = re.compile(r"\d{4}-\d{2}-\d{2}:\d{2}:\d{2}:\d{2}[,|][\w\s.]+[,|]\w{8}-\w{4}-\w{4}-\w{4}-\w{12}[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([\d]+|enabled|disabled)[,|]([-]{0,1}[\d]{1,3}[.][\d]+|0)[,|]([-]{0,1}[\d]{1,3}[.][\d]+|0)")

  1. The problem suggests determining the delimiter, either , or |, by parsing out character 19 of each string. The regex looks like it should work for either comma or pipe delimited text. I wonder if there are any gotchas where a regex matches, but using character 19 would produce a different result.

  2. Is there a regex/pattern match to filter out records that do not contain exactly 14 fields? The regex match is usually stricter than a simply split() on a delimiter. However, again I wonder if this results in a different answer.

Problem 3 Step 3: Compute and Visualize Clusters

In this step, you will compare the clusters using Euclidean distance vs. great circle distance. Calculate the k-means clusters for the device location data using k = 5.
Calculate the k-means clusters for the synthetic location data using k = 2 and k = 4.
Calculate the k-means clusters for the large-scale DBpedia location data. You will need to experiment with the number of clusters (maybe use k = 6 for a start or k = 2 or 4 if you use the US locations only). Argue, what choice of k makes sense by considering the problem context, i.e., what could the clusters actually mean/represent?
Visualize the clusters and cluster centers (use a random subset of data points for the last dataset) for both distance measures. Can you observe a difference?

Problem 3 Step 2: Understanding and Implementing k-means

MMDS chapter 7.3 (http://infolab.stanford.edu/~ullman/mmds/ch7.pdf) gives pseudo code and implementation strategies for the k-means clustering algorithm. Detailed implemen- tation requirements/specifications are listed below:
The following functions will be useful for calculating k-means:
• closestPoint:givena(latitude/longitude)pointandanarrayofcurrentcenterpoints,
returns the index in the array of the center closest to the given point
• addPoints: given two points, return a point which is the sum of the two points.
• EuclideanDistance: given two points, returns the Euclidean distance of the two.
• GreatCircleDistance: given two points, returns the great circle distance of the two.
Note, that the addPoints function will be used to compute the new cluster centers. As we are working with spatial data given as latitude-longitude pairs implementing this function in a meaningful way will need some thought!
The used distance measure (Euclidean or great circle), as well as the parameter k (number of clusters) should be read as an input from the command line.
Create a variable convergeDist that will be used to decide when the k-means calculation is done, i.e. when the amount the locations of the means changes between iterations is less than convergeDist. A "perfect" solution would be 0; this number represents a "good enough" solution. For this project, use a value of 0.1.
Parse the input file, which should also specified as a variable via the command line, into (latitude,longitude) pairs. Be sure to persist (cache) the resulting RDD because you will access it each time through the following iterations.
Now, plan and implement the main part of the k-means algorithm. Make sure to consider an efficient implementation being aware of tasks, stages, and cached RDDs.
When the iteration is complete, display and return the final k center points and store the k clusters (i.e., all data points plus cluster information).

Problem 1(a)

Are we supposed to sort somewhere? i'm getting a different answer than you
[u'131.166.169.114/67858',
u'254.253.248.193/3579',
u'200.192.65.234/8897',
u'62.193.77.201/169',
u'237.64.198.45/4261',
u'79.35.153.191/91',
u'191.71.1.210/26287',
u'181.208.88.8/59523',
u'255.201.124.130/183',
u'187.27.227.110/113']

hw7 prob3 b

Very unsure about this... there is a question on piazza about it, but this will likely need some review...

Problem 2 Step 1: Prepare device status data

Review the contents of the file $DEV1DATA/devicestatus.txt. You will have to pre-process the data in order to get it into a standardized format for later processing. This is a common part of the ETL (Extract-Load-Transform) process called data scrubbing.
3
The input data contains information collected from mobile devices on Loudacre’s network, including device ID, current status, location and so on. Loudacre Mobile is a (fictional) fast- growing wireless carrier that provides mobile service to customers throughout western USA. Because Loudacre previously acquired other mobile provider’s networks, the data from different subnetworks has a different format. Note that the records in this file have different field delimiters: some use commas, some use pipes (|) and so on. Your task is to
• Load the dataset
• Determine which delimiter to use (hint: the character at position 19 is the first use of
the delimiter)
• Filter out any records which do not parse correctly (hint: each record should have exactly 14 values)
• Extract the date (first field), model (second field), device ID (third field), and latitude and longitude (13th and 14th fields respectively). You might want to store latitude and longitude as the first two fields to make it consistent with the other two datasets.
• Filter out locations that have a latitude and longitude of 0.
• The model field contains the device manufacturer and model name (e.g. Ronin S2.) Split this field by spaces to separate the manufacturer from the model (e.g. manufac- turer Ronin, model S2.)
• Save the extracted data to comma delimited text files in the /loudacre/devicestatus_ etl directory on HDFS .
• Confirm that the data in the file(s) was saved correctly. Provide a screen-shot named devicedata.png showing a couple of records in your SVN repository.
• Visualize the (latitude, longitude) pairs of the device location data. You do not have to use SPARK for the visualization. Show this visualization at your milestone 2 demo with the TA or instructor.

addpoints

With lat/long you can only go up to +-90/180. i think we need to adjust addpoints to handle this.

Problem with creating Elbow Graph

@bskowron23 So it seems like the Tableau file you uploaded on to Google drive is not calculating dist_norm and Distance Measure at all.

I cannot output an Elbow graph without them properly working. Specifically, what is Distance Measure supposed to be?

I don't think your implementation of k-means algo outputs Distance Measure.

Problem 4: Big Data and Cloud Execution

Can you find or crawl an even bigger dataset you want to cluster? This dataset should be bif enough that your k-means clustering computation is not feasible in your pseudo- cluster? Cluster this data by executing your SPARK program on Amazon EMR and report your results and experiences. If, you do not find another geo-location dataset, feel free to perform clustering on any other Big data clustering problem. Keep in mind that now, the interpretation and visualization of the retrieved clusters is much harder. If need be you can also use the DBpedia data in EMR.
Document your cloud execution approach and provide the data source in your final project report. Add your pre-processing code to the src folder in your SVN repository. Describe your findings including dataset size and runtimes in your final project report.

Problem 3 Step 1: Understanding Parallel Data Processing and Persisting RDDs

This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).

Upload FINAL implementation of k-means

I noticed that the @bskowron23 version of the k-means implementation is missing the script/preprocessing logic for DBpedia data. @keon6kim can you upload the latest version of the implementation that was used for Problem 3 Step 3?

I will need that implementation for submission of the final project.

HW4: Problem 1 - Part E

\item[e.]

@bskowron23 - I found slightly different numbers for the insensitive. The values for sensitive matched.

[training@localhost ~]$ awk '{if($1=="a")print}' shakespeare_AvgWordLenght_case-insensitive.txt
a	3.275899648342265
[training@localhost ~]$ awk '{if($1=="w")print}' shakespeare_AvgWordLenght_case-insensitive.txt
w	4.373096283946263
[training@localhost ~]$ awk '{if($1=="z")print}' shakespeare_AvgWordLenght_case-insensitive.txt
z	5.053333333333334

HW6: Problem 1b

Run on the full TrainingRatings.txt set with N=15 and add answer to written answer.

Problem 3 Step 4: Runtime Analysis

Compare the runtime of your k-means implementation (using the same value for k) for all three datasets using the local mode with at least two threads. Further, rerun your imple- mentation without using persistent RDDs and compare those runtimes to the previously obtained ones. Create a table summarizing these results and briefly discuss your findings. After job completion you can read off the runtimes and other statistics from the Spark His- tory Server at http://localhost:18080. You might want to rerun each experiment a couple of times and use the average runtime for a more robust comparison (time permitting).

Problem 1

Step 1: Hue File Browser and Data Deployment
Step 2: Spark Documentation
Step 3: Spark Shell and RDDs
Step 4: Word-Count and Job Execution

HW8: Convert Word Doc to PDF and latex document.

@bskowron23,

Sorry about the radio silence. Work continues to be relentless. Also I've had a fever the last two days to top it all off.

My current work is fairly unorganized in the Word doc, but I think the Spark answers/code are there for 1 and 2.

I need to reformat to the latex document. Hence this issue.

Also, I need to read some of the text for 3. I think I know the answer, but I suspect it's in one of our readings or available in Spark docs.

Jason

Problem 2 Step 3: Get and Pre-process the DBpedia location data

Download the large-scale clustering data of (latitude, longitude) pairs extracted from DB- pedia (https://classes.cec.wustl.edu/cse427/lat_longs.zip). Each record represents a location/place that has a Wikipedia article and latitude/longitude information. The for- mat is: lat long name_of_page.
In total, there are 450,151 points in a 2D space (i.e., space with spherical geometry – maybe it would make sense to use the great circle distance when analyzing this data...). To get a smaller sample of this dataset for testing purposes, you could put a bounding box around the US and filter only those records inside the bounding box. Try to visualize this data. Eventually, you want to cluster the whole world using the entire dataset...

HW6: Problem 2

For the combiner, the input record count is: 3,192,295
The output record count is: 1,789

Number of key-value pairs combined: 1,789

I think the answer would be 3,190,506 key,value pairs when combined.

Problem 2: Step 1

Understanding Parallel Data Processing and Persisting RDDs

This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.