Giter Club home page Giter Club logo

bitly-data-from-1.usa-project's Introduction

Bitly-Data-from-1.USA-project

It is a dataset that contains the name of the browser, city, timezone, country, and webpage visited by anonymous users who shorted links with .gov or .mil. In 2011.

  • Browser's name
  • city
  • Timezone
  • Country
  • Web page visited

What are the 10 time zones with the highest presence?

Transforming the data

We decided to count the number of different time zones we have, using a function.

def get_counts(sequence):
   counts = {}
   for x in sequence:
       if x in counts:
           counts[x]+=1
       else:
           counts[x]=1
   return counts 

## What are the time zones with the highest presence?

```python
from collections import defaultdict

def get_counts2(sequence):
   counts= defaultdict(int) #values will initiaze to 0
   for x in sequence:
       counts[x] += 1
   return counts
get_counts2(time_zones)

from collections import Counter
counts= Counter(time_zones)
counts.most_common(10)

from collections import Counter
counts= Counter(time_zones)
counts.most_common(10)
tz_counts = frame["tz"].value_counts()
![image](https://github.com/EduardoJMR/Bitly-Data-from-1.USA-project/blob/master/images/Capture.JPG)

Once the database has been converted to dataframe, knowing the number of times data from different time zones is even easier.

frame = pd.DataFrame(records)
frame["tz"]

tz_counts = frame["tz"].value_counts()

Cleaning the data

It can be seen that among the records, some of them do not have data on the time zone, being Nan. This means that we can clean the data by converting these missing values as "Unknow".

clean_tz = frame["tz"].fillna("Missing")
clean_tz[clean_tz == ""] = "Unknown"
tz_counts=clean_tz.value_counts()

import seaborn as sns
subset = tz_counts.head()
sns.barplot(y=subset.index, x=subset.to_numpy())

Visualizing the data

Top time zones in the 1.usa.gov sample data

image

What are the top 5 browsers more used?

Cleaning the data

Parsing all of the interesting information in these “agent” strings may seem like a daunting task. One possible strategy is to split off the first token in the string (corresponding roughly to the browser capability) and make another summary of the user behavior:

results= pd.Series([x.split()[0] for x in frame["a"].dropna()])
results.head(5)

image

results.value_counts().head(8)

Visualizing the data

Top 5 browsers more used.

import seaborn as sns
subset2 = results.value_counts().head()
sns.barplot(y=subset2.index, x=subset2.to_numpy())

image

Top time zones by Windows and non-Windows users?

Transforming the data

We have noticed that another way to differentiate users is to see which ones use windows and which ones do not. To do this from column "a" we will look in each cell for the word windows.

cframe= frame[frame["a"].notna()].copy()
cframe

image

cframe["os"]= np.where(cframe["a"].str.contains("Windows"), "Windows", "Not Windows")
cframe["os"].head(5)

image

by_tz_os = cframe.groupby(["tz" , "os"])
agg_counts= by_tz_os.size().unstack().fillna(0)
agg_counts.head(20)

image

Once we knew the number of users who use windows by time zone, we decided to plot the ten time zones with the highest user traffic sorted by those who use windows and those who do not.

To do this, we first create an index to regroup the agg_cpunts table. This 'index' will take into account the total number of users using the sum of both windows and non-windows, by time zone, . sum("columns"), will reorder them according to their index using argsort().

indexer= agg_counts.sum("columns").argsort()
count_subset = agg_counts.take(indexer[-10:])
count_subset

image

count_subset = count_subset.stack()
count_subset.name = "total"
count_subset = count_subset.reset_index()
count_subset.head(10)

image

Visualizing the data

Top time zones by Windows and non-Windows users

sns.barplot(x="total",y="tz",hue="os", data= count_subset)

image

Transforming the data

Once again, once the number of users using windows according to time zone is known, it is recommended to normalise the data when plotting the graphs, obtaining in this case the proportion of users per time zone.

def norm_total(group):
    group["normed_total"] = group["total"]/group["total"].sum()
    return group
results= count_subset.groupby("tz").apply(norm_total)
results

image

g= count_subset.groupby("tz")
result2= count_subset["total"]/g["total"].transform("sum")
count_subset["normed_total_2"]=result2

sns.barplot(x="normed_total", y="tz", hue="os", data=results)

Visualizing the data

Top time zones by Windows and non-Windows users normalized

image

bitly-data-from-1.usa-project's People

Contributors

eduardojmr avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.