It is a dataset that contains the name of the browser, city, timezone, country, and webpage visited by anonymous users who shorted links with .gov or .mil. In 2011.
Browser's name
city
Timezone
Country
Web page visited
What are the 10 time zones with the highest presence?
Transforming the data
We decided to count the number of different time zones we have, using a function.
defget_counts(sequence):
counts= {}
forxinsequence:
ifxincounts:
counts[x]+=1else:
counts[x]=1returncounts## What are the time zones with the highest presence?```pythonfromcollectionsimportdefaultdictdefget_counts2(sequence):
counts=defaultdict(int) #values will initiaze to 0forxinsequence:
counts[x] +=1returncountsget_counts2(time_zones)
fromcollectionsimportCountercounts=Counter(time_zones)
counts.most_common(10)
fromcollectionsimportCountercounts=Counter(time_zones)
counts.most_common(10)
tz_counts=frame["tz"].value_counts()
![image](https://github.com/EduardoJMR/Bitly-Data-from-1.USA-project/blob/master/images/Capture.JPG)
Once the database has been converted to dataframe, knowing the number of times data from different time zones is even easier.
It can be seen that among the records, some of them do not have data on the time zone, being Nan. This means that we can clean the data by converting these missing values as "Unknow".
Parsing all of the interesting information in these “agent” strings may seem like a daunting task. One possible strategy is to split off the first token in the string (corresponding roughly to the browser capability) and make another summary of the user behavior:
We have noticed that another way to differentiate users is to see which ones use windows and which ones do not. To do this from column "a" we will look in each cell for the word windows.
Once we knew the number of users who use windows by time zone, we decided to plot the ten time zones with the highest user traffic sorted by those who use windows and those who do not.
To do this, we first create an index to regroup the agg_cpunts table. This 'index' will take into account the total number of users using the sum of both windows and non-windows, by time zone, . sum("columns"), will reorder them according to their index using argsort().
Once again, once the number of users using windows according to time zone is known, it is recommended to normalise the data when plotting the graphs, obtaining in this case the proportion of users per time zone.