Giter Club home page Giter Club logo

social_graph's Introduction

Social_Graph

This is social networking research ๐Ÿ‘ฅ.

Paresing from social network
Build graph of friends
Graph visualization
Graph analyzing
Strongly connected components
Algorithm description and correctness
Histograms

Purpose: To find, cluster and classify the strong connectivity components in a graph of friends from a social network using Python and Gephi.


Paresing from social network

In order to parse your friends, we've generated an easy-to-analyse json file. The following is one version of parsing VKontakte.

import requests
import json

all_id = []

for i in range(13):
    print(i)
    res = requests.get(f"https://api.vk.com/method/groups.getMembers?group_id=45&offset={i * 1000}&count=1000"
                       f"&access_token=4b32289d4b32289d4b32289d1d4b40471c44b324b32289d15e80a8caa89b73d1d2a5579&v=5.107")
    all_id.extend(json.loads(res.text)['response']['items'])

with open('ids.json', 'w') as f:
    f.write(json.dumps({'ids': all_id}))

Build graph of friends

Next the json was converted into a handy two files. One has a list of all the vertices of the graph, the other describes an edge on each line, with two vertices. Next the code that forms the text file.

txt = "test.txt"
f = open(txt).read()
ans = open('rebra.txt', 'w')
arr = []
flag = False
k = 0
check = 1
for i in f:
    if not (ord('0') <= ord(i) <= ord('9')):
        if k != 0:
            if check == 1:
                ans.write(str(k) + ' ')
            else:
                ans.write(str(k) + '\n')
            check *= -1
        k = 0
        flag = False
    if not flag and ord('0') <= ord(i) <= ord('9'):
        flag = True
    if flag and ord('0') <= ord(i) <= ord('9'):
        k = k * 10 + int(i)
print(arr)

Graph visualization

Now a table of vertices and edges has been generated in an xlsx file for further analysis. The table generation code will be given below, as it is quite similar when solving the two problems. These tables can now be exported to gephi. This is a graph visualizer. This is what the whole graph looks like. Immediately you can notice a few giant vertices (the size of a graph vertex is proportional to its degree). These are the pages of the two teachers https://vk.com/id238683 and https://vk.com/id133883. They are closely related to the organization of various extracurricular activities in the school, and consequently communicate a lot with all classes in the school, so there are so many different connections. Janina Yadova, for example, gave us the opportunity to get into this internship. image

Immediately you can see that the nodes with the most connections are usually the teachers, and thus it is not difficult to distinguish them from the general mass. Now I have done a little investigation with my pens. As a physical model is applied to the graph, it highlights communities, they look like clustered vertices. I chose a small sub-graph, and characterised each group by contacting some of its members and finding out how they were connected to our school. image

Graph analyzing

A person's class I determined directly by asking everyone on facebook, under the pretext of research. And so, under group 1 in the photo is the community of current grades 11-1 and 11-2. They are so closely related, as they were intermingled over the course of their studies. That is to say, the former when we moved from grade 7 to 8, we were roughly sorted by interests (maths or physics), but the old connections remained. And by and large, the two classes are pretty close, speaking as a student of one of them. Under numbers 2 is 11-3 grade, 3 is 11-5 grade, 4 is 11-7 and in question are 3 people from 11-4. Their class has sparred badly as many of the pages are either unsubscribed to the group or closed. As a result, only a few people are surrounded by other classes from the parallel, and the rest are either smeared amongst other classes or are not in the graph at all. As you can see, the communities are formed quite clearly, and manually separating them is not a problem.

Strongly connected components

But it's rather long and impractical to allocate everything by hand. Therefore, an algorithm was invented for selecting a class based on the id of one of its members. That is, the component of strong relatedness to which the member in question belongs is selected. Here is the code that does this.

import requests
import json


def findFriend(id, da):
    ss = set()
    for i in d[id]:
        ss.add(i)
    ds = {}
    for i in ss:
        ds[i] = 0
        for j in d[i]:
            for k in ss:
                if k == j:
                    ds[i] += 1
                    break

    anss = []
    for i in ss:
        anss.append((ds[i], i))
    anss = sorted(anss, reverse=True)
    k = 1
    for i in anss:
        if not da.get(i[1]) is None:
            da[i[1]] += k
            k *= 0.95


def GetNameById(user_id):
    res = requests.get(f"https://api.vk.com/method/users.get?user_ids={user_id}&"
                       f"access_token=4b32289d4b32289d4b32289d1d4b40471c44b324b32289d15e80a8caa89b73d1d2a5579&v=5.107&lang=ru")
    res = json.loads(res.text)['response'][0]
    return res['first_name'] + ' ' + res['last_name']


f = open("rebra.txt").read().split("\n")
d = {}
st = set()
for ind, i in enumerate(f):
    a = 0
    b = 0
    j = 0
    flag = False
    while j < len(i):
        if i[j] == ' ':
            flag = True
            j += 1
            continue
        if not flag:
            a = a * 10 + int(i[j])
        else:
            b = b * 10 + int(i[j])
        j += 1
    st.add(a)
    st.add(b)
    if d.get(a) is None:
        d[a] = []
    if d.get(b) is None:
        d[b] = []
    d[a].append(b)
    d[b].append(a)
id0 = 420403096
suspected = set()
for i in d[id0]:
    suspected.add(i)
ds = {}
dans = {}
for i in suspected:
    dans[i] = 0
ii = 0
findFriend(id0, dans)
ans = []
for i in suspected:
    ans.append((dans[i], i))
ans = sorted(ans, reverse=True)
for i in ans:
    findFriend(i[1], dans)
    if ii > 10:
        break
    ii += 1

ans = []
for i in suspected:
    ans.append((dans[i] / (len(d[i]) + 5), i))
ans = sorted(ans, reverse=True)
jj = 0
for i in ans:
    print(i[0], GetNameById(i[1]), jj)
    jj += 1
    if jj > 30:
        break

Algorithm description and correctness

To describe the algorithm in a nutshell, we simply take all vertices with which the vertex in question is connected, and then distinguish the first 30 members that are closely related to each other. But also this algorithm is applied to several other vertices from the friends of the considered one, in order to exclude false connectivity components, with which the considered vertex is connected by just a couple of edges. So, I applied this algorithm to my page, and got the following result. It is 90% correct. But also our class teacher (although it may be considered as a class member), and a couple of people from 11-2, as I said before that we have close social ties with them. image

Histograms

Next, a histogram has been made. I now attach the code that generates the xlsx table I referred to above.

from openpyxl import load_workbook

wb = load_workbook('t1.xlsx')
sheet = wb.get_sheet_by_name('test')
f = open("rebra.txt").read().split("\n")
ans = []
d = {}
dd = {}
st = set()
ii = 0
for ind, i in enumerate(f):
    ans.append([])
    a = 0
    b = 0
    j = 0
    flag = False
    while j < len(i):
        if i[j] == ' ':
            flag = True
            j += 1
            continue
        if not flag:
            a = a * 10 + int(i[j])
        else:
            b = b * 10 + int(i[j])
        j += 1
    if(a > b):
        a, b = b, a
    if dd.get((a, b)) is None:
        dd[(a, b)] = 1
        st.add(a)
        st.add(b)
        if d.get(a) is None:
            d[a] = 0
        if d.get(b) is None:
            d[b] = 0
        d[a] += 1
        d[b] += 1
st1 = set()
d1 = {}
check = 0
for i in st:
    st1.add(d[i])
    check += 1
    if d1.get(d[i]) is None:
        d1[d[i]] = 0
    d1[d[i]] += 1
ans1 = []
iii = 0
for i in st1:

    ans1.append([i, d1[i]])
    sheet[f'A{iii + 2}'] = i
    sheet[f'B{iii + 2}'] = d1[i]
    iii += 1
ans1 = sorted(ans1)
print(ans1)

wb.save('t1.xlsx')

The constructed table stores two columns. The first has the number of links and the second the number of nodes with that number of links. A graph has been plotted using this data. image

It is hard to speculate why the distribution is the way it is. It is especially strange that there are so many nodes with so few connections. In my opinion it can be explained by the fact that people interested in our school sign up for groups. But the interest rarely develops into the fact that a person or his/her children come here and they forget to unsubscribe from the group. It is worth remembering that the first point is missing on the graph, for 0 connections, where there are about 2500 nodes, which confirms the above theory In the same way a study of the age dependence of the participants in the group was conducted. The method of forming the table is the same. Here is the resulting graph of dependence. It is clear, that it is not exact, as many specify incorrect age (more than 100 years), but in the average dependence makes sense. image

This concludes the study. Thank you for your attention.

social_graph's People

Contributors

mirong1707 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.