Giter Club home page Giter Club logo

reddit-detective's Introduction

👋 Hi, I'm Umit

GitHub User's stars Views

Interests: Machine learning (time-series and graphs), Explainable AI, Social networks, Venture capital

Currently: Learning Lisp & CUDA

⭐ My favorite projects

🕵️ reddit-detective: Detect political disinformation campaigns, discover how ideas spread between communities, find "cyborg-like" activities carried out by bots and more in Reddit Downloads

🏰 Jomini: What if Byzantines had more soldiers in 1453? You can model this and many other historical battles Downloads

🐤 TIA is an advanced Twitter stalking/analysis tool powered by machine learning.

💰 Trying a New Fraud Detection Approach for Trust Networks While trying to detect fraud rings in the bitcoin-otc network, I came up with an individual fraud detection approach which is better than 9 of 10 well-known network-based fraud detection algorithms for this problem. (At least for this data set, I'll try it in different datasets and tune the models when I have time for that)

Say Hi

Blog Twitter LinkedIn

reddit-detective's People

Contributors

latueur avatar umitkaanusta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

reddit-detective's Issues

Suspended users throw an error

If a Comment or Submission's author is queried with self.resp.author.name or self.resp.author.id, reddit-detective throws an error and stops the network creation (or whatever operation is being carried).

Add static code analysis

Will be using Sonarcloud to track duplications, security vulnerabilities, and technical debt at the code level to test production readyness.

"CommentsReplies" does not work with Submissions with a high number of comments

Input:

# necessary imports are done, praw integration works fine
# no problems on the DB side
net = RedditNetwork(
        driver=driver,
        components=[
            CommentsReplies(Submission(api, "li5sts", limit=5))
        ]
    )
# constraints are already created
net.run_cypher_code()

Output:

Traceback (most recent call last):
  File "C:/Users/Ümit/PycharmProjects/rdtest/aaa.py", line 32, in <module>
    net.run_cypher_code()
  File "C:\Users\Ümit\PycharmProjects\rdtest\venv\lib\site-packages\reddit_detective\network.py", line 95, in run_cypher_code
    self._run_query(codes=self._codes())
  File "C:\Users\Ümit\PycharmProjects\rdtest\venv\lib\site-packages\reddit_detective\network.py", line 83, in _codes
    codes = list(chain.from_iterable([point.code() for point in self.components]))
  File "C:\Users\Ümit\PycharmProjects\rdtest\venv\lib\site-packages\reddit_detective\network.py", line 83, in <listcomp>
    codes = list(chain.from_iterable([point.code() for point in self.components]))
  File "C:\Users\Ümit\PycharmProjects\rdtest\venv\lib\site-packages\reddit_detective\relationships.py", line 154, in code
    comment_merges, comment_links, submissions = self._merge_and_link_comments(self.comments())
  File "C:\Users\Ümit\PycharmProjects\rdtest\venv\lib\site-packages\reddit_detective\relationships.py", line 196, in comments
    comment.refresh()
AttributeError: 'MoreComments' object has no attribute 'refresh'

Relationships create duplicate code

Assume that K comments are under the same submission.
During the code generation, the code to create a node for the given submission is generated K times.

Shortcomings of the current solution:

  • Forces the user to create a network with a degree and then generate the code, to get a code with no duplicates.

Add Implementation of Analytics functions for NetworkX objects as well

There might be instances where one wants to keep their interaction with Neo4j minimum, or one does not even use Neo4j. So, the analytics functions should work for NetworkX objects as well. (NetworkX is chosen since it is one of the most popular and easiest-to-use graph data science libraries)

Add CI

After fixing #22, add CI using GitHub Actions

Add feature: Convert tabular Reddit data to Neo4j Graph

On-the-spot scraping is very unscalable due to delays caused by Reddit API calls. RD should be usable by people who want to analyze some big data. Given any tabular data with some description from the user, we should be able to convert it to a Neo4j Graph.

v0.1.2 Produces Incomplete Graphs

Seems like the commits to reduce redundancy also led to incomplete graphs. The next version will involve a rollback, and please do not use v0.1.2. (I will also add this notice to the docs)

Reduce running times by minimizing the number of calls to the Reddit API

Problem

Currently, the main reason for high running times is the rate limit of the Reddit API. So we should minimize the number of calls to reduce running times.

Validation of the Problem

Test code:

# Assume api_ is declared and Neo4j is started
net = RedditNetwork(
        driver=driver_,
        components=[
            Comments(Redditor(api_, "Anub_Rekhan", limit=5))
        ]
    )
net.run_cypher_code()

The rate-limit evasion function in PRAW is called 59 times by the code above.

The code creates 18 nodes and 22 relationships atm. 59 is simply too much for such a small graph.

cProfile results:

Why are there so many requests?

TL;DR: The current version sends an unnecessary amount of requests.

Num of calls - Redditor object itself: 1

Num of calls - Getting the complete list of comments: 5 (2 items in the list of 11 comments are duplicate. That'll be fixed)

Num of calls - Merge and Link Comments: 33; At each iteration: 3 (1 Comment, 1 Author, 1 Submission). 3 * 11 = 33

Num of calls - Merge and Link Submissions: 22; At each iteration: 2 (1 Subreddit, 1 Author). 2 * 11 = 22 (There shouldn't be 11 submissions instead of 2. That'll be fixed)

The total number of calls is 61 (Same with the number in the Call Graph)

What to do?

  • "Merge code" generation shouldn't require a data_models object. Maybe abstract? (Affects 55 calls in the example above)
  • Find other ways of getting the complete list of comments (Affects 5 calls in the example above)

Creating a Documentation - v0.1.1

Needed:

  • YML files to integrate with readthedocs.io and MkDocs
  • Documents for v0.1.1

Before moving on to development/code reviews, I'm going to create docs of the codebase for potential users and potential contributors.

Any tips are appreciated at this point, especially from experienced and/or detail-oriented people

I'll use readthedocs.io and build docs with MkDocs, using Markdown files

For v0.1.1, the docs will be organized in this way:

docs/
    index.md (will be the same with ../README.md)
    analytics/
        metrics.md
    data_models.md
    relationships.md
    network.md

CONTRIBUTING.md missing

Will create a markdown file including:

  • Resources for people interested in contributing (PRAW documentation, Neo4j tutorials, introductory material to Social Network Analysis, etc.)
  • Issue standards
  • PR guidelines

ps: It's the first time I got an outside contribution to an open-source project of mine. Thank you, Lajos!

Relationships generate duplicate Cypher code

In the lines down there, reddit-detective gets rid of duplicate code fragments in the cache while preserving the order of elements in it. However, the Cypher code generated by method .code() still has duplicates.

https://github.com/umitkaanusta/reddit-detective/blob/main/reddit_detective/network.py#L79-L86

def _codes(self):
    """
    Get codes for every component
    """
    codes = list(chain.from_iterable([point.code() for point in self.components]))
    # Remove duplicates without changing order
    codes = sorted(set(codes), key=lambda x: codes.index(x))
    return codes

 

Given test case (the debugger shows the length of the list codes before and after the operation above):

# the necessary imports are done 
obj = CommentsReplies(Submission(api_, "jpt7s7", limit=None))
net = RedditNetwork(
    driver=driver_,
    components=[
        obj
    ]
)
net.run_cypher_code()

# 83 nodes
# 101 connections
# length of "codes": 332
# length of "codes" after the operation above: 184

 

The code used to get the duplicates:

# the continuation of the code above
# the variables obj and net are the ones created above
from pprint import pprint
from collections import Counter
L1 = obj.code()
L2 = list(net._codes())pprint(Counter(L1) - Counter(L2))

 

Duplicates:

Counter({'MERGE (:Submission {id: "jpt7s7", created_utc: 1604764778.0, title: "Play detective on Reddit: Discover political trolls, secret influencers and more", text: ""});': 49,
         '\nMATCH (n1 {id: "jpt7s7"})\nMATCH (n2 {id: "2qh0y"})\nWITH n1, n2\nMERGE ((n1)-[:UNDER {}]->(n2));\n': 49,
         '\nMATCH (n1 {id: "d1nfvrn"})\nMATCH (n2 {id: "jpt7s7"})\nWITH n1, n2\nMERGE ((n1)-[:AUTHORED {}]->(n2));\n': 49,
         'MERGE (:Redditor {id: "d1nfvrn", username: "Anub_Rekhan", created_utc: 1504794766.0, has_verified_email: "True"});': 1})

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.