Giter Club home page Giter Club logo

stacker-news's Introduction

BehaviorBounty

Humans, for the most part, behave and act as economic agents driven by primordial incentives or by more sophisticated reward schemes. Actions and behaviours carried on in Internet-based contexts (as forums, social media, etc) are not exempted from this biological truth. This is the reason why social media platforms and forums soon understood that the implementation of features as likes or some sort of other reward scheme could improve customer retention and interaction by orders of magnitute.

The following project has the goal to investigate the most rewarding behaviours for users interacting in the online forum Stacker News, an unconventional internet-based forum where likes are replaced by zaps, bitcoin microtransactions.

More details about the project can be found in the attached paper.

Co-author: Alberto Bersan

Reproduce the environment for the analysis

Important: as of june 2024, the Stacker News forum implemented several new features and gave to the users the option to hide some information about their profiles. This advancements could generate some inconsistencies between the results reported in the paper and the current forum landscape. If you need to reproduce the analysis as carried on by the authors, you're suggested to get in touch with me. My contacts are listed in my personal website.

In order to reproduce the environment used for the research, the following steps are suggested.

  1. Clone locally the current repo (or download the zipped folder);
  2. Unzip the zipped folder in a custom path;
  3. Navigate to the unzipped folder at the custom path and execute the following commands to create a python environment, activate the environment and install the requirements.

The '$' symbol indicates a new prompt line

$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

At this point all the necessary python packages are installed locally in the environment. The scraping process is break down into 3 steps:

  1. Setup the database folder and a new sqlite database;
  2. Scrape the items of the forum;
  3. Scrape the user profiles (profiles crawled are the ones of users that appeared at least once in the previous scraping process).
$ python python/setupDB.py         # Setup SQLite database
$ python python/scraping_items.py  # Scrape forum items
$ python python/scraping_users.py  # Scrape user profiles

R packages

It is suggested to execute R scripts using the Rstudio sotfware and open the folder in Rstudio as an Rproject (by opening the stacker_news.Rproj file). At the execution of every .R script, a function will verify if the needed packages are installed: if not then it procedes to install them, if installed they are imported in the environment.

Alternative installation of R packages

In order to sync all the packages and R requirements, is also possible to use the renv tools provided by Rstudio. Open the project file with Rstudio, navigate to the tools settings and open the project options. There, navigate to the environments section end activate the setting Use renv for this project.

The R session will restart. Then, navigate to the console and type the following command:

renv::init()

This command will ask for a choice in the renv management, select the option to restore the project from the lockfile. Rstudio will then proceed to install all the R packages needed.

These steps reproduce exactly the environment and dataset used to produce this research.

Project structure and customization

Python code

The functions and parameters used for the webscraping activity are located in different scripts. Scripts are freely customizable. In order to change the number of items to retrieve or the exact range, edit python/scraping_items.py:62.

python
├── comment.py
├── discussion.py
├── __init__.py
├── item.py
├── link.py
├── scraping_items.py
├── scraping_users.py
├── setupDB.py
└── user.py

R code

The structure of R scripts is based on the paper chapters. overview folder contains the data_cleaning.R script (that executes transformations on the data and saves RDS files); the summary_tables.R contains the code used for the initial data exploration. The directed folder contains all the code used for the social network analysis. The directed_general.R script contains the procedures to reproduce the general graph section. The numbered scripts are referred to the five periods analysed to setup the final table of the paper.

R
├── directed
│ ├── directed_general.R
│ ├── fifth.R
│ ├── first.R
│ ├── fourth.R
│ ├── second.R
│ └── third.R
└── overview
    ├── data_cleaning.R
    └── summary_tables.R

Data

Data are contained in a single sqlite database file inside the data folder. The database contains four tables:

stacker_news.sqlite
├── comments            # All the 'comment' items
├── post                # All the 'post` items
├── user                # All the user profiles 
├── exceptions          # Exceptions and errors occured during the scraping process

Every script interacting with the data at its source is set to search for the database file in the ~data/ path.

The setupDB.py script completely wipes the stacker_news.sqlite file. Remember to backup the `stacker_news.sqlite' file before running any python script.

RDS files

In order to simplify the data processing and analysis conducted in R, data used for the analysis are saved in .RDS form and are avaliable in the RDS_files folder in the main directory of the project.

RDS_files
├── c_fifth_period
├── c_first_period
├── c_fourth_period
├── comments
├── c_second_period
├── c_third_period
├── p_fifth_period
├── p_first_period
├── p_fourth_period
├── posts
├── p_second_period
├── p_third_period
└── users

The post, comments and users files are copies of the respective data.table objects. Files starting by 'c' correspond to data.table objects referring to the comments table (partitioned into periods); files starting by 'p' are referring to the posts table (partitioned into periods).

Images

The execution of the R scripts generates some plot images, used for exploratory analysis. The images will be generated inside an images/ folder.

I'm Using GitHub Under Protest

This project is currently hosted on GitHub. This is not ideal; GitHub is a proprietary, trade-secret system that is not Free and Open Souce Software (FOSS). I urge you to read about the Give up GitHub campaign from the Software Freedom Conservancy to understand some of the reasons why GitHub is not a good place to host FOSS projects.

Any use of this project's code by GitHub Copilot, past or present, is done without our permission. I do not consent to GitHub's use of this project's code in Copilot.

Logo of the GiveUpGitHub campaign

stacker-news's People

Contributors

filitol avatar

Watchers

 avatar

Forkers

bers00

stacker-news's Issues

Management of missing records

I propose to set up an another DB table to save all the entries that some how trigger the general excep statement.

Similar post item structure

Observed behaviour

The following items

  • bounty
  • poll
  • discussion

have all the same structure, meaning that we could process the scraping using the same function.
Obviously this is related to the specific data we think we would need from each type of item: if we decided to extract some other specific values we should keep the functions as separated processes.

Link detection

Observed behaviour:

Regarding the link items, some of them have the main link positioned in a different html 'position' (with different tag/class/target) from the standard behaviour observed.
It seems that the link formatting changed through time, at the beginning was different from now.
In any case, the scraping of links in link items is structured in such a way that:

  • main link is a string or none, depending if the main link is present in the header of the post. This is the standard behaviour.
  • body links is a list of links, namely the links contained in the comments. It includes also tagged users that can be removed later. For link items formatted differently, this list is populated with at least one link which is the main link.

The conclusion is that if the item is classified as link item then it should have a main link that is in the main link field or, if the main link is empty, it is the first link in the body links field.

Structure not clear

Why you are saving link, discussion, etc.. in different dataset? Is it better to have just a factor column representing the type of post?

Cookbook: scrape item data

  • Retrieve item webpage provided the item code

  • Detect item type: comment or post

    • If post, which kind of post:
      1. Discussion
      2. Link
      3. Poll
      4. Bounty
      5. Job
  • Retrieve title

  • Retrieve banner

    • Extract number of comment, compulsory
    • Extract stacked amount by the item, if present
    • Extract Boost value, if present
    • Extract username, compulsory
    • Extract timestamp, compulsory
    • Extract badge, compulsory
  • Extract amount stacked by comments, compulsory

  • Extract item code of comments OR extract user that commented

Forum moderators and administrators

Suggestion for data visualization

Since the forum has some administrators and moderators that are known as they are the creators, it could be useful to label those users differently. Could be interesting to observe their posting behaviour in relation to other users, if the create most of the post flow of if their role in the forum is marginal (from a content-creation perspective)

Social Network Analysis parameters

Social Network Analysis parameters to be calculated for every period

Degree

  • Node degree (may be calculated for the most important nodes)
  • Average degree
  • Degree distribution

Components observations

  • Visualize components
  • Consideration about giant components

Path

  • Diameter (=largest distance recorded between any pair of nodes in the network)
  • Average path length
  • Average degree of separation (small world phenomenon)

Clustering and partitioning

  • Average clustering coefficient
  • Embeddedness
  • Betweenness (The Girvan-Newman Method)
  • Homophily(?)

Missing the vertex users in the corresponding user table

I figured out this problem while working on the creation of the graph.
I figured out that maybe we need also the role of the user in the forum, because some users are contributors (developers of the platform) and they seem to be the ones interacting the more. While trying to add the role as an attribute to the graph I figured out that the number of vertex in the graph (users) is different from the number of users in the user table. Moreover, by doing an inner join between the tables I understood that we have ~300 users that are in the vertex list but not in the user table. This could happen because they cannot be retrieved during scraping, for some reason.

To deal with this problem we could:

  1. Use for the graph only the users for which I have an entry in Users table;
    OR
  2. Use for the graph also the users that have a vertex but do not have a corresponding entry in the users table.

The consequence of 1 is that I'm left with 5208 users, the consequence of 2 is that I do not have anagraphic information for some users (~300), nor I have the Role or the StackedAmount or the role (even though the role can be derived easily).

The reason why ~300 users are not retrieved during scraping is still unknown.

Analysis about centraility: important steps

  • Node degree (quantify node connectivity - local measure)
  • Eigen centrality (takes into account neighbors connectivity - global measure)
  • PageRank
  • Closeness (easy access to all nodes) -> What if it is not so important to have for an user to have many direct friends? But one wants to be in the “middle” of things, not too far from the center. It measures the average path length.
  • Betweenness (quanitify node importance in network flow)

Adjusting stacked amounts

Problems

  1. Building up the general directed graph I discovered that there are a buch of high-earnings users (according to the cumulative stacked amount computed using the comments and posts tables) that are in fact 'banned' users because they faked their staked amount.
  2. Users can earn sats even by 'forwarding' actions. That is, the user X creates a posts and decides that a custom percentage of the earnings from that post has to be forwarded to the Y user. These amounts are not included in the 'stacked amount' in the post but can actually be captured by looking at the difference between the stacked amount by users from the users table and the resulting from the cumulative computation.
  3. The stacked amounts in the users' profile are the results of a mathematical sum, therefore a more general criteria for the stacked amount evaluation should be used.
comments + posts + received (from forwarding) + forum daily rewards

Background

Jailed users are still forum members, therefore they still should be included in the research. However by executing the validation process between the two stacked amounts available is possible to isolate these users and is also possible to isolate the users that received consistent amounts via forwarding or in form of platform rewards
Platform rewards are distributed in such a way that the more active the user the more the daily reward will be.

Every day, the stackers who created the top 21% of posts and comments from the previous day will receive extra sats as a reward. The extra rewards depend on how popular the content you created was as determined by other stackers. (from FAQ)

Daily reward model is based on a Web of Trust. Look here for more info. The algorithm enforces a rule whereby every user is profiled with a score between $0$ and $0.9$ based on how much trust the other users gave him. Trust is given by liking one's content in such a way that the first satoshi zapped is relevant but not the total zapped (in other words, the act of liking aka zapping one's content is the actual trust vote and not the quantity zapped).
One's score acts as a weight for the amount that he/she will zap to others' posts. The higher the score, the more impactful will be user's zaps. The more weighted impactful zaps a post collects, the higher it will be in the homepage ranking and then the more likely to be in the top 21%.

This process is crucial with regards to the research: the more the rewards collected, the more the user is guaranteed to be a good forum user and the more he/she will earn from his/her online activity. Therefore looking also at the total rewards collected is crucial.

Possible solution

At this point, in order to answer to the main question is necessary to incapsulate as a node attribute also the stacked amount value from the users table, that is the scraped user profiles containing the aggregated sum of the different stacked amounts as the formula highlighted previously.
Since the values cannot be splitted, we could consider three different columns (to be added as node attribute):

  1. Stacked amount from posts+comments
  2. Stacked amount from profile
  3. 2 - 1, that is in fact the rewards + received via forwarding

These values should help to isolate the most rewarding behaviour.

Extraction of post stats

Extraction of post stats is more efficient if using regex expressions looking for unique patterns.

  • Extract banner in a function
  • Extract number of comments starting from the get_banner output
  • Extract number of sats stacked starting from the get_banner output
  • Extract boost number starting from get_banner output
  • Extract timestamp using the get_post_timestamp function starting from get_banner data
  • Extract sats stacked by comments (?)

Leading question(s)

Investigating the most rewarding behaviour in an economic-reward based online forum

Anomalies

Investigate the small clusters:

  • Who are the users
  • What did they post
  • Why did they earn so much (if so)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.