Giter Club home page Giter Club logo

backlink-monitoring's Introduction

Backlink checker

Table of Contents

Backlink checker is a simple tool, which checks backlink quality, identifies problematic backlinks, and outputs them to a specific Slack channel.

The tool tries to reach a backlink, which is supposed to contain a referent link, and checks if it indeed does. If a backlink contains a referent link, the tool retrieves the HTML of that backlink and checks for certain HTML elements, which indicate good quality of backlink.

Packages Required

The first step is to prepare the environment. The backlink checker is written in Python. The most common Python packages for creating any web crawling tool are Requests and Beautiful Soup 4 - a library needed for pulling data out of HTML. Also, make sure you have Pandas package installed, as it will be used for some simple data wrangling.

These packages can be installed using the pip install command.

pip install beautifulsoup4 requests pandas

This will install all the three needed packages.

Important: Note that version 4 of BeautifulSoup is being installed here. Earlier versions are now obsolete.

Checking backlinks

The script scrapes backlink websites and checks for several backlink quality signs:

  • if backlink is reachable
  • if backlink contains noindex element or not
  • if backlink contains a link to a referent page
  • if link to referent's page is marked as nofollow

STEP 1: Check if backlink is reachable

The first step is to try to reach the backlink. This can be done using the Requests library's get() method.

try:
    resp = requests.get(
        backlink,
        allow_redirects=True
    )
except Exception as e:
    return ("Backlink not reachable", "None")

response_code = resp.status_code
if response_code != 200:
    return ("Backlink not reachable", response_code)

If a request returns an error (such as 404 Not Found) or backlink cannot be reached, backlink is assigned Backlink not reachable status.

STEP 2: Check if backlink HTML has noindex element

To be able to navigate in the HTML of a backlink, a Beautiful soup object needs to be created.

bsObj = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)

Note that if you do not have lxml installed already, you can do that by running pip install lxml.

Beautiful Soup's find_all() method can be used to find if there are <meta> tags with noindex attributes in HTML. If that's true, let's assign Noindex status to that backlink.

if len(bsObj.findAll('meta', content=re.compile("noindex"))) > 0:
    return('Noindex', response_code)

STEP 3: Check if backlink HTML contains a link to a referent page

Next, it can be found if HTML contains an anchor tag (marked as a) with a referent link. If there was no referent link found, let's assign Link was not found status to that particular backlink.

elements = bsObj.findAll('a', href=re.compile(our_link))
if elements == []:
    return ('Link was not found', response_code)

STEP 4: Check if referent page is marked as nofollow

Finally, let's check if an HTML element, containing a link to a referent page, has a nofollow tag. This tag can be found in the rel attribute.

try:
    if 'nofollow' in element['rel']:
        return ('Link found, nofollow', response_code)
except KeyError:
    return ('Link found, dofollow', response_code)

Based on the result, let's assign either Link found, nofollow or Link found, dofollow status.

Assigning results to Pandas DataFrame

After getting status for each backlink and referent link pair, let's append this information (along with the response code from a backlink) to pandas DataFrame.

df = None
for backlink, referent_link in zip(backlinks_list, referent_links_list):
    (status, response_code) = get_page(backlink, referent_link)
    if df is not None:
        df = df.append([[backlink, status, response_code]])
    else:
        df = pd.DataFrame(data=[[backlink, status, response_code]])
df.columns = ['Backlink', 'Status', 'Response code']

get_page() function refers to the 4-step process that was described above (please see the complete code for the better understanding).

Pushing results to Slack

In order to be able to automatically report backlinks and their statuses in a convenient way, a Slack app could be used. You will need to create an app in Slack and assign incoming webhook to connect it and Slack's channel you would like to post notifications to. More on Slack apps and webhooks: https://api.slack.com/messaging/webhooks

SLACK_WEBHOOK = "YOUR_SLACK_CHANNEL_WEBHOOK"

Although the following piece of code could look a bit complicated, all that it does is formatting data into a readable format and pushing that data to Slack channel via POST request to Slack webhook.

cols = df.columns.tolist()
dict_df = df.to_dict()
header = ''
rows = []

for i in range(len(df)):
    row = ''
    for col in cols:
        row +=  "`" + str(dict_df[col][i]) + "` "
    row = ':black_small_square:' + row
    rows.append(row)

data = ["*" + "Backlinks" "*\n"] + rows

slack_data = {
    "text": '\n'.join(data)
}

requests.post(webhook_url = SLACK_WEBHOOK, json = slack_data)

That's it! In this example, Slack was used for reporting purposes, but it is possible to adjust the code so that backlinks and their statuses would be exported to a .csv file, google spreadsheets, or database.

Please see backlink_monitoring_oxylabs.py for the complete code.

backlink-monitoring's People

Contributors

erikaoxy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.