Giter Club home page Giter Club logo

british-broadcasting-corporation's Introduction

British-Broadcasting-Corporation

Overview

The objective of this project is to employ Natural Language Processing (NLP) techniques to analyze a collection of articles from the BBC across five distinct categories. The project aims to uncover insights by addressing two primary questions:

  1. Identifying Common Topics by Category: For each of the five specified categories, the project seeks to determine the most prevalent topics among the articles. Utilizing NLP algorithms, the analysis will focus on extracting key themes, subjects, and trends within each category. This exploration aims to provide a comprehensive understanding of the predominant discussions within the different sections of the BBC.

  2. Exploring G20 Country Mentions Across Categories: The second aspect of the project involves examining how many articles across all categories specifically mention each of the G20 countries. By leveraging NLP algorithms, the analysis will identify and quantify the frequency of references to individual G20 nations within the entire dataset. This broader perspective aims to unveil patterns in the global coverage provided by the BBC, shedding light on the prominence of G20 countries in various news topics.

Overall, this NLP-driven project seeks to offer valuable insights into the content of BBC articles, providing a nuanced understanding of prevalent topics within specific categories and the coverage of G20 countries across diverse news segments. The application of NLP techniques enables a data-driven approach to uncover patterns, trends, and relationships within the extensive BBC article dataset.

BBC Datasets

Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. These datasets are made available for non-commercial and research purposes only. If you make use of these datasets please consider citing the publication:

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. PDF BibTeX.

Dataset: BBC

All rights, including copyright, in the content of the original articles are owned by the BBC.

  • Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
  • Class Labels: 5 (business, entertainment, politics, sport, tech)

Download pre-processed dataset

Download raw text files

Dataset: BBCSport

All rights, including copyright, in the content of the original articles are owned by the BBC.

  • Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005.
  • Class Labels: 5 (athletics, cricket, football, rugby, tennis)

Download pre-processed dataset

Download raw text files

File formats

The datasets have been pre-processed as follows: stemming (Porter algorithm), stop-word removal (stop word list) and low term frequency filtering (count < 3) have already been applied to the data. The files contained in the archives given above have the following formats:

  • *.mtx: Original term frequencies stored in a sparse data matrix in Matrix Market format.
  • *.terms: List of content-bearing terms in the corpus, with each line corresponding to a row of the sparse data matrix.
  • *.docs: List of document identifiers, with each line corresponding to a column of the sparse data matrix.
  • *.classes: Assignment of documents to natural classes, with each line corresponding to a document.
  • *.urls: Links to original articles, where appropriate.

Questions:

  1. For each of the five categories: what are the most common topics among articles?
  2. Across all categories: how many articles talk about each of the G20 countries?

Link to analysis on Kaggle

british-broadcasting-corporation's People

Contributors

bornofdata avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.