Giter Club home page Giter Club logo

ada-2021-project-concatsanddogs's Introduction

Project Milestone 3 - Group Concatsanddogs

The use of women's rights and gender equality rhetoric in the US

Abstract:

The Quotebank dataset was created on the premises that while quotations are interesting materials by themselves, with an attributed speaker more meaning can be infered (see research paper). In this data story we build upon these premices to ask a few questions about women’s rights and gender equality rhetoric in news articles.

Our questions will make use of the data that contextualize the quotations in Quotebank, namely the date, attributed speaker and the quotations themselves. We will explore who relies on arguments of women’s rights and gender equality, using additional datasets to complete the social profile of the attributed speaker. We are also interested in discovering in which social and political context this rhetoric is used. To this end, we will use additional datasets and natural language processing to higlight some possible links between political events and the evolution of feminist topics in news articles. We will also dive into specifc topic with subset of quotes to try to investigate different feminist ideologies.

Research Questions:

Since the topic modelling and sentiment analysis didn't give the results we were hoping, in particular we didn't find a sufficient amount of quotes regarding femonationalism ( ~ 800 quotes ) our updated research questions are : RQ0 : What? What do people talk about?

	How can we subdivised topic in a supervised manner?

RQ1 : What? What do people talk about?

	What kind of topics appear in selected quotes with an unsupervised method? 
	Is there a topic/cluster representing femonationalist quotes?

RQ2 : When? When do people talk about women's rights?

	How do people feel about women’s rights and gender equality overall?
	How has sentiment evolved over time ?
	Are peaks in sentiment related to political or social events ?

RQ3 : Who? Who talks about women's right?

	Is speaker sentiment related to political party, gender or other?
	Is Women’s Rights a source of disagreement between the two sexes ?

RQ4: How? A dive into different feminist ideologies

	Can we highlight some ideologies within topic subset?

Proposed additional datasets :

  • Articles content and Keywords: These two text files contains the article retrieved from usnews.com and the list of most frequent bigrams that we will use as keywords.
  • Parquet files and QIDS: Samples of the Wikidata knowledgebase will be used to translate QID items in the dataset to readable labels. After selecting what quotes we will work on, using the provided .parquet file and json we will load the corresponding attributes to each speaker. Some attributes such as the political party could be used in our analyses. Since multiple QIDs are given for the political party we might extract the date for each term, and only keep the one relevant to the quote.

Methods :

  • Webscraping: To scrape news websites and extract keywords related to women's right topic we use python libraries that parse html tree (BeautifulSoup) and natural language processing (NLTK) to get the most frequent bigrams. The format of the website has to be manually inspected to define the relevant tag contents. In our case, two tags identifying the column of interest in the primary website page as well as a tag to identify the article itself, its secondary pages are manually determined. The article's content is cleaned with regular expressions to keep only the relevant part of the text (e.g. copyright ignored).

  • Clustering : The BERTopic package is used to cluster the quotes into meaningful labeled topics. A subset of our chosen quotes is fed to it. It embeds the quotes using sentence-bert, reduces the dimensions of the embedding, cluseters them using HDBSCAN, and gives labels to the clusters ( which would be our topics, consisting of a few keywords) using a modified version of TF-IDF and MMR(Maximal Marginal Relevance)

  • Sentiment analysis : Polarity scores from VaderSentiment library ( in the NLTK library) could be used to give a metric on the positivity or negativity of a quote. The VaderSentiment tool has been created to work best on social media content, which is appropriate as we are dealing with short texts. We also compare a sample of the results to other tools such as Flair and TextBlob.
  • QID to attributes & QID to readable labels: From a dataframe containing the chosen quotes, the QID row ( only the first QID if many are available) is used to perform a left join with the dataframe from the "speaker_attributes.parquet". Since some labels are still in QID format another join is performed between the remaining QIDs and "wikidata_labels_descriptions_quotebank.csv.bz2". Code has been prepared to perform the same join but using chunks with the file containing all the wikidata labels ( "wikidata_labels_descriptions.csv.bz2") in case some QID weren't available in the Quotebank version.

Proposed timeline: weekly

  • 9 : Get feedback and review P2 and potentially modify timeline
  • 10 : Have a result and plot for every question - start interpreting
  • 11 : Combine the results and sketch website with visualizations - finish notebook
  • 12 : Continue website - write text accompanying visualizations
  • 13 : Run all the code and visualizations one last time - Finalize website

Organization within the team:

Amina :

  • Webscraping,
  • RQ0 (data categorization, results interpretation, website redaction)
  • RQ2 (data, results interpretation, website redaction)
  • RQ4 (data)

Younes :

  • RQ1 (data topic clustering, results interpretation, website redaction)
  • RQ2 (data, results interpretation, website redaction)
  • RQ3 (data, results interpretation)
  • QID & BERTopic pipeline

Galann :

  • Website set up
  • RQ3 (results interpretation, website redaction)
  • RQ4 (results interpretation, website redaction)

Valérian :

  • Website set up
  • RQ1 (data topic clustering, results interpretation, website redaction)

(RQ4 = additional , the last question)

Notebook organization:

  • Notebooks with the plots used in the website :

    • Who_talks_about_womens_rights.ipynb

    • Womens_rights_and_gender_equality_quotes.ipynb

    • When_do_people_talk_about_womens_rights.ipynb

    • Additional_Investigations.ipynb

  • Method notebooks:

    • Webscraping & data selection.ipynb : Webscraping and how the quotes were selected from the Quotebank dataset
    • BERTopic modelling.ipynb : pipeline for topic modelling
    • Data_enriching.ipynb : QID pipeline

Folder organization :

  • generated_data : contains files generated by the notebooks
  • img : contains images use for the website
  • scripts: contains .py files used in our notebooks
  • Two files were too big for github so were placed in the external google drive link the files are :BERTopic_topic_model.pkl and BERTopic_topic_model.pkl and need to be placed in ./generated_data/BERTopic

ada-2021-project-concatsanddogs's People

Contributors

unesmu avatar amina-matt avatar valoun7 avatar galannp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.