LINK TO DATASTORY WEBSITE

Project Milestone 3 - Group Concatsanddogs

The use of women's rights and gender equality rhetoric in the US

Abstract:

The Quotebank dataset was created on the premises that while quotations are interesting materials by themselves, with an attributed speaker more meaning can be infered (see research paper). In this data story we build upon these premices to ask a few questions about women’s rights and gender equality rhetoric in news articles.

Our questions will make use of the data that contextualize the quotations in Quotebank, namely the date, attributed speaker and the quotations themselves. We will explore who relies on arguments of women’s rights and gender equality, using additional datasets to complete the social profile of the attributed speaker. We are also interested in discovering in which social and political context this rhetoric is used. To this end, we will use additional datasets and natural language processing to higlight some possible links between political events and the evolution of feminist topics in news articles. We will also dive into specifc topic with subset of quotes to try to investigate different feminist ideologies.

Research Questions:

Since the topic modelling and sentiment analysis didn't give the results we were hoping, in particular we didn't find a sufficient amount of quotes regarding femonationalism ( ~ 800 quotes ) our updated research questions are : RQ0 : What? What do people talk about?

	How can we subdivised topic in a supervised manner?

RQ1 : What? What do people talk about?

	What kind of topics appear in selected quotes with an unsupervised method? 
	Is there a topic/cluster representing femonationalist quotes?

RQ2 : When? When do people talk about women's rights?

	How do people feel about women’s rights and gender equality overall?
	How has sentiment evolved over time ?
	Are peaks in sentiment related to political or social events ?

RQ3 : Who? Who talks about women's right?

	Is speaker sentiment related to political party, gender or other?
	Is Women’s Rights a source of disagreement between the two sexes ?

RQ4: How? A dive into different feminist ideologies

	Can we highlight some ideologies within topic subset?

Proposed additional datasets :

Articles content and Keywords: These two text files contains the article retrieved from usnews.com and the list of most frequent bigrams that we will use as keywords.

Parquet files and QIDS: Samples of the Wikidata knowledgebase will be used to translate QID items in the dataset to readable labels. After selecting what quotes we will work on, using the provided .parquet file and json we will load the corresponding attributes to each speaker. Some attributes such as the political party could be used in our analyses. Since multiple QIDs are given for the political party we might extract the date for each term, and only keep the one relevant to the quote.

Methods :

Webscraping: To scrape news websites and extract keywords related to women's right topic we use python libraries that parse html tree (BeautifulSoup) and natural language processing (NLTK) to get the most frequent bigrams. The format of the website has to be manually inspected to define the relevant tag contents. In our case, two tags identifying the column of interest in the primary website page as well as a tag to identify the article itself, its secondary pages are manually determined. The article's content is cleaned with regular expressions to keep only the relevant part of the text (e.g. copyright ignored).
Clustering : The BERTopic package is used to cluster the quotes into meaningful labeled topics. A subset of our chosen quotes is fed to it. It embeds the quotes using sentence-bert, reduces the dimensions of the embedding, cluseters them using HDBSCAN, and gives labels to the clusters ( which would be our topics, consisting of a few keywords) using a modified version of TF-IDF and MMR(Maximal Marginal Relevance)

Sentiment analysis : Polarity scores from VaderSentiment library ( in the NLTK library) could be used to give a metric on the positivity or negativity of a quote. The VaderSentiment tool has been created to work best on social media content, which is appropriate as we are dealing with short texts. We also compare a sample of the results to other tools such as Flair and TextBlob.

QID to attributes & QID to readable labels: From a dataframe containing the chosen quotes, the QID row ( only the first QID if many are available) is used to perform a left join with the dataframe from the "speaker_attributes.parquet". Since some labels are still in QID format another join is performed between the remaining QIDs and "wikidata_labels_descriptions_quotebank.csv.bz2". Code has been prepared to perform the same join but using chunks with the file containing all the wikidata labels ( "wikidata_labels_descriptions.csv.bz2") in case some QID weren't available in the Quotebank version.

Proposed timeline: weekly

9 : Get feedback and review P2 and potentially modify timeline
10 : Have a result and plot for every question - start interpreting
11 : Combine the results and sketch website with visualizations - finish notebook
12 : Continue website - write text accompanying visualizations
13 : Run all the code and visualizations one last time - Finalize website

Organization within the team:

Amina :

Webscraping,
RQ0 (data categorization, results interpretation, website redaction)
RQ2 (data, results interpretation, website redaction)
RQ4 (data)

Younes :

RQ1 (data topic clustering, results interpretation, website redaction)
RQ2 (data, results interpretation, website redaction)
RQ3 (data, results interpretation)
QID & BERTopic pipeline

Galann :

Website set up
RQ3 (results interpretation, website redaction)
RQ4 (results interpretation, website redaction)

Valérian :

Website set up
RQ1 (data topic clustering, results interpretation, website redaction)

(RQ4 = additional , the last question)

Notebook organization:

Notebooks with the plots used in the website :
- Who_talks_about_womens_rights.ipynb
- Womens_rights_and_gender_equality_quotes.ipynb
- When_do_people_talk_about_womens_rights.ipynb
- Additional_Investigations.ipynb
Method notebooks:
- Webscraping & data selection.ipynb : Webscraping and how the quotes were selected from the Quotebank dataset
- BERTopic modelling.ipynb : pipeline for topic modelling
- Data_enriching.ipynb : QID pipeline

Folder organization :

generated_data : contains files generated by the notebooks
img : contains images use for the website
scripts: contains .py files used in our notebooks
Two files were too big for github so were placed in the external google drive link the files are :BERTopic_topic_model.pkl and BERTopic_topic_model.pkl and need to be placed in ./generated_data/BERTopic

galannp / ada-2021-project-concatsanddogs Goto Github PK

ada-2021-project-concatsanddogs's Introduction

LINK TO DATASTORY WEBSITE

Project Milestone 3 - Group Concatsanddogs

The use of women's rights and gender equality rhetoric in the US

Abstract:

Research Questions:

Proposed additional datasets :

Methods :

Proposed timeline: weekly

Organization within the team:

Notebook organization:

Folder organization :

ada-2021-project-concatsanddogs's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent