skewballfox / tosdhr Goto Github PK

Python 99.80% Shell 0.20%

tosdhr's Introduction

Terms of Service; Didn't Have to Read

a model for annotating terms of services and privacy policies, trained on data available from ToS;Dr, which hopefully will eventually make the jobs of the volunteers easier

Getting Started

Installation

The quickest way to get started is to use poetry to install the dependencies

git clone https://github.com/skewballfox/tosdhr
cd tosdhr
poetry install

p.s. if you want to keep the virtual environment in the project folder (so vscode automatically detects it) run this command before poetry install inside the project folder

poetry config virtualenvs.in-project true --local

the --local flag makes it so that this setting only applies to this project, rather than being the default for every project. Remove it only if you want this to be the default behavior for every poetry managed project.

tosdhr's People

Contributors

Watchers

tosdhr's Issues

add method to BookShelf class to get distribution of cases.

we need to add a method similar to get annotation cases that returns a histogram we can use to figure out how are data is distributed amongst cases.

The key should be the case_id and the value an int which is a count, alternatively you could use a counter from collections and thus avoid manually instantiating and updating the count

get_topics function

currently there are close to ~12,000 annotations spread across ~900 documents, divided into 239 cases
- this might be too many classifications for a model to handle, if so we may cluster the cases into topics
topics are something tosdr does to group cases together under certain ideas (such as dealing with the concept of "ownership"
- there are only 28 topics, excluding "deprecated", and not all of those may be necessary for the 239 cases we have
- we can then take the scoring field for the case (which is currently useless for our situation), to then expand the topics out to 3 categories: good,bad,or neutral
- this would reduces 239 cases to at most 84 hierarchical classifications
topics aren't available via the api so here's what we do:
1. we create an .env file with credentials for the edit.tosdhr site, where the list of topics is available
2. we use request.session to create a session, and use python-dotenv to load the credentials for the site
3. we make a request of the page which list all the topics
4. we use beautiful soup to parse the content of the page,getting a list of urls and names for the list of topics
  - if we have a way of identifying topics which aren't relevant(such as deprecated) we use them here
  - this might be where we want to use some kind of cache loop the way we do for services
5. for each topic, we use the existing session to visit the url of associated with topic
6. we use beautiful soup to parse the body of the response to get the list of cases
7. we check the cases for that topic against the set of cases we have annotations for.
  - if at least one case under that topic is in the set, we store it in a dict where the key is the numeric topic_id (part of the url), and the value is the the list or set of cases
  - (set of cases may be better, and python supports set operations like union and intersection)
8. we store this dict to swap out with the cases at tokenization time.

tokenize data

we need to create a function in bookshelf that tokenizes the text stored in documents using the indices of the annotations

raw_tokens=[]
for document in documents:
   for annotation in reversed(document): #.annotations if it doesn't recognize it as iterator
        text, raw_tk = document.text.some_split function split at index

we have to iterate through the annotations in reverse because otherwise it will mess up the indices of the other annotations
we also need to grab the case_id because we need to put this in front of the token string as a classification for legalbert
after we have the list of raw tokens, we need to remove anything that isn't raw text. some token functions might already do this, if not, beautiful soup may have something for detecting html tags (such as paragraph)

setup transformer

store data from cases api for cases we know we have annotations for

This doesn't require interacting with the cases api or the filesystem, as that is already handled here. We need two functions:

function to take the list of distinct cases we know we have and pulls the data like [get_all_reviewed_services] (https://github.com/skewballfox/tosdhr/blob/master/tosdhr/dataManagement/data_handler.py#L71) does for services. it will need to take the numeric case id and call get_case to get the json from the api and cache it locally(already handled by another function).
function in cases.py to take the json object and turn it into internal python objects similar to get_reviewed_documents in services.py

we need objects for cases and topics ideally, we might need both.

skewballfox / tosdhr Goto Github PK

tosdhr's Introduction

Terms of Service; Didn't Have to Read

Getting Started

Installation

tosdhr's People

Contributors

Watchers

tosdhr's Issues

add method to BookShelf class to get distribution of cases.

get_topics function

tokenize data

setup transformer

store data from cases api for cases we know we have annotations for

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent