Giter Club home page Giter Club logo

tosdhr's Introduction

Terms of Service; Didn't Have to Read

a model for annotating terms of services and privacy policies, trained on data available from ToS;Dr, which hopefully will eventually make the jobs of the volunteers easier

Getting Started

Installation

The quickest way to get started is to use poetry to install the dependencies

git clone https://github.com/skewballfox/tosdhr
cd tosdhr
poetry install

p.s. if you want to keep the virtual environment in the project folder (so vscode automatically detects it) run this command before poetry install inside the project folder

poetry config virtualenvs.in-project true --local

the --local flag makes it so that this setting only applies to this project, rather than being the default for every project. Remove it only if you want this to be the default behavior for every poetry managed project.

tosdhr's People

Contributors

skewballfox avatar eskutcheon avatar dependabot[bot] avatar

Watchers

 avatar

tosdhr's Issues

get_topics function

  • currently there are close to ~12,000 annotations spread across ~900 documents, divided into 239 cases
    • this might be too many classifications for a model to handle, if so we may cluster the cases into topics
  • topics are something tosdr does to group cases together under certain ideas (such as dealing with the concept of "ownership"
    • there are only 28 topics, excluding "deprecated", and not all of those may be necessary for the 239 cases we have
    • we can then take the scoring field for the case (which is currently useless for our situation), to then expand the topics out to 3 categories: good,bad,or neutral
    • this would reduces 239 cases to at most 84 hierarchical classifications
  • topics aren't available via the api so here's what we do:
    1. we create an .env file with credentials for the edit.tosdhr site, where the list of topics is available
    2. we use request.session to create a session, and use python-dotenv to load the credentials for the site
    3. we make a request of the page which list all the topics
    4. we use beautiful soup to parse the content of the page,getting a list of urls and names for the list of topics
      • if we have a way of identifying topics which aren't relevant(such as deprecated) we use them here
      • this might be where we want to use some kind of cache loop the way we do for services
    5. for each topic, we use the existing session to visit the url of associated with topic
    6. we use beautiful soup to parse the body of the response to get the list of cases
    7. we check the cases for that topic against the set of cases we have annotations for.
      • if at least one case under that topic is in the set, we store it in a dict where the key is the numeric topic_id (part of the url), and the value is the the list or set of cases
      • (set of cases may be better, and python supports set operations like union and intersection)
    8. we store this dict to swap out with the cases at tokenization time.

tokenize data

we need to create a function in bookshelf that tokenizes the text stored in documents using the indices of the annotations

raw_tokens=[]
for document in documents:
   for annotation in reversed(document): #.annotations if it doesn't recognize it as iterator
        text, raw_tk = document.text.some_split function split at index
  • we have to iterate through the annotations in reverse because otherwise it will mess up the indices of the other annotations
  • we also need to grab the case_id because we need to put this in front of the token string as a classification for legalbert
  • after we have the list of raw tokens, we need to remove anything that isn't raw text. some token functions might already do this, if not, beautiful soup may have something for detecting html tags (such as paragraph)

store data from cases api for cases we know we have annotations for

This doesn't require interacting with the cases api or the filesystem, as that is already handled here. We need two functions:

  1. function to take the list of distinct cases we know we have and pulls the data like [get_all_reviewed_services] (https://github.com/skewballfox/tosdhr/blob/master/tosdhr/dataManagement/data_handler.py#L71) does for services. it will need to take the numeric case id and call get_case to get the json from the api and cache it locally(already handled by another function).
  2. function in cases.py to take the json object and turn it into internal python objects similar to get_reviewed_documents in services.py

we need objects for cases and topics ideally, we might need both.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.