Giter Club home page Giter Club logo

banananer's People

Contributors

ceparadise avatar keighrim avatar tcurcuru avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

banananer's Issues

f_name: ZONE

indicate the zone in the article
possible values: TXT, HL, DATELINE, DD

Feature Ideas

Here are some ideas for features to implement:

Dictionaries of named entities (pulled from net)
Clusters
POS tags
Word > average length of all words
Word > 10 characters (or some other length)
BIO chunks included in dictionary of names or not

  • Unigram, Bigram, Trigram, Quadrigram included in dictionary of names

Word capitalized or not
Number of capitalized letters in word
Number of vowels in word
Number of consonants in word
Specific letters?
Length of sentence
First half/second half of sentence
Word is 'banana' or not

TODO list

1/26/2015

  • Base code for feature extraction - Yalin
  • Base code for an interface between CRF++ - Keigh
  • Get Brown clustering running - Todd

test issue

We don't have any bug, since we don't have any code

Designing base code for feature extraction

Need a class that has at least:

  1. read() - a method to load up train.gold file into a list of sentences (or any other data structure)
  2. any necessary method that return tokens, sentences, <token, postag> pairs at token level, <token, postag> pairs at sentence level
  3. feature_functions = [] - a list of names of functions, doing one job for each. Since we already have POS tags, something like pos_tag() can be the first element of this list
  4. extract() - method to traverse all functions in feature_functions to create feature vector

feature vector data structure

Basically it should be a table-like structure, where each column represents each feat_func in feat_funcs, and each row is for each token

fn1 fn2 fn3 ...
token1 val1(1) val2(1) val3(1) ...
token2 val1(2) val2(2) val3(2) ...
token3 val1(3) val2(3) val3(3) ...
... ... ... ... ...

Any additions or suggestions are welcomed.

f_name:CASE

possible values: InitCaps, MixedCaps, AllCaps

Problems in url

I get a question for the use of url. In the url, it did not give us the names and it is just the website. I do not know how to use it. I tried

    file = urllib2.urlopen('http://www.timeanddate.com/')
    data = file.read()
    file.close()
    tree=ET.parse(file)
    root=tree.getroot()

I did not get the correct root. I am confused.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.