I want to experiment using HMMs for Parts-of-speech tagging but I am not sure how to

states = total no. of tags in training data ? <p dir="a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Example of input data format for HMMs about mlpack HOT 6 CLOSED

kiner-shah commented on September 27, 2024

Example of input data format for HMMs

from mlpack.

Comments (6)

rcurtin commented on September 27, 2024 2

states = total no. of tags in training data ?

Yes, that is correct.

also, you mentioned for my case the observations will be multi-dimensional vectors, I didn't understood that

Each column of a sequence (sentence) should correspond to one word. You can choose how you want to represent each word. One option is to store each word as a numeric index (e.g. "hello" corresponds to 0, "goodbye" corresponds to 1, etc., etc.), but that will only work for emission distributions that are DiscreteDistribution. If you are using GaussianDistribution or GMM, then you will want to represent each word as, e.g., an embedding or a one-hot encoded vector, or something like this.

of type size_t - is it to represent states as numbers?

Yes, each state is represented by its index.

Does each row = states for each sequence?

Yes, each row vector in stateSeq should correspond to the list of hidden states for each word in the corresponding sentence.

from mlpack.

rcurtin commented on September 27, 2024 1

So, the answer to the question depends on whether you are doing this from C++, or from a binding or command-line program. In both cases, it could be helpful to take a look at the tests to get an idea of some examples, although I do understand that looking at test code is not always the easiest:

C++ interface tests: https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests/hmm_test.cpp
command-line/binding interface tests: https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests/main_tests/hmm_train_test.cpp (note that there are also tests for the hmm_viterbi, hmm_loglik, and hmm_predict bindings too)

In short an HMM is trained on a series of sequences (optionally, you might know the hidden states for each observation in a sequence, but that is not required). In C++, this is represented as a std::vector<arma::mat>, where each element in the outer std::vector corresponds to a sequence, and each inner arma::mat (which is a sequence) has each observation in the sequence as a column.

In really simple cases, each observation might be a single scalar (e.g. the temperature); in this case, each arma::mat sequence would have 1 row (the temperature) and however many columns were in that sequence. Each sequence can have a different length (number of columns). In more complex cases, each observation may actually be a multidimensional vector; I think that will be the case with your parts-of-speech tagging.

In C++, it will be up to you to use data::Load() to load each matrix in the sequence and pack it into a std::vector<arma::mat>. Of course if you only have one sequence, then there is only a need for one element in the std::vector.

If you are using the bindings (e.g. command-line mlpack_hmm_train), you can pass in a single sequence with the input_file option, where that file is just a matrix that contains a single sequence as described above. Or, if you specify the batch option, then it is expected that the file specified by the input_file option contains a list of filenames, each of which specifies one sequence.

I hope this helps! It actually is on the short-term TODO list to clarify the expectations of these methods, so hopefully that should help out. Let me know if I can clarify anything. 👍

from mlpack.

kiner-shah commented on September 27, 2024

@rcurtin Thanks for the above explanation. However, it seems I still do have some confusion regarding the inputs to Train().
For my use case, from what I understood:

Sequence = sentence
Words = observations
States = POS tags
Transition probability: probability of current state C given previous state was P
Emission probability: probability of observation O given the current state C.

I will be using C++ for experimentation (but I would love to know how to use CLI bindings as well).
From what I have thought:

I will call the constructor: HMM(states) // states = total no. of tags in training data ?

Then I will call Traiin():

Train(dataSeq,  // vector of size = no. of sentences in data set
                // each element of vector is a matrix, where I am confused on what will be the rows and columns and what will each element of matrix hold
                // also, you mentioned for my case the observations will be multi-dimensional vectors, I didn't understood that
      stateSeq  // vector of rows (of type size_t - is it to represent states as numbers?) with arbitrary number of columns
                // Does each row = states for each sequence?
);

from mlpack.

mlpack-bot commented on September 27, 2024

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

from mlpack.

kiner-shah commented on September 27, 2024

@rcurtin I was able to implement POS Tagging with HMMs successfully.
I made a Youtube video explaining steps from start to end.
Also, you can find the code here.

Thanks for the guidance.

from mlpack.

rcurtin commented on September 27, 2024

Awesome! I will point people towards that in the future when there are questions about the HMM code. Also, if you had interest in adapting that to the examples repository I think it would be nice to add, but don't feel obligated (it's easy enough to link to the repository you have).

from mlpack.

Example of input data format for HMMs about mlpack HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent