Outline
· Project Aim:
· Develop a conditional random field model which can assess protein functionally utilizing a protein family.
· Protein family acts as a database for scoring new protein sequences for functionality.
· What are Graphical CRFs?
· More powerful than HMMs due to their application of feature functions.
· Undirected graphical model.
· Has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence.
· Linear CRFs, like HMMs, only impose dependencies on the previous element whereas with general CRFs we can impose dependencies to arbitrary elements.
· Applications of CRFs
· Natural Language processing
· Parts-of-speech tagging
· Name Entity recognition
· Prediction sequences
· Gene prediction
· CRF options
· RNNSharp: CRFs based on recurrent neural networks
· CRF-ADF: Linear-chain CRFs with fast online ADF training
· CRFSharp: Linear-chain CRFs
· GCO: CRF with submodular energy functions
· DGM: General CRFs
· HCRF library: Hidden-state CRFs
· PyStruct: Structured Learning and prediction library in Python
· Advantages
· Design is flexible
· No strict independence assumptions like HMM
· Overcomes the drawbacks of label bias in MEMM
· Computes the conditional probability of global output nodes
· Computes the joint probability distribution
· Disadvantages
· Highly computationally complex at the training stage
· Difficult to re-train data with newer data