With the growth of unstructured text data over the Internet, which is mainly the result of the human interaction in web2.0 and social networks, finding a way to automatically process and extract knowledge from this data seems indispensable. Despite unstructured format this data contain valuable knowledge which can be extracted using knowledge discovery and machine learning techniques. There has been great progress in natural language processing task such as
- Sentiment analysis,
- Opinion mining,
- Topic identification,
- Automatic machine translation,
- Name entity recognition,
- Part of speech tagging,
- Parsing,
- Information extraction,
- Question answering,
- Paraphrase detection,
- etc.
In most of NLP tasks we first develop an algorithm and then convert our data to be prepared to feed into that algorithm. This is called feature engineering which is very time consuming. Mainly, words are considered as features in text data. But there are two shortcomings in this method: First word order may be lost and second is the sparsity of feature vector which affect training time.
The aim of this project is to find a way to automatically do feature extraction from text data in Persian. We found deep learning as a way to deal with this problem. Neural network with more than one layer is called deep network. In this method each word is described with a numerical vector, which is called distributed representation or word vector. This representation contains semantic and syntactic information about words. Word concatenation represent sentences. If we can describe words with such vectors the sentence could be too. The range of this combination include simple mathematic operator like vector addition or multiplication, to recurrent neural network and recursive auto-encoder, etc. This representation improve solving most of natural language processing problems, like POS tagging, NER, topic identification, machine translation and automatic text summarization.