In this project, a Persian Question Answering (QA) system is created to ease the access to information resources for doctors, health providers and users. To this aim, a set of Persian documents related to drugs and diseases are collected. The processing of the structured documents improves the performance of the QA system that’s why all documents were converted into semi-structured documents.
The developed system consists of three main units:
- question processing
- document retrieval
- answer extraction
The question processing unit, as in the most important module, consists of four components that sequentially extract keywords/queries. These components use a dictionary of drugs/diseases names and keywords/queries. This process is shown in the following figure. If a module fails to extract keywords from the question, based on the condition of the question, another component would make the extraction process instead. The first part of the QA system is question processing module. The main component of the question processing module includes Question Classifier, N-gram Tokenizer, Patterns Matching and Advanced Tokenizer.
In this architecture, the question asked by the user, is normalized and then the drug name or disease related to the question is extracted through Name Entity (NE) Dictionary. If this specified name is extracted from the question, the question would be sent to Question Classifier component for the extraction of the phrases that indicate the meaning of the question. Finally, by using the concept of the dictionary, the keywords would be extracted and the phrases would be mapped to the dictionary keywords. On the other hand, if the Question Classifier fails to extract any keywords, the question would be sent to the N-gram module in order to extract keywords by keywords dictionary.
In case the question matches any pattern, the keywords would be extracted with Advanced Tokenizer. If none of the components cannot understand the question, the Advanced Tokenizer which encompasses a list of specific stop words, tokenize the question and extracts the keywords. When this process finished successfully, extracted keywords are transferred to next module called Document Retrieval module.
In the Answer Extraction module, the appropriate answer is selected from the retrieved document in the previous step. Since the document is converted to the structured form, the answers would be extracted more accurate.