Most of valuable data around us is in unstructured format. Discovering worthy knowledge from text which is kind of unstructured data is an important task. Text mining (or text analytics), refers to the process of extracting information from text using machine learning algorithms. Research on text mining n DSPLab covers text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization and document similarity focusing on Persian language.
Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within their unstructured data.
Text is a one of the most common data types within databases. Depending on the database, this data can be organized as:
- Structured data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and phone numbers.
- Unstructured data: This data does not have a predefined data format. It can include text from sources, like social media or product reviews, or rich media formats like, video and audio files.
- Semi-structured data: As the name suggests, this data is a blend between structured and unstructured data formats. While it has some organization, it doesn’t have enough structure to meet the requirements of a relational database. Examples of semi-structured data include XML, JSON and HTML files.