ubi:indexer Digital Documents Reverse Indexing and Semantic Search
ubi:indexer is a multilingual, scalable, linguistic and semantic-based search engine for organization that offers reverse indexing, semantic enrichment and semantic search services to large amounts of unstructured, semi-structured or totally structured digital documents and data. Performing semantic analysis with the help of knowledge models and text mining techniques, ubi:indexer enriches the indexed documents and the queries to be executed with semantic information that will be derived by ontological models specifically developed to represent the application domain.
ubi:indexer handles morphological variations, synonyms, context awareness, generalizations, concept matching, semantic matching and natural language queries, while ubi:indexer allows end-users to enter their questions freely without the need of special formats and operators.
ubi:indexer provides to end-users all the necessary tools to be able to complete a set of tasks based on natural language processing technologies, having as the ultimate goal the realization of a set of both semantic and complex searches, as well as expressive queries, on large datasets. Those tasks are fundamental in order to ensure the development of advanced services for text processing.
Technical Walkthrough
In specific the platform will handle the automatic or semi-automatic processing of structural or non-structural digital documents/files, a process which comprises of three stages:
Stage 1 – Data and metadata extraction: Refers to the processing of the initial format of digital files in order to extract the essential text and the metadata of each document. By the term essential text we define the text in simple format (plain text) which can be extracted from any structural or non-structural digital file. However, any additional digital data in other format such as sound,image,comment tags are not within the scope of use of the SES system, and thus are not processed by the system. As metadata we define the plain text elements which are attached or are embedded to a digital file. The metadata which are extracted during the first stage of processing consist of a closed-list which includes all fields of the extracted metadata, accompanied by the respective value of each field.
Stage 2 – Named-Entity recognition: During this stage, the named entity recognition process of semantic entities take place on the essential text which has been extracted during the first stage of process. The named-entity recognition process is achieved by the two independent, individual named entity extraction subsystems of the primary NER system. A comprehensive description of the two NER subsystems follows:
Named-entity recognition based on statistical rules: The specific subsystem achieves the extraction of named-entities within the essential text based on statistical rules and pre-trained probabilistic entity models. Thus, based on these pre-trained models, the subsystem recognizes and categorizes the entities which have been identified within the essential text. In addition, the models and the procedures of NER are based on the max entropy algorithm which is a widely, well used algorithm, and it is considered to be the most accurate and suitable by the experts in natural language processing topics such as part-of-speech tagging (POS), sentence detection, relationship extraction, sentiment analysis and many other.
Named-entity recognition based on vocabulary: As opposed to the previous subsystem, the current one is not based on pre-trained entity models nor does it make use of machine learning algorithms. It achieves the named-entity recognition task by tokenizing the essential text into words and then passing each one through a filter which extracts the stem of each word. After this process a customized analyzer tries to achieve the exact match between the stemmed word and the words included in all the available vocabularies which are fed to the system.
Stage 3 – Data Indexing: Through this final stage, the processing, analysis and storage of the data to it’s final format, is achieved. More specifically, in this stage all the essential information which has been extracted from the primitive source of data, is stored to a persistent location as a specific structural format of data, including all the indexes that have been defined during this process. After this process, the data is now in a form upon which subject indexing can be achieved.
Supported Search Operations and Semantic Queries
Once the aforementioned technical implementation and deployment stages have been successfully completed, the end-user is able to achieve a set of searches and semantic queries on the processed digital data, through the dedicated user interface of the semantically enriched search engine that ubi:indexer provides. In specific, ubi:indexer supports the end-user with the following operations:
Simple-Basic Search: Search which is based only on the input of the end-user which usually is raw text.
Faceted Search: This kind of search gives the ability to the end-user to achieve faceted searching on preferred topics in which the data are categorized based on the indexes which define them.
Advanced Search: In this search , the results are pre-processed among with the users’ preferences. In more detail, the end-user can filter the search results based on the criteria of interests. Indicative criteria may be: ascending/descending sorting, range filtering on a result set, field filtering, statistical or complex mathematical function filtering, semantic distance between words, grouping by max/min/mean value and other.
Complex Search: This kind of search includes a combination of all the aforementioned search types.
In addition, ubi:indexer provides to the end-user the following intelligent and useful functionalities and features:
- Result Highlighting
- Spelling Check
- Multi-Format Result Exporting (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
- End-users’ Permission Search Control
- Spelling auto-suggestion functionality on users’ queries
- Auto-complete functionality on users’ queries
- History Search Record
- Automatic Save of recent search results
- Performance Optimization
- Geospatial Search
- Adaptive Similarity Model per search field
- Adaptation and Modification of Special and Lexicon lists (synonym lists, protected list, named-entity lists, stop-word lists)