Document & verbatim classification

Automatic text classification is an increasingly important technology for business development. Put simply, a text classification system, automatically assigns one or more user-defined categories to a text provided as input.

Depending on the organization’s needs, we divide text classification into three orthogonal dimensions:

  • Topical vs. functional classification. If your system needs to classify documents according to their content we are in the domain of topic classification. For instance, if you have to make the distinction between documents dealing with cars and documents dealing with motorbikes, you are in a case of topic classification. On the other hand, if you have to tell, for instance, invoices from receipts or to classify documents according to their degree of security, irrespective of their content, you are in a case of functional classification.
  • Short vs. long documents. Automatically classifying a short, often badly written document and a full document in a formal style are two radically different tasks. The typical case of short documents is represented by call center transcriptions or user-entered text, tickets, user queries, tweets, etc. Long documents are all the remaining document types, composed of full sentences and with a stricter structure.
  • Human-driven classification vs. learning-based classification. If at some stage your organization already manually classified a certain set of documents it is probably time to switch to learning-based classification: simply let our system automatically learn the criteria which were used in the past to classify documents and apply them to future documents. It may, however, be the case that you just decided to introduce a classification structure to your document base, or that you make use of several ever-changing classification systems. In this case, the best option is provided by human driven classification: just code a set of simple rules with few examples per category and the system will be able to infer the optimal category for each document.

At Ho2S we are aware of all the inherent complexities of classification tasks. Our systems cover the whole matrix of the above mentioned features and provide an optimal classification in all cases.

Moreover, our classification system is language-aware: we provide classification for French, English, Italian, German and some other languages, but we do not claim to be language independent. The algorithmic classification mechanism emulates the process carried out by a human operator when classifying a segment of text: it is evident that the knowledge of the operating language is a pre-requisite.

Domain awareness is also a crucial feature of our text classification solution. In all available combinations, it is always possible to create a domain-adapted knowledge model: this means that the same words will not be treated the same way in different domains: for instance the word crane in the construction domain will have a different semantic value to the word crane in the nature domain.

Modern information systems are flooded by short text to be classified.  These are typically tickets from call centers and Customer Relationship Management (CRM) systems, text entered in response to open questions in user-submitted questionnaires, call center transcriptions, tweets, SMSs etc. We offer the best technology to tackle the problem of automatically classifying these texts. Our hybrid technology allows the user to rapidly design the classification plan best suited to his or her needs and have the system interpret it to classify any document. Two options are available:

  • Local Classification: The classifier is installed in client-server mode on the customer’s premises. All nonfunctional aspects, such as security, load balancing, fault tolerance, etc., are dealt with by following the standard procedures in use by the customer.
  • Service Based Classification: this is the most “agile” way of obtaining high-quality results with minimal integration costs and time. A client application sends our servers a “classification design document”, i.e. a document containing the minimal information necessary to learn a classifier (the classification hierarchy itself, the description of the categories, a set of manually chosen examples for each category, etc.). The server returns an id.  From now on, the client can use this id to classify new documents in the selected classification hierarchy.

Thanks to our rich classification matrix we are able to provide accurate product classification based on simple product descriptions. This is a traditionally difficult task as it often involves many hundreds of categories which are distinguished only by linguistic nuances. In this case, we offer a semi-structured classification system which mixes both automatically learned information and hand-coded non-defeasible rules, to ensure that no “unpredictable” results are obtained.

This is the most “traditional” version of document classification, the one which is applied, for instance, to news, web pages, corporate documents, and so on. Out-of-the-box products can already achieve reasonable performance on these kinds of documents in terms of topic classification, as the presence of large quantities of text can drive the learning process.  What we add on top of this is a highly accurate functional classification layer which is of paramount importance in driving, for instance, corporate workflows. Such a functional classification can detect document type, security and privacy level, sources, and other features which it would be difficult to fit into a standard topic classification system.

  • Availability of topic classification and functional classification
  • Availability of rule-based and learning-based classification
  • Effective on both long and short documents
  • Available locally or as a web service
  • Language aware
  • Domain aware
  • Fast integration