Autocategorization in Records Management

The ability to automatically categorize records is very powerful, but autocategorization is not well understood and needs to be executed properly if it is to succeed. Autocategorization “attempts to assign electronic records to either predefined file structures or to self-defined categories through computer-based processes” (Lubbes 2003, 60). Its objective is to understand and recognize concepts that group similar documents together, yet exclude documents that are not relevant to search queries. It is designed to facilitate contextual searching by delineating the relationships that exist within and between topics in content. When properly established, autocategorization software learns to file documents in either a predefined taxonomy or user-defined categories.

Santangelo (2009) notes that accurate classification allows organizations to “retain [records], place legal holds on it, and make reasonable disposition decisions about it, thus helping to minimize the significant legal costs and risks associated with continuing to store it unnecessarily” (23). Chester (2004) notes that autocategorization’s deployment with other types of material is different than with records management, which “requires great accuracy because the cost of misfiling a record is greater than when sorting press releases or magazine articles” (17). Further complicating matters, file plans are more complex than those found in other applications and are externally determined by the organization, rather than derived from the documents themselves (Chester 2004).

Autocategorization uses either pattern-based systems or rule-based systems. Pattern-based systems use word patterns and concepts to associate the records with the file categories. In essence, the system learns how to distinguish between concepts using sophisticated algorithms. The four basic techniques are k-nearest neighbor, Bayesian, neural network, and support vector machines (Lubbes 2003). Other techniques include “clustering of set of documents based on similarities,…sophisticated linguistic inferences, the use of pre-existing sets of categories, and seeding categories with keywords” (Reamy 2002, 17).

Rule-based systems depend on user-defined sets of rules to decipher the concepts contained in the records. The system then parses documents, determines their concepts, and assigns them to categories based on the rule set. An advantage of rule-based systems is that even if there is an error in filing, it will be consistent unlike filing by humans.  

Unless other variables are incorporated as a cross-reference, rule-based systems may classify unrelated documents with similar terms together. Stephens (2007) notes that “The trend is to combine multiple methods to categorize the corpus of documents to increase the accuracy and relevancy of grouping similar documents” (158).

The fundamental problem of records management is getting documents into the system, and it is unrealistic, slow, and costly to expect records creators to classify documents themselves (Medina, et al. 2006). Autocategorization addresses adequate participation and accuracy, the two basic requirements of enterprise content management. However, before undertaking an autocategorization project, RIM professionals must determine if autocategorization is suited to the organization’s documents, their file plan, whether it can be accurate enough to be cost-effective against other types of classification schemes, and whether it will require unacceptable resources to train the system. Medina et al. (2006) advise, “Plan to make use of [autocategorization] in the long term, and introduce it in a later phase of your deployment. Most importantly, make sure to do a test drive against your organization’s documents and requirements” (17). Bock (2002) continues, “There is no commercially oriented benchmark for determining the effectiveness of one particular text-analysis solution or another. Thus, a company choosing between [a product] and its competitors has to do extensive comparisons on its own to determine the costs and benefits of alternative approaches” (as cited in Lubbes 2003, 69).

Even with the best autocategorization system, human intervention is needed to define rules and monitor results. Documents that the system does not understand are reviewed by the RIM professional to decide if the system’s proposed assignments are appropriate or if they should be modified (Chester 2004). Additionally, periodic review is necessary to not only see if the system remains on target, but also because new categories will be introduced and existing categories may “drift” from their original target (Chester 2004, 17).

Autocategorization software works best for organizations with a large influx of documents, preferably well written by professionals, that need to be categorized into a fairly shallow or general fie plan, or else have very highly developed and specialized vocabularies, such as the pharmaceutical or legal industry. However, with careful planning and training of the system, autocategorization can successfully sort records into buckets for later retrieval by RIM professionals.

Works Cited

Bock, G. (2002). Meta tagging and text analysis from Clearforest, identifying and organizing unstructured content for dynamic delivery through digital networks. Patricia Seybold Group.

Chester, B. (March/April 2004). Auto-categorization and records management. Infonomics 16-18.

Lubbes, R. K. (October 2001). Automatic categorization: How it works, related issues, and impacts on records management. Information Management Journal 38-43.

Lubbes, R. K. (March/April 2003). So you want to implement automatic categorization? Information Management Journal 60-69.

Medina, R., Gaffaney, D., & L. Andrews. (July/August 2006). Autocategorization: One key component for enterprise records management. Infonomics 15-17.

Reamy, T. (November 2002). Auto-categorization: Coming to a library or intranet near you! EContent 17-22.

Santangelo, J. (November/December 2009). Rise of the machines: The role of text analytics in record classification and disposition. Information Management 22-26.

Stephens, D. O. (2007). Records management: Making the transition from paper to electronic. Lenexa, KS: ARMA International.