Document classification
Encyclopedia : D : DO : DOC : Document classification
Document classification is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information.
Techniques
Document classification techniques include:- naive Bayes classifier
- tf-idf
- latent semantic indexing
- support vector machines
- artificial neural network
- kNN
Applications
A recent notable use of document classification techniques has been spam filtering which tries to discern E-mail spam messages from legitimate emails.See also
- classification
- supervised learning, unsupervised learning
- document retrieval
- information retrieval
- machine learning
External links
- Rafael A. Calvo, Jae-Moon Lee and Xiaobo Li. [Managing Content with Automatic Document Classification]. Journal of Digital Information, Volume 5 Issue 2, Article No. 282, 2004-06-08
- [Introduction to document classification]
- [Bibliography on Automated Text Categorization]
- [TechTC - Technion Repository of Text Categorization Datasets]
- [LingPipe] Java natural language processing software including a rich classification runtime and evaluation framework with classifiers based on character- and token- language models (including Naive Bayes).
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
