So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.
Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:4.0.0 org.acme mahout 0.94 Mahout Examples Scalable machine learning library examples jar UTF-8 0.4 org.apache.maven.plugins maven-compiler-plugin UTF-8 1.6 true org.apache.mahout mahout-core ${apache.mahout.version} org.apache.mahout mahout-math ${apache.mahout.version} org.apache.mahout mahout-utils ${apache.mahout.version} org.slf4j slf4j-api 1.6.0 org.slf4j slf4j-jcl 1.6.0
package org.acme; import java.io.BufferedReader; import java.io.IOException; import java.io.FileReader; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.mahout.classifier.ClassifierResult; import org.apache.mahout.classifier.bayes.TrainClassifier; import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm; import org.apache.mahout.classifier.bayes.common.BayesParameters; import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore; import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException; import org.apache.mahout.classifier.bayes.interfaces.Algorithm; import org.apache.mahout.classifier.bayes.interfaces.Datastore; import org.apache.mahout.classifier.bayes.model.ClassifierContext; import org.apache.mahout.common.nlp.NGrams; public class Starter { public static void main( final String[] args ) { final BayesParameters params = new BayesParameters(); params.setGramSize( 1 ); params.set( "verbose", "true" ); params.set( "classifierType", "bayes" ); params.set( "defaultCat", "OTHER" ); params.set( "encoding", "UTF-8" ); params.set( "alpha_i", "1.0" ); params.set( "dataSource", "hdfs" ); params.set( "basePath", "/tmp/output" ); try { Path input = new Path( "/tmp/input" ); TrainClassifier.trainNaiveBayes( input, "/tmp/output", params ); Algorithm algorithm = new BayesAlgorithm(); Datastore datastore = new InMemoryBayesDatastore( params ); ClassifierContext classifier = new ClassifierContext( algorithm, datastore ); classifier.initialize(); final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) ); String entry = reader.readLine(); while( entry != null ) { List< String > document = new NGrams( entry, Integer.parseInt( params.get( "gramSize" ) ) ) .generateNGramsWithoutLabel(); ClassifierResult result = classifier.classifyDocument( document.toArray( new String[ document.size() ] ), params.get( "defaultCat" ) ); entry = reader.readLine(); } } catch( final IOException ex ) { ex.printStackTrace(); } catch( final InvalidDatastoreException ex ) { ex.printStackTrace(); } } }There is one important note here: system must be taught before starting classification. In order to do so, it's necessary to provide examples (more - better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:
SUGGESTION That's a great suggestion QUESTION Do you sell Microsoft Office? ...More files you can provide, more precise classification you will get. All files must be put to the '/tmp/input' folder, they will be processed by Apache Hadoop first. :)