Recently I have got an interesting problem to solve: how to classify text from different sources using automation? Some time ago I read about a project which does this as well as many other text analysis stuff - Apache Mahout. Though it's not a very mature one (current version is 0.4), it's very powerful and scalable. Build on top of another excellent project, Apache Hadoop, it's capable to analyze huge data sets.
So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.
Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:
package org.acme;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;
public class Starter {
public static void main( final String[] args ) {
final BayesParameters params = new BayesParameters();
params.setGramSize( 1 );
params.set( "verbose", "true" );
params.set( "classifierType", "bayes" );
params.set( "defaultCat", "OTHER" );
params.set( "encoding", "UTF-8" );
params.set( "alpha_i", "1.0" );
params.set( "dataSource", "hdfs" );
params.set( "basePath", "/tmp/output" );
try {
Path input = new Path( "/tmp/input" );
TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
Algorithm algorithm = new BayesAlgorithm();
Datastore datastore = new InMemoryBayesDatastore( params );
ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
classifier.initialize();
final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
String entry = reader.readLine();
while( entry != null ) {
List< String > document = new NGrams( entry,
Integer.parseInt( params.get( "gramSize" ) ) )
.generateNGramsWithoutLabel();
ClassifierResult result = classifier.classifyDocument(
document.toArray( new String[ document.size() ] ),
params.get( "defaultCat" ) );
entry = reader.readLine();
}
} catch( final IOException ex ) {
ex.printStackTrace();
} catch( final InvalidDatastoreException ex ) {
ex.printStackTrace();
}
}
}
There is one important note here: system must be taught before starting classification. In order to do so, it's necessary to provide examples (more - better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:
SUGGESTION That's a great suggestion
QUESTION Do you sell Microsoft Office?
...
More files you can provide, more precise classification you will get. All files must be put to the '/tmp/input' folder, they will be processed by Apache Hadoop first. :)
Seasoned software developer with a great passion to code. I am extensively working with JVM platform using Java, Groovy, Scala as well as other languages and technologies (Ruby, Grails, Play!, Akka, MySQL, PostreSQL, MongoDB, Redis, JUnit, ...)