Recently I have got an interesting problem to solve: how to classify text from different sources using automation? Some time ago I read about a project which does this as well as many other text analysis stuff - Apache Mahout. Though it's not a very mature one (current version is 0.4), it's very powerful and scalable. Build on top of another excellent project, Apache Hadoop, it's capable to analyze huge data sets.
So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.
Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:
package org.acme;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;
public class Starter {
public static void main( final String[] args ) {
final BayesParameters params = new BayesParameters();
params.setGramSize( 1 );
params.set( "verbose", "true" );
params.set( "classifierType", "bayes" );
params.set( "defaultCat", "OTHER" );
params.set( "encoding", "UTF-8" );
params.set( "alpha_i", "1.0" );
params.set( "dataSource", "hdfs" );
params.set( "basePath", "/tmp/output" );
try {
Path input = new Path( "/tmp/input" );
TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
Algorithm algorithm = new BayesAlgorithm();
Datastore datastore = new InMemoryBayesDatastore( params );
ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
classifier.initialize();
final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
String entry = reader.readLine();
while( entry != null ) {
List< String > document = new NGrams( entry,
Integer.parseInt( params.get( "gramSize" ) ) )
.generateNGramsWithoutLabel();
ClassifierResult result = classifier.classifyDocument(
document.toArray( new String[ document.size() ] ),
params.get( "defaultCat" ) );
entry = reader.readLine();
}
} catch( final IOException ex ) {
ex.printStackTrace();
} catch( final InvalidDatastoreException ex ) {
ex.printStackTrace();
}
}
}
There is one important note here: system must be taught before starting classification. In order to do so, it's necessary to provide examples (more - better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:
SUGGESTION That's a great suggestion
QUESTION Do you sell Microsoft Office?
...
More files you can provide, more precise classification you will get. All files must be put to the '/tmp/input' folder, they will be processed by Apache Hadoop first. :)
Seasoned software developer with a great passion to code. I am extensively working with JVM platform using Java, Groovy, Scala as well as other languages and technologies (Ruby, Grails, Play!, Akka, MySQL, PostreSQL, MongoDB, Redis, JUnit, ...)
4 comments:
Hi.. Interesting experiment.. but what is args[0] in ur code?
Also how is it to be run? The dependencies are installed using mvn install..
hola,
tienes algun ejemplo utilizando apache mahout muchas gracias
por favor un ejemplo sobre como utilizar mahout
Post a Comment