Recently I have got an interesting problem to solve: how to classify text from different sources using automation? Some time ago I read about a project which does this as well as many other text analysis stuff - Apache Mahout. Though it's not a very mature one (current version is 0.4), it's very powerful and scalable. Build on top of another excellent project, Apache Hadoop, it's capable to analyze huge data sets.
So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.
Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:
package org.acme;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;
public class Starter {
public static void main( final String[] args ) {
final BayesParameters params = new BayesParameters();
params.setGramSize( 1 );
params.set( "verbose", "true" );
params.set( "classifierType", "bayes" );
params.set( "defaultCat", "OTHER" );
params.set( "encoding", "UTF-8" );
params.set( "alpha_i", "1.0" );
params.set( "dataSource", "hdfs" );
params.set( "basePath", "/tmp/output" );
try {
Path input = new Path( "/tmp/input" );
TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
Algorithm algorithm = new BayesAlgorithm();
Datastore datastore = new InMemoryBayesDatastore( params );
ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
classifier.initialize();
final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
String entry = reader.readLine();
while( entry != null ) {
List< String > document = new NGrams( entry,
Integer.parseInt( params.get( "gramSize" ) ) )
.generateNGramsWithoutLabel();
ClassifierResult result = classifier.classifyDocument(
document.toArray( new String[ document.size() ] ),
params.get( "defaultCat" ) );
entry = reader.readLine();
}
} catch( final IOException ex ) {
ex.printStackTrace();
} catch( final InvalidDatastoreException ex ) {
ex.printStackTrace();
}
}
}
There is one important note here: system must be taught before starting classification. In order to do so, it's necessary to provide examples (more - better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:
SUGGESTION That's a great suggestion
QUESTION Do you sell Microsoft Office?
...
More files you can provide, more precise classification you will get. All files must be put to the '/tmp/input' folder, they will be processed by Apache Hadoop first. :)
Testing never was an easy thing. I am following TDD approach for at least last 5-6 years and really excited about it. But for me, TDD is not only unit testing. It is whole set of testing techniques I find appropriate for particular project (unit tests, integration tests, performance tests, ...). Recently I discovered excellent tool - soapUI. It has a bunch of useful features but the one I would like to cover today is testing BlazeDS services using AMF protocol.
Before we start with code snippets, let's copy BlazeDS libraries to bin/ext folder of soapUI installation:
- commons-codec-1.3.jar
- commons-httpclient-3.0.1.jar
- commons-logging.jar
- flex-messaging-common.jar
- flex-messaging-core.jar
- flex-messaging-opt.jar
- flex-messaging-proxy.jar
- flex-messaging-remoting.jar
Among other very cool features, soapUI supports Groovy as a scripting language which is just awesome. So all my examples will be in Groovy. Let's start with necessary part: creating connection and aliasing services.
Having connection established, we are ready to call service methods of any aliased remote objects. Here is a code snippet to call service method foo() which has no parameters.
// Calling service method without arguments
def result = amfConnection.call( "testService.foo" );
And here is a code snippet to call service method foo() which accepts one parameter of type Person.
// Calling service method with object as argument
def person = new ASObject( "com.example.Person" );
person["name"]= "John Smith" ;
result = amfConnection.call( "testService.foo", person );
There's one issue which I've omitted for a moment. If you have security enabled for channels, you must proceed with authentication before calling any services. It's quite simple to do:
When we are done, let's be a good citizens and close connection:
amfConnection.close();
Again, if security for channels is enabled, do logout before closing connection:
CommandMessage c = new CommandMessage();
c.setHeader( Message.FLEX_CLIENT_ID_HEADER, clientId );
c.setOperation( CommandMessage.LOGOUT_OPERATION );
c.setDestination( "auth" );
amfConnection.call( null, c );
Having such a script, soapUI allows you to create load test based on it. It also support quite complicated scenarios with many scripts involved and parameters passed from one to another. There is very good blog which contains tons of very useful information how to use soapUI for different kind of testing.
Looking for better Adobe BlazeDS and Java platform integration, I would like to recommend one very useful project from SpringSource portfolio: Spring Flex (or Spring BlazeDS integration). It's pretty easy to start with and, moreover, you could integrate it with other projects like Spring Framework and Spring Security.
Basically, those few lines of code do all routine work to start Adobe BlazeDS MessageBroker servlet (to handle AMF protocol), publish your classes (annotated as @RemotingDestination) as remote objects to be accessible by Flex clients.
Adobe BlazeDS configuration, referenced here as /WEB-INF/flex/services-config.xml is pretty standard. It includes bare minimum enough to run simple application.
Configuration part is done. Let's create a simple remote object class.
package org.example.flex;
import org.springframework.flex.remoting.RemotingDestination;
import org.springframework.stereotype.Service;
@Service
@RemotingDestination( value = "simpleService", channels = { "default-amf", "secure-amf" } )
public class SimpleService {
public Boolean test() {
return Boolean.TRUE;
}
}
That's it! SimpleService is declared as simple POJO with @RemotingDestination annotation and will be discovered by Spring configuration and automatically published as remote object for "default-amf" and "secure-amf" channels.
Integrating Spring Security is again just a few configuration lines. Here is an example:
Spring Flex also provides a bunch of interesting features such as exception translators. It worthwhile to look at this project if you are developing Flex applications with Adobe BlazeDS.
When we are talking about software development, it's not only about writing a code (for sure, high-quality code). It's also about a bunch of supporting processes like automated building, testing, deployment, integration, ... In this blog I am trying to touch every aspect so this post starts a series of articles about building Java projects with Apache Maven 2. The Maven's web site has very good documentation so I will skip introductory part and concentrate on some practical issues which arrive quite often.
Suppose, you have XML configuration files and depending on build profile you have to modify some parameters (database server address, JMS endpoints, ...). How to do that with Apache Maven 2? Quite easy using ... Apache Ant integration for Apache Maven 2. Apache Ant has excellent and very powerful plug-in to work with XML files - XMLTask. Let us make use of it!
What this simple fragment does: for testing builds, it will remove from web.xml all XML elements with id attribute <some id here>. Not very meaningful but gives the idea how it works. XMLTask could do mostly everything you need: insert/removed elements and XML fragments, insert/remove/modify attributes with values and properties, copy/cut/paste XML, and a lot more. I found it extremely useful.
This post will be not very technical but I would like to share some of my experience related to Internet applications development.
It's quite a few years I have been involved into web applications development. I started from PHP, then moved to ASP.NET, then to JSF, then AJAX diluted all that stuff, and finally I moved to Adobe Flex. The trend is obvious: web applications must be as closed to desktop counterparts as possible. Adobe Flex is really cool, very coooool ... I didn't play with Microsoft Silverlight and JavaFX too much but it all about the same.
As more reach become web applications, more features are requested from them. For developers it's a whole new world to explore. My current project is built on top of Adobe Flex and Java. It worth-while to say that Adobe Flex and Java integrates very good via BlazeDS (opensource) or LCDS (commercial) bridges. SpringSource provides excellent support for Flex and BlazeDS development by means of Spring BlazeDS integration project.
What all this is about... Development of RIA on top of Java platform is a challenge which requires from developer to engage the whole new technology stack. It's something which couldn't be done using pure Java platform. JavaFX is coming, but too late. Will it be successful?
Nevertheless, I would like to encourage developers to consider Adobe Flex as part of your next web project. It's worthwhile the time you will spend on it.
So far I haven't had a need to test servlets within Spring framework environment. But the issue came up recently and I am going to share my experience with testing file upload servlet based on Apache FileUpload and Spring.
Let's start with a file upload servlet implementation. I will omit some unnecessary details and concentrate on two issues: get application context and retrieve/save file to disk.
public class FileUploadServlet extends HttpServlet {
@Override
public void doPost( HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
ApplicationContext appContext = WebApplicationContextUtils
.getRequiredWebApplicationContext( getServletContext() );
// Get some beans here from application context
...
DiskFileItemFactory factory = new DiskFileItemFactory();
ServletFileUpload upload = new ServletFileUpload( factory );
try {
Iterator< ? > iter = upload.parseRequest( request ).iterator();
while( iter.hasNext() ) {
FileItem item = ( FileItem )iter.next();
if( !item.isFormField() ) {
// store items here
...
}
}
response.setStatus( HttpServletResponse.SC_OK );
} catch( Exception e ) {
response.setStatus( HttpServletResponse.SC_INTERNAL_SERVER_ERROR );
} finally {
response.flushBuffer();
}
}
}
Servlet is ready. Let's develop test case to verify it. There are basically three steps:
create mock request (and response)
create servlet instance and pass Spring application context to it
wrap file into request and call servlet's post()
The code fragment below shows how easy it could be done using Spring testing scaffolding (thanks Spring team again).
public class UploadServlerTestCase extends AbstractJUnit4SpringContextTests {
private byte[] buffer;
@Before
public void setUp() throws Exception {
// Load file content from resource
final InputStream in = getClass().getResourceAsStream( "test.pdf" );
buffer = new byte[ in.available() ];
in.read( buffer );
in.close();
}
@Test
public void testFileUpload() {
// create mock servlet config and pass Spring application context to it
StaticWebApplicationContext ctx = new StaticWebApplicationContext();
ctx.setParent( applicationContext );
MockServletConfig sc = new MockServletConfig();
sc.getServletContext().setAttribute(
WebApplicationContext.ROOT_WEB_APPLICATION_CONTEXT_ATTRIBUTE, ctx );
// create mock request (and response)
MockHttpServletRequest request = new MockHttpServletRequest( "POST",
"http://localhost/" );
MockHttpServletResponse response = new MockHttpServletResponse();
// wrap file into request
final ByteArrayOutputStream out = new ByteArrayOutputStream();
try {
out.write( String.format( "-----1234\r\n" +
"Content-Disposition: form-data; name=\"%s\"; filename=\"%s\"\r\n" +
"Content-Type: %s\r\n" +
"\r\n",
"textField",
"test.pdf",
"application/pdf" ).getBytes()
);
out.write( buffer );
out.write( new String( "\r\n-----1234" ).getBytes() );
out.flush();
request.setContentType( "multipart/form-data; boundary=---1234" );
request.setContent( out.toByteArray() );
} finally {
out.close();
}
// create servlet instance and call post()
FileUploadServlet servlet = new FileUploadServlet();
servlet.init( sc );
servlet.doPost( request, response );
// do some checks to ensure file has been stored
...
}
}
Test case is ready. Depending on your uploads management strategy (disk, database, Amazon S3, ...), test case should be extended to ensure that file has been stored by upload servlet at proper location.
Seasoned software developer with a great passion to code. I am extensively working with JVM platform using Java, Groovy, Scala as well as other languages and technologies (Ruby, Grails, Play!, Akka, MySQL, PostreSQL, MongoDB, Redis, JUnit, ...)