Tuesday, July 28, 2020

It is never enough of them: enriching Apache Avro generated classes with custom Java annotations

Apache Avro, along with Apache Thrift and Protocol Buffers, is often being used as a platform-neutral extensible mechanism for serializing structured data. In the context of event-driven systems, the Apache Avro's schemas play the role of the language-agnostic contracts, shared between loosely-coupled components of the system, not necessarily written using the same programming language.

Probably, the most widely adopted reference architecture for such systems circles around Apache Kafka backed by Schema Registry and Apache Avro, although many other excellent options are available. Nevertheless, why Apache Avro?

The official documentation page summarizes pretty well the key advantages Apache Avro has over Apache Thrift and Protocol Buffers. But we are going to add another one to the list: biased (in a good sense) support of the Java and JVM platform in general.

Let us imagine that one of the components (or, it has to be said, microservice) takes care of the payment processing. Not every payment may succeed and to propagate such failures, the component broadcasts PaymentRejectedEvent whenever such unfortunate event happens. Here is its Apache Avro schema, persisted in the PaymentRejectedEvent.avsc file.

{
    "type": "record",
    "name": "PaymentRejectedEvent",
    "namespace": "com.example.event",
    "fields": [
        {
            "name": "id",
            "type": {
                "type": "string",
                "logicalType": "uuid"
            }
        },
        {
            "name": "reason",
            "type": {
                "type": "enum",
                "name": "PaymentStatus",
                "namespace": "com.example.event",
                "symbols": [
                    "EXPIRED_CARD",
                    "INSUFFICIENT_FUNDS",
                    "DECLINED"
                ]
            }
        },
        {
            "name": "date",
            "type": {
                "type": "long",
                "logicalType": "local-timestamp-millis"
            }
        }
    ]
}

The event is notoriously kept simple, you can safely assume that in more or less realistic system it has to have considerably more details available. To turn this event into Java class at build time, we could use Apache Avro Maven plugin, it is as easy as it could get.

<plugin>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>1.10.0</version>
    <configuration>
        <stringType>String</stringType>
    </configuration>
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>schema</goal>
            </goals>
            <configuration>
                <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
                <outputDirectory>${project.build.directory}/generated-sources/avro/</outputDirectory>
            </configuration>
        </execution>
    </executions>
</plugin>

Once the build finishes, you will get PaymentRejectedEvent Java class generated. But a few annoyances are going to emerge right away:

@org.apache.avro.specific.AvroGenerated
public class PaymentRejectedEvent extends ... {
   private java.lang.String id;
   private com.example.event.PaymentStatus reason;
   private long date;
}

The Java's types for id and date fields are not really what we would expect. Luckily, this is easy to fix by specifying customConversions plugin property, for example.

<plugin>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>1.10.0</version>
    <configuration>
        <stringType>String</stringType>
        <customConversions>
            org.apache.avro.Conversions$UUIDConversion,org.apache.avro.data.TimeConversions$LocalTimestampMillisConversion
        </customConversions>
    </configuration>
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>schema</goal>
            </goals>
            <configuration>
                <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
                <outputDirectory>${project.build.directory}/generated-sources/avro/</outputDirectory>
            </configuration>
        </execution>
    </executions>
</plugin>

If we build the project this time, the plugin would generate the right types.

@org.apache.avro.specific.AvroGenerated
public class PaymentRejectedEvent extends ... {
   private java.util.UUID id;
   private com.example.event.PaymentStatus reason;
   private java.time.LocalDateTime date;
}

It looks much better! But what about next challenge. In Java, annotations are commonly used to associate some additional metadata pieces with a particular language element. What if we have to add a custom, application-specific annotation to all generated event classes? It does not really matter which one, let it be @javax.annotation.Generated, for example. It turns out, with Apache Avro it is not an issue, it has dedicated javaAnnotation property we could benefit from.

{
    "type": "record",
    "name": "PaymentRejectedEvent",
    "namespace": "com.example.event",
    "javaAnnotation": "javax.annotation.Generated(\"avro\")",
    "fields": [
        {
            "name": "id",
            "type": {
                "type": "string",
                "logicalType": "uuid"
            }
        },
        {
            "name": "reason",
            "type": {
                "type": "enum",
                "name": "PaymentStatus",
                "namespace": "com.example.event",
                "symbols": [
                    "EXPIRED_CARD",
                    "INSUFFICIENT_FUNDS",
                    "DECLINED"
                ]
            }
        },
        {
            "name": "date",
            "type": {
                "type": "long",
                "logicalType": "local-timestamp-millis"
            }
        }
    ]
}

When we rebuild the project one more time (hopefully the last one), the generated PaymentRejectedEvent Java class is going to be decorated with the additional custom annotation.

@javax.annotation.Generated("avro")
@org.apache.avro.specific.AvroGenerated
public class PaymentRejectedEvent extends ... {
   private java.util.UUID id;
   private com.example.event.PaymentStatus reason;
   private java.time.LocalDateTime date;
}

Obviously, this property has no effect if the schema is used to produce respective constructs in other programming languages but it still feels good to see that Java has privileged support in Apache Avro, thanks for that! As a side note, it is good to see that after some quite long inactivity time the project is expiriencing the second breath, with regular releases and new features delivered constantly.

The complete source code is available on Github.

Thursday, April 30, 2020

The crypto quirks using JDK's Cipher streams (and what to do about that)

In our day-to-day job we often run into the recurrent theme of transferring data (for example, files) from one location to another. It sounds like a really simple task but let us make it a bit more difficult by stating the fact that these files may contain confidential information and could be transferred over non-secure communication channels.

One of the solutions which comes to mind first is to use encryption algorithms. Since the files could be really large, hundreds of megabytes or tens of gigabytes, using the symmetric encryption scheme like AES would probably make a lot of sense. Besides just encryption it would be great to make sure that the data is not tampered in transit. Fortunately, there is a thing called authenticated encryption which simultaneously provides to us confidentiality, integrity, and authenticity guarantees. Galois/Counter Mode (GCM) is one of the most popular modes that supports authenticated encryption and could be used along with AES. These thoughts lead us to use AES256-GCM128, a sufficiently strong encryption scheme.

In case you are on JVM platform, you should feel lucky since AES and GCM are supported by Java Cryptography Architecture (JCA) out of the box. With that being said, let us see how far we could go.

The first thing we have to do is to generate a new AES256 key. As always, OWASP has a number of recommendations on using JCA/JCE APIs properly.

final SecureRandom secureRandom = new SecureRandom();
        
final byte[] key = new byte[32];
secureRandom.nextBytes(key);

final SecretKey secretKey = new SecretKeySpec(key, "AES");

Also, to initialize AES/GCM cipher we need to generate random initialization vector (or shortly, IV). As per NIST recommendations, its length should be 12 bytes (96 bits).

For IVs, it is recommended that implementations restrict support to the length of 96 bits, to promote interoperability, efficiency, and simplicity of design. - Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC

So here we are:

final byte[] iv = new byte[12];
secureRandom.nextBytes(iv);

Having the AES key and IV ready, we could create a cipher instance and actually perform the encryption part. Dealing with large files assumes the reliance on streaming, therefore we use BufferedInputStream / BufferedOutputStream combined with CipherOutputStream for encryption.

public static void encrypt(SecretKey secretKey, byte[] iv, final File input, 
        final File output) throws Throwable {

    final Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
    final GCMParameterSpec parameterSpec = new GCMParameterSpec(128, iv);
    cipher.init(Cipher.ENCRYPT_MODE, secretKey, parameterSpec);

    try (final BufferedInputStream in = new BufferedInputStream(new FileInputStream(input))) {
        try (final BufferedOutputStream out = new BufferedOutputStream(new CipherOutputStream(new FileOutputStream(output), cipher))) {
            int length = 0;
            byte[] bytes = new byte[16 * 1024];

            while ((length = in.read(bytes)) != -1) {
                out.write(bytes, 0, length);
            }
        }
    }
}

Please note how we specify GCM cipher parameters with the tag size of 128 bits and initialize it in encryption mode (be aware of some GCM limitations when dealing with files over 64Gb). The decryption part is no different besides the fact the cipher is initialized in decryption mode.

public static void decrypt(SecretKey secretKey, byte[] iv, final File input, 
        final File output) throws Throwable {

    final Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
    final GCMParameterSpec parameterSpec = new GCMParameterSpec(128, iv);
    cipher.init(Cipher.DECRYPT_MODE, secretKey, parameterSpec);
        
    try (BufferedInputStream in = new BufferedInputStream(new CipherInputStream(new FileInputStream(input), cipher))) {
        try (BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(output))) {
            int length = 0;
            byte[] bytes = new byte[16 * 1024];
                
            while ((length = in.read(bytes)) != -1) {
                out.write(bytes, 0, length);
            }
        }
    }
}

It seems like we are done, right? Unfortunately, not really, encrypting and decrypting the small files takes just a few moments but dealing with more or less realistic data samples gives shocking results.

Mostly 8 minutes to process a ~42Mb file (and as you may guess, larger is the file, longer it takes), the quick analysis reveals that most of that time is spent while decrypting the data (please note by no means this is a benchmark, merely a test). The search for possible culprits points out to the long-standing list of issues with AES/GCM and CipherInputStream / CipherOutputStream in JCA implementation here, here, here and here.

So what are the alternatives? It seems like it is possible to sacrifice the CipherInputStream / CipherOutputStream, refactor the implementation to use ciphers directly and make the encryption / decryption work using JCA primitives. But arguably there is a better way by bringing in battle-tested BouncyCastle library.

From the implementation perspective, the solutions are looking mostly identical. Indeed, although the naming conventions are unchanged, the CipherOutputStream / CipherInputStream in the snippet below are coming from BouncyCastle.

public static void encrypt(SecretKey secretKey, byte[] iv, final File input, 
        final File output) throws Throwable {

    final GCMBlockCipher cipher = new GCMBlockCipher(new AESEngine());
    cipher.init(true, new AEADParameters(new KeyParameter(secretKey.getEncoded()), 128, iv));

    try (BufferedInputStream in = new BufferedInputStream(new FileInputStream(input))) {
        try (BufferedOutputStream out = new BufferedOutputStream(new CipherOutputStream(new FileOutputStream(output), cipher))) {
            int length = 0;
            byte[] bytes = new byte[16 * 1024];

            while ((length = in.read(bytes)) != -1) {
                out.write(bytes, 0, length);
            }
        }
    }
}

public static void decrypt(SecretKey secretKey, byte[] iv, final File input, 
        final File output) throws Throwable {

    final GCMBlockCipher cipher = new GCMBlockCipher(new AESEngine());
    cipher.init(false, new AEADParameters(new KeyParameter(secretKey.getEncoded()), 128, iv));

    try (BufferedInputStream in = new BufferedInputStream(new CipherInputStream(new FileInputStream(input), cipher))) {
        try (BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(output))) {
            int length = 0;
            byte[] bytes = new byte[16 * 1024];
                
            while ((length = in.read(bytes)) != -1) {
                out.write(bytes, 0, length);
            }
        }
    }
}

Re-runing the previous encryption/decryption tests using BouncyCastle crypto primitives yields the completely different picture.

To be fair, the file encryption / decryption on the JVM platform looked like a solved problem at first but turned out to be full of surprising discoveries. Nonetheless, thanks to BouncyCastle, some shortcomings of JCA implementation are addressed in efficient and clean way.

Please find the complete sources available on Github.

Tuesday, February 18, 2020

In praise of the thoughful design: how property-based testing helps me to be a better developer

The developer's testing toolbox is one of these things which rarely stays unchanged. For sure, some testing practices have proven to be more valuable than others but still, we are constantly looking for better, faster and more expressive ways to test our code. Property-based testing, largely unknown to Java community, is yet another gem crafted by Haskell folks and described in QuickCheck paper.

The power of this testing technique has been quickly realized by Scala community (where the ScalaCheck library was born) and many others but the Java ecosystem has lacked the interest into adopting property-based testing for quite some time. Luckily, since the jqwik appearance, the things are slowly changing for better.

For many, it is quite difficult to grasp what property-based testing is and how it could be exploited. The excellent presentation Property-based Testing for Better Code by Jessica Kerr and comprehensive An introduction to property-based testing, Property-based Testing Patterns series of articles are excellent sources to get you hooked, but in today's post we are going to try discovering the practical side of the property-based testing for typical Java developer using jqwik.

To start with, what the name property-based testing actually implies? The first thought of every Java developer would be it aims to test all getters and setters (hello 100% coverage)? Not really, although for some data structures it could be useful. Instead, we should identify the high-level characteristics, if you will, of the component, data structure, or even individual function and efficiently test them by formulating the hypothesis.

Our first example falls into category "There and back again": serialization and deserialization into JSON representation. The class under the test is User POJO, although trivial, please notice that it has one temporal property of type OffsetDateTime.

public class User {
    private String username;
    @JsonFormat(pattern = "yyyy-MM-dd'T'HH:mm:ss[.SSS[SSS]]XXX", shape = Shape.STRING)
    private OffsetDateTime created;
    
    // ...
}

It is surprising to see how often manipulation with date/time properties are causing issues these days since everyone tries to use own representation. As you could spot, our contract is using ISO-8601 interchange format with optional milliseconds part. What we would like to make sure is that any valid instance of User could be serialized into JSON and desearialized back into Java object without loosing any date/time precision. As an exercise, let us try to express that in pseudo code first:

For any user
  Serialize user instance to JSON
  Deserialize user instance back from JSON
  Two user instances must be identical

Looks simple enough but here comes the surprising part: let us take a look at how this pseudo code projects into real test case using jqwik library. It gets as close to our pseudo code as it possibly could.

@Property
void serdes(@ForAll("users") User user) throws JsonProcessingException {
    final String json = serdes.serialize(user);

    assertThat(serdes.deserialize(json))
        .satisfies(other -> {
            assertThat(user.getUsername()).isEqualTo(other.getUsername());
            assertThat(user.getCreated().isEqual(other.getCreated())).isTrue();
        });
        
    Statistics.collect(user.getCreated().getOffset());
}

The test case reads very easy, mostly natural, but obviously, there is some background hidden behind jqwik's @Property and @ForAll annotations. Let us start from @ForAll and clear out where all these User instances are coming from. As you may guess, these instances must be generated, preferably in a randomized fashion.

For most of the built-in data types jqwik has a rich set of data providers (Arbitraries), but since we are dealing with application-specific class, we have to supply our own generation strategy. It should be able to emit User class instances with the wide range of usernames and the date/time instants for different set of timezones and offsets. Let us do a sneak peek at the provider implementation first and discuss it in details right after.

@Provide
Arbitrary<User> users() {
    final Arbitrary<String> usernames = Arbitraries.strings().alpha().ofMaxLength(64);
 
    final Arbitrary<OffsetDateTime> dates = Arbitraries
        .of(List.copyOf(ZoneId.getAvailableZoneIds()))
        .flatMap(zone -> Arbitraries
            .longs()
            .between(1266258398000L, 1897410427000L) // ~ +/- 10 years
            .unique()
            .map(epochMilli -> Instant.ofEpochMilli(epochMilli))
            .map(instant -> OffsetDateTime.from(instant.atZone(ZoneId.of(zone)))));

    return Combinators
        .combine(usernames, dates)
        .as((username, created) -> new User(username).created(created));

}

The source of usernames is easy: just random strings. The source of dates basically could be any date/time between 2010 and 2030 whereas the timezone part (thus the offset) is randomly picked from all available region-based zone identifiers. For example, below are some samples jqwik came up with.

{"username":"zrAazzaDZ","created":"2020-05-06T01:36:07.496496+03:00"}
{"username":"AZztZaZZWAaNaqagPLzZiz","created":"2023-03-20T00:48:22.737737+08:00"}
{"username":"aazGZZzaoAAEAGZUIzaaDEm","created":"2019-03-12T08:22:12.658658+04:00"}
{"username":"Ezw","created":"2011-10-28T08:07:33.542542Z"}
{"username":"AFaAzaOLAZOjsZqlaZZixZaZzyZzxrda","created":"2022-07-09T14:04:20.849849+02:00"}
{"username":"aaYeZzkhAzAazJ","created":"2016-07-22T22:20:25.162162+06:00"}
{"username":"BzkoNGzBcaWcrDaaazzCZAaaPd","created":"2020-08-12T22:23:56.902902+08:45"}
{"username":"MazNzaTZZAEhXoz","created":"2027-09-26T17:12:34.872872+11:00"}
{"username":"zqZzZYamO","created":"2023-01-10T03:16:41.879879-03:00"}
{"username":"GaaUazzldqGJZsqksRZuaNAqzANLAAlj","created":"2015-03-19T04:16:24.098098Z"}
...

By default, jqwik will run the test against 1000 different sets of parameter values (randomized User instances). The quite helpful Statistics container allows to collect whatever distribution insights you are curious about. Just in case, why not to collect the distribution by zone offsets?

    ...
    -04:00 (94) :  9.40 %
    -03:00 (76) :  7.60 %
    +02:00 (75) :  7.50 %
    -05:00 (74) :  7.40 %
    +01:00 (72) :  7.20 %
    +03:00 (69) :  6.90 %
    Z      (62) :  6.20 %
    -06:00 (54) :  5.40 %
    +11:00 (42) :  4.20 %
    -07:00 (39) :  3.90 %
    +08:00 (37) :  3.70 %
    +07:00 (34) :  3.40 %
    +10:00 (34) :  3.40 %
    +06:00 (26) :  2.60 %
    +12:00 (23) :  2.30 %
    +05:00 (23) :  2.30 %
    -08:00 (20) :  2.00 %
    ...    

Let us consider another example. Imagine at some point we decided to reimplement the equality for User class (which in Java means, overriding equals and hashCode) based on username property. With that, for any pair of User class instances the following invariants must hold true:

  • if two User instances have the same username, they are equal and must have same hash code
  • if two User instances have different usernames, they are not equal (but hash code may not necessarily be different)
It is the perfect fit for property-based testing and jqwik in particular makes such kind of tests trivial to write and maintain.

@Provide
Arbitrary<String> usernames() {
    return Arbitraries.strings().alpha().ofMaxLength(64);
}

@Property
void equals(@ForAll("usernames") String username, @ForAll("usernames") String other) {
    Assume.that(!username.equals(other));
        
    assertThat(new User(username))
        .isEqualTo(new User(username))
        .isNotEqualTo(new User(other))
        .extracting(User::hashCode)
        .isEqualTo(new User(username).hashCode());
}

The assumptions expressed through Assume allow to put additional constraints on the generated parameters since we introduce two sources of the usernames, it could happen that both of them emit the identical username at the same run so the test would fail.

The question you may be holding up to now is: what is the point? It is surely possible to test serialization / deserialization or equals/hashCode without embarking on property-based testing and using jqwik, so why even bother? Fair enough, but the answer to this question basically lies deeply in how we approach the design of our software systems.

By and large, property-based testing is heavily influenced by functional programming, not a first thing which comes into mind with respect to Java (at least, not yet), to say it mildly. The randomized generation of test data is not novel idea per se, however what property-based testing is encouraging you to do, at least in my opinion, is to think in more abstract terms, focus not on individual operations (equals, compare, add, sort, serialize, ...) but what kind of properties, characteristics, laws and/or invariants they come with to obey. It certainly feels like an alien technique, paradigm shift if you will, encourages to spend more time on designing the right thing. It does not mean that from now on all your tests must be property-based but I believe it certainly deserves the place in the front row of our testing toolboxes.

Please find the complete project sources available on Github.