With Pith

Ethan Petuchowski

Avoiding Sign Extension

I was looking at an implementation of file-based mergesort from GitHub, and found the following snippet.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/**
 * Author: cowtowncoder
 * https://github.com/cowtowncoder/java-merge-sort/blob/master/src/main/java/com/fasterxml/sort/std/ByteArrayComparator.java
 * 
 * Simple implementation of comparator for byte arrays which
 * will compare using <code>unsigned</code> byte values (meaning
 * that 0xFF is creator than 0x00, for example).
 */
public class ByteArrayComparator
    implements Comparator<byte[]>
{
    @Override
    public int compare(byte[] o1, byte[] o2)
    {
        final int len = Math.min(o1.length, o2.length);
        for (int i = 0; i < len; ++i) {
            // alas, sign extension means we must do masking...
            int diff = (o1[i] & 0xFF) - (o2[i] & 0xFF);
            if (diff != 0) {
                return diff;
            }
        }
        return o1.length - o2.length;
    }

}

What is the meaning of the remark “// alas, sign extension means we must do masking...”? What’s the deal with the masking?

References

  1. Supressing sign extension in Java
  2. Promotion in Java

ol[i] is a byte, i.e. 8 bits. But if you try to subtract two bytes

1
2
3
byte a = 0x00;
byte b = 0xff;
byte c = a - b;

This will produce a compiler error, because Java’s - (minus) operator promotes bytes to ints before performing the -, thus - produces an int, and you can’t assign an int to a byte (without a cast, because you’d lose bits).

So what about

1
2
3
byte a = 0x00;
byte b = 0xff;
int c = a - b;

The problem is that (according to the spec in the doc comment), we want c to be negative. However note:

1
2
3
4
5
// this
int c = 0x00 - 0xff;

// becomes this
int c = 0 - 0xffffffff; // 0 - (-1) = 1 "sign extension"

So now c = 1 instead of -255.

1
2
3
4
5
6
7
8
// But this
int c = (`0x00` & 0xff) - (`0xff` & 0xff);

// becomes this
int c = (`0x00` & 0x000000ff) - (`0xff` & 0x000000ff);

// which becomes this
int c = 0 - 0x000000ff; // 0 - 255 = -255 (avoided sign extension)

So because of the masking, we avoided sign extension, and now c = -255, which is what we wanted.