UCS-2 vs GSM-7: Getting SMS Segment Counts Right

November 18, 2025 MOBITELSMS Engineering 9 min read

SMS billing is per-segment, not per-message. A single "message" from the user's perspective can be split into multiple segments on the wire, and each segment costs money. The number of segments depends entirely on the character encoding used. Get the encoding detection wrong, and your billing engine will systematically overcharge or undercharge every message that passes through it.

Why Encoding Matters for Billing

When a message is submitted to an SMSC or SMPP gateway, it is encoded using one of two character sets: GSM-7 or UCS-2. The choice of encoding determines how many characters fit in a single SMS segment, and therefore how many segments the message is split into. Since carriers and aggregators bill per segment, the encoding directly determines the cost.

This is not a theoretical concern. We found and fixed a bug in our own SMPP server where UCS-2 messages were being billed at double the correct segment count. The root cause was a single line of code that confused bytes with characters. That one confusion meant every UCS-2 message -- every Arabic, Chinese, Japanese, Korean, or emoji-containing message -- was overcharged by approximately 50%.

GSM-7 Encoding: The Default

GSM-7 is the default encoding for SMS. It uses a 7-bit character set defined in 3GPP TS 23.038 that covers the Latin alphabet, digits, and common punctuation. Most English-language messages use GSM-7.

The segment limits for GSM-7 are:

Single segment: up to 160 characters
Concatenated (multi-part): up to 153 characters per segment

Why the drop from 160 to 153? When a message exceeds 160 characters and must be split into multiple segments, each segment needs a User Data Header (UDH) that tells the receiving phone how to reassemble the parts. The UDH consumes 7 characters worth of space (6 bytes, but in 7-bit encoding that maps to 7 septets), reducing the usable payload from 160 to 153 characters per segment.

GSM-7 Extended Characters

The GSM-7 basic character set includes 128 characters. There is also an extension table that adds characters like {, }, [, ], ~, \, ^, |, and the euro sign. Each extended character uses an escape sequence and counts as 2 characters toward the segment limit. This is a common source of off-by-one errors: a message with 155 characters that includes 3 pipe symbols actually uses 155 + 3 = 158 septets, which still fits in a single segment -- but a message with 158 characters and 3 pipes uses 161 septets and spills into two segments.

UCS-2 Encoding: Unicode Support

UCS-2 is a 16-bit encoding that supports the Unicode Basic Multilingual Plane (BMP). It is used whenever a message contains any character outside the GSM-7 character set. This includes:

Arabic, Hebrew, and other RTL scripts
Chinese, Japanese, and Korean characters
Thai, Hindi, Bengali, and other Indic scripts
Cyrillic characters not in the GSM-7 set
Emoji (any emoji triggers UCS-2)
Accented characters not in GSM-7 (many diacritics)

The segment limits for UCS-2 are:

Single segment: up to 70 characters
Concatenated: up to 67 characters per segment

The same UDH overhead applies: concatenated UCS-2 segments lose 3 characters (6 bytes / 2 bytes per char = 3 chars) to the reassembly header.

The Byte vs. Character Trap

Here is where billing engines go wrong. In the SMPP protocol, the message_length field in a SUBMIT_SM PDU contains the byte length of the message payload, not the character count. For GSM-7 messages encoded as packed septets, the byte count and character count are different but roughly similar. For UCS-2 messages, the byte count is exactly double the character count, because every UCS-2 character is 2 bytes.

If your billing engine uses message_length directly as the character count for segment calculation, it will work correctly for GSM-7 but produce wildly wrong results for UCS-2:

// WRONG: Using byte length directly
int segments = message_length <= 160 ? 1 : ceil(message_length / 153.0);

// Example: 100 UCS-2 characters = 200 bytes
// Wrong calculation: ceil(200 / 153) = 2 segments
// But the message is only 100 characters!
// Correct answer: ceil(100 / 67) = 2 segments (happens to match here)

// Example: 60 UCS-2 characters = 120 bytes
// Wrong calculation: 120 <= 160 ? 1 segment -- but this is wrong too,
// because it should use UCS-2 limits (70), not GSM-7 limits (160)
// Correct answer: 60 <= 70 ? 1 segment

The real danger is more subtle. Consider a 134-character UCS-2 message. The byte length is 268. Using GSM-7 limits: ceil(268 / 153) = 2 segments. Using correct UCS-2 limits: ceil(134 / 67) = 2 segments. Same answer -- but only by coincidence. A 140-character UCS-2 message has 280 bytes. GSM-7 math: ceil(280 / 153) = 2. UCS-2 math: ceil(140 / 67) = 3. Now the billing engine undercharges by a full segment.

The Correct Segment Calculation

Here is the correct approach. First, detect the encoding from the SMPP data_coding field. Then calculate segments using the right limits:

// Detect encoding from SMPP data_coding field
// data_coding == 0x00 or 0x01: GSM-7
// data_coding == 0x08: UCS-2

if (data_coding == 0x08) {
    // UCS-2: convert bytes to characters first
    int chars = message_length / 2;
    segments = (chars <= 70) ? 1 : (int)ceil((double)chars / 67.0);
} else {
    // GSM-7 (or Latin-1 / ASCII)
    int chars = message_length;  // 1 byte per char for unpacked
    segments = (chars <= 160) ? 1 : (int)ceil((double)chars / 153.0);
}

The critical line is int chars = message_length / 2 for UCS-2. Without that division, every UCS-2 message is billed as if it has twice as many characters as it actually does.

Real-World Billing Impact

The financial impact depends on your traffic mix. If 20% of your traffic is UCS-2 (common for platforms serving Middle Eastern, Asian, or emoji-heavy markets), and the billing engine treats byte counts as character counts, the overcharge pattern looks like this:

Short UCS-2 messages (1-35 chars / 2-70 bytes): Billed as 1 segment regardless -- no overcharge, because 70 bytes is below the GSM-7 single-segment limit of 160.
Medium UCS-2 messages (36-70 chars / 72-140 bytes): Still billed as 1 segment, still correct by coincidence (under 160 bytes).
Long UCS-2 messages (71+ chars / 142+ bytes): This is where it breaks. A 100-character UCS-2 message (200 bytes) should be 2 segments but might be calculated as 2 using byte math too -- the numbers diverge more as messages get longer.
Very long UCS-2 messages (134+ chars / 268+ bytes): A 201-character UCS-2 message (402 bytes) should be ceil(201/67) = 3 segments. Byte math gives ceil(402/153) = 3. But a 250-character UCS-2 message (500 bytes) should be ceil(250/67) = 4 segments. Byte math: ceil(500/153) = 4. At 335 characters (670 bytes): correct is ceil(335/67) = 5, byte math gives ceil(670/153) = 5. The errors compound at boundary values.

The worst case is messages near segment boundaries. A 71-character UCS-2 message should be 2 segments. Its byte length is 142. If the billing code applies GSM-7 rules (142 <= 160 ? 1), the customer is undercharged by 1 segment. A 140-character UCS-2 message should be 3 segments (ceil(140/67) = 2.09 = 3). Byte math: 280 <= 160? No. ceil(280/153) = 1.83 = 2. Undercharged by 1 segment.

Best Practices

If you operate an SMPP platform, SMS gateway, or any system that calculates SMS billing, follow these rules:

Always check data_coding before calculating segments. Never assume GSM-7. The encoding must drive the math.
Convert bytes to characters for UCS-2. Divide message_length by 2 before applying segment limits. This is the single most common mistake.
Use the correct segment limits. GSM-7: 160/153. UCS-2: 70/67. Do not mix them.
Account for GSM-7 extended characters. If you are counting characters rather than encoded bytes, remember that extended GSM-7 characters ({ } [ ] ~ \ ^ | and euro) count as 2.
Use floating-point division with ceiling. Integer division truncates. ceil(7/6) in integer math is 1, not 2. Cast to double or float before dividing: ceil((double)chars / 67.0).
Test with boundary values. Test at exactly 160, 161, 70, 71, 153, 154, 67, and 68 characters for both encodings. These are where bugs hide.
Audit your CDRs. Compare the segment count in your billing CDRs against the actual message_length and data_coding values. If UCS-2 messages consistently show higher segment counts than expected, you likely have this bug.

Character encoding in SMS is one of those areas where a small misunderstanding creates a systematic billing error that can persist for months or years without detection -- especially if your team tests primarily with English-language messages. If your platform handles any international traffic, it is worth spending an hour to verify that your segment calculation handles UCS-2 correctly. The cost of the bug is proportional to your traffic volume, and it compounds every day it goes undetected.