SMS billing is per-segment, not per-message. A single "message" from the user's perspective can be split into multiple segments on the wire, and each segment costs money. The number of segments depends entirely on the character encoding used. Get the encoding detection wrong, and your billing engine will systematically overcharge or undercharge every message that passes through it.
Why Encoding Matters for Billing
When a message is submitted to an SMSC or SMPP gateway, it is encoded using one of two character sets: GSM-7 or UCS-2. The choice of encoding determines how many characters fit in a single SMS segment, and therefore how many segments the message is split into. Since carriers and aggregators bill per segment, the encoding directly determines the cost.
This is not a theoretical concern. We found and fixed a bug in our own SMPP server where UCS-2 messages were being billed at double the correct segment count. The root cause was a single line of code that confused bytes with characters. That one confusion meant every UCS-2 message -- every Arabic, Chinese, Japanese, Korean, or emoji-containing message -- was overcharged by approximately 50%.
GSM-7 Encoding: The Default
GSM-7 is the default encoding for SMS. It uses a 7-bit character set defined in 3GPP TS 23.038 that covers the Latin alphabet, digits, and common punctuation. Most English-language messages use GSM-7.
The segment limits for GSM-7 are:
- Single segment: up to
160characters - Concatenated (multi-part): up to
153characters per segment
Why the drop from 160 to 153? When a message exceeds 160 characters and must be split into multiple segments, each segment needs a User Data Header (UDH) that tells the receiving phone how to reassemble the parts. The UDH consumes 7 characters worth of space (6 bytes, but in 7-bit encoding that maps to 7 septets), reducing the usable payload from 160 to 153 characters per segment.
GSM-7 Extended Characters
The GSM-7 basic character set includes 128 characters. There is also an extension table that adds characters like {, }, [, ], ~, \, ^, |, and the euro sign. Each extended character uses an escape sequence and counts as 2 characters toward the segment limit. This is a common source of off-by-one errors: a message with 155 characters that includes 3 pipe symbols actually uses 155 + 3 = 158 septets, which still fits in a single segment -- but a message with 158 characters and 3 pipes uses 161 septets and spills into two segments.
UCS-2 Encoding: Unicode Support
UCS-2 is a 16-bit encoding that supports the Unicode Basic Multilingual Plane (BMP). It is used whenever a message contains any character outside the GSM-7 character set. This includes:
- Arabic, Hebrew, and other RTL scripts
- Chinese, Japanese, and Korean characters
- Thai, Hindi, Bengali, and other Indic scripts
- Cyrillic characters not in the GSM-7 set
- Emoji (any emoji triggers UCS-2)
- Accented characters not in GSM-7 (many diacritics)
The segment limits for UCS-2 are:
- Single segment: up to
70characters - Concatenated: up to
67characters per segment
The same UDH overhead applies: concatenated UCS-2 segments lose 3 characters (6 bytes / 2 bytes per char = 3 chars) to the reassembly header.
The Byte vs. Character Trap
Here is where billing engines go wrong. In the SMPP protocol, the message_length field in a SUBMIT_SM PDU contains the byte length of the message payload, not the character count. For GSM-7 messages encoded as packed septets, the byte count and character count are different but roughly similar. For UCS-2 messages, the byte count is exactly double the character count, because every UCS-2 character is 2 bytes.
If your billing engine uses message_length directly as the character count for segment calculation, it will work correctly for GSM-7 but produce wildly wrong results for UCS-2:
// WRONG: Using byte length directly
int segments = message_length <= 160 ? 1 : ceil(message_length / 153.0);
// Example: 100 UCS-2 characters = 200 bytes
// Wrong calculation: ceil(200 / 153) = 2 segments
// But the message is only 100 characters!
// Correct answer: ceil(100 / 67) = 2 segments (happens to match here)
// Example: 60 UCS-2 characters = 120 bytes
// Wrong calculation: 120 <= 160 ? 1 segment -- but this is wrong too,
// because it should use UCS-2 limits (70), not GSM-7 limits (160)
// Correct answer: 60 <= 70 ? 1 segment
The real danger is more subtle. Consider a 134-character UCS-2 message. The byte length is 268. Using GSM-7 limits: ceil(268 / 153) = 2 segments. Using correct UCS-2 limits: ceil(134 / 67) = 2 segments. Same answer -- but only by coincidence. A 140-character UCS-2 message has 280 bytes. GSM-7 math: ceil(280 / 153) = 2. UCS-2 math: ceil(140 / 67) = 3. Now the billing engine undercharges by a full segment.
The Correct Segment Calculation
Here is the correct approach. First, detect the encoding from the SMPP data_coding field. Then calculate segments using the right limits:
// Detect encoding from SMPP data_coding field
// data_coding == 0x00 or 0x01: GSM-7
// data_coding == 0x08: UCS-2
if (data_coding == 0x08) {
// UCS-2: convert bytes to characters first
int chars = message_length / 2;
segments = (chars <= 70) ? 1 : (int)ceil((double)chars / 67.0);
} else {
// GSM-7 (or Latin-1 / ASCII)
int chars = message_length; // 1 byte per char for unpacked
segments = (chars <= 160) ? 1 : (int)ceil((double)chars / 153.0);
}
The critical line is int chars = message_length / 2 for UCS-2. Without that division, every UCS-2 message is billed as if it has twice as many characters as it actually does.
Real-World Billing Impact
The financial impact depends on your traffic mix. If 20% of your traffic is UCS-2 (common for platforms serving Middle Eastern, Asian, or emoji-heavy markets), and the billing engine treats byte counts as character counts, the overcharge pattern looks like this:
- Short UCS-2 messages (1-35 chars / 2-70 bytes): Billed as 1 segment regardless -- no overcharge, because 70 bytes is below the GSM-7 single-segment limit of 160.
- Medium UCS-2 messages (36-70 chars / 72-140 bytes): Still billed as 1 segment, still correct by coincidence (under 160 bytes).
- Long UCS-2 messages (71+ chars / 142+ bytes): This is where it breaks. A 100-character UCS-2 message (200 bytes) should be 2 segments but might be calculated as 2 using byte math too -- the numbers diverge more as messages get longer.
- Very long UCS-2 messages (134+ chars / 268+ bytes): A 201-character UCS-2 message (402 bytes) should be
ceil(201/67) = 3segments. Byte math givesceil(402/153) = 3. But a 250-character UCS-2 message (500 bytes) should beceil(250/67) = 4segments. Byte math:ceil(500/153) = 4. At 335 characters (670 bytes): correct isceil(335/67) = 5, byte math givesceil(670/153) = 5. The errors compound at boundary values.
The worst case is messages near segment boundaries. A 71-character UCS-2 message should be 2 segments. Its byte length is 142. If the billing code applies GSM-7 rules (142 <= 160 ? 1), the customer is undercharged by 1 segment. A 140-character UCS-2 message should be 3 segments (ceil(140/67) = 2.09 = 3). Byte math: 280 <= 160? No. ceil(280/153) = 1.83 = 2. Undercharged by 1 segment.
Best Practices
If you operate an SMPP platform, SMS gateway, or any system that calculates SMS billing, follow these rules:
- Always check
data_codingbefore calculating segments. Never assume GSM-7. The encoding must drive the math. - Convert bytes to characters for UCS-2. Divide
message_lengthby 2 before applying segment limits. This is the single most common mistake. - Use the correct segment limits. GSM-7: 160/153. UCS-2: 70/67. Do not mix them.
- Account for GSM-7 extended characters. If you are counting characters rather than encoded bytes, remember that extended GSM-7 characters (
{ } [ ] ~ \ ^ |and euro) count as 2. - Use floating-point division with ceiling. Integer division truncates.
ceil(7/6)in integer math is1, not2. Cast to double or float before dividing:ceil((double)chars / 67.0). - Test with boundary values. Test at exactly 160, 161, 70, 71, 153, 154, 67, and 68 characters for both encodings. These are where bugs hide.
- Audit your CDRs. Compare the segment count in your billing CDRs against the actual
message_lengthanddata_codingvalues. If UCS-2 messages consistently show higher segment counts than expected, you likely have this bug.
Character encoding in SMS is one of those areas where a small misunderstanding creates a systematic billing error that can persist for months or years without detection -- especially if your team tests primarily with English-language messages. If your platform handles any international traffic, it is worth spending an hour to verify that your segment calculation handles UCS-2 correctly. The cost of the bug is proportional to your traffic volume, and it compounds every day it goes undetected.