Fuzzy Arabic Search Without Elasticsearch: Solving التشكيل, الألف, and المد in PostgreSQL
You need fuzzy Arabic search but don't want to introduce Elasticsearch. PostgreSQL can do it—but Arabic diacritics, character variants, and multi-schema operator issues will fight you every step of the way. This is how to win against all three.
And why you'll sleep better knowing search doesn't depend on another service.
Why You Can't Ignore Arabic Text Search
Your users are searching for reports. They type "اجمالي" and expect to find everything matching that word. Simple, right?
Wrong.
They find nothing. Because the database contains "إجمالي المبيعات" and "اجمالي المدينة" and "أجمالي العملاء"—all slightly different in diacritics or character variants, none of which match the naked string they typed.
Your first instinct: "Elastic Search will fix this." And yes, Elasticsearch has Arabic analyzers. But now you're running another service. Another database. Another synchronization problem. Another moving part that can break at 3 AM.
PostgreSQL can solve this. It's not as famous. It's not as simple. But it's yours—and it doesn't require another service.
The constraint: You want fuzzy Arabic search that lives entirely in PostgreSQL. No external services. No distributed complexity. Just your database doing the work.
Three Problems That Break String Matching
Arabic text search fails for three specific reasons. Understanding them is half the battle.
Problem 1: Tashkeel – التشكيل (Diacritical Marks)
Arabic diacritical marks—fatḥah, dammah, kasrah, sukūn, shadda—are phonetic guides. They don't change the word's meaning, but PostgreSQL treats them as different characters.
= fails. LIKE fails. substring fails. You get zero results.
Problem 2: Hamza Variants – الهمزة (Alef Chaos)
The letter Alef has four Unicode representations. They're linguistically identical.
- أ U+0623 (Alef with Hamza Above)
- إ U+0625 (Alef with Hamza Below)
- آ U+0622 (Alef with Madda)
- ٱ U+0671 (Alef Wasla)
All should match a bare ا U+0627 (Alef).
They don't. A user typing "ايرادات" won't find "أيرادات" or "إيرادات" even though they're the same word spelled four different ways.
Problem 3: Character Equivalents – تهجئة (Accepted Spelling)
Some letters have accepted replacements in different regions or styles:
- ة (Teh Marbuta) can be written as ه (Heh) in informal text
- ى (Alef Maqsura) can be written as ي (Yeh)
Your users expect these to match. PostgreSQL doesn't.
The reality: Out-of-the-box PostgreSQL string matching treats Arabic like ASCII. It fails immediately. You have to build the solution yourself.
🛠️ How PostgreSQL Handles This
PostgreSQL has the tools. You just need to use them correctly.
Step 1: Normalize Text with a Function
Create a deterministic SQL function that normalizes Arabic text by:
- Removing all diacritical marks
- Collapsing Hamza variants to bare Alef
- Normalizing character equivalents
CREATE OR REPLACE FUNCTION normalize_arabic(input text)
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
PARALLEL SAFE
AS $$
SELECT
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
lower(input),
'[ً-ٟـ]', '', 'g'), -- Strip diacritics
'[أإآٱ]', 'ا', 'g'), -- Hamza → bare Alef
'ة', 'ه', 'g'), -- Teh Marbuta → Heh
'ى', 'ي', 'g') -- Alef Maqsura → Yeh
$$;
Why these keywords matter:
Step 2: Use pg_trgm for Fuzzy Matching
PostgreSQL's `pg_trgm` extension breaks text into 3-character trigrams and does similarity matching. It's not Arabic-specific—it works for any language.
Enable it:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
Then create a functional index on normalized columns:
CREATE INDEX idx_search_normalized_trgm
ON your_table
USING gin (normalize_arabic(search_column) public.gin_trgm_ops);
Now queries using the `<%` operator will use the index:
SELECT * FROM your_table
WHERE normalize_arabic(search_column) <% normalize_arabic('اجمالي');
The trigram approach: When you search for "اجمالي", PostgreSQL breaks it into trigrams: ["اجم", "جما", "ما", "الي"]. Then it finds all rows whose normalized content shares significant trigram overlap. It's forgiving of typos and close matches.
⚠️ The Multi-Schema Operator Hell
If you run a single public schema, the above works fine. But if you operate on a custom schema (which many production systems do), you hit a wall.
The Problem: Operator Not Found
Say your tables live in a custom schema (not public). You create the `normalize_arabic` function there. But `pg_trgm` is installed in the public schema.
When you run this:
SELECT * FROM your_table
WHERE normalize_arabic(col) <% 'search_term';
PostgreSQL throws:
operator does not exist: text <% text
Why? Because the `<%` operator lives in the `public` schema, and PostgreSQL isn't searching there by default.
The Fix: Explicit Operator Qualification
In your application code (Python, Go, whatever), use the op() method to explicitly specify the operator with its schema:
# Using SQLAlchemy (Python example)
normalized_col = func.normalize_arabic(YourTable.search_column)
normalized_query = normalize_arabic_in_app('اجمالي')
# THE FIX: Explicitly call the operator from the public schema
query = query.where(
literal(normalized_query).op("OPERATOR(public.<%)")(normalized_col)
)
Breaking it down:
literal(normalized_query) — Wrap the query string as a literal value.op("OPERATOR(public.<%)") — Use the <% operator from the public schema(normalized_col) — Apply it to the normalized columnWhy this works: You're telling PostgreSQL exactly where to find the operator. No ambiguity. No search path guessing. It works.
The Root Cause
This happens because:
pg_trgmis a system extension installed in the public schema- Your code runs in a custom schema (for isolation, multi-tenancy, whatever)
- When PostgreSQL looks for the `<%` operator, it searches your schema first, doesn't find it, and fails
- The search_path setting sometimes helps, but it's unreliable across different connection contexts
Explicit qualification sidesteps the entire problem.
🔗 Building the Complete Solution
Part 1: Python Normalizer (Application Side)
Create a pure function in your app that mirrors the SQL function:
class ArabicTextNormalizer:
"""Mirrors normalize_arabic() SQL function"""
_TASHKEEL = re.compile(r"[ً-ٟـ]")
_HAMZA_ALEF = re.compile(r"[أإآٱ]")
_TEH_MARBUTA = re.compile(r"ة")
_ALEF_MAQSURA = re.compile(r"ى")
def normalize(self, text: str) → str:
text = text.lower()
text = self._TASHKEEL.sub("", text)
text = self._HAMZA_ALEF.sub("ا", text)
text = self._TEH_MARBUTA.sub("ه", text)
text = self._ALEF_MAQSURA.sub("ي", text)
return text
Design principle: No magic. No side effects. You pass in text, you get normalized text. It's testable and deterministic.
Part 2: SQL Function (Database Side)
The function we showed earlier. Keep it in sync with the Python version—same transformations, same order.
Part 3: Functional Index
Index the normalized column using trigrams. Create separate indexes if you have multiple search columns or locales:
CREATE INDEX idx_search_normalized_ar_trgm
ON your_table
USING gin (normalize_arabic(search_column) public.gin_trgm_ops)
WHERE is_deleted = false; -- Ignore soft-deleted rows
Part 4: Query Builder (The Multi-Schema Fix)
class ArabicSearchFilter:
def build_query(self, search_term: str):
# Step 1: Normalize the search term in Python
normalized = self.normalizer.normalize(search_term)
# Step 2: Build SQL with explicit operator qualification
column = YourTable.search_column
normalized_col = func.normalize_arabic(column)
# Step 3: Use explicit operator path for multi-schema safety
filter_clause = literal(normalized).op(
"OPERATOR(public.<%)"
)(normalized_col)
return session.query(YourTable).filter(filter_clause)
This approach:
- ✓ Normalizes the query string in Python (for consistency)
- ✓ Normalizes the database column in SQL (for index usage)
- ✓ Uses the explicit operator path (works across any schema)
- ✓ Triggers the functional index (fast, even on millions of rows)
🧪 Verification That It Works
You need to test both sides:
Unit Tests (Python Normalizer)
def test_normalize_arabic():
normalizer = ArabicTextNormalizer()
# Diacritics removed
assert normalizer.normalize("إجمالي") == "اجمالي"
# Hamza variants normalized
assert normalizer.normalize("أيرادات") == normalizer.normalize("إيرادات")
# Character equivalents
assert normalizer.normalize("مدينة") == normalizer.normalize("مدينه")
Integration Tests (Database)
async def test_fuzzy_search_arabic(db):
# Insert test data with various forms
await db.insert({'content': 'إجمالي المبيعات'})
await db.insert({'content': 'أجمالي العملاء'})
# Search with the naked form (what users type)
results = await search('اجمالي')
# Both rows should match
assert len(results) == 2
Why both tests matter: Unit tests catch normalizer bugs. Integration tests catch SQL setup issues. Together, they prove the entire pipeline works.
🎯 Key Takeaways
1. Normalization Is Linguistic, Not Technical
Most developers approach this backwards. They ask "how do I do fuzzy search in PostgreSQL?" The real question is "what does Arabic linguistic normalization look like?" Once you understand the linguistics (diacritics, Hamza variants, character equivalents), the technical solution follows naturally.
2. Multi-Schema Requires Explicit Operator Paths
If you're building systems with schema isolation (for multi-tenancy, security, whatever), never trust PostgreSQL to find operators across schemas. Use op("OPERATOR(public.<%)"). It's verbose but reliable.
3. Functional Indexes Are Your Secret Weapon
Most developers don't know about functional indexes. They index raw columns. But you can index the result of a function—in this case, normalized text. This is what makes the entire system fast.
4. Keep Python and SQL in Sync
The moment you have normalization logic in two places, you have a synchronization problem. Document the relationship. Write tests that verify both sides produce identical output. If they drift, searches break silently.
5. Don't Introduce Elasticsearch If PostgreSQL Can Do It
Elasticsearch is powerful. It's also another service. Another database. Another failure point. If PostgreSQL solves your problem—and for Arabic fuzzy search, it does—stick with PostgreSQL. Simpler systems are more reliable systems.
Building fuzzy Arabic search in PostgreSQL isn't magic. It's understanding the problem space (Arabic linguistics), knowing your database (normalization functions + trigram indexes + explicit operator paths), and not reaching for external services unnecessarily.
Your users type what they want to find. They don't care about diacritics or Hamza variants. Make it work for them. PostgreSQL can. You just have to know how.
The OPERATOR(public.<%)) pattern is your friend when things break at 2 AM. Remember it. 👇
If this was useful, let's connect. I write about engineering leadership, backend architecture, and lessons from building products at scale.
Follow on LinkedIn