2026.05.10 8 min read

PostgreSQL Arabic NLP Backend DatabaseOptimization

Fuzzy Arabic Search Without Elasticsearch: Solving التشكيل, الألف, and المد in PostgreSQL

You need fuzzy Arabic search but don't want to introduce Elasticsearch. PostgreSQL can do it—but Arabic diacritics, character variants, and multi-schema operator issues will fight you every step of the way. This is how to win against all three.

And why you'll sleep better knowing search doesn't depend on another service.

Fuzzy Arabic Search Architecture - Three-layer solution with Tashkeel normalization, trigrams indexing, and character merging

The Problem

Why You Can't Ignore Arabic Text Search

Your users are searching for reports. They type "اجمالي" and expect to find everything matching that word. Simple, right?

Wrong.

They find nothing. Because the database contains "إجمالي المبيعات" and "اجمالي المدينة" and "أجمالي العملاء"—all slightly different in diacritics or character variants, none of which match the naked string they typed.

Your first instinct: "Elastic Search will fix this." And yes, Elasticsearch has Arabic analyzers. But now you're running another service. Another database. Another synchronization problem. Another moving part that can break at 3 AM.

PostgreSQL can solve this. It's not as famous. It's not as simple. But it's yours—and it doesn't require another service.

The constraint: You want fuzzy Arabic search that lives entirely in PostgreSQL. No external services. No distributed complexity. Just your database doing the work.

The Linguistics

Three Problems That Break String Matching

Arabic text search fails for three specific reasons. Understanding them is half the battle.

Problem 1: Tashkeel – التشكيل (Diacritical Marks)

Arabic diacritical marks—fatḥah, dammah, kasrah, sukūn, shadda—are phonetic guides. They don't change the word's meaning, but PostgreSQL treats them as different characters.

User types

"اجمالي"

7 characters, no diacritics

Database has

"إجمالي"

7 characters + hamza + diacritics

= fails. LIKE fails. substring fails. You get zero results.

Problem 2: Hamza Variants – الهمزة (Alef Chaos)

The letter Alef has four Unicode representations. They're linguistically identical.

أ U+0623 (Alef with Hamza Above)
إ U+0625 (Alef with Hamza Below)
آ U+0622 (Alef with Madda)
ٱ U+0671 (Alef Wasla)

All should match a bare ا U+0627 (Alef).

They don't. A user typing "ايرادات" won't find "أيرادات" or "إيرادات" even though they're the same word spelled four different ways.

Problem 3: Character Equivalents – تهجئة (Accepted Spelling)

Some letters have accepted replacements in different regions or styles:

ة (Teh Marbuta) can be written as ه (Heh) in informal text
ى (Alef Maqsura) can be written as ي (Yeh)

Your users expect these to match. PostgreSQL doesn't.

The reality: Out-of-the-box PostgreSQL string matching treats Arabic like ASCII. It fails immediately. You have to build the solution yourself.

The PostgreSQL Solution

🛠️ How PostgreSQL Handles This

PostgreSQL has the tools. You just need to use them correctly.

Step 1: Normalize Text with a Function

Create a deterministic SQL function that normalizes Arabic text by:

Removing all diacritical marks
Collapsing Hamza variants to bare Alef
Normalizing character equivalents

CREATE OR REPLACE FUNCTION normalize_arabic(input text)
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
PARALLEL SAFE
AS $$
SELECT
  regexp_replace(
    regexp_replace(
      regexp_replace(
        regexp_replace(
          lower(input),
          '[ً-ٟـ]', '', 'g'),  -- Strip diacritics
        '[أإآٱ]', 'ا', 'g'),  -- Hamza → bare Alef
      'ة', 'ه', 'g'),  -- Teh Marbuta → Heh
    'ى', 'ي', 'g')  -- Alef Maqsura → Yeh
$$;

Why these keywords matter:

✓

IMMUTABLE — PostgreSQL can use this in indexes. Without it, PostgreSQL refuses to optimize because it can't guarantee the function won't change results.

✓

STRICT — Function returns NULL if any input is NULL. Safer, faster.

✓

PARALLEL SAFE — PostgreSQL can run this in parallel queries. No hidden state.

Step 2: Use pg_trgm for Fuzzy Matching

PostgreSQL's `pg_trgm` extension breaks text into 3-character trigrams and does similarity matching. It's not Arabic-specific—it works for any language.

Enable it:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

Then create a functional index on normalized columns:

CREATE INDEX idx_search_normalized_trgm
  ON your_table
  USING gin (normalize_arabic(search_column) public.gin_trgm_ops);

Now queries using the `<%` operator will use the index:

SELECT * FROM your_table
WHERE normalize_arabic(search_column) <% normalize_arabic('اجمالي');

The trigram approach: When you search for "اجمالي", PostgreSQL breaks it into trigrams: ["اجم", "جما", "ما", "الي"]. Then it finds all rows whose normalized content shares significant trigram overlap. It's forgiving of typos and close matches.

The Trap

⚠️ The Multi-Schema Operator Hell

If you run a single public schema, the above works fine. But if you operate on a custom schema (which many production systems do), you hit a wall.

The Problem: Operator Not Found

Say your tables live in a custom schema (not public). You create the `normalize_arabic` function there. But `pg_trgm` is installed in the public schema.

When you run this:

SELECT * FROM your_table
WHERE normalize_arabic(col) <% 'search_term';

PostgreSQL throws:

operator does not exist: text <% text

Why? Because the `<%` operator lives in the `public` schema, and PostgreSQL isn't searching there by default.

The Fix: Explicit Operator Qualification

In your application code (Python, Go, whatever), use the op() method to explicitly specify the operator with its schema:

# Using SQLAlchemy (Python example)
normalized_col = func.normalize_arabic(YourTable.search_column)
normalized_query = normalize_arabic_in_app('اجمالي')

# THE FIX: Explicitly call the operator from the public schema
query = query.where(
    literal(normalized_query).op("OPERATOR(public.<%)")(normalized_col)
)

Breaking it down:

literal(normalized_query) — Wrap the query string as a literal value

.op("OPERATOR(public.<%)") — Use the <% operator from the public schema

(normalized_col) — Apply it to the normalized column

Why this works: You're telling PostgreSQL exactly where to find the operator. No ambiguity. No search path guessing. It works.

The Root Cause

This happens because:

pg_trgm is a system extension installed in the public schema
Your code runs in a custom schema (for isolation, multi-tenancy, whatever)
When PostgreSQL looks for the `<%` operator, it searches your schema first, doesn't find it, and fails
The search_path setting sometimes helps, but it's unreliable across different connection contexts

Explicit qualification sidesteps the entire problem.

Implementation

🔗 Building the Complete Solution

Part 1: Python Normalizer (Application Side)

Create a pure function in your app that mirrors the SQL function:

class ArabicTextNormalizer:
    """Mirrors normalize_arabic() SQL function"""
    
    _TASHKEEL = re.compile(r"[ً-ٟـ]")
    _HAMZA_ALEF = re.compile(r"[أإآٱ]")
    _TEH_MARBUTA = re.compile(r"ة")
    _ALEF_MAQSURA = re.compile(r"ى")
    
    def normalize(self, text: str) → str:
        text = text.lower()
        text = self._TASHKEEL.sub("", text)
        text = self._HAMZA_ALEF.sub("ا", text)
        text = self._TEH_MARBUTA.sub("ه", text)
        text = self._ALEF_MAQSURA.sub("ي", text)
        return text

Design principle: No magic. No side effects. You pass in text, you get normalized text. It's testable and deterministic.

Part 2: SQL Function (Database Side)

The function we showed earlier. Keep it in sync with the Python version—same transformations, same order.

Part 3: Functional Index

Index the normalized column using trigrams. Create separate indexes if you have multiple search columns or locales:

CREATE INDEX idx_search_normalized_ar_trgm
  ON your_table
  USING gin (normalize_arabic(search_column) public.gin_trgm_ops)
  WHERE is_deleted = false;  -- Ignore soft-deleted rows

Part 4: Query Builder (The Multi-Schema Fix)

class ArabicSearchFilter:
    def build_query(self, search_term: str):
        # Step 1: Normalize the search term in Python
        normalized = self.normalizer.normalize(search_term)
        
        # Step 2: Build SQL with explicit operator qualification
        column = YourTable.search_column
        normalized_col = func.normalize_arabic(column)
        
        # Step 3: Use explicit operator path for multi-schema safety
        filter_clause = literal(normalized).op(
            "OPERATOR(public.<%)"
        )(normalized_col)
        
        return session.query(YourTable).filter(filter_clause)

This approach:

✓ Normalizes the query string in Python (for consistency)
✓ Normalizes the database column in SQL (for index usage)
✓ Uses the explicit operator path (works across any schema)
✓ Triggers the functional index (fast, even on millions of rows)

Testing

🧪 Verification That It Works

You need to test both sides:

Unit Tests (Python Normalizer)

def test_normalize_arabic():
    normalizer = ArabicTextNormalizer()
    
    # Diacritics removed
    assert normalizer.normalize("إجمالي") == "اجمالي"
    
    # Hamza variants normalized
    assert normalizer.normalize("أيرادات") == normalizer.normalize("إيرادات")
    
    # Character equivalents
    assert normalizer.normalize("مدينة") == normalizer.normalize("مدينه")

Integration Tests (Database)

async def test_fuzzy_search_arabic(db):
    # Insert test data with various forms
    await db.insert({'content': 'إجمالي المبيعات'})
    await db.insert({'content': 'أجمالي العملاء'})
    
    # Search with the naked form (what users type)
    results = await search('اجمالي')
    
    # Both rows should match
    assert len(results) == 2

Why both tests matter: Unit tests catch normalizer bugs. Integration tests catch SQL setup issues. Together, they prove the entire pipeline works.

Lessons

🎯 Key Takeaways

1. Normalization Is Linguistic, Not Technical

Most developers approach this backwards. They ask "how do I do fuzzy search in PostgreSQL?" The real question is "what does Arabic linguistic normalization look like?" Once you understand the linguistics (diacritics, Hamza variants, character equivalents), the technical solution follows naturally.

2. Multi-Schema Requires Explicit Operator Paths

If you're building systems with schema isolation (for multi-tenancy, security, whatever), never trust PostgreSQL to find operators across schemas. Use op("OPERATOR(public.<%)"). It's verbose but reliable.

3. Functional Indexes Are Your Secret Weapon

Most developers don't know about functional indexes. They index raw columns. But you can index the result of a function—in this case, normalized text. This is what makes the entire system fast.

4. Keep Python and SQL in Sync

The moment you have normalization logic in two places, you have a synchronization problem. Document the relationship. Write tests that verify both sides produce identical output. If they drift, searches break silently.

5. Don't Introduce Elasticsearch If PostgreSQL Can Do It

Elasticsearch is powerful. It's also another service. Another database. Another failure point. If PostgreSQL solves your problem—and for Arabic fuzzy search, it does—stick with PostgreSQL. Simpler systems are more reliable systems.

Building fuzzy Arabic search in PostgreSQL isn't magic. It's understanding the problem space (Arabic linguistics), knowing your database (normalization functions + trigram indexes + explicit operator paths), and not reaching for external services unnecessarily.

Your users type what they want to find. They don't care about diacritics or Hamza variants. Make it work for them. PostgreSQL can. You just have to know how.

The OPERATOR(public.<%)) pattern is your friend when things break at 2 AM. Remember it. 👇

If this was useful, let's connect. I write about engineering leadership, backend architecture, and lessons from building products at scale.

Follow on LinkedIn

Discussion

Fuzzy Arabic Search Without Elasticsearch: Solving التشكيل, الألف, and المد in PostgreSQL

Why You Can't Ignore Arabic Text Search

Three Problems That Break String Matching

Problem 1: Tashkeel – التشكيل (Diacritical Marks)

Problem 2: Hamza Variants – الهمزة (Alef Chaos)

Problem 3: Character Equivalents – تهجئة (Accepted Spelling)

🛠️ How PostgreSQL Handles This

Step 1: Normalize Text with a Function

Step 2: Use pg_trgm for Fuzzy Matching

⚠️ The Multi-Schema Operator Hell

The Problem: Operator Not Found

The Fix: Explicit Operator Qualification

The Root Cause

🔗 Building the Complete Solution

Part 1: Python Normalizer (Application Side)

Part 2: SQL Function (Database Side)

Part 3: Functional Index

Part 4: Query Builder (The Multi-Schema Fix)

🧪 Verification That It Works

Unit Tests (Python Normalizer)

Integration Tests (Database)

🎯 Key Takeaways

1. Normalization Is Linguistic, Not Technical

2. Multi-Schema Requires Explicit Operator Paths

3. Functional Indexes Are Your Secret Weapon

4. Keep Python and SQL in Sync

5. Don't Introduce Elasticsearch If PostgreSQL Can Do It

Stay in the loop