# dTelecom STT — Real-Time Speech-to-Text API

> Production-grade real-time speech-to-text for AI agents. Pay with USDC via x402. No API keys, no accounts.

Built on [dTelecom DePIN](https://dtelecom.org) — a decentralized real-time communication network.

## Overview

The first x402-powered real-time speech-to-text service. Purpose-built for AI agents that need transcription without human onboarding.

Any agent with a USDC wallet (EVM or Solana) can discover, pay, and use this service autonomously:

1. Check pricing (`GET /pricing`)
2. Buy a session with USDC (`POST /v1/session` — x402 payment)
3. Connect via WebSocket, stream audio, receive transcriptions
4. Extend the session as needed
5. Disconnect and reconnect without losing paid time

## Why This Service

We are not a model hosting service. We are an **audio intelligence pipeline** — a complete system for turning noisy real-world audio into accurate text. Here's what makes us different from every other STT API.

### 1. Full Audio Intelligence Pipeline

Most STT APIs take raw audio and feed it directly to a model. We don't. Every audio stream goes through a multi-stage processing pipeline before any transcription happens:

- **Real-time voice activity detection** — a dedicated hardware-accelerated neural network (Silero VAD on Apple Neural Engine) detects speech with millisecond precision. No wasted compute on silence.
- **Neural noise reduction** — GTCRN deep learning denoiser removes background noise, echo, and interference in real-time. Optimized for conference calls, phone audio, and noisy environments.
- **Speech validation** — a secondary voice analysis pass confirms the detected audio is actual human speech, not keyboard clicks, music, or environmental noise. Measures speech confidence, speech-to-noise ratio, and signal strength. Rejects non-speech segments before they reach the transcription engine.
- **Intelligent audio trimming** — silence before and after speech is precisely removed with configurable padding, so the transcription engine receives clean, tight speech segments.
- **Hallucination filtering** — post-transcription filter catches and removes phantom output (repeated phrases, subtitle artifacts, filler patterns) that plague raw model output.

The result: **cleaner input → better transcription → no garbage output.**

### 2. Dual-Engine Architecture

We run two independent transcription engines simultaneously:

- **Parakeet-TDT** (speed engine) — 3-4x faster than Whisper, optimized for low latency. Supports 25 European languages natively with word-level confidence scoring.
- **Whisper** (accuracy engine) — broader language support (99+ languages) with contextual prompting that uses conversation history to improve accuracy over time.
- **Smart routing** — requests are automatically directed to the best available engine. If the speed engine is busy, the accuracy engine takes over. If both are busy, requests queue to the one that will be free soonest.

This means **no single point of failure**. If one engine has issues, the other handles all traffic seamlessly.

### 3. Agent-Native Access via x402

No API keys. No account creation. No billing setup. No human in the loop.

Any AI agent with a USDC wallet can:
1. Discover the service (x402 directory, MCP tools)
2. Check pricing (`GET /pricing`)
3. Buy a session (x402 micropayment)
4. Stream audio and receive transcriptions
5. Extend the session as needed
6. Disconnect and reconnect without losing paid time

This is how machine-to-machine commerce should work.

### 4. Fair Usage-Based Billing

We charge for the time you're actually connected and streaming — not wall-clock time from when you purchased.

- Network disconnection? Clock pauses. Reconnect when ready.
- Finished early? Your remaining balance is preserved for reconnection.
- Need more time? Extend your session with one HTTP call while streaming continues.

No other STT provider offers this. Deepgram, OpenAI, Google — they all charge from the moment audio is submitted.

### 5. Privacy — Your Audio Stays on Our Hardware

Unlike cloud STT services where audio is uploaded to shared infrastructure, our transcription engines run on dedicated hardware with local inference.

- **No cloud processing** — audio is transcribed by on-premise engines, never sent to third-party APIs or cloud GPU providers
- **Nothing stored** — audio exists only in working memory during processing and is immediately discarded. No recordings, no audio logs, no retention
- **No training** — customer audio is never used to train or improve models
- **Encrypted in transit** — all connections use TLS (WSS/HTTPS)

### Competitive Pricing

Premium quality below all major cloud providers:

| Provider | Price/min | vs Us |
|----------|-----------|-------|
| **dTelecom STT (us)** | **$0.005** | — |
| OpenAI Whisper/GPT-4o | $0.006 | 17% cheaper |
| Deepgram Nova-3 | $0.0077 | 35% cheaper |
| Google Cloud STT | $0.016 | 69% cheaper |
| Azure Real-time | $0.0167 | 70% cheaper |
| AWS Transcribe | $0.024 | 79% cheaper |

## Pricing

| Parameter | Value |
|-----------|-------|
| Rate | $0.005/min |
| Minimum purchase | 5 minutes ($0.025) |
| Maximum purchase | 120 minutes ($0.60) |
| Currency | USDC on Base (L2) or Solana |
| Protocol | x402 |
| Billing | Usage-based (clock pauses on disconnect) |

```
GET /pricing
```

## Authentication

This service uses [x402](https://www.x402.org/) for payment-based authentication. No API keys needed.

1. Agent sends `POST /v1/session` with `{minutes, language}`
2. Server returns `402 Payment Required` with payment options (Base + Solana)
3. Agent's x402 client picks a matching network, signs a USDC payment, and retries
4. Server returns `{session_id, session_key, ws_url, remaining_seconds}`

## API Reference

### Endpoints

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/pricing` | None | Get pricing info |
| `GET` | `/health` | None | Service health check |
| `GET` | `/docs.md` | None | This documentation (markdown) |
| `GET` | `/.well-known/x402` | None | x402 discovery document |
| `POST` | `/v1/session` | x402 | Buy an STT session |
| `POST` | `/v1/session/extend` | x402 | Extend an active session |
| `GET` | `/v1/session/{id}/status` | None | Check session remaining time |
| `WS` | `/v1/stream` | Session key | WebSocket audio streaming |

### POST /v1/session

Buy a new STT session.

**Request:**
```json
{
  "minutes": 5,
  "language": "en"
}
```

**Response (200):**
```json
{
  "session_id": "abc-123-uuid",
  "session_key": "eyJ...",
  "ws_url": "wss://x402stt.dtelecom.org/v1/stream",
  "remaining_seconds": 300,
  "minutes": 5,
  "price_usd": "0.025000"
}
```

### POST /v1/session/extend

Add more time to an active session.

**Request:**
```json
{
  "session_id": "abc-123-uuid",
  "minutes": 5
}
```

**Response (200):**
```json
{
  "session_id": "abc-123-uuid",
  "remaining_seconds": 600,
  "minutes_added": 5,
  "price_usd": "0.025000"
}
```

### GET /v1/session/{id}/status

Check remaining time (no payment required).

**Response:**
```json
{
  "session_id": "abc-123-uuid",
  "status": "connected",
  "remaining_seconds": 245.3,
  "used_seconds": 54.7,
  "balance_seconds": 300,
  "language": "en"
}
```

## WebSocket Protocol

### Connection

Connect to `wss://x402stt.dtelecom.org/v1/stream` and send a config message:

```json
{"type": "config", "language": "en", "session_key": "eyJ..."}
```

Server responds with:

```json
{"type": "ready", "remaining_seconds": 300}
```

### Sending Audio

Send raw PCM16 audio as binary WebSocket frames:
- **Format:** PCM16 (signed 16-bit little-endian)
- **Sample rate:** 16000 Hz
- **Channels:** 1 (mono)
- **Recommended chunk size:** 20ms (640 bytes)

### Receiving Transcriptions

```json
{
  "type": "transcription",
  "text": "Hello, how are you?",
  "start": 1.5,
  "end": 2.8,
  "confidence": 0.95,
  "is_final": true
}
```

### Session Lifecycle Messages

| Message | Direction | When |
|---------|-----------|------|
| `{"type": "session_expiring", "remaining_seconds": 60}` | Server → Client | 60s remaining |
| `{"type": "session_expiring", "remaining_seconds": 10}` | Server → Client | 10s remaining |
| `{"type": "session_extended", "remaining_seconds": N}` | Server → Client | After extend |
| `{"type": "session_expired"}` | Server → Client | Time exhausted, connection closes |
| `{"type": "error", "message": "..."}` | Server → Client | Error occurred |

## Supported Languages

99+ languages across two engines. Languages supported by Parakeet-TDT are routed there for 3-4x faster transcription; all others use Whisper.

### Parakeet-TDT Languages (Fast Engine — 25 Languages)

Based on [NVIDIA Parakeet-TDT 0.6B v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). WER measured on FLEURS benchmark (lower is better):

| Language | Code | WER (FLEURS) | Tier |
|----------|------|-------------|------|
| Italian | it | 3.0% | Excellent |
| Spanish | es | 3.5% | Excellent |
| Portuguese | pt | 4.8% | Excellent |
| English | en | 4.9% | Excellent |
| German | de | 5.0% | Excellent |
| French | fr | 5.2% | Excellent |
| Russian | ru | 5.5% | Excellent |
| Ukrainian | uk | 6.8% | Very Good |
| Polish | pl | 7.3% | Very Good |
| Dutch | nl | 7.5% | Very Good |
| Slovak | sk | 8.8% | Good |
| Czech | cs | 11.0% | Good |
| Romanian | ro | 12.4% | Good |
| Croatian | hr | 12.5% | Good |
| Bulgarian | bg | 12.6% | Good |
| Finnish | fi | 13.2% | Good |
| Swedish | sv | 15.1% | Fair |
| Hungarian | hu | 15.7% | Fair |
| Estonian | et | 17.7% | Fair |
| Danish | da | 18.4% | Fair |
| Greek | el | 20.7% | Fair |
| Lithuanian | lt | 20.4% | Fair |
| Maltese | mt | 20.5% | Fair |
| Latvian | lv | 22.8% | Fair |
| Slovenian | sl | 24.0% | Fair |

Average: 11.97% WER across all 25 languages. Top 10 languages average 5.4% WER.

### Whisper Languages (Accuracy Engine — 99+ Languages)

Based on [OpenAI Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo). Used as fallback for Parakeet languages (when busy) and as primary engine for all other languages.

**High-resource languages (3-8% WER):** English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Dutch

**Medium-resource languages (8-15% WER):** Russian, Polish, Czech, Turkish, Arabic, Hindi, Swedish, Indonesian, Vietnamese, Ukrainian, Romanian, Hungarian, Finnish, Danish, Norwegian, Greek, Hebrew, Thai, Bulgarian, Croatian, Slovak, Catalan, Slovenian, Lithuanian, Latvian, Estonian, Serbian, Malay, Galician, Basque

**Additional supported languages:** Afrikaans, Albanian, Amharic, Armenian, Assamese, Azerbaijani, Bashkir, Belarusian, Bengali, Bosnian, Breton, Cantonese, Faroese, Georgian, Gujarati, Haitian Creole, Hausa, Hawaiian, Icelandic, Kannada, Kazakh, Khmer, Lao, Latin, Lingala, Luxembourgish, Macedonian, Malagasy, Malayalam, Maori, Marathi, Mongolian, Myanmar, Nepali, Nynorsk, Occitan, Pashto, Persian, Punjabi, Sanskrit, Shona, Sindhi, Sinhala, Somali, Sundanese, Swahili, Tagalog, Tajik, Tamil, Tatar, Telugu, Tibetan, Turkmen, Urdu, Uzbek, Welsh, Yiddish, Yoruba, and more.

### Engine Selection

| Scenario | Engine Used |
|----------|-------------|
| Parakeet language + Parakeet available | Parakeet-TDT (fastest) |
| Parakeet language + Parakeet busy | Whisper (fallback) |
| Non-Parakeet language | Whisper (always) |
| Both engines busy | Queue to next available |

## Client SDKs

### Python

```bash
pip install dtelecom-stt
```

```python
import asyncio
from dtelecom_stt import STTClient

async def main():
    # EVM wallet (Base)
    client = STTClient(private_key="0x...")
    # Or Solana wallet — detected automatically
    # client = STTClient(private_key="base58...")

    async with client.session(minutes=5, language="en") as stream:
        async for t in stream.transcribe_file("audio.wav"):
            print(f"[{t.start:.1f}s] {t.text}")

asyncio.run(main())
```

GitHub: [dTelecom/stt-client-python](https://github.com/dTelecom/stt-client-python) | PyPI: [dtelecom-stt](https://pypi.org/project/dtelecom-stt/)

### TypeScript

```bash
npm install @dtelecom/stt
```

```typescript
import { STTClient } from "@dtelecom/stt";

// EVM wallet (Base)
const client = new STTClient({ privateKey: "0x..." });
// Or Solana wallet — use async factory
// const client = await STTClient.create({ privateKey: "base58..." });

const stream = await client.session({ minutes: 5, language: "en" }).open();
for await (const t of stream.transcribeFile("audio.wav")) {
  console.log(`[${t.start?.toFixed(1)}s] ${t.text}`);
}
await stream.close();
```

GitHub: [dTelecom/stt-client-ts](https://github.com/dTelecom/stt-client-ts) | npm: [@dtelecom/stt](https://www.npmjs.com/package/@dtelecom/stt)

### MCP Server (Claude, Cursor, AI Assistants)

Use dTelecom STT directly from Claude Code, Claude Desktop, Cursor, and other MCP-compatible AI assistants.

```bash
npm install -g @dtelecom/stt-mcp
```

**Claude Code** — add to `.mcp.json` in your project:

```json
{
  "mcpServers": {
    "dtelecom-stt": {
      "command": "npx",
      "args": ["-y", "@dtelecom/stt-mcp"],
      "env": {
        "DTELECOM_PRIVATE_KEY": "YOUR_PRIVATE_KEY"
      }
    }
  }
}
```

**Claude Desktop** — add to `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "dtelecom-stt": {
      "command": "npx",
      "args": ["-y", "@dtelecom/stt-mcp"],
      "env": {
        "DTELECOM_PRIVATE_KEY": "YOUR_PRIVATE_KEY"
      }
    }
  }
}
```

Available tools: `transcribe_file` (transcribe WAV files), `stt_pricing` (get pricing), `stt_health` (check service health).

GitHub: [dTelecom/stt-mcp](https://github.com/dTelecom/stt-mcp) | npm: [@dtelecom/stt-mcp](https://www.npmjs.com/package/@dtelecom/stt-mcp)

## Audio Format

The service expects **PCM16, 16kHz, mono** audio. Convert with ffmpeg:

```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -acodec pcm_s16le output.wav
```

## Error Codes

| HTTP | Meaning |
|------|---------|
| 402 | Payment required (x402) |
| 400 | Invalid request (bad minutes, missing fields) |
| 404 | Session not found |
| 500 | Server error |

| WS Close Code | Meaning |
|----------------|---------|
| 4001 | Config timeout or invalid config |
| 4002 | Session expired or not found |
| 4003 | Authentication failed (invalid session key) |
| 4004 | Session already connected elsewhere |

## Links

- **Service:** https://x402stt.dtelecom.org
- **dTelecom DePIN:** https://dtelecom.org
- **x402 Protocol:** https://www.x402.org
- **Python SDK:** https://github.com/dTelecom/stt-client-python | [PyPI](https://pypi.org/project/dtelecom-stt/)
- **TypeScript SDK:** https://github.com/dTelecom/stt-client-ts | [npm](https://www.npmjs.com/package/@dtelecom/stt)
- **MCP Server:** https://github.com/dTelecom/stt-mcp | [npm](https://www.npmjs.com/package/@dtelecom/stt-mcp)
