How to do Real Time Sign Language Recognition with a Webcamank

How to do Real Time Sign Language Recognition with a Webcam
Proverb: “There is nothing new under the sun.” — Ecclesiastes 1:9
Introduction
This is not a new invention, nor a breakthrough that has never been seen before. It is a simple and faithful implementation of techniques that others have shared openly: using existing tools to build a real‑time sign language recognition system with a standard webcam.
The aim here is to show how to reproduce a well‑known solution using MediaPipe, PyTorch, and OpenCV. What follows is a complete walk‑through—a practical guide to creating a system that recognises static hand signs and converts them into letters, in real time.
Recognition belongs to those who first created these tools and shared the knowledge freely. This article simply gathers those ideas and shows how to use them together.
1. What We Are Building
The system captures hand landmarks from a webcam feed, converts those landmarks into numeric features, and feeds those features into a small neural network. The network classifies the features as specific hand signs that represent letters of the alphabet.
The approach works well for static signs such as the majority of the ASL alphabet. Signs that require motion (for example, the letters J and Z in ASL) need sequence analysis and are beyond this simple baseline.
2. Prerequisites
- A Linux, macOS or Windows system with a webcam.
- Python 3.10 or newer.
- Familiarity with Python virtual environments.
Install the dependencies:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install mediapipe==0.10.14 opencv-python torch torchvision torchaudio numpy scikit-learn
On Alpine Linux you may prefer opencv-python-headless
if binary wheels for opencv-python
are not available.
3. Step One: Feature Extraction
MediaPipe detects 21 hand landmarks (x, y, z). We translate and scale these points so that the wrist becomes the origin and the hand size is normalised. From these we compute:
- The flattened 3D coordinates.
- Distances between key fingertip pairs.
- A handful of finger joint angles.
These form a compact feature vector that is robust to scale and position.
4. Step Two: Data Collection
Run a simple OpenCV window that shows the webcam feed. Press a letter key (for example A
), pose the hand, and press C
to capture a sample. Each capture stores the feature vector and the chosen label in a CSV file.
Aim for 50 to 200 samples per sign. Vary lighting, orientation, and distance. This diversity will improve the trained model.
5. Step Three: Train the Model
A small PyTorch multi-layer perceptron (MLP) is sufficient. It takes the feature vector as input and outputs the probability of each class.
- Split the dataset into training and validation sets.
- Encode the labels.
- Train for about 50 epochs.
- Save the trained model and the label encoder.
The network is tiny and trains quickly even on a CPU.
6. Step Four: Real-Time Inference
With the model saved, run the live demo:
- Capture a webcam frame.
- Detect the hand landmarks.
- Convert to features and feed into the network.
- Display the predicted letter and its confidence.
The process runs comfortably in real time on an ordinary laptop.
7. Optional Improvements
- Debouncing: Smooth the output by requiring the same letter to be predicted over several frames before it is accepted.
- Dynamic signs: For letters or signs that involve motion, extend the system with a temporal model such as an LSTM or a sliding-window majority vote.
- Dataset variety: Gather data from multiple people and both hands to improve generalisation.
Conclusion
Many things can be done with a computer these days. Some will claim it their own; others will give the recognition where recognition belongs. The method here is no secret—it is simply a practical joining of the open work of many others. The credit remains with those who built MediaPipe, OpenCV and PyTorch and with the countless contributors who showed that sign language recognition with a webcam is possible.
There is nothing new under the sun—only the opportunity to learn and to share.
Full Code Walkthrough (Copy–Paste Ready)
The following files reproduce the pipeline end‑to‑end. Place them in a working folder (e.g., handsign/
), then follow the usage steps.
0) Environment & Dependencies
Prefer Alpine first, then Arch.
Alpine Linux (edge/testing may be needed for some wheels):
# base tools
apk add --no-cache python3 py3-pip python3-dev build-base cmake pkgconfig \
linux-headers libstdc++
# If you want OpenCV system libs (optional, wheels usually suffice):
# apk add --no-cache opencv opencv-dev
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip wheel setuptools
# Prefer headless if GUI libs are missing
pip install mediapipe==0.10.14 opencv-python-headless torch torchvision torchaudio numpy scikit-learn
Arch Linux:
pacman -S --needed python python-pip base-devel cmake pkgconf
python -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install mediapipe==0.10.14 opencv-python torch torchvision torchaudio numpy scikit-learn
Project structure:
handsign/
├─ data/
├─ models/
├─ features.py
├─ collect_signs.py
├─ train_pytorch.py
└─ infer_realtime.py
1) features.py
Converts 21×(x,y,z) landmarks into a translation/scale‑normalised feature vector with a few distances and angles.
# features.py
import numpy as np
LANDMARK_COUNT = 21
def _pairwise_distances(points):
idx_pairs = [
(4,8), (4,12), (4,16), (4,20), # thumb tip to other tips
(8,12), (12,16), (16,20), # tip-to-tip
(0,5), (0,9), (0,13), (0,17), # wrist to MCPs
(5,9), (9,13), (13,17) # MCP spans
]
d = []
for i,j in idx_pairs:
d.append(np.linalg.norm(points[i] - points[j]))
return np.array(d, dtype=np.float32)
def _finger_angle(a, b, c):
ba = a - b
bc = c - b
na = np.linalg.norm(ba) + 1e-8
nc = np.linalg.norm(bc) + 1e-8
cosang = np.clip(np.dot(ba, bc) / (na*nc), -1.0, 1.0)
return np.arccos(cosang)
def _basic_angles(points):
joints = [(5,6,7), (6,7,8), # index base bend, tip bend
(9,10,11), (10,11,12),
(13,14,15), (14,15,16),
(17,18,19), (18,19,20)]
angs = []
for a,b,c in joints:
angs.append(_finger_angle(points[a], points[b], points[c]))
return np.array(angs, dtype=np.float32)
def landmarks_to_features(landmarks):
"""
landmarks: length 63 flat list or (21,3) array of x,y,z in image coords.
returns: 1D float32 feature vector.
"""
pts = np.array(landmarks, dtype=np.float32).reshape(LANDMARK_COUNT, 3)
wrist = pts[0].copy()
pts -= wrist
scale = np.max(np.linalg.norm(pts, axis=1)) + 1e-8
pts /= scale
flat = pts.flatten()
dists = _pairwise_distances(pts)
angs = _basic_angles(pts)
return np.concatenate([flat, dists, angs], axis=0)
2) collect_signs.py
Press a letter key (A–Z) to set the current label, press C
to capture a sample, Q
to quit.
# collect_signs.py
import cv2
import csv
import os
import time
import numpy as np
from features import landmarks_to_features, LANDMARK_COUNT
import mediapipe as mp
mp_hands = mp.solutions.hands
DATA_CSV = "handsign/data/dataset.csv"
os.makedirs(os.path.dirname(DATA_CSV), exist_ok=True)
def main():
cap = cv2.VideoCapture(0)
if not cap.isOpened():
raise RuntimeError("Webcam not available")
current_label = None
last_info = "Press A–Z to set label, 'C' capture, 'Q' quit"
if not os.path.exists(DATA_CSV):
with open(DATA_CSV, "w", newline="") as f:
writer = csv.writer(f)
tmp_len = len(landmarks_to_features(np.zeros(LANDMARK_COUNT*3, dtype=np.float32)))
header = [f"f{i}" for i in range(tmp_len)] + ["label"]
writer.writerow(header)
with mp_hands.Hands(
static_image_mode=False,
max_num_hands=1,
model_complexity=1,
min_detection_confidence=0.6,
min_tracking_confidence=0.6
) as hands, open(DATA_CSV, "a", newline="") as f:
writer = csv.writer(f)
while True:
ok, frame = cap.read()
if not ok:
break
frame = cv2.flip(frame, 1)
h, w = frame.shape[:2]
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
res = hands.process(rgb)
if res.multi_hand_landmarks:
for lm in res.multi_hand_landmarks:
for p in lm.landmark:
cx, cy = int(p.x * w), int(p.y * h)
cv2.circle(frame, (cx, cy), 3, (0, 255, 0), -1)
status = f"Label: {current_label if current_label else '-'} | {last_info}"
cv2.putText(frame, status, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 2)
cv2.imshow("Collect Hand Signs", frame)
key = cv2.waitKey(1) & 0xFF
if key == ord('q'):
break
elif key == ord('c'):
if current_label and res.multi_hand_landmarks:
lms = res.multi_hand_landmarks[0].landmark
coords = []
for p in lms:
coords.extend([p.x, p.y, p.z])
feats = landmarks_to_features(coords).tolist()
writer.writerow(feats + [current_label])
last_info = f"Captured '{current_label}' at {time.strftime('%H:%M:%S')}"
else:
last_info = "No hand or no label set."
else:
if 65 <= key <= 90:
current_label = chr(key)
last_info = f"Current label: '{current_label}'"
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
main()
3) train_pytorch.py
Reads the CSV, trains a compact MLP, and saves models/hand_sign_mlp.pt
and models/label_encoder.json
.
# train_pytorch.py
import json
import csv
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn as nn
import torch.optim as optim
DATA_CSV = "handsign/data/dataset.csv"
MODEL_DIR = "handsign/models"
MODEL_PATH = os.path.join(MODEL_DIR, "hand_sign_mlp.pt")
ENCODER_PATH = os.path.join(MODEL_DIR, "label_encoder.json")
os.makedirs(MODEL_DIR, exist_ok=True)
class MLP(nn.Module):
def __init__(self, in_dim, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, out_dim),
)
def forward(self, x):
return self.net(x)
def load_dataset(path):
X, y = [], []
with open(path, "r") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
X.append(list(map(float, row[:-1])))
y.append(row[-1])
return np.array(X, dtype=np.float32), np.array(y)
def main():
if not os.path.exists(DATA_CSV):
raise FileNotFoundError("No dataset found. Run collect_signs.py first.")
X, y = load_dataset(DATA_CSV)
le = LabelEncoder()
y_enc = le.fit_transform(y)
num_classes = len(le.classes_)
with open(ENCODER_PATH, "w") as f:
json.dump({"classes": le.classes_.tolist()}, f, indent=2)
X_train, X_val, y_train, y_val = train_test_split(
X, y_enc, test_size=0.2, random_state=42, stratify=y_enc
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MLP(in_dim=X.shape[1], out_dim=num_classes).to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
X_train_t = torch.tensor(X_train, device=device)
y_train_t = torch.tensor(y_train, dtype=torch.long, device=device)
X_val_t = torch.tensor(X_val, device=device)
y_val_t = torch.tensor(y_val, dtype=torch.long, device=device)
best_val = 0.0
for epoch in range(1, 51):
model.train()
optimizer.zero_grad()
logits = model(X_train_t)
loss = criterion(logits, y_train_t)
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
val_logits = model(X_val_t)
val_pred = val_logits.argmax(dim=1)
val_acc = (val_pred == y_val_t).float().mean().item()
print(f"Epoch {epoch:02d} | loss {loss.item():.4f} | val_acc {val_acc*100:.2f}%")
if val_acc > best_val:
best_val = val_acc
torch.save({
"state_dict": model.state_dict(),
"in_dim": X.shape[1],
"num_classes": num_classes
}, MODEL_PATH)
print(f"Best val acc: {best_val*100:.2f}% | model saved to {MODEL_PATH}")
if __name__ == "__main__":
main()
4) infer_realtime.py
Runs the webcam demo and overlays the predicted letter and confidence. Includes debounce to print a stable letter.
# infer_realtime.py
import json
import os
import cv2
import numpy as np
import torch
import torch.nn as nn
from collections import deque, Counter
from features import landmarks_to_features, LANDMARK_COUNT
import mediapipe as mp
MODEL_DIR = "handsign/models"
MODEL_PATH = os.path.join(MODEL_DIR, "hand_sign_mlp.pt")
ENCODER_PATH = os.path.join(MODEL_DIR, "label_encoder.json")
class MLP(nn.Module):
def __init__(self, in_dim, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, out_dim),
)
def forward(self, x):
return self.net(x)
def load_model():
chk = torch.load(MODEL_PATH, map_location="cpu")
model = MLP(chk["in_dim"], chk["num_classes"])
model.load_state_dict(chk["state_dict"])
model.eval()
with open(ENCODER_PATH, "r") as f:
classes = json.load(f)["classes"]
return model, classes
def main():
if not os.path.exists(MODEL_PATH):
raise FileNotFoundError("Model not found. Train with train_pytorch.py first.")
model, classes = load_model()
softmax = nn.Softmax(dim=1)
# debounce window for stable outputs
window = deque(maxlen=8)
stable_label = None
stable_needed = 5
mp_hands = mp.solutions.hands
cap = cv2.VideoCapture(0)
if not cap.isOpened():
raise RuntimeError("Webcam not available")
with mp_hands.Hands(
static_image_mode=False,
max_num_hands=1,
model_complexity=1,
min_detection_confidence=0.6,
min_tracking_confidence=0.6
) as hands:
while True:
ok, frame = cap.read()
if not ok:
break
frame = cv2.flip(frame, 1)
h, w = frame.shape[:2]
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
res = hands.process(rgb)
label_text = "No hand"
if res.multi_hand_landmarks:
lm = res.multi_hand_landmarks[0].landmark
for p in lm:
cx, cy = int(p.x * w), int(p.y * h)
cv2.circle(frame, (cx, cy), 3, (0,255,0), -1)
coords = []
for p in lm:
coords.extend([p.x, p.y, p.z])
feats = landmarks_to_features(coords).astype(np.float32)
x = torch.tensor(feats, dtype=torch.float32).unsqueeze(0)
with torch.no_grad():
logits = model(x)
probs = softmax(logits).cpu().numpy()[0]
idx = int(np.argmax(probs))
conf = float(probs[idx])
# debounce logic
if conf > 0.75:
window.append(classes[idx])
counts = Counter(window)
most, cnt = counts.most_common(1)[0]
if cnt >= stable_needed:
stable_label = most
else:
window.clear()
stable_label = None
label_text = f"{classes[idx]} ({conf*100:.1f}%)"
if stable_label:
# Print the stable label to stdout and overlay
print(stable_label, end="", flush=True)
label_text = f"{stable_label} (stable)"
cv2.putText(frame, label_text, (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 1.0, (255,255,255), 2)
cv2.imshow("Realtime Hand Sign Recognition", frame)
key = cv2.waitKey(1) & 0xFF
if key == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
main()
5) Usage
Collect data (repeat for each letter/class you want to support):
python collect_signs.py
# Press a letter to set the label (e.g., A), pose your hand, press C to capture.
# Gather ~50–200 samples per class under varied lighting and orientation.
Train the classifier:
python train_pytorch.py
Run real‑time inference with on‑screen overlay and stdout printing of stable letters:
python infer_realtime.py
NotesStart with static letters (leave out motion‑based letters like J and Z initially).For mirrored hands, consider augmenting features or collecting data from both hands.CPU‑only is fine; this model is small and fast.
Outro
Many things can be done with a computer these days. Some will claim it their own; others will give the recognition where recognition belongs.