Our Technology

Customer Sensor Data Intake: Privacy First

Your health data deserves the highest level of protection, which is why Wake AI prioritizes HIPAA compliance in every aspect of our platform. We handle all customer data with the utmost care, ensuring that your personal health information remains secure and confidential throughout your wellness journey. Our architecture is designed to extract only the sensor data necessary for generating meaningful insights – nothing more. We believe in data minimization as a core principle, processing only what's essential to help you improve. Most importantly, we automatically remove sensor data once it's no longer needed. For example, after extracting pose estimation frames and completing temporal analysis from your swimming videos, the original video files are immediately deleted. This approach ensures that your raw data doesn't linger in our systems, maintaining your privacy while still delivering the personalized insights you need to reach your goals.

Understanding Sensor Metrics: Machine Learning

Imagine teaching a child to recognize patterns – you show them examples, and over time they learn to spot similarities and differences. Machine learning works the same way, but with sensor data from your swimming. When you swim, sensors capture hundreds of metrics: the angle of your elbow, the speed of your kick, the rotation of your hips. Our machine learning system assigns "weights" to each of these signals – think of weights like volume knobs that determine how important each metric is for predicting your performance. At first, these weights are random guesses. But as the system sees more of your swimming data, it adjusts these knobs up or down. Maybe it discovers that your hip rotation has a huge impact on your speed, so it turns that knob way up. Perhaps your hand entry angle matters less than expected, so that knob gets turned down. Through thousands of these tiny adjustments, the system learns exactly which combination of movements makes you faster or more efficient.

This is where deep neural networks – the technology behind V-JEPA – truly shine. While simple machine learning might look at each sensor metric in isolation, deep neural networks are structured in layers, like a wedding cake. The first layer might notice basic patterns: "left arm moving" or "body tilting right." The next layer combines these basics into more complex insights: "initiating freestyle stroke" or "compensating for fatigue." Higher layers build even more sophisticated understanding: "stroke efficiency declining due to dropped elbow position." By stacking these layers deep, our system can understand the intricate relationships between dozens of simultaneous movements – how a slight change in your head position affects your hip rotation, which impacts your kick efficiency, which ultimately determines your speed. It's like having a coach who can watch every muscle in your body at once and understand how they all work together, then predict exactly what would happen if you adjusted any part of your technique.

LLMs and Why They Don't Scale

You've probably heard about Large Language Models (LLMs) like ChatGPT. These are the neural networks we described above, but trained on massive amounts of text to predict the next word in a sentence. They're impressive at generating human-like text, but as Meta's Chief AI Scientist Yann LeCun explains, they have fundamental limitations when it comes to understanding the physical world.

Here's the problem: LLMs learn by predicting words, not by understanding reality. They can tell you that "water flows downhill" because they've read it millions of times, but they don't actually understand gravity or fluid dynamics. It's like memorizing every swimming technique manual ever written without ever getting in the pool – you might sound knowledgeable, but you can't actually swim or teach someone else to swim effectively.

More critically, LLMs don't scale efficiently. To make them better, you need exponentially more text data and computing power. There's only so much text in the world, and even if you fed an LLM every book, article, and website ever written, it still wouldn't understand how your body moves through water or why your stroke technique affects your speed. Text alone can't capture the rich, multi-dimensional nature of physical movement and wellness.

This is why autoregressive prediction (guessing the next word) isn't sufficient for understanding health and movement. Your body doesn't operate on language – it operates on physics, biomechanics, and complex feedback loops that can only be understood by observing actual movement, not by reading descriptions of it.

Adapting to Change: Self-Supervised Learning

This is where V-JEPA 2 (Video Joint-Embedding Predictive Architecture) represents a fundamental breakthrough. Unlike LLMs that predict words, V-JEPA 2 learns by predicting what happens next in the physical world. It uses energy functions and latent representations – think of these as the AI's internal model of physics and movement – that actually scale with more video data because they're learning the underlying rules of how bodies move, not just memorizing sequences.

Your body changes, your goals evolve, and new wellness science emerges constantly. Traditional AI models become outdated quickly, but Wake AI stays current through V-JEPA 2 – our self-supervised learning system that continuously improves by observing patterns in movement, just like a coach who gets better with every athlete they train. Because V-JEPA 2 builds these physics-based representations rather than word predictions, it can generalize to new movements and techniques it's never seen before.

How Wake AI Transforms Your Swimming Video into Actionable Insights

Here's exactly how V-JEPA 2 processes your phone video of swimming freestyle to provide insights about stroke rate, American Red Cross swimming levels, and injury risks – broken down into 25 detailed steps:

Phase 1: Video Capture & Initial Processing (Steps 1-5)

Video Recording: You capture a 30-60 second video of swimming using your iPhone, ideally from a side angle showing the full body movement through the water.
Secure Upload: The Body Sauce iOS app uploads your video to our HIPAA-compliant processing service, establishing a secure WebSocket connection for real-time progress updates.
Frame Extraction: The video is decomposed into individual frames at 4-30 fps depending on the swimming activity (dives use higher frame rates for safety analysis).
Spatial Patchification: Each frame is divided into 16×16 pixel patches, creating a grid of visual "tokens" – like breaking a picture into puzzle pieces that the AI can analyze.
Temporal Tubelet Formation: Patches from consecutive frames are grouped into 3D "tubelets" (2 frames × 16 pixels × 16 pixels), capturing how each region changes over time.

Phase 2: Pose Detection with MoveNet (Steps 6-10)

Body Keypoint Detection: MoveNet Thunder identifies 17 key body points in each frame – nose, shoulders, elbows, wrists, hips, knees, and ankles – with confidence scores.
Skeleton Construction: The detected keypoints are connected to form a skeletal representation, showing the body's posture at each moment in time.
Swimming-Specific Optimization: For swimming activities, the system applies specialized processing to better detect keypoints despite water occlusion and splashing.
Kinematic Profile Generation: For each keypoint, the system calculates displacement, velocity, and acceleration between frames, creating a detailed motion profile.
Confidence Filtering: Low-confidence detections (below 0.3 threshold) are filtered out to ensure only reliable pose data is used for analysis.

Phase 3: V-JEPA 2 Representation Learning (Steps 11-15)

Encoder Processing: The V-JEPA 2 encoder (a 1-billion parameter Vision Transformer) processes the video patches, converting them into high-dimensional feature vectors (1408 dimensions per patch).
Spatiotemporal Attention: Using 3D Rotary Position Embeddings (RoPE), the model understands how different body parts relate to each other in space and time.
Mask-Based Learning: During training, V-JEPA 2 learned by predicting masked portions of videos in representation space – not pixel space – making it understand motion patterns rather than visual details.
Representation Extraction: Each video frame is encoded into a 16×16×1408 feature map, capturing both appearance and motion information in an abstract representation.
Temporal Consistency: The encoder ensures temporal coherence, meaning similar movements produce similar representations across different swimmers and conditions.

Phase 4: Motion Analysis & Comparison (Steps 16-20)

Embedding Sequence Creation: The sequence of frame representations forms a trajectory in the high-dimensional embedding space, representing the swimmer's motion pattern.
Reference Stroke Comparison: Your embedding trajectory is compared to pre-computed reference embeddings from elite swimmers for your chosen stroke. Here's exactly how this works:
- Reference Library Creation: Elite swimmers' videos are processed to create a library of "ideal" embedding trajectories for each stroke type, capturing the full range of efficient motion patterns.
- Temporal Alignment: Your embedding sequence is aligned with reference sequences using Dynamic Time Warping (DTW), which accounts for differences in swimming speed while preserving motion patterns.
- Nearest Neighbor Search: For each frame in your video, we find the K-nearest reference embeddings (typically K=5) using efficient algorithms like FAISS or Annoy.
- Distance Metrics: The L2 distance between your embedding and the nearest reference embeddings quantifies how far your technique deviates from ideal form at each moment.
- Percentile Ranking: Your distances are compared to a distribution of distances from recreational to elite swimmers, giving you a percentile score (e.g., "Your technique is better than 75% of swimmers").
Deviation Analysis: Cosine similarity measures how closely your technique matches optimal form. The detailed process:
- Vector Normalization: Each 1408-dimensional embedding vector is normalized to unit length, making cosine similarity equivalent to dot product.
- Similarity Calculation: For each frame, compute cosine_similarity = dot_product(your_embedding, reference_embedding), yielding values between -1 and 1.
- Deviation Scoring: Convert similarity to deviation: deviation_score = 1 - cosine_similarity. Values near 0 indicate perfect form, while higher values indicate problems.
- Body Part Attribution: By analyzing which dimensions of the embedding vector contribute most to the deviation, we can identify specific body parts causing the issue (e.g., "high deviation in dimensions 256-384, which typically encode arm position").
- Temporal Patterns: Track deviation scores over time to identify when in the stroke cycle problems occur (e.g., "deviation peaks during arm recovery phase").
- Threshold-Based Alerts: If deviation exceeds 0.3 for more than 5 consecutive frames, generate specific technique feedback like "Your elbow is dropping during the pull phase."
Stroke Cycle Detection: Changes in embedding patterns identify individual stroke cycles. The precise mechanism:
- Embedding Differentiation: Calculate the first derivative of embeddings: change_vector[t] = embedding[t+1] - embedding[t].
- Magnitude Calculation: Compute the magnitude of change: change_magnitude[t] = ||change_vector[t]||₂.
- Periodicity Detection: Apply autocorrelation to the change_magnitude signal to find the dominant period T (typically 20-40 frames for freestyle).
- Peak Finding: Use signal processing techniques (like scipy.signal.find_peaks) to identify local maxima in the change signal, which correspond to stroke transitions.
- Stroke Rate Calculation: Count peaks over time: stroke_rate = (number_of_peaks / video_duration_seconds) × 60.
- Phase Segmentation: Divide each stroke cycle into phases (catch, pull, push, recovery) by finding characteristic embedding patterns within each cycle.
Consistency Scoring: Variance in embedding similarities across stroke cycles reveals technique consistency. The detailed computation:
- Cycle Extraction: Extract N stroke cycles from your swimming, each represented as a sequence of M embeddings (typically M=30 for a complete cycle).
- Cycle Alignment: Use DTW to align all cycles to a common temporal reference, ensuring phase correspondence across cycles.
- Mean Cycle Computation: Calculate the average embedding at each phase: mean_embedding[i] = (1/N) × Σ(cycle[j].embedding[i]).
- Variance Calculation: For each phase i, compute variance: variance[i] = (1/N) × Σ(||cycle[j].embedding[i] - mean_embedding[i]||²).
- Consistency Score: Overall consistency = 1 / (1 + mean(variance)). Scores near 1 indicate highly consistent technique.
- Phase-Specific Analysis: Identify which phases have highest variance (e.g., "Your catch phase varies by 23% between strokes, suggesting timing issues").
- Fatigue Detection: Track how consistency degrades over time by comparing early vs. late stroke cycles in the video.

Phase 5: Safety & Skill Assessment (Steps 21-25)

Anomaly Detection: Sudden changes in embedding patterns trigger safety alerts. The precise safety monitoring system:
- Baseline Establishment: Calculate the mean and covariance of embeddings from the first 5 seconds of swimming to establish "normal" patterns.
- Mahalanobis Distance: For each new embedding, compute Mahalanobis distance from baseline: d = √((x-μ)ᵀ Σ⁻¹ (x-μ)), where μ is mean and Σ is covariance.
- Anomaly Threshold: If d > 3 standard deviations for 10+ consecutive frames, trigger anomaly alert.
- Velocity Tracking: Project embeddings onto a "forward motion" subspace (learned from training data) and track velocity: if velocity < 0.1 m/s for 3+ seconds, trigger "stopped swimming" alert.
- Vertical Position Monitoring: Specific embedding dimensions correlate with body depth; if these indicate sinking (decreasing values over 2+ seconds), trigger drowning risk alert.
- Panic Detection: Erratic embedding changes (high-frequency oscillations in embedding space) indicate panic; if embedding_variance > 5× baseline for 2+ seconds, alert lifeguard.
- Submersion Detection: Certain embedding patterns indicate full submersion; if these persist > 5 seconds without expected surfacing pattern, critical alert.
Skill Level Classification: Your overall embedding pattern is classified against American Red Cross swimming levels. The classification pipeline:
- Feature Extraction: From your full embedding trajectory, extract 50 key features:
  - Mean embedding vector (1408 dimensions → compressed to 128 via PCA)
  - Stroke rate and variance (2 features)
  - Average deviation from reference (1 feature)
  - Consistency score (1 feature)
  - Body alignment metrics (extracted from specific embedding dimensions, 5 features)
  - Breathing pattern regularity (3 features)
  - Kick-to-stroke ratio (2 features)
- Ensemble Classification: Feed features into a pre-trained ensemble of classifiers:
  - Random Forest (100 trees) trained on 10,000 labeled swimming videos
  - Support Vector Machine with RBF kernel
  - Gradient Boosting classifier
- Level Mapping: Classifiers output probabilities for each Red Cross level:
  - Level 1: Water adjustment (can't swim 5 yards)
  - Level 2: Basic skills (can swim 5-15 yards)
  - Level 3: Stroke readiness (15-25 yards, multiple strokes)
  - Level 4: Stroke development (consistent breathing, 25+ yards)
  - Level 5: Stroke refinement (efficient technique, 100+ yards)
  - Level 6: Advanced skills (racing starts, flip turns)
- Confidence Scoring: If ensemble agreement < 70%, provide range (e.g., "Level 3-4") rather than single level.
- Skill Gap Analysis: Compare your features to level requirements to identify what's needed for advancement (e.g., "To reach Level 5, improve breathing consistency by 15%").
Injury Risk Identification: Repetitive patterns that deviate from biomechanically sound form are flagged. The biomechanical analysis:
- Joint Angle Estimation: Specific embedding dimensions correlate with joint angles; extract these to estimate shoulder rotation, elbow flexion, hip rotation.
- Biomechanical Constraints: Compare extracted angles to safe ranges:
  - Shoulder rotation > 170° repeatedly → impingement risk
  - Elbow angle < 90° during pull → tennis elbow risk
  - Asymmetric hip rotation > 15° → lower back strain risk
- Repetition Counting: Track how many times risky patterns occur: risk_score = (count_risky_positions / total_frames) × repetition_factor.
- Cumulative Load Calculation: Estimate joint stress: stress = Σ(angle_deviation² × velocity × time_in_position).
- Asymmetry Detection: Compare left vs right side embeddings; if ||embedding_left - embedding_right|| > threshold for 50%+ of strokes, flag muscle imbalance.
- Fatigue-Induced Risk: Track how form degrades over time; if injury risk patterns increase by 30%+ from start to end, recommend shorter sessions.
- Risk Reporting: Generate specific warnings: "Your right shoulder shows impingement risk in 68% of strokes. Consider widening your hand entry by 6 inches."
Insight Generation: All metrics are synthesized into human-readable insights: stroke rate, skill level assessment, technique corrections, and injury prevention recommendations.
Continuous Learning: Your swimming patterns contribute to V-JEPA 2's ongoing learning (anonymized), helping the system better understand diverse swimming styles and improve future analyses.

Why V-JEPA 2 Excels at Swimming Analysis

Unlike traditional computer vision that tries to track exact pixel positions, V-JEPA 2 understands swimming at a conceptual level. It learned from over 1 million hours of video that:

Objects persist: A swimmer doesn't disappear underwater – they're just temporarily occluded
Motion has patterns: Efficient swimming follows predictable biomechanical principles
Context matters: The same arm movement means different things in freestyle vs. butterfly
Physics applies: Bodies in water follow fluid dynamics that affect movement efficiency

This intuitive physics understanding – demonstrated by V-JEPA achieving 98% accuracy on physical plausibility tests – means Wake AI can distinguish between efficient technique and movements that waste energy or risk injury, all without being explicitly programmed with swimming rules.

Beyond Video: Multimodal Sensor Integration

While V-JEPA 2 transforms swimming videos into actionable insights, the future of wellness intelligence lies in understanding how multiple sensors work together to paint a complete picture of your health. Just as the M3-JEPA framework demonstrates, different modalities – whether video, audio, accelerometer data, or heart rate monitors – can be aligned in a shared latent space where their relationships become clear. This isn't just about adding more data sources; it's about understanding how each sensor's unique perspective contributes to the whole story.

Imagine Wake AI analyzing not just your swimming video, but simultaneously processing the sound of your breathing patterns, the acceleration data from a waterproof sensor on your wrist, and your heart rate variability during different stroke phases. The multimodal architecture would use specialized encoders for each sensor type – preserving what makes each signal unique – while a lightweight mixture-of-experts connector finds the relationships between them. This means discovering insights that no single sensor could reveal: how your breathing rhythm affects your stroke efficiency, how fatigue in your heart rate data predicts form breakdown before it's visible, or how the harmony between all signals indicates your readiness for intense training.

The beauty of this approach is its efficiency and scalability. Each sensor's data can be pre-processed by its specialized encoder, with only the essential patterns passed to the alignment system. As you add new sensors – perhaps a continuous glucose monitor or a sleep tracking ring – Wake AI doesn't need to be retrained from scratch. It simply learns how these new signals relate to what it already knows about your body, building an increasingly complete model of your wellness that adapts as technology evolves.