Lesson 8.1: The AI Stack - Security experts

Module: 8 – AI & Advanced Analytics Prerequisites: Lesson 3.1 (Camera Hardware) & Lesson 7.4 (Cloud/Edge) Estimated Time: 45–60 Minutes

By the end of this lesson, you will be able to:

Deconstruct the “AI Stack” into its three layers: Capture, Inference, and Action.
Differentiate between Pixel Motion Detection (Old School) and Object Classification (New School).
Define “Training” vs. “Inference” and why the camera only does one of them.
Compare Edge AI (On-Camera) vs. Server AI (On-Premise) vs. Cloud AI.

To understand AI, you must understand what we are replacing.

Pixel Motion (The Old Way):
- How it worked: The camera compared Frame 1 to Frame 2. If 50% of the pixels changed color, it screamed “ALARM!”
- The Failure: Trees blowing in the wind, shadows moving, or headlights passing by all change pixels. This caused 95% False Alarm rates.
Computer Vision / AI (The New Way):
- How it works: The camera doesn’t look at pixels; it looks for Shapes and Patterns.
- The Result: It ignores the moving tree because it knows: “That is a tree, not a human.”

AI isn’t magic; it’s a pipeline.

Deep Learning / CNN (Convolutional Neural Network): This is the software brain.
Training: You feed a supercomputer 1,000,000 photos of a human. It learns what a “human” looks like (two legs, head, torso). This happens at the factory.
Inference: You install this “trained brain” onto the camera. The camera looks at live video and guesses: “I am 98% sure that shape is a Human.”

GPU (Graphics Processing Unit): Massive power, usually in a server (NVIDIA). Used for heavy lifting (Face Recognition).
NPU (Neural Processing Unit): A tiny, efficient chip inside the camera. Used for basic tasks (Human/Vehicle detection).

This is the most expensive decision an integrator makes.

Location: The analytics run inside the camera chipset (SoC).
Pros: Fast, Cheap (no server needed), Low Bandwidth (only sends alerts, not video).
Cons: Limited brainpower. Can detect “Human” or “Car,” but maybe not “Steve Smith” or “2015 Toyota Camry.”

Location: All video is streamed to a big server with NVIDIA GPUs in the closet.
Pros: Smarter. Can do Facial Recognition, License Plate Recognition (LPR), and Weapon Detection simultaneously.
Cons: Expensive. Requires $5,000+ servers and consumes massive electricity.

Location: Video is sent to Amazon/Microsoft/Google servers.
Pros: Infinite intelligence. Can search for “Man in red shirt” across 1,000 hours of video in seconds.
Cons: High recurring monthly fees. Requires massive internet upload speed.

Classification: Identifying what an object is (Human, Vehicle, Animal).
Attribute Extraction: Identifying details about the object (Red Shirt, Hat, Backpack, White Truck).
Behavioral Analytics: Identifying actions (Loitering, Falling Down, Fighting/Fast Movement).
False Positive: The alarm went off, but nothing was there (e.g., a scarecrow detected as a human).
False Negative: The thief walked right past the camera, and it failed to alert. (This is worse).