Back to PortfolioPrototype · Research Platform

Educational AI / RoboticsLION Lab · University of ToledoTeam: Uma · Jay · Ahmad

DeepFlyer platform dashboard — drone, reward editor, hoop navigation course, and edge hardware overview

DeepFlyer:
3D DeepRacer for Drone RL

An educational drone reinforcement learning platform that teaches students RL through reward-function editing, PPO agent training, and autonomous drone simulation — no deep AI background required.

80%

PPO success rate within 100 steps

PPO training steps completed

<1s

Gazebo cold simulation startup

10ms

Reward API median latency

<5%

Collision-detection false-positive rate

40%

X500 URDF mesh optimization

← All Projects

Overview

DeepFlyer reimagines AWS DeepRacer for drone autonomy. Instead of racing an RC car around a flat 2D track, students train a drone in simulation to fly through hoops, avoid obstacles, and navigate a 3D course — all by modifying reward functions in a browser-based editor.

The platform reframes reinforcement learning as something physical and visual. If the reward function is poorly shaped, the drone behaves poorly. If it's better designed, the drone becomes safer, smoother, and more efficient. Students can see that relationship directly — no lectures about abstract RL theory required.

The early prototype established the simulation stack, drone model, React reward editor, backend API, dynamic reward switching, and initial PPO training results. Future milestones target SLAM, event cameras, XAI overlays, and sim-to-real deployment.

PlatformDeepFlyer — Educational Drone RL

Inspired ByAWS DeepRacer (2D car → 3D drone)

LabLION Lab, University of Toledo

TeamUma (simulation/CAD) · Jay (UI/integration) · Ahmad (RL/AI)

StatusPrototype — Weeks 1–3 complete

Hardware TargetHolybro S500 · Pixhawk 6C · ZED Mini

Core Thesis

“Reward functions are not just math. They encode behavior. The best way to teach that is to let students see a drone learn — and fail — in real time.”

Problem

Reinforcement learning is powerful, but it's difficult to teach.

The core concepts are abstract. Students tune parameters, watch plots, and read reward curves — but rarely develop the physical intuition for why reward design matters.

Abstract Objectives

Students are taught that the reward function drives behavior, but never experience the consequences of a bad reward directly.

No Physical Feedback

RL assignments produce plots and numbers. They don't produce a drone that crashes because the penalty for going off-course was too low.

Reward Exploitation

Without a physical system to anchor understanding, reward hacking is a concept. With a drone, it's immediately obvious.

Sim-to-Real Gap

Why does a policy that scores 99% in simulation fail on hardware? The abstraction of code-only experiments never makes this tangible.

Safety Constraints

Safety boundaries and emergency stops are afterthoughts in most RL coursework. In real robotics, they are foundational design decisions.

AWS DeepRacer Ceiling

DeepRacer is excellent for 2D car navigation but limited in scope. Drone autonomy requires 3D reasoning, altitude control, and richer state representations.

AWS DeepRacer

2D track navigation

RC car (ground vehicle)

Fixed action space

2D obstacle avoidance

Cloud training only

Pre-built course only

DeepFlyer

3D hoop & obstacle navigation

Quadcopter drone (aerial)

3D continuous action space

Altitude, yaw, speed, alignment

Edge + local inference

Configurable course layouts

Platform Architecture

A full-stack learning loop from browser to simulation to hardware.

01

Student UI

React 18 Reward EditorMission SelectorTraining DashboardReward Preset LibraryFuture: Leaderboard / Telemetry

02

Backend API

Node.js + ExpressMongoDBJWT AuthenticationReward Preset EndpointsSession & Log Storage10ms median / 20ms P95 latency

03

RL Training Layer

OpenAI Gym EnvironmentStable-Baselines3PPO AlgorithmYAML Reward-Interface SchemaDynamic Reward SwitchingMLflow Experiment Logging

04

Simulation Layer

ROS 2 HumbleGazebo FortressX500 URDF (visual + collision + inertial)Gazebo Contact SensorsFuture: SLAM / Event-Camera Plugins

05

Drone Control (Future)

ROS 2 BridgeMAVROSPixhawk 6C Flight ControllerSafety BoundariesEmergency Stop Protocol

06

Physical Hardware (Planned)

Holybro S500 FramePixhawk 6CRaspberry Pi 4B CompanionZED Mini Stereo CameraSafety Netting + Propeller GuardsInstructor Kill Switch

Course Type A — Hoop Navigation

A fixed 4–5 hoop circuit with multiple laps at ~0.8 m altitude. Visual targets provide immediate student feedback. Best for public demos and intuitive reward-behavior observation.

5 HoopsMulti-lap0.8m altitudeVisual targetsDemo-friendly

Course Type B — Obstacle Course

A straight path from start to finish with static obstacles requiring lateral maneuvering. Fixed altitude. Easier to compare across reward functions. Best for controlled benchmarking.

Static obstaclesLateral maneuverFixed altitudeBenchmark-friendlyUsed in Week 3

Reward Design

The reward function is the educational core of DeepFlyer.

Students don't write RL training code. They modify reward weights through the editor and observe how incentive changes produce different drone behaviors.

Hoop Passage Reward

Higher values make the drone prioritize passing through hoops aggressively, potentially sacrificing smoothness.

Center Alignment Bonus

Rewards cleaner hoop passage. Students observe the drone gradually centering its approach path over training.

Speed Efficiency Bonus

Can produce faster flight but less precise hoop targeting. Classic exploration vs. exploitation tradeoff.

Collision Penalty

Increasing this makes the policy more conservative — students directly observe the agent "choosing" to avoid obstacles.

Smooth Flight Bonus

Reduces jerk in motor commands. Students notice the drone trajectory becoming more continuous over episodes.

Boundary Violation Penalty

Hard constraint on staying within the safety zone. Essential for real-hardware deployment with safety netting.

12-Dimensional Observation Space

Direction to target hoopRelative altitude errorForward velocityLateral velocityDistance to targetVelocity alignmentHoop alignment (camera)Vision hoop distanceHoop visibilityCourse progressLap progressSafety zone proximity

The observation space teaches students that RL agents don't learn from magic — they learn from state representations. Better observations consistently produce better policies.

Technical Stack

Frontend

React 18Node.js v18ESLintPrettierFigma (user flows)

Backend

Node.jsExpressMongoDBJWT AuthReward API EndpointsSession / Log Storage

RL & Experiment Tracking

OpenAI GymStable-Baselines3PPOYAML Reward SchemaMLflowDynamic Reward Switching

Simulation & Robotics

ROS 2 HumbleGazebo FortressX500 URDFGazebo Contact SensorsMAVROS (planned)Future: SLAM / Event-Camera

Prototype Results

Weeks 1–3 validated the full learning loop.

Progress spanned simulation infrastructure, drone model, backend, frontend, RL training, and experiment tracking — all in three weeks.

Week 1

ROS 2 Humble + Gazebo Fortress installed and configured

Simulation startup performance validated

React frontend repo initialized with ESLint/Prettier

Gym + Stable-Baselines3 Docker environment

YAML reward-interface schema defined

CI builds passing

Week 2

X500 URDF built with visual, collision, and inertial components

40% mesh optimization achieved

Mass/inertia validated within 2%

React Dashboard + Reward Editor UI scaffolded

Figma user flows finalized

Reward API endpoints implemented — 10ms median / 20ms P95 latency

Dynamic reward switching integrated into Gym environment

Week 3

Gazebo contact sensors integrated

Collision false positives reduced below 5%

Node.js + Express + MongoDB + JWT backend complete

PPO agent trained for 1M steps

80% success rate within 100 steps on Distance-to-Goal preset

MLflow experiment logging integrated

Convergence plateau identified at ~9e5 steps for further tuning

80%

PPO success within 100 steps (Distance-to-Goal preset)

<1s

Gazebo cold start / ~0.3s warm start

10ms

Reward API median latency (20ms P95)

<5%

Collision-detection false-positive rate

40%

X500 URDF mesh optimization

<2%

Mass/inertia validation error

Lessons Learned

Educational AI Needs Clean Abstractions

Students learn reward design most effectively when the platform hides unnecessary complexity. Every API endpoint exposed in the UI should have a direct observable effect on drone behavior.

Reward Design Is the Best RL Entry Point

No other concept maps as directly from abstract theory to physical outcome. Students who adjust a collision penalty and watch the drone become more conservative have had a genuine insight.

A Stable Simulator Beats a Flashy Demo

A sub-1s startup and reliable contact sensor behavior matters more than impressive visuals. Students who encounter simulation bugs stop learning RL and start debugging ROS.

UI Latency Kills Learning Flow

The 10ms Reward API response time was a deliberate design target. If reward updates feel sluggish, students disengage from the feedback loop.

Log Everything From Day One

MLflow was integrated in Week 3 and it immediately revealed the convergence plateau at 9e5 steps. Without experiment tracking, that insight would have required re-running experiments.

Separate High-Level RL From Low-Level Flight Control

Students should reason about rewards and actions, not attitude stabilization and PID tuning. The flight controller abstraction boundary is a pedagogical design decision as much as a technical one.

Roadmap

From prototype to production-ready educational platform.

Near-Term

Motor & camera plugin integration

Mission Selector UI MVP

Path-Efficiency reward preset

SLAM plugin integration

SLAM map widget

Energy-Efficiency preset

Mid-Term

Event-camera simulation

Collision-Avoidance preset

Live telemetry view

Downloadable logs

Model upload (.pt / .onnx)

Live model swapping

Long-Term

XAI logging + attention overlays

SHAP explainability

Altitude-stability maps

Dynamic leaderboard

Composite reward presets

Final sim-to-real validation

DeepFlyer

Making reinforcement learning physical.

Reward design, PPO training, ROS 2 simulation, and a future path to real drone hardware — all accessible through a browser.

← Back to Portfolio