Back to PortfolioPrototype · Research Platform
Educational AI / RoboticsLION Lab · University of ToledoTeam: Uma · Jay · Ahmad
DeepFlyer platform dashboard — drone, reward editor, hoop navigation course, and edge hardware overview

DeepFlyer:
3D DeepRacer for Drone RL

An educational drone reinforcement learning platform that teaches students RL through reward-function editing, PPO agent training, and autonomous drone simulation — no deep AI background required.

80%
PPO success rate within 100 steps
1M
PPO training steps completed
<1s
Gazebo cold simulation startup
10ms
Reward API median latency
<5%
Collision-detection false-positive rate
40%
X500 URDF mesh optimization
← All Projects
Overview

DeepFlyer reimagines AWS DeepRacer for drone autonomy. Instead of racing an RC car around a flat 2D track, students train a drone in simulation to fly through hoops, avoid obstacles, and navigate a 3D course — all by modifying reward functions in a browser-based editor.

The platform reframes reinforcement learning as something physical and visual. If the reward function is poorly shaped, the drone behaves poorly. If it's better designed, the drone becomes safer, smoother, and more efficient. Students can see that relationship directly — no lectures about abstract RL theory required.

The early prototype established the simulation stack, drone model, React reward editor, backend API, dynamic reward switching, and initial PPO training results. Future milestones target SLAM, event cameras, XAI overlays, and sim-to-real deployment.

PlatformDeepFlyer — Educational Drone RL
Inspired ByAWS DeepRacer (2D car → 3D drone)
LabLION Lab, University of Toledo
TeamUma (simulation/CAD) · Jay (UI/integration) · Ahmad (RL/AI)
StatusPrototype — Weeks 1–3 complete
Hardware TargetHolybro S500 · Pixhawk 6C · ZED Mini
Core Thesis

“Reward functions are not just math. They encode behavior. The best way to teach that is to let students see a drone learn — and fail — in real time.”

Problem

Reinforcement learning is powerful, but it's difficult to teach.

The core concepts are abstract. Students tune parameters, watch plots, and read reward curves — but rarely develop the physical intuition for why reward design matters.

Abstract Objectives

Students are taught that the reward function drives behavior, but never experience the consequences of a bad reward directly.

No Physical Feedback

RL assignments produce plots and numbers. They don't produce a drone that crashes because the penalty for going off-course was too low.

Reward Exploitation

Without a physical system to anchor understanding, reward hacking is a concept. With a drone, it's immediately obvious.

Sim-to-Real Gap

Why does a policy that scores 99% in simulation fail on hardware? The abstraction of code-only experiments never makes this tangible.

Safety Constraints

Safety boundaries and emergency stops are afterthoughts in most RL coursework. In real robotics, they are foundational design decisions.

AWS DeepRacer Ceiling

DeepRacer is excellent for 2D car navigation but limited in scope. Drone autonomy requires 3D reasoning, altitude control, and richer state representations.

AWS DeepRacer
2D track navigation
RC car (ground vehicle)
Fixed action space
2D obstacle avoidance
Cloud training only
Pre-built course only
DeepFlyer
3D hoop & obstacle navigation
Quadcopter drone (aerial)
3D continuous action space
Altitude, yaw, speed, alignment
Edge + local inference
Configurable course layouts
Platform Architecture

A full-stack learning loop from browser to simulation to hardware.

01
Student UI
React 18 Reward EditorMission SelectorTraining DashboardReward Preset LibraryFuture: Leaderboard / Telemetry
02
Backend API
Node.js + ExpressMongoDBJWT AuthenticationReward Preset EndpointsSession & Log Storage10ms median / 20ms P95 latency
03
RL Training Layer
OpenAI Gym EnvironmentStable-Baselines3PPO AlgorithmYAML Reward-Interface SchemaDynamic Reward SwitchingMLflow Experiment Logging
04
Simulation Layer
ROS 2 HumbleGazebo FortressX500 URDF (visual + collision + inertial)Gazebo Contact SensorsFuture: SLAM / Event-Camera Plugins
05
Drone Control (Future)
ROS 2 BridgeMAVROSPixhawk 6C Flight ControllerSafety BoundariesEmergency Stop Protocol
06
Physical Hardware (Planned)
Holybro S500 FramePixhawk 6CRaspberry Pi 4B CompanionZED Mini Stereo CameraSafety Netting + Propeller GuardsInstructor Kill Switch
Course Type A — Hoop Navigation

A fixed 4–5 hoop circuit with multiple laps at ~0.8 m altitude. Visual targets provide immediate student feedback. Best for public demos and intuitive reward-behavior observation.

5 HoopsMulti-lap0.8m altitudeVisual targetsDemo-friendly
Course Type B — Obstacle Course

A straight path from start to finish with static obstacles requiring lateral maneuvering. Fixed altitude. Easier to compare across reward functions. Best for controlled benchmarking.

Static obstaclesLateral maneuverFixed altitudeBenchmark-friendlyUsed in Week 3
Reward Design

The reward function is the educational core of DeepFlyer.

Students don't write RL training code. They modify reward weights through the editor and observe how incentive changes produce different drone behaviors.

Hoop Passage Reward

Higher values make the drone prioritize passing through hoops aggressively, potentially sacrificing smoothness.

Center Alignment Bonus

Rewards cleaner hoop passage. Students observe the drone gradually centering its approach path over training.

Speed Efficiency Bonus

Can produce faster flight but less precise hoop targeting. Classic exploration vs. exploitation tradeoff.

Collision Penalty

Increasing this makes the policy more conservative — students directly observe the agent "choosing" to avoid obstacles.

Smooth Flight Bonus

Reduces jerk in motor commands. Students notice the drone trajectory becoming more continuous over episodes.

Boundary Violation Penalty

Hard constraint on staying within the safety zone. Essential for real-hardware deployment with safety netting.

12-Dimensional Observation Space
Direction to target hoopRelative altitude errorForward velocityLateral velocityDistance to targetVelocity alignmentHoop alignment (camera)Vision hoop distanceHoop visibilityCourse progressLap progressSafety zone proximity

The observation space teaches students that RL agents don't learn from magic — they learn from state representations. Better observations consistently produce better policies.

Technical Stack
Frontend
React 18Node.js v18ESLintPrettierFigma (user flows)
Backend
Node.jsExpressMongoDBJWT AuthReward API EndpointsSession / Log Storage
RL & Experiment Tracking
OpenAI GymStable-Baselines3PPOYAML Reward SchemaMLflowDynamic Reward Switching
Simulation & Robotics
ROS 2 HumbleGazebo FortressX500 URDFGazebo Contact SensorsMAVROS (planned)Future: SLAM / Event-Camera
Prototype Results

Weeks 1–3 validated the full learning loop.

Progress spanned simulation infrastructure, drone model, backend, frontend, RL training, and experiment tracking — all in three weeks.

Week 1
ROS 2 Humble + Gazebo Fortress installed and configured
Simulation startup performance validated
React frontend repo initialized with ESLint/Prettier
Gym + Stable-Baselines3 Docker environment
YAML reward-interface schema defined
CI builds passing
Week 2
X500 URDF built with visual, collision, and inertial components
40% mesh optimization achieved
Mass/inertia validated within 2%
React Dashboard + Reward Editor UI scaffolded
Figma user flows finalized
Reward API endpoints implemented — 10ms median / 20ms P95 latency
Dynamic reward switching integrated into Gym environment
Week 3
Gazebo contact sensors integrated
Collision false positives reduced below 5%
Node.js + Express + MongoDB + JWT backend complete
PPO agent trained for 1M steps
80% success rate within 100 steps on Distance-to-Goal preset
MLflow experiment logging integrated
Convergence plateau identified at ~9e5 steps for further tuning
80%

PPO success within 100 steps (Distance-to-Goal preset)

<1s

Gazebo cold start / ~0.3s warm start

10ms

Reward API median latency (20ms P95)

<5%

Collision-detection false-positive rate

40%

X500 URDF mesh optimization

<2%

Mass/inertia validation error

Lessons Learned
Educational AI Needs Clean Abstractions

Students learn reward design most effectively when the platform hides unnecessary complexity. Every API endpoint exposed in the UI should have a direct observable effect on drone behavior.

Reward Design Is the Best RL Entry Point

No other concept maps as directly from abstract theory to physical outcome. Students who adjust a collision penalty and watch the drone become more conservative have had a genuine insight.

A Stable Simulator Beats a Flashy Demo

A sub-1s startup and reliable contact sensor behavior matters more than impressive visuals. Students who encounter simulation bugs stop learning RL and start debugging ROS.

UI Latency Kills Learning Flow

The 10ms Reward API response time was a deliberate design target. If reward updates feel sluggish, students disengage from the feedback loop.

Log Everything From Day One

MLflow was integrated in Week 3 and it immediately revealed the convergence plateau at 9e5 steps. Without experiment tracking, that insight would have required re-running experiments.

Separate High-Level RL From Low-Level Flight Control

Students should reason about rewards and actions, not attitude stabilization and PID tuning. The flight controller abstraction boundary is a pedagogical design decision as much as a technical one.

Roadmap

From prototype to production-ready educational platform.

Near-Term
Motor & camera plugin integration
Mission Selector UI MVP
Path-Efficiency reward preset
SLAM plugin integration
SLAM map widget
Energy-Efficiency preset
Mid-Term
Event-camera simulation
Collision-Avoidance preset
Live telemetry view
Downloadable logs
Model upload (.pt / .onnx)
Live model swapping
Long-Term
XAI logging + attention overlays
SHAP explainability
Altitude-stability maps
Dynamic leaderboard
Composite reward presets
Final sim-to-real validation
DeepFlyer

Making reinforcement learning physical.

Reward design, PPO training, ROS 2 simulation, and a future path to real drone hardware — all accessible through a browser.

← Back to Portfolio