3we Benchmark Leaderboard¶

Standardized evaluation for embodied AI agents on the 3we robot platform. All benchmarks run in deterministic simulation environments with fixed seeds, enabling reproducible comparison of navigation and exploration policies.

Tasks¶

The benchmark suite evaluates agents on three core embodied AI tasks:

Task	Description	Environment
PointNav	Navigate to a specified (x, y) coordinate	Indoor office scenes with static obstacles
ObjectNav	Navigate to an instance of a named object category	Indoor scenes with semantic object annotations
Exploration	Maximize map coverage within a time budget	Unknown environments requiring frontier-based or learned exploration

Metrics¶

Metric	Definition	Range
SR (Success Rate)	Fraction of episodes where the agent reached the goal	0.0 - 1.0
SPL (Success weighted by Path Length)	Success penalized by path inefficiency relative to shortest path	0.0 - 1.0
Duration	Mean episode wall-clock time in seconds	0.0+
Coverage	Fraction of explorable area visited (exploration only)	0.0 - 1.0

Baseline Results¶

Results measured on the office_v2 scene with 100 episodes per task, seed range [0, 99].

Agent	Task	Scene	SR	SPL	Duration (s)	Coverage
Nav2 DWB	pointnav	office_v2	0.85	0.72	12.3	-
Frontier Exploration	exploration	office_v2	0.90	-	45.2	0.82
Random Agent	objectnav	office_v2	0.12	0.08	28.1	-

Running Benchmarks¶

Install the benchmark dependencies and run evaluation:

pip install threewe[sim]

Run a benchmark task:

threewe benchmark run --task pointnav --episodes 100

Run with a specific backend and scene:

threewe benchmark run --task objectnav --episodes 100 --backend gazebo --scene office_v2

Compare your result against a baseline:

threewe benchmark compare --result result.json --baseline nav2_pointnav_office

Submission Format¶

Results must be submitted as a JSON file conforming to the following schema:

{
  "agent_name": "string (required) — display name for leaderboard",
  "task": "string (required) — one of: pointnav, objectnav, exploration",
  "scene": "string (required) — scene identifier (e.g., office_v2)",
  "episodes": "integer (required) — number of episodes evaluated",
  "seed_start": "integer (required) — first seed used",
  "metrics": {
    "success_rate": "float (required) — fraction of successful episodes",
    "spl": "float (required for pointnav/objectnav) — success weighted by path length",
    "mean_duration": "float (required) — mean episode duration in seconds",
    "coverage": "float (required for exploration) — fraction of area explored"
  },
  "software": {
    "threewe_version": "string (required) — threewe package version",
    "backend": "string (required) — gazebo or isaac_sim",
    "backend_version": "string (required) — simulator version"
  },
  "metadata": {
    "method": "string (optional) — brief description of the approach",
    "paper_url": "string (optional) — link to paper or preprint",
    "code_url": "string (optional) — link to source code",
    "hardware": "string (optional) — training hardware description"
  }
}

Example submission:

{
  "agent_name": "PPO-LiDAR-v2",
  "task": "pointnav",
  "scene": "office_v2",
  "episodes": 100,
  "seed_start": 0,
  "metrics": {
    "success_rate": 0.91,
    "spl": 0.78,
    "mean_duration": 10.5
  },
  "software": {
    "threewe_version": "0.2.0",
    "backend": "gazebo",
    "backend_version": "Harmonic"
  },
  "metadata": {
    "method": "PPO with 360-point LiDAR + goal vector, 500k timesteps",
    "code_url": "https://github.com/user/ppo-nav"
  }
}

How to Submit¶

Submit your result to the leaderboard:

threewe benchmark submit --result result.json

The submission tool validates schema compliance and checks that the seed range matches the official protocol before uploading.

Evaluation Protocol¶

To ensure fair and reproducible comparison, all submissions must follow this protocol:

Deterministic seeds: Episodes use seeds [seed_start, seed_start + episodes). The official evaluation uses seed_start=0.
Scene version: Results are only comparable within the same scene version. The current official scene is office_v2.
Software version: Record the exact threewe package version and simulator version. Results from different versions are grouped separately on the leaderboard.
No scene-specific tuning: Agents must not be trained on the exact evaluation episodes. Training on the same scene type is permitted, but not on the specific seed-generated configurations used for evaluation.
Timeout: Episodes that exceed the task-specific time limit are marked as failures. PointNav: 60s, ObjectNav: 120s, Exploration: 180s.
Single attempt: Each episode is attempted exactly once. No retries or cherry-picking.

Full Documentation¶

For detailed benchmark API reference, custom task definitions, and advanced evaluation options, see the benchmark module documentation.