> ## Documentation Index
> Fetch the complete documentation index at: https://docs.venice.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Reference to Video

> Generate consistent AI videos with character elements, scene references, and multi-shot control on the Venice API using Kling O3 and Grok Imagine R2V models.

Reference to Video lets you lock in the appearance of characters, objects, and scenes so your AI-generated videos stay visually consistent. Instead of hoping the model interprets your prompt correctly, you provide visual anchors — reference images that tell the model exactly what your subject looks like.

This feature is available on **Kling O3** and **Grok Imagine R2V** models in the [Venice Video Studio](https://venice.ai/video). Each model family uses a different approach to reference images — see the model-specific sections below.

## When to use Reference to Video

Use Reference to Video when you need:

* **Character consistency** — the same person or character across multiple shots
* **Product accuracy** — a real product that must look identical to the original
* **Scene continuity** — a specific environment or background across generations
* **Multi-character scenes** — multiple distinct characters interacting without blending

For simple text-to-video or image-to-video where consistency isn't critical, the standard models work well without references.

## Available models

| Model                     | Approach                | Best for                                                     |
| ------------------------- | ----------------------- | ------------------------------------------------------------ |
| **Kling O3 Pro R2V**      | Elements + scene images | Complex multi-character scenes with precise identity control |
| **Kling O3 Standard R2V** | Elements + scene images | Faster iteration on element-based scenes                     |
| **Grok Imagine R2V**      | Flat reference images   | Quick reference-driven generation with up to 7 images        |

**Kling O3** uses a structured approach with Elements (character identity anchors with frontal + reference images) and Scene Images. **Grok Imagine R2V** takes a simpler approach — you upload reference images directly and reference them in your prompt with `@Image1`, `@Image2`, etc.

***

## Kling O3 Reference to Video

### Core concepts

Kling O3 Reference to Video uses three types of visual input that work together:

| Input                      | Required                    | Purpose                               | How to reference in prompt     |
| -------------------------- | --------------------------- | ------------------------------------- | ------------------------------ |
| **Elements**               | At least one visual input\* | Lock a character or object's identity | `@Element1`, `@Element2`, etc. |
| **Scene Reference Images** | At least one visual input\* | Set the environment, style, and mood  | `@Image1`, `@Image2`, etc.     |
| **Start Frame**            | At least one visual input\* | Control the first frame of the video  | N/A (set via upload)           |
| **End Frame**              | No                          | Control the last frame of the video   | N/A (set via upload)           |

\*At least one of: start frame, elements, or scene reference images is required.

### Elements

An Element is a character or object you want to keep visually stable throughout the video. Each element consists of:

* **Frontal Image** (required per element) — a clear, front-facing photo of the subject. This is the primary identity anchor. Think of it as the "passport photo" of your character or product.
* **Reference Images** (1–3, optional) — additional angles of the same subject (side view, 45-degree angle, back). These help the model understand the subject in 3D space. If not provided, the frontal image is automatically used as the reference.

You can add up to **7 elements** per generation (limited by combined total). Reference them in your prompt using `@Element1`, `@Element2`, etc.

### Scene Reference Images

Scene references define the "stage" where the action takes place. They influence:

* Lighting and color palette
* Architecture and environment details
* Overall visual style and mood

You can add up to **4 scene images**. Reference them as `@Image1`, `@Image2`, etc. in your prompt.

### Limitations

The total number of images across all input types is limited:

| Limit                                                                   | Value                                                          |
| ----------------------------------------------------------------------- | -------------------------------------------------------------- |
| **Minimum required**                                                    | At least 1 visual input (start frame, element, or scene image) |
| **Combined total** (first frame + last frame + elements + scene images) | **7 maximum**                                                  |
| Elements (without start/end frame)                                      | 7 maximum                                                      |
| Elements (with start or end frame)                                      | 3 maximum                                                      |
| Scene reference images                                                  | 4 maximum                                                      |
| Reference images per element                                            | 1–3                                                            |

**Example scenarios:**

* 7 elements + 0 scene images = 7 ✓ (no frames)
* 5 elements + 2 scene images = 7 ✓ (no frames)
* First frame (1) + 3 elements + 3 scene images = 7 ✓
* First frame (1) + last frame (1) + 3 elements + 2 scene images = 7 ✓
* First frame (1) + 4 elements = ✗ (max 3 elements with frame)
* First frame (1) + last frame (1) + 4 elements = ✗ (max 3 elements with frames)

<Note>
  Each element requires a **frontal image**. If you don't provide reference images for an element, the frontal image is automatically used as the reference.
</Note>

### Multi-shot mode

Multi-shot lets you break a single generation into multiple scenes, each with its own prompt and duration. Elements and scene references carry across all shots, maintaining consistency. The total duration across all shots cannot exceed **15 seconds**.

***

### Step-by-step guide (Video Studio)

#### 1. Open Video Studio and select the model

Go to [venice.ai/video](https://venice.ai/video). In the Model Browser on the left, select one of the **Kling O3 Reference to Video** models:

* **Kling O3 Pro R2V** — higher quality, longer generation time (\~6 min)
* **Kling O3 Standard R2V** — faster, more cost-effective for iteration

#### 2. Add Visual Inputs (at least one required)

You must provide **at least one visual input** to generate a video: a start frame, an element, or a scene reference image. In the Input Panel, you'll see the **Elements** section. Click **Add Element** to create an element for characters or objects you want to keep visually consistent.

For each element:

1. Click the **Frontal** tile to upload a clear, front-facing image of your character or object
2. Optionally click **Add** under **Reference Images** to upload additional angles (1–3)

Repeat for additional characters or objects (up to 7 elements total, or 3 if using start/end frames).

<Warning>
  The combined total of first frame, last frame, elements, and scene images cannot exceed **7**. See [Limitations](#limitations) for details.
</Warning>

<Tip>
  **Best reference images:** Use well-lit photos with a clean background. Provide front, side, and 45-degree angle views for the strongest identity lock. Make sure all reference images share the same visual style (don't mix photorealistic and anime).
</Tip>

#### 3. Add Scene Reference Images (optional)

Below the Elements section, you'll see **Scene Reference Images**. Upload images that define the environment you want — a specific location, lighting setup, or art style.

These are tagged automatically as `@Image1`, `@Image2`, etc.

#### 4. Upload a Start Frame (optional)

If you want to control the exact first frame of your video, switch to the **Image** input type and upload a start frame. You can also optionally set an end frame.

#### 5. Write your prompt

In the prompt field, describe the action you want while referencing your elements and scene images using the `@` tags:

```
@Element1 walks through the streets of @Image1, looking up at the buildings.
The camera slowly tracks from behind, revealing the city skyline.
```

For **multi-character scenes**:

```
@Element1 and @Element2 enter the cafe in @Image1 from opposite sides.
@Element1 waves and walks toward @Element2, who is sitting at a corner table.
```

#### 6. Configure settings

Open **Video Settings** to adjust:

| Setting        | Options         | Default |
| -------------- | --------------- | ------- |
| Duration       | 3s – 15s        | 5s      |
| Aspect Ratio   | 16:9, 9:16, 1:1 | 16:9    |
| Generate Audio | On/Off          | Off     |

<Note>
  Audio generation adds native sound effects, dialogue, and ambient audio synchronized to the video. It increases cost by \~25%.
</Note>

#### 7. Generate

Click **Generate Video**. Kling O3 typically takes 4–6 minutes depending on the model tier and duration. You can queue multiple generations and browse results in the Video Gallery.

***

### Multi-shot storyboarding

For narrative sequences, use multi-shot mode to define separate scenes within a single generation.

1. In the prompt area, click **Add Shot** to create additional shots
2. Write a separate prompt for each shot
3. Set the duration for each shot (3–15s each, total ≤ 15s)

Elements and scene references persist across all shots automatically:

```
Shot 1 (5s): @Element1 stands at the edge of @Image1, looking out at the horizon.
Slow camera push forward.

Shot 2 (5s): Close-up of @Element1's face as they turn toward the camera.
Soft natural lighting, shallow depth of field.

Shot 3 (5s): @Element1 walks away from camera into the distance.
Wide cinematic shot, golden hour lighting.
```

<Warning>
  Multi-shot total duration cannot exceed 15 seconds. For example, three 5-second shots = 15s maximum.
</Warning>

***

### Prompting tips

#### Structure your prompt

Follow this pattern for reliable results:

```
[subject with @Element tag] + [action] + [environment with @Image tag] + [camera movement] + [lighting/style]
```

**Example:**

```
@Element1 hops happily across the candy ground of @Image1, stops to look at a
giant lollipop, tilts its head curiously. Cinematic tracking shot, soft warm lighting.
```

#### Keep prompts 50–150 words

Shorter prompts lack detail. Longer prompts introduce contradictions. Aim for the sweet spot.

#### Use simple camera language

The model responds best to straightforward camera directions:

| Use                         | Avoid                                           |
| --------------------------- | ----------------------------------------------- |
| `slow camera push forward`  | `dolly zoom with rack focus transition`         |
| `tracking shot from behind` | `complex handheld parallax movement`            |
| `close-up`                  | `extreme macro with tilt-shift bokeh`           |
| `wide cinematic shot`       | `anamorphic ultra-wide establishing crane shot` |

#### Use consistent vocabulary

If you describe a character wearing "a red jacket" in one prompt, don't switch to "crimson coat" in the next. The model treats different words as different intent.

#### Place camera instructions early

Put the camera direction near the beginning of the prompt for more reliable results:

```
Cinematic tracking shot of @Element1 walking through @Image1, leaves
blowing in the wind, golden afternoon light.
```

***

### Kling O3 Pricing

Kling O3 Reference to Video models use duration-based pricing:

| Model                 | Per second (no audio) | Per second (with audio) |
| --------------------- | --------------------- | ----------------------- |
| Kling O3 Pro R2V      | \$0.112               | \$0.140                 |
| Kling O3 Standard R2V | \$0.112               | \$0.140                 |

**Example:** A 10-second video with audio = 10 × $0.14 = **$1.40\*\*

Use the [Video Quote API](https://docs.venice.ai/api-reference/endpoint/video/quote) for exact pricing before generation.

***

### Kling O3 API usage

Kling O3 Reference to Video is also available via the Venice API. See the [Video Queue API](https://docs.venice.ai/api-reference/endpoint/video/queue) for full details.

#### Python

```python theme={"system"}
import requests

response = requests.post(
    "https://api.venice.ai/api/v1/video/queue",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "kling-o3-pro-reference-to-video",
        "prompt": "@Element1 walks through @Image1, camera tracking from behind",
        "duration": "8",
        "aspect_ratio": "16:9",
        "audio": True,
        "elements": [
            {
                "frontal_image_url": "https://example.com/character-front.jpg",
                "reference_image_urls": [
                    "https://example.com/character-side.jpg",
                    "https://example.com/character-angle.jpg"
                ]
            }
        ],
        "image_urls": [
            "https://example.com/scene-background.jpg"
        ]
    }
)

queue_id = response.json()["id"]
```

#### Node.js

```javascript theme={"system"}
const response = await fetch("https://api.venice.ai/api/v1/video/queue", {
  method: "POST",
  headers: {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "kling-o3-pro-reference-to-video",
    prompt: "@Element1 walks through @Image1, camera tracking from behind",
    duration: "8",
    aspect_ratio: "16:9",
    audio: true,
    elements: [
      {
        frontal_image_url: "https://example.com/character-front.jpg",
        reference_image_urls: [
          "https://example.com/character-side.jpg",
          "https://example.com/character-angle.jpg"
        ]
      }
    ],
    image_urls: [
      "https://example.com/scene-background.jpg"
    ]
  })
});

const { id: queueId } = await response.json();
```

#### cURL

```bash theme={"system"}
curl https://api.venice.ai/api/v1/video/queue \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kling-o3-pro-reference-to-video",
    "prompt": "@Element1 walks through @Image1, camera tracking from behind",
    "duration": "8",
    "aspect_ratio": "16:9",
    "audio": true,
    "elements": [
      {
        "frontal_image_url": "https://example.com/character-front.jpg",
        "reference_image_urls": [
          "https://example.com/character-side.jpg",
          "https://example.com/character-angle.jpg"
        ]
      }
    ],
    "image_urls": [
      "https://example.com/scene-background.jpg"
    ]
  }'
```

#### Element schema

Each element in the `elements` array accepts:

| Field                  | Type      | Required | Description                                                                          |
| ---------------------- | --------- | -------- | ------------------------------------------------------------------------------------ |
| `frontal_image_url`    | string    | **Yes**  | Clear front-facing image URL                                                         |
| `reference_image_urls` | string\[] | No       | Additional angle URLs (1–3). If omitted, the frontal image is used as the reference. |

<Note>
  The API also supports `video_url` for video-based elements, but this is not currently available in the Video Studio UI.
</Note>

***

### Kling O3 Troubleshooting

| Problem                                           | Likely cause                                  | Fix                                                                                 |
| ------------------------------------------------- | --------------------------------------------- | ----------------------------------------------------------------------------------- |
| Generate button is disabled                       | No visual inputs provided                     | Add at least one visual input: start frame, element, or scene reference image       |
| "Number of images exceeds the limit" error        | Too many combined inputs                      | Total of first frame + last frame + elements + scene images must be ≤ 7             |
| Character face changes between shots              | Different or missing frontal image            | Use the same frontal image consistently, keep description identical                 |
| Camera movement feels random                      | Multiple or conflicting camera instructions   | Use a single camera instruction, place it early in the prompt                       |
| Style shifts between generations                  | Inconsistent scene references or mixed styles | Reuse the same scene images, keep style keywords consistent                         |
| Elements blend together in multi-character scenes | Vague spatial instructions                    | Be explicit about each element's position: "foreground left", "entering from right" |
| Background looks distorted                        | Cluttered or complex scene reference image    | Use clean, high-quality scene reference images                                      |
| Motion looks unnatural                            | Too many actions in one prompt                | Simplify the action, use shorter duration, one action per shot                      |

<Tip>
  Test with a 3–5 second clip before committing to longer durations. Shorter clips maintain better consistency and let you iterate faster.
</Tip>

***

## Grok Imagine Reference to Video

Grok Imagine R2V takes a simpler approach than Kling O3. Instead of structured Elements with frontal/reference image separation, you upload **flat reference images** and reference them directly in your prompt using `@Image1`, `@Image2`, etc. The model incorporates those subjects into the generated video.

### How it works

1. Upload **1–7 reference images** — photos of characters, objects, or scenes you want in the video
2. Write a prompt that describes the video, using `@Image1`, `@Image2`, etc. to reference specific images
3. The model generates a video incorporating those references

If you don't include `@Image` tags in your prompt, all uploaded images are referenced automatically.

### Settings

| Setting      | Options                             | Default |
| ------------ | ----------------------------------- | ------- |
| Aspect Ratio | 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, 9:16 | 16:9    |
| Resolution   | 480p, 720p                          | 480p    |
| Duration     | 5s, 8s, 10s                         | 8s      |

<Note>
  Grok Imagine R2V does not support audio generation, multi-shot mode, or Elements. For those features, use Kling O3 R2V.
</Note>

### Step-by-step guide (Video Studio)

#### 1. Select the model

Go to [venice.ai/video](https://venice.ai/video). In the Model Browser, select **Grok Imagine R2V**.

#### 2. Upload reference images

Click **References** in the input toolbar (or use the + menu) to open the reference images panel. Upload 1–7 images of the characters, objects, or scenes you want in the video.

Each image is automatically tagged as `@Image1`, `@Image2`, etc. in the order you upload them (left to right).

#### 3. Write your prompt

Describe the video you want. Use `@Image` tags to reference specific images:

```
@Image1 and @Image2 walking together through a sunlit park,
camera slowly tracking alongside them, warm afternoon light.
```

Type `@` in the prompt field to see an autocomplete menu of available image references.

<Tip>
  If you omit `@Image` tags entirely, the backend automatically prepends references to all uploaded images. This is useful when you want all images used without specifying which is which.
</Tip>

#### 4. Configure settings and generate

Open **Video Settings** to adjust aspect ratio, resolution, and duration. Click **Generate Video**.

### Grok Imagine R2V Pricing

Grok Imagine R2V uses duration and resolution-based pricing:

| Resolution | Per second |
| ---------- | ---------- |
| 480p       | \~\$0.063  |
| 720p       | \~\$0.088  |

**Example:** An 8-second video at 480p = 8 × $0.063 = **~$0.50\*\*

<Note>
  Grok Imagine charges a content moderation fee for generated videos, even if the video is rejected. This is reflected in the credit cost shown before generation.
</Note>

### Grok Imagine R2V API usage

#### Python

```python theme={"system"}
import requests

response = requests.post(
    "https://api.venice.ai/api/v1/video/queue",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "grok-imagine-reference-to-video",
        "prompt": "@Image1 and @Image2 walking through a park, cinematic tracking shot",
        "duration": "8",
        "aspect_ratio": "16:9",
        "referenceImageUrls": [
            "https://example.com/character-a.jpg",
            "https://example.com/character-b.jpg"
        ]
    }
)

queue_id = response.json()["id"]
```

#### Node.js

```javascript theme={"system"}
const response = await fetch("https://api.venice.ai/api/v1/video/queue", {
  method: "POST",
  headers: {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "grok-imagine-reference-to-video",
    prompt: "@Image1 and @Image2 walking through a park, cinematic tracking shot",
    duration: "8",
    aspect_ratio: "16:9",
    referenceImageUrls: [
      "https://example.com/character-a.jpg",
      "https://example.com/character-b.jpg"
    ]
  })
});

const { id: queueId } = await response.json();
```

#### cURL

```bash theme={"system"}
curl https://api.venice.ai/api/v1/video/queue \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-reference-to-video",
    "prompt": "@Image1 and @Image2 walking through a park, cinematic tracking shot",
    "duration": "8",
    "aspect_ratio": "16:9",
    "referenceImageUrls": [
      "https://example.com/character-a.jpg",
      "https://example.com/character-b.jpg"
    ]
  }'
```

#### API parameters

| Field                | Type      | Required | Description                                               |
| -------------------- | --------- | -------- | --------------------------------------------------------- |
| `model`              | string    | **Yes**  | Must be `grok-imagine-reference-to-video`                 |
| `prompt`             | string    | **Yes**  | Text prompt with optional `@Image1`, `@Image2` references |
| `referenceImageUrls` | string\[] | **Yes**  | 1–7 image URLs or data URLs                               |
| `duration`           | string    | No       | `"5"`, `"8"` (default), or `"10"`                         |
| `aspect_ratio`       | string    | No       | e.g., `"16:9"` (default), `"9:16"`, `"1:1"`               |
| `resolution`         | string    | No       | `"480p"` (default) or `"720p"`                            |

<Note>
  Grok Imagine R2V does not use the `elements`, `image_urls`, or `imageUrl` fields. All reference images are passed via `referenceImageUrls`.
</Note>

### Grok Imagine R2V Troubleshooting

| Problem                                          | Likely cause                              | Fix                                                                                                       |
| ------------------------------------------------ | ----------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| Generate button is disabled                      | No reference images uploaded              | Upload at least 1 reference image                                                                         |
| "At least one reference image is required" error | `referenceImageUrls` is empty or missing  | Provide at least one image URL in `referenceImageUrls`                                                    |
| Wrong image associated with `@Image` tag         | Image order doesn't match tags            | `@Image1` corresponds to the first image in your upload order (left to right). Reorder uploads if needed. |
| Subject not appearing in video                   | Too many references without explicit tags | Use `@Image` tags in your prompt to be explicit about which images to use                                 |
| Low quality output                               | Using 480p resolution                     | Try 720p for higher quality (costs more)                                                                  |
| Video too short                                  | Default duration is 8s                    | Set duration to `"10"` for longer videos                                                                  |