p-video-avatar

Generate talking head videos from a single image with text or audio-driven speech

Overview

p-video-avatar is Pruna's talking head video generation model. Provide a portrait image and either a speech script or an audio file, and the model generates a realistic video of the person speaking. Supports multiple voices, languages, and output resolutions.

Rate Limit: 50 requests per minute

Category: Video Generation

Pricing:

Configuration	Price
720p	0.025$ per second of output video
1080p	0.045$ per second of output video

Quickstart

Start with uploading your image

curl -X POST "https://api.pruna.ai/v1/files" \
  -H "apikey: YOUR_API_KEY" \
  -F "content=@/path/to/portrait.jpg"

Note: Use -F (form) with @ prefix to upload a file from your local filesystem. The file path should be absolute or relative to your current directory.

Avatar with Voice Script (Synchronous)

curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-H 'Try-Sync: true' \
-d '{
"input": {
"image": "https://your-url.com/portrait.jpg",
"voice_script": "Hello, welcome to our product demo!",
"voice": "Zephyr (Female)"
}
}'

Avatar with Voice Script (Asynchronous)

curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-d '{
"input": {
"image": "https://your-url.com/portrait.jpg",
"voice_script": "Hello, welcome to our product demo!",
"voice": "Zephyr (Female)",
"voice_language": "English (US)"
}
}'

Avatar with Audio

curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-d '{
"input": {
"image": "https://your-url.com/portrait.jpg",
"audio": "https://your-url.com/speech.mp3"
}
}'

Parameters

Required Parameters

Parameter	Type	Description
image	string	Input image (first frame). Supports jpg, jpeg, png, webp

You must also provide either voice_script or audio (or both, in which case audio takes priority).

Optional Parameters

Parameter	Type	Default	Description
seed	integer	random	Random seed. Set for reproducible generation
audio	string	-	URL of uploaded audio to condition video generation. If both `audio` and `voice_script` are provided, uploaded audio is used
voice	string	"Zephyr (Female)"	Voice for generated speech. Options: "Zephyr (Female)", "Puck (Male)", "Charon (Male)", "Kore (Female)", "Fenrir (Male)", "Leda (Female)", "Orus (Male)", "Aoede (Female)", "Callirrhoe (Female)", "Autonoe (Female)", "Enceladus (Male)", "Iapetus (Male)", "Umbriel (Male)", "Algenib (Male)", "Despina (Female)", "Erinome (Female)", "Laomedeia (Female)", "Achernar (Female)", "Algieba (Male)", "Schedar (Male)", "Gacrux (Female)", "Pulcherrima (Female)", "Achird (Male)", "Zubenelgenubi (Male)", "Vindemiatrix (Female)", "Sadachbia (Male)", "Sadaltager (Male)", "Sulafat (Female)", "Alnilam (Male)", "Rasalgethi (Male)"
resolution	string	"720p"	Resolution of the video. Options: "720p", "1080p"
video_prompt	string	"The person is talking."	Optional prompt for the video
voice_prompt	string	"Say the following."	Optional speaking style, tone, pacing or emotion instructions
voice_script	string	""	Script for the person to say when no audio is uploaded
voice_language	string	"English (US)"	Output language. Options: "English (US)", "English (UK)", "Spanish", "French", "German", "Italian", "Portuguese (Brazil)", "Japanese", "Korean", "Hindi"
negative_prompt	string	""	Disabled if empty. Mention what you do NOT want in the video, e.g. "subtitles, text, blurry, low quality, watermark". We recommend using multiple keywords at once
strength_negative_prompt	number	0.5	Strength of the Negative Prompt (0 - 4). Optimal value can differ for different video lengths (Experimental Feature)
disable_safety_filter	boolean	true	Disable safety filter for prompts and input image. When disabled, prompts are not checked for unsafe content before generation
disable_prompt_upsampling	boolean	false	When true, skip the OpenRouter multimodal prompt upsampler and pass the raw user prompt to the video model