TIDE: Task-Isolated Diffusion for
Unified Video Editing and Generation

A single diffusion model for instruction editing, reference-guided editing, and subject-driven generation

Qi Liu1,*, Gang Yue2,*, Mingyu Yin2, Lisai Zhang2, Yidi Wu2, Yaole Wang2, Yaohui Wang2, Chang Yao1, Jingyuan Chen1,†, Lin Ma

1Zhejiang University  2Bilibili Inc.
*Equal contribution Corresponding author

Abstract

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks.

Method Overview

TIDE unifies video editing and generation through per-token task embeddings and dual-path conditioning.

TIDE method overview

Figure: Overview of the TIDE framework. Per-token task embeddings isolate target, source, and reference roles within shared attention layers, enabling a single model to handle editing and generation tasks seamlessly.

Instruction-Based Video Editing

Drag the slider to compare input and output.

Style Transfer
InputOutput

Edit Video-1. Apply a Cyberpunk animation style featuring flickering neon signs, rain-slicked streets with holographic reflections, and a gritty urban backdrop.

Style Transfer
InputOutput

Edit Video-1. Apply a Studio Ghibli animation style with its distinctive artistic flair.

Style Transfer
InputOutput

Edit Video-1. Apply a watercolor animation style with fluid color transitions, soft outlines, and a dreamlike atmosphere.

Style Transfer
InputOutput

Edit Video-1. Apply a pixel art style with sharp, retro-style pixels and a cohesive color palette to emulate classic pixel animations.

Style Transfer
InputOutput

Edit Video-1. Apply a dawn aesthetic with a soft, warm ambiance, gradual light transitions, and pastel sky tones.

Style Transfer
InputOutput

Edit Video-1. Apply a classic cartoon aesthetic with a vivid color scheme and exaggerated character movements.

Style Transfer
InputOutput

Edit Video-1. Apply a Chinese Ink Wash Painting style featuring animated ink smudging, blended textures, and a soft, ethereal glow.

Local Change
InputOutput

Edit Video-1. Replace the man's green t-shirt with a dark navy blue business suit, white shirt, and burgundy tie, keeping his position and pose unchanged.

Local Change
InputOutput

Edit Video-1. Replace the smooth white table with a luxurious marble surface table while maintaining its original position and pose.

Local Change
InputOutput

Edit Video-1. Replace the sofa with a classic brown leather sofa with visible stitching, maintaining its exact position and pose.

Local Change
InputOutput

Edit Video-1. Replace the young girl's winter sweater and skirt with a light, sleeveless pastel summer dress while maintaining her seated position and pose.

Local Change
InputOutput

Edit Video-1. Replace the woman in the denim jacket with an elderly woman with silver hair and gentle wrinkles, maintaining her original position and pose.

Local Change
InputOutput

Edit Video-1. Replace the man's orange astronaut suit with a light blue short-sleeve shirt and beige shorts, maintaining his seated pose and position.

Background
InputOutput

Edit Video-1. Replace the background with a dynamic cozy lounge featuring a flickering fireplace, moving wall shadows, and a soft ambient glow. Keep the subject perfectly still.

Background
InputOutput

Edit Video-1. Replace the background with a modern conference room featuring sunlight streaming through glass walls, moving shadows, and shimmering reflections, while keeping the subject still.

Background
InputOutput

Edit Video-1. Replace the background with a dynamic classic car showroom featuring subtle reflections on polished surfaces and gently flickering overhead lights. Keep the subject and car interior perfectly still.

Background
InputOutput

Edit Video-1. Replace the background with a dynamic ship's command bridge featuring ocean waves through the windows and blinking navigation lights. The subject must remain perfectly still.

Creative
InputOutput

Edit Video-1. Transform the clouds into a sprawling futuristic cityscape composed of gleaming glass and steel towers with visible streets and lights.

Creative
InputOutput

Edit Video-1. Transform the man's hands into sleek, metallic robotic hands with articulated joints and glowing blue lights while maintaining the original typing motion.

Creative
InputOutput

Edit Video-1. Transform the white corrugated walls into towering bookshelves filled with colorful books. Replace the framed pictures with ornate reading lamps emitting warm light and adjust the floor reflections to mirror the golden glow.

Creative
InputOutput

Edit Video-1. Transform the paddleboarder into a glowing ethereal water guardian. Make the water sparkle with bioluminescent waves, add glowing mythical water creatures, and replace the sun with a large luminous moon casting a mystical glow.

Creative
InputOutput

Edit Video-1. Transform the bread loaf into a miniature lush forest with green trees, moss, and tiny flowers, while keeping the knife poised to slice through the forest canopy.

Add
InputOutput

Edit Video-1. Overlay an animated black metal music stand to the right of the seated man holding the guitar, ensuring it remains fixed to the man as the camera moves.

Remove
InputOutput

Edit Video-1. Remove the neon green "DON'T PANIC!" text from the tablet, inpainting the background to match the surroundings.

Remove
InputOutput

Edit Video-1. Remove the elegant orchid with its slender stem, delicate pink blossoms, and vibrant green leaves from the entire sequence, reconstructing the background in its place.

Remove
InputOutput

Edit Video-1. Remove the white flower with four ruffled petals and a pale yellow central core, reconstructing the background in its place.

Remove
InputOutput

Edit Video-1. Remove the small bird with glossy black plumage and a yellow beak from the entire video, reconstructing the background.

Subtitle
InputOutput

Edit Video-1. Remove the subtitles located at the top of the video.

Subtitle
InputOutput

Edit Video-1. Remove the subtitles at the bottom of the video.

Subtitle
InputOutput

Edit Video-1. Remove the subtitles at the bottom of the video.

Subtitle
InputOutput

Edit Video-1. Remove the subtitles located at the center of the frame.

Subtitle
InputOutput

Edit Video-1. Remove the subtitles located at the center of the frame.

Background
InputOutput

Edit Video-1. Change the background to a modern art gallery with soft ambient lighting, polished floors with subtle reflections, distant visitors, and dynamic light shifts from overhead spotlights.

Creative
InputOutput

Edit Video-1. Transform the person into a stylized 3D digital avatar with a translucent visor replacing the VR headset. Change the background to a digital grid environment with floating data streams and neon accents.

Reference-Guided Video Editing

Reference images guide the edit; drag the slider to compare.

2ref_add_add
InputOutput
refref

Add the lush green potted fiddle-leaf fig plant from Image-1 to the empty left corner of the room in Video-1. Place the beige, round, fabric-covered meditation cushion from Image-2 directly underneath the woman in the video, so she is sitting centered on top of it.

2ref_add_add
InputOutput
refref

Add the potted green basil plant from Image-1 to the far right corner of the wooden kitchen counter in Video-1. Place the rectangular wooden cutting board from Image-2 flat on the center of the kitchen counter, directly in front of the man.

1ref_add_remove
InputOutput
ref

Add the black cylindrical speaker from Image-1 to the far back corner of the room, positioned on the floor near the right side of the wall. Remove the large arched window located on the back wall in the center of Video-1, replacing that area with a plain white wall surface that matches the surrounding room texture.

1ref_bg
InputOutput
ref

Replace the existing dark, plain background of Video-1 with the interior of the Roman Colosseum depicted in Image-1. Ensure the new background includes the sandy arena floor, the ancient tiered stone arches, and the blue sky to create a historic architectural atmosphere behind the two performers.

1ref_bg
InputOutput
ref

Replace the sandy beach background of Video-1 with the modern podcast studio from Image-1, featuring a large wooden desk, microphones, acoustic wall panels, and sleek LED lighting. Ensure the two surfers in their wetsuits remain in the foreground, seamlessly integrated into the new indoor environment as if walking through the studio.

1ref_bg
InputOutput
ref

Replace the entire nighttime city street background of Video-1 with the vast salt flats shown in Image-1. Ensure the cracked white salt surface, the distant mirror-like water reflecting the sky, and the bright cloudy sky seamlessly integrate behind the people currently standing in the foreground, maintaining the perspective of the expansive landscape.

2ref_bg_add
InputOutput
refref

Replace the existing forest trail background in Video-1 with the desert scene featuring the Pyramids of Giza under a clear blue sky from Image-1. Add the bright orange backpack rain cover from Image-2 over the blue backpack worn by the person in the center of the frame, ensuring it snugly covers the rear portion of the backpack.

2ref_bg_add
InputOutput
refref

Replace the existing architectural stairwell background in Video-1 with the park featuring a large green lawn, trees, and the Eiffel Tower under a golden sunset sky shown in Image-1. Add the minimalist black floor lamp from Image-2 to the left side of the frame, positioning it on the grass next to the man playing the guitar so that it stands vertically.

1ref_bg_remove
InputOutput
ref

Replace the existing gymnasium background of Video-1 with the wooden pier overlooking the water and the Sydney Opera House from Image-1, maintaining the foreground trampoline and the person jumping. Remove the brown leather pommel horse located in the center background of the gymnasium.

2ref_bg_replace
InputOutput
refref

Replace the existing outdoor forest background in Video-1 with the large Mayan pyramid in a grassy plaza under a clear blue sky from Image-1. Replace the wooden signpost located on the right side of the hiking trail in Video-1 with the modern stainless steel trail map kiosk shown in Image-2, ensuring it is positioned firmly in the ground where the original sign stood.

2ref_bg_replace
InputOutput
refref

Replace the entire background of Video-1, including the white walls and windows, with the modern penthouse loft with hardwood floors, a marble fireplace, and large floor-to-ceiling windows overlooking the New York City skyline from Image-1. Additionally, replace the large rectangular mirror located on the left wall with the sleek, black, flat-screen television from Image-2, ensuring the TV is mounted in the same position and orientation as the original mirror.

Style Transfer
InputOutput
ref

Apply the low-poly 3D geometric art style of Image-1 to Video-1, characterized by flat-shaded triangular facets, geometric simplification, angular surfaces, and minimal color gradients within each polygon. It is essential to preserve the original motion of the street performer, the musical instruments, and the entire spatial composition.

Style Transfer
InputOutput
ref

Style Video-1 as a soft watercolor painting based on Image-1, showcasing translucent washes, bleeding edges, soft gradients, visible brush strokes, and delicate color blending. All original motion of the man fishing by the lake, along with the subjects and spatial composition, must remain unchanged.

Style Transfer
InputOutput
ref

Apply the pointillism painting style of Image-1 to Video-1, using tiny distinct dots of color, optical color mixing, a vibrant mosaic-like texture, and systematic dot patterns. It is crucial to maintain the original motion of the basketball player, the outdoor court, and the entire spatial composition.

0ref_remove
InputOutput

Remove the blue-tinted, three-columned potted cactus located at the center front of the white desk in Video-1. Ensure the surface of the white desk is seamlessly filled in where the brown terracotta pot previously sat.

0ref_remove_remove
InputOutput

Remove the weathered brown wooden signpost located on the right side of the hiking trail in Video-1. Additionally, remove the blue water bottle tucked into the side pocket of the blue backpack worn by the hiker in the center of Video-1.

0ref_remove_remove
InputOutput

Remove the clear glass beaker positioned at the front left of the desk in Video-1. Additionally, remove the transparent acrylic test tube rack containing colorful liquids located on the left side of the desk, corresponding to.

0ref_remove_remove
InputOutput

Remove the pile of broken concrete debris located at the bottom left foreground of Video-1. Additionally, remove the yellow taxi car positioned in the background on the right side of the street.

1ref_remove_replace
InputOutput
ref

Remove all the vertical brown wooden posts lining both sides of the boardwalk in Video-1. Replace the entire wooden plank boardwalk surface with the grey, irregular stone tile texture shown in Image-1, ensuring the new path maintains the same winding shape and perspective as the original walkway.

1ref_remove_replace
InputOutput
ref

Remove the cyclist positioned on the left side of the road in Video-1. Replace the black windbreaker worn by the central cyclist with the bright neon yellow short-sleeved cycling jersey shown in Image-1, ensuring the new jersey maintains the same fit and follows the rider's posture.

1ref_replace
InputOutput
ref

Replace the black and white patterned hat worn by the man in Video-1 with the solid brown fedora shown in Image-1. The new brown fedora should maintain the same position on the man's head, covering his hair, and match the size and brim style of the original hat.

2ref_replace_replace
InputOutput
refref

Replace the man's gray and white horizontally striped sweater in Video-1 with the beige linen shirt shown in Image-1, maintaining the same fit and seated posture. Simultaneously, replace the boy's solid blue long-sleeved shirt in Video-1 with the bright yellow hoodie from Image-2, ensuring the hood and sleeves match the boy's current position on the wooden dock.

1ref_add
InputOutput
ref

Add the beige, rectangular, textured floor rug from Image-1 underneath the black wooden chair in Video-1. The rug should be positioned flat on the dark wood floor, centered beneath the chair's legs, covering the area directly under the seating space.

2ref_add_add_remove
InputOutput
refref

Remove the dark brown wooden side table located on the left side of Video-1. Add the potted fiddle-leaf fig plant from Image-1 to the floor space where the side table was previously positioned, next to the chair. Hang the minimalist framed abstract wall art from Image-2 on the empty white wall behind the woman, centering it horizontally above her head.

1ref_bg
InputOutput
ref

Replace the entire indoor living room background of Video-1 with the snowy field at night under bright green aurora borealis depicted in Image-1. Ensure the woman from Video-1 remains in the foreground, maintaining her seated position on the ground while her surroundings are seamlessly transitioned into the vast snow-covered landscape with distant mountains and pine trees under a spectacular starry sky.

1ref_bg
InputOutput
ref

Replace the plain white wall background of Video-1 with the misty view of the Taj Mahal and its reflecting pool from Image-1. Ensure the new background is seamlessly integrated behind the chair and the side table, maintaining the original lighting and perspective of the scene.

2ref_bg_add_remove
InputOutput
refref

Replace the plain white wall background of Video-1 with the dusty wild west town scene with wooden buildings like a saloon and hotel from Image-1. Remove the potted snake plant located in the wicker basket on the right side of the frame. Add the sleek black floor lamp from Image-2 into the right corner of the room, positioning it upright on the floor where the plant was previously located.

1ref_bg_remove
InputOutput
ref

Replace the current solid dark blue background in Video-1 with the alien glowing forest scene from Image-1, featuring trees with glowing cyan leaves and pinkish trunks under a starry night sky. Remove the silver hoop earring with the white teardrop pearl pendant currently worn on the subject's ear, ensuring the earlobe appears natural and free of jewelry.

2ref_bg_replace
InputOutput
refref

Replace the entire living room background of Video-1 with the scene of the Taj Mahal with a reflecting pool and hazy warm light from Image-1, maintaining the subject's position in the foreground. Replace the large abstract painting hanging on the wall behind the sofa with the framed green fern botanical print shown in Image-2, ensuring it is centered on the wall.

0ref_remove
InputOutput

Remove the small green potted plant sitting on the white kitchen counter in the background of Video-1, as specified in. Ensure the area where the plant was located is seamlessly filled with the same white countertop and cabinet finish visible behind it.

Showcase: Complex Multi-Reference Editing

Complex editing with multiple reference images for background replacement, object addition, and removal.

1ref_bg_remove
InputOutput
ref

Replace the entire city street background of Video-1 with the snowy field at night under bright green aurora borealis from Image-1, with distant mountains and pine trees under a starry sky. Remove the white tote bag from the woman's left shoulder on the left side of the frame. Ensure all three people remain in the foreground with their walking motion preserved, seamlessly integrated into the new arctic night environment.

2ref_bg_replace
InputOutput
refref

Replace the modern city street background of Video-1 with the dusty Wild West town street lined with wooden saloon and hotel buildings from Image-1. Replace the dark gray button-up shirt worn by the middle man with the red and black buffalo plaid flannel shirt from Image-2, preserving his white undershirt, jeans, backpack, and walking posture.

2ref_bg_add
InputOutput
refref

Replace the indoor living room background of Video-1 with the vast Uyuni salt flats under a bright sky from Image-1. Add the red Santa Claus hat with white fluffy trim from Image-2 onto the head of the woman sitting in the middle, fitting naturally with her long black hair and seated posture. All clothing remains unchanged.

2ref_bg_replace
InputOutput
refref

Replace the beach and lighthouse background of Video-1 with the vast Uyuni salt flats reflecting sky and clouds from Image-1. Replace the right woman's white top and dark floral skirt with the bright yellow sleeveless sundress from Image-2, keeping her red curly hair and relaxed pose on the blanket.

2ref_bg_replace
InputOutput
refref

Replace the winter city street background of Video-1 with the vast Uyuni salt flats reflecting the sky from Image-1. Replace the middle man's gray houndstooth blazer with the burgundy velvet blazer from Image-2, keeping his navy turtleneck and black beret.

Subject-Driven Video Generation

Given reference images, generate a video depicting the subjects.

humanobj
refrefref

a man feeding a bird in the park.

humanobj
refrefref

a man sitting in the office, a cat sitting beside him.

humanobj
refrefref

a man sitting on the grass in the park, a dog walking around him.

multiface
refrefref

The video depicts an indoor scene where two individuals are engaged in a conversation. The setting appears to be a well-lit room with natural light streaming through large windows. The room is furnished with a wheelchair positioned near the back, suggesting it might be a medical or care facility. A desk with a laptop and some other items is visible on the right side of the frame. The person on the left, dressed in a gray uniform, is holding a tablet and gesturing with her hands while speaking. Her posture indicates she is explaining something, possibly related to the tablet's content. The individual on the right, wearing a light-colored robe, is seated and listening attentively, smiling slightly, which suggests a positive interaction. The camera remains static throughout the sequence, focusing on capturing the interaction between the two individuals. There is no noticeable camera movement such as panning or zooming. The overall atmosphere seems calm and professional, with the focus on the exchange between the two people.

multiface
refref

The video captures two individuals taking a selfie together in an indoor setting. The man is holding a smartphone with his right hand extended forward, capturing the photo. He is dressed in a light gray blazer over a white shirt. The woman beside him has long blonde hair and is wearing a white top. Both are smiling broadly, appearing cheerful and engaged in the moment. The background suggests they are in a modern office or a similar professional environment, with a mix of neutral tones and greenery visible behind them. The lighting is bright and even, likely from overhead fluorescent lights, which illuminates the scene clearly without harsh shadows. As the video progresses, there is minimal change in the positioning of the subjects. They maintain their close proximity and continue to smile at the camera. The man's arm remains steady, holding the phone at arm's length, while the woman slightly adjusts her position to ensure she is within the frame. The overall atmosphere conveys a sense of camaraderie and lightheartedness. The camera remains static throughout the sequence, focusing on the two individuals as they pose for the selfie. There is no noticeable panning, tilting, or zooming, keeping the framing consistent and centered on the subjects. The video captures a candid and joyful moment shared between the two individuals.

multiface
refrefref

The video depicts two individuals sitting closely on a beige couch covered with a matching beige slipcover. The man is seated on the left side, wearing a maroon t-shirt and blue jeans, while the woman sits beside him, dressed in a purple plaid shirt and dark blue jeans. She has her arm around his shoulder, suggesting a friendly or intimate relationship. The man is focused on a silver laptop placed on his lap, occasionally moving his hands to interact with it. The woman leans in towards the laptop, pointing at the screen with her right hand, indicating she might be explaining something or showing him something specific. Her facial expression suggests engagement and interest. The background shows large windows with a view of a cityscape, indicating an urban setting. The lighting is bright, suggesting daytime with clear weather outside. The overall atmosphere appears casual and relaxed, with the two individuals comfortably interacting with each other and the laptop. There is no significant camera movement; the shot remains static throughout the sequence, focusing on capturing the interaction between the two individuals and their shared activity on the laptop.

multiobject
refref

The video begins with a close-up of a French press coffee maker on a wooden surface, containing partially brewed coffee, and a blue and white striped mug with a heart design to its left. The background is blurred, suggesting a cozy indoor setting. A person's hand, wearing a blue sleeve, appears and presses down on the plunger of the French press, indicating the completion of the brewing process. The person then lifts the French press and pours the freshly brewed coffee into the mug, with the camera focusing on the pouring action. The coffee flows smoothly into the mug, filling it up. After pouring, the person places the French press back on the wooden surface next to the mug, which is now filled with coffee. The video concludes with a close-up of the filled mug and the French press, emphasizing the freshly brewed coffee and the cozy indoor setting.

multiobject
refref

The video begins with a close-up view of a wooden birdhouse with a black roof, set against a backdrop of green foliage and a wooden fence. A blue tit, a small bird with blue and yellow plumage, is seen peeking out from the entrance of the birdhouse. Initially, the bird's head is visible, and it appears to be looking around. As the video progresses, the bird gradually extends its body out of the entrance, revealing more of its blue and yellow feathers. The bird's movements are cautious and deliberate, suggesting it is assessing its surroundings. Throughout the video, the bird's position changes slightly, with its head and body moving in and out of the birdhouse. The background remains consistent, with the green foliage and wooden fence providing a natural setting for the birdhouse. The video captures the bird's interactions with its environment, highlighting its cautious behavior and the gradual exploration of its surroundings.

multiobject
refrefref

The video begins with a person standing behind a table, dressed in a red robe with white cuffs, indicating a traditional or cultural setting. On the table, there is a brown clay teapot with a lid, a transparent glass pitcher filled with a golden liquid, and a small white cup with a blue pattern. The person's hands are initially resting on the table, near the teapot and the pitcher. The person then lifts the glass pitcher and pours the golden liquid into the white cup, suggesting the preparation of a beverage, likely tea. After pouring, the person places the pitcher back on the table and picks up the cup, bringing it to their mouth to drink. The background features a bamboo mat, reinforcing the traditional or cultural ambiance of the scene. The video captures a sequence of actions related to the preparation and consumption of a beverage, set against a backdrop that suggests a cultural or traditional context.

multiobject
refref

The video showcases the process of frying French fries in a commercial kitchen setting. It begins with a close-up of a metal fryer basket filled with golden, crispy French fries being lifted out of a fryer by a person wearing white gloves. The basket is held above a stainless steel fryer, which contains more fries submerged in hot oil. The person tilts the basket to drain excess oil from the fries, allowing it to drip back into the fryer. This action is repeated several times to ensure all the oil is drained. The background is dimly lit, focusing the viewer's attention on the frying process. The video emphasizes the meticulous care taken in preparing the fries, highlighting the importance of draining excess oil for the perfect texture and taste.

multiobject
refref

The video features a thingoid robot with a shiny metallic finish, standing against a plain white background. The robot is holding a cupcake with chocolate frosting and a yellow cherry on top in its right hand. Throughout the video, the robot maintains a static pose, with no movement or change in its position. The cupcake remains in the robot's hand, and there are no alterations in the robot's posture or the environment. The background remains consistently white, and there are no additional elements introduced into the scene. The camera angle and perspective do not change, providing a continuous, unaltered view of the robot and the cupcake.

multiobject
refref

The video begins with a close-up of a black analog alarm clock with a white face and black numbers, placed on a light-colored wooden floor. The clock shows the time as 10:10. The background features a blurred interior setting with a wooden structure on the left and a striped object on the right, possibly a piece of furniture or a curtain. The scene remains static, with no movement or changes in the environment. As the video progresses, the time on the clock changes to 10:11, indicating a minute has passed. The background and the position of the clock remain unchanged, maintaining the same static scene.

multiobject
refrefref

The video showcases a serene and static scene set on a wooden table. On the left side of the table, there is a plate filled with golden-brown, crescent-shaped cookies. To the right, a clear glass vase holds a bouquet of lilac flowers, adding a touch of natural beauty to the setting. In the foreground, a transparent glass teapot is seen pouring a golden liquid, presumably tea, into a matching glass cup. The pouring action is captured in detail, with the tea flowing smoothly from the teapot into the cup. Throughout the video, there are no significant changes or movements observed. The scene remains consistent, with the cookies, lilac flowers, and the act of pouring tea maintaining their positions and appearances. The camera angle and perspective do not change, providing a steady view of this tranquil moment.

singlehuman
ref

A man gently clutching a bouquet of vibrant flowers, his eyes radiating a serene contentment as he glances at the camera. His slightly upturned lips convey a sense of calm joy, accompanied by a faint twinkle in his eye. The scene is set in a lush garden, brimming with colorful blooms and verdant foliage, creating a tranquil haven. The shot captures him from the waist up, emphasizing his relaxed stance and the natural harmony of his surroundings.

singleobj
ref

The video begins with a close-up of a vintage camera resting on an old wooden desk, surrounded by scattered photographs. The camera zooms in to capture the details of the camera—its weathered leather, brass accents, and the glass lens catching the light. A hand gently picks up one of the photographs, flipping it over to reveal its back. The camera shifts focus between the photos and the camera, as the hand adjusts the camera’s settings with a soft click. As the hand places the photo back onto the desk, a slight breeze causes a few of the scattered photographs to shift, creating subtle motion in the scene. The soft sound of the camera’s dials and the faint rustle of paper add to the nostalgic atmosphere of the moment.

singleobj
ref

The video begins with a close-up of a street lamp, its warm glow casting soft light across the cobblestone street below. The camera slowly zooms out, revealing the lantern-style lamp against a dusky sky, with scattered clouds catching the last light of sunset. As the breeze stirs, the faint rustle of leaves can be heard, and the soft light from the lamp flickers slightly, casting changing shadows on the cobblestones. The camera moves gently around the scene, capturing the play of light and shadow, with the street lamp remaining the focal point. The distant sound of footsteps or faint city noise adds to the calm yet slightly mysterious atmosphere of the quiet evening. background

singleobj
ref

The video opens with a close-up of a teddy bear sitting on a park bench, surrounded by vibrant autumn leaves scattered across the ground. The camera slowly zooms in on the teddy bear, capturing the soft texture of its fur as a gentle breeze causes a few leaves to flutter around it. The bear remains still, its button eyes reflecting the soft light of the autumn afternoon. Occasionally, a leaf lands on the bear's lap, adding a subtle motion to the otherwise calm scene. The sound of leaves rustling and the distant chirping of birds creates a peaceful atmosphere, with the teddy bear remaining the focal point of the tranquil park setting.

singleobj
ref

The video begins with a close-up of a beer can placed at the center of a circle made of international currencies—paper bills and coins from various countries. The camera slowly zooms out, capturing the can’s metallic surface and the intricate details of the surrounding money. As the camera moves, the currencies subtly shimmer, reflecting light, while the beer can remains the focal point. The sound of light ambient noise adds to the quiet significance of this global scene, emphasizing the connection between the beer can and its worldwide influence.

singleobj
ref

The video begins with a close-up of a basketball shoe resting on a reflective surface, its vibrant, multicolored laces standing out against the sleek design. The camera slowly zooms in to capture the intricate details of the shoe—its texture, stitching, and the dynamic play of colors in the laces. As the camera pulls back, the reflections on the surface create an interesting visual effect, enhancing the shoe’s bold features. The video maintains focus on the shoe, emphasizing its design and energy, while subtle background sounds highlight the movement of the camera.

singleobj
ref

The video opens with a plush armchair positioned near a roaring fireplace, the warm glow from the flames casting soft light across the room. The camera slowly zooms in on the armchair, capturing its inviting cushions and rich fabric. As the fire crackles, a slight movement is seen in the armchair, as if someone just left it, leaving behind a faint impression in the cushions. The camera shifts to the flickering flames, then back to the chair, highlighting the contrast between the stillness of the armchair and the lively motion of the fire. The gentle crackling sound of the fire and the soft flicker of light create a warm, cozy atmosphere.

Multi-Subject Video Generation

Generate videos featuring multiple subjects from reference images in a coherent scene.

3-subject
refrefref

A mouth-watering slow-pan across a rustic wooden diner table. Image-1 sits center stage, steaming slightly. Next to it, Image-2 is piled high, and in the background, Image-3 catches the neon lights of the diner window. The camera slowly pushes in, emphasizing the textures and juicy details of the meal.

3-subject
refrefref

A warm, inviting living room scene at twilight. Image-1 sits elegantly on top of Image-3. Beside the chair, Image-2 casts a warm, soft yellow glow over the velvet fabric. The camera slowly arcs around the setup, highlighting the rich textures and cozy atmosphere of the home.

3-subject
refrefref

A moody, dimly lit jazz club stage. A musician is passionately playing Image-1, plugged directly into Image-2 which vibrates slightly with the bass. In the foreground, Image-3 stands ready, catching the dramatic blue and purple stage lights. Slow panning shot capturing the live music energy.

4-subject
refrefrefref

A slick, high-end commercial video. Image-1 is walking purposefully through a modern glass-walled office lobby. He is wearing Image-2 impeccably. As he checks the time, the camera does a macro zoom on Image-3 on his wrist, before cutting to a low-angle tracking shot of Image-4 clicking rhythmically on the marble floor.

4-subject
refrefrefref

A luxurious sunlit living room. Image-1, wearing the sparkling Image-4 around its neck, steps gracefully out of Image-2. The cat stretches lazily, then walks over to Image-3 to eat. The camera follows at floor level with a shallow depth of field, highlighting the cat's majestic fur.

4-subject
refrefrefref

An architectural interior design video. The camera slowly pans across a chic dining space, showing Image-1 surrounded by four of Image-2. In the center of the table sits Image-3, adding verticality to the space. On the wall behind the setup, Image-4 hangs perfectly centered, completing the modern aesthetic.

5-subject
refrefrefrefref

A wide, sweeping drone shot dropping down into a sun-drenched park. Image-4 is spread out on the green grass. Image-1 and Image-2 are sitting on it, laughing together as they unpack fresh fruit from Image-3. Beside them, Image-5 wags its tail excitedly, hoping for a treat. The scene is bright, happy, and cinematic.

5-subject
refrefrefrefref

A mouth-watering macro video sequence in a bright kitchen. Image-1 sits in the center of the frame. A hand slowly pours Image-2 over the stack in slow motion, the syrup cascading beautifully. Image-3 is scattered artistically around the plate. To the side sits Image-4, steam rising into the morning light. Finally, a hand picks up Image-5 and cuts into the fluffy stack.