convertToCaptions()v4.0.131
This API assumes a newer version of Whisper.cpp than the stable release to support tokenLevelTimestamps. As a downside, this version may crash unexpectedly.
Use an older version of Whisper.cpp (1.0.54 or earlier) if you prefer to use a stable version of Whisper.cpp and forgo tokenLevelTimeStamps support.
Opinionated function that converts the output from transcribe() into easily digestable captions.
Can also combine words with close timestamps.
Useful for TikTok/Reel-type of videos that animate captions word-by-word.
transcribe.mjstsximportpath from "path";import {transcribe ,convertToCaptions } from "@remotion/install-whisper-cpp";const {transcription } = awaittranscribe ({inputPath : "/path/to/audio.wav",whisperPath :path .join (process .cwd (), "whisper.cpp"),model : "medium.en",tokenLevelTimestamps : true,});const {captions } =convertToCaptions ({transcription ,combineTokensWithinMilliseconds : 200,});for (constline ofcaptions ) {console .log (line .text ,line .startInSeconds );}
transcribe.mjstsximportpath from "path";import {transcribe ,convertToCaptions } from "@remotion/install-whisper-cpp";const {transcription } = awaittranscribe ({inputPath : "/path/to/audio.wav",whisperPath :path .join (process .cwd (), "whisper.cpp"),model : "medium.en",tokenLevelTimestamps : true,});const {captions } =convertToCaptions ({transcription ,combineTokensWithinMilliseconds : 200,});for (constline ofcaptions ) {console .log (line .text ,line .startInSeconds );}
Options
transcription
The transcription object that you retrieved from transcribe().
The tokenLevelTimestamps option must have been set to true.
combineTokensWithinMilliseconds
Combine words that are close to each other.
If words are not combined, they might display for a very short time if word-by-word captions are being used.
Disable combination by setting 0.
Recommendation: 200.
Return value
An object objects of the following shape:
tstypeCaption = {text : string;startInSeconds : number;};typeReturnValue = {captions :Caption [];};
tstypeCaption = {text : string;startInSeconds : number;};typeReturnValue = {captions :Caption [];};
Suggested usage
This shows how, given a data structure produced by convertToCaptions(), word-by-word captions can be rendered in a Remotion project.
See our TikTok template for a full reference implementation.
@remotion/install-whisper-cpp cannot be imported on the frontend, it is a Node.js API.
Only the TypeScript type is imported in this example
tsximport type {Caption } from "@remotion/install-whisper-cpp";import {Sequence ,useVideoConfig } from "remotion";constCaptions :React .FC <{subtitles :Caption [];}> = ({subtitles }) => {const {fps } =useVideoConfig ();return (<>{subtitles .map ((subtitle ,index ) => {constnextSubtitle =subtitles [index + 1] ?? null;constsubtitleStartFrame =subtitle .startInSeconds *fps ;constsubtitleEndFrame =Math .min (nextSubtitle ?nextSubtitle .startInSeconds *fps :Infinity ,subtitleStartFrame +fps ,);return (<Sequence from ={subtitleStartFrame }durationInFrames ={subtitleEndFrame -subtitleStartFrame }><Subtitle key ={index }text ={subtitle .text } />;</Sequence >);})}</>);};
tsximport type {Caption } from "@remotion/install-whisper-cpp";import {Sequence ,useVideoConfig } from "remotion";constCaptions :React .FC <{subtitles :Caption [];}> = ({subtitles }) => {const {fps } =useVideoConfig ();return (<>{subtitles .map ((subtitle ,index ) => {constnextSubtitle =subtitles [index + 1] ?? null;constsubtitleStartFrame =subtitle .startInSeconds *fps ;constsubtitleEndFrame =Math .min (nextSubtitle ?nextSubtitle .startInSeconds *fps :Infinity ,subtitleStartFrame +fps ,);return (<Sequence from ={subtitleStartFrame }durationInFrames ={subtitleEndFrame -subtitleStartFrame }><Subtitle key ={index }text ={subtitle .text } />;</Sequence >);})}</>);};