C
C#•5d ago
Silme94

Help with Whisper.net (AI Voice detection)

I'am trying to make a substitle generator using whisper and ffmpeg. But i noticed that when nobody is talking (in the video), the subtitle shows too early... Is there a way using Whisper or anything else to fix this? Thanks. Watch the video to understand well, even if its not in english : https://streamable.com/tkzeah
Streamable
Watch output | Streamable
Watch "output" on Streamable.
27 Replies
Silme94
Silme94OP•5d ago
Here is the code :
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
using Whisper.net;
using Whisper.net.Ggml;

string inputVideo = "copy.mp4";
string outputVideo = "output.mp4";
string audioFile = "temp.wav";
string assFile = "subtitles.ass";
string model = "ggml-large-v3-turbo-q8_0.bin";

using var whisperFactory = WhisperFactory.FromPath(model);

using var processor = whisperFactory.CreateBuilder()
.WithLanguage("fr")
.Build();

RunFFmpeg($"-i {inputVideo} -ar 16000 -ac 1 -f wav {audioFile}");

using var fileStream = File.OpenRead(audioFile);
using var writer = new StreamWriter(assFile, false, new UTF8Encoding(false));

writer.WriteLine("[Script Info]");
// Format for .ass file ...

await foreach (var result in processor.ProcessAsync(fileStream))
{
string text = result.Text.Trim();

if (string.IsNullOrWhiteSpace(text))
continue;

string start = FormatAssTimestamp(result.Start);
string end = FormatAssTimestamp(result.End);

writer.WriteLine($"Dialogue: 0,{start},{end},Default,,0,0,0,,{text}");
}

writer.Close();

RunFFmpeg($"-i {inputVideo} -vf \"ass={assFile}\" -c:a copy {outputVideo}");

static string FormatAssTimestamp(TimeSpan ts)
=> $"{ts.Hours}:{ts.Minutes:D2}:{ts.Seconds:D2}.{ts.Milliseconds / 10:D2}";


void RunFFmpeg(string argument)
{
Process proc = new Process()
{
StartInfo = new ProcessStartInfo()
{
FileName = "ffmpeg",
Arguments = $"-y {argument}",
UseShellExecute = false,
CreateNoWindow = true,
RedirectStandardOutput = true,
RedirectStandardError = true
}
};

proc.Start();

// Read output ...

proc.WaitForExit();
}
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
using Whisper.net;
using Whisper.net.Ggml;

string inputVideo = "copy.mp4";
string outputVideo = "output.mp4";
string audioFile = "temp.wav";
string assFile = "subtitles.ass";
string model = "ggml-large-v3-turbo-q8_0.bin";

using var whisperFactory = WhisperFactory.FromPath(model);

using var processor = whisperFactory.CreateBuilder()
.WithLanguage("fr")
.Build();

RunFFmpeg($"-i {inputVideo} -ar 16000 -ac 1 -f wav {audioFile}");

using var fileStream = File.OpenRead(audioFile);
using var writer = new StreamWriter(assFile, false, new UTF8Encoding(false));

writer.WriteLine("[Script Info]");
// Format for .ass file ...

await foreach (var result in processor.ProcessAsync(fileStream))
{
string text = result.Text.Trim();

if (string.IsNullOrWhiteSpace(text))
continue;

string start = FormatAssTimestamp(result.Start);
string end = FormatAssTimestamp(result.End);

writer.WriteLine($"Dialogue: 0,{start},{end},Default,,0,0,0,,{text}");
}

writer.Close();

RunFFmpeg($"-i {inputVideo} -vf \"ass={assFile}\" -c:a copy {outputVideo}");

static string FormatAssTimestamp(TimeSpan ts)
=> $"{ts.Hours}:{ts.Minutes:D2}:{ts.Seconds:D2}.{ts.Milliseconds / 10:D2}";


void RunFFmpeg(string argument)
{
Process proc = new Process()
{
StartInfo = new ProcessStartInfo()
{
FileName = "ffmpeg",
Arguments = $"-y {argument}",
UseShellExecute = false,
CreateNoWindow = true,
RedirectStandardOutput = true,
RedirectStandardError = true
}
};

proc.Start();

// Read output ...

proc.WaitForExit();
}
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
i dont understand, whats wrong?
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
yes but this isn't my video i got it from a friend
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
💀 bro there is nothing wrong but if you want i remove it.... and also if you know something about the issue help me please
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
"if she makes a mistake she owes you a blowjob"
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
i dont see something wrong about it. Personally i just take it as a joke so My friend said it was for a tiktok
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
so uh do you know how to fix
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
dude i swear im not doing anything illegal or smth bad
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
this isnt my video its just you dont understand french the video is just a quizz...
Unknown User
Unknown User•5d ago
Message Not Public
Sign In & Join Server To View
Silme94
Silme94OP•5d ago
with subtitles that show too early.. please. bro i swear there is nothing wrong or any sexual things
Lex Li
Lex Li•5d ago
The actual technique is not to depend only on what whisper provides. The correct timeline should be constructed based on different layers of data you collect from many other approaches, and human review/editing is still required if you want high quality results in the end.
Silme94
Silme94OP•5d ago
I see. But is there still any ways to make the results better? to put the subtitles in time and not too early
Lex Li
Lex Li•5d ago
Yes, there are better results from AI/algorithms. My team have an in-house enterprise solution developed in this field, but unfortunately even with that level of details AI/algorithms can fail in edge cases.
Silme94
Silme94OP•5d ago
i have a question, do you think it may be because of the model? The used model does around 800MB, if i get a larger one, will it be more precise and dont put subtitles too early?
Lex Li
Lex Li•5d ago
It would be a trade-off when you compare local small model to cloud based commercial services (and their modern models behind). However, the actual raw materials vary, and none of them is perfect right now to handle all cases (pause with noises, music, etc.), so you need to expect certain amount of human editing like I mentioned early on. We are also developing our own editing tools to further minimize human errors/efforts during the process.
Silme94
Silme94OP•5d ago
I have a question about you video editor app. Are you gonna develop it using ffmpeg or another library?
Lex Li
Lex Li•5d ago
It's a commercial product for our internal use right now, so not able to share much more details. You can use whatever technique feasible, as there are just too many options.
Silme94
Silme94OP•5d ago
I'm just asking because im curious how popular editing app were build like capcut It would be very hard to do a such thing from scratch without library

Did you find this page helpful?