C#•5d ago

Help with Whisper.net (AI Voice detection)

I'am trying to make a substitle generator using whisper and ffmpeg. But i noticed that when nobody is talking (in the video), the subtitle shows too early... Is there a way using Whisper or anything else to fix this? Thanks. Watch the video to understand well, even if its not in english : https://streamable.com/tkzeah

Streamable

Watch output | Streamable

Watch "output" on Streamable.

27 Replies

Silme94OP•5d ago

Here is the code :

using System;
using System.Diagnostics;
using System.IO;
using System.Text;
using Whisper.net;
using Whisper.net.Ggml;

string inputVideo = "copy.mp4";
string outputVideo = "output.mp4";
string audioFile = "temp.wav";
string assFile = "subtitles.ass";
string model = "ggml-large-v3-turbo-q8_0.bin";

using var whisperFactory = WhisperFactory.FromPath(model);

using var processor = whisperFactory.CreateBuilder()
    .WithLanguage("fr")
    .Build();

RunFFmpeg($"-i {inputVideo} -ar 16000 -ac 1 -f wav {audioFile}");

using var fileStream = File.OpenRead(audioFile);
using var writer = new StreamWriter(assFile, false, new UTF8Encoding(false));

writer.WriteLine("[Script Info]");
// Format for .ass file ...

await foreach (var result in processor.ProcessAsync(fileStream))
{
    string text = result.Text.Trim();

    if (string.IsNullOrWhiteSpace(text))
        continue;

    string start = FormatAssTimestamp(result.Start);
    string end = FormatAssTimestamp(result.End);

    writer.WriteLine($"Dialogue: 0,{start},{end},Default,,0,0,0,,{text}");
}

writer.Close();  

RunFFmpeg($"-i {inputVideo} -vf \"ass={assFile}\" -c:a copy {outputVideo}");

static string FormatAssTimestamp(TimeSpan ts)
    => $"{ts.Hours}:{ts.Minutes:D2}:{ts.Seconds:D2}.{ts.Milliseconds / 10:D2}";


void RunFFmpeg(string argument)
{
    Process proc = new Process()
    {
        StartInfo = new ProcessStartInfo()
        {
            FileName = "ffmpeg",
            Arguments = $"-y {argument}",
            UseShellExecute = false,
            CreateNoWindow = true,
            RedirectStandardOutput = true,
            RedirectStandardError = true
        }
    };

    proc.Start();

// Read output ...

    proc.WaitForExit();
}

using System;
using System.Diagnostics;
using System.IO;
using System.Text;
using Whisper.net;
using Whisper.net.Ggml;

string inputVideo = "copy.mp4";
string outputVideo = "output.mp4";
string audioFile = "temp.wav";
string assFile = "subtitles.ass";
string model = "ggml-large-v3-turbo-q8_0.bin";

using var whisperFactory = WhisperFactory.FromPath(model);

using var processor = whisperFactory.CreateBuilder()
    .WithLanguage("fr")
    .Build();

RunFFmpeg($"-i {inputVideo} -ar 16000 -ac 1 -f wav {audioFile}");

using var fileStream = File.OpenRead(audioFile);
using var writer = new StreamWriter(assFile, false, new UTF8Encoding(false));

writer.WriteLine("[Script Info]");
// Format for .ass file ...

await foreach (var result in processor.ProcessAsync(fileStream))
{
    string text = result.Text.Trim();

    if (string.IsNullOrWhiteSpace(text))
        continue;

    string start = FormatAssTimestamp(result.Start);
    string end = FormatAssTimestamp(result.End);

    writer.WriteLine($"Dialogue: 0,{start},{end},Default,,0,0,0,,{text}");
}

writer.Close();  

RunFFmpeg($"-i {inputVideo} -vf \"ass={assFile}\" -c:a copy {outputVideo}");

static string FormatAssTimestamp(TimeSpan ts)
    => $"{ts.Hours}:{ts.Minutes:D2}:{ts.Seconds:D2}.{ts.Milliseconds / 10:D2}";


void RunFFmpeg(string argument)
{
    Process proc = new Process()
    {
        StartInfo = new ProcessStartInfo()
        {
            FileName = "ffmpeg",
            Arguments = $"-y {argument}",
            UseShellExecute = false,
            CreateNoWindow = true,
            RedirectStandardOutput = true,
            RedirectStandardError = true
        }
    };

    proc.Start();

// Read output ...

    proc.WaitForExit();
}

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

i dont understand, whats wrong?

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

yes but this isn't my video i got it from a friend

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

💀 bro there is nothing wrong but if you want i remove it.... and also if you know something about the issue help me please

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

"if she makes a mistake she owes you a blowjob"

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

i dont see something wrong about it. Personally i just take it as a joke so My friend said it was for a tiktok

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

so uh do you know how to fix

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

dude i swear im not doing anything illegal or smth bad

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

this isnt my video its just you dont understand french the video is just a quizz...

Unknown User•5d ago

Message Not Public

Silme94OP•5d ago

with subtitles that show too early.. please. bro i swear there is nothing wrong or any sexual things

Lex Li•5d ago

The actual technique is not to depend only on what whisper provides. The correct timeline should be constructed based on different layers of data you collect from many other approaches, and human review/editing is still required if you want high quality results in the end.

Silme94OP•5d ago

I see. But is there still any ways to make the results better? to put the subtitles in time and not too early

Lex Li•5d ago

Yes, there are better results from AI/algorithms. My team have an in-house enterprise solution developed in this field, but unfortunately even with that level of details AI/algorithms can fail in edge cases.

Silme94OP•5d ago

i have a question, do you think it may be because of the model? The used model does around 800MB, if i get a larger one, will it be more precise and dont put subtitles too early?

Lex Li•5d ago

It would be a trade-off when you compare local small model to cloud based commercial services (and their modern models behind). However, the actual raw materials vary, and none of them is perfect right now to handle all cases (pause with noises, music, etc.), so you need to expect certain amount of human editing like I mentioned early on. We are also developing our own editing tools to further minimize human errors/efforts during the process.

Silme94OP•5d ago

I have a question about you video editor app. Are you gonna develop it using ffmpeg or another library?

Lex Li•5d ago

It's a commercial product for our internal use right now, so not able to share much more details. You can use whatever technique feasible, as there are just too many options.

Silme94OP•5d ago

I'm just asking because im curious how popular editing app were build like capcut It would be very hard to do a such thing from scratch without library

Gaming

Programming

Help with Whisper.net (AI Voice detection)

Did you find this page helpful?