手順

以下のコードは、動画ファイルのパスと字幕ファイルの出力パスを指定し、字幕の抽出を行います。実際に使用する際には、適切なファイルパスに置き換えてください。

import ffmpeg

def get_subtitle_stream_index(video_path):
    try:
        probe = ffmpeg.probe(video_path)
        for stream in probe['streams']:
            if stream['codec_type'] == 'subtitle':
                return stream['index']
        print("字幕ストリームが見つかりませんでした。")
        return None
    except ffmpeg.Error as e:
        print(f"ストリーム情報の取得中にエラーが発生しました: {e}")
        return None

def extract_subtitles_from_video(video_path, output_subtitle_path, subtitle_stream_index):
    try:
        # 'map'パラメータを修正
        ffmpeg.input(video_path).output(output_subtitle_path, format='srt', map=f'0:{subtitle_stream_index}').run(overwrite_output=True)
        print(f"字幕ファイルが正常に抽出されました。保存先: {output_subtitle_path}")
    except ffmpeg.Error as e:
        print(f"字幕ファイルの抽出中にエラーが発生しました: {e}")

# 使用例
video_file_path = '/path/to/your/video.mp4'  # 動画ファイルのパスを指定
output_srt_path = '/path/to/output/subtitles.srt'  # 出力する字幕ファイルのパスを指定

# 字幕ストリームのインデックスを取得
subtitle_stream_index = get_subtitle_stream_index(video_file_path)

# 字幕を抽出（字幕ストリームが存在する場合）
if subtitle_stream_index is not None:
    extract_subtitles_from_video(video_file_path, output_srt_path, subtitle_stream_index)

この手順で動画の字幕を抽出します。

抽出した字幕の整形

字幕ファイルの内容から時間コードとセクション番号を除去し、自然な日本語の文章に整形する関数

def clean_subtitles_advanced(subtitles_text):
# 正規表現を使用して、時間コードとセクション番号を削除
cleaned_text = re.sub(r'\d+\n\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\n', '', subtitles_text)

# 連続する改行を単一の改行に置換
# 字幕ファイルの内容全体を整形して保存する

# 全文を整形
cleaned_full_subtitles = clean_subtitles_advanced(subtitles)

# 整形した字幕データをファイルに保存
cleaned_subtitle_file_path = '/mnt/data/cleaned_subtitles.txt'
with open(cleaned_subtitle_file_path, 'w', encoding='utf-8') as file:
    file.write(cleaned_full_subtitles)

cleaned_subtitle_file_path

字幕の内容を再度整形

cleaned_subtitles_advanced = clean_subtitles_advanced(subtitles)

字幕ファイルまとめ

手順

抽出した字幕の整形

字幕の内容を再度整形