transformer asr japanese サンプルがある。kotoba-whisper-v1.0 を ONNX に変換

transformer asr japanese サンプルがある。

おんちゃんは、transformer asr のサンプルの、下記を試しておったが、
Automatic speech recognition
なんとも、日本語対応がすでに、ありますっと。

google で検索していたら、transformer asr japanese があった。
kotoba-tech/kotoba-whisper-v1.0 を ONNX に変換して動かしてみた。

1. 自分で転移学習をするなら、
August 2023, Fine-Tuning ASR Models for Japanese and Vietnamese by using OpenAI Whisper and Azure Speech Studio

2. そのまま使うなら、
kotoba-tech/kotoba-whisper-v1.0

ここのサンプルで、最近、判ったことだが、

torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

の部分は、使う GPU によっては、注意が必要みたいぞね。
おんちゃんの、GPU は、 GTX-1070 じゃが、この場合は、float16 は、使えないみたいぞね。

#torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
torch_dtype = torch.float32
こちらにすべきです。
これで、3.8 sec/f だったのが、 1.18 sec/f になったぞね。

日本語音声認識に特化したWhisperである kotoba-whisper-v1.0を早速試してみた
 Kotoba-Whisper入門 - 日本語音声認識の新しい選択肢

結論を言えば、kotoba-whisper-v1.0 で、GTX-170 の場合、torch_dtype = torch.float32 で使うと、
リアルタイム MIC 入力で使えば、それなりに快適に動きます。
Yutube の怪談の動画の音声をスピーカーに出して、それをマイクで拾って遊んでいます。
まあ、間違いは、お愛嬌か。

2. kotoba-whisper-v1.0 を onnx に変換できるのか試してみる。
Export to ONNX を参考に、 ONNX に変換してみる。

$ optimum-cli export onnx -h
で、 --task を確認できるが。
--task automatic-speech-recognition を使うべきか。

$ optimum-cli export onnx --model kotoba-tech/kotoba-whisper-v1.0 --task automatic-speech-recognition kotoba-whisper-v1.0_onnx/

$ optimum-cli export onnx --model kotoba-tech/kotoba-whisper-v1.0 --task automatic-speech-recognition kotoba-whisper-v1.0_onnx/ Framework not specified. Using pt to export the model. The task `automatic-speech-recognition` was manually specified, and past key values will not be reused in the decoding. if needed, please pass `--task automatic-speech-recognition-with-past` to export using the past key values. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using the export variant default. Available variants are: - default: The default ONNX variant. Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]} Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ***** Exporting submodel 1/2: WhisperEncoder ***** Using framework PyTorch: 2.3.1+cu121 Overriding 1 configuration item(s) - use_cache -> False /home/nishi/torch_env/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1164: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if input_features.shape[-1] != expected_seq_length: /home/nishi/torch_env/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:339: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len): /home/nishi/torch_env/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:378: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): Saving external data to one file... ***** Exporting submodel 2/2: WhisperForConditionalGeneration ***** Using framework PyTorch: 2.3.1+cu121 Overriding 1 configuration item(s) - use_cache -> False /home/nishi/torch_env/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:86: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if input_shape[-1] > 1 or self.sliding_window is not None: /home/nishi/torch_env/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if past_key_values_length > 0: /home/nishi/torch_env/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:346: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attention_mask.size() != (bsz, 1, tgt_len, src_len): Post-processing the exported models... Deduplicating shared (tied) weights... Found different candidate ONNX initializers (likely duplicate) for the tied weights: model.decoder.embed_tokens.weight: {'model.decoder.embed_tokens.weight'} proj_out.weight: {'onnx::MatMul_1292'} Removing duplicate initializer onnx::MatMul_1292... Validating ONNX model kotoba-whisper-v1.0_onnx/encoder_model.onnx... -[✓] ONNX model output names match reference model (last_hidden_state) - Validating ONNX Model output "last_hidden_state": -[✓] (2, 1500, 1280) matches (2, 1500, 1280) -[x] values not close enough, max diff: 0.011976242065429688 (atol: 0.001) Validating ONNX model kotoba-whisper-v1.0_onnx/decoder_model.onnx... -[✓] ONNX model output names match reference model (logits) - Validating ONNX Model output "logits": -[✓] (2, 16, 51866) matches (2, 16, 51866) -[✓] all values close (atol: 0.001) The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 0.001: - last_hidden_state: max diff = 0.011976242065429688. The exported model was saved at: kotoba-whisper-v1.0_onnx

これは、変換できたのだろうか?
最後の部分が気になるが?
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model
and the ONNX exported model is not within the set tolerance 0.001:
- last_hidden_state: max diff = 0.011976242065429688.
The exported model was saved at: kotoba-whisper-v1.0_onnx

一応、kotoba-whisper-v1.0_onnx/ ディレクトリーにできたみたい。

しかし、ファイルが2 つも出てきた。
encoder_model.onnx
decoder_model.onnx

どうやって、使うのか? 皆目、わからん。

How can I use the ONNX model? あたりが参考になるのか?

Optimum Inference with ONNX Runtime

from optimum.onnxruntime import ORTModelForSeq2SeqLM
の
ORTModelForXxx 部分を、指定すればよいのか?
Transformers Asr は、何をつかうのか?
ORTModelForSeq2SeqLM で、良いのか?

下記コードを試してみた。
onnx-pred.py
""" https://discuss.huggingface.co/t/how-can-i-use-the-onnx-model/70923 """ from transformers import AutoTokenizer, pipeline, PretrainedConfig from optimum.onnxruntime import ORTModelForSeq2SeqLM import onnxruntime # Load encoder model encoder_session = onnxruntime.InferenceSession('kotoba-whisper-v1.0_onnx/encoder_model.onnx') # Load decoder model decoder_session = onnxruntime.InferenceSession('kotoba-whisper-v1.0_onnx/decoder_model.onnx') model_id = "kotoba-whisper-v1.0_onnx/" tokenizer = AutoTokenizer.from_pretrained(model_id) config = PretrainedConfig.from_json_file('kotoba-whisper-v1.0_onnx/config.json') model = ORTModelForSeq2SeqLM( config=config, onnx_paths=['kotoba-whisper-v1.0_onnx/decoder_model.onnx','kotoba-whisper-v1.0_onnx/encoder_model.onnx'], encoder_session=encoder_session, decoder_session=decoder_session, model_save_dir='kotoba-whisper-v1.0_onnx', use_cache=False, ) onnx_translation = pipeline("translation_src_to_target", model=model, tokenizer=tokenizer) #text = 'the text to perform your translation task' #result = onnx_translation(text, max_length = 10000) #print(result)

実行してみた。
$ python onnx-pred.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
後は、人のspeechを入力すればよいのか?
入力データは、
kotoba-whisper-v1.0_onnx/config.json が参考になるのか。
"encoder_ffn_dim": 5120,
"max_length": 448,

辺りか。
自分で、ffn に変換するのか?もしかしたら、pipeline() で、変換してくれるのか?

ORTModelForSeq2SeqLM ではなくて、
ORTModelForSpeechSeq2Seq を使うみたい。
最終的に、下記コードで、OK でした。

onnx-pred.py
""" onnx-pred.py https://discuss.huggingface.co/t/how-can-i-use-the-onnx-model/70923 https://huggingface.co/transformers/v4.11.3/_modules/transformers/pipelines/automatic_speech_recognition.html """ from transformers import AutoTokenizer, pipeline, PretrainedConfig from optimum.onnxruntime import ORTModelForSpeechSeq2Seq import onnxruntime from transformers import WhisperFeatureExtractor # Load encoder model encoder_session = onnxruntime.InferenceSession('kotoba-whisper-v1.0_onnx/encoder_model.onnx') # Load decoder model decoder_session = onnxruntime.InferenceSession('kotoba-whisper-v1.0_onnx/decoder_model.onnx') model_id = "kotoba-whisper-v1.0_onnx/" tokenizer = AutoTokenizer.from_pretrained(model_id) config = PretrainedConfig.from_json_file('kotoba-whisper-v1.0_onnx/config.json') #generate_kwargs = {"language": "japanese", "task": "transcribe"} feature_extractor = WhisperFeatureExtractor( chunk_length=30, feature_size=128, hop_length=160, n_fft=400, n_samples=480000, nb_max_frames=3000, padding_side="right", padding_value=0.0, processor_class="WhisperProcessor", return_attention_mask=False, sampling_rate=16000 ) #model = ORTModelForSeq2SeqLM( model = ORTModelForSpeechSeq2Seq( config=config, onnx_paths=['kotoba-whisper-v1.0_onnx/decoder_model.onnx','kotoba-whisper-v1.0_onnx/encoder_model.onnx'], encoder_session=encoder_session, decoder_session=decoder_session, model_save_dir='kotoba-whisper-v1.0_onnx', use_cache=False, ) onnx_translation = pipeline( "automatic-speech-recognition", model=model, device="cpu", tokenizer=tokenizer) #print('type(onnx_translation):',type(onnx_translation)) #type(onnx_translation): <class 'transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline'> # set feature_extractor onnx_translation.feature_extractor=feature_extractor #print('onnx_translation.feature_extractor:',pipe.feature_extractor) sample="/home/nishi/local/tmp/commonvoice/cv-corpus-11.0-2022-09-21/ja/common_voice_ja_32866812.mp3" #text = 'the text to perform your translation task' #result = onnx_translation(text, max_length = 10000) result = onnx_translation(sample) print(result)

実行結果は、
$ python onnx-pred.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. {'text': 'どこまでも死体が生きるには死体が構成していかなければならない'}

なぜだか、うまく動いた。
すごい!!

後は、pipelines() は、遅いから、これを使わない方法を考えないといかんぞね。
ORTModelForSpeechSeq2Seq() に直接、音声データを入力する方法を、考えればよいのか。

ONNX から、quantize して、orange pi5 armbian jammy で動かせるようになるのだろうか?

3. Transformers Quantization を試してみる。
Transformers Quantization の、 bitsandbytes を試してみた。
bitsandbytes

上記を参考に、kotoba-tech/kotoba-whisper-v1.0 を、 8-bit モデルにしてみた。

quantize-pred.py

""" kotoba-whisper/quantize-pred.py https://huggingface.co/docs/transformers/quantization/bitsandbytes https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model """ import sys import torch from transformers import AutoModelForCausalLM, pipeline, AutoTokenizer, BitsAndBytesConfig,AutoModelForSpeechSeq2Seq from transformers import WhisperFeatureExtractor quantization_config = BitsAndBytesConfig(load_in_8bit=True) model_id = "kotoba-tech/kotoba-whisper-v1.0" generate_kwargs = {"language": "japanese", "task": "transcribe"} #model_8bit = AutoModelForCausalLM.from_pretrained( model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, quantization_config=quantization_config, low_cpu_mem_usage=True ) tokenizer = AutoTokenizer.from_pretrained(model_id) feature_extractor = WhisperFeatureExtractor( chunk_length=30, feature_size=128, hop_length=160, n_fft=400, n_samples=480000, nb_max_frames=3000, padding_side="right", padding_value=0.0, processor_class="WhisperProcessor", return_attention_mask=False, sampling_rate=16000 ) pipe = pipeline( "automatic-speech-recognition", model=model_8bit, torch_dtype=torch.float8_e4m3fn, tokenizer=tokenizer, feature_extractor=feature_extractor ) sample="/home/nishi/local/tmp/commonvoice/cv-corpus-11.0-2022-09-21/ja/common_voice_ja_32866812.mp3" model_8bit.generation_config.forced_decoder_ids=None # 推論の実行 result = pipe(sample, generate_kwargs=generate_kwargs) #result = pipe(sample) print(result["text"]) #print('model_8bit.generation_config:',model_8bit.generation_config) #model.push_to_hub("bloom-560m-8bit") #model.save_pretrained("path/to/model") model_8bit.save_pretrained("./kotoba-whisper-v1.0-8bit") #print(model)

$ python quantize-pred.py
で、./kotoba-whisper-v1.0-8bit/model.safetensors が出てきたけれど、これは、そのまま、 ONNX に変換できるのだろうか?
これは、このままロードすれば、使えるみたい。

GPU 版には、及ばないけれど、
sec/f: 3.8781877358754477

非GPU でも、早いみたい。
sec/f: 4.035113255182902

これを使った、prediction も作ってみた。
$ python sample2-bit8.py

4. ONNX を quantize してみる。
"2. kotoba-whisper-v1.0 を onnx に変換できるのか試してみる。" で変換した、ONNX を、quantize してみます。
深層学習の量子化に入門してみた〜BERTをDynamic Quantization〜を参考にさせてもらいました。

onnx2qauntize.py

""" onnx2qauntize.py https://tech.retrieva.jp/entry/20220304 https://github.com/microsoft/onnxruntime/issues/15888 quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, nodes_to_exclude=['/conv1/Conv']) """ from onnxruntime.quantization import quantize_dynamic, QuantType quantize_dynamic( model_input="kotoba-whisper-v1.0_onnx/encoder_model.onnx", model_output="kotoba-whisper-v1.0_onnx/encoder_model-8bit.onnx", weight_type=QuantType.QInt8, #weight_type=QuantType.QUInt8, nodes_to_exclude=['/conv1/Conv','/conv2/Conv'] ) quantize_dynamic( model_input="kotoba-whisper-v1.0_onnx/decoder_model.onnx", model_output="kotoba-whisper-v1.0_onnx/decoder_model-8bit.onnx", weight_type=QuantType.QInt8, #weight_type=QuantType.QUInt8, nodes_to_exclude=['/conv1/Conv','/conv2/Conv'] )

kotoba-whisper-v1.0_onnx/encoder_model-8bit.onnx , decoder_model-8bit.onnx ができるので、
後は、前の onnx-pred.py で、これを使えばOK ぞね。
piplines() が、ネックか、あまり速度の改善は無い。
sec/f: 12.050926446914673

多分、piplines() をやめれば、もっと早いのでは?

Build and deploy fast and portable speech recognition applications with ONNX Runtime and Whisper に、
pipelines を使わない例があったので、取り込んでみた。

onnx-pred_pro.py
""" onnx-pred_pro.py https://discuss.huggingface.co/t/how-can-i-use-the-onnx-model/70923 https://huggingface.co/transformers/v4.11.3/_modules/transformers/pipelines/automatic_speech_recognition.html /home/nishi/torch_env/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py audio_utils.ffmpeg_read /home/nishi/torch_env/lib/python3.10/site-packages/transformers/pipelines/audio_utils.py https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort """ #from transformers import AutoTokenizer, pipeline, PretrainedConfig from transformers import AutoTokenizer, PretrainedConfig from optimum.onnxruntime import ORTModelForSpeechSeq2Seq import onnxruntime from transformers import WhisperFeatureExtractor import time import numpy as np import sys #from onnx import numpy_helper import torch # test by nishi from transformers.pipelines.audio_utils import ffmpeg_read #from transformers.pipelines import AutomaticSpeechRecognitionPipeline #from mltu.preprocessors import WavReader #import librosa """ copy from /home/nishi/Documents/VisualStudio-TF/lstm_sound_to_text/inferencModel.py """ import matplotlib.pyplot as plt def plot_spectrogram(spectrogram: np.ndarray, title:str = "", transpose: bool = True, invert: bool = True) -> None: """Plot the spectrogram of a WAV file Args: spectrogram (np.ndarray): Spectrogram of the WAV file. title (str, optional): Title of the plot. Defaults to None. transpose (bool, optional): Transpose the spectrogram. Defaults to True. invert (bool, optional): Invert the spectrogram. Defaults to True. """ if transpose: spectrogram = spectrogram.T if invert: spectrogram = spectrogram[::-1] plt.figure(figsize=(15, 5)) plt.imshow(spectrogram, aspect="auto", origin="lower") plt.title(f"Spectrogram: {title}") plt.xlabel("Time") plt.ylabel("Frequency") #plt.colorbar() plt.tight_layout() plt.show() model_id = "./kotoba-whisper-v1.0_onnx" if False: encoder="encoder_model.onnx" decoder="decoder_model.onnx" else: encoder="encoder_model-8bit.onnx" decoder="decoder_model-8bit.onnx" # Load encoder model encoder_session = onnxruntime.InferenceSession(model_id+'/'+encoder) # Load decoder model decoder_session = onnxruntime.InferenceSession(model_id+'/'+decoder) tokenizer = AutoTokenizer.from_pretrained(model_id) config = PretrainedConfig.from_json_file(model_id+'/config.json') #generate_kwargs = {"language": "japanese", "task": "transcribe"} feature_extractor = WhisperFeatureExtractor( chunk_length=30, feature_size=128, hop_length=160, n_fft=400, n_samples=480000, nb_max_frames=3000, padding_side="right", padding_value=0.0, processor_class="WhisperProcessor", return_attention_mask=False, sampling_rate=16000 ) #model = ORTModelForSeq2SeqLM( model = ORTModelForSpeechSeq2Seq( config=config, onnx_paths=[model_id+'/'+decoder,model_id+'/'+encoder], encoder_session=encoder_session, decoder_session=decoder_session, model_save_dir=model_id, use_cache=False, ) if True: print('--------------') input_name0 = encoder_session.get_inputs()[0].name print('input_name0:',input_name0) # input_name: input_features shape0 = encoder_session.get_inputs()[0].shape print('shape0:',shape0) # shape0: ['batch_size', 128, 3000] outputs0 = encoder_session.get_outputs() print('len(outputs0):',len(outputs0)) # len(outputs): 1 output_name0 = encoder_session.get_outputs()[0].name print('output_name0:',output_name0) output_shape0 = encoder_session.get_outputs()[0].shape print('output_shape0:',output_shape0) output_type0 = encoder_session.get_outputs()[0].type print('output_type1:',output_type0) print('--------------') input_name1 = decoder_session.get_inputs()[0].name print('input_name1:',input_name1) # input_name: input_features shape1 = decoder_session.get_inputs()[0].shape print('shape1:',shape1) # shape: ['batch_size', 128, 3000] outputs1 = decoder_session.get_outputs() print('len(outputs1):',len(outputs1)) # len(outputs): 1 output_name1 = decoder_session.get_outputs()[0].name print('output_name1:',output_name1) # output_name[0]: logits output_shape1 = decoder_session.get_outputs()[0].shape print('output_shape1:',output_shape1) # output_shape: ['batch_size', 'decoder_sequence_length', 51866] output_type1 = decoder_session.get_outputs()[0].type print('output_type1:',output_type1) # output_type: tensor(float) print('--------------') #print('type(model):',type(model)) #print('model.config:',model.config) TEST1=True if TEST1==True: print('TEST1') sample_mp3="/home/nishi/local/tmp/commonvoice/cv-corpus-11.0-2022-09-21/ja/common_voice_ja_32866812.mp3" with open(sample_mp3, "rb") as f: inputs = f.read() inputs = ffmpeg_read(inputs, sampling_rate=feature_extractor.sampling_rate) print('inputs.shape:',inputs.shape) x=feature_extractor(inputs,sampling_rate=feature_extractor.sampling_rate) #print('type(x):',type(x)) #x_dict=dict(x) #print(x_dict) dt=x['input_features'] #print('dt.shape',dt.shape) sample=dt if False: dtx=np.transpose(dt[0]) plot_spectrogram(dtx) print('sample.shape',sample.shape) cnt=0 start=time.time() while True: # https://medium.com/microsoftazure/build-and-deploy-fast-and-portable-speech-recognition-applications-with-onnx-runtime-and-whisper-5bf0969dd56b #print('model.config.forced_decoder_ids:',model.config.forced_decoder_ids) #print('model.config:',model.config) input_my = torch.from_numpy(sample).clone() #predicted_ids = model.generate(input_my, max_length=448) predicted_ids = model.generate(input_my) #print("predicted_ids:",predicted_ids) #x = predicted_ids.to('cpu').detach().numpy().copy() #print('x:shape',x.shape) #speech=tokenizer.decode(x[0]) speech=tokenizer.decode(predicted_ids[0]) print('speech:',speech) cnt+=1 if cnt >= 5: break end=time.time() print("sec/f:",(end-start)/cnt) # quantization #sec/f: 11.72364239692688

i) predicted_ids = model.generate(input_my)
を使えば、OK でした。
ii) 入力データの変換は、
x=feature_extractor(inputs,sampling_rate=feature_extractor.sampling_rate)
で、変換してくれました。
処理内容を調べるには、
transformers.WhisperFeatureExtractor
~/torch_env/lib/python3.10/site-packages/transformers/models/whisper/feature_extraction_whisper.py
の中をみてみないといかんぞね。
class WhisperFeatureExtractor(SequenceFeatureExtractor): ... def __call__( ...

将来、マイク入力にする場合、ここの理解がひつようぞね!!
多分、 mel 化、 padding、ノーマライズ等の処理が入っているか?
ここの処理の流れは、
openai / whisper のサンプルを見るとわかりやすい。
import whisper model = whisper.load_model("base") # load audio and pad/trim it to fit 30 seconds audio = whisper.load_audio("audio.mp3") audio = whisper.pad_or_trim(audio) # make log-Mel spectrogram and move to the same device as the model mel = whisper.log_mel_spectrogram(audio).to(model.device) ....

a) 音楽データをロードして、Normalize
b) pad_or_trim して、
c) mel 化して( tensor 化)
d) 読み込ませる。

該当コードは、
whisper/audio.py
にあるので、わかりやすい。
これは、使えるのでは?

試して見ましたが、処理時間は、ほとんど変わらない。残念。
sec/f: 11.72364239692688

ONNX quantize 版は、入力形式が、torch.float32 なのがなにか変。
大元の、torch 版は、 float32 -> float16 にロード時に、変換しているので、torch.float16 なのだが。

torch 8 bit 版の、pipelines を使わない、prediction も作ってみた。
$ python sample2-bit8_pro.py

試した、python scripts は、後で、github にでも上げておきます。

tosa-no-onchan/kotoba-whisper

Transformers kotoba-whisper-v1.0 の ONNX quntize 版は、OrangePi 5 の NPU でも変換すれば動くのだろうか?

参考。
1) 【音声認識2022】音声からテキストへ変換する「Whisper」でYouTube動画の文字起こしを実装する

2) 【音声認識2023】Google Colab で「Whisper large-v3」を使ってYouTube動画を文字起こしする（large-v2との精度比較あり）

3) Whisperを使ったリアルタイム音声認識と字幕描画方法の紹介

4) openai / whisper

5) 【音声認識ライブラリ】PythonでSpeechRecognitionを使って日本語を認識させてみる！

speech_recognition/__init__.py
line 58
class Microphone(AudioSource):
が、マイク入力の参考になりそう。
下の、サンプルでも使っているみたい。

6) Real Time Whisper Transcription
リアルタイムで、ASR を実行している。

常に、Mic 入力を取り込んで、queue で送ってくる仕掛けじゃが。
多分、 --energy_threshold=1000
以上のMic 入力が有ると、record を始めて、queuing してくるのでは?
おんちゃんも、以前 Mic 入力をしたとき、その様にした。

これを取りこんだ、sample2-pro_mic.py を作ってみた。

$ python sample2-pro_mic.py

こちらは、GTX-1070 flota32 版だが、それなりに遅滞なく変換できる。
しかし、YouTube の動画をマイクで取りこんでいると、取りこぼしが結構あるみたい。

入力部分をダブルバッファリングでもしないと、改善できないのでは?
C++ だと、その辺りは、うまくできそうだけど、Python は、どうするのか?

改善版を、作ってみました。ダブルバッファリングではないが、
pyaudio.PyAudio() Non Blocking と言うのがあったので、それを参考に
sample2-pro_mic_my.py を作ってみた。

$ python sample2-pro_mic_my.py

でも、時々、言葉が抜けているようじゃ。

しかし、おんちゃんが思うに、どのサンプルをみても、既存のライブラリーを組み合わせて使うだけで、
それを、参考にして、自分でもっと良い関数を作る発想が無い。
まことに、安直じゃ!!

他人のサンプルをそのまま使うのが嫌いなおんちゃんは、自分で MIC 入力関数を作りました。
マイク入力の音量が一定以上になると、Streamming を始めて、音量が低くなると停止する機能付きぞね。
mic_stream.py

単体でも動作確認できます。
$ python mic_stream.py

7) faster-whisper。
SYSTRAN/faster-whisper
こちらが、速いかと思ったが、 openai-whisper とそれほど違わない。
リアルタイムに、マイクから取りこんで、試していたら、
よく、GPUメモリー不足で、エラーになるみたい。
精度は、openai-whisper(medium) のほうが良い気がする。
でも、やはり、kotoba-whisper-v1.0(GPU flot32) が軽くて、それなりの認識性能を得られる?

transformer asr japanese サンプルがある。kotoba-whisper-v1.0 を ONNX に変換

カテゴリ:

検索

このブログ記事について

カテゴリ

月別アーカイブ

ウェブページ

サイトナビ

transformer asr japanese サンプルがある。kotoba-whisper-v1.0 を ONNX に変換

カテゴリ:

検索

このブログ記事について

カテゴリ

月別 アーカイブ

ウェブページ

サイトナビ

月別アーカイブ