就在前两天,公司提出需求,要求完成一个UE5语音识别的业务。

一开始的思路是用UE5 c++ 录音完成之后直接获取文件对象。所有有了以下写法(省略录音部分),

 

// 获取音频通道数和采样率
	float Channels = this->AudioCapture->GetNumChannels();
	float SampleRate = this->AudioCapture->GetSampleRate();
	
	Audio::FSampleBuffer SampleBuffer(this->AudioBuffer.GetData(), this->AudioBuffer.Num(), Channels, SampleRate);
	Audio::FSoundWavePCMWriter Writer;
	FString FilePath = FPaths::ProjectSavedDir();
	UE_LOG(LogTemp, Warning, TEXT("FilePath: %s"), *FilePath)
	Writer.BeginWriteToWavFile(SampleBuffer, "CapturedAudio", FilePath, []()
	{
		UE_LOG(LogTemp, Log, TEXT("SaveComplate"))
	});
	Writer.SaveFinishedSoundWaveToPath(FilePath + "CapturedAudio");

希望在SaveComplate处保存文件,再读取出来。这个方案有很大缺点。首先,有两次IO的操作。其次,并不是每次都能触发SaveComplate的事件。

在深思熟虑之下,我采用了另一种解决方案。

// 获取音频通道数和采样率
	float Channels = this->AudioCapture->GetNumChannels();
	float SampleRate = this->AudioCapture->GetSampleRate();
	TArray<uint8> ByteData;
	const int32 DataSize = AudioBuffer.Num() * sizeof(float);
	ByteData.Append(reinterpret_cast<const uint8*>(AudioBuffer.GetData()), DataSize);
	UploadUrl += "?sampleRate=" + FString::FromInt(SampleRate);
	UploadUrl += "&numChannels=" + FString::FromInt(Channels);
	// 创建 HTTP 请求
	TSharedRef<IHttpRequest, ESPMode::ThreadSafe> HttpRequest = FHttpModule::Get().CreateRequest();
	HttpRequest->SetURL(UploadUrl);
	HttpRequest->SetVerb(TEXT("POST"));
	HttpRequest->SetHeader(TEXT("Content-Type"), TEXT("application/octet-stream"));
	// 设置请求内容为二进制数据
	HttpRequest->SetContent(ByteData);

	// 设置回调函数
	HttpRequest->OnProcessRequestComplete().BindLambda(
		[this,UploadComplate](FHttpRequestPtr Request, FHttpResponsePtr Response, bool bSuccess)
		{
			if (bSuccess && Response.IsValid())
			{
				UE_LOG(LogTemp, Log, TEXT("Response: %s"), *Response->GetContentAsString());
				this->AudioBuffer.Reset();
				UploadComplate.Execute(Response -> GetContentAsString());
			}
			else
			{
				UE_LOG(LogTemp, Error, TEXT("HTTP Request failed"));
			}
		});

	// 发送请求
	HttpRequest->ProcessRequest();

直接将录音得到的TArray<float>数组发送给后端,再让后端组成文件。相比于C++,我更加熟悉JAVA的API。

JAVA的实现如下:

@PostMapping("/speechRecognition")
    public String speechRecognition(@RequestBody byte[] audioData, @RequestParam int sampleRate, @RequestParam int numChannels) throws Exception {
        int bitsPerSample = 32;  // 32位浮点
        File file = WavConverter.convertToWav(
                audioData,
                numChannels,
                sampleRate,
                bitsPerSample
        );
        String res = speechRecognitionService.speechSynthesis(file);
        // 定义正则表达式(匹配所有汉字)
        Pattern pattern = Pattern.compile("\\\\\"w\\\\\":\\\\\"([\\u4e00-\\u9fa5]+)\\\\\"");
        Matcher matcher = pattern.matcher(res);

        // 提取结果
        List<String> chineseList = new ArrayList<>();
        while (matcher.find()) {
            chineseList.add(matcher.group(1));
        }

        // 合并结果
        return String.join("", chineseList);
    }
package com.ruoyi.ai.utils;

import java.io.File;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.charset.StandardCharsets;

public class WavConverter {

    /**
     * 将原始音频字节流转换为WAV文件
     *
     * @param audioData     原始音频数据(小端字节序)
     * @param numChannels   声道数(1=单声道,2=立体声)
     * @param sampleRate    采样率(如44100)
     * @param bitsPerSample 位深度(32=浮点,16=PCM)
     */
    public static File convertToWav(
            byte[] audioData,
            int numChannels,
            int sampleRate,
            int bitsPerSample
    ) throws IOException {
        // 计算参数
        int byteRate = sampleRate * numChannels * (bitsPerSample / 8);
        int blockAlign = numChannels * (bitsPerSample / 8);
        int dataSize = audioData.length;
        int riffChunkSize = 36 + dataSize;

        // 创建头缓冲区(44字节)
        ByteBuffer header = ByteBuffer.allocate(44);
        header.order(ByteOrder.LITTLE_ENDIAN);

        // RIFF块
        header.put("RIFF".getBytes(StandardCharsets.US_ASCII));
        header.putInt(riffChunkSize);
        header.put("WAVE".getBytes(StandardCharsets.US_ASCII));

        // fmt子块
        header.put("fmt ".getBytes(StandardCharsets.US_ASCII));
        header.putInt(16); // 子块大小(16表示PCM/浮点标准格式)
        header.putShort((short) (bitsPerSample == 32 ? 3 : 1)); // 格式代码
        header.putShort((short) numChannels);
        header.putInt(sampleRate);
        header.putInt(byteRate);
        header.putShort((short) blockAlign);
        header.putShort((short) bitsPerSample);

        // data子块
        header.put("data".getBytes(StandardCharsets.US_ASCII));
        header.putInt(dataSize);

        // 合并头和音频数据
        byte[] headerBytes = header.array();
        byte[] wavBytes = new byte[headerBytes.length + audioData.length];
        System.arraycopy(headerBytes, 0, wavBytes, 0, headerBytes.length);
        System.arraycopy(audioData, 0, wavBytes, headerBytes.length, audioData.length);

        return createTempWavFile(wavBytes);

    }


    public static File createTempWavFile(byte[] wavBytes) throws IOException {
        // 创建临时文件(自动生成文件名,后缀为 .wav)
        Path tempFilePath = Files.createTempFile("audio_", ".wav");
        File tempFile = tempFilePath.toFile();
        // 写入字节数据
        Files.write(tempFilePath, wavBytes);
        // 程序退出时删除临时文件(可选)
        tempFile.deleteOnExit();
        return tempFile;
    }


}

这里语音识别使用的是讯飞的API。获取音频文件对象之后可以换成自己想要服务。

如果你对C++熟悉,第二步可以换成C++实现。

 

 

 

 

 

Logo

技术共进,成长同行——讯飞AI开发者社区

更多推荐