You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -293,6 +297,139 @@ if let imageURL = images.first?.url {
293
297
}
294
298
```
295
299
300
+
## Audio
301
+
Adding audio generation and transcription to mobile apps is becoming increasingly important as users grow more comfortable speaking directly to apps for responses or having their audio input transcribed efficiently. Preternatural enables seamless integration with these cutting-edge, continually improving AI technologies.
302
+
303
+
### Audio Transcription
304
+
[Whisper](https://openai.com/index/whisper/), created and open-sourced by OpenAI, is an Automatic Speech Recognition (ASR) system trained on 680,000 hours of mostly English audio content collected from the web. This makes Whisper particularly impressive at transcribing audio with background noise and varying accents compared to its predecessors. Another notable feature is its ability to transcribe audio with correct sentence punctuation.
305
+
306
+
```swift
307
+
import OpenAI
308
+
309
+
let client = OpenAI.Client(apiKey: "YOUR_API_KEY")
310
+
311
+
// supported formats include flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm
312
+
let audioFile =URL(string: "YOUR_AUDIO_FILE_URL_PATH")
313
+
314
+
// Optional - great for including correct spelling of audio-specific keywords
315
+
// For example, here we provide the correct spelling for company-spefic words in an earnings call
// Optional - Supplying the input language in ISO-639-1 format will improve accuracy and latency.
319
+
// While Whisper supports 98 languages, note that languages other than English have a high error rate, so test thoroughly
320
+
let language: LargeLanguageModels.ISO639LanguageCode= .en
321
+
322
+
// The sampling temperature, between 0 and 1.
323
+
// Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
324
+
// If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
325
+
let temperature =0
326
+
327
+
// Optional - Setting Timestamp Granularities provides the time stamps for roughly every sentence in the transcription.
328
+
// note that the timestampGranularties is an array of granularities, so you can inlcude both .segment and .word granularities, or simple one of them
329
+
let timestampGranularities: [OpenAI.AudioTranscription.TimestampGranularity] = [.segment, .word]
330
+
331
+
do {
332
+
let transcription =tryawait openAIClient.createTranscription(
333
+
audioFile: audioFile,
334
+
prompt: prompt,
335
+
language: language,
336
+
temperature: temperature,
337
+
timestampGranularities: timestampGranularities
338
+
)
339
+
340
+
let fullTranscription = transcription.text
341
+
let segements = transcription.segments
342
+
let words = transcription.words
343
+
} catch {
344
+
print(error)
345
+
}
346
+
```
347
+
348
+
### Audio Generation: OpenAI
349
+
Preternatural offers a simple Text-to-Speech (TTS) integration with OpenAI:
350
+
351
+
```swift
352
+
import OpenAI
353
+
354
+
let client = OpenAI.Client(apiKey: "YOUR_API_KEY")
355
+
356
+
// OpenAI offers two Text-to-Speech (TTS) Models at this time.
357
+
// The tts-1 is the latest text to speech model, optimized for speed and is ideal to use for real-time text to speech use cases.
358
+
let tts_1: OpenAI.Model.Speech= .tts_1
359
+
// The tts-1-hd is the latest text to speech model, optimized for quality.
360
+
let tts_1_hd: OpenAI.Model.Speech= .tts_1_hd
361
+
362
+
// text for audio generation
363
+
let textInput ="In a quiet, unassuming village nestled deep in a lush, verdant valley, young Elara leads a simple life, dreaming of adventure beyond the horizon. Her village is filled with ancient folklore and tales of mystical relics, but none capture her imagination like the legend of the Enchanted Amulet—a powerful artifact said to grant its bearer the ability to control time."
364
+
365
+
// OpenAI currently offers 6 voice options
366
+
// Listen to voice samples are on their website: https://platform.openai.com/docs/guides/text-to-speech
367
+
let alloy: OpenAI.Speech.Voice= .alloy
368
+
let echo: OpenAI.Speech.Voice= .echo
369
+
let fable: OpenAI.Speech.Voice= .fable
370
+
let onyx: OpenAI.Speech.Voice= .onyx
371
+
let nova: OpenAI.Speech.Voice= .nova
372
+
let shimmer: OpenAI.Speech.Voice= .shimmer
373
+
374
+
// The OpenAI API offers the ability to adjust the speed of the audio.
375
+
// Speed between 0.25 and 4.0 could be selected, with 1.0 as the default.
376
+
let speed =1.0
377
+
378
+
let speech: OpenAI.Speech=tryawait openAIClient.createSpeech(
379
+
model: tts_1,
380
+
text: textInput,
381
+
voice: alloy,
382
+
speed: speed)
383
+
384
+
let audioData = speech.data
385
+
```
386
+
387
+
### Audio Generation: ElevenLabs
388
+
[ElevenLabs](https://elevenlabs.io/) is a voice AI research & deployment company providing the ability to generate speech in hundreds of new and existing voices in 29 languages. They also allow voice cloning - provide only 1 minute of audio and you could generate a new voice!
389
+
390
+
```swift
391
+
import ElevenLabs
392
+
393
+
let client = ElevenLabs.Client(apiKey: "YOUR_API_KEY")
394
+
395
+
// ElevenLabs offers Multilingual and English-specific models
396
+
// More details on their website here: https://elevenlabs.io/docs/speech-synthesis/models
397
+
let model: ElevenLabs.Model= .MultilingualV2
398
+
399
+
// Select the voice you would like for the audio on the ElevenLabs website
400
+
// Note that you first have to add voices from the Voice Lab, then check your Voices for the ID
401
+
let voiceID ="4v7HtLWqY9rpQ7Cg2GT4"
402
+
403
+
let textInput ="In a quiet, unassuming village nestled deep in a lush, verdant valley, young Elara leads a simple life, dreaming of adventure beyond the horizon. Her village is filled with ancient folklore and tales of mystical relics, but none capture her imagination like the legend of the Enchanted Amulet—a powerful artifact said to grant its bearer the ability to control time."
404
+
405
+
// Optional - if you set any or all settings to nil, default values will be used
406
+
let voiceSettings: ElevenLabs.VoiceSettings= .init(
407
+
// Increasing stability will make the voice more consistent between re-generations, but it can also make it sounds a bit monotone. On longer text fragments it is recommended to lower this value.
408
+
// this is a double between 0 (more variable) and 1 (more stable)
409
+
stability: 0.5,
410
+
// Increasing the Similarity Boost setting enhances the overall voice clarity and targets speaker similarity.
411
+
// this is a double between 0 (Low) and 1 (High)
412
+
similarityBoost: 0.75,
413
+
// High values are recommended if the style of the speech should be exaggerated compared to the selected voice. Higher values can lead to more instability in the generated speech. Setting this to 0 will greatly increase generation speed and is the default setting.
414
+
// this is a double between 0 (Low) and 1 (High)
415
+
styleExaggeration: 0.0,
416
+
// Boost the similarity of the synthesized speech and the voice at the cost of some generation speed.
417
+
speakerBoost: true)
418
+
419
+
do {
420
+
let speech =tryawait client.speech(
421
+
for: textInput,
422
+
voiceID: voiceID,
423
+
voiceSettings: voiceSettings,
424
+
model: model
425
+
)
426
+
427
+
return speech
428
+
} catch {
429
+
print(error)
430
+
}
431
+
```
432
+
296
433
## Text Embeddings
297
434
Text embedding models are translators for machines. They convert text, such as sentences or paragraphs, into sets of numbers, which the machine can easily use in complex calculations. The biggest use-casefor Text Embeddings is improving Search in your application.
0 commit comments