Voice and AI: Descript | Listen via Hubhopper

Voice and AI: Descript

Listens: 0

About

If you can dream it, it can happen! Jay Leboeuf from Descript joins Anne to discuss the benefits of having a voice clone and how Descript can improve work-from-home potential for talent. Remove filler words with one click, adjust your audio via transcript, fix errors using an Overdub voice clone, and so much more. Use your voice beyond its in-person potential with tools that bring the power of AI editing directly to talent. More at https://voboss.com/voice-and-ai-descript-jay-leboeuf/ Transcript >> It’s time to take your business to the next level, the BOSS level! These are the premiere Business Owner Strategies and Successes being utilized by the industry’s top talent today. Rock your business like a BOSS, a VO BOSS! Now let’s welcome your host, Anne Ganguzza. AI Voices: Welcome to the podcast. The VO BOSS podcast blends solid, actionable business advice with a dose of inspiration for today's voiceover talent. Each week host Anne Ganguzza focuses in on a specific topic to help you grow your voiceover business. Anne: All right. Hey everyone, who was that? That was some other people introducing the podcast today. So welcome again, everyone to the VO BOSS podcast. This is the AI and Voice series, and I am your host Anne Ganguzza, Anne Ganguzza. Today, I'm excited to bring you special guest Jay LeBoeuf, head of business development at Descript, a company that creates tools for new media creators. Now, Jay is also a lecturer on media technology and business at Stanford University, Carnegie Mellon, and University of Michigan, and sits on the board of advisors of numerous AI media and ed tech startups. He previously worked at some little known companies, probably to you guys out there and in the voiceover world, Avid Pro Tools and Izotope. Jay, thank you so much for joining me today. Jay: Thanks for having me, Anne. It's, uh, it's wonderful to be here with some of my AI-driven voice friends. Anne: Yeah, that was fantastic. So what we heard in the beginning was a couple of your voices on your platform, right? Jay: Indeed. One of which is my own and we were using Descript's Overdub technology. Anne: Awesome. Well, I want to definitely talk to you about that, but before we get into your role at Descript and what the company offers, first of all, let me just say, okay. Avid Pro Tools and Izotope, known to just everybody probably that listens to this podcast, and your resume is so incredibly impressive. Back in 2008, you were founder and CEO of Imagine Research where you created the first sound object recognition platform. And somehow that, I believe that that led into a patent as well as some small business research awards to you. And then somehow that became Izotope in 2012. Now, does that mean that my mouse clicks are being detected by an AI engine? Jay: So there's so many ways that AI is now integrated into the creative products that we use on a daily basis. And so the short answer is yes. So Imagine Research was based on some of what I was seeing. So I was at the, on the Pro Tools team, like you mentioned for about eight, eight and a half years before that. And I was seeing all these struggles that recording engineers, mixing engineers, voiceover talent, uh, ADR, we were seeing all these, these problems in the process that AI could solve. So we attempted to create the first set of tools where we could teach a computer how to recognize basic sounds and musical instruments, and even robustly differentiate is this a male speaker versus a female voice, and, you know, try to choose presets automatically for it. So Izotope acquired that company and that technology. I was at Izotope for about two years or so, helping to integrate all that work. And you know, you now see that Izotope products include a number of assistants -- Anne: Oh yeah. Jay: -- and things that will listen to your content and it's going to help it -- Anne: Absolutely. Jay: -- get it to the next stage. And that's the goal with all of this. Anne: And I have to say that there's a lot of people in the voiceover industry that just absolutely, that is their go-to, that is their go-to product to get rid of excess noise in their recording. So I thought that that was so fascinating. So, and now you are at Descript, and I've heard of Descript from the podcast world, and I'd heard about it a few years back where a lot of people were starting to use Descript for transcripts for their podcasts. And then wow, you guys just seem to have like catapulted with your product offerings since then. Tell us a little bit about Descript and the products that you offer, because I'm genuinely impressed with everything that you guys have going on over there. Jay: Great. Uh, thanks for using it, being familiar with it. For those that don't know De-script or Descript, we have no official pronunciations. So the choice is yours. Anne: Okay. Jay: Our team is kind of split on it. I go with De-script myself. So -- Anne: De-script. 4:30 Jay: Descript allows creators to create and edit audio and video as simply as typing. And this is this paradigm where you can drag in content that you've recorded externally, or you can record natively in the app. A transcript appears in seconds to minutes. You know, this time transcript will appear. If you have multiple people on a track, will automatically detect who they are, split them into different speaker labels. So you have this like really rich transcription going on. And a lot of people might stop right there and say, yeah, I've seen transcription tools before. Then I, you know, do a paper edit in Google docs, and then we bring it into Pro Tools and then just start cutting. But with Descript, we have all this alignment technology where the transcript is automagically aligned to the underlying audio and video. So as you are editing the text, as you are doing things like cutting out all of your ums, ahs, likes, you knows, all of that, just snips them out. And we use some AI to kind of stitch it all together. So that way you make a few cuts. And I have plenty of examples I can play of like befores and afters, where we can take a lot of great material and just make it sound so much better. So that's all you have to do, just edit text. Anne: Now I remember when I looked at it a couple of years ago, one of the things that I have today is when I record through ipDTL, because it's a high quality audio connection, people can talk over one another. And whenever I tried transcript technologies in the past, it couldn't deal with people talking at the same time, and then basically separating out who they were. But I feel like your technology has now surpassed those issues. And it's really something that I think is incredible, that it can even overlay the words on the wave form. Is that what you had mentioned? Jay: Absolutely, so you have, you have two ways of editing. You have the script view where you can actually just see the transcript. And if you just, all you want to do is select words and phrases and hit, delete, or strike through, you can edit through that. But if you are more comfortable with the wave form, we actually will overlay the words on top of each part of the wave form. Anne: Wow. Jay: And then you can make your manipulations there. So if you want to add a crossfade to a certain place, you know that, okay, yeah. Just put a crossfade between the words, voiceover and business, and no more needing to audition thousands and thousands of times to get them right. Anne: Wow. Well, that's fantastic. All right. So that's for podcasting. And now you have some other products that you offer as well that are quite powerful. Jay: Exactly. So, you know, we're most known for podcasting, I'd say. You know, the, the people in that community have probably heard of us, have probably tried it out. If you haven't, by all means, now's a great time to at least try. Drag some tape in, start cutting it up, and of course if there's anything I can help you with, let me know. But you know, we added video support in 20 -- what year are we in now -- 2020. Anne: Yep. I saw that. Jay: It's been a year. Anne: It's been a year. Jay: It's been a year. So about halfway through 2020, we -- you were always able to kind of edit the video because it was always linked to the audio, but we really doubled down. So, uh, what we ended up doing was built in all of the basic features that you would have in a typical non-linear editor, like an, an iMovie or a Final Cut or a Premiere. We built in all the basics, all the bread and butter things that you need, on top of all of the word and text editing capabilities we had. So you can now do all of your cross fades, all of your titling, arrows and annotations, and you know, very basic multicam support. All these things work great, 4k, 60 frames-a-second video. It's all synced to the cloud, so that's something that's also really wonderful about the tool, and you, and I could record something. I can invite you just like a Google Doc, and then you and I can start collaborating on this material simultaneously. We see the same doc. We have the same footage. Anne: So, wow, a video word processor. So we have the audio word processor -- Jay: Video word processor. Anne: -- and now a video word processor. That's, wow. Also, in addition to that, I think you can do screen recording as well with Descript? Jay: Exactly. So for all of us that are fully embracing the remote collaboration -- Anne: Yeah. Jay: -- asynchronous video communication life, we're sending each other a lot of quick updates or quick tutorials. So rather than have to type out those "here's all the instructions on how to connect to ipDTL for the first time," you can actually just do a quick screen recording using your own voice. And what differentiates the Descript screen recorder is again, as soon as you finish recording your screen recording, either, you know, your webcam or the screen itself, you see an instant transcript of what you said. And with one click, if you want to remove all of your filler words -- Anne: Right. Jay: -- I am a prolific ummer and ahher when I'm making stuff up. Anne: We all -- yeah, I think we all. We all are. Jay: So when -- you get to this little dialogue that pops up that says you have 35 filler words -- Anne: Wow. Jay: -- click to remove, and then you'll see the sentence where I start explaining it. And then I say, "yeah, let me try that again." I can just whack that sentence out and then send the video along. You can ask my team. I do tons of those every day,. Anne: Now does it record the screen, and also use the video cam? So it can do multiple cameras or multiple recordings at the same time? Jay: Exactly, exactly. So, so right now you can have your webcam as a bubble that you can position anywhere you want on the screen. Also, you have separate audio tracks for your mic. You have computer audio. So that's something that I use a lot where I'm demoing something and maybe sharing the output of Descript to the app or a different tool. So you can capture audio from computer audio and also your high input. Anne: Fantastic. Jay: Very nice microphones. Anne: Now I happen to read a press release the other day about a new product called Studio Sound, which allows you to remove noise [laughs] in your recording. Jay: Okay. Anne: That's pretty powerful. [laughs] Jay: So I have incredible admiration for companies that make professional noise reduction, de-reverberation restoration tools. I have a ton of friends that work at Izotope. Having worked there myself, I love the company. So -- Anne: I was going to say, you have quite a background in it. So that would make sense. [laughs] Jay: So I will say what we wanted to build was as close to a one checkbox solution where you know what, you have this audio, you either don't have the time, you don't have the skill -- Anne: Right, exactly. Jay: -- you don't have the knowledge to use the professionals. So like we're not talking about saving location recording from the deadliest catch and removing like -- Anne: Right. Jay: -- some of those conditions. We're talking about -- let me play an example. So I'm going to play you some material, and this, this is maybe what got recorded with some, you know, room tone on a not great mic. So let me just hit play. Anne: Okay. [room noise] Jay: Hey, there's the room tone. Voice: The appearance of the island when I came on deck next morning was altogether changed. Although the breeze had now utterly ceased, we'd made a great deal of way during the night and were now lying becalmed about half a mile to the southeast of the low eastern coast. Jay: Okay. So now let me click a checkbox that's called Studio Sound in Descript. Anne: And that's not uncommon for people with podcasts who have guests that are not necessarily -- Jay: Right. Anne: -- having the right recording studio. Jay: Right. No, definitely. Anne: Yeah. Jay: So now, now let me hit the space bar and now I'm playing. Voice: The appearance of the island when I came on deck next morning was altogether changed. Although the breeze had now utterly ceased, we'd made a great deal of way during the night and were now lying -- Jay: Let me turn it off. Voice: -- becalmed about half a mile to the southeast of the low eastern coast. Anne: Wow. Jay: And back on. Voice: Green colored woods covered a large part of the surface. Anne: Wow, wow! Jay: That's one checkbox. Anne: This is a product that's actually out now? Jay: This is out now. We -- Anne: Wow, that's incredible. Jay: -- have a beta tag applied to it because we're still experimenting with it -- Anne: Sure. Jay: -- but it's actually on every plans. Anne: Okay. Jay: We have a free Descript plan. So people listening to this, they're like, I want to try this out. You can try this out. It's totally free. Try it on your files, download your files when you're done with them. Anne: Right. Jay: We're really excited about this. And this is just one of these other suites of tools that we're trying to do to allow people to create professional sounding and looking content faster than ever before. Anne: Sure. Jay: You shouldn't have to spend hundreds and hundreds of extra dollars to download and learn tools when you have problems with your content. And so that's, that's some of the stuff we're trying to solve. Anne: Yeah, and that really serves a need. You know, I cannot tell you how many people -- I mean, I'm a full-time voice talent. And so for me, you know, this is part of my daily thing. I had to learn how to, or I'd had tools that helped me to remove noise, but there's so many people out and in the podcast world, or just in general, that are creating content and yeah. Stuff like this is it can be immensely helpful. So, wow. So that's an incredible suite of tools, and you also now have, well, you've had it for a couple of years now, Overdub, right, which is your -- this is how you can create an AI voice, your voice cloning technology. Talk to me a little bit about that. Jay: Absolutely. So Overdub allows anyone to create their own voice clone, and importantly, only with their own voice. And you can do that with only a few minutes of training data. And once you have this voice clone, this voice model, you can generate new sentences or correct your verbal typos. So a few ways that we see it being used, being -- really resonate with your listeners. Let's say you made a mistake in a, in an audio book or, you know, in a podcast, you mispronounced the key character's name. Anne: Right. Jay: You stated a date wrong, something like that. So you need to go back to the studio, or if you're at home, you need to kind of set up your equipment again, get it exactly how it was before. Anne: Punch and roll. [laughs] Jay: Rerecord everything, punch and roll, or even better, I have much more experience on the editor side. So as an editor, I would spend hours trying to find that word or phrase and then splice it in from elsewhere in the archives. Anne: Absolutely. Jay: It just never sounds right. Anne: Yeah, that actually makes me think of a lot of medical recordings that I do, for medical narration. If you find that you've mispronounced the word once, it's usually in the script quite a few times, if it's a product name. Jay: Right. So with Overdub, you would have created your own voice model. And so if you have the script and you knew -- you're using Descript, you can actually go in, find that one word that needs fixing or that phrase that needs fixing, or the sentence you actually forgot to say, and just type it in. And what we actually do behind the scenes -- this part is fascinating -- we don't just generate in the word in isolation. We take the text that you type in. We take basically the audio recording before your contextual edit and the audio after. And then we send that all to the cloud, and using those three inputs along with, you know, your voice model, we're able to generate the missing word or phrase to make it fit in in context. So, you know, if I was trying to resynthesize the word Overdub, sometimes it will sound like Overdub. Sometimes it'll sound Overdub, and it's just gonna depend on where it's going to fit in within the phrasing of what you were saying. Anne: Wow. So tell me again, what does it take to create your Overdub again? How long does it take? Jay: As little as 10 minutes -- Anne: Wow! Jay: -- of training data. Anne: So does that mean you have a model that's already there, that's being used for these voices? Jay: So let's go even deeper with the super behind the scenes. The way that we're able to make it so easy where all you need to do is create, you know, you basically read a training script. Anne: Okay. Jay: And you read this training scripts to us, and, you know, we have it on our website and there's, there's nothing special about it. Technically any source material would work, but we just provide this like David Attenborough voiceover stuff. It's really fun to read. Anne: Okay. Jay: So you read that, and we need as little as 10 minutes. The more you add, the better it's going to get. There's no point in going over an hour at that point. Our research has shown it's not going to sound any better. Anne: Okay. Jay: So, you know, between 10 minutes and an hour that you're willing to sit and read this script. The other thing we need of course is your voice consent statement. So this is a 30-second long blurb we also have available on our website, which you grant consent to Descript to create your own voice model. And you're just stating that, like, I and I alone have access to this voice model. If I choose to grant it with somebody else, then I'm giving people the option to use my voice. But you know, this voice is just mine. And we use that to compare against the training data to make sure that this is really you. Anne: Got it. So then let me just back up just a second. Jay: Yeah, please. Anne: So if you're using any of the material that people upload, let's say, for podcast editing or any of the, any of the products that you offer, is any of that being used for training data from Descript? Jay: No. So all of your material, all your voice data is yours and yours alone. Anne: Got it. Previous to releasing Overdub, we had actually learned from this the general speech patterns from thousands and thousands of speakers. Uh, Descript acquired a company called Lyrebird in 2019. Anne: Yes, I'm familiar with that. Jay: And they're real pioneers in this space. And they had actually learned from thousands of existing speakers. Anne: I heard the viral thing they did with politicians, so back a few years back. Absolutely. And so you've had the model for a while that's been developed with thousands and thousands of voices. Jay: Exactly. Anne: Got it. Jay: What, what the secret sauce is, is the ability to, with just a few minutes of a different person's speech, be able to identify what makes Jay or what makes Anne sound the way they do with the mic they have in the room that they do with the cadence that they're speaking? And we kind of can make this like lighter weight model to generate your speech. Anne: Okay. So what, in your opinion, or what, in your knowledge, what makes a better AI voice? Is it the person that records being, I don't know, more conversational or what makes some voices sound a little more robotic than others? Jay: The short answer is it's really going to depend on the underlying technology that's being used. So that's why Descript's Overdub technology sounds different than Alexa, than Google Wavenet, than Thimble, than, you know, than other solutions. For our approach, some of the things that we think makes it sound so good, so one thing is that we are one of the only solutions that actually we generate already 44,100 samples every second of your voice. And your listeners know what that means. If, if people don't it's, you know, CD quality sound -- you don't even know what CDs are anymore. Anne: I know! Jay: It's really good, super high resolution. And so that's one of the things that people often notice, like Alexa is nowhere even close to -- Anne: Right. Jay: 44.1 K. And so that's why she'll always sound that little bit muffled, that little bit like flat. And so by generating in, you know, what the researchers called super resolution, that's one thing that really makes a very big difference with what we're doing. From a training material standpoint, when we, you know, when we work with artists and celebrities, sometimes we'll actually coach them on, you know, the training material that they should put into the system should be read as naturally -- Anne: As possible. Jay: -- as they want the output to be. So, yeah. So, you know, we have the David Attenborough scripts, but if you're never going to be doing that in the wild and then read it in a way that's more representative -- Anne: In the wild! [laughs] Right, right, absolutely. Jay: Literally in the wild. Anne: Yup. Yup. Okay. All right. That makes sense. Now, do you have tools that allow you to change the sound of it once you've, you know, once you've typed in a script, and you change -- can you add emotion? Can you change speed? Those sorts of things? Jay: Change style is what we have. Rather than exposing 10, 15, you know, sliders, controls, checkbox, the Descript way of doing it is to allow you to actually select some source material that sounds representative of the style you want to recreate. So I would go in there, I would highlight a sentence or part of a paragraph that sounds like what I want to create. I would then right click on it, say overdub voice style, and I would say "create new voice style," and then call it whatever you want. So maybe it's happy or enthusiastic. Anne: Okay. Jay: You give it a name and then that name can be applied for Overdub generation in the future to steer the material. Anne: Are you recording that happy? Or are you recording that? Like, where are they getting that from? Where are you getting the happy from? Or the emotion from? Jay: Yeah. Anne: The style. Jay: We leave it to users. Anne: Oh, okay. Jay: That's one of the things people say like -- Anne: I got it. Jay: -- "hey, you know, I just created my voice model. Why don't you provide some templates?" I'm like, because I don't know what you sound like when you're happy. Anne: Okay, okay. Jay: So you get one default style -- Anne: Okay. Jay: -- that the system thinks is neutral Anne. This is what neutral Anne sounds like. And then it's up to you to go through, and in your training data, start finding examples of here's me being contemplative, here's me being excitable, and then give them the names -- Anne: Okay. Jay: -- that you you feel comfortable with. Anne: Do you resell these voices? Jay: No. So your voice is only your voice. You can assign it to other people that you work with on your team -- Anne: Okay. Jay: -- but you can also revoke that at any time. That's, uh, you know, it's functionality that we, we treat seriously. Now that -- the one thing we do provide to get people started out of the box, when we were playing the welcome to the VO BOSS intro, for example, we provide some stock voices. So we have eight right now, just a very limited palette, but still eight stock voices, which are pre-trained voice models of voice actors that we have an agreement with to get people get up and running. Anne: Got it. So then if I wanted to resell my voice, is that possible? Like if I create, let's say I get a script, I mean, you can hire human Anne or you can hire AI Anne. And so somebody says, well, I'm going to hire AI Anne, and I'm going to pay a certain amount. You know, probably not as much as human Anne. Could I then on Descript generate that voice and sell that? Jay: Yeah, we, you know, we don't have a marketplace or anything like that to facilitate that, but -- Anne: Interesting. Jay: -- the voice is yours. So you would come to an arrangement. You would be responsible for sharing your voice with another Descript user and overseeing how they're using it. And you know, the nice part of the voice ownership, you can turn it off at any time, so you can revoke access. Anne: So I guess my question would be, let's say I have a client, and they say, you know what? I have a bunch of material that I need to have recorded, but my budget is so much. And I say, okay, well, I can do that for you with my AI voice. 'Cause I don't have enough time to go in my studio and record that, but I could go to Descript, throw in the scripts, generate that, and then sell that to my client. I guess that's my question. Um, and that would be in agreement -- Jay: Oh, totally. Anne: -- that would be in agreement. How interesting, because I think one thing that a lot of people in the voiceover industry have been fearful of is, you know, who owns that voice, and how do I know where it's being used, and how do I, you know, is there an agreement, a contract that's been drawn up? So what that would do is it would allow us control over our own voice in selling the voice. So we would like, we normally do, we have contracts where we specify usage. So if it happens to be, let's say, in the commercial realm, and it's a commercial for McDonald's, if that's, you know, what they were looking for, we could then, you know, put in usage that would be appropriate for the job. And it would be something that we would negotiate with the company. Jay: Right. Anne: And that would be fine. You're not even a middle guy in that. That's basically we own our own voice. Jay: No. Anne: We can do whatever we want with it. Right? We can download it, right, I assume. Jay: Absolutely. This is the workflow I heard you say, Anne, is maybe we can flip it. You hire me, I'm voice talent. You give me the script. Anne: Yup, yup. Jay: But then like, oh, this is not within my budget. And you're like, how about this? I'm going to give you AI Jay. Anne: Yup. Jay: You're only interested in the final files. Maybe I can also give you the Descript file so that way, if you need to make -- Anne: Changes. Jay: -- changes and tweaks, you can, but you can't make, you can't generate new material. Anne: Well, then they'd have -- Jay: So here's AI Jay. This is Jay. I'm reading a sentence for Anne. She paid me to read this. Here you go. Anne: Oh, yup, yup. Jay: There's my material. You provide the audio files. These things are getting a lot of traction. So we actually have the ability to batch export material. And also we have API access for -- Anne: Wow. Jay: -- Overdub for if you want to programmatically do things. Anne: Sure. Jay: So a real example, there's a -- Anne: Wow. Jay: -- creative agency, and they work with one of their voice actors to do a mixture of things that are read real, but then they have a contract with Sunglass Hut. And they want to personalize it to go to your local Sunglass Hut. Anne: Right, exactly. Jay: And they get the address or the town. Anne: Sure. Jay: And so what they actually do, and Descript is not involved in this -- Anne: Right. Jay: -- but they use the tools to programmatically then create all the addresses sp this voice talent doesn't have to read 10,000 different Sunglass Hut locations. And so the voice actor consents to using their voice for that. And often they're the ones like generating on their system -- Anne: Sure. Jay: -- because they want to make sure it sounds right, and it's -- Anne: Well, yeah, exactly. So the client isn't necessarily going in -- they don't have a Descript account, and they're going in and typing it -- in addresses. It would be the talent probably, 'cause you're right. They would tweak it speed-wise or, you know, just so it sounds good. Jay: Right. And it's as super flexible. So I would encourage -- Anne: Right. Jay: -- because you know, the voice that you create, you can only create a voice with your own voice. You -- Anne: Right. Jay: We have people that try to upload a Barack Obama voice, you know, try to fake the consent statement, and AI built this. AI is kind of smarter than that. So it can detect that you're trying to fake the system. Anne: Right. Jay: We also have a human in the loop that listens to these consent statements. Make sure everything's legit. Anne: Oh, got it, got it. Jay: So we do everything we can to keep this as secure as possible. Anne: Wow. Talk to us a little bit about ethics, because I know you're one of the early adopters of putting a terms of service and an ethics statement on your website. Tell me about your policies on that. Jay: Yeah. I love that when I joined -- I joined the company at the beginning of 2020. There was already an ethics statement in place -- Anne: Mm-hmm, yup. Jay: -- which, which I was really inspired by. So you own, and you control your use of your digital voice. And this is something we strongly believe in that users can, you know, create a model that's authorized by you and controlled by you. So that's something that we unwaveringly do not budge from, and it's all based on this recorded verbal consent state, that kind of grants consent, and also helps us verify that you are a real, live, consenting person. So we will not clone voices of the deceased. Anne: Okay, okay. Jay: It's just, it's just a slippery slope. Anne: Yeah. Jay: That's an unapproved voice cloning. So unless we have a consent statement,. Anne: Oh, okay, that makes sense then why you have a verbal consent statement, yeah. Jay: We have a verbal consent statement, and, you know, uh, and again, people will try to stitch it together with -- Anne: Sure. Jay: -- with words, but it's just the system's designed to, to try to not allow that. And you know, we personally view that unapproved voice cloning -- like if we start making exceptions to this rule, then we're going to get into a world where we're making subjective judgment calls -- Anne: Yeah. Jay: -- about what's ethical and what's not ethical -- Anne: Absolutely. Jay: -- or what's a creative use case. And that's a very slippery slope. So we just want to be very clear and transparent. You have to own your voice. You have to be able to provide a consent statement. Um, we do not clone voices of children or minors. That's also against our terms of service. So if you're under 13, you can't use Descript. Our terms of service prohibit that. Anne: Okay. Jay: And we really want to stay up on what are the, the latest ethical standards? How are other companies using this? So we're talking to a lot of companies, participating in different membership organizations to try to figure out, you know, how do we ensure that content is authentic and -- Anne: Right. Jay: -- we're, we're as responsible as possible? Anne: Are you in the process of improving your model? So the AI voices will become even better and better and better with even maybe less data or, you know, even more human-like? Or is there a point where you kind of say, this is the level of -- like, how human do you want it to be? Because I think there's a level there of, if it becomes too human, then maybe there's that one note that somebody says, "wait a minute, am I being duped? Is this, you know, is this a human talking to me? Or is it an AI voice?" Do you have a level of, I guess, humanness for your AI model? Jay: We're going to keep improving it until it is indistinguishable from reality. And there's a lot of podcasts right now where you know, the sweet spot right now, Anne, is for this contextual edits where a word or a phrase has been fixed in the context of a longer recording. So we're at the point now where hosts are using that on a regular basis, and you can't tell. Like, no one's writing in and saying -- Anne: Right. Jay: -- that it sounds fake. And that's something that even a few years ago, it sounded like -- Anne: Sure. Jay: -- like voicemail phone tree systems, it would stick out. Those are just smooth. They sound great. Where we're going to be going, and what I think is going to sound better and better in the coming years is this like longer form text-to-speech. Anne: Yeah, right. Jay: So let me give you an example. So this is, this is how Malcolm Gladwell and his team at Pushkin Industries use Descript, and they use this for podcasts and audio books. So, you know, they're using Descripts, the desktop app, to transcribe dozens of interviews and, you know, archive material, and then starting to pull tape, pull selects, and getting the show in like a good rough cut. And then Malcolm Gladwell created his Overdub voice, and he assigned access to his voice to some of his editors. So they can create a draft narration for what the show would sound like with him doing the intro and kind of transitioning between different pieces. And so they can actually do a table read, and everybody can just kind of get on a call, listen to the table read with digital Malcolm, so they can hear how it sounds before anybody entered the -- Anne: Sure. Jay: Now that -- nothing's going to replace Malcolm in the zone saying and introducing his stories as himself. Anne: Right. Jay: And he's going to be like that for a while. Anne: Yeah. Jay: But there's always going to be applications, and it could be for really short commercials. Anne: Yeah. Jay: It could be for no budget audio books where, you know what, I'm just going to throw the AI voice at it. And we're gonna certainly know it's fake, but it's not going to be like listening to Alexa reading audio. Anne: Right, right, exactly. Jay: Because it's going to, it's going to actually have some, have some level of dynamics. Anne: Well, I think as long as the listener, I mean, then it becomes like the consumer, right? And you know, as long as they're aware. You know, I don't have a problem listening to Alexa 'cause I know it's Alexa, and I don't feel like Alexa is trying to dupe me into thinking she's human. And so I feel that same way. If I'm aware, I don't have a problem in certain cases, listening to it as long as know. Jay: That's it. And that's also why we want to, if anything, empower creators to have control of their voice. And if they wanna use it for editorial corrections, fantastic. If they want to use it for some longer form projects that they don't actually have the time to do or the budget -- their clients might not have the budget to do it -- Anne: Right. Jay: That that's their choice. Anne: Wow. Well, this has just been so enlightening. Woo, thank you so much for talking to me and talking to our listeners and talking about this, this amazing product that just seems to keep going. You guys keep coming up with these really wonderful things. So congratulations on that. Where do you see AI going in five years or even ten years? Jay: I'm super excited about this. Like media production is now actually entering a phase where if you can dream it, it can happen. And we don't necessarily need the expensive studio or the years and years and years of audio or video production training. We just need our laptops. So you and I both seen this in our careers with, with the move, from editing on tape -- Anne: Yup. Jay: -- to digital and then with PCs becoming so powerful with tools like iMovie and Garage Band that, you know, truly anybody can be a creator, and professionals can work from home. Well, the thing is there were a lot of advances during this time on other parts of the production process, like filming on smart phones and being able to broadcast and publish on social media, YouTube and podcast hosts, but all that stuff in between, all the editorial, all the correcting out mistakes -- Anne: Yeah. Jay: -- uh, generating small replacements, re-records, cutting, all that has been painstakingly difficult. Anne: Yeah. Jay: So this is where AI is really stepping in. And this next wave is, is huge because everybody is going to have access to these tools that make life even simpler, and the next generation of storytellers have never had it so good. Anne: Yeah. Well, that's fantastic. Oh, my goodness. Thank you so, so very much again, for spending time with us today. I'm going to give a big shout-out to our sponsor, ipDTL. You too can connect like a BOSS and find out more at ipdtl.com. You guys, BOSSes, have an amazing week, and we will see you next week. Thanks again. Bye-bye. Jay: Bye, everybody. >> Join us next week for another edition of VO BOSS with your host Anne Ganguzza. And take your business to the next level. Sign up for our mailing list at voboss.com and receive exclusive content, industry revolutionizing tips and strategies, and new ways to rock your business like a BOSS. Redistribution with permission. Coast to Coast connectivity via ipDTL.

VO BOSS Podcast

October 5, 2021

Business