Voice and AI: Resemble AI with Zohaib Ahmed and Tanja Milojevic

Listens: 0

About

With new AI technology, voice actors might be afraid of deep-fakes, or someone stealing their voice. But did you know that the same technology may be able to track where your voice is used and flag any deep-fakes? This week, we welcome Zohaib Ahmed, CEO of Resemble.ai and Tanja Milojevic, voice talent and Community Manager. In addition to offering a variety of solutions for voice cloning, character voices, and other content building with synthetic voices, Resemble AI runs an open-source project called Resemblyzer which allows detection of deep-fakes or misuse. Listen in as we discuss the ethics, accessibility, and the importance of storytelling in AI voices. Guest Bios About Zohaib Zohaib Ahmed is the CEO of Resemble and oversees tech development. His previous experience includes leading engineering teams at Magic Leap, Deepen AI, Hipmunk, and BlackBerry. At Hipmunk, he was the lead engineer of the first AI Assistant for Travel, built using modern NLP techniques. Zohaib graduated from the University of Toronto with a degree in computer science. About Tanja Tanja Milojevic is highly motivated, talented, and dedicated to audio description. Tanja is working with Resemble AI to clone her voice and as the Community Manager. Tanja has worked on: Games (the gate, Flippd, and others in development), Audible (Baby Teeth), Pseudopod, Podcastle, Podscape, Radio Dramas (Edict Zero, What’s the Frequency, 11th Hour, You Are Here, A Scottish Podcast, Koach Studios, Electric Vicuna Productions, Campfire Radio Theater, All’s Fair, Organism, Greater Boston, Twilight Radio Theater, Misfits Audio, Darker Projects, Brokensea Audio, 19 Nocturne Boulevard, Audioblivious Productions, Icebox Radio Theater, The Grey Area, The No Sleep Podcast. Top 10 Takeaways With Resemble, voice actors can choose from three levels of control for use of their voice: No Control, Content Filters, and Full Approval. Resemble has thorough terms of service that spells out voice usage. There are two ways to record for Resemble: record samples on the Resemble website or upload previous materials. Resemblyzer is an open-source project that allows you to derive a high-level representation of a voice. Resemblyzer allows for detections of deep-fakes and synthetic voice misuse. Resemble has emotional gradients that can be added to their AI voices. Resemble can analyze audio, edit and supplement a voice actor’s work. Voice talent can use Resemble starting at $30 per month. Resemble’s marketplace can “score” voice samples based on professionalism, and companies can license these voices to use per character usage. Resemble wants voice actors to have a major role in the industry. Referenced in this Episode Learn more about Resemble.ai Visit Tanja’s Website Recorded on ipDTL Share This Resemble AI believes in compensating voice actors #VOBOSS AI voices improve accessibility #VOBOSS Passive income is never a bad thing #VOBOSS Transcript >> It’s time to take your business to the next level, the BOSS level! These are the premiere Business Owner Strategies and Successes being utilized by the industry’s top talent today. Rock your business like a BOSS, a VO BOSS! Now let’s welcome your host, Anne Ganguzza. Anne: Hey everyone. Welcome to the VO BOSS podcast, the AI and Voice series. I'm your host in Anne Ganguzza and it is my pleasure to introduce some very special guests that are with me today. First Zohaib Ahmed is the founder and CEO at Resemble.AI. Zohaib also previously led engineering teams at Magic Leap, Deepen AI, Hipmunk and Blackberry. We also have special guest Tanja Milojevic, award-winning voice talent and community manager of Resemble.AI. Tanja assists in onboarding and supporting voice talent through Resemble's synthetic voice creation process. She has over 10-plus years of voice acting experience, and her work ranges from character voices for audio dramas, to short story narrations, to audio descriptions for the blind and more. Zohaib and Tanja, thank you both so very much for joining me today. Tanja: Thank you, Anne. I love your podcast, so I'm very excited to be here. Anne: Oh my gosh. Thank you so much! I appreciate that. So I've been interviewing quite a few companies that produce voices in the AI space for the series. So I'd love to start off by asking if you could tell us a little bit about your company and the products that you offer and what makes you different. Zohaib: Yeah, so we started off, for example, with one core problem in mind. We looked at the computer vision community, you know, all those lucky people who have Photoshop since 1991, and they'd have all these fancy tools, Unity, and you name it, to do all these fancy movies and graphic effects and all sorts of visuals that we kind of take for granted now. And we looked at them and we said, hey, why don't audio people have any of these tools? Why, why are they still stuck with old knobs on a screen that kind of resemble literally what the physical version of that instrument would be in real life, just on a computer screen? And we kind of looked at that and looked around and closer and closer. And we found that there were a lot of use cases where extending voice actors would make life a lot easier. And it would also create a lot of interesting applications that couldn't be done before. So that's kind of where we started. And since then we kind of went in all sorts of directions, but in our very core, we were building synthetic voices. So we were like our core business is to take any sort of arbitrary speech data that's unstructured, create these high quality synthetics, and then kind of go from there. And over time we've decreased latency. So we're able to generate content within a few milliseconds, like seconds or minutes of conversations in a few seconds or milliseconds. We're able to do unique audio editing, where you have like a hybrid text-to-speech where you paste in your own voice and say, in this podcast, I'm kind of blabbing on, and if I wanted to remove bits and pieces, I can do that. Or if I wanted to change a few words, I could do that. And my synthetic voice would kick in in those parts. We found pre-production and post-production value in all, everything that we're doing. And then within all that, you know, we kind of want to still keep a performance of voice actors. So we want them to not only sound like them, but to be as emotional as they are and to kind of get the performances as they do. So, yeah, we've been around for a little bit over two years now, and we've worked with more than 150,000 users who have created voices on our platform. We try to make it as accessible as possible. That's kind of the big differentiating factor between us and everyone else is we tried, from day one, our goal was, well, this is a lot more powerful if we give it to the user and let them clone their voice, then you know, Anne can go ahead and click on her voice and figure out, oh, I'd love this on my podcast. And she'll come up with the ideas, and we kind of just facilitate her ideas. So that's been our, that's been our motivation has just been like putting it out to feel people's hands. And we have like 150,000 of them now and growing every day and all sorts of different use cases have popped up. Anne: Wow. So I want to get in a little bit more into how you create your voices on your platform. But before that, I'd like to ask Tanja a question. Tell us a little bit about your voiceover career and what was it that led you -- some of my voice talent friends would say -- to the dark side? What was it that led you to your interest in AI? Tanja: Well, seeing that I love Star Wars, I think that's a great reference, dark side. That's, that's awesome. So I generally stumbled upon voice acting when I was in high school. I've always been interested in acting ever since I was a kid. I received one Christmas a tape recorder that, you know, with some blank tapes at the time, and that was the best thing ever. 'Cause then I started recording stories and everybody's conversations and annoying the whole household in general. So, so that's where that started, and I was always interested in it. And then I had a friend, her and I would do a lot of improv kind of over the phone. Do you remember the days where after 9:00 PM, doesn't matter what phone plan you had, it would be free. Anne: Wow. I don't know if that was the case in my zone, wherever I was. I don't remember that. Tanja: Yeah, here in Massachusetts, that was, that was kind of a thing. Just, you can just call after 9:00. Anne: Oh wow! Tanja: So we would improv a lot, needless to say, and not be very alert for school. But then I stumbled upon, actually at my local library, a talking book, a couple of audio books for school, A tale of Two Cities and Pet Sematary by Stephen King. And they turned out to be audio dramas. So, you know, full sound effects, story, everything was in audio and I, I was intrigued. So then I did a bunch of research, and I found out about voice acting and a website called Voice Acting Alliance where anyone could audition. It was meant for amateurs. You didn't have to know what you were doing at the time. So I got a really terrible compressor microphone, Linin from Radio Shack, I think, and started using that and did as many auditions as I possibly could and listened to people's feedback. Everyone was so gracious and welcoming. So I was hooked. And after that, I just started getting into whatever I could with voice acting, anything from audio dramas, which the independent podcasting movement now has a lot of those, and they're growing. There are several places to find them, so many amazing stories out there that there just isn't enough time in the day to listen. So I I've been involved in that consistently for 10 years. Recently, I'd say about a couple of years ago, maybe a year ago now, I started recording audio descriptions for a couple of different companies, which is a track that's added to a media, like whether it's a TV series or a film, and it describes essential costume changes, action sequences, facial expressions, and so on and so forth meant for the blind and visually impaired to access media on an equal playing field to everyone else. So that's been a lot of fun. I've done a couple of short stories, narration, kind of narration based, playlist intros, radio spots, things like that, tied in here and there. And then I accidentally actually stumbled upon Resemble.AI because I've always been fascinated with artificial intelligence and smart assistants. Being someone with the visual impairment, I use screen readers all day, every day. My phone has a screen reader. I have, I kid you not, four Echoes in my house. So that gives you an idea of how much I love artificial intelligence and assistants. So I was looking for voice cloning and specifically searching as a voice actor, how do you become a virtual assistant? And Resemble.AI popped up with voice cloning, of course, with how Zohaib was discussing earlier, the ability and the power that it gives the user where you can then clone your voice for free. I never saw that anywhere. I've, I haven't come across any other website that allows you to just record and it's there. So I tried it and then I sent in data. I also applied on the form, and I heard back from one of the team and started the dialogue of, hey, I want to do all the voices, like everything. I don't, I don't care what it is, whatever you guys need. And it evolved into this role. So it was a very organic process that was serendipitous for me. So really excited to be here. Anne: Absolutely. That's a really wonderful story. And I saw on your website, I've listened to some of your demos, and I know that you're very passionate about the acting part of voiceover and being able to tell the story. I think you have an affirmation that you said, it's not just the voice that matters. It's how you tell the story. Tanja: Absolutely. Anne: I guess my question going into that is how does AI fit into this? Do you envision an AI voice being able to tell the story as a voice actor would? Tanja: I do. I mean, I think that we have room here to definitely include both where, for example, a client might want to use the artificial voice for a smaller character, but then they might want to, down the line, hire that voice actor, if they're readily available and also not remaining anonymous, which is an option for our voice talent as well. Then they might get additional work with that company depending on what the needs are, especially if they are someone who's, who's available and responsive, and the company likes their voice anyway. But that said, the emotional gradients that we have, that we offer at Resemble really allow the developers to add these emotions to the clips that they're generating, whether it's sad, angry, caring, happy, et cetera, scared. There is a lot of customization that's available for the developers. And we do take feedback very seriously. Anyone's feedback for improvement, we're ever evolving and improving. So with just the rapid changes in AI technology in the last decade, I'll be looking forward to seeing how much more realistic and how much more powerful neural TTS voices will become over time, since it's literally, it's just a matter of time and of data crunching. Anne: Excellent. Zohaib, can you talk to me a little bit about how voices are created on your platform? Zohaib: Yeah, so we, we wanted to make it as simple as possible. So there's, there's two ways of creating voices. The first option is anyone could go on our web platform, record 50 sentences. Typically these sentences are fairly short, five to eight words each, and after 50 sentences, we will build the voice for you in the next 15 to 20 minutes. There you go. There's your voice. And the second way is you upload some sort of unstructured data. So you can imagine a lot of talent that we work with, and a lot of customers that we directly work with, they are sitting on top of data where they've previously recorded data that convey a certain emotion or that it conveys a certain style of speech. So it allows our folks to kind of create domain specific voices. So for example, if Tanja is doing a voice for a telco on like IVR, so it's like you pick up the phone, and you call Verizon and kind of talks back to you or synthetic voice talks back to you, that kind of data set is very different than if Tanja was doing like an audio drama. Anne: Right? Zohaib: She needs to be a lot more emotional, and it's a completely different performance. So we tried to capture in exactly the kind of data or the performance that we want to reproduce to your earlier question, in terms of emotion and inflection, the model the AI is built in a way such that it tries to predict the right emotion or the right inflection, given a few words or whatever you're typing in. So that's a couple of ways of doing it. Once we ingest that data, typically the 30-second technical pitch is it consumes the audio as well as a transcription of that audio. And then it tries to learn what the transcript would output. And at the end of it, once it's learned a mapping between the words and the audio and the reason we could do it at 50 samples and not 20 hours like it used to be, or 100 hours like it used to be is because the model has already has a notion of English. So it's, you don't need to provide it with everything. It's just -- Anne: So you have a base model. Zohaib: -- we try to cover most of the phonetic -- exactly. Anne: Ah, okay. Zohaib: We do that for -- across different languages. So if you wanted Tanja to speak Spanish, and assuming Tanja doesn't speak Spanish -- Anne: Right. Zohaib: -- we can record her English and then get her voice to speak Spanish because the AI has learned some notion of Spanish already. And then during prediction time, it's just, you give it text and then you say, hey model, if this Tanja synthetic voice was generating this particular text, what would this audio look like or sound like? And it tries to make predictions. That's what most machine learning does. It's just prediction at the end of the day. Anne: Right. So then if people are creating voices on your platform, are you using those? Is that data that gets fed in for the machine to learn from, in order to create that ball or -- Zohaib: No, no. Yeah. So the, the underlying model is -- it's stagnant. Anne: Oh, okay. Zohaib: So like we freeze it in time and then we don't append data to it. And the reason we don't append data to it, that's like live data that's being adjusted is typically in any stream of machine learning, it's bad data, even a little bit of bad data hurts you significantly, so -- Anne: Oh, interesting. Zohaib: -- you don't want to pollute data at all. So you kind of want to -- you know, it's like if you, if you were building an application for measuring house prices, and then all of a sudden, you started sneaking in outliers into that dataset, the predictions would get worse overall because of these outliers. So we don't include any of user data into any of these models. Those are all like custom models that we've collected data ourselves specifically for building that task. Anne: Gotcha. Zohaib: And those are like stuck in time. Anne: So that's been done. And so there's no other mo -- I'm just curious. So there's no other models that will improve that model. Is that correct? Or there's no other information that can be added that would improve the model, or you're maybe continually trying to improve it or no, you're, you're good with this model? Zohaib: No. So we were constantly trying to improve it. There's two ways of improving the model. One is like where we lack data. So for example, if we understand that our model used to struggle with deep voices, deep male voices -- Anne: Okay. Zohaib: -- we would go ahead and be like, oh, that's because we didn't collect enough -- Anne: Enough. Zohaib: -- of this kind of data. So we'll go ahead and try to fill in those gaps and see like, okay, what else are we missing now? So we always can try to improve it with data. Anne: Gotcha. Zohaib: But then some of the bigger advancements, the bigger improvements occur due to just architecture changes. So just recently we've done things like produce audio at 44 kilohertz most, if not all, text-to-speech engines produce it at 22 or 24, but that's like architectural change in the model that produces better results. So yeah, there's only really two avenues to go. Either you feed it more data and see if you could tweak things or where you, once you get stuck there, you look at the architecture of the model, and you say, well, what are we not able to do better? Are we not able to render higher frequencies? Are we not able to predict certain emotions really well? Did we struggle with a particular accent, et cetera, then you kind of adjust the architecture from there. Anne: So, that's interesting. So it leads me to think about, let's say if I were to produce my own voice, my emotion, right, or my model emotion maybe would be completely different than maybe somebody else's or maybe the model's emotion or inclination toward that emotion. Is that correct to assume that, and it would be better if I wanted a model of my voice to upload more data? Zohaib: Yeah. So we definitely do that. So it's the fundamental model is built up of all sorts of accents and a variety of data. What we've seen is the dataset that we provide you to record typically captures -- it's like phonetically balanced. So it captures like most of the phonemes that you're -- that we speak in. So that kind of gets us really far, but we have had scenarios where we -- I, I recall one with this company out in New Zealand that, you know, recorded or sent over some data. And when we generated it to my non-oceanic ears, it sounds good. It sounds like, yes, like people in New Zealand sound like, and then we sent it over to them and, you know, there's like all sorts of like, oh, but it, New Zealand, we don't do -- Anne: Right. Zohaib: -- the, the R's like that. That's how Australians do the R's, and that accent is slightly different than the Australian. So when do you get, when you get really integrated, then yeah. We need to collect more data from that kind of source. Anne: Interesting. And is it also, let's say, for example, as a voice talent, right, I want to have my human voice, right, that I use and get paid for, but I also want to have my AI voice available, and maybe I want to do a lot of, I want to do IVR systems. I do a lot of them now anyways, figuring that that's going to be one genre that's going to utilize AI voices. If I were to give you data that I've already recorded, where I've done a lot of phone voices, would that make a better phone voice for me versus let's say something, like, I might do some acting and do some more dramatic emotional stuff? So if I wanted to create an AI voice for IVR systems, I would maybe give you more data that would be inclusive of that type of read, versus maybe I could have another AI voice that would be my, you know, my more dramatic voice that could be for video games or for whatever, and then work more on the emotional aspect of the data. Tanja: Yep. Yep, exactly. Zohaib: Absolutely. Everything is domain specific. Anne: Got it. Zohaib: So you give us data that's like IVR, it produces a better idea. Anne: Right. Yep. Got it. Now I assume that, you know, I can't create my own voice for free there or can I? That would be -- I don't imagine it's free to create my own voice, like, or an accurate representation of my voice? Zohaib: Yeah. So if it's like for a custom data set that we're ingesting, it's no longer free -- Anne: Got it. That makes sense. Zohaib: -- because there's some sort of pipeline that we put you through. Anne: Sure. Okay. Zohaib: But yeah, there are different, depending on which aspect you're coming from, whether you are a VO, you're a voice talent, or whether you are a company that just happens to have voice data, the pricing kind of varies from there. Anne: Well, I always found it interesting because I literally, I've been doing my VO BOSS podcast for four years. Literally I could give you probably the most conversational aspects of my voice if I were to just give you all that data. And that would create a very conversational Anne for an AI voice, I would hope anyway. Zohaib: Yep. Exactly. Tanja: Right. Anne: Okay. Right. Okay, so let me ask you a question. Do you sell AI voices as well? So let's say if somebody is using -- that they don't necessarily want to create their own voice. If they want to use your platform for, you know, creating audio files, can they use your voices at a, at a cost? Zohaib: Yeah. So we do have a marketplace. Anne: Okay. Zohaib: So we, we invited a lot of voice talent. And actually, if you just go and build a voice, we -- it rates, we rank your voice in some way, or we score your voice in some way, depending on what kind of microphone you use and what kind of support you've got. Anne: Okay. Zohaib: So we do invite people to add their voice to this marketplace. And then from there on, those voices are available for a selection to our customers. Anne: Got it. Zohaib: So our customers can say, oh, I really like Anne's voice. And they'll click on your name, they'll listen to a sample with your synthetic voice and be like, oh yeah, that's, that's exactly kind of what I want. And then they could propose a project to you -- Anne: Got it. Zohaib: -- and then we kind of facilitate that kind of agreement. Anne: So then my next question is, is there compensation and usage for any, let's say, company that wants to use my voice, and is it on a per job basis? Zohaib: Yeah. So the way that we look at it is, it's on a per character usage fee. Anne: Okay. Zohaib: So we do compensate voice actors on a per character level. Anne: Okay. Zohaib: So if you are building an IVR voice, is that more that that customer uses your voice? We show you exactly what that character usage is, so you can track it. And then there's some sort of compensation at the end of every month. Anne: Oh, okay. So it's a monthly thing? It's not necessarily based on per job? Tanja: Right. It's just how many characters were run through your -- Anne: Got it. Tanja: You can have multiple voice models, and maybe clients are using all of them, different clients. It would all simply be based on how much data these clients were running through these voice models collectively on a month to month basis. And the royalties would come out of the characters generated. Anne: Got it. So then if my voice were recognizable, right, and it was a great AI voice, I would be concerned as a voice talent that maybe the usage of that voice might not be in alliance with my brand, right? Or maybe they're using it for something that I would not necessarily be aligned with with my brand. Is there any sort of, you know, job control in that respect? Zohaib: Yep. So we have three levels of control that we offer at the moment. So the first is no control. We have plenty of voice talent that does -- that do like impersonations of voices, or they do like really specific character voices that aren't really tied to them in any way. And they're okay with anyone doing anything with those voices. So if you're making a game, the character in that game could be anything. It doesn't really matter to them. So that's like one level, I guess. Anne: Okay. Zohaib: I'll jump to the most extreme level, which is, we also have the ability for the talent to completely make it a manual process and make it so that there has to be a project description, and that it has to be accepted. They have to be sample lines that are discussed with that or show to the voice talent and then the project executes. And then there's like category in the middle, which is kind of like exploratory right now. But it's something that I think not only voice AI or hopefully not just us, but other people are also trying to do, which is something we call a content filter. Anne: Yeah, mm-hmm, yeah. Zohaib: So the idea is that we're able to detect with texts whether something is political -- Anne: Yeah. Zohaib: -- whether it's not safe for work, and we're automatically able to prevent that from ever being generated. So at the moment, it errs to the side of caution. So there are a lot of false positives, but that's because we want to be extra safe for that scenario -- Anne: Sure, absolutely. Zohaib: -- where, you know, if you've mention Donald Trump on there -- Anne: Sure. Zohaib: -- it'll most likely say, hey, that's political -- Anne: Right. Zohaib: -- and doesn't want you to say anything. Anne: Or if there's -- yeah. Or maybe there's swear words or, you know, words that I would never say myself -- Tanja: Right. Anne: -- wouldn't be represented -- okay. Very interesting. So tell me a little bit about, I saw something on your website about Resemble Protect. What is that, what does that do for us, Resemblizer? Is that what it -- Zohaib: Yeah, exactly. So that's an open source project that we have. It's, it's on GitHub, github.com. Anne: Is that what you were just describing to me? Was that the Resemble Protect or? Zohaib: Nope. So yeah, Resemble Protect [inaudible] it's the same thing. Anne: Got it. Zohaib: The idea behind Resemblizer was when we first started, there were a lot of components that kind of build up our voice model. So you can assume that since we're detecting emotion, we can also offer just emotion detection as a service, since we're detecting different sorts of languages and different sorts of voices, we can kind of offer -- like we do some sort of fingerprinting to identify or disentangle your voice from the text that you're reading or your accent, et cetera. So Resemblizer basically is this open source package. It ships with, or it comes with like this pre-trade model. So what that means is as a user of that open source or free package, you don't need to train anything. There's no ML working to do. You don't need to buy compute or have powerful computers. Because open source project is basically our way of looking at this problem of speaker identification and deepfake protection, and basically looking at it and saying like, this is like a problem for everyone to solve. And this is a machine learning network that's able to distinguish between fakes and reals. And we kind of put it out in the public because it's not our core product or anything. So we're like, well, let's put this out there and see what other people can do with it. So we've had people who are trading that model with a lot more data that we traded with, so like different languages, et cetera. Anne: Sure. Well, that's amazing. I think that's really wonderful. And that is an open source project that you began and put it out there? Zohaib: Exactly. Anne: That's really great, because I've always said that there's gotta be some sort of a way for us to figure out where is our voice being used. I mean, we have enough problems as it is, and I'm sure Tanja can identify that we don't know if our voice is being used, you know, in another region or, you know, another campaign that maybe we didn't agree to in the first place. I mean, that's always been -- Tanja: Right. Anne: -- you know, something that voice artists have been concerned with is usage. And there really hasn't been a way that I'm familiar with outside of some other voice talent and saying, "hey, I heard your commercial in California. I thought you said it was only for east coast." And so that's really how we found out before. So I would think with AI voices, I would hope, that there would be technology that would allow us to figure out where is this voice being used. And also, I guess my question would be, does that also take care of if let's say, 'cause you have a model, right? I can speak Spanish, even though I don't. So is there a way, is there hybrid models of, of AI voices, like two or three different people and then youcreate a whole new voice? Is that a thing to create new voices like that? And then if so, how do you know if your voice is involved in there? Zohaib: Yeah. So we've been experimenting with a technique that kind of does what you just described, which is like blending voices together. We've been using it largely for a different purpose. So we work with customers who are trying to get really particular pronunciation of words or a really particular performance, but they want to keep the original voice that they're, that they're using or the target voice. But that target voice, we just don't have enough data in that target place to get that kind of pronunciation. So typically what we've done in the past is refer using Tanja's voice and her voice has said something in particular, and you want to control exactly how she's pronouncing those words. What we typically do is augment her data with someone like your data. And then we kind of blend some of it together enough that we can disentangle the way that you're pronouncing, pronouncing the words versus how she's saying them. And we can blend different aspects of voices together that way. But yeah. Anne: Is that, is that like a separate model? Like, you know what I mean? Like is that like a new model that you've generated? Zohaib: You can think of it as a new model. Anne: Okay. Zohaib: You can think of it as new model, but our --so that's when you, when we say model, it's always kind of weird because models are models are comprised of models. Like there's a model that just does like emotion detection. There's a model that just does pitch detection. There's one that's just looking at languages and making sure it's conditioning on the right languages. There's one for like gender, et cetera, speaker. And these are all like disentangled pieces. And then you could, these are like blocks, and you could put one block with the other block and get something different, if that makes any sense, but they're fairly modular by design. Anne: Okay. Okay. Well, I do want to mention that I did check your website for any terms of service and ethics policies, which I just want to make sure that my BOSS listeners know that yes, I found a wonderful page on that. And I really liked that you had a statement that said once your voiced is created, that we own all the rights to that voice, and that you don't use that voice data to train other models, which is something I've not seen on some other websites that generate AI voices. So I appreciate that, nor do we resell the voice data to third-party companies. So yeah, I appreciate that you have that page. And I just want to kind of put that out there. So BOSSes know I like to work with companies that are concerned with ethics in this. It's kind of a crazy time for us as a lot of voice talent are fearful that they're going to be losing a lot of work. So with that, I think my last questions to you guys would be, first of all, Tanja, where do you see the future of voiceover, let's say, in five years? Tanja: That's a great question. So I really do think that we are in the beginning of a new market, but that said, I don't think that we're in danger as voice talent of losing our jobs. And just because AI is going to be very powerful, and companies are going to want to utilize it, because for them it's less paperwork. And maybe these are companies that don't necessarily go out on Fiverr or voices.com or look at people's websites to find talent. They just don't have the bandwidth, or they would rather just work with a third party where they have a selection of voices already available, and they just sign an NDA agreement or something, and then they move forward. And then, so these companies are normally, in my opinion, not places we would be searching for clients anyhow -- Anne: Yeah. Tanja: -- or not clients that are readily available to us in our search, auditions or working with our agents or what have you. So that said, a lot of these places are going to be using AI, in my opinion, mainly for IVR, conversational with all these apps that are coming out, new apps, continuously, virtual assistants, probably smaller characters in games, non-player characters, and even e-learning, I think e-learning is going to be huge, or even folks making their websites more accessible, I guess, to all, by having an AI read their blogs or maybe customizing ads. Anne: Right, sure. Tanja: So you, you get the idea, it's like it's endless of what the use cases could possibly be. I think that this is a nice blend, a nice way for voice talent, to get additional marketing that maybe they would have to do themselves. And it would take longer. They'd work with new clients that maybe they wouldn't have sought out initially. And also their voice would be used ethically, I would say in five years, because I think all the companies that are not paying attention to ethics right now, in five years, we're going to know who they are. There's only so far you can go -- Anne: Yeah, yeah. Tanja: -- before your -- somebody's going to come out, concerns are brought up. So yeah, I really do think, you know, we'll be more educated then as voice talent on what the market is. And I think if we get in on the market now, we'll be there when it's saturated in the future, we'll still be there. So I don't see any problems or concerns myself as a voice talent. We just need to educate ourselves and really ask those hard questions and make sure that it is something that we're comfortable with and then move forward. Passive income is never a bad thing. Anne: Well, I agree with you there. [laughs] Zohaib, where do you see the future of AI and the future of Resemble in five years? Zohaib: Yeah, so Tanja kind of mentioned all sorts of interesting concepts that we might see in the future or we're already seeing now. So that's, that's one thing. So there are a lot of interesting things that we can do. And honestly, I'm in the camp that I don't quite know exactly what the answer to that question is. I can make really good predictions, but a lot more people are a lot more creative than I am. So they will figure out where to use this. So already we have some interesting use cases within EdTech and within banks to like old companies that you would think would never do anything innovative, but since they have the solution now, they're able to be a bit more flexible with what they can do. But in the long run, I think a good parallel to look at is, if you look at how movies were made and especially like visuals, so you went from this world of actors to like stunt actors perhaps, or from stunt actors, actors, whatever you had for a long time, we did that little dance. We went to like a green screen -- Anne: Yeah. Zohaib: -- you know, which is more recent. Then we had like technology like mocap that came out where you just wear a suit, and then you kind of just do the action and maybe the face stays still. And that's, that's the rest. There's a, there's a movie on Netflix called The Irishman -- Anne: Yup. Zohaib: -- which they take, I believe, Robert DeNiro and make him older. And those are like all things that, you know, that's a very good path to look at when you're looking at audio, because we're lucky in audio that we're kind of the last people that people think about. It's like kind like the addition at the very end, like, oh no, we needed someone to voice over. And what I think we want to do is kind of become like the first thought that people have is like, oh, this is a core part of the -- Anne: Sure. Zohaib: -- entire movie or the entire product. Anne: Absolutely. Zohaib: And you have a lot of these scenarios where you have these gorgeous looking movies that get like really high budgets, and the, the dialogue and the VO is just so underwhelming because it's such a last minute effort to piece that together. And sometimes the writing and the dialogue doesn't kind of go hand in hand, or sometimes there's just not enough time to improvise as voiceover talent. And we kind of want to change the way that works. And I think like in five years or so, you will start seeing a lot more experiences that will involve AI voices. And I'm not sure if those experiences are what we see today. Perhaps you'll start seeing dubs of movies and films where the original voiceover talent is -- that type of voice is kept and the character is still kept, but they're now speaking Mandarin or Japanese or Spanish without worrying that, hey, it sounds like a different character that can be jarring to the entire experience. Anne: Right, right. Zohaib: So yeah, there's, I think there are plenty of things, but if you just take a peek at like the computer vision and the visual world, there's like a pretty clear path or some sort of a vague path where AI voices can go as well. Anne: Well, you guys have been so gracious. Thank you, both, so very much for joining me today and having such an interesting conversation. Where can my listeners go to find out more about you guys? Zohaib: Yeah. You can go to www.resemble.ai. We're also Resemble.AI on practically every social media thing out there. You can email any of us. You could get all of us if you email team@resemble.ai. They'll send an email to everybody. Yeah. You can find out more. Anne: Well, thank you, guys, again so very much. It's been a pleasure having you here. I am going to give a great, big shout-out to my sponsor, ipDTL. You two can connect like BOSSes and find out more at ipdtl.com. You guys, have an amazing week, and we'll see you next week. Bye! Zohaib: Bye. Thanks for having us. Tanja: Bye, thank you so much for having us. >> Join us next week for another edition of VO BOSS with your host Anne Ganguzza. And take your business to the next level. Sign up for our mailing list at voBOSS.com and receive exclusive content, industry revolutionizing tips and strategies, and new ways to rock your business like a BOSS. Redistribution with permission. Coast to Coast connectivity via ipDTL. CONNECT + FOLLOW Twitter @vo_boss Instagram @vo_boss Facebook /VO BOSS YouTube VO BOSS SUBSCRIBE YouTube https://www.youtube.com/c/VOBOSS Spotify https://rb.gy/meopx8 Apple Podcasts https://rb.gy/chdamm Amazon Music https://rb.gy/luw83x Google Podcasts https://rb.gy/koc3ls Stitcher https://rb.gy/hslkgj TuneIn http://tun.in/piZHU iHeart Radio https://rb.gy/uixh90 Pandora https://rb.gy/knoz7c SPONSORED BY ipDTL: https://ipdtl.com Anne Ganguzza Voice Productions: https://anneganguzza.com

VO BOSS Podcast

September 14, 2021

Business