This podcast series is aimed at educating healthcare services and insurance companies about HIPAA compliance in the context of development and test data. Each episode focuses on key aspects of compliance, addressing technical, legal, and ethical challenges.
George Barroso
There is. I mean, obviously, expert determination requires a lot more analysis. And, you know, when the data set changes, like if you start to capture extra information or you want to send something different, you have to go back and re-examine that.So, it is a little bit more work to maintain there, which isn't to say that Safe Harbor is no work. There's still work involved there, but it's a little bit, I guess I would call it a little bit faster and easier to implement. Okay.Right? Because, you know, there are tools you can get off the shelf to help you do this sort of thing, and you just go in and you change all the data. Got it.And you know you're changing all the locations where the names are and the addresses are. And if you knew it was 18, then, you know, you're considered safe. Got it.So, you can go and do that. It's a straightforward approach. If you've got a lot of databases, you know, you're a big organization, you've got 100 databases, do you want to have 10 experts sitting around for, you know, two weeks to specify a data set?
Nick Shuart
Right.
George Barroso
Right? They kind of make that judgment call. If you need to share data out, then, you know, that kind of leans more towards expert determination method because you're sharing it and you want to get it back and maybe get some results and look at the analysis.That's a little different.
Nick Shuart
You're right.
George Barroso
But if you're just talking about pure, like, development testing and, you know, you're not as worried about the data utility, then safe harbors.
Nick Shuart
Okay.
George Barroso
Faster, easier, faster implementation.
Nick Shuart
Okay.
George Barroso
I'd say easier because it's always different.
Nick Shuart
Gotcha. It's kind of the click button for obfuscation there.
George Barroso
Right.
Nick Shuart
Interesting. Okay. That makes sense.So what do you think are the advantages to expert determination as opposed to safe harbor outside of, you know, the simplified use that comes out of safe harbor?
George Barroso
The advantages of expert determination, I would say, because you're not changing data, you don't lose any of the richness or the utility of it. Right. And what I mean by that is if you're going to go in and you're going to change addresses and you have to change zip code a specific way, you know, then you're going to have probably a little bit of what I would call blurring in of, you know, of the kind of clear geographic zones in which you're doing, you know, in which the data sits.So for customers that are doing something regionally or, you know, if they have some specific kind of logic around the zip code and, you know, an example was we had a healthcare client that wanted to make sure that the mass data all needed to be clinics within a one and a half mile radius of the patient's original address. True. When you're changing the address, now you have to change all the clinics.
Nick Shuart
Right.
George Barroso
Attending.
Nick Shuart
Yeah.
George Barroso
That gets very complicated to do safe harbor method. Right. But if you can go and remove all the pieces that allow you to identify who that data belongs to, then you can leave those zip codes and the addresses.I mean, I just probably not because you can look them up in the phone book. Property records, et cetera. But the zip code at least, right, doesn't need to change if there's nothing else to go along with it that would identify the person.Got it. Does that make sense? Yeah, that makes perfect sense.In terms of data utility, you know, expert determination, it gives you maximum data utility, but it does take effort to get the data set to that point where it's safe.
Nick Shuart
Okay. I think I'm picking up on your point.
Jan Diana
You're downing it. That definitely makes a lot of sense. Essentially, it just depends on how you want to utilize that data once it's being masked, right?
George Barroso
Right. Cool. I mean, expert determination is not really, I don't want to say it's a bad idea for development and testing.It's just an expensive idea for development and testing. Right. Unless it's, you know, I mean, if your data set contains two columns or three, you know, a half a dozen fields and none of them are PHI, it's probably easy.It's not PHI.
Nick Shuart
Yeah.
George Barroso
There's no PHI identifiers in it at all.
Nick Shuart
Okay.
George Barroso
Then maybe that's easier, right? But, you know, in development and testing, we're going to have lots of databases. You know, we have some healthcare customers with billions of rows of data.How are you going to pour through that statistically, right, as an expert, and make sure that all those records are safe?
Nick Shuart
Right.
George Barroso
I mean, you're not going to share a billion rows of data either.
Nick Shuart
Yeah.
George Barroso
But you see my point.
Nick Shuart
Yeah, absolutely.
George Barroso
It's going to get more costly. Some of you, I don't remember who it was, which one of you guys asked about the AI, though. I think it was Jan.So, Jan, you asked about AI, whether AI could be the expert. Correct. Potentially, yes.But you still need an expert to train the AI. Right. Here's all the relationships, right, in the data.Although there are some groups that are looking at, at least in healthcare, they're looking at can the AI find patterns that we don't see, right? Yeah. So, you know, there may be a path forward there, but I don't believe that there's any AI out there right now that can do that.
Nick Shuart
Okay. But if it was implemented.
George Barroso
Yeah, they could be training it right now. I don't know what, you know.
Nick Shuart
Yeah, of course.
George Barroso
What they're up to.
Nick Shuart
Yeah. And that could definitely help out with billions of lines that you would have to go through.
Jan Diana
Absolutely. It would definitely streamline the process, at least.
George Barroso
It would, yeah. You definitely need to, I mean, you would need to train the AI, given enough, I mean, billions of records is enough. Right.But yeah, there could be some that are, I don't know of any right now, but I'm sure that somebody is looking at that.
Jan Diana
I mean, it's just a lot of data for them to comb through in order to get to that point.
George Barroso
But then you have the other problem. Now you're putting all this unprotected PHI into an AI. Right.Right? So, are you running that internally?
Nick Shuart
Right.
George Barroso
You're not sending it to chat GPT.
Nick Shuart
Yeah.
George Barroso
These other, you know, common PIs that are out there.
Nick Shuart
You would hope not would.
George Barroso
Yeah. Right? I mean, there might be some, but you wouldn't want them, you wouldn't want your protected health information being said into one of these generative AI models, or even the Deep Seek that, you know, or whatever.Right. So, there's that concern as well.
Jan Diana
Okay. That's perfect. That actually brings us to the next topic, which is interesting, which is just the challenges in managing the development and test data.Why is it important to minimize the use of PHI in development and testing? Right.
George Barroso
Well, the obvious things are, you know, besides just violations of HIPAA, breach. Breaches. Right?Right. Breaches are... ...not good. Yeah. They're common. It seems like you see them in the news all the time.Right. But what happens when you see a breach? Right?I think in some countries, there are much more stringent kind of rules... Right. ...and penalties around that sort of thing. I mean, it's starting to happen here too. But, you know, you see it. There's breaches.You know, it's like every other week, I get a letter that says that my data was shared or somehow somebody accessed it.
Nick Shuart
That was my bad. I'm sorry about that.
George Barroso
And, you know... I got you on the podcast. They pay for LifeLock or whatever that credit monitoring service is.And, you know, basically, they just monitor your credit for a couple of years. Right?
Nick Shuart
Right.
George Barroso
And breaches are bad because there could be penalties. So... There could be litigation.Right? And the worst is reputational damage. Of course.Right? I mean, if you've got a reputation of being a company that's been breached multiple times, that's not a good look, right? Hire a PR firm.
Nick Shuart
Of course. Yeah, yeah. Okay.I understand that. So, would de-identification be the only way to kind of mask the data? Are there other...
George Barroso
No, there are some. So, there's pseudonymization. I think I pronounced that right.
Nick Shuart
Yes.
George Barroso
Like a pseudonym. It's a very long... I always use tokens.
Nick Shuart
Okay. That's really what it's... That rolls off the tongue a lot easier.
George Barroso
Yeah. Right. So, essentially what it is, is you're going to replace the sensitive data with some sort of a token.That token can be an alphanumeric string. It can be just a number, you know, depending on how you want to tokenize that data. And there's tokenization solutions out there that use different kind of combinations of those.But essentially what it is, is it just replaces the actual data with something else. So, instead of seeing George Barroso, you see, you know, one, two, three, and four, five, six.
Nick Shuart
Okay. Right.
George Barroso
To use a dumb example.
Nick Shuart
But I can see that being implemented in the healthcare industry pretty easily with tokenization.
George Barroso
Yeah. So, in tokenization specifically for healthcare is used. It's very common.I see a lot of our clients use it when they need to share data with third parties. Got it. And they need to reverse it back.And I know we talked about expert determination. Right. Do that.
Nick Shuart
Right.
George Barroso
And I kind of introduced it a little bit. But essentially, all our healthcare clients that are sharing data out to third parties are generally, instead of doing a masking or a de-identification, you know, which many times and many tools, it's one way because you don't want to be able to reverse it.
Nick Shuart
Right.
George Barroso
They'll use tokens instead and say, okay, instead of George, it's going to be one, two, three, instead of Barroso, it's four, five, six. They're going to send it out. Whoever they're sending it out, presumably, whoever gets it presumably doesn't need the name to do whatever analysis they're going to do.Then they send it back. And then when, you know, the healthcare company gets it back, they just take the tokens and reverse it back to the name. And now they've got the analyzed data.Oh, wow. Right. That's cool.So that's one of the other ways that we've seen healthcare protect the data while it's, you know, being shared out. And because it's a token that's arbitrarily assigned, there's no risk of de-identification.
Nick Shuart
Right. Right. Gotcha.So essentially, it's like synthetic data?
George Barroso
I wouldn't know. I think synthetic data is a little different. So synthetic data does not have an actual real value as the basis.
Nick Shuart
Okay.
George Barroso
You know, when you're talking about a synthetic data solution, what you're talking about is, you know, I want to go, I know I have this medical database. We've got medical record numbers, patient IDs, various other, you know, PHI. I have nothing.It's empty. And I want to populate it with data that will treat it.
Nick Shuart
Gotcha. Okay.
George Barroso
And there are companies and tools that do specifically that.
Nick Shuart
Okay.
George Barroso
Right? They generally don't do masking, although I think some of them are starting to kind of delve into that. But typically, when you get, when you are looking to synthesize data, you're not necessarily doing it from a set of production data.You're just generating.
Jan Diana
You're just generating, creating something. Right. So it's totally random.It's not, for the most part.
George Barroso
Right. I mean, you know, that's a challenge also because with synthetic data, the challenge there is there are relationships between all of these different tables. Right.In your system. How do you generate all the appropriate, you know, kind of child rows, if you will? Like if I'm going to go in and add Yandiana.Right. Right? And your address.How do I add all of the other records so that it maps back to you? I see. Right?So that it builds your medical profile, but it's completely fake. Right? So sometimes it does get a little more complicated to do, but yes, you know, it's all fake.
Jan Diana
Okay. Okay.
George Barroso
Completely fake from the beginning. And there are some healthcare companies out there. We talked to a few prospects that essentially they have contractual obligations or contractual restrictions that don't allow them to actually use the real data as a basis for anything.
Jan Diana
Hmm.
George Barroso
So the only way they can do development and testing for that particular client because of their contracts is to generate it all from scratch.
Nick Shuart
Got it. Oh, okay.
George Barroso
Yeah. So they can't use a de-identification solution, which uses the original data as a basis for generating the new data.
Jan Diana
Right.
George Barroso
Which is a lot of what happens with a lot of these de-identification tools. They look at the original data and then they do some calculations to figure out what the mass data should be.
Nick Shuart
Hmm.
George Barroso
And that way you can always maintain consistency. Right? Because that's the other thing we haven't really talked about.Is you want to make sure that the name, you know, Dayana gets changed to let's say Mickey Mouse, right? Everywhere the same way. Because you can't have it show up one place as Mickey Mouse and the other place as Donald Duck because now all of a sudden that doesn't matter.
Jan Diana
Correct. Right? And now your report's real.I see. And that also is kind of an identifier of mass data essentially, right? So then somebody that's looking for that data can now be like, oh, this is the pattern here.That makes sense. Okay. That makes sense.So actually when it comes to HIPAA compliant data handling, what tools can help with data masking and de-identification?
George Barroso
That's a good question. That's an excellent question. And there's a lot that are out there today.