This podcast series is aimed at educating healthcare services and insurance companies about HIPAA compliance in the context of development and test data. Each episode focuses on key aspects of compliance, addressing technical, legal, and ethical challenges.
Nick Shuart
Hey everybody, this is Nick Schuert. I'm here with Jan Deanna and George Barroso. And today we're going to be talking about HIPAA compliance for development and test data.How you doing today, George? Good.
George Barroso
How are you?
Nick Shuart
Not too bad. Thanks for coming in.
Jan Diana
Thanks for having me.
Nick Shuart
Absolutely. Glad you could make it. So, tell us a little bit about yourself.As far as development and test data, you know, how does that play into your daily life? How long have you been in this industry?
George Barroso
So I've been dealing with development and test data for well over 10 years. We handle, you know, de-identification, masking, redaction, you know, whatever some folks call obfuscation. But essentially, we've been doing that for our clients for 15 years or so, specifically around HIPAA and healthcare.We've been doing that for clients since 10 plus years now. And I think that was our first healthcare client. So you could say we have a lot of experience in helping them make sure that they have HIPAA compliant data and they're in the development and test environments.
Nick Shuart
Very cool. Very cool. Obfuscation, by the way.That's a new word for me. I got to write it down.
George Barroso
Oh, yeah. There's all sorts. They come up with one every year.
Nick Shuart
Yeah, they do. Yeah. We were just talking yesterday about interoperability for everybody.You know, it's like, okay, synergy would work pretty well, too. That's right. But another thing you mentioned was HIPAA.What exactly is HIPAA? Why is it important to software developers?
George Barroso
So HIPAA stands for the Health Insurance Portability and Accountability Act. It's basically a law that protects patients' medical data, you know, or protected health information. That's abbreviated as PHI, so you hear a lot of people talk about PHI.That stands for protected health information. And it's very important because if you're doing development, you know, there are very strict regulations and kind of rules about how you handle the data, who's supposed to have access to the data because it's protected. And so if you're, you know, a developer at a healthcare firm or, you know, some company that's dealing with healthcare data, you do have to be very careful about how you handle that data because if it's not de-identified or, you know, masked, obfuscated, however you want to refer to it, if it's actual data, then there are very strict rules about who's supposed to see it and how you handle it.So that's why it's important to consider because if you violate those, then you have legal and financial issues.
Nick Shuart
And nobody wants those.
George Barroso
Nobody wants those.
Nick Shuart
Not in this day and age.
Jan Diana
Of course. To follow up with that, of course, what risks are associated with using real PHI in developing environments?
George Barroso
Well, so, you know, in production, you always have, everything's very locked down. And so you're going to have your real health information in those systems. When you get to development and testing, they're not going to put the same controls around those systems.And many more people are going to have access to it, right? A production system might be a bunch of folks in a medical office that can see and use the system. But if you're a developer, you know, they might send some of that development offshore.They could send it to, you know, a third party here in the United States. But there's going to be way more, a lot more people accessing the data and looking at it. And so if it's not protected, encrypted, obfuscated, de-identified, all of that PHI is in the clear.Right? And that means that all of those developers, you know, have access to it. They can see people's names and addresses, potentially.Some sensitive health information could be there. And so it's a big risk for companies to have these development and testing systems that are not protected.
Jan Diana
And that makes sense.
George Barroso
And then that opens up and increases your risk, right? Because in production, maybe you have a few systems where people access it. You maybe have 100 people accessing it if you're a big developer.But in development, you might have 1,000 developers. That's 10 times the risk if you just look at pure numbers.
Nick Shuart
That's a good point. I didn't think about that at all. That really opens up the floodgates, potentially.Correct. Okay. So, are there methods to help de-identify PHI?
George Barroso
There are. And actually, HIPAA specifies two accepted approaches for protecting PHI and using it outside of the production scenario. One is called the Safe Harbor Method, and we'll talk about what that is probably in a few minutes here.And the other one is expert determination. Both of them are accepted, and they're actually in the HIPAA regulations, too. But those are the accepted methods for handling data that's going to be used in development and testing.
Jan Diana
Interesting. Okay. Awesome.So, you did mention one of the two methods would be the Safe Harbor Method? Yes. What exactly is it when it comes to the HIPAA compliance?
George Barroso
So, the Safe Harbor Method is essentially a strategy for making sure you can de-identify your dataset. And at HIPAA, the regulations lay out the 18 identifiers that are required to be de-identified or protected, treated. Things like names, addresses, technically, it's geographic data smaller than a state.And so, that means your street, your city, and potentially some parts of the zip code as well. Right. There's some very strict rules around that.And then, social security numbers, there's medical record numbers, but there's a whole list of 18 that need to be de-identified in your dataset in order for it to be considered safe.
Jan Diana
Right. That makes sense. We don't want to just disseminate our information out into the world, so that would make sense.We want to make sure that it's safe. Yeah.
Nick Shuart
I hear data is kind of important these days. So, based off of that, you said there's 18 identifiers. Are there any specific challenges?Because you say things like social security number, addresses, things I wouldn't want out in the public. Are there any specific challenges when it comes to protecting those with the Safe Harbor Method?
George Barroso
Probably the biggest challenges with protecting those is making sure that the data is still usable afterwards.
Nick Shuart
Okay.
George Barroso
Right? It's easy to say, oh, well, that's no problem. I'll just replace it with all Xs.
Nick Shuart
Right.
George Barroso
Great. But if you've got business logic that's going to go and give you some report based on zip code, and all of a sudden it's all ones, that's going to be useless. Right.You no longer can tell whether or not that report's going to work. So the biggest challenge is maintaining utility when you're making changes like this. Okay.Right? That's the biggest kind of challenge with the Safe Harbor Method, making sure that the data, after you're done treating it, still has usability for whatever you're testing.
Nick Shuart
Got it. That makes sense. Okay.I can definitely see that. Right.
Jan Diana
Well, now that we understand what the Safe Harbor Method is, would you like to do a little bit of a deep dive into the Expert Determination Method?
George Barroso
Yeah, sure. It applies to HIPAA. So the Expert Determination Method is used generally when you want to maintain the maximum usability or utility of the data.And really what that is, is you have a qualified expert that analyzes your data set, and you might have already removed some of the protected health information, but they'll analyze the data set to make sure that whatever you're sharing out to your third parties or whatever you're using internally for development, there's no risk of looking at that data and being able to re-identify that back to the original person that it belongs to. That would be something like, you know, if it only had, let's say, my test results from my last physical, but there was no indicator of who it belonged to, like my name wasn't in the data set, my address wasn't in the data set, my medical record number wasn't in the data set.It was purely used for just, okay, here's some test results, right? And it was kind of completely disconnected. You know, the expert could say, yes, this is safe to disseminate because it doesn't actually, you can't tie it back to anybody specifically.
Nick Shuart
Okay. So I have two questions there. First, say that information, that data was sent back to a medical professional.Is there a way to re-identify? And then second, you said that there's experts that use that method. Who qualifies as an expert in that case?
George Barroso
So the expert is someone who actually, obviously, the expert prefers that. The expert is actually somebody who has enough knowledge and understanding of the data set to be able to apply the statistical and scientific methods required to prove that can re-identify it. So, you know, and that does apply to that specific data set.So if the data set changes and you add a column or you add a bunch of data, you know, the next time around, now you have to go back, the expert has to go back in and re-certify that that is still safe.
Nick Shuart
Got it.
George Barroso
To address. And then your other question was, can you re-identify the data? Right.Potentially, yes. Okay. Depending on what's in the data set, it will be up to the expert to determine whether or not that re-identification risk is enough to worry about.Or, you know, let's say, for example, there was some internal identifier or token for each record, right, that only you had or only your company had, and when you shared it out, there was no way anybody could figure out that, you know, number three belongs to me and number five belongs to you.
Nick Shuart
Gotcha.
George Barroso
Right? I could have just re-identified.
Nick Shuart
Right. Okay.
George Barroso
In that scenario, then, yes, when you, the owner of the data company, you know, gets it back after the analysis, they could go in and map those back to them.
Nick Shuart
Okay.
George Barroso
And that's a very common thing that companies will look to do when they have to basically send it out for analysis and then get the results back and kind of marry that back with their regular data to determine, you know, whatever they're trying to figure out. Re-identification is possible in those scenarios and desired in those scenarios.
Nick Shuart
But it sounds like there's good checks and balances there in that process. Yes, there has to be. Absolutely.
George Barroso
Otherwise, you know, it will be too easy to reverse. I mean, reversing it is the big problem.
Jan Diana
Right. I actually have a little bit of a compound question when it comes to that also. The expert itself, now is this a literal person?Is there some AI attached to that or can there be some AI attached to that? Could it be an automated process or is it a literal person? And then also, in addition to that, how does this differ from the safe harbor method as well?
George Barroso
So, traditionally, it has always been a person because you have to have knowledge of how the data is used, what's stored in the data, and, you know, have information about whether or not you could re-identify. I'll give you actually a real world example, not from the healthcare space. This is from retail.But I think it applies and kind of illustrates what I mean by, you know, have to have knowledge of the data. So, we were doing the identification for a retail customer. They had their membership, you know, program, which, you know, had people's names and addresses and their membership number and the things that they purchased.And you could, you know, get rewards points for all the things that you buy there and then use your reward points to buy other things. So, they had this database with all of this information in it. And they said we have to go and mask all of this data because we want to use it in development.We don't want people to know or, you know, hold personal information for our customers. So, we went through, we did all the things that they identified as being sensitive. They gave it to one of the team members who was trying to find his own record because he was also part of the rewards program.And so, everything had been de-identified, the membership number, his name, his address, you know, everything about him was de-identified. What was not de-identified was the store number, the register number, the item that was purchased, and the purchase price.
Jan Diana
Oh, wow. Okay.
George Barroso
So, you'd think, well, how would you ever get back to that?
Jan Diana
Right.
George Barroso
Right? Well, he knew that he had bought a kayak, let's say, somewhere recently, so he grabbed his receipt, right? He checked the data.And he checked all that information. Yeah, makes sense. So, he found his record, which found his, you know, de-identified version of his membership number, let's call it 123.And then he went and he could go to all the other tables because he had knowledge of the data set. He could go to all the other tables and reproduce his entire purchase history.
Nick Shuart
Very cool.
George Barroso
That was in the database.
Nick Shuart
Very interesting.
George Barroso
So, you know, then we thought, okay, well, it's this one guy who knows the data so well, he's a quote-unquote expert, right? So, you know, how would somebody else, how would you get somebody else's information?
Jan Diana
Right.
George Barroso
So, he went out on social media and did a search and found like 50 people that had posted their receipts. Oh, look at this great new purchase. I got this kayak or I got this whatever, right?So, then he started searching for theirs and he found theirs. So, he came back with his results. We realized, oh, there's this huge kind of re-identification risk because of the fact that people are now posting things like this.They don't realize it's sensitive. It's just a receipt.
Nick Shuart
Yeah.
George Barroso
It doesn't have their name on it. Right. But they were able to reverse that and get the information back.That's very cool. Right? Social engineering expert.
Jan Diana
Yeah, very cool.
George Barroso
But, so, we ended up having to go and, you know, we did some analysis with the customer and realized, okay, well, if we change the register number and we change the store number and we modify the price amount, we're good.
Jan Diana
Interesting.
George Barroso
Right? Because how many kayaks should they sell a day? You know?Yeah. And so, that was enough to where he went and got a new set of receipts from, you know, social media postings or whatever, and then after we went that extra level, he wasn't no longer able to re-identify. Oh, wow.But that's an example of expert determination, right? A normal person looking at that without any personal information, you know, doesn't have anything that makes sense, but they also didn't know that it's, you know, the register number, the store number, and the price amount was enough to uniquely identify that transaction and tie it back to a person.
Jan Diana
That makes sense. So. Very cool.
George Barroso
Yeah. That's an example of the expert. And how does it differ from Safe Harbor?Correct. Safe Harbor is rules-based. It's basically, if it's a name, you mask it.If it's a social security number, you mask it. If it's whatever, you know, if it's one of the 18 identifiers, you just go in and you change it.
Jan Diana
Right? Well done.
George Barroso
But, you know, obviously, expert determination, you have a little more flexibility because, you know, maybe the social security number isn't stored in a format that makes it obvious that it's a social.
Jan Diana
That makes sense.
George Barroso
Then maybe it's not so easy to determine, and it might be fine based on expert determination that, you know, that can be left in the clear because you can't map back to anybody. You don't even realize it's just a nine-digit number, right? I have one client, actually, that said, I don't need to mask social security numbers in my systems because our account numbers are also nine digits, and then we have a bunch of other nine-digit numbers that we're not storing, and the column isn't labeled social security number, so it's just a nine-digit number.You have no idea what it is. Right. Interesting.Right? So, for them, their impulse was not that that was enough of a risk to, you know, it, there are differences like that. Safe Harbor is, quote, unquote, safer, right?Because you're changing everything that could be sensible.
Jan Diana
Right.
George Barroso
Expert determination, though, you know, is matched from data utility, right? Because you're not changing it, but you're just making it so that, making sure that the data set can't be read.
Nick Shuart
Right. That makes sense. Okay.Interesting. So, just to be clear, I shouldn't be posting my social security number online anymore? Oh, you absolutely shouldn't.And you probably think twice about posting any receipts. Any receipts. I mean, I've got, like, two kayaks, so I was like, wait a second.I wonder. My identity has been stolen so many times. That's right.