How Census Data Put Trans Children at Risk

2022-09-20 18:45:52
关注

Every decade, the U.S. Census Bureau counts the people in the United States, trying to observe the balance between gathering accurate information and protecting the privacy of the people described in that data. But current technology can reveal a person’s transgender identity by linking seemingly anonymized information such as their neighborhood and age to discover that their sex was reported differently in successive censuses. The ability to deanonymize gender and other data could spell disaster for trans people and families living in states that seek to criminalize them.

In places like Texas, where families seeking medical care for trans children can be accused of child abuse, the state would need to know which teenagers are trans to carry out their investigations. We worried that census data could be used to make this kind of investigation and punishment easier. Might a weakness in how publicly released data sets are anonymized be exploited to find trans kids—and to punish them and their families? This is a similar concern that underscored the public outcry in 2018 over the census asking people to reveal their citizenship—that the data would be used to find people living in the U.S. illegally to punish them.

Using our expertise in data science and data ethics, we took simulated data designed to mimic the data sets that the Census Bureau releases publicly and tried to reidentify trans teenagers, or at least narrow down where they might live, and unfortunately, we succeeded. With the data-anonymization approach the Census Bureau used in 2010, we were able to identify 605 trans kids. Thankfully, the Census Bureau is undertaking a new differential-privacy approach that will improve privacy overall, but it is still a work in progress. When we reviewed the most recent data released, we found the bureau’s new approach cuts the identification rate by 70 percent—a lot better, but still with room for improvement.

Even as researchers who use census data to answer questions about life in the U.S. for our work, we believe strongly that privacy matters. The bureau is currently undertaking a public comment period on designing the 2030 census. Submissions could shape how the census is undertaken, and how the bureau will go about anonymizing data. Here is why this is important.

The federal government gathers census data to make decisions about things like the size and shape of congressional districts, or how to disburse funding. Yet, government agencies aren’t the only people who use the data. Researchers in a variety of fields, such as economics and public health, use the publicly released information to study the state of the nation and make policy recommendations.

But the risks of deanonymizing data are real, and not just for trans children. In a world where private data collection and access to powerful computing systems are increasingly ubiquitous, it might be possible to unwind the privacy protections that the Census Bureau builds into the data. Perhaps most famously, computer scientist Latanya Sweeney showed that almost 90 percent of U.S. citizens could be reidentified from just their ZIP code, date of birth and assigned sex.

In August of 2021, the Census Bureau responded. The organization used the cryptographer-preferred approach of differential privacy to protect its redistricting data. Mathematicians and computer scientists have been drawn to the mathematical elegance of this approach, which involves intentionally introducing a controlled amount of error into key census counts and then cleaning up the results to ensure they remain internally consistent. For example, if the census counted precisely 16,147 people who identified as Native American in a specific county, it might report a number that is close but different, like 16,171. This sounds simple, but counties are made up of census tracts, which are made up of census blocks. That means, in order to get a number that is close to the original count, the census must also tweak the number of Native Americans in each census block and tract; the art of the Census Bureau’s approach is to make all of these close-but-different numbers add up to another close-but-different number.

One might think that protecting people’s privacy is a no-brainer. But some researchers, primarily those whose work depends on the existing data privacy approach, feel differently. These changes, they argue, will make it harder for researchers to do their jobs in practice—while the privacy risks the Census Bureau is protecting against are largely theoretical.

Remember: we’ve shown that the risk is not theoretical. Here’s a bit on how we did it.

We reconstructed a complete list of people under the age of 18 in each census block so that we could learn what their age, sex, race and ethnicity was in 2010. Then we matched this list up with the analogous list in 2020 to find people now 10 years older and with a different reported sex. This method, called a reconstruction-abetted linkage attack, requires only publicly released data sets. When we had it reviewed and presented it formally to the census, it was robust and worrying enough to inspire researchers from Boston University and Harvard University to reach out to us for more details about our work.

We simulated what a bad actor could do, so how do we make sure that attacks like this don’t happen? The Census Bureau is taking this aspect of privacy seriously, and researchers who use these data must not stand in their way.

The census has been collected at great labor and great cost, and we will all benefit from data produced by this effort.  But these data can also do harm, and the Census Bureau’s work to protect privacy has come a long way in mitigating this risk. We must encourage them to continue.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.

参考译文
人口普查数据如何让跨性别儿童面临风险
每十年,美国人口普查局都会统计美国的人口,努力在获取准确信息与保护数据中所描述的个人隐私之间找到平衡。但目前的技术可以通过将看似匿名的数据(例如居住社区和年龄)进行关联,从而揭示一个人的跨性别身份,即使他们在不同的人口普查中性别报告不同。这种去匿名化识别性别和其他数据的能力,可能对那些生活在美国试图将跨性别者定罪的州的跨性别者及其家庭构成灾难性影响。在像德克萨斯州这样的地方,如果家庭为跨性别儿童寻求医疗护理,可能会被指控虐待儿童,那么该州政府就需要知道哪些青少年是跨性别的,以便开展调查。我们担心人口普查数据可能被用来使这类调查和惩罚变得更加容易。是否有可能利用公开发布数据集匿名化中的漏洞,找到跨性别儿童,并对他们及其家庭实施惩罚?这是一种类似于2018年公众对人口普查要求人们透露国籍所引发的强烈抗议的担忧,即这些数据可能被用来找到非法居住在美国的人,然后对他们实施惩罚。通过我们对数据科学和数据伦理的专业知识,我们使用了模拟数据,这些数据被设计成与人口普查局公开发布的数据集相似,尝试重新识别跨性别青少年,或至少缩小他们可能居住的范围。不幸的是,我们成功了。如果我们使用2010年人口普查局采用的数据匿名化方法,我们能够识别出605名跨性别儿童。值得庆幸的是,人口普查局正在采用一种新的差分隐私方法,这种方法有望整体提升隐私保护,但目前仍在进行中。当我们审查最近发布的数据时,发现该局的新方法使识别率降低了70%——这已经好得多,但仍存在改进空间。即使我们是那些研究人员,我们使用人口普查数据来回答关于美国生活的各种问题,我们也坚信隐私很重要。目前,人口普查局正在就2030年人口普查的设计开展公众意见征询。公众的反馈可能会塑造人口普查的实施方式,以及该局如何进行数据匿名化。以下就是为什么这非常重要。联邦政府收集人口普查数据,用于决定诸如国会选区的划分方式或如何分配资金等事务。然而,政府部门并不是唯一使用这些数据的人。经济、公共卫生等各个领域的研究者也会使用这些公开发布的信息,研究国家状况并提出政策建议。但去匿名化数据的风险是真实存在的,不仅是对跨性别儿童而言。在私人数据收集和强大计算系统日益普及的世界中,也许有可能破坏人口普查局为数据设置的隐私保护。也许最著名的一个案例是,计算机科学家拉坦亚·斯威尼(Latanya Sweeney)证明,仅凭一个人的邮政编码、出生日期和法定性别,就可以识别出美国近90%的公民。2021年8月,人口普查局做出了回应。该机构采用了一种密码学界推崇的差分隐私方法来保护其选区划分数据。数学家和计算机科学家被这种方法的数学美感所吸引,这种方法涉及在关键的人口普查统计中故意引入一定量的误差,然后清理结果以确保数据内部的一致性。例如,如果某县的普查结果显示有16,147人自认为是美洲原住民,该机构可能会报告一个相近但不同的数字,例如16,171。这听起来很简单,但县是由人口普查区组成的,而人口普查区又是由人口普查街区构成的。这意味着,为了得到接近原始统计数字的值,普查还需要调整每个普查街区和普查区的美洲原住民人数;人口普查局的精妙之处在于,让这些接近但不同的数字最终加总为另一个接近但不同的数字。人们可能认为保护隐私是理所当然的事。但一些研究人员,尤其是那些依赖现有数据隐私方法开展工作的人员,却持不同看法。他们认为,这些变化将使研究人员在实际工作中变得更困难,而人口普查局努力规避的隐私风险大多是理论上的。请记住:我们已经证明了这种风险并非理论。以下是我们如何做到这一点的简要说明。我们重建了一份每个人口普查街区中18岁以下人群的完整名单,以了解他们的年龄、性别、种族和族裔在2010年的情况。然后我们将其与2020年的类似名单进行匹配,以找到现在年长10岁且报告性别不同的个体。这种方法被称为“重建辅助链接攻击”,只需要使用公开发布的数据集。当我们对这种方法进行正式审查并提交人口普查局后,其稳健性和令人担忧的程度足以促使波士顿大学和哈佛大学的研究人员主动联系我们,了解我们的更多详情。我们模拟了恶意行为者可能采取的行动,那么我们如何确保类似攻击不会发生呢?人口普查局正认真对待隐私保护的这一方面,而使用这些数据的研究人员必须给予支持,不能成为阻碍。人口普查数据的收集耗费了巨大的人力和成本,我们所有人都将从这些数据中获益。但这些数据也可能带来伤害,而人口普查局在隐私保护方面已取得了长足进步,显著降低了这种风险。我们必须鼓励他们继续努力。本文为观点与分析文章,文中作者的观点不一定代表《科学美国人》(Scientific American)的观点。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告

scientific

这家伙很懒,什么描述也没留下

关注

点击进入下一篇

建筑中谷歌土的替代品

提取码
复制提取码
点击跳转至百度网盘