Orlando, Fla. – First came the “noise” – The US Census Bureau decided to introduce small errors in the 2020 census data to protect the privacy of participants. Now the bureau is looking for “synthetic data”, manipulating numbers widely used for economic and demographic research, to obscure the identities of those who provide the information.
The move has weaponized some researchers, worried that the statistical agency may sacrifice accuracy in its zeal to protect privacy.
Statisticians from the Census Bureau revealed at a virtual conference last week that over the next three years they would work towards developing a method to create “synthetic data” for files on individuals and homes that are already devoid of personal information. Huh. These files, known as the American community survey microdata, are used by researchers to create tables adapted to suit their research.
Statisticians from the Census Bureau said more privacy protections are needed because technological innovations increase the risk of people being identified through their survey answers, which are confidential. The computing power is now so vast that it can easily crunch third-party data sets that combine personal information, purchasing records, voting patterns and public documents from credit ratings and social media companies, among other things.
“This is a balancing act. The law requires us to do competitive things. We need to release statistics on the nation to allow people to make useful decisions. But we also have to protect the privacy of our respondents, “Census Bureau statistician Rolando Rodriguez said at the conference.
But critics say the proposal, along with an ongoing effort to add small inaccuracies to the 2020 Census data to protect the privacy of participants, cites the Census Bureau as a provider of accurate data about the U.S. population. Weak credibility.
Demographer Steven Ruggles of the University of Minnesota categorically stated that the synthetic data “would not be suitable for research.”
“The Census Bureau is inventing hypothetical threats to privacy to drastically reduce public access to data,” Ruggles said. “I don’t think it will stand up, because society needs information to function.”
Microdata are collected each year from the American Community Survey with a sample size of 3.5 million homes, extrapolated to populations of all sizes from across the country to the neighborhood. It provides a wide range of estimates on the country’s demographic structure and housing characteristics. Ruggles said that microdata is used annually in the drafting of about 12,000 research papers.
Synthetic data is created by taking variables into the microdata to construct a model re-creating the interrelationships of the variables, and then building a simulated population based on the model. Scholars will conduct their research using simulated populations – or synthetic data – and then submit it, if they wish, to the Census Bureau for double checks against the actual data to ensure that their analyzes are correct.
Ruggles said that new discoveries in the data will be missed because the models capture only what is already known.
Another problem is that synthetic data can exacerbate an outlier, such as in a health study where one person behaves risky at times, but others do not, and it seems that risky behavior is actually much more than that More comprehensive is David Swanson, Emeritus Professor of Sociology at the University of California Riverside.
However, there are benefits, such as the ability to get details about people at really small geographic levels, such as neighborhood blocks, said Cornell University economist Lars Wilhuber, who has researched the method. Synthetic data makes this possible because it protects privacy, he said.
“You can actually find far more detail in the data than traditional methods,” Wilhuber said.
The Census Bureau said in a statement on Thursday that it had not made a final decision on the use of synthetic data in the US Community Survey and has welcomed feedback from researchers.
The Census Bureau has taken other recent steps to protect the privacy of individuals, which has become difficult due to the proliferation of external data sources. This year, the bureau proposed to use housing units instead of people when defining an urban area. And it has drawn sharp criticism for using a statistical technique known as “differential privacy” in the 2020 census data, which will be used to delineate congressional and legislative districts.
Differential privacy adds mathematical “noise” or intentional errors to the data, so that the identity of any individual can be obscured while providing statistically valid information. It has been challenged in court by the state of Alabama, stating that its use will result in inaccurate data.
Historian Margo Anderson, a professor at the University of Wisconsin-Milwaukee, said “the Census Bureau is saying that it is in the tradition forever” to protect privacy. “An important organization of critics is saying that it is completely different. They say, ‘You have never intentionally mispronounced data.'”
The Census Bureau first mooted the idea of using synthetic data three years ago, but after the Trump administration failed to add the citizenship question to the 2020 census questionnaire, concerns over it and differential policy were dispelled and the epidemic The country’s major calculation challenged the year, Anderson said.
For Swanson, the Census Bureau’s efforts at secrecy remind him of the quote that reporter Peter Arnett attributed to an unnamed American military officer during the Vietnam War: “We had to destroy the city to save it.”
“I think they will literally destroy the census data to protect it from an undetermined threat,” Swanson said. “If they destroy the data, they are going to destroy the bureau.”
Follow Mike Schneider on Twitter https://twitter.com/MikeSchneiderAP