One of the most important attributes of data driven journalism is that it scales, and the primary goal of my OpenRural, Open N.C. and data dashboard projects has been to democratize data so that we start seeing the same types of reporting and presentation in small community papers that we see in the big national news sites. So when I saw Thursday’s New York Times graphic on the race gap in America’s police departments, I immediately thought that something similar could be done pretty quickly that would look at North Carolina towns.
Being a words guy rather than a picture guy, I used data visualization software Tableau to put together a prototype of something similar to what The Times had done. It is absolutely no where near as good as what they did, but I copied their concept, color scheme and fonts. And about two hours later I had something that told the same story.
The graphic alone doesn’t tell the whole story. Tippett pointed out when I showed her the chart that most of the Latinos in Siler City aren’t even eligible to join the city’s police force — 40% are not adults, and 80% of adult Hispanics there are not citizens.
And many of these police forces are very small, which makes it easy for them to end up with huge percentage disparities in the racial breakdowns of their police and residents. Tiny Biscoe, for example, only has nine police officers. Wagram has two police officers — half of which are white and half of which are “other.”
The other potential problem with the data is that it’s seven years old. But so is the data used by The Times.
This is just an example of how we might continue to democratize data. This graphic could be emailed to an editor of each news outlet in North Carolina, along with a list of suggested questions that local reporters could ask to quickly make the data more relevant.
Suggested Questions to Localize This Data Driven Story
- “This data is seven years old. Does it still look accurate to you? Can you provide me with some more recent data of the racial and ethnic breakdown of the police department?”
- “Why do you think your department has a higher percentage of white officers than the residents?”
- “How does the racial disparity between the police department and local residents effect the way your department works?”
- “Walk me through the hiring process for new officers. How does a candidate’s race factor in to hiring decisions, if at all?”
- “How do you publicize vacancies in the department? Do you do anything to recruit minority applicants?”
- “What percentage of your officers live in the city? How important is it that officers come from within the city? Why?”
- Also, seek opinions of others — both insiders such as city council members and community leaders as well as people on the street. Consider using social media such as Facebook or Twitter to ask people what they think about the data and these questions. This is the start of a conversation, not the end. Be sure to get a diversity of perspectives — age, gender, geography and certainly race and ethnicity.
The Challenge: News Deserts
But even if we acquire, clean and produce data along with some simple story guides, data driven journalism may still not find its way into smaller newspapers if nobody is there to receive our help. At many papers, this would still be seen as enterprise reporting. As an editor with a staff you can count on one hand, do you send a reporter out prospecting for answers to these somewhat uncomfortable questions? Or do you have them write up the day’s arrests? Or preview this weekend’s chamber of commerce golf tournament?
North Carolina also has broad news deserts — whole counties that have no reporters shining light in dark places, holding powerful people accountable and explaining an increasingly complex and interconnected world. Siler City, for example, is in a county of 65,000 people with a single newspaper that reaches only 12 percent of them. The News & Observer — provides scant coverage of the county.
What other story templates would you like to see? What would make them easier to use?
First of all, let’s not let allow the alluring alliteration to distract from we’re really talking about — not robot reporters, but robot writers.
Mashable’s Lance Ulanoff asked me what I thought about the news that Durham’s Automated Insights would be writing automated business stories for the Associated Press.
This trend excites me about the future of journalism. I’ve been talking with folks about it for about five years, since I first saw similar work that was being incubated by Northwestern’s journalism school. That effort grew into the company Narrative Science, which has been writing earnings preview stories for Forbes.com. The Los Angeles Times uses an algorithm to write earthquake stories. The Washington Post has looked into using Narrative Science for high school sports stories.
The Guardian learned how hard it is to build a robot writer, but the automated stories I’ve seen written by both Automated Insights and Narrative Science are pretty good. And 46 media and communications undergrads couldn’t distinguish a computer written story from one written by a human.
The trend in automation should free up the best writers and best reporters to add the how and why context that still needs to be done by humans. If I were a beat reporter at a newspaper I’d be working as fast I could to convince by editor to let a computer write the scut stories I have to write and free me up to do more explanatory and accountability reporting, or to craft beautifully written narratives.
One significant risk is that for the last decade we’ve seen “good enough” journalism growing in popularity. News organizations that continue to have a strategy of harvesting profits rather than investing in growth will no doubt cut reporters if machines can write commodity news at a lower cost.
If I were a young journalist looking for my first job, I’d be looking for news organizations that are sustaining a small margin and growing both expenses and revenues — the ones that are using both bots and humans.
The trend toward automation will result in an emphasis on the news value of impact. Mass customization is going to change the nouns in the leads of stories from the third person to the second — “investors” will become “you.”
The trick is how to make money off this. News organizations that continue to see themselves as manufacturers of goods will probably increase the volume of digital commodity content they publish and continue to drive down ad rates.
But smart content companies are evolving from a manufacturing industry to a service industry, and trying to create, explain and capture the value they provide to each client by getting the right information to the right people at the right time.
What we see now as data is as unsophisticated as what many of us thought of data when Google first made its mission organizing all of it. We think of data now as numbers in tables — scores, money, temperatures, but we’ll soon see data as behavior and content metadata. And we will see automated stories that incorporate the user’s data and the data of her social network as well.
That level of concierge news service, though, is going to come at a price for users. If we’ve seen the democratization of media this automation trend has the potential to create a world of media haves and have nots — the haves will pay premium subscription fees to get highly personalized news from bots. The have-nots will get generic news (maybe written by bots as well).
The one thing from which I think everyone will benefit is an increase in the quality and frequency of narrative writing, and of explanatory and accountability reporting.
To aid that transition I’m working on the idea that we can use digital public records to build a newsroom dashboard system that will alert beat reporters to possible story ideas. Automated Insights and Narrative Science are scaling commodity news stories. I want to see if we can lower the human reporters’ opportunity cost of pursuing enterprise stories that land with much bigger and much longer lasting impact.
If you want a pithy quote from a journalism prof. on the effect that robot writers are going to have on the job market for journalism students, here it is: “My C students are probably screwed. My A students are going to do better than ever.”
I made an error in the piece I wrote for PBS Media Shift Idea Lab yesterday. I misrepresented the audience for The Columbia Tribune. The paper’s general manager, Andy Waters, kindly brought it to my attention and I want to offer a correction here.
Numbers that Waters sent me from The Media Audit show that the news organization reaches nearly 80 percent of the 130,000 adults in its market. I used a small potential audience base and a smaller penetration rate when I wrote up the post. That number was no good and I should’ve known better than to use it. I simply took the print circulation and divided it by the Census estimate of residents in Columbia.
You could go all night quibbling about audience measurement methodologies, but whatever faults The Media Audit numbers may or may not have, they are certainly better than the way I tried to calculate it.
And I think this is a particularly important measurement to correct because of the number of unsourced posts that you can find on the Internet saying that the Tribune lost anywhere between 25 and 40 percent of its online audience when it implemented online subscriptions. I’m not fact-checking those claims one way or the other, and even if true they may not be important. I’m repeating it here only to provide context for the correction and to hopefully spur some critical thinking about any audience claims you see — including mine.
Anyway. The Media Audit numbers that Waters showed me indicate that out of a base population of 130,634 adults 18 or older, 103,260 of them – or 79 percent – say they read the Tribune either in print or online. The print edition (weekday/Sunday) reaches 62 percent of the market at least once a week, and the website reaches 52.5 percent of the market at least once a month.
I hope that gives a better picture of the kind of environment in which the Tribune’s OpenBlock experiment is taking place. And the point I was trying to make I think remains valid — that Columbia, Mo., is WAY different than Chicago or Charlotte or San Francisco.
Here’s a great exercise for journalism professors who are introducing their students to data-driven journalism. It provides a good opportunity to show them that they have to get over the common perception that data is unbiased — clean and clear. It gives instructors an opportunity to talk about the need to “interview” the data.
The assignment is deceptively simple: Have the students download the Census Bureau’s list of rural and urban counties and calculate the population density for the counties in your state.
That’s it. Tell them no more. Depending on where they get stuck, slowly reveal to them the clues they need to complete the project. What you may not be surprised to find is that too many college undergrads seem to be accustomed to following step-by-step instructions and too few know how to break down a problem into smaller, sequential pieces. This is the kind of critical thinking skills that they need to be good journalists. Or, as I like to say, think journalistically regardless of their eventual profession.
Helping Them Get Unstuck
Force your students to get a quick start. Don’t let them sit and stare at their computer screens for even a second. Agitate them in whatever way you need to make them feel like an asteroid is about to smash the earth to smithereens. They can’t solve the whole problem all at once, so what are the pieces of the problem hidden inside this big problem?
- Where can you find the Census list of rural and urban counties?
The answer — of course — is Google. So, there’s an opportunity to teach efficient search strategies.
Students will click around the Census site a bit trying to find what they want. Ask how skimmed and how many read every word on each page. A good opportunity to talk about the way people use information online.
You can help students find the data they need. And from there you can show them basic file-management and Excel techniques. Where does the file download on their computer? What’s the difference between a .csv and a .xlsx file?
With the data open in Excel, they’ll need to sort to filter out just their state. But now what? Ask the students what they think each of the columns represent. What does it mean that something has a POP_UA of 10791 and a STATE of 37?
Once they figure that out, they may note that the data includes some pre-calculated population density. But it’s not the information you asked them to find, so they’ll have to calculate population density — a commonly-needed, very simple journalism math equation.
This gives you a chance to explain that numbers are only meaningful in relation to other numbers. And how to do basic calculations in Excel.
The students will do the math correctly, but they won’t get answers that make any sense. A chance for you to talk with them about how data still has to pass the sniff test. Why doesn’t the data make sense? They can find the answer back on the Census website.
Once they’ve made the correct calculations (how many meters are in a mile anyway?), you can talk with them about how you still need to find the story in the data. Even though their calculations have added value to the data — essentially refining raw ore — mere presentation is of marginal value.
You can top off the conversation by coming back to language, and that journalistic aspiration for precision and objectivity. What does “rural” mean anyway? What does the dictionary say? Is it an abstract concept or something you can measure? How (many different ways) does the Census measure it? How is it different than the USDA’s definition? Which is better? Why?
This is a project that could take several weeks as a module in a college class, or as a MOOC or quick conference or newsroom workshop. Its strength is its scope and flexibility. Just like a good journalist.
Romney: 44 (Ga., S.C., Ky., Ind.)
Obama: 3 (Vt.)
Undeclared: 34 (Va., Fla.)
Virginia and Florida will be our first undecided states, and in 2008 they were the ones that finally got called at 11 p.m. and allowed TV networks to project that Obama would win.
Romney: 49 (W.Va.)
Obama: 3 (Vt.)
Undeclared: 43 (N.C., Ohio, Va., Fla.)
In 2008, McCain conceded even while he was still ahead in North Carolina. Of course, after all precincts reported it was Obama who won became the first Democrat since Jimmy Carter to win the state. It will be at least 9:30 before the state is called and I suspect that the longer it stays open the worse it looks for Romney.
West Virginia hasn’t gone Democratic since 1996. Sometimes I forget that.
8 p.m. –
Romney: 130 (Tenn., Ala., Miss., Mo., Tex., Okla.)
Obama: 98 (Mich., Maine, N.H., R.I., Conn., N.J., Del., Md., D.C., Ill.)
Undeclared: 59 (N.C., Ohio, Va., Fla.)
If N.H. and Mich. don’t go for Obama right away, then he may be in trouble there.
If N.J. doesn’t get called right away, it’ll probably not be an indication of anything other than storm-related voting issues.
8:30 p.m. –
Romney: 136 (Ark.)
Obama: 98 ()
Undeclared: 59 (N.C., Ohio, Va., Fla.)
9 p.m. –
Romney: 174 (La., N.D., S.D., Neb., Kans., Wyo., Ariz.)
Obama: 152 (N.Y., Minn., N.M., Wis.)
Undeclared: 68 (Colo., N.C., Ohio, Va., Fla.)
Colorado is another of those states that didn’t get called in 2008 until after McCain conceded shortly after 11 p.m.
In 2008 at about 9:30 p.m., the networks projected Obama would win Ohio. They also projected Wisconsin going for Obama about the same time. It looks like they both may go Obama’s way this year, too – but Wisconsin before Ohio.
10 p.m. –
Romney: 185 (Mont., Utah)
Obama: 164 (Nev., Iowa)
Undeclared: 68 (Colo., N.C., Ohio, Va., Fla.)
Iowa went for Obama right away in 2008. And he eventually got 54 percent of the vote there. If the state goes to Romney, it would be the second time since 1988. A slow call for Obama here might point in that direction.
I seem to recall that Nevada is very slow to report, but a slow call in Nevada might also be an indication that the state will return to the Republican column.
In 2008 at about 10:45 p.m., Fox called Virginia for Obama.
11 p.m. –
Romney: 185 ()
Obama: 269 (Ohio, Va., Calif., Wash., Ore.)
Undeclared: 50 (Colo., N.C., Fla.)
Hope will remain for Romney if Ohio doesn’t go to Obama by 11 p.m. But also note that Ohio is expecting more than 200,000 provisional ballots, and will be forced into a recount if the difference between the two candidates is about 14,000 votes.
Of the five remaining undeclared states, Nate Silver predicts that Obama is most likely to win Virginia and Colorado. But winning Virginia would still leave Obama one electoral vote shy of the 270 he needs to win the presidency. So expect this election night to go later than it did four years ago.
Florida and Virginia got called for Obama shortly after 11 p.m. in 2008, allowing networks to project him as the winner. Obama had about 53 percent in Virginia and 51 percent in Florida when all the votes were counted in 2008.
Nevada didn’t get declared until after McCain conceded. Obama ended up with 55 percent of the vote there.
12 a.m. –
Romney: 185 ()
Obama: 273 (Hawaii)
Undeclared: 37 (Colo., N.C., Va., Fla.)
1 a.m. –
Romney: 188 (Alaska)
Obama: 273 ()
Undeclared: 37 (Colo., N.C., Va., Fla.)