This project provides a sortable list of datasets that may be of use to journalists. Most of the datasets and descriptions come from Jeremy Singer-Vine’s Data Is Plural, “a weekly newsletter highlighting useful and curious datasets”.
You can save a data set for future reference by clicking the "+" at the top right-hand corner of the dataset cards.
This project provides a sortable list of datasets that may be of use to journalists. Most of the datasets and descriptions come from Jeremy Singer-Vine’s Data Is Plural, “a weekly newsletter highlighting useful and curious datasets”.
The original Data is Plural spreadsheet (and this remixed project) are published under a Creative Commons Attribution - Share Alike 4.0 International license.
You can find the GitHub repo for this project here.
The Global Investigative Journalism Network has published a collection of resources for finding and working with data. They also have resources available in Arabic, Bangla, Chinese, French, Portugeuse, Russian and Spanish.
Voice of America does not endorse and has not verified these datasets. This page was created as a resource for journalists to find potentially useful data to help report stories.
VOA provides trusted and objective news and information in over 45 languages to a measured weekly audience of more than 275.2 million people around the world. For over 75 years, VOA journalists have told American stories and supplied content that many people cannot get locally: objective news and information about the US, their specific region and the world. Learn more
CIRCL, Luxembourg’s computer security incident response team, has published a dataset of 37,500 .onion website screenshots, a subset of which have been categorized by topic (e.g., “drugs-narcotics”, “extremism”, “finance”) and/or purpose (e.g., “forum”, “file-sharing”, “scam”). [h/t Alexandre Dulaunoy] — Data is Plural: September 18, 2019
Links:
Tags: crimeextremismtechnology
Urban planning professor Geoff Boeing’s US street network data represents America’s roads as a network graph, where each intersection (and dead-end) is a node, and each street segment is an edge between two of those nodes. The project’s data repository contains these networks for each city, county, Census tract, and more. You might remember: Boeing’s urban street orientation charts. [h/t Robin Hawkes] — Data is Plural: September 18, 2019
Links:
Tags: mapping
The State Networks dataset gathers comparative and relationship metrics for every combination of the 50 US states, plus the District of Columbia. Among the metrics: the number of flights between each state-pair, migration in either direction, and total value of goods imported. The comparisons also include state-to-state differences in demographics, ideology, and GDP. [h/t Matt Grossmann] — Data is Plural: September 18, 2019
Links:
Political science professor Jamie Monogan has compiled a dataset of more than 2,700 immigration laws passed by US state legislatures from 2005 to 2016. The dataset summarizes the laws and also categorizes them by subject, scope, and whether they appear to be welcoming or hostile to immigrants.[h/t Jason Anastasopoulos] — Data is Plural: September 18, 2019
Links:
Tags: immigration
Since 1988, Brazil’s PRODES project has been using satellite imagery to track clear-cutting in the country’s Amazon basin. The government’s TerraBrasilis web portal provides an interactive map and downloads of the data. Global Forest Watch also provides a dataset of PRODES-detected deforestation, from 2001 to 2015. [h/t Giuseppe Sollazzo] — Data is Plural: September 18, 2019
Links:
Tags: environmentmappingplants
Researchers at Brazil’s Federal University of Ceará have published a new dataset “composed of more than 70,000 bug-fix reports from 10 years of bug-fixing activity of 55 projects from the Apache Software Foundation.” — Data is Plural: September 11, 2019
Links:
Tags: technology
Since 1992, the US Bureau of Labor Statistics’ has collected data on work-related deaths through its Census of Fatal Occupational Injuries. The results are presented as various cross-tabulations — by industry, demographic, circumstances, and more. Related: The agency also publishes data on non-fatal injuries and illnesses. [h/t Elissa Philip Gentry and W. Kip Viscusi] — Data is Plural: September 11, 2019
Links:
Transport for London has launched its Cycling Infrastructure Database, which “contains the location of more than 240,000 pieces of cycling infrastructure in London, including places to park and the location of cycle lanes.” The new information can be found among the agency’s broader collection of cycling data; look for the “CyclingInfrastructure” folder. [h/t Jolyon Whaymand] — Data is Plural: September 11, 2019
Links:
Uber Movement, from the titular ride-hailing company, “shares anonymized data aggregated from over ten billion trips to help urban planning around the world.” Online, you can explore street speeds and estimated travel times for dozens of cities. To download data from the website, Uber requires you to provide your name, email address, and purpose. But they also provide a command-line tool that lets you download street-speed data without any registration. Michael A. Rice] — Data is Plural: September 11, 2019
Links:
Tags: mappingtransportation
The UN’s World Database on Protected Areas is, it says, “the most up to date and complete source of information on protected areas, updated monthly with submissions from governments, non-governmental organizations, landowners and communities.” It contains structured, geospatial information on more than 245,000 nature reserves, national parks, wildlife sanctuaries, and other kinds of conservation sites. The project provides bulk downloads, an interactive map, country-level statistics, and an API. Previously: The California Protected Areas Database (DIP 2019.07.10). [h/t Giuseppe Sollazzo] — Data is Plural: September 11, 2019
Links:
FiveThirtyEight has built a dataset of 65 college football fight songs, which contains each song’s name, authors, year written, tempo, duration, and whether it includes various tropes, such as spelling out words or mentioning the school’s colors. Related: FiveThirtyEight’s “Guide To The Exuberant Nonsense Of College Fight Songs,” where you can listen to the songs, read the lyrics, and explore an interactive chart of tempo versus duration. — Data is Plural: September 4, 2019
Links:
The Drama Corpora Project has collected and processed more than 800 plays in German, Greek, Spanish, Russian, Latin, and English. For each play, the project provides a structured-data version of the text, a network diagram, speech distribution metrics, plus several other files and features. [h/t Lynn Cherny] — Data is Plural: September 4, 2019
Links:
Tags: entertainmentlanguage
The 3PFL dataset — Patents and Publications with a Public-Funding Linkage — lists more than 13,000 US patents that have acknowledged federal funding. The dataset, accompanied by a detailed methodology, also links the patents to details about the funding, as well as to scientific publications that stemmed from it. Previously: Patent geography (DIP 2019.07.31). [h/t Gaétan de Rassenfosse] — Data is Plural: September 4, 2019
Links:
Tags: technology
The Central African Republic’s ongoing civil war has pressed more than 600,000 people to flee the country. The violence has also internally displaced another 600,000 people, a phenomenon that the UN's Humanitarian Data Exchange has been tracking. In addition to counts of internally displaced people by locality, the UN’s datasets include a listing of refugee sites and the country's road network. Related: A multimedia presentation of one family's 600-kilometer journey in search of safety. [h/t Becky Band Jain] — Data is Plural: September 4, 2019
Links:
The University of Oxford’s Malaria Atlas Project collects, models, and publishes a range of datasets related to the mosquito-borne disease, including localized incidence rates. You can explore and download the data, layer by layer, through the project’s interactive map. [h/t Clara Burgert-Brucker] — Data is Plural: September 4, 2019
Links:
Missing Migrants tracks migrant deaths around the world. The data is is available as XLS, CSV and interactive maps and charts. Related: The US Border Patrol also has a PDF of deaths from 1998 to 2018. [h/t @VMMacchi] — VOA: September 4, 2019
Links:
James E. Cutting, a Cornell University psychology professor, has compiled several datasets on the structure of popular films, including one that indicates the length of each shot in 220 movies from 1915 to 2015. [h/t Igor Schwarzmann + Noah Brier] — Data is Plural: August 28, 2019
Links:
Tags: entertainmentmovies
Legal scholar and open-data enthusiast Hanjo Hamann has digitized seventy years of rosters from Germany’s seven federal courts, extracted structured data about the judges, and linked them to their Wikidata IDs. Related: Hamann’s detailed description of the dataest’s historical context and its construction. [h/t Erik Gahner Larsen] — Data is Plural: August 28, 2019
Links:
Tags: law
Government professor C. Lawrence Evans’ dataset of US House "whip counts" describes more than 650 of the informal polls conducted by party leadership — covering 1955–86 for Democrats and 1975–80 for Republicans, on topics as varied as dairy prices, Alaskan statehood, voting rights, and Vietnam. It also indicates how each party member responded. [h/t Neil Malhotra + Janet Box-Steffensmeier] — Data is Plural: August 28, 2019
Links:
Tags: politics
A team led by meta-research pioneer John Ioannidis has developed a dataset of citation metrics for science’s 100,000 most-cited authors. The dataset includes each author’s name, institutional affiliation, number of publications, total citations, “h-index,” and more. For each citation metric, there’s a second version that excludes self-citations. Related: “Hundreds of extreme self-citing scientists revealed in new database” (Nature). — Data is Plural: August 28, 2019
Links:
Tags: science
The OECD’s ADIMA database tracks multinational corporations — Walmart, Toyota, Nestle, etc. — and their subsidiaries. It currently includes economic statistics about each of the world’s 100 largest multinationals, the names and locations of 26,000 subsidiaries, and information about nearly 20,000 of their websites. The OECD says plans to expand the number of companies in the future. Now you know: In 2016, the companies in the dataset “generated nearly $10 trillion in revenues (almost 20% of global GDP), earned $730 billion in profits and paid $185 billion in taxes,” according to the OECD. — Data is Plural: August 28, 2019
Links:
Tags: business
The Confidence Database is aggregating data from behavioral studies that have asked participants’ how confident they were in their own assessments. As of its launch earlier this month, the database contains 145 datasets, 8,700 participants, and 4 million individual observations. [h/t Audrey Mazancieux + Doby Rahnev] — Data is Plural: August 21, 2019
Links:
Tags: statistics
On Monday, the British government published a dataset of voting results, by party and parliamentary constituency, for every UK general election since 1918 — merging modern data with a handful of historical sources. — Data is Plural: August 21, 2019
Links:
The TV-NGRAM project pulls 14 TV stations’ data from the Television News Archive and calculates how often each word (and two-word combination) was said during each 30-minute window. Most of the stations’ counts go back 9 or 10 years, and all are updated daily. — Data is Plural: August 21, 2019
Links:
Tags: journalismlanguagemedia
Joshua Tschantret, a political science Ph.D. candidate at the University of Iowa, has compiled a dataset of 260+ terrorist groups formed between 1860 and 1969. For the purposes of the dataset, “terrorist groups are operationally defined as politically-motivated non-state actors using bombings or assassinations,” Tschantret writes in an introductory article (PDF). About one-third of the groups in the dataset operated in the US, Russia, or China; the rest are spread across dozens of other countries. Related: Additional documentation (PDF). Good to know: On Twitter, Tschantret explains why the Black Panthers are included. [h/t Carla Martinez Machain] — Data is Plural: August 21, 2019
Links:
The Joint Organisations Data Initiative (JODI) coordinates the collection, standardization, and publication of oil and gas data from around the world; the 100+ countries that participate represent the vast majority of global production. The oil data goes back to 2002; the gas data goes back to 2009. Both datasets are updated monthly and track a range of subproducts (e.g., crude oil, diesel, jet fuel) and flows (e.g., imports, exports, production) for each country. Previously: Global and gas infrastructure (DIP 2018.06.06) and state-owned oil companies (DIP 2019.05.01). — Data is Plural: August 21, 2019
Links:
Tags: energy
Katherine M. Kinnaird and John Laudun — professors whose research includes cultural analytics and computational folklore studies — have created a dataset of 2,656 TED talks, with metadata and transcripts, and have published a detailed description of the project. [h/t Lynn Cherny] — Data is Plural: August 14, 2019
Links:
Tags: entertainmentmedia
The Open Power System Data platform has aggregated energy data from across Europe into a series of standardized datasets, including electricity consumption, power plants, and generation capacity. The project has also published an “IT philosophy,” a guide for new users, and a detailed listing of primary sources. — Data is Plural: August 14, 2019
Links:
Tags: energy
ThePLUG, a news site that reports on the black innovation economy, has been collecting data on conferences for black tech professionals. The dataset currently contains 33 events in more than a dozen cities, and lists their costs, year started, contact information, sponsors, and more. [h/t Sherrell Dorsey] — Data is Plural: August 14, 2019
Links:
Tags: racetechnology
OurAirports, a community-assisted project that began in 2007, provides bulk data detailing 55,000+ airports and 41,000+ runways, plus listings of airport radio frequencies and global navigation aids. In addition to standard airports, the records include 23 balloonports, 1,000+ seaplane bases, and 11,000+ heliports. Related: “How we created a map of the global architecture of airport runways, which turned out to be a wind map.” [h/t Robin Hawkes] — Data is Plural: August 14, 2019
Links:
Tags: transportation
The London Stage Database “is the latest in a long line of projects that aim to capture and present the rich array of information available on the theatrical culture of London, from the reopening of the public playhouses following the English civil wars in 1660 to the end of the eighteenth century.” The database contains information on more than 50,000 events, which you can search online and download in bulk, and are often supplemented with detailed notes and cast lists. The site also offers a user guide and a detailed explanation of the data’s provenance. (“We hope that visitors to the site will find this frank acknowledgment and foregrounding of the dataset’s history and limitations refreshing rather than frustrating.”) [h/t Ula Klein] — Data is Plural: August 14, 2019
Links:
Tags: entertainmenthistory
About a third of US states hold a monopoly on the local sale of hard liquor. Some of them — including Virginia, Alabama, Michigan, Utah, and North Carolina — let you download their price lists as spreadsheets. [h/t Christopher Ingraham] — Data is Plural: August 7, 2019
Links:
Tags: alcohol
Brigham Young University’s Antarctic Iceberg Tracking Database provides surveillance on hundreds of floating hunks of ice, past and present. The records cover 1978 plus 1992 through mid-2019; a subset of the database lists 117 icebergs’ daily position, estimated size, and rotation angle. [h/t Robin Hawkes] — Data is Plural: August 7, 2019
Links:
An international team of researchers has compiled a “comprehensive spatial inventory” of nearly 100,000 public health facilities in sub-Saharan Africa. The dataset includes facilities in 50 countries and lists each facility’s name, country, administrative region, type, ownership, and coordinates. [h/t Karen Grepin] — Data is Plural: August 7, 2019
Links:
Tags: healthcare
The government-run Federal Judicial Center publishes a daily-updated “biographical directory” of all judges who’ve served on federal courts — the Supreme Court, appellate courts, district courts, the bygone circuit courts, plus a few others. The directory is presented as structured data, and includes information on the judges’ demographics, educations, professional careers, nominations and more. Related: The University of South Carolina’s Judicial Research Initiative also maintains historical datasets of district and appellate court judges; they contain many of the same variables plus some extras, such as religion and estimated net worth. [h/t Dan Nguyen + Sergio Galletta, Elliott Ash, and Daniel L. Chen] — Data is Plural: August 7, 2019
Links:
Tags: law
The Washington Post and the Charleston Gazette-Mail recently won a year-long legal battle to obtain a large slice of the Drug Enforcement Administration’s data on opioid shipments. (The data had previously been provided to plaintiffs in a federal lawsuit, but a judge had sealed the records from public access.) The Post has begun publishing its findings, as well as a cleaned-up version of the dataset that focuses on “shipments of oxycodone and hydrocodone pills to chain pharmacies, retail pharmacies and practitioners” between 2006 and 2012. The raw, unsealed dataset is also available. Related: A 500-row subset, so you can see what the data looks like before downloading the large files. — Data is Plural: August 7, 2019
Links:
Tags: drugshealthcare
Duncan Geere has compiled a database of the 48 dogs who participated in the USSR’s space program in the 1950s and 1960s. The information, which also includes details about the canines’ 42 flights, is based on Olesa Turkina's book, Soviet Space Dogs. — Data is Plural: July 31, 2019
Links:
The UK Institute for Government has been updating a spreadsheet of ministers who’ve resigned since 1979, the post each one held, the reasons for resignation, and the prime minister in charge at the time. The spreadsheet, which so far contains 151 resignations through last week, includes a few methodological notes embedded as comments in the header row. [h/t Gavin Freeguard] — Data is Plural: July 31, 2019
Links:
Tags: politics
Researchers at two Swiss universities have created a dataset of inventors’ and applicants’ locations listed in 18.8 million patents filed between 1980 and 2014. The locations, which span 46 countries, are specified both by their geographic coordinates as well as their administrative areas (e.g. city, state, country). [h/t Gaétan de Rassenfosse] — Data is Plural: July 31, 2019
Links:
Tags: technology
A team of researchers at the MIT Media Lab has built a corpus of machine-generated transcriptions from 284,000 hours of talk radio. The transcripts capture approximately 2.8 billion words from 50 semi-randomly selected stations, and include metadata, such as the program name, the speaker’s (guessed) gender, and whether the speaker seemed to be in the studio or on the phone. [h/t Lynn Cherny] — Data is Plural: July 31, 2019
Links:
For nearly two decades, the US Department of Defense has released detailed tables on the foreign military units it has trained. For each training, the information describes the units trained, number of trainees, course name, start and end dates, location, cost, and more. Unfortunately, the government publishes these records only as PDFs. To make the data more accessible, Security Force Monitor, a project of the Columbia Law School Human Rights Institute, has converted the PDFs into an open, queryable database. An associated GitHub repository contains an extensive methodology, the extraction code, and the raw data. [h/t Jamon Van Den Hoek] — Data is Plural: July 31, 2019
Links:
Tags: military
“The Merchant Shipping Act 1835 required all British registered ships of 80 tons or more employed in the coastal trade or fisheries to carry crew agreements and accounts, often referred to as crew lists.” The lists include crew members’ ages, places of birth, previous vessels, and more. Thanks to the National Library of Wales Volunteering Programme, thousands of crew lists from the Welsh port of Aberystwyth, from 1856 to 1914, have been transcribed. [h/t u/cavedave] — Data is Plural: July 24, 2019
Links:
The Standardised Precipitation-Evapotranspiration Index is a metric, calculated from climatic data, that “can be used for determining the onset, duration and magnitude of drought conditions with respect to normal conditions.” The project, based at the Spanish National Research Council, provides both a “near real-time” global drought monitor and a historical database. — Data is Plural: July 24, 2019
Links:
ICESat-2, launched by NASA in September 2018, “is measuring the height of a changing Earth one laser pulse at a time, 10,000 laser pulses per second”; the satellite “allow[s] scientists to monitor the elevation of ice sheets, glaciers, sea ice, and more—all in unprecedented detail.” Its datasets are available to download. [h/t Michael McLaughlin] — Data is Plural: July 24, 2019
Links:
As part of Oak Ridge National Laboratory’s efforts to evaluate America’s hydropower resources, researchers there have developed a system (and corresponding dataset) for classifying all 2.6 million streams in the Lower 48 by size, hydrology, gradient, temperature, and “valley confinement.” Elsewhere, other researchers have assessed the “connectivity status of 12 million kilometres of rivers globally” and have identified “those that remain free-flowing in their entire length”; you can download that data and also explore it online. — Data is Plural: July 24, 2019
Links:
The Water Observatory “provides reliable and timely information about surface water levels of water bodies across the globe.” The locations are based on NASA’s Global Reservoir and Dam Database and the World Wildlife Fund’s Global Lakes and Wetlands Database. Concerned about the accuracy of the boundaries in those databases, the researchers instead treated them as a “collection of potentially interesting water bodies” and then “extracted their polygons from the OpenStreetMap.” Of the 40,000 bodies of water they extracted, they’ve published water level data for roughly 7,000 through the project’s interactive dashboard and API. [h/t Emma Vitz] — Data is Plural: July 24, 2019
Links:
UNICEF compiles child marriage rates for men and women married before ages 15 and 18. The data primarily comes from national census and household surveys, including the Multiple Indicator Cluster Surveys (MICS) and Demographic and Health Surveys (DHS). Related: Researchers at UCLA publish child marriage rates in the US (which is unavailable in the UN data). — VOA: July 23, 2019
Links:
The Database of Global Administrative Areas aims “to map the administrative areas of all countries, at all levels of sub-division.” With 386,735 divisions and counting, “this is a never ending project, but we are happy to share what we have.” Note: “commercial use is not allowed without prior permission.” — Data is Plural: July 17, 2019
Links:
Tags: mapping
The United States’ Foreign Agents Registration Act requires lobbyists who represent foreign governments to file paperwork with the Department of Justice. The database has long been available to browse online; last month, the agency added a last month, however, added three new features: full-text search, an API, and bulk downloads. [h/t Lachlan Markay + Jack Corrigan + u/surlyq] — Data is Plural: July 17, 2019
Links:
Tags: politics
The PluriCourts Investment Treaty Arbitration Database (PITAD) provides “a comprehensive, regularly-updated and networked overview of all-known investment arbitration cases.” You can download the 1,400+ cases or explore them online, searching by case, arbitrator, investor, or country. Note: PITAD says its data are “strictly for academic use.” Related: My former colleague Chris Hamby’s “The Court That Rules the World” series — “an exposé of a dispute-settlement process used by multinational corporations to undermine domestic regulations and gut environmental laws at the expense of poorer nations,” as the Pulitzer committee put it. [h/t Joel Dahlquist Cullborg] — Data is Plural: July 17, 2019
Links:
Tags: conflict
A group of researchers have collected, parsed, and added metadata to all UN Security Council debates from 1995 through 2017. The dataset includes more than 65,000 speeches (with information about each speaker), extracted from nearly 5,000 meeting transcripts. Related: The authors describe their methodology. [h/t Ronny Patz] — Data is Plural: July 17, 2019
Links:
Tags: United Nations
The CITES Trade Database, named after the Convention on International Trade in Endangered Species of Wild Fauna and Flora, contains information about more than 20 million shipments of wildlife (e.g., live tapirs, sturgeon eggs, wolf skulls) and wildlife products (e.g., venus flytrap extract) since 1975. The database is maintained by a UN agency and includes the year of the shipment; the scientific name of the plant or animal; the type and quantity of the particular thing being traded; their purpose and source; and the country of origin, export, and export. Related: Citesdb, an R package for analyzing the database. — Data is Plural: July 17, 2019
Links:
Tags: United Nationsanimals
James Fee has compiled a dataset of more than 400 baseball stadiums from more than 40 leagues around the world; each stadium’s information includes its name, team(s), league(s), and geographic coordinates. — Data is Plural: July 10, 2019
Links:
Tags: sports
With more than 15,000 “super units,” and an even larger number of subdivisions within them, the California Protected Areas Database is “the authoritative GIS database of parks and open space in California.” It’s one of the two main databases that the California Natural Resources Agency publishes regarding protected lands; the other, the California Conservation Easement Database, tracks restricted-use private land. [h/t @cartonaut] — Data is Plural: July 10, 2019
Links:
Tags: environmentmapping
In order to develop its maps of North American ecoregions, the US Environmental Protection Agency consulted with other federal agencies and state agencies, plus the governments of Canada and Mexico. Each “ecoregion” is an area with “similarity in the mosaic of biotic, abiotic, terrestrial, and aquatic ecosystem components with humans being considered as part of the biota.” The maps are available both as PDFs and as geospatial data files, at four levels of increasing specificity. [h/t Brandyn Friedly] — Data is Plural: July 10, 2019
Links:
Tags: environmentmapping
The Open Observatory of Network Interference, run by the Tor Project, “collects and processes network measurements with the aim of detecting network anomalies, such as censorship, surveillance and traffic manipulation.” You can volunteer to run OONI’s tests from your computer or phone; so far, “millions of network measurements have been collected from more than 200 countries since 2012.” You can explore that data online, download it in bulk, and access it via an API. Related: OONI’s blog, which includes reports on some of its findings. [h/t John Emerson] — Data is Plural: July 10, 2019
Links:
Tags: censorshiptechnology
Last month, the US Federal Emergency Management Agency released two major datasets from its National Flood Insurance Program: more than 47 million insurance policies and more than 2 million insurance claims. The latter includes details on each claim’s property, flood zone, amount paid, and more. Both datasets have been partially redacted to remove personally-identifiable information. [h/t Anna Weber] — Data is Plural: July 10, 2019
Links:
“In this paper, we aim to teach a machine how to make a pizza,” writes a team of computer scientists from MIT and the Qatar Computing Research Institute. One of the key ingredients: 9,213 photos of pizza, with their lists of toppings annotated by Amazon Mechanical Turk workers. [h/t Kristin Houser + Center for Data Innovation] — Data is Plural: July 3, 2019
Links:
Tags: food
Global Mangrove Watch uses satellite data to track the global extent of those coastal intertidal forests; the project’s seven snapshots span 1996 to 2016. Note: To download the data, you’ll need to provide a few details and agree to certain terms and conditions. [h/t Dan Friess] — Data is Plural: July 3, 2019
Links:
Tags: environmentmapping
The Issue Correlates of War project, which started in 1997 with a focus on territorial disputes, gathers “systematic data on contentious issues in world politics.” In addition to its two centuries of territorial claims, the project has also catalogued disputes over rivers, maritime zones, and ethnic groups, and compiled supplementary datasets on colonial history, historical country names, and more. — Data is Plural: July 3, 2019
Links:
Tags: conflict
Dan Salmon, a grad student who specializes in information security, has published data on more than 7 million Venmo transactions, which he downloaded from the mobile payment platform’s public API. “I am releasing this dataset,” he writes, “in order to bring attention to Venmo users that all of this data is publicly available for anyone to grab without even an API key.” Practical: How to make your Venmo transactions private. Related: Salmon explains more, in Wired. Also: In 2018, Hang Do Thi Duc analyzed 200 million public Venmo transactions to show how revealing they could be. [h/t Álex Barredo] — Data is Plural: July 3, 2019
Links:
Tags: businessmoneytechnology
The Administrative Office of the United States Courts posts its annual “wiretap reports”, which provide details on the wiretaps that state and federal judges have authorized. Last week, the agency published its 2018 report; the supplementary data includes each wiretap’s jurisdiction, authorizing judge, date of authorization, type of intercept, number of communications intercepted, total cost, and more. [h/t Chris Zubak-Skees + Steven Rich] — Data is Plural: July 3, 2019
Links:
FiveThirtyEight has collected the text of all 50 state governors’ 2019 annual addresses, and has analyzed the most common words and phrases used by Republican and Democratic governors. — Data is Plural: June 26, 2019
Links:
The United Kingdom’s Department of Education publishes data on its university graduates’ annual earnings 1, 3, 5, and 10 years after graduation, broken down by school attended, subject studied, and demographic characteristics. [h/t Tera Allas] — Data is Plural: June 26, 2019
Links:
Tags: educationmoneystatistics
This is the spreadsheet that “broke the art world’s culture of silence.” In just a few weeks, Michelle Millar Fisher and anonymous colleagues have collected more than 2,600 self-reported salaries from their fellow curators, managers, interns, and other art-world employees. Related: “It took us three minutes to build this spreadsheet,” the organizers have written in The Art Newspaper. “It is not a perfect survey tool, nor was it ever intended to be. While we’ll work with statistics professionals to review and glean meaningful facts [...] Its primary goal is to catalyse us all into action.” [h/t u/cavedave] — Data is Plural: June 26, 2019
Links:
Tags: artmoneystatistics
The Judicial Review of Congress dataset, compiled by Princeton politics professor Keith E. Whittington, “catalogs all the cases in which the U.S. Supreme Court has substantively reviewed the constitutionality of a provision or application of a federal law.” The dataset currently covers 1,308 cases, stretching from the high court’s founding through its 2017 term. For each case, it specifies the statute being reviewed, how long the statute had been in effect, the main constitutional issues at hand, the outcome, and more. [h/t Sheldon Gilbert] — Data is Plural: June 26, 2019
Links:
You can browse NASA’s Image and Video Library online; you can also access it via NASA’s API. Through that interface, you can search by caption, keyword, location, photographer, year created, and other fields; in return, you get structured data on each media file. The library was launched two years ago, bringing together more than 140,000 images, videos, and audio files that had previously been spread across dozens of separate collections. [h/t Seth Donoughe] — Data is Plural: June 26, 2019
Links:
Tags: sciencetechnology
MUStARD is a corpus of 690 text and video clips “for research in automated sarcasm discovery.” The dataset’s 690 examples — half involving sarcasm, half not — come from Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. Related: Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper), the researchers’ introduction to the dataset. — Data is Plural: June 19, 2019
Links:
Developer Michael Zemel has built an interactive timeline of 282 European kings, queens, emperors, and other monarchs. For each, the data includes his or her name, religion, period of reign, reason for losing power, wars involved in, relationships, and notable events. Zemel has also published a detailed writeup about his inspiration and process, plus the underlying data and code. [h/t Giuseppe Sollazzo + Sophie Warnes] — Data is Plural: June 19, 2019
Links:
Tags: politics
“Most people can name a mammal or bird that has become extinct in recent centuries, but few can name a recently extinct plant.” That’s from a new academic paper that presents “a comprehensive, global analysis of modern extinction in plants.” The paper itself is paywalled, but the dataset — of 571 extinct seed plants, plus other species that have been rediscovered or reclassified — is available to download. Related: World’s largest plant survey reveals alarming extinction rate, a summary of the findings. [h/t Joseph Stirt] — Data is Plural: June 19, 2019
Links:
Discogs, a user-contributed music database and marketplace, publishes “monthly data dumps” listing the millions of artists, labels, and releases in its system. Additional types of data (e.g., user reviews) are available through Discogs’ API. [h/t Jan Willem Tulp] — Data is Plural: June 19, 2019
Links:
Tags: entertainmentmusic
The Centers for Medicare and Medicaid Services’ National Average Drug Acquisition Cost dataset indicates how much U.S. pharmacies have to pay, on average, to obtain thousands of prescription and over-the-counter drugs. The dataset contains millions of rows — one for each National Drug Code in the survey, for each week since 2013 — but you can also download smaller, weekly slices. The agency also publishes a dataset of changes in these average costs. Previously: Total and average costs for Medicare Part B and Part D prescriptions (DIP 2016.12.14). [h/t data.world] — Data is Plural: June 19, 2019
Links:
Tags: drugshealthcare
The U.S. Department of the Interior publishes data describing the boundaries of all 420 units of the National Park System. In addition to the 61 officially-designated national parks, the boundaries include the country’s national preserves, national seashores, and 30 other types of special places. — Data is Plural: June 12, 2019
Links:
Tags: environmentmapping
For 220 countries between the 1750s and 2018, the Tax Introduction Dataset tracks “the year of the first permanent introduction at the national level of government of six major taxes, as well as on the top statutory tax rate for that year.” The six taxes are those on personal income, corporate income, inheritance, and general sales, plus VATs and compulsory social security contributions. [h/t Philipp Heimberger + Laura Seelkopf] — Data is Plural: June 12, 2019
Links:
Opportunity Insights, a research and policy institute that uses data analysis to examine economic mobility in the United States, publishes dozens of datasets stemming from their studies, often accompanied by code to replicate their findings. Related: “The radical plan to change how Harvard teaches economics,” a recent profile of Raj Chetty, who co-leads the institute. Bonus: The lecture materials for Chetty’s popular new class, “Using Big Data Solve Economic and Social Problems.” Michael A. Rice] — Data is Plural: June 12, 2019
Links:
Tags: economicsmoneystatistics
The International Consortium of Investigative Journalists and partners have obtained records that detail 8,000+ instances, between 2012 and 2017, in which U.S. Immigration and Customs Enforcement detention centers placed detainees in solitary confinement. For each confinement, the records indicate the detainee’s citizenship, detention facility, dates of confinement, and the stated reasons for it. Note: “ICE said it does not keep records of every solitary confinement placement. Instead it tracks only those cases where detainees were held in isolation for more than 14 days, and where immigrants with a ‘special vulnerability’ were placed in isolation.” [h/t Jason Norwood-Young] — Data is Plural: June 12, 2019
Links:
The Humanitarian Data Exchange has been tracking cases and deaths in the North Kivu Ebola outbreak. The numbers come from the Democratic Republic of the Congo’s health ministry and distinguish between suspected, probable, and confirmed cases; they are available at both the national level and disaggregated into the ministry’s 25 currently-affected health zones. Related: “Ebola cases pass 2,000 as crisis escalates” (Nature). Also related: The World Health Organization’s weekly situation reports. Previously: Data from the 2014 Ebola outbreak (DIP 2018.05.23). [h/t Sam Phinizy] — Data is Plural: June 12, 2019
Links:
Tags: diseaseebolahealthcare
“The BLOND dataset was collected at a typical office building in Germany, with the main occupants being academic institutes and their researchers.” BLOND’s several dozen terabytes of data provide “long-term continuous measurements of voltage and current waveforms” for 74 appliances in office over several months, including a bunch of computers, a printer, paper shredder, space heater, and an electric toothbrush. — Data is Plural: June 5, 2019
Links:
Tags: energy
In an study published last year (preprint PDF here), three Boston-area professors analyzed data from more than 600,000 people who took an online English grammar quiz. In addition to the participants’ answers, the dataset includes their native languages, the age they began learning English, the countries they’ve lived in, gender, age, and more. Related: Scott Chacon's analysis of the data, and what it might mean for older learners. [h/t George McIntire] — Data is Plural: June 5, 2019
Links:
Tags: language
The Chicago-focused Lawyers’ Committee for Better Housing has built a database of evictions in the city from 2010 to 2017. It aggregates nearly 300,000 evictions to the ward, community area, and Census tract level, and contains metrics on case types, outcomes, legal representation, and more. There’s a user guide, bulk download, and methodology. Previously: The Eviction Lab, an effort to collect eviction data for the entire country (DIP 2018.04.18). [h/t Maya Dukmasova] — Data is Plural: June 5, 2019
Links:
Tags: justice
The International Institute for Democracy and Electoral Assistance’s Voter Turnout Database tracks the number of registered voters, total voter turnout, voting-age population, and associated metrics for elections in more than 200 countries, some going as far back as 1945. Related: The European Parliament’s election results website provides charts and bulk downloads. Also related: “What’s going on with abstention in Europe?,” a recent article by Lorenzo Ferrari and Jacopo Ottaviani. [h/t Gianna Grün + Giuseppe Sollazzo] — Data is Plural: June 5, 2019
Links:
O Say Can You See, a project partially funded by the National Endowment of the Humanities, “documents the challenge to slavery and the quest for freedom in early Washington, D.C., by collecting, digitizing, making accessible, and analyzing freedom suits filed between 1800 and 1862, as well as tracing the multigenerational family networks they reveal.” The project provides several ways to access the data and documents; it covers more than 500 lawsuits, nearly 5,000 people, and tens of thousands of relationships. You can also explore the cases, people, and families online. [h/t Jan Willem Tulp] — Data is Plural: June 5, 2019
Links:
Duncan Geere’s 00s Indie Band Database quantifies 130+ acts from the early-millennium’s indie music scenes. In addition to basic facts, the database also includes several subjective scales: “Guitars to Synths,” “Artsy to Populist,” “Loudness,” and “Coolness.” — Data is Plural: May 29, 2019
Links:
Tags: entertainmentmusic
The Texas General Land Office’s geospatial data offerings include beach access points, shoreline environmental sensitivity ratings, offshore oil structures, oil and gas leases, and more. Related: “Relinquishing Riches: Auctions vs Informal Negotiations in Texas Oil and Gas Leasing,” and NBER working paper by economists Thomas R. Covert and Richard L. Sweeney; code and data available on GitHub. — Data is Plural: May 29, 2019
Links:
Tags: energyenvironmentmapping
Postupci Protiv Funkcionera “is a unique database made by the Center for Investigative Reporting of Serbia, which gives citizens the opportunity to get information in one place about the processes conducted by the Serbian Anti-Corruption Agency against public officials in the period from 2010 to November 2018.” The database contains information on nearly 2,800 proceedings against more than 1,700 officials, and can be downloaded as an RDS file (and opened in R). Kudos: The project has been shortlisted for the 2019 Data Journalism Awards. (Full shortlist here.) — Data is Plural: May 29, 2019
Links:
Tags: corruption
GeoChicas, an initiative to close the gender gap in the OpenStreetMap community, has built an interactive map and dataset that shows which streets in Latin America and Spain that are named after women (and the much larger number named after men). So far, they’ve mapped 11 cities in 8 countries, including Barcelona, Havana, Mexico City, and Buenos Aires. — Data is Plural: May 29, 2019
Links:
“Every year, the federal government releases large amounts of data on US schools, districts, and colleges. But this information is scattered across multiple datasets, and changes in data structure make it hard to measure change.” The Urban Institute’s Education Data Explorer aims to fix that by pulling together the Department of Education’s Common Core of Data, Civil Rights Data Collection, Integrated Postsecondary Education Data System, and College Scorecard, plus the Census Bureau’s Small Area Income and Poverty Estimates. You download custom queries, access the data via an API, or download bulk files for all elementary and secondary schools, school districts, and colleges. [h/t Daniel Wood] — Data is Plural: May 29, 2019
Links:
Tags: educationstatistics
The Pudding’s Jan Diehm has identified and analyzed decades of hyphenated last names in seven North American sports leagues: the MLB, NBA, NFL, NHL, MLS, WNBA, and NWSL. The code and data are available to download. Now you know: Two ambi-hyphenates — Pierre-Luc Letourneau-Leblond and Jean-Luc Grand-Pierre — have played in the NHL (and none in any of other leagues). — Data is Plural: May 22, 2019
Links:
Tags: languagesportsstatistics
A team of researchers at the Universidad Nacional Autónoma de México have aggregated the observations of 1,216 studies into a database describing 504 primate species. The traits in the database include body mass, habitat, type of diet, conservation status, and more. — Data is Plural: May 22, 2019
Links:
Tags: animals
At Canada’s highest court, “interveners” are the rough equivalent of amicus brief filers in U.S. Supreme Court cases. Sancho McCann, a student at the University of British Columbia’s law school, has created a dataset of the past ten years of interveners and has analyzed it. For each of the 665 cases from 2009 to 2018, the dataset includes the case name, the previous court, a couple of case classifications, and the names of the interveners (if any). — Data is Plural: May 22, 2019
Links:
Tags: law
The Fiscally Standardized Cities database “makes it possible to compare local government finances for 150 of the largest U.S. cities across more than 120 categories of revenues, expenditures, debt, and assets.” The database, developed by Adam Langley at the Lincoln Institute of Land Policy, covers the years 1977 to 2016 and takes into account the ways in which finances and responsibilities overlap between cities, counties, school districts, and other local governments. [h/t Cezary Podkul] — Data is Plural: May 22, 2019
Links:
Tags: statistics
The Measurement Lab describes itself as “the largest open source Internet measurement effort in the world.” Volunteers run the lab’s tests on their own devices, measuring their internet connection’s speed, latency, and other characteristics. The lab then publishes the data it collects, both as raw output and as BigQuery tables. It also offers a tool for charting internet speeds by location and ISP, based on 240+ million tests generated from 87,000+ cities; you can access the data underlying any chart, and also download the same aggregations directly. [h/t Georgia Bullen] — Data is Plural: May 22, 2019
Links:
Tags: technology
The Death Penalty Information Center maintains a database of all executions in the United States since 1976. (There have been 1,495 so far.) The database tracks the date, method, county, and state of each execution; the name, age, sex, and race of the person executed; and the race and sex of the victims they were convicted of killing. Related: The Marshall Project’s The Next to Die. Previously: Death sentences (DIP 2018.08.01) and executed prisoners' last words (DIP 2019.03.06). — Data is Plural: May 15, 2019
Links:
Tags: justice
The World Cube Association “governs competitions for mechanical puzzles that are operated by twisting groups of pieces,” the most famous of which is the Rubik’s Cube. The association also publishes a database of all competitions, competitors, results, rankings, and more. Related: “Children of the Cube,” by the New York Times’ John Branch. [h/t Michael Höhle + u/cavedave] — Data is Plural: May 8, 2019
Links:
Tags: entertainmentgames
A team of biologists has compiled and standardized data on 790+ animal social networks, covering more than 45 species on six continents. The Animal Social Network Repository features networks of wild and captive mammals, reptiles, fish, birds, and insects; the connective data-tissue includes dominance relationships, group memberships, grooming behaviors, and several other types of interactions. — Data is Plural: May 8, 2019
Links:
Tags: animals
Publishers Weekly’s Translation Database tracks books of fiction and poetry that has been translated into English and published in the United States. The database, which contains more than 7,200 entries since 2008, includes the books’ original languages and countries of publication, the authors’ and translators’ names and genders, the publishers´ names, publication years, prices, and ISBNs. Related: “Will Translated Fiction Ever Really Break Through?” a recent Vulture article by Chad Post, who created the database. — Data is Plural: May 8, 2019
Links:
Team Populism is an initiative that “brings together renowned scholars from Europe and the Americas to study the causes and consequences” of the titular political style. The collaboration has published several datasets, including one that scores the populist rhetoric of 40 countries’ leaders between 2000 and 2018 — a project commissioned by The Guardian, which has visualized the findings and described the methodology. [h/t Erik Gahner Larsen] — Data is Plural: May 8, 2019
Links:
Tags: politics
Using data scraped from BoxRec.com and UFCStats.com, Thomas Richardson analyzed “over 13,800 professional boxers and mixed martial artists of varying abilities” and has found “robust evidence that left-handed fighters have greater fighting success.” — Data is Plural: May 8, 2019
Links:
Tags: sports
Last month, Chicago officials launched a public mural registry. So far, the database includes more than 140 pieces, credited to more than 100 artists. About half of the entries specify the mural’s medium (e.g., paint, spray, mosaic) and nearly all indicate the mural’s location and installation year. — Data is Plural: May 8, 2019
Links:
Tags: art
A decade ago, researchers built VizWiz, a smartphone app that allowed blind users take photos and ask questions about them. For instance: “What color is this?” or “When is the expiration date?” Now 20,000 VizWiz images and questions, plus 200,000 answers, are available to download — part of a contest to develop algorithms for visual question-answering. Related: Be My Eyes, an app that lets you volunteer your visual assistance through a video call. — Data is Plural: May 8, 2019
Links:
Tags: languagetechnology
University of Montreal PhD candidate Semra Sevi has compiled data on all Canadian federal candidates from 1867 to 2017. The dataset lists each candidate’s gender, occupation, incumbency status, party affiliations, birth year, and electoral results. The tens of thousands of candidates have represented roughly 140 parties. Among them: Canada’s Work Less Party, which has fielded one lone federal candidate, who in 2008 received 1% of Vancouver East’s votes. [h/t Éric Grenier + Peter Loewen] — Data is Plural: May 8, 2019
Links:
Tags: politics
The United Nations’ FAOSTAT provides dozens of country-by-country datasets on agriculture. The datasets include crop and livestock production, imports and exports, fertilizer usage, emissions, and more. Many go back to 1961. (In that year, Afghanistan harvested about 32,000 metric tons of apricots.) Related: Researchers have previously used this data to trace the “increasing homogeneity in global food supplies” over time. Also related: National Geographic’s visualization of that research. [h/t David Svab] — Data is Plural: May 8, 2019
Links:
Through an unofficial API, you can access to data on the latest items, weapons, challenges, and other aspects of the global video game phenomenon. — Data is Plural: May 1, 2019
Links:
The MAESTRO dataset gathers recordings from nine years of the International Piano-e-Competition, where “virtuoso pianists perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system.” The MIDI data “includes key strike velocities and sustain pedal positions”; additional metadata contains each performance’s year, composer, and title. Related: OpenAI’s music-composing MuseNet neural network, trained in part on the MAESTRO data. — Data is Plural: May 1, 2019
Links:
Tags: music
A team of researchers has compiled the publication histories of 545 Nobel laureates — 92% of the prize-winners in physics, chemistry, and physiology-or-medicine between 1900 and 2016. The researchers say they spent more than 1,000 hours collecting and validating the data, drawing on the Nobel website, laureates’ personal pages, Wikipedia entries, and the Microsoft Academic Graph (featured in DIP earlier this month). — Data is Plural: May 1, 2019
Links:
USA Today has collaborated with more than 100 of its affiliated newsrooms and the Invisible Institute to gather police disciplinary records “from thousands of state agencies, prosecutors and local police departments” around the country, creating “the biggest collection of police misconduct records” ever assembled. They’re starting to make the records public, beginning with a database of 30,000+ officers who’ve had their certifications revoked. The database lists each officer’s name, state, agency, and year decertified. It includes records from 44 states, but you won’t find Massachusetts in it, for instance, because the state doesn’t license police officers. And although there are a handful of records from New York state, none regard NYPD officers; that’s in part because the country’s largest police force keeps its misconduct cases secret. (Last year, colleagues at BuzzFeed News published a database of 1,800 NYPD officers accused of misconduct, based on some of those secret records, obtained from a source who requested anonymity.) — Data is Plural: May 1, 2019
Links:
The browseable and downloadable National Oil Company Database, a project of the Natural Resource Governance Institute, pulls together official data on nearly 100 metrics concerning 71 oil/gas companies owned by 61 countries. For instance: Petróleos de Venezuela, S.A., reported transferring roughly $5.5 billion dollars to its government in 2016, down from nearly $28 million in 2013; Saudi Aramco produces the equivalent of 13 million barrels of oil daily; and in 2017, Russia’s Rosneft generated approximately $283,000 in revenue per employee. [h/t Rachel Ziemba] — Data is Plural: May 1, 2019
Links:
Tags: energy
“In a Cox proportional hazards model, which covariates are associated with the odds (or hazard ratios) being ever in your favor?” To find out, Brett Keller created spreadsheet of all 24 tributes in the 74th Hunger Games, including the districts from which they hailed, their ages, and how many days they survived. — Data is Plural: April 24, 2019
Links:
Tags: booksentertainmentmovies
Derek M. Jones analyzes software-engineering data. Recently, he convinced a small software company to release a dataset documenting its internal time estimates, spanning 10 years, 20 projects, and 10,000+ tasks. For each task, the dataset indicates the number of hours it was predicted to take, how long it actually took, the (anonymized) developers it was assigned to, and more. [h/t Erik Bern] — Data is Plural: April 24, 2019
Links:
Tags: businesstechnology
Chicago has become the first city to publish detailed data from ride-hailing services, such as Uber and Lyft. Last week, officials released three datasets — on (anonymized) drivers, vehicles, and trips. The driver and vehicle datasets cover early 2015 through December 2018. The trip dataset covers only November and December 2018; even so, it includes more than 17 million rides. For each ride, the records contain the rough pickup and dropoff location, duration, the approximate fare and tip, and more. [h/t Sharon Machlis + Dan Nguyen + Karl Sluis + Michael A. Rice] — Data is Plural: April 24, 2019
Links:
Tags: transportation
The Archigos dataset provides historical data the leaders of nearly 200 countries between 1875 and 2015. The dataset — a collaboration between political scientists Hein Goemans, Kristian Skrede Gleditsch, and Giacomo Chiozza — includes basic demographic information, plus categorizations of how each leader came to power, how they lost it, and their post-office fate. Now you know: No UK prime minister has died in office since 1865; José María Velasco Ibarra became president of Ecuador five separate times, and removed by coup four times; Tunisian president Beji Caid Essebsi is 92 years old. [h/t Jeffrey Sachs] — Data is Plural: April 24, 2019
Links:
Tags: politics
Varieties of Democracy bills itself as “a new approach to conceptualizing and measuring democracy” — one that “reflects the complexity of the concept of democracy as a system of rule that goes beyond the simple presence of elections.” The project scores countries annually on five high-level aspects of democracy, which are further broken down (by thousands of country-experts, based on a detailed codebook) into hundreds of more granular “indicators,” such as how often the government publicly attacks the judiciary, the extent to which authorities respect religious freedom, and the proportion of journalists who are women. Version 9 of the dataset, released earlier this month, covers 1789 to 2018 and includes 202 countries. [h/t John Polga-Hecimovich] — Data is Plural: April 24, 2019
Links:
Tags: governmentpolitics
The question: How many bags of Skittles must you open before finding two identical color-distributions? The answer: “82 days, 13 boxes, 468 packs, and 27,740 individual Skittles later [...]”. The data: available on GitHub. [h/t u/cavedave] — Data is Plural: April 17, 2019
Links:
Tags: statistics
The London Lives initiative “makes available, in a fully digitised and searchable form, a wide range of primary sources about eighteenth-century London, with a particular focus on plebeian Londoners.” As part of the project, digital historian Sharon Howard has compiled a dataset of 2,894 Westminster coroners’ inquests from 1760 to 1799. The fields include the date of death, the name of the deceased, the cause of death, the coroner’s verdict, and more. Bonus: A recent Twitter thread from Howard highlighting more datasets. — Data is Plural: April 17, 2019
Links:
Tags: healthcarehistory
The Microsoft Academic Knowledge Graph, published under an Open Data Attributions license, describes 8+ billion relationships between scientific papers, their authors, affiliated institutions, conferences, journals, fields of study, and more. The data can be downloaded and also queried online through a SPARQL interface. [h/t Michael Färber] — Data is Plural: April 17, 2019
Links:
Tags: science
The R Street Institute has converted the last five decades of successful Supreme Court confirmation hearings into a spreadsheet, with one row for each statement, question, and answer. The 15 transcripts begin with William Rehnquist’s 1971 hearing and end with Neil Gorsuch’s in 2017. (Robert Bork’s failed nomination is excluded, and Brett Kavanaugh’s 2018 transcript is not yet available.) [h/t Zachary Agatstein + Alex Spurrier] — Data is Plural: April 17, 2019
Links:
The International Consortium of Investigative Journalists, along with media partners in dozens of countries, has been compiling a cross-border database of medical-device safety alerts. The alerts include recalls as well as less-urgent notifications published by health authorities and manufacturers. You can download the public database, which so far includes 90,000+ notices for devices in 18 countries. The records include the date and type of notice; a device identifier; the reason for the alert; a classification of its severity; and more. Related: The Implant Files, an investigative series by the consortium, based on the data. — Data is Plural: April 17, 2019
Links:
Tags: healthcaretechnology
To study the relationship between artificial light and “flight calling” among nocturnally-migrating species, a team of researchers examined 70,000 instances of birds colliding with buildings in Chicago. [h/t Ben Winger] — Data is Plural: April 10, 2019
Links:
Tags: animalsarchitecture
Boston College’s Center for Retirement Research compiles detailed financial data on state and local public pension plans. The database covers fiscal years 2001–18 and includes 180 public pension plans, which together “account for 95 percent of state/local pension assets and members in the US.” [h/t Cezary Podkul] — Data is Plural: April 10, 2019
Links:
Tags: money
From Nate Silver: “we’ve been publishing forecasts for more than a decade now, and although we’ve sometimes tried to do an after-action report following a big election or sporting event, this is the first time we’ve studied all of our forecast models in a comprehensive way.” You can now explore and download thousands of FiveThirtyEight’s predictions about sports and politics (and their outcomes). [h/t Gavin Freeguard] — Data is Plural: April 10, 2019
Links:
Political science professor Nils B. Weidmann and collaborators have taken tens of thousands of reports — published by the AP, AFP, and BBC Monitoring — of political protests in autocratic countries and have turned them into structured data. The resulting Mass Mobilization in Autocracies Database is available to download (free registration required), and comes with documentation and code examples. The database currently covers 2003–15, with data for 2016–17 in the works. — Data is Plural: April 10, 2019
Links:
“The Rulers, Elections, and Irregular Governance (REIGN) dataset describes political conditions in every country each and every month. These conditions include the tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups, coup attempts and other violent conflicts.” The latest dataset covers 200 countries, from 1950 to the present, and includes dozens of variables for each monthly snapshot. [h/t Erik Gahner] — Data is Plural: April 10, 2019
Links:
“In 1949, an Italian Jesuit priest named Roberto Busa presented a pitch to Thomas J. Watson, of I.B.M.,” according to a New Yorker article principally about the Enron email archive. “Busa was trained in philosophy, and had just published his thesis on St. Thomas Aquinas, the Catholic theologian with a famously unmanageable œuvre.” Watson agreed to help, “and, for the next thirty years, Busa encoded sixty-five thousand pages of Thomist text so that it could be word-searched, cross-referenced, and what we now call hyperlinked.” The Index Thomisticus became “the first corpus to be primed for digital scholarship,” and is available online to search and download. — Data is Plural: April 3, 2019
Links:
The Virginia Institute of Marine Science at The College of William & Mary maintains shoreline inventories for Virginia, Maryland, and parts of Delaware and North Carolina. The datasets include geospatial information about land use, vegetation, different types of structures (e.g., jetties, bulkheads, docks, boathouses), and more. [h/t Susie Cambria] — Data is Plural: April 3, 2019
Links:
To test the “moralizing gods” hypothesis (which posits that “belief in morally concerned supernatural agents culturally evolved to facilitate cooperation among strangers in large-scale societies”), the authors of a recent paper in Nature “coded records from 414 societies that span the past 10,000 years from 30 regions around the world, using 51 measures of social complexity and 4 measures of supernatural enforcement of morality.” The dataset is available to download. Findings: “Our analyses not only confirm the association between moralizing gods and social complexity, but also reveal that moralizing gods follow — rather than precede — large increases in social complexity.” [h/t Juan Moreno-Cruz + Peter Irvine] — Data is Plural: April 3, 2019
Links:
The UNESCO Institute of Statistics collects country-level data on the number of teachers, teacher-to-student ratios, and related figures. You can download the data or explore it in UNESCO’s eAtlas of Teachers or their interactive visualization of teacher supply in Asia. — Data is Plural: April 3, 2019
Links:
Tags: United Nationseducation
To mark the four-year anniversary of the Saudi-led bombing campaign in Yemen, the Yemen Data Project last week released civilian casualty estimates for the entire air war. The project’s researchers collect and cross-reference data from a range of sources, including news reports, social media, video footage, local authorities, and NGOs; their published data contains dates, locations, and casualty estimates for more than 19,000 air raids. As seen in: “Saudi Strikes, American Bombs, Yemeni Suffering: How Saudi Arabia’s war tactics have fueled Yemen’s humanitarian crisis” (New York Times, December 2018). [h/t Andrea Carboni] — Data is Plural: April 3, 2019
Links:
Tags: conflict
From Alexis C. Madrigal, writing at The Atlantic: “Now, a decade since Uber blazed the trail, and half that since the craze faded, we built a spreadsheet of 105 Uber-for-X companies founded in the United States, representing $7.4 billion in venture-capital investment. We culled from lists, dug in Crunchbase, and pulled from old news coverage. It’s not a comprehensive list, but it is a large sample of the hopes and dreams of the entrepreneurs of the time.” — Data is Plural: March 27, 2019
Links:
Tags: businesstechnology
University of Tasmania Ph.D. candidate Shaun T. Brooks has created a geospatial dataset of “all buildings and disturbance detected across Antarctica, manually digitised from Google Earth images.” The dataset includes research stations, lighthouses, weather stations, historic sites, and more. [h/t Jasmine Lee] — Data is Plural: March 27, 2019
Links:
Tags: infrastructuremapping
“The Foundations of Rebel Group Emergence (FORGE) Dataset examines the roots of rebellion by considering the characteristics and activities of the ‘parent’ organizations from which rebel groups emerged,” plus details such as “the organization's ‘birthdate’ and founding location, initial goals, ideology, and ethnic/religious foundations.” The new dataset, developed by the University of Arizona’s Jessica Maves Braithwaite and the University of Maryland’s Kathleen Gallagher Cunningham, contains 430 rebel groups active between 1946 and 2011. [h/t Jori Breslawski + Michael Poznansky] — Data is Plural: March 27, 2019
Links:
Phenology (literally: “the science of appearance”) is the location-and-species-specific study of recurring plant and animal phenomena, such as the annual arrivals and departures of migratory birds. The USA National Phenology Network collects observational data from thousands of citizen scientists, professional researchers, NGOs, and other groups; assesses the data’s quality; and makes it available to explore and download. Previously: The flowering dates of Kyoto’s Prunus jamasakura cherry trees going back to the 9th century (DIP 2017.04.05). [h/t Greta Kaul] — Data is Plural: March 27, 2019
Links:
FiveThirtyEight has compiled a dataset of all U.S. special counsel, independent counsel, and special prosecutor investigations since 1973 — and the people charged in them. Related: FiveThirtyEight’s visual comparison of the Mueller probe to other investigations. Bonus: FiveThirtyEight’s Amelia Thomson-DeVeaux has also been tracking major lawsuits related to President Trump and his administration; that dataset currently contains 45 civil cases and 6 criminal cases. — Data is Plural: March 27, 2019
Links:
New York City requires the owners of buildings with rooftop water tanks to get the vessels inspected annually for things like sediment, bacteria, and dead bugs. The city publishes a dataset of the owner-report results, based on 15,000 inspections, mostly from 2015–17. Unfortunately: “A review of city records indicates that most building owners still do not inspect and clean their tanks” ... and the “city can’t even say with certainty how many there are or where they are located” ... and in “almost every case the [bacteriological] tests are conducted only after the tanks have been disinfected.” [h/t Zack Quaintance] — Data is Plural: March 20, 2019
Links:
Tags: mappingstatistics
Security firm Rapid7’s Project Sonar “conducts internet-wide surveys across more than 70 different services and protocols to gain insights into global exposure to common vulnerabilities.” Much of the data (on DNS responses, SSL certificates, and more) can be bulk-downloaded through the company’s open data portal without an account, and historical data and the most-current data are available with a free account. Related: Project Sonar: An Underrated Source of Internet-wide Data (Patrik Hudak). Also: Rapid7’s guide to using their open data API with R. [h/t Sharon Machlis] — Data is Plural: March 20, 2019
Links:
Tags: technology
“[W]hy are so many cities and metropolitan areas still split along racial lines? And what is the role of local government in reinforcing those divides? To answer those questions, Governing conducted a six-month investigation of black-white segregation in the small cities of downstate Illinois.” As part of the investigation, the magazine calculated (and published) school and residential segregation metrics for hundreds of U.S. metropolitan areas, based on the latest Department of Education and Census Bureau data. Related: “The Most Diverse Cities Are Often The Most Segregated” (FiveThirtyEight, 2015). [h/t Mike Maciag] — Data is Plural: March 20, 2019
Links:
The Council of State Governments’ annual Book of the States compiles 50-state reference tables on a range of topics, including elections, finances, courts, and more. It has been published since 1935, and the tables for the past decade-plus are available as spreadsheets. Now you know: The chief justice of the California Supreme Court makes $256,059 per year — the highest compensation for any state judge, and nearly double New Mexico’s top judge, according to 2018’s Table 5.4. [h/t Cezary Podkul] — Data is Plural: March 20, 2019
Links:
Tags: electionsstatistics
“Produced by the OECD Sahel and West Africa Club, Africapolis.org is the only comprehensive and standardised geospatial database on cities and urbanisation dynamics in Africa. Combining demographic sources, satellite and aerial imagery and other cartographic sources, it is designed to enable comparative and long-term analyses of urban dynamics - covering 7,500 agglomerations in 50 countries.” You can download the data — which includes historical populations, urbanization metrics, and geospatial outlines — and also explore it online. [h/t Rafael Prieto Curiel] — Data is Plural: March 20, 2019
Links:
Tags: statistics
Researchers based at Chicago’s Lincoln Park Zoo have published “life expectancy estimates for hundreds of vertebrate species based on carefully vetted studbook data from North American zoos and aquariums.” Their dataset includes “sex-specific median life expectancies as well as sample size and 95% confidence limits for each estimate.” — Data is Plural: March 13, 2019
Links:
The magazine Psychology Today hosts paid listings for therapists, who advertise their services to prospective patients. Andrew Thompson has created a dataset of the 50,000+ U.S. listings (as of October 2018), with each therapist’s name, city, specialties, and subject areas. — Data is Plural: March 13, 2019
Links:
Tags: healthcare
FiveThirtyEight is tracking who’s endorsing whom to be the Democrats’ 2020 presidential nominee. The site has published a methodology describing its approach, plus the underlying data, which includes each endorser’s name, state, relevant position, and other details. (According to the site’s formula, Sens. Cory Booker and Kamala Harris are currently leading, although almost entirely based on home-state endorsements.) — Data is Plural: March 13, 2019
Links:
Stanford University’s Big Local News project has compiled data from 100,000+ daily situation reports (known as “SIT-209”s) filed by federal firefighting authorities, detailing their efforts to suppress large wildfires. The dataset covers 2014 to 2017, and includes 240+ variables from each report, including estimated costs, damaged/destroyed buildings, injuries, fatalities, and more. Related: Eric Sagara’s quick introduction to the dataset. — Data is Plural: March 13, 2019
Links:
“Thousands of people report workplace discrimination to the government each year. Employers are rarely held accountable,” according to an investigation by the Center for Public Integrity. Reporters Maryam Jameel and Joe Yerardi “analyzed eight years of complaint data — through fiscal 2017 — from the [U.S. Equal Employment Opportunity Commission] as well as its state and local counterparts, reviewed hundreds of court cases and interviewed dozens of people who filed complaints.” The data (on more than 3.7 million allegations and their outcomes) and code are available online. Related: A visual exploration of the data. Previously: Two decades of workplace sexual harassment complaints (DIP 2017.12.06). [h/t Reddit user "cavedave" + Giuseppe Sollazzo] — Data is Plural: March 13, 2019
Links:
A few years ago, a team of scientists examined the shapes of 49,000 bird eggs belonging to 1,400 different species. You can download their calculations of each species’ average egg length, asymmetry, and ellipticity, which formed the basis of a graphics-forward article in Science Magazine. [h/t Sophie Warnes] — Data is Plural: March 6, 2019
Links:
For a recent article in The Pudding, Amber Thomas and two data assistants “recorded every rule listed in each dress code” at 481 public high schools in 36 states, plus “the words used in the dress code’s rationale, as well as any listed sanctions for breaking the dress code.” The 15,000+ rules and 1,470 sanctions are available to download. — Data is Plural: March 6, 2019
Links:
Tags: education
The UNESCO Institute of Statistics compiles data on “internationally mobile” university students, including annual numbers of students by country of origin and country of study. Related: UNESCO's interactive map of student flows. [h/t Francisco Marmolejo] — Data is Plural: March 6, 2019
Links:
Tags: United Nationseducation
The U.S. Department of Agriculture’s CropScape website provides interactive access to the agency’s Cropland Data Layer — “a raster, geo-referenced, crop-specific land cover data layer created annually for the continental United States using moderate resolution satellite imagery and extensive agricultural ground truth.” You can use CropScape to filter the data’s acreage estimates (for more than a hundred different crops) by state, county, or custom-drawn geographies — or download the complete data in bulk. [h/t Katie McGaughey] — Data is Plural: March 6, 2019
Links:
Tags: agriculture
The Texas Department of Criminal Justice publishes a list of each death row inmate executed since 1982 — the year the state resumed capital punishment. In addition to providing basic demographic information, the listing also links to transcriptions of the inmates’ final statements. And although state doesn’t provide the statements as structured data, Zi Chong Kao has created a spreadsheet of of them (plus additional details extracted from the state’s website) for his interactive tutorial, Select Star SQL. Related: “‘Love’ Is the Most Common Word in Death Row Last Statements” (Will Young, Oct. 2018). [h/t Noah Veltman] — Data is Plural: March 6, 2019
Links:
The Professional Disc Golf Association publishes a spreadsheet of flying objects officially approved for use in competition. [h/t Ryan Maus] [Note, 2019-02-20: Original item included incorrect link, now fixed.] — Data is Plural: February 20, 2019
Links:
Tags: sports
Thanks to a 2015 state bill, when California law enforcement agencies obtain search warrants for digital communications (or are granted access to such information in an emergency), they must notify the people whose information they targeted. The state’s Department of Justice publishes data about these notifications, including the agency name, the grounds for the warrant, the nature of the investigation, the companies searched (e.g., AT&T, Verizon, Google, Facebook), and more. As seen in: “San Bernardino County Sheriff's electronic surveillance use — already highest in state — continues to surge” (Palm Springs Desert Sun, Jan. 2019). — Data is Plural: February 20, 2019
Links:
MyEU.uk’s interactive map lets you search and explore tens of thousands of European Union–funded projects in the United Kingdom, aggregated from a range of official sources. The initiative, which opposes Brexit, has published its data-collection and data-processing code as well as a spreadsheet of all projects it has identified. [h/t Jovi Juan] — Data is Plural: February 20, 2019
Links:
Tags:
The Academy of Motion Picture Arts and Sciences website hosts two searchable databases related to their annual awards show: one of nominees and winners, and another of acceptance speeches. The Academy doesn’t provide direct downloads, but many folks have created structured datasets from the records. For instance: Statistics professor Adam B. Kashlak has build a dataset that combines speech word-counts, Best Picture winners’ budgets, and total broadcast length. And: Alex Albright’s analysis from a few years ago, “I’d Like to Thank the Academy… for making this data available,” is based on her dataset of all speeches from the 2010–14 broadcasts. [h/t Jay Arthur] — Data is Plural: February 20, 2019
Links:
The Global Power Plant Database, published by the World Resources Institute, “is a comprehensive, open source database of power plants around the world” and contains “information on plant capacity, generation, ownership, and fuel type.” The current edition, released in June 2018, covers 28,600+ power plants in 164 countries — including more than 1,000 each in Brazil, Canada, China, Great Britain, France, and the United States. Previously: U.S. power plants (DIP 2016.02.10). [h/t Kelly Rose + Paul Deane] — Data is Plural: February 20, 2019
Links:
Tags: energy
Drawing upon a fan wiki, Matt Laessig has created a spreadsheet of all 889 obstacles in the first 10 seasons of American Ninja Warrior. (Free registration required to download.) [h/t Ilan Brat] — Data is Plural: February 13, 2019
Links:
Tags: sportstelevision
The CDC’s State Tobacco Activities Tracking and Evaluation system tracks “current and historical state-level legislative data on tobacco [and now also e-cigarette] use prevention and control policies.” The system’s datasets provide quarterly snapshots — going back to 1995 — of rules concerning taxes, youth access, licensing, fire safety, and more. — Data is Plural: February 13, 2019
Links:
Tags:
The Open Knowledge Foundation Deutschland and OpenCorporates have partnered to make Germany’s official business register available to download in bulk. The dataset contains basic information about more than 5 million German companies, and more than 4 million associated officers. Note: Although the dataset’s landing page is written in German, its documentation is available in English. Related: Joachim Gassen’s initial analysis of the companies’ locations, using R. [h/t Sharon Machlis] — Data is Plural: February 13, 2019
Links:
Tags: business
The Resources and Conflict Project’s Rebel Contraband Dataset “measures if and how rebel groups earn income from the exploitation of natural resources or criminal activities.” The dataset spans 1990–2015, covers more than 70 countries, and specifies dozens of types of resources — such as oil, cannabis, gold, tea, and timber. [h/t Eric Gahner] — Data is Plural: February 13, 2019
Links:
Tags: conflictenvironment
The U.S. Forest Service’s Forest Inventory and Analysis program tracks “trends in forest area and location; in the species, size, and health of trees; in total tree growth, mortality, and removals by harvest; in wood production and utilization rates by various products; and in forest land ownership.” It also “serves as perhaps the largest publicly available” dataset of “downed and dead wood.” The inventory is available to download and comes with user guides. — Data is Plural: February 13, 2019
Links:
To support its data-driven feature, “30 Years of American Anxieties,” The Pudding gathered 20,000 questions posed to legendary advice columnist Dear Abby. — Data is Plural: February 6, 2019
Links:
Tags: languagemiscellaneous
The District of Columbia’s taxi trip data covers 2015–17 and includes each trip’s pickup and dropoff location, mileage, total fare, tip amount, and other details. Previously: Chicago and NYC taxi rides (DIP 2016.12.07). Richard Sigman] — Data is Plural: February 6, 2019
Links:
Tags: transportation
In the course of investigating why Oklahoma’s female incarceration rate is so high, The Frontier and the Center for Investigative Reporting obtained “a decade’s worth of state prison data never before analyzed by the state itself.” The data includes information about each prisoner, their prison sentences, and their entries and exits from Department of Corrections supervision. [h/t Dan Nguyen] — Data is Plural: February 6, 2019
Links:
For a recent analysis of Trump administration turnover, FiveThirtyEight compiled a dataset of the last seven presidents’ cabinets — covering the 24 positions included in Donald Trump’s cabinet. (As author Nathaniel Rakich notes, “Not every president designates the same positions to be in the Cabinet.”) The dataset includes each cabinet member’s position, start date, departure date, and total days in office. — Data is Plural: February 6, 2019
Links:
An international team of researchers has created a dataset of 343 cities’ CO2 emissions. The researchers aggregated and standardized the emissions data — largely self-reported — from three sources: the Carbon Disclosure Project, the Bonn Center for Local Climate Action and Reporting, and a new project at Peking University. The dataset includes cities large and small, from Lagos and Shanghai to Kadıovacık, Turkey (pop. 216) and Brisbane, California (pop. ~4,700). In addition to emissions, the dataset also provides contextual information about the cities, such as average household sizes and gasoline prices. — Data is Plural: February 6, 2019
Links:
Tags: environment
“Shane Nackerud needed to know: Does 89.3 the Current play the Replacements every day?” To figure it out, the University of Minnesota librarian extracted track listings from 1.1 million @currentplaylist tweets from 2009 through 2018. He’s also published the total play counts by artist and the raw data. [h/t Kent Gerber + Amy Riegelman] — Data is Plural: January 30, 2019
Links:
. “Funemployed programmer” Colin Morris looked for all the times where commenters on Reddit added “(sp?)”, or a related annotation, to their remarks. E.g., “SF is putting on quite a show, especially Kapernick (sp?).” Morris then compiled a dataset of the words that preceded those annotations, accompanied by examples of their usage. [h/t Rich Posert] — Data is Plural: January 30, 2019
Links:
Tags: languagesocial media
Christina Isabel Zuber and Edina Szöcsik’s Ethnonationalism in Party Competition dataset compiles ratings for more than 200 political parties in 22 European countries. Experts rated the parties twice — first in 2011, and then again in 2017 — on a range of factors, such as the centrality of ethnonationalism to the parties’ platforms, and their positions on territorial autonomy for minorities. (Dataset access requires providing a name and email address.) [h/t Erik Gahner] — Data is Plural: January 30, 2019
Links:
Since 1997, the Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN) algorithm has used satellite imagery to estimate rainfall rates around the world. The system’s hourly, daily, monthly, and annual estimates can now be explored online and downloaded. — Data is Plural: January 30, 2019
Links:
Tags: climate
On Thursday, Sarah Ryley, Sean Campbell, and I published a deeply-reported investigation into U.S. cities’ failure to solve shootings — a year-long collaboration between The Trace and BuzzFeed News. To reach our quantitative findings, we analyzed (and standardized) three major FBI datasets, internal data from 22 police departments, and a database of Baltimore victims and suspects. Data, code, and methodologies for the analyses are available on GitHub. Related: Last year, The Washington Post published Murder with Impunity, a series examining unsolved homicides; their data, on 52,000+ homicides in 50 cities, is also available on GitHub. — Data is Plural: January 30, 2019
Links:
The U.S. Department of Agriculture’s Dairy Data Set contains annual tabulations of production, sales, imports, exports, consumption, and other economic aspects of “the U.S. dairy situation.” As seen in: “Nobody Is Moving Our Cheese: American Surplus Reaches Record High” (NPR). — Data is Plural: January 16, 2019
Links:
Tags: agriculturefood
In previous centuries, maritime officers kept “detailed log books of the ships’ activities and management,” including observations of the wind and weather. The Climatological Database for the World's Oceans 1750-1850 has digitized a quarter-million entries from such logbooks, originally written in Dutch, English, French, and Spanish, and published them as detailed, structured data. Helpful: Steven Ottens has converted the project’s fixed-width files into tab-delimited data. [h/t Robi Sen + Roger Davies + Topi Tjukanov] — Data is Plural: January 16, 2019
Links:
Party Facts is a “collaborative data collection” that links various political-party datasets together. The project has two main tables. One contains basic information about 4,100+ political parties in more than 200 countries, including each party’s mother-tongue name and English translation, year founded, and Wikipedia page. The second table cross-references each party with its unique identifier in 26 external datasets, such as ParlGov (DIP 2018.09.19), The Manifesto Project (DIP 2017.06.21), and the Constituency-Level Elections Archive (DIP 2016.09.28). [h/t Matt Grossmann + Erik Gahner] — Data is Plural: January 16, 2019
Links:
The Census Bureau’s My Congressional District tool lets you browse (and download) demographic, socioeconomic, and business data corresponding to each of the country’s 435 congressional districts. Political scientist Ella Foster-Molina has compiled a historical dataset containing similar information for 1972 to 2014; it also contains details about each district’s representatives — such as their personal characteristics, the committees they served on, and the number of bills they sponsored. [h/t Josh McCrain + Derek Willis] — Data is Plural: January 16, 2019
Links:
The Committee to Protect Journalists maintains a database of journalists who’ve been killed for reasons related to their work. The database goes back to 1992 and contains more than 1,300 entries, with details about the journalists, the circumstances of their deaths, and whether perpetrators have been convicted. More recently, the organization has also begun publishing data on journalists who’ve been imprisoned or gone missing. [h/t Giuseppe Sollazzo] — Data is Plural: January 16, 2019
Links:
Norse World is an “online, open access searchable index and mapping of the foreign place names found in medieval East Norse texts.” Through the project’s interactive map, you can search and download the data. — Data is Plural: January 9, 2019
Links:
If you decide to acquire a new citizenship, do you get to keep your previous one? Are you allowed to renounce it? The Maastricht Center for Citizenship, Migration and Development’s Global Expatriate Dual Citizenship Dataset tracks how 200 countries have, each year since 1960, treated this situation. The extensive documentation provides links to the relevant laws, and descriptions of how each country’s rules have changed. [h/t Sam Petulla] — Data is Plural: January 9, 2019
Links:
Tags:
Researchers at the International Monetary Fund have built a historical database of fiscal crises, defined as “periods of extreme fiscal distress, when governments have not been able to contain large fiscal imbalances leading to the adoption of extreme measures (e.g., debt default and monetization of the deficit).” The researchers, building off of previous work, have “expand[ed] the country coverage to 188 countries, over 1970-2015, more than double the size of the sample relative to many other studies,” and identified 436 distinct episodes of fiscal crisis. [h/t David Tercero Lucas] — Data is Plural: January 9, 2019
Links:
Last spring, the BBC published an archive of 16,000+ sound effects, licensed ”for personal, educational or research purposes.” Each audio file is accompanied by a description, categorization, and its length. For instance, the first sound effect on the archive’s page is a 194-second clip described as “two-stroke petrol engine driving small elevator, start, run, stop,” and categorized as “Engines: Petrol.” Not documented, but useful: You can download a CSV of the metadata. Highlight: The one-two punch of “several men snoring, hilariously” and “several men snoring, less hilariously.” [h/t Amy King] — Data is Plural: January 9, 2019
Links:
Monitoring the Future surveys approximately 50,000 eighth-, tenth-, and twelfth-grade students in the U.S. each year. The project, which is funded by the National Institute on Drug Abuse, has been running since 1975. Although best known for its detailed drug-use questions, the surveys also ask questions related to education, labor, sex, race, politics, happiness, and other topics. Public-use versions of the data are available through the National Addiction & HIV Data Archive Program (free registration required). [h/t Dan Kopf] — Data is Plural: January 9, 2019
Links:
The Narrabeen-Collaroy Beach Survey Program has been measuring a major stretch of the Sydney shore every month since April 1976. You can explore the data online and (free registration required) download it. [h/t Robbi Bishop-Taylor + Mitchell Harley] — Data is Plural: January 2, 2019
Links:
The UK Marine Noise Registry tracks “human activities in UK seas that produce loud, low to medium frequency (10Hz – 10kHz) impulsive noise” — including pile-driving, explosives, military sonar, and “acoustic deterrent devices.” For each of the UK’s oil and gas licensing blocks, the registry’s published data counts the number of days that a given type of impulsive noise was generated. Related: Owen Boswarva has built an interactive map of the data. [h/t Giuseppe Sollazzo] — Data is Plural: January 2, 2019
Links:
Tags: audioenergyenvironment
A pair of researchers have used satellite imagery to quantify nighttime lights in five urban areas in Niger and Nigeria — Agadez, Katsina, Maradi, Niamey, and Zinder. Describing their findings in a recent issue of Scientific Data, the researchers write, “Our data showed 1) urban illumination fluctuated seasonally, 2) corresponding population fluctuations were sufficient to drive seasonal measles outbreaks, and 3) overlooking these fluctuations during vaccination activities resulted in below-target coverage levels, incapable of halting transmission of the virus.” — Data is Plural: January 2, 2019
Links:
Tags: diseaseenvironment
Melbourne, Australia, has placed dozens of pedestrian-counting sensors across the city, and publishes a dataset of the hourly observations going back to 2009. Now you know: Among the 2.5 million entries so far, the highest count has been the 12,289 pedestrians at the Bourke Street pedestrian bridge between 6pm and 7pm on Friday, October 26, 2018. Bonus: Melbourne’s interactive map of the data. Related: Pedestrian counts from the Brooklyn Bridge and Somerville, Massachusetts. — Data is Plural: January 2, 2019
Links:
Tags: transportation
The Vera Institute of Justice’s recently-expanded Incarceration Trends project combines data from a range of government reports — such as the Census of Jails and the National Corrections Reporting Program — into a single, longitudinal, well-documented dataset. For each county and year, the dataset tallies the number of people admitted to jails and prisons, the average daily incarcerated jail and prison population, and other related details. Many of the counts are also broken down by race, ethnicity, and sex. Bonus: The institute’s interactive map of the data. [h/t Chris Henrichson + Sam Petulla] — Data is Plural: January 2, 2019
Links:
Tags: crime
A team of evolutionary biologists has compiled a dataset describing the size and shape of eggs laid by more than 6,700 insect species. You can explore and download the underlying data, which is based on measurements from 1,756 published sources. [h/t Cassandra Extavour] — Data is Plural: December 19, 2018
Links:
Tags: animals
Open Units is a dataset detailing the total amount of alcohol in 1,000+ beer and cider offerings, “based on information made public by drinks manufacturers, distributors and retailers.” For instance, a 355-mL bottle of Sierra Nevada Pale Ale contains 20 mL of alcohol, the same as a pint of Bud Light. [h/t Giuseppe Sollazzo] — Data is Plural: December 19, 2018
Links:
Tags: alcohol
Philadelphia’s Department of Records has begun publishing a dataset of all real estate transfers recorded since late 1999. The 3.7 million records include deeds, mortgages, condo declarations, and a few other types of documents. The deed data includes each property’s fair market value, address, grantor and grantee names, various taxes, and more. Bonus: An interactive visualization of the data. Previously: UK property sales (DIP 2016.03.23). [h/t Michael McLaughlin] — Data is Plural: December 19, 2018
Links:
Tags: real estate
The International Monetary Fund’s Global Debt Database brings together “total gross debt” numbers for 190 countries, for the years 1950 to 2017. The database features a detailed methodology and includes indicators of government, household, and corporate debt. — Data is Plural: December 19, 2018
Links:
The CDC’s Small-area Life Expectancy Estimates Project calculates how long someone, born in a given Census tract in 2010–15, might expect to live. The estimates are based on a combination of death records, Census population data, and statistical modeling. Related: “Map: What story does your neighborhood’s life expectancy tell?” (Quartz). Previously: Life expectancy by income, gender, and city (DIP 2016.04.13), and by country (DIP 2017.02.08). [h/t Dan Kopf] — Data is Plural: December 19, 2018
Links:
Tags: deathmappingstatistics
The Pudding’s Internet Boy Band Database is “an audio-visual history of every boy band to chart on the Billboard Hot 100 since 1980.” You can download the underlying data, which is stored in two files: boys.csv and bands.csv. — Data is Plural: December 12, 2018
Links:
Tags: entertainmentmusic
Academic researcher Adrien Barbaresi has compiled a corpus of thousands of speeches from the the German Presidency, Presidency of the Bundestag, Chancellery, and Ministry of Foreign Affairs. The corpus, now in its third version, was first released in 2011. [h/t Adrien Barbaresi] — Data is Plural: December 12, 2018
Links:
The National Renewable Energy Laboratory’s solar datasets measure the average annual and monthly “total solar resource” for the United States, broken down by state, county, ZIP code, and roughly-10-square-kilometer chunks of the country. Bonus: More sun-radiation datasets via this Stack Overflow answer. [h/t Joe Hourclé] — Data is Plural: December 12, 2018
Links:
Common Vulnerabilities and Exposures is a downloadable list of more than 110,000 “publicly known cybersecurity vulnerabilities.” Each vulnerability is assigned a unique identifier (e.g., CVE-2014-0160) and given a description. The National Institute of Standards and Technology’s National Vulnerability Database takes the list and adds more information for each entry, “such as fix information, severity scores, and impact ratings.” That database is available in a variety of bulk downloads and data feeds; you can also search it online. [h/t GitHub user "nanoseconds"] — Data is Plural: December 12, 2018
Links:
Tags: technology
A growing number of cities publish detailed data on bicyclist and pedestrian injuries involving cars, including New York City, Chicago, Boston, Seattle, St. Paul, Minn., Chapel Hill, N.C., Tempe, Ariz., Toronto, and London — many through the cities’ “Vision Zero” street-safety initiatives. (Some of the datasets also include car-on-car collisions.) Related: “The most dangerous intersections in Seattle for bicyclists and pedestrians.” [h/t Rachel Schallom + Jeff Asher] — Data is Plural: December 12, 2018
Links:
Tags: injurytransportation
Last month, I Quant NY’s Ben Wellington analyzed New York City’s raw snow plow data, “which had only been viewed 41 times before apparently.” The 250 million–row dataset is, as Wellington notes, “stored in an odd format” — snapshots that indicate, every 15 minutes, the last time each of the city’s street segments was plowed. Related: ClearStreets provides historical data from the City of Chicago’s Plow Tracker; Iowa Department of Transportation also publishes a live plow tracker; Syracuse and Pittsburgh have published historical snow plow data. — Data is Plural: December 5, 2018
Links:
Tags: climatetransportation
SUBTLEXus is a dataset of word frequencies in American English, derived from the subtitles for 8,388 films. The dataset, which covers more than 74,000 words, includes each word’s total frequency, the number of films in which the word appeared, and several other metrics. Bonus: Similar datasets are also available for Chinese and Dutch. [h/t The Language Goldmine] — Data is Plural: December 5, 2018
Links:
The British nonprofit 360Giving helps grantmakers “to publish their grants data in an open, standardised way and helps people to understand and use the data.” Through its GrantNav platform, you can search across more than 300,000 grants — totalling more than £25 billion — given by scores of funders to nearly 180,000 recipients. You can download the results of each search, as well as the underlying datasets. [h/t Enigma Public] — Data is Plural: December 5, 2018
Links:
The New Orleans Police Department’s “Body Worn Camera Metadata” contains the dates, times, durations, and locations for 2.7 million body camera recordings, going back to 2014. Related: The agency publishes similar data for 1.5 million in-car camera recordings. [h/t Alexandre Léchenet] — Data is Plural: December 5, 2018
Links:
Tags: crimejusticetechnology
In 2014, author C. M. Taylor began writing a new novel, this time with a twist: He would write the entire story on a laptop intentionally infected with spyware. With the help of the British Library, a program recorded every keystroke Taylor typed and took screenshots every few seconds. The novel, Staying On, was published in October; soon after, Taylor and the library made the spyware recordings available to download. [h/t Dan Hett] — Data is Plural: December 5, 2018
Links:
Tags: bookstechnology
New York City’s Department of Health publishes a dataset of 8,000+ reported instances of dogs biting humans, mostly from 2015 through 2017. The agency collects the reports “to determine if the biting dog is healthy ten days after the person was bitten in order to avoid having the person bitten receive unnecessary rabies shots.” [h/t Justin Baker] — Data is Plural: November 28, 2018
Links:
Last month, an international team of researchers published the third major version of their Gridded Livestock of the World dataset, which estimates the global distribution of cattle, buffaloes, horses, sheep, goats, pigs, chickens and ducks. The new dataset is based on 2010 statistics and provides estimates at “a spatial resolution of 0.083333 decimal degrees (approximately 10 km at the equator).” — Data is Plural: November 28, 2018
Links:
Tags: agricultureanimals
Bilateral labor agreements regulate the migration of workers between two countries, and the Bilateral Labor Agreements Dataset aims to catalog as many of these treaties as it can. So far the University of Chicago Law School professors and researchers running the initiative have identified 582 treaties signed between 1945 and 2015. “However, this list is almost certainly underinclusive,” they write. “Many BLAs are not deposited in the major international treaty databases and they often do not receive much, if any, publicity.” [h/t Adam Chilton] — Data is Plural: November 28, 2018
Links:
Tags: economics
The Small World of Words project “is a large-scale scientific study that aims to build a mental dictionary or lexicon in the major languages of the world.” The experiment has asked hundreds of thousands of participants to list their immediate associations with various words (such as “telephone,” “journalist,” and “yoga”). In all, the project has collected more than 15 million responses. You can download the data, examine the project’s analysis pipeline, and explore the responses online. [h/t Lewis Mitchell] — Data is Plural: November 28, 2018
Links:
Tags: language
The German Aerospace Center is publishing global elevation data derived from its TanDEM-X satellite mission. For five years, two satellites orbited Earth together in a formation that allowed their radars to “ 'see' the same land area, but from slightly different perspectives” and to calculate elevations based on those differences. Although the most detailed versions of the data are “subject to restrictions due to the potential for commercial exploitation, and thus requires a scientific proposal,” the least detailed version (which still clocks in at more than 90 gigabytes) can be downloaded for free. [h/t Matt Brealey] — Data is Plural: November 28, 2018
Links:
Tags: mapping
STAPI bills itself as “the first public Star Trek API.” It provides access to structured data not only about the fictional universe (e.g., 6,364 characters, 1,215 spacecraft, and 155 conflicts) but also its intersection with reality (e.g., 5,302 performers, 731 television episodes, 76 soundtracks). [h/t Cezary Kluczyński] — Data is Plural: November 7, 2018
Links:
A recent study revealed the results of “the Moral Machine, an online experimental platform designed to explore the moral dilemmas faced by autonomous vehicles.” The experiment asked participants to decide whether a self-driving car — faced with two deadly options — should stay on course (killing one group of pedestrians) or swerve (killing another). The project “gathered 40 million decisions in ten languages from millions of people in 233 countries and territories,” and a dataset containing every decision is available to download. Read more: “Should a self-driving car kill the baby or the grandma? Depends on where you’re from.” [h/t Walt Hickey] — Data is Plural: November 7, 2018
Links:
Tags: technologytransportation
A team led by University of Kansas professor Ron Francisco has collected and codified data on protests, strikes, and other “coercive acts” in dozens of European countries during the late 20th century. There’s a row for each day of each protest, and each row specifies the issue at stake, the organizers, their target, the type of action, and the location — as well as the number of protesters, arrests, injuries, and deaths. [h/t Alexandre Léchenet] — Data is Plural: November 7, 2018
Links:
The Department of Education requires U.S. universities to report all major gifts from (and contracts with) foreign entities. The agency’s database of these gifts and contracts currently covers 2012 to mid-2018, and includes 18,000+ entries from more than 150 schools. Related: In the wake of Jamal Khashoggi’s murder, the AP’s Collin Binkley and Chad Day used the data to examine colleges’ financial ties to Saudi Arabia. [h/t Meghan Hoyer] — Data is Plural: November 7, 2018
Links:
Tags: education
The Caselaw Access Project aims “to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.” Currently, the project provides an API for fetching data on more than 6 million cases published between 1658 and 2018 — though public access is limited to downloading 500 cases per day. You can also download bulk data for all cases in Illinois and Arkansas, but getting bulk data for other states currently requires a research agreement. [h/t Caitlin Ostroff] — Data is Plural: November 7, 2018
Links:
Tags:
When the Federal Election Commission receives a registration form that contains “questionable information” from a candidate or committee, the agency asks for additional information. If the FEC doesn’t get a proper response, it adds the registration to its dataset of “unverified filers”. Among the 500+ registrations currently on the list: “VoldemortCantStopTheVote.org”, “Department of Treasury,” “Wookie PAC,” and “Al Pacino.” [h/t Chris Zubak-Skees] — Data is Plural: October 31, 2018
Links:
Tags: elections
What happens when coal mines shut down? Money for their cleanup is supposed to be ensured by a system of bonds. But when Climate Home News’ Mark Olalde investigated these remediation funds, he found “a system incapable of dealing with large-scale bankruptcies, amid a declining industry, which severely threatens the environment and future of coal-mining communities across the country.” You can download the data behind Olalde’s findings — including bond databases covering the “23 states that produce 99% of US coal,” obtained via public records requests. [h/t Megan Darby] — Data is Plural: October 31, 2018
Links:
Tags: energyenvironment
The U.S. Energy Information Administration uses Form EIA-861 to collect annual data from thousands of electric utilities about their sales, revenue, peak loads, customer counts, energy efficiency savings, and more. More than 3,400 utilities submitted the form (or its shorter cousin, EIA-861S) for 2017, and the data go back to 1990. [h/t Jordan Wirfs-Brock] — Data is Plural: October 31, 2018
Links:
Tags: energy
Earlier this month, Twitter released data on the public activity of “3,841 accounts affiliated with the [Internet Research Agency], originating in Russia, and 770 other accounts, potentially originating in Iran.” Together, the datasets “include more than 10 million Tweets and more than 2 million images, GIFs, videos, and Periscope broadcasts.” Related: My colleague Peter Aldhous used this data — combined with data on 3 million “Russian troll tweets” released this summer by Clemson University researchers and FiveThirtyEight — to examine the Internet Research Agency’s traction before and after the 2016 election. Bonus: Peter’s code. — Data is Plural: October 31, 2018
Links:
Tags: electionssocial media
The Federal Highway Administration’s National Bridge Inventory contains detailed data on more than 600,000 “highway bridges” in the United States. The inventory goes back to 1992 and contains scores of fields, including the bridge’s age, condition, design, and materials. Now you know: Texas has the most highway bridges in the inventory, with more than 53,800. Bonus: You can also search the bridges via the unofficial BridgeReports.com. Related: The code the Baltimore Sun used to answer the question, “How safe are Maryland's bridges?” [h/t Christine Zhang] — Data is Plural: October 31, 2018
Links:
Tags: infrastructure
New York City provides the latitude, longitude, ID number, and current status — active, inactive, retired, planned, and removed — of more than 14,700 parking meters. [h/t Zack Quaintance] — Data is Plural: October 10, 2018
Links:
Tags: transportation
Connecticut’s Department of Consumer Protection has released a dataset listing all branded medical marijuana products registered with the state. For each of the nearly 4,000 products so far, the dataset describes the producer, brand name, form of dosage, and chemical potencies — plus links to images of each product and label. [h/t Kristin Hussey] — Data is Plural: October 10, 2018
Links:
Tags: drugshealthcare
Nager.Date calculates the timing — past, present, and future — of public holidays for more than 90 countries. The holidays can be browsed online, accessed via an API, or downloaded as CSVs (one per country per year). Now you know: Today is Cuba’s Día de la Independencia and Suriname’s Day of the Maroons. [h/t Tino Hager] — Data is Plural: October 10, 2018
Links:
Tags: miscellaneous
England’s public health department generates quantitative “profiles” of the country’s well-being. The metrics include rates of HPV vaccination, dementia, exercise, diabetes, and much more. The results can be downloaded directly, and also accessed via an API. [h/t Sharon Machlis] — Data is Plural: October 10, 2018
Links:
Tags: healthcarestatistics
LawAtlas.org publishes interactive maps that detail state and federal regulations on dozens of public health–related topics. Among them: e-cigarettes, HIV criminalization, fair housing, syringe distribution, and cell phone use while driving. (You can, for instance, use the e-cigarette map to identify all states where vaping is allowed in hotel rooms but prohibited in public parks.) You can download the underlying data, plus documentation about how the laws were categorized. Bonus: The website, run by Temple University’s Center for Public Health Law Research, will also teach you how to map laws yourself. Previously: The Correlates of State Policy Project (DIP 2016.07.06). — Data is Plural: October 10, 2018
Links:
Tags: HIV and AIDShealthcare
The Florida Department of Transportation publishes its inventory of active permits for billboards and other “outdoor advertising.” For each permit, the dataset provides details about the permit-holder and the structure itself — such as its location, height, whether it’s in a city, and more. [h/t Caitlin Ostroff] — Data is Plural: October 3, 2018
Links:
Tags: mapping
A slew of cities have installed devices to count bicycles that pass through major routes. At least several publish hourly or daily tallies: London, Ottawa, Edinburgh, Seattle, Cambridge, Mass., and the Washington, DC area. New York City provides daily counter-tallies for its East River bridges, but currently only as PDFs. Related: "[Transport for London]’s cycle counter data: initial thoughts" and “What we can learn from Seattle’s bike-counter data.” [h/t Giuseppe Sollazzo] — Data is Plural: October 3, 2018
Links:
Tags: transportation
The UK Department for Transport’s traffic counts calculate the average daily number of vehicles “for every junction-to-junction link on the 'A' road and motorway network in Great Britain.” Likewise, California publishes the average daily traffic, peak hourly traffic, truck traffic, and ramp traffic for each of its state highways. Previously: U.S. interstate highway traffic (DIP 2016.10.05) and public roads (DIP 2018.04.25) [h/t Dave Fisher-Hickey + u/ron_leflore] — Data is Plural: October 3, 2018
Links:
Tags: transportation
PollOfPolls.eu aggregates political polls from 30 European countries. The Vienna-based initiative has, for instance, collected and standardized more than 1,000 individual polls on British parliament since 2014, and 60 on the Bavarian state elections. You can download each set of standardized data as either JSON or CSV. [h/t Jovi Juan] — Data is Plural: October 3, 2018
Links:
Tags: elections
The U.S. Fish & Wildlife Service publishes a database outlining the critical habitats for more than 700 threatened and endangered species. For each habitat, the dataset provides its geographic boundary lines, the species’ name and type, the size of the habitat, the date it was declared critical, and more. Related: Other geospatial datasets from the USFWS, including those on the Coastal Barrier Resources System and migratory bird populations. — Data is Plural: October 3, 2018
Links:
Tags: animalsenvironment
The Hass Avocado Board publishes weekly data on the retail volume and average price of Hass avocados sold in the United States, based on information collected “directly from retailers’ cash registers.” The data is available at the national and city level going back to 2015, distinguishes between conventional and organic avocados of various sizes. Related: Justin Kiggins has aggregated the historical spreadsheets for 2015 through March 2018 into a single file. — Data is Plural: September 26, 2018
Links:
Tags: food
The Social Assistance, Politics and Institutions database, developed at an United Nations University research center, “provides a synthesis of longitudinal and harmonized comparable information on social assistance programmes in developing countries, covering the period 2000-2015.” For each program, such as Brazil’s “Bolsa Familia,” the database describes its basic characteristics, budget and financing, and population coverage. [h/t Erik Gahner Larsen] — Data is Plural: September 26, 2018
Links:
The National Survey of Family Growth, run by the U.S. Centers for Disease Control and Prevention, “gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health.” Versions of the survey have been conducted nine times, dating back to 1973. The most recent results come from interviews of more than 10,205 people between September 2013 and September 2015. Related: The Pudding’s Amber Thomas used the data to explore trends in birth control. Bonus: Thomas also published the code and data behind her analysis. [h/t Giuseppe Sollazzo] — Data is Plural: September 26, 2018
Links:
Tags: familyhealthcare
Some cities — including San Francisco, Los Angeles, and Austin — provide downloadable databases of lobbyists who’ve officially registered to influence their administrations. Chicago has gone one step further, publishing data on lobbyists’ compensation, expenditures, gifts, and more. Previously: Lobbying data from the U.S. House, U.S. Senate, and European Union (DIP 2017.05.31 + DIP 2017.08.02). [h/t Alisha Green and Laurenellen McCann] — Data is Plural: September 26, 2018
Links:
Tags:
When Amsterdam began excavating parts of the Amstel River in 2003 to construct a new metro line, the city gave archaeologists access to two large sections of the riverbed. Over time, these archaeologists unearthed “a deluge of finds, some 700,000 in all: a vast array of objects, some broken, some whole, all jumbled together.” To showcase the work, the city has published Below the Surface, a website that lets you explore the 20,000 of the objects online, download detailed data on more than 130,000 of the artifacts, read the backstory, and watch a documentary about it. Among the discoveries: Thousands of tobacco pipes, hundreds of teapots, dozens of gin bottles, and one “miniature wind mill.” [h/t Adam J Calhoun + Manoj Mallela] — Data is Plural: September 26, 2018
Links:
Tags: history
New York State’s Department of State publishes a structured listing of all real estate brokers, salespeople, and offices currently licensed by the agency. Roughly half of the 160,000 licensees are registered to business addresses in New York City. The ZIP code with the largest raw number of active licenses is 10022, a chunk of Midtown East that includes (among other things) the Waldorf Astoria and Trump Tower. — Data is Plural: September 19, 2018
Links:
Tags: real estate
The U.S. Department of State publishes, “to the maximum extent practicable,” a database of “each United States citizen who dies in a foreign country from a non-natural cause.” The database currently contains 13,045 deaths, starting in October 2002, and is updated every six months. For each incident, the database provides the date, city, and cause of death. [h/t Jacquelyn Elias] — Data is Plural: September 19, 2018
Links:
Tags: death
Unpaywall has collected data on millions of open-access scholarly articles, plus many more paywalled articles. You can download the full dataset, or submit specific Digital Object Identifiers to the website’s API or online form. For each article, you can learn whether it’s openly accessible, whether the journal that published it is open-access, and additional details about the article itself. [h/t @authcontroller] — Data is Plural: September 19, 2018
Links:
Tags: education
For many decades, the Department of Energy’s Residential Energy Consumption Survey has been asking people about their homes’ energy-related characteristics (e.g., number of bedrooms and roofing materials) and energy-consuming appliances (e.g., television size and dishwasher use). Then, the agency cross-references those answers with billing data collected “directly from energy suppliers under a mandatory authority granted by Congress.” The survey has been conducted 14 times since 1978; survey microdata is available for the eight most recent iterations. — Data is Plural: September 19, 2018
Links:
Tags: energy
ParlGov, “a data infrastructure for political science,” has collected detailed information on 1,500+ political parties, the results of 900+ elections, and the formation of 1,400+ parliamentary cabinets. The 37 countries it covers include every member of the European Union plus certain non-EU members of the OECD (such as Israel, Turkey, and Canada — but not the United States). The datasets are available in several formats, can be explored online, and come with extensive documentation. [h/t Jovi Juan] — Data is Plural: September 19, 2018
Links:
Open Brewery DB is a searchable database of more than 8,000 breweries in the United States (although “future plans are to import world-wide data”). The site provides an API, which lets you query by name, location, and type — microbrewery, regional brewery, brewpub, and so on. Previously: Official brewery statistics (DIP 2017.05.24). [h/t Chris Mears] — Data is Plural: September 12, 2018
Links:
Tags: alcohol
Zillow has created a dataset outlining the boundaries of more than 17,000 neighborhoods in the United States’ largest cities, spanning 49 states (all but Wyoming) plus D.C. and Puerto Rico. Related: OpenStreetMap, which is API-queryable, has a “neighbourhood” tag type. [h/t Volodymyr Kupriyanov] — Data is Plural: September 12, 2018
Links:
Tags: mappingreal estate
“For the first time, the city’s database, which tracks more than 28 million parking and vehicle compliance tickets, is easily available to the public,” according to ProPublica Illinois, which has published the two-gigabyte dataset in collaboration with WBEZ. The dataset, which covers January 2007 to mid-May 2018, “includes information on when, where, and by whom tickets were issued; de-identified license plates; vehicle make; registration zip code; the violation for which the vehicle was cited; the payment status and more.” — Data is Plural: September 12, 2018
Links:
Last month, Puerto Rico’s government began publishing a dataset of all deaths registered in the U.S. territory from January 2017, updated weekly. For each death, the information includes the year and month of the death; the type and causes of death; the deceased’s age, sex, marital status, occupation, place of birth and residence, and more. Related: “More Than 2,000 Puerto Ricans Applied For Funeral Assistance After Hurricane Maria. FEMA Approved Just 75.” [h/t Giancarlo Gonzalez] — Data is Plural: September 12, 2018
Links:
Utility companies are required to report major power outages and other “electric disturbance events” to the Department of Energy within a business day (or, depending on the type of event, sooner) of the incident. The federal agency then aggregates the reports annual summary datasets. For each event, the data includes the time it began and was resolved, the geographic areas it affected, the type of incident, and the estimated number of customers affected. [h/t Jordan Wirfs-Brock] — Data is Plural: September 12, 2018
Links:
Tags: energy
Jan Diehm and Amber Thomas measured the pockets of 80 pairs of jeans — four pairs each from 20 brands, half marketed to men and the other half to women. Their findings “confirmed what every woman already knows to be true: women’s pockets are ridiculous.” In fact, “on average, the pockets in women’s jeans are 48% shorter and 6.5% narrower than men’s pockets.” For each pair of jeans, the duo’s underlying dataset contains the front and back pocket dimensions, material composition, retail price, and more. — Data is Plural: August 22, 2018
Links:
Tags: miscellaneous
The University of North Carolina’s Louis Harris Data Center serves as “the national depository for publicly available survey data collected by Louis Harris and Associates, Inc.” The online depository contains more than 1,000 Harris polls, some from as early 1958. In total, they include “160,000 questions asked of more than 1,200,000 respondents.” [h/t Xan Gregg] — Data is Plural: August 22, 2018
Links:
Tags: statistics
DigitalGlobe’s open data program publishes georeferenced satellite imagery from before and after major natural disasters. The archive currently includes a couple dozen events, including recent flooding in Kerala and California’s Carr Wildfire and Mendocino Complex Fire. Previously: NOAA's emergency response aerial imagery (DIP 2017.09.20). [h/t Laura Noren and Brad Stenger] — Data is Plural: August 22, 2018
Links:
The African Economic Research Consortium, African Development Bank, and the World Bank have partnered to create the Service Delivery Indicators program — ”a new Africa-wide initiative” that dispatches teams of surveyors “to gauge the quality of service delivery in basic health services” across the continent. The initiative’s de-identified data contains results for nine countries so far, including assessments of facility infrastructure, worker absenteeism, and patient case simulations. [h/t Matthew Collin] — Data is Plural: August 22, 2018
Links:
Tags: healthcare
Google recently launched a database of political ads “that have appeared on Google and partner properties.” The searchable and downloadable dataset indicates the organization that paid for each advertisement, approximately how much they spent, how long the ad ran, what demographics were used for targeting, and roughly how many people it reached. A few months ago, Facebook launched a similar initiative, but you need to logged in to view it and you can’t download the data. You can, however, get Facebook political-advertising data from at least two sources: A repository of 267,000 ads scraped from Facebook’s official archive by NYU researchers, and ProPublica’s ongoing, detailed database of ads and targeting parameters gathered through their Political Ad Collector. [h/t Sheera Frenkel] — Data is Plural: August 22, 2018
Links:
The SEC’s Office of Structured Disclosure publishes data extracted from corporations’ public financial statements. That dataset contains the numbers listed in each company’s primary financial statements — balance sheets, cash flows, et cetera. An even more detailed version of the dataset includes plain-text notes from the filings, plus numbers from a broader array of forms. Both datasets are updated quarterly and go back to 2009. — Data is Plural: August 15, 2018
Links:
Tags: business
Best Buy’s API and Walmart’s API both let you search their products and stores. Both also require (free) registration to obtain an API key. In 2016, Best Buy also published bulk data describing its products and stores. [h/t Dan Nguyen + Dave Machado] — Data is Plural: August 15, 2018
Links:
Tags: businesstechnology
Nike, Inc.’s manufacturing map displays 618 factories and material suppliers that the company uses to manufacture its products (as of May 2018). You can export the entire dataset, or browse and filter the data online. For each of the factories, the information includes the factory’s name, address, product type, number of workers, percentage of workers who are female, and more. [h/t Marc DaCosta] — Data is Plural: August 15, 2018
Links:
SpaceX’s API provides data on the company’s rockets, launchpads, launches, and more. It also will tell you the current orbital position of the car SpaceX launched into space. [h/t Mike Allred] — Data is Plural: August 15, 2018
Links:
Tags: businesstechnology
The Lending Club, which matches borrowers with investors, publishes a dataset of all loans issued through its platform since 2007. The dataset’s many fields include each loan’s amount, term, interest rate, grade, status, and purpose (as a category, and often also a fuller description), as well as the borrower’s employer, home ownership status, and annual income. You can also download all declined loans, i.e., those “that did not meet Lending Club's credit underwriting policy.” Charlie Stanton] — Data is Plural: August 15, 2018
Links:
Tags: money
London, Belfast, Vancouver, Washington (D.C.), Philadelphia, Boston, Cambridge (Mass.), Madison, Providence, San Francisco, Oakland, and Berkeley are among the many cities that publish data cataloguing the trees that line their streets. Previously: NYC’s street trees (DIP 2016.11.16). [h/t Jens von Bergmann + Sunlight Open Cities + u/willwardo] — Data is Plural: August 8, 2018
Links:
The nonprofit organization Reclaim The Records recently obtained New Jersey’s death index, and has made it available to search and download. The records include structured data for 1,275,833 deaths in the state between 2001 and 2017, plus digitized images of the death index for 1901-1903, 1920-1929, and 1949-2000. The structured data contains each person’s name, date of birth, date of death, and death certificate number — plus, for the most recent records, the locations of birth and death. Also: NJ Advance Media has published data on 17 years of drug overdose deaths from the state’s Office of the State Medical Examiner, and property tax rolls for “all 2.3 million taxable parcels of land” in 2017. (Free registration required to download the files.) [h/t Benjamin Cooley + Martin Burch] — Data is Plural: August 8, 2018
Links:
Tags: deathstatistics
Researchers at Primer, a machine learning and natural language processing startup, have released a dataset describing more than 36,000 notable computer scientists, “only 15%” of which have Wikipedia biographies. The researchers trained their algorithms on a corpus of existing Wikipedia articles, Wikidata entries, news articles, and the Semantic Scholar Open Research Corpus. (The latter contains data on more than 39 million research papers in computer science, neuroscience, and biomedical science.) The results include each computer scientist’s name, basic metadata, academic papers, and snippets of news articles mentioning them. Related: “Using Artificial Intelligence to Fix Wikipedia's Gender Problem” (Wired). [h/t Sara Blask] — Data is Plural: August 8, 2018
Links:
Tags:
ProPublica is tracking the money that political campaigns and government agencies have reported spending at Donald Trump’s hotels, golf clubs, and restaurants. You can download the data, which includes the spender, property, date, amount, and listed purpose for each payment. From ProPublica’s notes: “Federal government spending is incomplete because many government agencies have actively fought requests to disclose spending at Trump properties. The data we have so far was released, in part, after lawsuits.” — Data is Plural: August 8, 2018
Links:
Tags: Trump
The Militarized Interstate Dispute datasets provide details about more than 2,200 instances between 1816 and 2010 where a government “threatened, displayed, or used force against another” — including each dispute’s timing, participants, death count, result, and more. A supplementary database tracks the disputes’ locations. The datasets are part of the Correlates of War project, which was founded in 1963 and which strives for “the systematic accumulation of scientific knowledge about war.” [h/t Erik Beuck] — Data is Plural: August 8, 2018
Links:
There are many official datasets of public toilets, including those in New York City parks, Vancouver, Seattle parks, many UK cities, Australia, and New Zealand. [h/t Jens von Bergmann] — Data is Plural: August 1, 2018
Links:
Tags: mapping
Earlier this year, Johns Hopkins professor Dan Honig released the Project Performance Database, which tracks the outcome ratings of international development projects (typically conducted by auditors on a four- or six-point scale). “The PPD is, at present, the world's largest” such database and “contains over 14,000 unique projects from eight agencies,” including the World Bank, the Asian Development Bank, and others. [h/t Paddy Carter] — Data is Plural: August 1, 2018
Links:
Tags:
Researchers at the Environmental Protection Agency have created a new dataset of “reported and predicted information on more than 75,000 chemicals and more than 15,000 consumer products.” The Chemicals and Products Database, as they’ve named it, is an “aggregation of publicly available data on chemical-use categorization, consumer product composition [...], and functional use of chemicals”, and uses “a consistent scheme for categorizing products and chemicals.” You can download the data via the EPA’s Chemistry Dashboard. — Data is Plural: August 1, 2018
Links:
The Organisation for Economic Co-operation and Development (OECD) has launched a database “providing detailed and comparable tax revenue information for 80 countries around the world.” The Global Revenue Statistics Database, “which will expand to cover more than 90 countries by the end of 2018,” breaks tax revenues into dozens of categories and subcategories — such as sales taxes, taxes on capital gains, and taxes on exports. Related: The OECD’s interactive charts of the data. — Data is Plural: August 1, 2018
Links:
Tags: taxes
Law professor Brandon L. Garrett has led an effort to compile data on every death sentence in the U.S. since the early 1990s. Garrett’s “End of its Rope” database currently includes more than 4,900 sentencings, and specifies each defendant’s name, race, and gender; the state, county, and year of the sentence; whether it was a resentencing; and whether the defendant has been executed. You can download the data, browse it online, and explore it via an interactive map. — Data is Plural: August 1, 2018
Links:
“When somebody's obituary appears in the New York Times, FOIA The Dead sends an automated request to the FBI for their (newly-available) records.” So far, the project has obtained and published FBI’s files on 54 people. The site’s data includes each person’s name, a short description, a link to the relevant obituary, a link to the received records, and the number of pages obtained. [h/t Noah Veltman] — Data is Plural: July 18, 2018
Links:
In a recent essay at The Pudding, Jason Li, Amber Thomas, and Divya Manian explored the shades of foundation offered by best-selling makeup brands in the U.S., Nigeria, India, and Japan. They also published the underlying data — color values for more than 600 shades from 36 different brands. — Data is Plural: July 18, 2018
Links:
Tags: businessmiscellaneous
Microsoft’s Bing Maps team has published an open dataset describing the outlines of nearly 125 million buildings in the United States. To build the dataset, the team trained neural networks to detect buildings’ footprints in satellite images. — Data is Plural: July 18, 2018
Links:
The Amsterdam-based activist group UNITED for Intercultural Action has, since the early 1990s, been collecting information about the deaths of Europe’s refugee-seekers. The organization's volunteers “update the data annually, spending six months at a time verifying reports, categorising deaths and entering them into the database,” according to The Guardian's story about the endeavor and its findings. “When the project began, they received physical clippings from a network of groups around Europe. Now, the data is collected from email submissions and Google Alerts in a number of languages.” The story features a PDF-listing of the deaths, including the date the migrants were found dead, names and countries of origin (where known), and the causes of death. The Italian civic-data organization OnData has converted the PDF to a spreadsheet. [h/t Giuseppe Sollazzo] — Data is Plural: July 18, 2018
Links:
Tags: deathimmigration
Since July 2014, ScotusMap.com has been tracking the U.S. Supreme Court justices’ public events — “whether the Supreme Court is in session or on summer recess, the justices keep busy with writers’ conferences, state bar luncheons, award ceremonies, and more.” The map’s database now contains more than 700 entries, and even includes events attended by the retired justices. Bonus: The creators of ScotusMap recently launched ScotusWat.ch, a website (with downloadable data) that “tracks the public statements made by United States senators about how they plan to vote on the Supreme Court nominee, Brett Kavanaugh, and tallies them into a likely vote count.” [h/t Jay Pinho + Victoria Kwan] — Data is Plural: July 18, 2018
Links:
Tags: law
Kansas City publishes a dataset of cars for sale at its monthly auction. As of yesterday, the dataset contained 482 cars. For each car, the variables include the make, model, year, VIN, reason for being auctioned — e.g., “abandoned,” “stolen,” “illegally parked” — and other details. — Data is Plural: July 4, 2018
Links:
Tags: transportation
Christine Zhang has compiled a CSV of 400+ locations featured in Anthony Bourdain’s No Reservations, The Layover, and Parts Unknown shows. The spreadsheet-as-remembrance includes each location’s name, country, latitude/longitude, plus the relevant episode’s show, season, number, and title. — Data is Plural: July 4, 2018
Links:
The U.S. Centers for Medicare & Medicaid Services publishes a series of “geographic variation” spreadsheets, which cover hundreds of metrics — such as kidney dialysis usage, the total cost of medical tests, and hospital readmission rates — related to Medicare beneficiaries’ healthcare in each state, county, and “hospital referral region.” [h/t Drew Ivan] — Data is Plural: July 4, 2018
Links:
Tags: healthcarestatistics
The Cooperative Open Online Landslide Repository (COOLR) is a recently-launched NASA project that “seeks to cultivate an open platform where scientists and citizen scientists around the world can share landslide reports to guide awareness of landslide hazards for improving scientific modeling and emergency response.” The repository has been seeded with the agency’s Global Landslide Catalog, which it says is already “the largest openly available global database of rainfall-triggered mass movements known to date.” You can explore the COOLR data on an interactive map or download the data in several formats. — Data is Plural: July 4, 2018
Links:
Every year, hundreds of U.S. transit systems — from the Pomona Valley Transportation Authority’s Claremont Dial-a-Ride to the Metropolitan Transportation Authority’s New York City Transit — submit detailed metrics to the congressionally-established National Transit Database. The NTD's datasets cover a broad set of topics, including “agency funding sources, inventories of vehicles and maintenance facilities, safety event reports, measures of transit service provided and consumed, and data on transit employees.” The NTD also provides a glossary, data collection manuals, and the underlying forms. Michael A. Rice, a teacher at Ingraham High School in Seattle] — Data is Plural: July 4, 2018
Links:
Tags: transportation
Ever since 2010, the National Institute for Computer-Assisted Reporting (NICAR) annual conference has featured a session of five-minute “lightning talks,” selected by popular vote. NICARian Christine Zhang has compiled a spreadsheet of all 309 lightning talk proposals, the proposed presenters, their professional affiliations, how many votes each proposal received, and more. Related: “Nine Years of NICAR Lightning Talks (and Cats),” Zhang’s analysis of the data. Also related: The code behind Zhang’s analysis. — Data is Plural: June 6, 2018
Links:
Tags: journalism
The Mexican state of Yucatán publishes a dataset listing the names and locations of cenotes, the region’s famous water-filled sinkholes. Related: Other datasets from the Programa de Ordenamiento Ecológico Territorial del Estado de Yucatán. [h/t Forest Gregg] — Data is Plural: June 6, 2018
Links:
Tags: environmentmapping
PubMed, the National Library of Medicine’s search engine for biomedical and life-sciences literature, lets you search for retracted publications; just add "retracted publication"[PTYP] to your query. For instance, here are retracted articles that were originally published in 2016. Using the “Send to” link at the top-right of the query pages, you can download all the results. Data scientist Neil Saunders has gathered this data and condensed it into an interactive, graphical report. (Clicking on the axis labels takes you the relevant PubMed search.) Related: The code behind Saunders’ report. [h/t u/cavedave] — Data is Plural: June 6, 2018
Links:
Tags: healthcarescience
The Smithsonian Institution’s Global Volcanism Program maintains a database of more than 12,000 volcanoes and 11,000 eruptions — dating from 10450 BCE to the present year. You can search the data online, and then download the results as a spreadsheet. Related: “Here's every volcano that has erupted since Krakatoa.” [h/t Duncan Geere + Rachel Schallom + Lazaro Gamio] — Data is Plural: June 6, 2018
Links:
The Department of Energy’s National Energy Technology Laboratory has published what it says is the “first-ever database inventory of oil and natural gas infrastructure information from the top hydrocarbon-producing and consuming countries in the world.” The database contains tons of geospatial information and “identifies more than 4.8 million individual features like wells, pipelines, and ports from more than 380 datasets in 194 countries. It includes information about the type, age, status, and owner/operator of infrastructure features.” Helpful: The authors’ (detailed) methodology paper. [h/t Michael McLaughlin] — Data is Plural: June 6, 2018
Links:
Tags: energy
The Wikimedia Foundation has published a dataset listing each clearly-cited source (e.g., a book with an ISBN, a scholarly article with a DOI, etc.) on each page of each of Wikipedia’s 298 languages editions — 15,693,732 source-page combinations in all. Related: “The Most-Cited Authors on Wikipedia Had No Idea,” by Louise Matsakis. [h/t Ted Lawless] — Data is Plural: May 23, 2018
Links:
Tags: languagemediatechnology
Earlier this month, the Department of Energy’s National Renewable Energy Laboratory made a big new slice of its Wind Integration National Dataset available online. The latest version provides API access to 50 terabytes of wind-related measurements — about 10% of the full database. It includes “barometric pressure, wind speed and direction, relative humidity, temperature, and air density data” between 2007 and 2013, from nearly 5 million locations in/near the continental United States. The NREL has also published an animated map of the data. Note: Free registration is required to access the API. Previously: Wind turbines (DIP 2018.04.25). [h/t Michael McLaughlin] — Data is Plural: May 23, 2018
Links:
Tags: energy
The Nonviolent and Violent Campaigns and Outcomes (NAVCO) Data Project, based at the University of Denver, “catalogues major nonviolent and violent resistance campaigns around the globe from 1900-2013.” The project’s initial dataset explored the general characteristics of hundreds of campaigns; follow-up datasets have examined the annual activity and tactics of smaller subsets. Each dataset comes with a detailed codebook. Note: Free registration is required to download the most recent datasets. [h/t Peace Science Digest] — Data is Plural: May 23, 2018
Links:
Tags: conflict
As a way to “lower the barrier“ for analyzing public transportation data, researchers at Finland’s Aalto University have published “a curated collection of [now more than] 25 cities' public transport networks in multiple easy-to-use formats including network edge lists, temporal network event lists, SQLite databases, GeoJSON files, and the GTFS data format.” On the project’s website, you can browse, visualize, and download each city’s data. (The cities are mostly in Europe and Australia, but also include Detroit, Winnipeg, and Antofagasta, Chile.) Previously: TransitLand and TransitFeeds (DIP 2016.07.27). [h/t NYU Data Science Community Newsletter] — Data is Plural: May 23, 2018
Links:
Tags: transportation
Caitlin Rivers, a computational epidemiologist at the Johns Hopkins Center for Health Security, has started compiling data tracking the current Ebola outbreak in the Democratic Republic of Congo. So far, the datasets are based on case counts and other information from the DRC’s Ministry of Health and the World Health Organization. A series of “data interpretation notes” accompanies each dataset. (Rivers administered a similar data repository during the 2014 Ebola outbreak.) Related: “Most Maps of the New Ebola Outbreak Are Wrong,” by Ed Yong. — Data is Plural: May 23, 2018
Links:
You know those metallic grates embedded into city sidewalks? D.C.’s Office of the Chief Technology Officer has identified 10,000+ of them in the District. Also: 89,727 curb segments. [h/t Sunlight Open Cities] — Data is Plural: May 9, 2018
Links:
Tags: infrastructuremapping
The European Space Agency’s Gaia spacecraft “has produced the richest star catalogue to date, including high-precision measurements of nearly 1.7 billion stars and revealing previously unseen details of our home Galaxy.” Those measurements, released last month, are available to download. They’ve also been used to create a high-resolution image of all observed stars and to expand the ESA’s interactive space map. Related: This Vox article provides some more context. [h/t u/Kopachris] — Data is Plural: May 9, 2018
Links:
Tags: science
The Himalayan Database tracks “all expeditions that have climbed in the Nepalese Himalaya.” The hyper-detailed database “is based on the expedition archives of Elizabeth Hawley, a longtime journalist based in Kathmandu, and it is supplemented by information gathered from books, alpine journals and correspondence with Himalayan climbers.” The database — long accessible only on CD, for a fee — is now available to download for free. (The main download is provided as a Microsoft Visual FoxPro database, but the .DBF files within it can be opened using other software, including LibreOffice.) Related: Yuichiro Miura, the oldest person to reach the summit of Mount Everest. [h/t Jacob Bradburn] — Data is Plural: May 9, 2018
Links:
Cook County, Illinois, publishes data on all deaths reported to its medical examiner — 20,000+ deaths since August 2014, and updated daily. (FYI: “Not all deaths that occur in Cook County are reported to the Medical Examiner or fall under the jurisdiction of the Medical Examiner.”) Connecticut’s Office of the Chief Medical Examiner has published data on all accidental drug deaths reported between 2012 and 2017. The Dallas Morning News’ Dana Amihere obtained autopsy data from the Dallas County medical examiner's office, and NJ Advance Media’s Stephen Stirling obtained data on “all cases referred to the NJ Medical Examiner system from 1996 to 2016.” [Correction, 2018-05-09: The original version of this item misspelled Stephen Stirling's name. Data Is Plural regrets the error.] — Data is Plural: May 9, 2018
Links:
Tags: death
OpenTrials, a collaboration between Open Knowledge International and Oxford University’s Ben Goldacre, “aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally.” The project’s “public beta” brings together data from several of the world’s largest clinical trial registries — including the United States’ ClinicalTrials.gov, the European Union Clinical Trials Register, and the WHO’s International Clinical Trials Registry Platform — and other related sources. You can explore the data through an online search tool, monthly bulk exports, and an API. — Data is Plural: May 9, 2018
Links:
Tags: healthcarescience
Computer-vision researchers convinced 32 participants (of 10 nationalities, living in 4 cities) to record everything they did in their kitchens for three days using a head-mounted camera. Later, the participants narrated what they had been doing. Taken together, the EPIC-Kitchens dataset includes 55 hours of video, nearly 40,000 narration segments, and more. [h/t Duncan Geere] — Data is Plural: April 25, 2018
Links:
Tags: food
For her 2013 book, Making the News: Politics, the Media, and Agenda Setting, UC Davis professor Amber E. Boydstun oversaw the compilation of a dataset of every front-page article in the New York Times from 1996 to 2006. Each of the 31,034 articles have been categorized by topic, according a detailed codebook, and given a short summary. Related: The Comparative Agendas Project's list of datasets that use its topic-classification system, including Boydstun’s data. Also related: The NYT’s APIs. [h/t Cornelius Puschmann] — Data is Plural: April 25, 2018
Links:
Tags: journalism
Lawrence Berkeley National Laboratory, the U.S. Geological Survey, and the American Wind Energy Association have partnered to publish the U.S. Wind Turbine Database. The dataset, which the government says will be “continuously updated,” currently contains 57,636 turbines and includes each turbine’s location, development project, manufacturer, model, height, rotor diameter, and other characteristics. You can download the data in several formats, and also explore it on an interactive map. [h/t Ed Vine] — Data is Plural: April 25, 2018
Links:
Tags: energy
The federal Highway Performance Monitoring System “includes inventory information for all of the Nation's public roads as certified by the States’ Governors annually.” And it’s not just highways: “All roads open to public travel are reported in HPMS regardless of ownership, including Federal, State, county, city, and privately owned roads such as toll facilities.” Shapefiles representing the HPMS data are available for 2011–2015. For each segment of road, the dataset indicates the average daily traffic, number of turn lanes, surface type, and dozens of other variables. Related: America’s Quietest Routes, which uses the data. — Data is Plural: April 25, 2018
Links:
Over the past year, reporters at the Washington Post ”attempted to identify every act of gunfire at a primary or secondary school during school hours since the Columbine High massacre on April 20, 1999.” Using a range of sources, the reporters ”reviewed more than 1,000 alleged incidents, but counted only those that happened on campuses immediately before, during or just after classes.” The resulting database, published last week, currently contains more than 200 incidents and can be downloaded as a CSV. For each shooting, the database includes details about the location, timing, circumstances, shooter, casualties, and the school’s students. [h/t The INN Nerds] — Data is Plural: April 25, 2018
Links:
The University of Florida’s Larry Winner has collected hundreds of “miscellaneous” datasets, many from niche academic studies. A few highlights: “Antiseptic as Treatment for Amputation – Upper Limb” (from an 1870 study), “Sex, Lies, and Religiosity” (1971), and “Reading Times by E-Reader Device and Lighting Conditions” (2013). [h/t Charles Minshew] — Data is Plural: April 18, 2018
Links:
Tags: miscellaneousstatistics
Researchers at the University of Colorado at Boulder and the Santa Fe Institute have compiled a dataset of 200+ universities’ parental leave policies. For each institution, the dataset indicates the amount of paid leave granted to//taken by both women and men, and what type of leave it is (e.g., relief from teaching, from all duties, et cetera). [h/t Sam Way] — Data is Plural: April 18, 2018
Links:
The nonpartisan Campaign Finance Institute has launched a database of current and historical state campaign finance laws. The information goes back to 1996 and describes each state’s contribution limits, various kinds of prohibitions, disclosure rules, and more. You can download the full dataset or explore it online. [h/t Rachel Shorey] — Data is Plural: April 18, 2018
Links:
The U.S. Office of the Federal Register publishes structured data on every presidential executive order since 1994. For each of the 886 entries, the dataset provides the order’s title, the date it was signed, the president who signed it, and where to find it in the Federal Register. [h/t u/cavedave] — Data is Plural: April 18, 2018
Links:
Tags: politics
A team led by Princeton sociologist and Evicted author Matthew Desmond has compiled the United States’ first-ever national-scale, publicly-available database of eviction metrics. Desmond’s Eviction Lab has collected more than 80 million records from cities, counties, and states across the country, and used them to calculate the number of evictions and eviction filings in each place. (Short methodology here; longer methodology here.) You can download the aggregate data in bulk (after supplying your email address) and explore it through an interactive map. Related: “In 83 Million Eviction Records, a Sweeping and Intimate New Look at Housing in America” (The New York Times), which includes additional background and graphics. — Data is Plural: April 18, 2018
Links:
Last year, Data Is Plural pointed readers to dog registration data for NYC, Tacoma, and Edmonton. It turns out that government of Zurich also publishes local dog registrations, including each canine’s name, gender, and birth year. And the Sunshine Coast Council, in Australia, publishes a spreadsheet of both dogs and cats, their primary breeds and colors, and whether they’ve been spayed/neutered. [h/t Open Data Institute] — Data is Plural: April 4, 2018
Links:
Tags: animals
Software engineer Michael Penkov has scraped the official, polling station–level results for Russia’s recent presidential election, and made the data available as a single JSON file. He’s also published an introductory Python notebook, which explains the data structure and provides English translations for the Russian field names. — Data is Plural: April 4, 2018
Links:
The Center for Strategic and International Studies’ Beyond Parallel project publishes several databases related to North Korean international relations — including 200+ negotiations between the U.S. and DPRK since 1990, and several hundred military provocations since 1958. Related: Los Angeles Times correspondent Matt Stiles’ visual explorations of the provocations data. Previously: The James Martin Center for Nonproliferation Studies’ North Korea Missile Test Database (DIP 2017.05.17). [h/t Matt Stiles] — Data is Plural: April 4, 2018
Links:
Tags: conflict
The UK government has begun requiring all companies with at least 250 employees in Great Britain (i.e., England, Scotland, and Wales) to report the pay differences between their male and female workers. Today is the official deadline to submit the reports; as of last night, more than 8,800 employers had done so. The reports include the percentage gaps in hourly earnings, differences in bonus pay, and the proportions of male and female employees in each pay quartile. You can search the data online and also download it as a CSV. Related: The Guardian’s series of reports on the data. [h/t Peter Yeung] — Data is Plural: April 4, 2018
Links:
Tags: gendermoneystatistics
The GOES-16 satellite was launched into orbit in November 2016, and it’s been collecting near-realtime images and data ever since. (GOES stands for “Geostationary Operational Environmental Satellite.”) It collects data on 16 different spectral bands, and it can capture a full image of the Western Hemisphere every 15 minutes, plus “an image of the Continental U.S. every five minutes, and two smaller, more detailed images of areas where storm activity is present, every 60 seconds.” You can browse the images and data online, and also download them as NetCDF files. Related: Washington Post graphics reporter John Muyskens’ list of GOES-16 resources and usage examples. [h/t John Muyskens] — Data is Plural: April 4, 2018
Links:
Tags: mapping
OpenPowerlifting.org “aims to create a permanent, accurate, convenient, accessible, open archive of the world's powerlifting data. In support of this mission, all of the OpenPowerlifting data and code is available for download in useful formats.” So far, that includes 400,000+ performances at 9,000+ competitions in dozens of countries. [h/t u/cavedave] — Data is Plural: March 7, 2018
Links:
Tags: sports
The University of Miami Libraries has digitized 53,000+ pages of La Gaceta de La Habana, “the paper of record during the Spanish colonial occupation of Cuba in the nineteenth century.” The digitized editions span 33 of the years between 1849 and 1897. Previously: Historical U.S. newspapers (DIP 2017.08.16). [h/t Mike Stucka + Heather Froehlich] — Data is Plural: March 7, 2018
Links:
Tags: journalism
Last year, the Stanford Center for Reproducible Neuroscience launched OpenNeuro, “a free and open platform for analyzing and sharing neuroimaging data.” (It’s the successor to the center’s earlier initiative, OpenfMRI.) You can, for instance, download scans of brains that were watching a particular episode of The Twilight Zone. Related: The Brain Imaging Data Structure, “a simple and intuitive way to organize and describe your neuroimaging and behavioral data.” Previously: The Open Access Series of Imaging Studies (DIP 2017.08.16). [h/t Laura Noren and Brad Stenger] — Data is Plural: March 7, 2018
Links:
Tags: healthcarescience
Afrobarometer “is a pan-African, non-partisan research network that conducts public attitude surveys on democracy, governance, economic conditions, and related issues in more than 35 countries in Africa.” You can download data from the first six rounds of surveys, conducted between 1999 and 2015. You can also read the detailed questionnaires and explore the results online. Note: To download the data, you’ll need to create a (free) account on the website. [h/t Jeffrey Arnold] — Data is Plural: March 7, 2018
Links:
Tags: statistics
The U.S. Energy Information Administration publishes near-real-time data on the Lower 48’s electrical grid. The datasets include net electricity generation, flows in and out of the country’s various “balancing authorities,” regional demand, and forecasts of demand. You can explore the data online, access it through the EIA’s API, or download it in bulk. Helpful: The EIA’s guide to the data and “known issues”. — Data is Plural: March 7, 2018
Links:
Tags: energy
The (unofficial) Rick and Morty API provides data on 390+ characters, 60+ locations, and all 31 episodes of the science-fictional animated series. — Data is Plural: February 28, 2018
Links:
Tags: entertainment
The American Society of Mammalogists’ Mammal Diversity Database “is home base for tracking the latest taxonomic changes to species and higher groups of mammals.” Currently, it contains more than 1,300 genera and 6,000 total species. Fun facts: The impala is the only member of the genus Aepyceros, and the name “Schmidly's deer mouse” can refer to either of two species in two entirely different genera. [h/t Himanshu Goenka] — Data is Plural: February 28, 2018
Links:
Tags: animals
The Environmental Protection Agency’s RadNet system “monitors the nation's air, precipitation and drinking water for radiation.” The radiation measurements, collected from 130+ stations in all 50 states plus the District of Columbia and Puerto Rico, are available on a “near-real-time” basis. Related: Randall Munroe’s radiation dose chart. Previously: SafeCast (DIP 2016.02.03). [h/t Stanislav Kralin] — Data is Plural: February 28, 2018
Links:
Tags: environmentnuclear
Political scientist Jeffrey Arnold has converted the U.S. Army Concepts Analysis Agency (CAA) Database of Battles from a series of Lotus 1-2-3 worksheets into tidier, easier-to-use CSV files. The dataset includes details of 660 battles — associated with several dozen wars — between 1600 and the mid/late-1900s. The fields indicate each battle’s “name, date, and location; the strengths and losses on each side; identification of the victor; temporal duration of the battle,” and more. — Data is Plural: February 28, 2018
Links:
The PA-X Peace Agreements Database contains structured information about 1,500+ “formal, publicly-available documents” that address “conflict with a view to ending it.” The database covers more than 140 peace processes between 1990 and 2015, and each agreement has been coded for more than 200 variables — for instance, whether the agreement contains provisions about religious groups. [h/t Melissa Terras] — Data is Plural: February 28, 2018
Links:
Tags: conflict
The Open Plaques project is dedicated to “documenting the historical links between people and places as recorded by commemorative plaques.” The latest data dump contains nearly 40,000 plaques — the vast majority in the U.S., U.K., and Germany. OpenBenches, meanwhile, has collected similar data for 4,300+ memorial benches. [h/t Jason Norwood-Young] — Data is Plural: February 21, 2018
Links:
Tags: mapping
The GitHub Archive is an effort to record the popular code-sharing website’s public timeline, “archive it, and make it easily accessible for further analysis.” The dataset, which includes more than 20 types of events and often contains more than 1 million events per day, goes back to February 2011. Related: Structured data representing the “commit histories” of two dozen popular open-source projects, including Rust, Pandas, Redis, and Bitcoin. — Data is Plural: February 21, 2018
Links:
Tags: technology
HappyDB is “a corpus of 100,000 crowd-sourced happy moments.” An example: “My son gave me a big hug in the morning when I woke him up.” The researchers, who recently described their efforts in an academic paper, collected the sentiments from Mechanical Turk workers, who also supplied basic demographic information, such as age, gender, and whether they have children. [h/t Marcel Weiher] — Data is Plural: February 21, 2018
Links:
Tags: entertainmentstatistics
Through its Office of Foreign Assets Control, the Treasury publishes several datasets that describe the people and companies subject to U.S. economic sanctions. The two main listings are the Specially Designated Nationals and Blocked Persons (“SDN”) and the Consolidated Sanctions List. Those contain only currently-sanctioned entities, but the Treasury also publishes (semi-structured) documents describing historical additions and removals. Related: Enigma Public’s Sanctions Tracker. [h/t Jennifer Roscoe] — Data is Plural: February 21, 2018
Links:
The Humanitarian Data Exchange has collated dozens of datasets related to the Rohingya refugee crisis. Among them: the geographic boundaries of Rohingya refugee settlements in Bangladesh, the numbers of refugees living in those settlements, and the infrastructure available there. — Data is Plural: February 21, 2018
Links:
Tags: refugees
Via a Freedom of Information Act request to the Fish and Wildlife Service, Newsweek reporter Kristin Hugo obtained a spreadsheet listing all imports of bats — vampire, fruit, yellow-shouldered, leaf-nosed, and more — to the United States between January 2016 and October 2017. — Data is Plural: February 14, 2018
Links:
Tags: animals
The United Kingdom’s Home Office publishes dozens of fire-safety related datasets, including aggregate statistics on response times, smoke alarms, and fire department staffing; incident-level data on appliance fires, vehicle fires, and fatalities; and much more. Of the 100,000+ domestic appliance fires reported over a six-year span, 52% were believed to have been caused by a “cooker incl. oven,” 11% by a “grill/toaster,” 2% by dishwashers, and just over 1% by deep-fat fryers. Semi-related: Jamie Oliver’s Bad Cheese Idea Is Still Starting Toaster Fires. [h/t Owen Boswarva] — Data is Plural: February 14, 2018
Links:
Tags: disaster
Common Voice is a Mozilla-led project that aims “to make voice recognition technology easily accessible to everyone.” To that end, the project asks visitors to record themselves speaking specific sentences, and to validate the recordings of other users. The whole dataset is available to download and currently clocks in at 12 gigabytes, compressed. (Bonus: That download page also links to other freely available voice datasets.) Related: The project’s FAQ. — Data is Plural: February 14, 2018
Links:
Tags: language
Through the Constituency-Level Elections Archive (DIP 2016.09.28) and other sources, you can get historical election results for the U.S. Congress. And through the work of Jeffrey B. Lewis et al., you can get data describing the historical boundaries of each congressional district. In a Scientific Data article published last year, quantitative geographer Levi John Wolf presented a dataset that brings the two types of information together, so that all congressional election results from 1896 to 2014 are “explicitly linked to the geospatial data about the districts themselves.” — Data is Plural: February 14, 2018
Links:
In April 2015, the Ghorkha Earthquake killed more than 8,000 people in Nepal, and destroyed hundreds of thousands of homes. In early 2016, a team led by the not-for-profit Kathmandu Living Labs, in collaboration with Nepal’s government, undertook “a massive household survey using mobile technology to assess building damage in the earthquake-affected districts.” The responses to that survey are now available at the 2015 Nepal Earthquake Open Data Portal; you can explore the data online or download it in bulk. In all, the datasets include details on millions of individuals, plus information about each surveyed household and building. [h/t Reddit user “phishfart”] — Data is Plural: February 14, 2018
Links:
Tags: disaster
KongTrackr hosts detailed stats about specific games played on the beloved arcade fixture, with a focus on record-setting scores. The website’s database, which can be downloaded as a single JSON file, currently includes 1,715 games by 450 players. Related: KongTrackr played a role in some recent high-score commotion. Also related: KongTrackr says its site is “heavily influenced” by this database of StarCraft 2 results. — Data is Plural: February 7, 2018
Links:
Tags: entertainment
Before each meeting of the Federal Open Market Committee, the Federal Reserve’s research staff prepares a set of economic projections known as the Greenbook. Those forecasts are kept secret for five years, and then released to the public. The Philadelphia Fed’s archive of public Greenbook data dates back to 1966, and contains both PDFs and structured data files. — Data is Plural: February 7, 2018
Links:
Tags: economics
The Global Register of Introduced and Invasive Species combines data and observations from thousands of sources to create a standardized database of such species in more than 200 countries. can be explored by kingdom (plants, animals, fungi, etc.), ecosystem, and country. Each slice of data can be downloaded as a CSV. Related: In a Scientific Data paper published last month, the researchers behind the effort described their methodology in detail. — Data is Plural: February 7, 2018
Links:
Tags: animalsenvironmentplants
The Global Terrorism Database, run by a University of Maryland–based consortium, is an “open-source database” of more than 170,000 terrorist events. The database, which currently covers 1970 through 2016, is well-documented and includes information about about the attackers, locations, weapons, victims, and more. Note: To download the data, you first need to accept an end-user license agreement. Previously: Profiles of Individual Radicalization in the United States, from the same consortium (DIP 2017.05.24). [h/t Brian C. Keegan] — Data is Plural: February 7, 2018
Links:
Tags: terrorism
Lynn Fisher’s Hollywood Age Gap collects data on silver screen love interests — more than 880 so far, from more than 630 movies — and then calculates the difference in those actors’ ages. The largest gap so far is the 52-year age difference in Harold and Maude. The movie with the most pairings is Love Actually, with seven. You can download the data as JSON and CSV files from the project’s GitHub page. [h/t Julia Smith] — Data is Plural: February 7, 2018
Links:
Tags: entertainmentgender
Hans Lienesch calls himself The Ramen Rater, and (as his website’s banner declares) he’s been “Celebrating the Instant Noodle for 15 Years.” Over that time, he’s amassed a spreadsheet of more than 2,600 ratings. [h/t dreyco] — Data is Plural: January 24, 2018
Links:
Tags: food
Using a range of public sources, The Duke Chronicle collected data on all 1,739 students listed in the Class of 2018’s “Freshman Picture Book” — including their hometowns, details about their high schools, whether they won a merit scholarship, and whether they play on a sports team — in order to analyze “trends between those who do and don't join Greek life at Duke.” Related: “Is Greek life at Duke as homogenous as you think?,” the first story in the Chronicle’s multipart series based on the data. [h/t Gautam Hathi] — Data is Plural: January 24, 2018
Links:
Tags: education
A team led by researchers at the University of Oxford’s Malaria Atlas Project have estimated the time it would take (as of 2015) to get from any square kilometer in the world to the nearest city of 50,000+ people. The analysis, which improves upon a similar effort from 15 years earlier, benefits from “the first-ever, global-scale synthesis of two leading roads datasets – Open Street Map (OSM) data and distance-to-roads data derived from the Google roads database.” You can download the data as a GeoTIFF, or explore the map online. [h/t Data & Eggs] — Data is Plural: January 24, 2018
Links:
Tags: mappingtransportation
The Consumer Financial Protection Bureau’s National Financial Well-Being Survey collected more than 6,000 responses to the agency’s 10-question Financial Well-Being Scale, plus additional demographic and financial information. The survey results, which were collected in late 2016, come with a detailed methodology and data dictionary. Plus: You can take the questionnaire yourself, anonymously. [h/t Amy Cesal] — Data is Plural: January 24, 2018
Links:
Tags: economicsmoneystatistics
The Atlas of Economic Complexity has collected decades of import/export data from the United Nations Comtrade database, and then applied “a unique method to clean the data to account for inconsistent reporting practices.” You can download the raw data, learn more about the cleaning process in the FAQ, explore current and historical trade flows, and browse the Atlas’s rankings of countries by “economic complexity.” Related: The researchers have also created regionally-detailed economic atlases of Mexico and Columbia. [h/t Annie White] — Data is Plural: January 24, 2018
Links:
Tags: economics
The unofficial Studio Ghibli API contains structured information about the famed Japanese animation studio’s films (e.g., Princess Mononoke and Spirited Away), plus the characters, locations, and vehicles featured in them. You can also download a single file containing all the data. — Data is Plural: January 17, 2018
Links:
Tags: entertainmentmovies
The Open Source Psychometrics Project “provides a collection of interactive personality tests with detailed results that can be taken for personal entertainment or to learn more about personality assessment.” You can download results from more than 30 such tests, including the Big Five Personality Test, the Kentucky Inventory of Mindfulness Skills, and Bob Altemeyer's Right-wing Authoritarianism Scale. Related: “Most Personality Quizzes Are Junk Science. I Found One That Isn’t” (FiveThirtyEight). [h/t Chris Zioutas] — Data is Plural: January 17, 2018
Links:
Tags: healthcare
The London Air Quality Network, run by researchers at King's College London, gathers data on levels of nitrogen dioxide, ozone, fine particulate matter, and other pollutants from more than 100 monitoring sites. You can download the data as CSV files (for up to six metric and site combinations at a time) or fetch JSON and XML data from the site’s API. Related: “London air pollution live data – where will be first to break legal limits in 2018?” (The Guardian). Previously: Air quality data from the EPA (DIP 2017.10.04), OpenAQ (DIP 2017.03.29), Berkeley Earth (DIP 2017.03.22), and the World Health Organization (DIP 2016.06.15). [h/t Gavin Freeguard] — Data is Plural: January 17, 2018
Links:
Tags: environment
The Immigration Policies in Comparison (IMPIC) project has quantified the immigration regulations of 33 OECD countries between 1980 and 2010. The project, led by political sociologist Marc Helbling, dives deeply into the regulations related to four policy areas: labor migration, family reunification, asylum/refugees, and “co-ethnics.” You can find the dataset’s detailed codebook and methodology in this PDF. Related: Helbling's summary of the project’s goals, approach, and initial findings (Migration Data Portal). [h/t David Brady] — Data is Plural: January 17, 2018
Links:
Reclaim The Records launched in 2015 and became a 501(c)(3) non-profit last year. Its mission: To “identify important genealogical records sets that ought to be in the public domain but which are being wrongly restricted by government archives, libraries, and agencies.” The organization files freedom-of-information requests and lawsuits to get the data, and “then we digitize everything we win and put it all online for free, without any paywalls or usage restrictions, so that it can never be locked up again.” Most of the records they’ve received so far have arrived as PDFs or microfilm. But a 2016 court settlement with the NYC City Clerk’s Office netted the group — and the public — a dataset of 3 million NYC marriage licenses from 1950 to 1995. — Data is Plural: January 17, 2018
Links:
Scott Cole is a neuroscience PhD student at UC San Diego who, in his spare time, is leading a project to rate the region’s burritos on a 10-dimensional scale. — Data is Plural: January 10, 2018
Links:
Tags: food
The National Water and Climate Center maintains a series of interactive snow maps. Their snow depth map is based on data from nearly one thousand monitoring stations around the country — mostly in western states, but also a handful in the Southwest, Northeast, and Midwest. To download data from a map, click on “Selected Stations” in the top-left corner, and then click “Export Data as CSV.” [h/t Charlie Loyd's collection of "near-realtime Earth observation resources" + Noah Veltman] — Data is Plural: January 10, 2018
Links:
Tags: climate
With the help of research assistants, legal historian Jed Shugerman has compiled a “tentative database” of prosecutor politicians — presidents, Supreme Court justices, circuit court justices, governors, state attorneys general, and senators who served as prosecutors earlier in their careers. Shugerman’s spreadsheet goes back to 1880 and lists the dates served in office, political party, other offices held, and “relevant prosecutorial background” for each politician. [h/t Geoff Hing] — Data is Plural: January 10, 2018
Links:
The CDC’s 500 Cities Project provides “city and census tract-level data, obtained using small area estimation methods, for 27 chronic disease measures for the 500 largest American cities.” The metrics range from cancer prevalence to binge drinking to dental health to undersleeping. The latest data release was published in December and covers more than 28,000 Census tracts. [h/t Kate Rabinowitz] — Data is Plural: January 10, 2018
Links:
Tags: diseasehealthcare
The Bureau of Ocean Energy Management and the Bureau of Safety and Environmental Enforcement — two of the agencies that replaced the troubled U.S. Minerals Management Service in the wake of the Deepwater Horizon spill — publish a few dozen bulk datasets related to their oversight of offshore drilling operations. Among them: lease owners, production metrics, company details, pipeline permits and locations, incident investigations, and platform structures. Related: “American Idle: Decommissioning costs sink offshore drillers into latest crisis,” a 2017 Debtwire investigation that used the platform data. [h/t Alex Plough] — Data is Plural: January 10, 2018
Links:
Tags: energy
“The Khipu Database Project began in the fall of 2002, with the goal of collecting all known information about khipu” — the knotted string textiles used for recordkeeping in the Inca Empire — “into one centralized repository.” The project’s datasets include detailed structural data about hundreds of khipu, as well as an inventory of all known specimens. Related: The College Student Who Decoded the Data Hidden in Inca Knots. — Data is Plural: January 3, 2018
Links:
Tags: miscellaneous
The Census Bureau’s Building Permits Survey collects data from thousands of municipalities every month. For each municipality, metro area, and state, the datasets provide the number of permits issued for new residential housing, number housing units authorized, and total estimated value of the new construction. Previously: The Census Bureau’s Annual Characteristics of New Housing survey (DIP 2016.06.22). [h/t Susie Cambria + Issi Romem] — Data is Plural: January 3, 2018
Links:
Tags: architecturereal estate
Movebank is a “a free, online database of animal tracking data hosted by the Max Planck Institute for Ornithology.” On the site’s data map, you can display the animal tracks from particular studies — for instance, the migrations of more than a dozen turkey vultures. Contributing researchers can decide whether to share the underlying data; not all do. (Here’s the data for those vultures, plus six buffalo in Kruger National Park, and seven Venezuelan oilbirds.) [h/t Hari Karthic] — Data is Plural: January 3, 2018
Links:
The Open University Learning Analytics dataset features demographic information about 28,000+ students who, in 2013 and 2014, enrolled in any of seven particular distance learning courses at the UK’s Open University; their final results (distinction, pass, fail, or withdrawn); 173,000+ graded assignments; and 10+ million rows describing each student’s interactions with the courses’ “virtual learning environments.” Useful: The researchers’ academic article describing the dataset. — Data is Plural: January 3, 2018
Links:
Tags: educationtechnology
The IRS publishes a ton of tax statistics. One of the most interesting portions: data aggregated from individual income tax returns (i.e., Form 1040s), which the IRS provides at the state, county, and ZIP code level. Those datasets’ 100+ fields include details that range from the basic (e.g., the number of tax filings and total income reported) to the more obscure (e.g., the number of returns that included “educator expenses” and the total amount of overpayments refunded). [h/t Cecilia Reyes] — Data is Plural: January 3, 2018
Links:
Tags: taxes
Through a series of surveys, L'Atlante della Lingua Italiana QUOTidiana has been asking Italian speakers what words they use to describe various everyday things. The results for each question can be browsed as maps, or downloaded as XML files. When shown a picture of a watermelon, most respondents wrote “anguria,” but others responded with “cocomero,” “melone,” “citrone,” or “zipangulu.” [h/t Giuseppe Sollazzo] — Data is Plural: December 27, 2017
Links:
As part of a recent investigation, reporters at Reason Magazine used public records law to obtain geospatial data on each of Tennessee's 8,544 drug-free zones. In addition the geographic boundaries, the shapefile also includes each zone’s name and type (school, childcare, park, or library). [h/t CJ Ciaramella] — Data is Plural: December 27, 2017
Links:
The Digital Database for Screening Mammography was first released two decades ago, in 1997. It contains data and images from 2,620 mammographies — a mix of normal, benign, and malignant cases. In a Scientific Data article published last week, a team of Stanford University researchers describe a series of improvements they’ve made to the original database; their Curated Breast Imaging Subset of DDSM has modernized the database’s image formatting, added detailed “region-of-interest” annotations, and converted the metadata into CSV files. — Data is Plural: December 27, 2017
Links:
Tags: healthcarewomen
Ships use the internationally-standardized automatic identification systems (AIS) to broadcast their name, speed, direction, and other details. With a bit of radio hardware and software, anyone can collect the signals emitted by nearby vessels. AISHub aggregates AIS data from hundreds of volunteer signal-collectors around the world, and makes that data available via an API and online maps. The Finnish Transport Agency also provides an API of data collected by its AIS stations on the Baltic Sea and other local waters; Denmark’s government publishes free historical data of maritime traffic on Danish waters; and the Coast Guard publishes historical AIS data for U.S. coastal waters (currently only for 2009–2014). [h/t Topi Tjukanov + Miska Knapek] — Data is Plural: December 27, 2017
Links:
Tags: transportation
The SEC requires Moody’s, Standard & Poor’s, and other “nationally recognized statistical rating organizations” to report their rating assignments and changes (e.g., upgrades, downgrades, withdrawals) going back to 2010. The agencies publish the reports as XBRL-formatted files, and update them monthly. But “because most researchers are unfamiliar with XBRL and cannot easily locate the history files, this valuable resource has seen limited use,” according to the Center for Municipal Finance’s RatingsHistory.info, which now provides the reports as easier-to-use CSVs. [h/t data.world] — Data is Plural: December 27, 2017
Links:
In far northern Norway, the Svalbard Global Seed Vault safekeeps hundreds of millions of seeds, helping to back up the world’s biodiversity. Data on the vault’s deposits, which often contain hundreds of seeds apiece, are available to search and to download. [h/t Enigma Public] — Data is Plural: December 13, 2017
Links:
Tags: plants
For a recent investigation into state legislators’ financial interests, the Center for Public Integrity “analyzed disclosure reports from 6,933 lawmakers holding office in 2015 from the 47 states that required them.” You can search through the disclosures and download the data. For each of the 11,000+ disclosed interests, the dataset includes the lawmaker’s state, legislative body, and district; the name and industry of the financial interest; and a link to the lawmaker’s personal disclosure form. [h/t The Nerds at INN Labs] — Data is Plural: December 13, 2017
Links:
Tags: politics
For several years now, the folks at FiveThirtyEight have been quantifying professional sports teams’ current and historical strength, mostly using Elo rating systems. Their global club soccer ratings go back to 2016, their basketball ratings go back to 1946, their American football ratings go back to 1920, and their baseball ratings go back to 1871. For each of those, the entire histories of match-by-match ratings are available as CSV files. [h/t Jay Boice] — Data is Plural: December 13, 2017
Links:
Tags: historysportsstatistics
For an investigation published Monday, Vice News spent “nine months collecting data on both fatal and nonfatal police shootings from the 50 largest local police departments in the United States.” They’ve published raw and standardized data on every shooting, plus the code they used to analyze it. [h/t Allison McCann] — Data is Plural: December 13, 2017
Links:
Last month, the Council on Foreign Relations launched the Cyber Operations Tracker, a database of “publicly known state-sponsored cyber incidents that have occurred since 2005.” The 191 attacks in the database so far have been sponsored by 16 different countries, with China, Russia, and Iran being the most cited. For each incident, the dataset also includes the type of attack (e.g. espionage, data destruction), its name (e.g., “Stuxnet”), a description, the date it occurred, its victims, and the type of response, if any. — Data is Plural: December 13, 2017
Links:
Tags: conflicttechnology
An anonymous married couple has decided “to be completely open about [their] finances so that people can see what an actual family’s budget looks like.” In addition to blogging about their financial habits, they’ve also published a spreadsheet of “(almost) every dollar” they spent between December 2015 and November 2017. For each transaction, the dataset provides the date, dollar amount, category (e.g., “Groceries”), and meta-category (e.g., “Food”). — Data is Plural: December 6, 2017
Links:
Atlas da Notícia is a Brazilian project that aims to collect data on all local and regional news outlets in the country. Last month, the project released its first batch of data, which identified 5,354 newspapers and online publications in a total of 1,125 municipalities. The raw dataset is currently only available in Portuguese, but the aggregate tables have been translated into English. [h/t Sérgio Spagnuolo] — Data is Plural: December 6, 2017
Links:
Tags: journalismmedia
Back in 2013, four dozen Dartmouth College students agreed to let a custom smartphone app surveil them for the StudentLife Study. During the 10 weeks of the spring academic term, the app collected data on the students’ physical activity, GPS coordinates, eating schedule, sleep habits, phone usage, and more. The study combined all that information with a slew of other data, including the students’ class deadlines, academic performance, and their responses to surveys about stress, depression, personality, and sleep quality. The study’s public (and anonymized) dataset clocks in at 53 gigabytes. Related: “Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities,” a recently-released academic paper that uses the StudentLife dataset. [h/t Konrad Kording] — Data is Plural: December 6, 2017
Links:
Tags: education
The Consumer Financial Protection Bureau’s consumer complaint database can be searched online, accessed via an API, and downloaded in bulk. The 915,000+ complaints the Bureau has received have been categorized into 18 financial product groups (e.g., mortgages, debt collection, student loans, cryptocurrency) and more than 160 kinds of issues (e.g., billing disputes, communication tactics, privacy). The agency says they “don’t verify all the facts alleged in these complaints,” but that they “take steps to confirm a commercial relationship between the consumer and the company.” [h/t Dan Brady] — Data is Plural: December 6, 2017
Links:
My colleague Lam Thuy Vo obtained an anonymized dataset listing all 170,000+ sexual harassment claims submitted to the U.S. Equal Employment Opportunity Commission between October 1995 and September 2016. For each claim, the dataset indicates the date the complaint was filed, the complainant’s gender, and the general category of employer. Additional fields — available for most claims, but not all — indicate the complainant’s birthdate, race, and national origin, as well as the employer’s industry and approximate number of workers. Related: Lam’s story and interactive graphics, which place the data in context. — Data is Plural: December 6, 2017
Links:
The Aarne-Thompson-Uther Classification of Folk Tales organizes (mostly Indo-European) folktales into groups and hierarchies. As Atlas Obscura’s Cara Giaimo puts it, the ATU is “like the Dewey Decimal System, but with more ogres.” The ATU doesn’t publish any downloadable versions of its data, but researchers studying the “ancient roots” of such stories have built a data-matrix that denotes the presence/absence of the 275 ATU “tales of magic” across 50 Indo-European-speaking populations. [h/t Andrew McCartney] — Data is Plural: November 29, 2017
Links:
Tags: entertainmentlanguage
Since 2014, the California Civic Data Coalition has been working to improve access to CAL-ACCESS, “the jumbled, dirty and difficult government database that tracks campaign finance and lobbying activity in California politics.” Their cleaned-up datasets are updated often and include formats suitable for beginners, “database junkies,” and masochists. Last month, the organization released data files cataloging every state ballot measure and candidate for public office since 2000. [h/t Zack Quaintance] — Data is Plural: November 29, 2017
Links:
ProPublica has published a searchable and downloadable dataset of visitor logs and meeting calendars from five White House agencies: the Office of Management and Budget, the Office of the U.S. Trade Representative, the Office of National Drug Control Policy, the Office of Science and Technology Policy, and the Council on Environmental Quality. ProPublica received the underlying documents from Property of the People, a transparency group that sued the Trump administration to release the records under the Freedom of Information Act. (The administration has not released the White House’s main visitor logs.) Related: Politico has manually compiled a searchable database it calls “The Unauthorized White House Visitor Logs”, based on thousands of known visits, meetings, phone calls, and other presidential interactions. Also related: The Obama administration’s White House visitor logs. — Data is Plural: November 29, 2017
Links:
Missing Pieces is “a yearlong investigation by The Trace and more than a dozen NBC TV stations [that has] identified more than 23,000 stolen firearms recovered by police between 2010 and 2016 — the vast majority connected with crimes.” To support the investigation, the reporters obtained more than 800,000 records of stolen and recovered guns, which they’ve standardized into a single CSV file and supplemented with a data dictionary. The dataset “contains nearly complete stolen-gun records for the states of California and Florida, both of which have centralized collections of gun-theft data,” as well as records from nearly 300 other agencies across the country. Previously: The ATF’s gun trace statistics (DIP 2017.11.08) and firearm background checks (DIP 2015.12.09). [h/t Sarah Ryley] — Data is Plural: November 29, 2017
Links:
The Armed Conflict Location & Event Data Project (ACLED), records the locations, dates, actors, and outcomes of “all reported political violence and protest events in over 60 developing countries in Africa and Asia.” The Africa datasets currently go back to 1997 and cover more than 50 countries. The Asia datasets currently only go back to 2015, but ACLED’s website says it’s planning to add data soon going back to 2010. Both of the datasets are extensively documented, as is the methodology . [h/t Lari McEdward] — Data is Plural: November 29, 2017
Links:
A few years ago, economist Alex Albright and a friend transcribed the plotline-sharing dynamics of Friends’ six friends, across all 236 episodes. In the very first episode (“The One Where Monica Gets a Roommate”), Monica and Rachel each have their own plotline; Rachel and Ross share a plotline; and Chandler, Joey, and Ross share another plotline. Related: Albright’s analysis of the data. — Data is Plural: November 8, 2017
Links:
Tags: entertainmenttelevision
Over at BuzzFeed India, Harsha Devulapalli and Janak Jain have crowned Hyderabad the best city in India for going to the movies, based on their analysis of nearly 600 theaters in eight major cities. The underlying dataset lists each theater’s location, name, average ticket price (where available), number of screens, and number of seats. — Data is Plural: November 8, 2017
Links:
Tags: entertainmentmovies
As part of NerdWallet’s recent investigation into Rent-A-Center, “the nation’s largest rent-to-own company,” reporters compiled pricing data for 39 consumer products on rentacenter.com. For each product, the dataset lists the various Rent-A-Center costs (e.g., installment fees for weekly/monthly payment plans, cash prices, et cetera) in each of 48 states and D.C. — plus prices for the same product at standard online retailers. Related: NerdWallet’s analysis of the data. — Data is Plural: November 8, 2017
Links:
Reporters at the Center for Investigative Reporting asked 200+ of the largest Silicon Valley tech companies for their official diversity data. Specifically, the reporters requested each company’s latest EEO-1, the detailed demographic report that every large U.S. employer must submit to the federal government. Only 23 companies shared their data. For those that did, their numbers are now available as a tidy spreadsheet. [h/t Sophie Chou] — Data is Plural: November 8, 2017
Links:
The Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF) helps trace guns — such as those recovered at crime scenes by law enforcement agencies — back to their original manufacturers, wholesale distributors, dealers, and purchasers. Each year, ATF publishes a range of datasets based on these gun traces. The datasets for 2016 provide state-by-state tallies of gun caliber, state of original purchase, possessors’ age, associated crime, and more. Related: “Gun Laws Stop At State Lines, But Guns Don’t,” from FiveThirtyEight, using the data. Also related: “How a Gun Trace Works,” from The Trace. Previously: Firearm background checks (DIP 2015.12.09), which my colleague Peter Aldhous analyzed last week, finding that gun sales did not spike after the Las Vegas shooting. — Data is Plural: November 8, 2017
Links:
“Jane Goodall drew the attention of a global audience with vivid depictions of the personalities of eastern chimpanzees (Pan troglodytes schweinfurthii) at Gombe National Park, yet only one attempt [in 1973] has been made to quantify these personality traits systematically,” writes a team of researchers in the latest issue of Scientific Data. To remedy the situation, the researchers paid field observers to score 128 Gombe chimpanzees on 24 personality traits — “dominant,” “excitable,” “helpful,” “sensitive,” and more — on a seven-point scale. — Data is Plural: November 1, 2017
Links:
Tags: animals
Researchers at the University of Washington’s Institute for Health Metrics and Evaluation to estimated cardiovascular mortality rates for each U.S. county, for every year between 1980 and 2014. The findings, based on 32 million de-identified death records, population data from the Census, and other sources, are also broken down by particular disease (e.g., aortic aneurysm, ischemic stroke, etc.) and gender. Related: The researchers’ JAMA article describing their methodology and findings. Previously: The Global Burden of Disease dataset, published by the same institute (DIP 2016.07.27). Michael A. Rice, a teacher at Ingraham High School in Seattle] — Data is Plural: November 1, 2017
Links:
For years, the National Oceanic & Atmospheric Administration has been working to assess the damage done to natural resources by the April 2010 Deepwater Horizon explosion and oil spill. As part of that effort, they’ve collected and compiled several dozen related datasets, including toxicity studies, plankton samples, necropsies of stranded turtles, dolphin health assessments, and a “backyard boater” survey. [h/t Sebastian Kraus] — Data is Plural: November 1, 2017
Links:
Tags: disasterenvironment
Rebecca Zisser and Lazaro Gamio at Axios have compiled a timeline of alleged sexual assaults by Harvey Weinstein, Bill O'Reilly, Roger Ailes, Donald Trump, and Bill Cosby. For each of the 140+ cases recorded as of Oct. 20, the timeline indicates the year of the assault, the year the victim came forward (if they did), and the year of any legal settlement (if there was one). The underlying data is available as a spreadsheet. [h/t Mike Allen] — Data is Plural: November 1, 2017
Links:
The U.S. Federal Judicial Center’s “Integrated Data Base” contains a longitudinal record of all federal criminal, civil, and appellate court cases going back to the 1970s, as well as bankruptcy cases going back to late 2007. Each dataset contains dozens of detailed fields — including each case’s jurisdiction, name, docket number, relevant legal statutes, and more — accompanied by explanatory codebooks. You can download single-year snapshots and cumulative files, or interactively select specific slices of data to export. Related: “How the Bankruptcy System Is Failing Black Americans,” an investigation by ProPublica that used the IDB’s data on bankruptcy cases for its analysis. — Data is Plural: November 1, 2017
Links:
Tags: law
ConceptNet “is a freely-available semantic network, designed to help computers understand the meanings of words that people use.” It defines approximately 28 million “statements,” i.e., relationships between various things. For instance, ConceptNet indicates that a newsletter is a type of “report”, and that a computer can be used to “send email”. You can download the entire dataset, or access it via an API. — Data is Plural: October 18, 2017
Links:
Tags: languagetechnology
In the wake of the Second Vatican Council in the 1960s, Sister Marie Augusta Neal conducted an enormous opinion survey of Catholic “women religious.” More than 130,000 sisters responded to the 649 multiple-choice-question survey — the results of which the University of Notre Dame recently cleaned up and made available online. [h/t Kevin Schlottmann] — Data is Plural: October 18, 2017
Links:
The U.S. Patent and Trademark Office publishes a huge amount of bulk data, including detailed XML files that contain information about millions of patent/trademark applications, assignments, trials, and appeals. The agency also publishes a collection of “research datasets”, which distill those bulk XML files into easier-to-use tabular data. [h/t Rachael Tatman] — Data is Plural: October 18, 2017
Links:
Tags: businesstechnology
University of Michigan–based researchers have created “a repository of micro-level, subnational event data on armed conflict and political violence around the world.” The project, dubbed xSub, standardizes information from 21 data sources, and includes conflicts in 139 countries between 1942 and 2016. For each administrative boundary (e.g., country, province, district) and data source, xSub’s data counts the number of violent incidents by year, month, week, or day. The numbers are also broken down by the sides involved, who initiated the conflict, and what types of force were used. [h/t Andy Halterman] — Data is Plural: October 18, 2017
Links:
Tags: conflicts
Since shortly after Hurricane Maria hit Puerto Rico, the territory’s government has been publishing a dashboard of recovery statistics. The website tracks a couple dozen metrics, including the percent of homes with electricity, number of people in shelters, and the number of open hospitals. For several of the main metrics, researcher Michael A. Johansson has been scraping daily figures from the dashboard and publishing them as a CSV file. Related: The Washington Post has been charting the recovery, and published a deep dive into the island’s ongoing power outages. — Data is Plural: October 18, 2017
Links:
Tags: disasterstatistics
Carnegie Mellon’s Motion Capture Database provides data files and videos representing humans performing various activities: shaking hands, drinking soda, exchanging “angry hand gestures,” doing cartwheels, mopping floors, laughing, chicken-dancing, and oh-so-much more. [h/t John Emerson] — Data is Plural: October 11, 2017
Links:
Tags: science
The U.S. Geological Survey has been measuring water quality in the San Francisco Bay for nearly 50 years. The agency recently published 210,826 of these measurements, collected from dozens of monitoring stations between April 1969 and December 2015. (It’s “one of the longest records of water-quality measurements in a North American estuary,” according to a recent academic article describing the data.) Each row specifies the measurement’s date, station, depth, temperature, and salinity; many rows include levels of chlorophyll, oxygen, nitrate, ammonium, and other matter. — Data is Plural: October 11, 2017
Links:
Tags: environment
The Federal Motor Carrier Safety Administration helps to regulate the United States’ large trucks and passenger buses. The datasets available through its Safety Measurement System include a census of all regulated carriers, the results of safety inspections, and reported crashes. The crash files list the number of injuries and fatalities; the weather, light, and road conditions; the involved vehicle’s VIN and license plate number; and more. [h/t Dan Brady] — Data is Plural: October 11, 2017
Links:
The Crowd Counting Consortium, launched earlier this year, is a volunteer effort to “[collect] publicly available data on political crowds reported in the United States, including marches, protests, strikes, demonstrations, riots, and other actions.” The team publishes monthly spreadsheets that list each crowd’s date, location, type, and cause (e.g., “Oppose removal of confederate statue”); high and low size estimates; the number of reported arrests and injuries; links to sources; and additional details. Related: The project’s main coordinators have been summarizing their findings on the Washington Post’s Monkey Cage blog. [h/t Amanda L. James] — Data is Plural: October 11, 2017
Links:
“Monitoring Trends in Burn Severity (MTBS) is an interagency program whose goal is to consistently map the burn severity and extent of large fires across all lands of the United States”; the most recent release contains more than 20,000 fires from 1984 to 2015. You can explore the data online, or download it in bulk. For more recent data, see GeoMAC, which aims to map all current wildfires; NOAA’s Hazard Mapping System, which uses satellites to detect fire locations and smoke plumes; and NASA’s MODIS and VIIRS datasets, which provide satellite-based detections for the entire globe. Previously: National Fire Incident Reporting System, which also includes structure fires and vehicle fires (DIP 2016.07.20). [h/t Max Joseph] — Data is Plural: October 11, 2017
Links:
Tags: disasterenvironment
For a new interactive essay at The Pudding, Ash Ngu analyzed the gender composition of This American Life episodes. To support the findings, Ngu has published the underlying data, extracted from the show’s transcripts. Among the data extracted: the number of words spoken by each person in each act of each episode. — Data is Plural: October 4, 2017
Links:
Tags: genderjournalism
In certain cities, private developers can earn zoning concessions by converting sections of their properties into plazas, atriums, mini-parks, and other open-to-the-public spaces. You can download datasets of these “privately owned public spaces” in San Francisco, Seattle, New York City, and — thanks to a recent collaboration between Guardian Cities and local community group — London. Related: A guide to NYC’s POPS. [h/t Reddit user seeriktus + Ed Vine] — Data is Plural: October 4, 2017
Links:
Tags: mapping
Media Cloud, a collaboration between MIT and Harvard–based researchers, describes itself as “an open-source platform for studying media ecosystems.” The project lets you track topics and keywords across thousands of sources — including mainstream news publications in the U.S. and many other countries — at both a story and sentence level. You can access Media Cloud’s data via its dashboard or its API. Both require (free) registration. Related: “The Media Really Has Neglected Puerto Rico,” by Dhrumil Mehta at FiveThirtyEight; the analysis uses data from Media Cloud, the TV News Archive, and Google Trends. Also related: The geometry of hurricane coverage, as told through the front pages of The New York Times and Washington Post. — Data is Plural: October 4, 2017
Links:
Tags: journalism
Last week, the National Institutes of Health released a dataset containing more than 100,000 anonymized chest x-rays, from 30,000 patients, “including many with advanced lung disease.” For each image, the associated metadata includes the patient’s age, gender, and diagnosis labels. (The dataset’s authors used natural language processing to extract those labels from radiological reports; they estimate that fewer than 10% of the labels are incorrect.) Related: Andrew L. Beam’s list of medical datasets for machine learning. [h/t Chris Hamby] — Data is Plural: October 4, 2017
Links:
Tags: healthcare
The Environmental Protection Agency collects air quality samples from thousands of monitoring stations across the country. The resulting datasets, which go back to the 1980s, are available as daily files, annual files, and via an API. The monitored pollutants include ozone, carbon monoxide, sulfur dioxide, nitrogen dioxide, particulate matter, volatile organic compounds, and more. You can also download daily Air Quality Index ratings and information about each monitoring station. Previously: Global air pollution datasets from Berkeley Earth (DIP 2017.03.22) and from the World Health Organization (DIP 2016.06.15). [h/t Swier Heeres] — Data is Plural: October 4, 2017
Links:
Tags: environment
For each of 966 occupations, the Department of Labor’s O*NET database quantifies the types knowledge, skills, abilities, education, and training required, tasks involved, tools used, and more job-related parameters. Related: The Upshot uses the data to ask (and answer), “What Is Your Opposite Job?” — Data is Plural: September 27, 2017
Links:
Tags: statistics
New York City’s Department of Transportation publishes a bunch of data, including its own assessments of each street segment’s quality on a 1-to-10 scale. It also publishes spreadsheets of all construction-related street closures, by intersection and by block, updated daily. [h/t Christian Moscardi] — Data is Plural: September 27, 2017
Links:
Tags: mappingtransportation
The UK’s Ordnance Survey makes detailed digital maps of Great Britain. Their free offerings include all of the island’s roads, rivers, green spaces, and place names. The Survey’s “open map” includes buildings, railways, electricity transmission lines, and other features. Related: Want only the buildings? The University of Sheffield’s Alasdair Rae has you covered. [h/t Robyn Inglis] — Data is Plural: September 27, 2017
Links:
Tags: architecturemapping
The TV News Archive’s new “Third Eye” project is extracting chyrons — those placards of text at the bottom of news broadcasts, also known as “lower thirds” — from four major cable networks: BBC News, CNN, Fox News, and MSNBC. The resulting database contains every chyron that Third Eye’s optical character recognition (OCR) software has extracted since late August. Related: This Washington Post piece analyzing cable news’ chyrons during James Comey’s congressional testimony, and this explanation of how they did it. [h/t Nancy Watzman] — Data is Plural: September 27, 2017
Links:
Tags: journalismlanguagemedia
Earlier this month, the FBI and 18F released the first iteration of their Crime Data Explorer, a website that simplifies access to the FBI’s Uniform Crime Reporting program. You can download bulk data on individual incidents, state and national trends, hate crimes, arrests, assaults on officers, police employees, human trafficking, and cargo theft. You can also access the data via an API. Caution: The FBI’s data collection program is voluntary; not all law enforcement agencies participate. (In fact, more than 3,000 agencies don’t submit hate crime data.) [h/t Nick Wright] — Data is Plural: September 27, 2017
Links:
Tags: crime
The popular “webcomic of romance, sarcasm, math, and language” provides an interface for grabbing data about each comic strip, including the title, image file, date of publication, easter-egg-y “alt” text, and transcript. [h/t Karl L. Hughes] — Data is Plural: September 20, 2017
Links:
Earlier this year, Politico reporters scoured the internet’s WHOIS records for domains registered to the Trump Organization. They found thousands, including TrumpRussia.com, No2Trump.com, Trumpublican.net, and ImBeingSuedByTheDonald.com. (Most, including those, just send readers to a generic “domain parking” landing page.) Politico has open-sourced the article’s components, including a JSON file containing 1,267 of the domains, which includes each domain’s owner, creation date, last-updated date, and expiration date. [h/t Tyler Fisher] — Data is Plural: September 20, 2017
Links:
Tags: Trumptechnology
The Democracy Fund Voter Study Group, “a research collaboration comprised of nearly two dozen analysts and scholars from across the political spectrum,” has published the participant-level data from its 2016 VOTER survey. It’s a “unique longitudinal data set” that represents the “political attitudes, values, and affinities” of 8,000 American adults who were interviewed first in December 2011, then again before and after the 2012 election, and again in December 2016. [h/t Jenny Listman] — Data is Plural: September 20, 2017
Links:
Tags: elections
After major natural disasters, NOAA’s National Geodetic Survey routinely collects detailed aerial photos of the affected areas. For each disaster — including Hurricane Harvey, Hurricane Irma, and a couple dozen others — you can download the full set of (georeferenced) images, by date and survey flight. [h/t David Yanofsky] — Data is Plural: September 20, 2017
Links:
Tags: disaster
The U.S. Federal Communications Commission publishes a ton of data on the “wireline” telecommunications industry, including several datasets about broadband internet access. Among them: the places where providers offer service, subscriptions per 1,000 households in each Census tract, and a survey of plans available in urban areas. You can also find a spreadsheet of payphones-by-state at the bottom of that landing page. (As of last March, there were only 113 payphones left in North Dakota, down from 705 in 2008.) Related: “Signs of Digital Distress,” a new Brookings Institution report, with findings and maps based on the broadband subscription data. — Data is Plural: September 20, 2017
Links:
Tags: technology
The “robust and curated” Global Wood Density Database contains more than 16,000 entries, culled from scientific literature, websites, and unpublished scholarship. The densest so far is a Caesalpinia sclerocarpa from Mexico, weighing in at 1.39 grams per cubic centimeter. Related: The TRY database of “curated plant traits” (free registration required). [h/t Amy Zanne] — Data is Plural: September 13, 2017
Links:
Tags: plants
When companies file reports to the U.S. Securities and Exchange Commission, they do so through the SEC’s EDGAR system. The SEC makes those filings available online, and it uses EDGAR’s server logs to analyze web traffic to the site. The SEC’s EDGAR Log File Data Set contains a set CSVs — one for each day between February 14, 2003 and December 31, 2016 — extracted from those server logs. For each document visited, the data includes the visitor’s unique-but-obfuscated IP address, the date and time of the visit, the IDs of the document and associated company, and some information about the visitor’s browser. [h/t Brian C. Keegan] — Data is Plural: September 13, 2017
Links:
The Internet Archive has pumped footage from CNN, Fox News, MSNBC, and the BBC through software trained to recognize the faces of Donald Trump and majority/minority leaders of the U.S. House and Senate. The result: Face-O-Matic, a dataset released to the public last week. For each face the software found, the dataset includes the network, program, date, time, duration, and a link to the footage on the TV News Archive. Since mid-July, Face-O-Matic has logged more than 50,000 sightings. [h/t Nancy Watzman] — Data is Plural: September 13, 2017
Links:
Tags: Trumpjournalismpolitics
Two weeks ago, DIP featured Case-Shiller’s home price index data. There are, in fact, several other prominent (and downloadable) house price indices, including the Federal Housing Finance Agency’s House Price Index, the National Association of Realtors’ indices, and Zillow’s Home Value Index. Helpful: This guide to various home price indices and how they’re constructed, by Jed Kolko, formerly Trulia’s chief economist. Related: This critique of Case-Shiller’s approach, also by Kolko. — Data is Plural: September 13, 2017
Links:
Tags: real estate
The Dartmouth Flood Observatory’s Global Archive of Large Flood Events contains data about 4,500+ floods, dating back to 1985. It’s updated often, and is available in Excel, XML, HTML, and geospatial formats. The variables include each flood’s location, timespan, severity, main cause, and estimated impact. The organization also publishes detailed maps of the “maximum observed flooding” for specific disasters, such as for Hurricane Harvey and for Hurricane Irma. Related: A Science Magazine mini-profile of the DFO and its founder. Previously: U.S. tide gauges and flood observations (DIP 2016.03.23), UK coastal flooding (DIP 2017.08.09), and FEMA flood risk maps (DIP 2017.08.30). — Data is Plural: September 13, 2017
Links:
Earlier this month, The New York Times asked readers to rate 50 of the show’s most recognizable characters along two dimensions: good ↔ evil, and ugly ↔ beautiful. They’ve received 190,000+ submissions. The results are accessible as two JSON files: one for the averages and another for the distributions. — Data is Plural: August 30, 2017
Links:
Favicons are the little square icons in your browser’s tabs, placed there by the websites you’ve loaded. Two recent projects attempted to collect these markers from the web’s million most-trafficked domains. One, by programmer Colin Morris, collected 360,000 favicons in July 2016. The second, by researchers at ETH Zurich, collected 548,00 favicons in April 2017. Semi-related: Morris’s “Finding bad flamingo drawings with recurrent neural networks”; the analysis uses Google’s 50-million-doodles data, featured in DIP 2017.05.04. — Data is Plural: August 30, 2017
Links:
Tags: arttechnology
The Federal Reserve Bank of St. Louis publishes S&P/Case-Shiller Home Price Index data, which measures changes in average home prices over time. The monthly-updated datasets — copyrighted, but free to download — are available at a national and metro-area level, and go back several decades. — Data is Plural: August 30, 2017
Links:
Tags: real estate
The Mapping Inequality project has digitized more than 150 of the “security maps” produced by the Home Owners' Loan Corporation between 1935 and 1940. Together, the maps “offer a view of Depression-era America as developers, realtors, tax assessors, and surveyors saw it — a set of interlocking color-lines, racial groups, and environmental risks.” To download the data for a given map, click on the cloud icon in the top-right corner. Related: A new research paper, by economists at the Federal Reserve Bank of Chicago, uses the data to quantify redlining’s lasting effects. Also related: The New York Times’ summary of the data and research. [h/t Kendall Taggart] — Data is Plural: August 30, 2017
Links:
FEMA’s Flood Map Service Center publishes geospatial files that detail the agency’s flood risk assessments — both current and historical. The maps include flood zones, levee locations, “base flood elevations,” and more. Helpful: FEMA’s technical documentation. Related: “Why Houston Isn’t Ready for Harvey,” published last week by ProPublica and The Texas Tribune; and “Hell and High Water,” the reporting team’s deep dive on Houston last year. Previously: The most comprehensive global dataset of cyclone paths (DIP 2017.04.19). — Data is Plural: August 30, 2017
Links:
The 1970s, a team of linguistic investigators canvassed the globe, armed with boxes of color chips. They sought out a couple dozen native speakers of 110 unwritten languages, and asked: What do you call these colors? The results are available online. Related: This Vox video provides context. — Data is Plural: August 23, 2017
Links:
The congressionally-established National Endowment for the Humanities publishes a dataset of all of the grants it has awarded since the late 1960s. On the same page, you can download a file describing the organization’s 25,000+ “evaluators” — “knowledgeable persons outside NEH who are asked for their judgments about the quality and significance” of proposed projects. [h/t Brett Bobley + Max Kemman] — Data is Plural: August 23, 2017
Links:
Tags: aidartentertainment
The Database of State Incentives for Renewables & Efficiency, “is the most comprehensive source of information on incentives and policies that support renewables and energy efficiency in the United States.” The database, which was founded in 1995 and is funded by the Department of Energy, includes tax rebates, solar energy buybacks, building standards, and more. You can download the data in several formats, or browse and search it online. [h/t Carol Brotman White] — Data is Plural: August 23, 2017
Links:
Last year, the British government began requiring companies to identify all the people who exert power over them. The resulting “People with Significant Control” database contains each person’s name, country of residence, nationality, and “nature of control” — e.g., ownership of large numbers of shares, voting rights, or the ability to appoint/remove directors. [h/t Enigma Public] — Data is Plural: August 23, 2017
Links:
Tags: business
A team of researchers has compiled “the largest ever geo-coded database of anophelines in Africa.” (Anophelines are the only kind of mosquito that transmits malaria.) The database covers 1898 to 2016 and includes more than 13,400 observations of mosquitoes in specific locations. For each observation, the dataset lists the country, administrative region(s), and latitude/longitude, as well as the time period, the species identified, the sampling method, and the source of the information. [h/t Michael Chew] — Data is Plural: August 23, 2017
Links:
Robin Sloan, author of Mr. Penumbra’s 24‑Hour Bookstore, has a new book coming out next month — one that he believes “is the first novel in English to feature, as a main supporting character, a possibly-sentient sourdough starter.” To dole out advance copies of the book, Sloan conducted the following contest: Try to choose the smallest prime number that nobody else will pick. Now he’s posted the results — a CSV listing the number of contestants who chose each prime number. (Seventeen was the most popular number among the contest’s 1,354 entries; the smallest unique prime was 409.) — Data is Plural: August 16, 2017
Links:
Tags: miscellaneous
The Energy Information Administration’s Petroleum Supply Monthly contains detailed data about how the United States obtains crude oil and petroleum products, and where that supply goes. In May, for instance, the U.S. refined nearly 314 million barrels of “finished motor gasoline” and exported 18.6 million barrels of it. — Data is Plural: August 16, 2017
Links:
The Open Access Series of Imaging Studies (OASIS) project is “aimed at making MRI data sets of the brain freely available to the scientific community,” with the goal of “[facilitating] future discoveries in basic and clinical neuroscience.” So far, the project has published two collections: a cross-sectional dataset of scans from 416 people, ages 18 to 96; and a longitudinal dataset, based on 150 people aged 60 to 96, each of whom were scanned at least two different times. [h/t Andrew Beam] — Data is Plural: August 16, 2017
Links:
Tags: healthcare
“The U.S. government has prosecuted 808 people for terrorism since the 9/11 attacks. Most of them never even got close to committing an act of violence.” Those are the findings of The Intercept’s Trial and Terror database, first published in April and most recently updated last week. The underlying data — available on GitHub — contains each defendant’s name and demographic details, as well as each case’s description, status, charges, charge date, conviction date (if convicted), jurisdiction, and more. — Data is Plural: August 16, 2017
Links:
UPDATE: VOA News used this data for Terror on Trial: The Imam’s Choice.
Chronicling America — a project run by the Library of Congress and the National Endowment for the Humanities — provides information about more than 150,000 historic newspapers and access to digitized pages from many of them. Its API lets you search the database and doesn’t require registration; its bulk data includes text from more than 12 million pages. For instance, here’s the Omaha Daily Bee’s front page on April 7, 1917, the day after the U.S. entered World War I. [h/t Ed Summers] — Data is Plural: August 16, 2017
Links:
Tags: historyjournalism
California’s Department of Industrial Relations publishes a dataset of all licensed talent agencies, with each agency’s name, address, license number, workers’ comp insurer, and bond issuer. Florida publishes something similar. Previously: Texas’s licensed professionals (DIP 2015.12.09). — Data is Plural: August 9, 2017
Links:
Tags: businessentertainment
Earlier this year, the researchers behind SurgeWatch.org published an updated version of their their database of UK coastal floods. They combined tidal gauge data with reports from scientific journals, newspapers, and social media to identify 329 “coastal flooding events” that occurred between 1915 to 2016. For each event, the dataset includes the date, region, and severity level, which ranges from 1 (“nuisance”) to 6 (“disaster,” applied to only one event — the North Sea flood of 1953). — Data is Plural: August 9, 2017
Links:
The Atlas of Pidgin and Creole Language Structures contains data on 76 languages, such as Trinidad English Creole, Afrikaans, Guadeloupean Creole, and Singapore Bazaar Malay. For each language, the dataset includes information about 130 “structural features,” example sentences, and more. Previously: The World Atlas of Language Structures (DIP 2016.01.06) and a database of the Trans-New Guinea language family (DIP 2015.11.04). [h/t Rachael Tatman] — Data is Plural: August 9, 2017
Links:
Tags: language
The federally funded Freight Analysis Framework “integrates data from a variety of sources to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation.” For each year between 2012 and 2015, the database “provides estimates for tonnage (in thousand tons) and value (in million dollars) by regions of origin and destination, commodity type, and mode.” Last week, Axios published an interactive map of the state-to-state flows for each commodity group, as well as some helpful caveats and “head-scratchers.” [h/t Chris Canipe] — Data is Plural: August 9, 2017
Links:
The USDA National Nutrient Database for Standard Reference is the primary source for most of the food nutrition facts you see in America. The database assesses more than 8,000 foods, from abiyuch to zwieback, and provides the average nutrient levels per 100 grams — e.g., protein, carbohydrates, vitamin D, caffeine, lycopene, and water. North of the border, you can find the (bilingual) Canadian Nutrient File. It’s based on the USDA data, but excludes stateside foods “known not to be on the Canadian market”, adds some foods (such as poutine and ptarmigan), and makes adjustments based on “Canadian levels of fortification and regulatory standards.” The United Kingdom has its own nutrient file, as do many other countries. [h/t Reddit user Alacritous] — Data is Plural: August 9, 2017
Links:
Tags: food
The New York Philharmonic has published three spreadsheets listing its subscribers — including where they sat, how much they paid, and where they had their tickets sent — for a slew of orchestral seasons between 1883 and the late 1990s. The earliest data includes names, too. (“Miss A. Brown” of 715 Fifth Avenue seems to have been a big fan, having subscribed to 26 seats for the 1890-91 season.) Previously: The Philharmonic’s performance history (DIP 2016.10.12). [h/t Rachel Shorey] — Data is Plural: August 2, 2017
Links:
Tags: entertainmentmusic
The Seattle Public Library publishes a dataset of every checkout of every physical item (e.g., paperback books and DVDs, but not e-books) since April 2005. It currently contains more than 90 million rows. Previously: The library’s monthly checkout counts, by title (DIP 2017.03.01). [h/t David Christensen] — Data is Plural: August 2, 2017
Links:
Tags: books
The California Department of Education publishes aggregate scores on these high-school tests for each county, district, and school going back to the late 1990s. One hitch: For more than two months, the 2016 AP data “contained 350,000 more tests than had actually been taken,” according to inewsource.org’s Megan Wood, who spotted the discrepancies (and others) and got the department to fix them. Similar datasets are available from other states, including Texas, Florida, and Pennsylvania. Bonus: inewsource.org’s has also published easy-to-search tables of the California AP, SAT, and ACT scores. — Data is Plural: August 2, 2017
Links:
Tags: education
The EU publishes a searchable database of people and organizations registered to lobby the European Parliament and the European Commission. The website LobbyFacts.eu takes that data and makes it available via an API. LobbyFacts also scrapes the European Commission’s disclosed lobbying meetings, which you can download here (warning: 10-megabyte direct download). Related: You can also explore the lobbyists and meetings via InegrityWatch.eu, which uses LobbyFacts’ data. Previously: U.S. government lobbyists (DIP 2017.05.31). [h/t Enigma Public + Xavier Dutoit] — Data is Plural: August 2, 2017
Links:
Tags: politics
After Malaysia Airlines flight MH370 disappeared in March 2014, the Australian government undertook an enormous seafloor-mapping operation in search of the lost Boeing 777. Last month, it released data from the first phase of the project, which collected 278,000 square kilometers of bathymetry (i.e., seafloor topography) measurements. “In general, the world's deep oceans have had little investigation,” the government explains in an interactive map. “Only 10 to 15 percent of the ocean has been mapped with the sonar technology similar to that used in the search for MH370.” As a result, the MH370 search area “is now among the most thoroughly mapped regions of the deep ocean on the planet.” [h/t Soh Kam Yung] — Data is Plural: August 2, 2017
Links:
Data Stories is a podcast about data visualization, hosted by Enrico Bertini and Moritz Stefaner. To celebrate their recently-published 100th episode, the hosts released a spreadsheet detailing the date, title, number and genders of guests, length, and timestamped subchapters of each episode so far. Related: Christian Laesser’s visualization of the data. [h/t Benjamin Cooley] — Data is Plural: July 26, 2017
Links:
Tags: audiomediastatistics
During the course of its Enron investigation, the Federal Energy Regulatory Commission obtained the emails of approximately 150 (mostly high-ranking) Enron staff. You can find versions of the dataset — cleaned, deduplicated, and restructured in various ways — hosted by Carnegie Mellon, UC Berkeley, and Duke Law. Related: “What the Enron Emails Say About Us,” published by The New Yorker last week. Nathan Heller writes: The Enron archive “remains one of the country’s largest private e-mail corpora turned public. Its lasting value is less as an account of Enron’s daywork than as a social and linguistic data pool, a record of the way we write online when we’re not preening for the public eye.” — Data is Plural: July 26, 2017
Links:
As the basis for his recent study, “Is Running Enough? Reconsidering the Conventional Wisdom about Women Candidates” (paywalled, but a draft is freely available), PhD candidate Peter Bucchianeri compiled a dataset of female candidates in House primary elections from 1972 to 2010. The spreadsheet covers 1,242 candidacies, and includes each candidate’s party, votes garnered in the primary and general elections, the seat’s incumbency status, the district’s demographics, and more. — Data is Plural: July 26, 2017
Links:
NBC News has been tracking the president’s visits to his own luxury properties. For each day since Trump took office, the data — available to download at the bottom of the page — tells you which properties he visited and whether any were golf courses. Since February, Trump has visited his properties roughly 10 days a month, including 25 trips to Mar-a-Lago and 42 trips to his golf courses. Related: A similar tracker from The New York Times. [h/t Rachel Schallom] — Data is Plural: July 26, 2017
Links:
The Densho Digital Repository is an archive of oral histories, photographs, newspaper clippings, and other primary sources relating to the internment of Japanese Americans during World War II. Among the materials: several datasets listing people sent to the internment camps, based on official government records. The largest dataset contains more than 100,000 entries and includes details such as each internee’s “relocation” site, arrival date, hometown, birth year, time spent in Japan, marital status, religion, educational degrees, occupation, and military service. The National Archives hosts the raw data, as well as its documentation. — Data is Plural: July 26, 2017
Links:
The National Park Service and Geyser Observation and Study Association have been using water-temperature sensors to track the eruption times of dozens of geysers in Yellowstone — Old Faithful, of course, but also Beehive, Little Squirt, and Narcissus. GeyserTimes.org combines this data with historical logbooks and observations from “geyser gazers” to form what it describes as “the most comprehensive database of geyser eruption and observation data on the internet.” — Data is Plural: July 19, 2017
Links:
Tags: environmentstatistics
The International Monetary Fund’s World Economic Outlook Database contains the fund’s projections for future “national accounts, inflation, unemployment rates, balance of payments, fiscal indicators, trade for countries and country groups” and commodity prices. (They predict that farm-bred Norwegian salmon will cost $6.79/kg in 2022.) The database also contains historical observations for many of the economic indicators back to 1980. [h/t David Mihalyi] — Data is Plural: July 19, 2017
Links:
Tags: economics
Each September, the United Nations gathers for its annual General Assembly. Among the activities: the General Debate, a series of speeches delivered by the UN’s nearly 200 member states. The statements provide “an invaluable and, largely untapped, source of information on governments’ policy preferences across a wide range of issues over time,” write a trio of researchers who, earlier this year, published the UN General Debate Corpus — a dataset containing the transcripts of 7,701 speeches from 1970 to 2016. The researchers have also published an online tool for exploring and visualizing the dataset. Previously: UN General Assembly votes since 1946 (DIP 2016.07.13). [h/t Ronny Patz] — Data is Plural: July 19, 2017
Links:
Tags: United Nations
Last week, a team at NYU announced “the world’s densest urban aerial laser scanning (LiDAR) dataset” — a 1.4-billion-point description of Dublin’s city center. They write: ”At over 300 points per square meter, this is more than 30 times denser than typical LiDAR data and is an order of magnitude denser than any other aerial LiDAR dataset.” The researchers collected the topographical data during a series of criss-crossing flyovers on March 26, 2015. They’ve also published a short, illustrative video. Previously: LiDAR datasets (DIP 2016.05.25) and 3D models (DIP 2017.04.05) of cities and countries around the world. [h/t Darrell Etherington] — Data is Plural: July 19, 2017
Links:
Tags: mapping
You’ve probably seen The Washington Post’s solar eclipse graphics from last Monday. The stellar maps are largely based on an online tool that uses data from NASA's Five Millennium Canon of Solar Eclipses. The tool can (among other things) generate maps and KMZ files describing the paths of the 11,898 solar eclipses Earth will have experienced between and 2000 BCE and 3000 CE. Helpful: NASA’s key to understanding the data terminology. — Data is Plural: July 19, 2017
Links:
Tags: science
The recently-launched Tweets Of Congress is collecting and publishing daily archives of tweets by congressional representatives, caucuses, and committees. Meanwhile, the Trump Twitter Archive has collected more than 30,000 of @realDonaldTrump’s tweets, which you can search and download. — Data is Plural: June 28, 2017
Links:
The Internal Revenue Service publishes a file listing all “organizations eligible to receive tax-deductible charitable contributions” — currently more than 1 million charities, private foundations, and other groups. (Not all nonprofits apply for, or receive, tax-exempt status from the IRS; but all tax-exempt organizations are nonprofits.) Previously: Annual IRS 990 filings, in bulk (DIP 2016.06.22). [h/t Norbert Krupa + Derek Willis] — Data is Plural: June 28, 2017
Links:
Tags: taxes
The National Association of Realtors publishes monthly real estate inventory data “at the national level, the 500 largest metropolitan areas, the 1,000 largest counties, and over 15,000 zip codes.” The data, based on the realtors’ multiple listing services, goes back five years and “tracks key market metrics including list prices, days on market, and total active inventory.” As of early June, six counties — Manhattan, plus five in California — had median listing prices above $1 million. Previously: The Census Bureau’s Annual Characteristics of New Housing (DIP 2016.06.22), international house prices (DIP 2017.02.08), millions of mortgages (DIP 2015.12.30), and millions more mortgages (DIP 2017.03.15). [h/t Reddit user bbekks] — Data is Plural: June 28, 2017
Links:
Tags: real estate
OpenSNP is a website that lets people publish the results of their genetic tests (such as those sold by 23andMe, deCODEme, FamilyTreeDNA), “find others with similar genetic variations, [get] the latest primary literature on their variations, and help scientists find new associations.” Since 2012, users have uploaded more than 3,000 sets of genetic variants, which you can download individually or in bulk or access via OpenSNP’s API. Users can also list various personal traits, such as eye color, height, coffee consumption, and lactose intolerance. Useful primer: SNP stands for “single nucleotide polymorphism,” the NIH explains. They’re “the most common type of genetic variation”; each one “represents a difference in a single DNA building block, called a nucleotide.” — Data is Plural: June 28, 2017
Links:
The European Centre for Disease Prevention and Control’s Surveillance Atlas of Infectious Diseases lets you browse, map, and download data on the historical incidence of several dozen diseases — from anthrax to Zika — in each of the European Economic Area’s countries. Related: Keila Guimarães’s recent investigation into penicillin shortages, which uses the Centre’s data on syphilis cases. — Data is Plural: June 28, 2017
Links:
Tags: disease
The Florida Department of Corrections’ public database contains a table describing current and released inmates’ tattoos. That data includes each tattoo’s location (e.g., “right arm,” “stomach,” “face”) and description (“cross,” “tribal,” and “skull” being the most common). Helpful: Dan Nguyen’s guide to converting the database into SQLite and CSV files. Related: Recent analyses by The Economist and by The Palm Beach Post. — Data is Plural: June 21, 2017
Links:
The Manifesto Project has collected and coded more than 4,000 electoral manifestoes from more than 1,000 political parties in more than 50 countries between 1945 and 2015. For each manifesto, the project’s dataset indicates whether the document expresses support for/against dozens of policies and attitudes, including “market regulation,” a “national way of life”, “environmental protection,” and “anti-imperialism.” You can also browse the manifestoes online. Caveat: The dataset is subject to a somewhat restrictive usage policy. [h/t The Quartz Directory of Essential Data] — Data is Plural: June 21, 2017
Links:
Libraries.io monitors “over 2.4m unique open source projects, 25m repositories and 85m interdependencies between them.” Last week, the site released its first bulk dataset, which describes each project’s metadata, published versions, and dependencies on other software libraries. [h/t Nadia Eghbal] — Data is Plural: June 21, 2017
Links:
Tags: technology
“Created by USAID in 1985 to help decision-makers plan for humanitarian crises,” the Famine Early Warning Systems Network (FEWS NET) “provides evidence-based analysis on some 34 countries.” As part of its work, FEWS NET publishes geospatial shapefiles that score each country’s “most likely food security outcome” on standardized scale: Minimal, Stressed, Crisis, Emergency, and Famine. Previously: Global food prices (DIP 2017.05.17). [h/t Melissa Segura] — Data is Plural: June 21, 2017
Links:
Tags: disaster
“Police pull over more than 50,000 drivers on a typical day, more than 20 million motorists every year. Yet the most common police interaction — the traffic stop — has not been tracked, at least not in any systematic way,” according to the Stanford Open Policing Project. To that end, the group has been collecting and standardizing traffic-stop data from state police agencies across America. Its first data release, published Monday, contains 130 million records from 31 states. The records vary by agency, but the most-complete states include the date, time, location, reason, and outcome of each stop; the driver’s race, gender, and age; whether a search was conducted; and whether the search found contraband. Related: The project’s findings so far. Previously: Raw traffic stop data from a smaller number of states (DIP 2015.10.28). — Data is Plural: June 21, 2017
Links:
The Los Angeles City Controller has released a map of the city’s openly-operating medical marijuana businesses. You can access a spreadsheet of the 191 dispensaries that comply with Proposition D, which the city passed in 2013. Additionally, you can find hundreds of (active and inactive) dispensaries by filtering the city’s business registrations to those whose primary NAICS category is listed as “medical marijuana collective.” [h/t Zack Quaintance] — Data is Plural: June 14, 2017
Links:
Tags: drugshealthcare
. ResistoMap is an interactive visualization of antibiotic drug resistance, based on more than 1,500 bacteria genome samples from people’s intestinal tracts. The data behind the visualization is available to download. It’s partly based on two prior datasets: McMaster University’s Comprehensive Antibiotic Resistance Database (“a bioinformatic database of resistance genes, their products and associated phenotypes”) and the University of Gothenburg’s BacMet (“an easy-to-use bioinformatics resource of antibacterial biocide- and metal-resistance genes”). [h/t Carlos Somohano] — Data is Plural: June 14, 2017
Links:
Tags: diseasehealthcare
The Census Bureau’s Survey of Business Owners and Self-Employed Persons “provides the only comprehensive, regularly collected source of information on selected economic and demographic characteristics for businesses and business owners by gender, ethnicity, race, and veteran status.” The most recent data comes from 2012. The survey has been conducted every five years since 1972, but data from before 1992 is “available only in printed form.” Related: “30% Of The Black-Owned Businesses In New York Disappeared In 5 Years,” by my colleague Cora Lewis. — Data is Plural: June 14, 2017
Links:
Tags: business
Last week, the University of Virginia School of Law launched an expanded version of its Corporate Prosecution Registry. The revamped database includes “detailed information about every federal organizational prosecution since 2001, as well as deferred and non-prosecution agreements with organizations since 1990” — more than 3,000 cases so far. Previously: Good Jobs First’s Violation Tracker (DIP 2015.11.11). [h/t Tom Jackman] — Data is Plural: June 14, 2017
Links:
Oyez.org bills itself as, among other things, “a complete and authoritative source for all of the [Supreme] Court’s audio since the installation of a recording system in October 1955.” The site has an API and releases all its material — including timestamped transcripts of oral arguments — under a Creative Commons license. A least two GitHub repositories have aggregated the transcripts and make them easy to bulk-download. For each segment of audio, the transcripts list the start/end time, the speaker, and the text. Related: PuppyJusticeAutomated, a YouTube channel that (a) must be seen to be understood and (b) uses the Oyez API. Previously: CourtListener (DIP 2016.04.13) and The Supreme Court Database (DIP 2016.02.23). [h/t Walker Boyle + Reddit user 21cannons] — Data is Plural: June 14, 2017
Links:
The San Francisco Public Utilities Commission’s Beach Water Quality Monitoring Program measures bacteria levels at fifteen locations on the city’s shoreline. You can download the measurements by clicking the “raw data” link below this map. The data powers the (unsurprisingly) unofficial @BeachPooBot account on Twitter. [h/t Reddit user cavedave] — Data is Plural: June 7, 2017
Links:
Tags: diseaseenvironment
Researchers at Google took a semi-random sample of 9,473 Reddit threads, containing 116,347 comments in total. Then, they paid people to categorize each comment by its “discourse act” — e.g., whether it was a question, answer, announcement, agreement, humor, et cetera. The result is Coarse Discourse, “a dataset for understanding online discussions.” [h/t Roberto Bayardo] — Data is Plural: June 7, 2017
Links:
Beginning in January 2015, the Occupational Safety and Health Administration began requiring U.S. employers to report “all severe work-related injuries, defined as an amputation, in-patient hospitalization, or loss of an eye.” You can download a spreadsheet of these injuries — some 20,000 in 2015 and 2016 combined. It contains the injury dates, descriptions, and outcomes, as well as the employers’ names and locations. Previously: OSHA’s more detailed (but slightly more cumbersome) inspection data and API (DIP 2016.07.13). [Clarification, 2017-06-07/2017-06-14: The dataset dataset reflects "federal OSHA states only.” It excludes “injuries in state plans," which cover private sector employees in 21 states.] — Data is Plural: June 7, 2017
Links:
Tags: injury
Before Donald Trump began flying on Air Force One, he rode a fleet of private aircraft. Reporters at Bloomberg used the Freedom of Information Act to obtain flight records for three major components of that fleet — a ”Boeing 757 with gold-plated seatbelt buckles, known as Trump Force One during the campaign; a Cessna 750 Citation X jet; and a Sikorsky helicopter”. For each of the more than 1,500 flights taken between August 2010 and November 2016, the dataset contains the date, time, and airport of both the departure and arrival. Trump wasn’t necessarily aboard each of those flights; the dataset does not contain passengers information. Related: Bloomberg’s analysis/maps of the data. Also related: The Washington Post used the data to estimate the flights’ CO2 emissions. — Data is Plural: June 7, 2017
Links:
Tags: Trumptransportation
ORCID is a nonprofit organization that provides unique identifiers for researchers — mostly scientists so far — to make it easier to distinguish between them. It has issued more than 3 million IDs so far, and provides annual bulk downloads of all researchers’ public profiles. In many cases, the researchers have supplied their education and employment histories. That enabled Science magazine to analyze the migrations of more than 110,000 researchers who’ve listed multiple countries in these public CVs. (The data and code underlying the analysis are also available to download.) [h/t Shaun Coffey] — Data is Plural: June 7, 2017
Links:
Tags: science
You might have seen New York City’s bubble map of dog names. It turns out that the underlying dataset — which includes the name, gender, age as of 2015, breed, and borough of more than 110,000 dogs — is available on GitHub. You can also download slightly older, but more detailed data from WNYC’s Dogs of NYC project. That data includes each dog’s coat colors, whether it had been spayed/neutered, and its ZIP code. Related: Similar pet license data from Tacoma, Wash., and Edmonton, Canada. [h/t Alex P. Miller + Dan Nguyen] — Data is Plural: May 31, 2017
Links:
Tags: animals
Aswath Damodaran — a professor of finance at the NYU’s business school — maintains a trove of data on per-sector financials, including effective tax rates, return on equity, and working capital ratios by industry. For most datasets, Damodaran publishes both current and historical versions. [h/t Tim McGovern] — Data is Plural: May 31, 2017
Links:
A team of researchers at the Boston University School of Public Health has collected data on the presence/absence of 133 different types of firearm laws in each U.S. state, for each year between 1991 and 2016. The legal provisions are grouped into 14 categories, such as background checks, “Stand Your Ground” laws, and child access prevention. You can download a spreadsheet of the data, and also browse state-by-state summaries. Previously: The Correlates of State Policy Project (DIP 2016.07.06). — Data is Plural: May 31, 2017
Links:
Tags: guns
U.S. lobbyists must notify Congress within 45 days of being retained by new clients. Every quarter after that, they’re required to file activity reports that detail the agencies they lobbied, the topics they covered, and the income they earned. Bulk downloads of both types of reports are available as XML files from the House (going back to 2004) and from the Senate (since 1999). Although they receive the same filings, each chamber “follows different data-cleaning, processing, and editing procedures before storing the data,” according to this recent GAO report. — Data is Plural: May 31, 2017
Links:
Tags: government
Last week at BuzzFeed News, we shared a vast trove federal payroll data. Those records — provided by Office of Personnel Management through the Freedom of Information Act — cover more than 40 years and millions of employees. The dataset includes salaries, titles, job types, and demographic variables. In many-but-not-all cases (per OPM’s data release policies), it also includes names. Previously, federal payroll data had been searchable online, but very little was available in downloadable, analysis-friendly formats. Also: Many states – including New York, California, Florida, New Jersey, Minnesota, Arkansas, South Carolina, and Washington – proactively make payroll data available for download. (Some cities, such as Chicago, do, too.) — Data is Plural: May 31, 2017
Links:
Tags: governmentmoney
The website CraftCans.com publishes a database of 2,000+ canned beers. For each beer, the database lists its name, style, brewery, size, alcohol level, and bitterness. The website doesn’t provide a direct download, but — as Jean-Nicholas Hould points out — you can basically just copy-paste the website’s data into your favorite spreadsheet program. Or, if you want something slightly cleaner, you can use this script. Related: This data-profiling tutorial by Hould, which uses the data. Also related: RateBeer.com’s API, but you’ll need to request a developer key to use it. Plus: This interactive graphic, which uses the RateBeer data to explore America’s microbrew epicenters. And also: Official brewery production stats from the U.S. Alcohol and Tobacco Tax and Trade Bureau. [h/t Daniel Brady] — Data is Plural: May 24, 2017
Links:
Tags: alcohol
Google is clever: It created a drawing game, got 15 million people to play it, and then turned those doodles into into a public dataset of people drawing. You can download the raw data, or just browse the doodles online. — Data is Plural: May 24, 2017
Links:
Tags: arttechnology
When the malware program known as “WannaCry” hit hundreds of thousands of computers earlier this month, it demanded that the computers’ owners pay $300 in Bitcoin — or lose all of their data. Keith Collins at Quartz has been using Blockchain’s API to track Bitcoin payments to the three digital wallets that the hackers designated to receive the ransoms. He’s published the data and is also using it to power a Twitter bot. Related: “Victims of the WannaCry ransomware attacks have stopped paying up” and “Inside the digital heist that terrorized the world—and only made $100k,” both by Collins. Previously: Historical Bitcoin prices (DIP 2017.03.08). — Data is Plural: May 24, 2017
Links:
Tags: crimetechnology
The Profiles of Individual Radicalization in the United States (PIRUS) database “contains deidentified individual-level information on the backgrounds, attributes, and radicalization processes of nearly 1,500 violent and non-violent extremists who adhere to far right, far left, Islamist, or single issue ideologies in the United States” — including the Klu Klux Klan, the Taliban, and the Animal Liberation Front, among others. The dataset covers 1948 through 2013 and was released earlier this year by a team at the University of Maryland. [h/t Lorand Bodo] — Data is Plural: May 24, 2017
Links:
Last week, the Library of Congress released its largest dataset ever: nearly 25 million records for books, maps, manuscripts and other items in its online catalog. For each item, the data includes standardized bibliographic information, such as the title, author, publication date, and genre. (The dataset represents the online catalog as it was in 2013; more recent data will cost you.) Related: A bit of background about the library’s MARC (Machine Readable Cataloging Records) data format. — Data is Plural: May 24, 2017
Links:
Tags: bookstechnology
“The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia.” The plots describe movies, books, plays, TV series, TV episodes, video games, and other stories — essentially, any *thing that has a Wikipedia article with the word “plot” in one of its subheadings. Related: “Examining the arc of 100,000 stories: a tidy analysis” and “Gender and verbs across 100,000 stories: a tidy analysis,” two blog posts by David Robinson that use the data. — Data is Plural: May 17, 2017
Links:
The Chicago Sun-Times has obtained and published an August 2016 copy of the Chicago Police Department’s “Strategic Subject List,” a database that scores nearly 400,000 (unnamed) people on a scale from 10 to 500, based on an algorithm that attempts to estimate their risk of being involved in gun violence (either as a shooter or a victim). The database includes demographic, geographic, criminal history, and other information about the people it ranks. “But the database doesn’t indicate — and the police won’t say — how much weight is given to each factor in computing the scores, which are produced using an algorithm developed at the Illinois Institute of Technology,” according to the Sun-Times. — Data is Plural: May 17, 2017
Links:
How might rising sea levels affect coastal flooding? A new-ish NOAA Technical Report, published in January, combines historical data on global sea levels with “regional factors contributing to sea level change for the entire U.S. coastline.” The result: Localized projections under six sea-level rise scenarios, ranging from “low” to “extreme.” You can download the data (at the bottom of this page) or explore it on a map. Related: Climate Central describes what NOAA’s “extreme” scenario could mean for America (including more maps and calculations). Previously: Tide gauge data (DIP 2016.03.23) and sea ice measurements (DIP 2016.09.14). [h/t Susie Cambria] — Data is Plural: May 17, 2017
Links:
The UN World Food Programme’s vulnerability analysis group collects and publishes food price data for more than 1,000 towns and cities in more than 70 countries. The dataset, which goes back more than a decade, covers basic staples, such as wheat, rice, milk, oil, and more. It’s updated monthly and feeds into (among other things) the UNWFP’s price-spike indicators. Related: The Humanitarian Data Exchange, which hosts the dataset for the UN. Also: The Economist’s Big Mac Index. Andrew McCartney] — Data is Plural: May 17, 2017
Links:
The James Martin Center for Nonproliferation Studies publishes what it calls “the first database to record flight tests of all missiles launched by North Korea capable of delivering a payload of at least 500 kilograms a distance of at least 300 kilometers.” The database currently contains 107 missile tests — starting with North Korea’s first, launched in April 1984, to its latest, launched Sunday morning. For each test, the data includes the missile’s launch site, highest altitude, distance travelled, landing location, success/failure, and other details. [h/t Ian Greenleigh] — Data is Plural: May 17, 2017
Links:
Grad students in Princeton’s computer science department have published a dataset they call Self-Annotated Reddit Corpus, or “SARC” for short. “The corpus has 1.3 million sarcastic statements — 10 times more than any previous dataset,” the authors write, and takes advantage of Reddit users’ habit of tagging sarcastic comments with an “/s”. Related: A dataset of sarcastic Amazon reviews. [h/t Carlos Somohano + Reddit user cavedave] — Data is Plural: May 10, 2017
Links:
Tags: languagesocial media
The National Science Foundation’s Survey of Doctorate Recipients “is a longitudinal biennial survey conducted since 1973 that provides demographic and career history information about individuals with a research doctoral degree in a science, engineering, or health (SEH) field from a U.S. academic institution.” You can download aggregated data and detailed survey responses going back to 1993. The next release is scheduled for this month. Related: The NSF has published an interactive graphic of the data. [h/t Peter Aldhous] — Data is Plural: May 10, 2017
Links:
Groceries-on-demand startup Instacart has released a dataset containing 3 million orders from 200,000 (anonymized) users. “For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order,” the company’s head of data science writes. “We also provide the week and hour of day the order was placed, and a relative measure of time between orders.” Here’s the data dictionary. — Data is Plural: May 10, 2017
Links:
Tags: food
Last month, ProPublica and Consumer Reports published an analysis of car insurance costs in four states, finding that “some major insurers charge minority neighborhoods as much as 30 percent more than other areas with similar accident costs.” The reporters also published a detailed methodology and dataset supporting their findings. The dataset contains company-by-company insurance premiums for a (hypothetical) college-educated, excellent-credit, accident-free 30-year-old woman in each of 6,261 ZIP codes in the four states — California, Texas, Missouri, and Illinois. The dataset also includes several years of average (per-car) insurance payouts for each ZIP code, which the reporters obtained from state insurance commissioners. Related: The insurance industry's rebuttal and ProPublica's counter-rebuttal. — Data is Plural: May 10, 2017
Links:
There’s about 700 miles of official fencing between the U.S. and Mexico, covering about one-third of the full border. The Department of Homeland Security doesn’t provide structured spatial data about the fence’s path. But, thanks to a Texas law professor’s FOIA and some serious elbow grease, reporters at Reveal have created “the most detailed border fence map publicly available.” For each segment of fence, Reveal’s dataset includes the fence type (i.e., pedestrian, vehicle, or unknown), the government’s name for the segment, and the project through which the segment was built. — Data is Plural: May 10, 2017
Links:
Tags: immigrationmapping
For April Fools, Reddit launched a million-pixel canvas called “r/place.” Users could place a single-pixel tile, in one of 16 colors, anywhere on the canvas — but only every five minutes. By the end of r/place’s 72-hour lifetime, Redditors had placed 16.5 million tiles on the canvas, likely making it “the largest collaborative art project in history.” Last week, Reddit published the entire history of the canvas as structured data. [h/t Felipe Hoffa] — Data is Plural: April 26, 2017
Links:
The CDC has been running its National Survey of Family Growth since 1973. For the first three decades, it surveyed only women ages 15-44. Starting in 2002, it began also surveying men. The latest survey was conducted in 2013-15, when it collected data from 10,205 residents about sexual activity and contraception, pregnancy and infertility, marriage and divorce, adoption, parenting, and more. [h/t Allen B. Downey] — Data is Plural: April 26, 2017
Links:
Tags: familyhealthcarewomen
For each of India’s 36 states and Union Territories, the country’s latest National Family Health Survey includes 114 metrics, such as the percentages of “households using iodized salt” and “men who have comprehensive knowledge of HIV/AIDS.” Unfortunately, the government publishes the reports only as PDFs. But the Hindustan Times has extracted the data for the survey’s eight “women’s empowerment and gender based violence” metrics, including the percentages of “ever-married women who have ever experienced spousal violence” and “women having a bank or savings account that they themselves use.” They’ve published that data as a spreadsheet and used it to construct an interactive Women Empowerment Index. [h/t Gurman Bhatia] — Data is Plural: April 26, 2017
Links:
You’re probably familiar with the Google Books Ngram Viewer, which lets you chart word and phrase frequencies over time. Google publishes the underlying data but those files can (depending on your tools and goals) be cumbersomely large. Here’s an alternative: DIP reader (and former colleague) Chris Wilson has condensed the overall frequencies for 87,000 words — those found in the CMU Pronouncing Dictionary — into a svelte, four-megabyte file. Related: BYU’s advanced interface to the Google Books data. Also related: “The Pitfalls of Using Google Ngram to Study Language” (Wired, 2015). And also: “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them” (The Atlantic, 2017). — Data is Plural: April 26, 2017
Links:
Tags: language
The U.S. National Park Service publishes a ton of data about visitors to its parks, historic sites, memorials, preserves, and more. Among them: Visitors per park (annually since 1904, and monthly since 1979), overnight stays by type of lodging (tents, RVs, backcountry, etc.), and traffic. Related: “The National Parks Have Never Been More Popular” (FiveThirtyEight, 2016). [h/t Jack King] — Data is Plural: April 26, 2017
Links:
Tags: environmentstatistics
For a 2012 academic paper, researchers captured the keystrokes of paid volunteers as they typed descriptions of images. Whenever a participant used the backspace key to correct a word, the researchers added it to a dataset of self-corrections. Each of the 44,000 lines in the English-language version of the dataset contains the original mistake and the correction. The most common change was in → on. Other common fixes included waling → walking and pople → people. [h/t Seth Stephens-Davidowitz] — Data is Plural: April 19, 2017
Links:
Tags: language
The USDA’s Plant Hardiness Zone Map “is the standard by which gardeners and growers can determine which plants are most likely to thrive at a location.” The USDA and Oregon State, which have jointly developed the map, previously sold access to the underlying data through a vendor. But after the vendor shut down earlier this year, OSU began publishing the data free of charge (though with some licensing restrictions). The dataset is available as detailed shapefiles and as ZIP code–based spreadsheets. [h/t Waldo Jaquith + Lynn Cherny] — Data is Plural: April 19, 2017
Links:
Tags: agricultureplants
Through its International Best Track Archive for Climate Stewardship project, the National Oceanic and Atmospheric Administration publishes what it calls “the most complete global set of historical tropical cyclones available.” For each tropical cyclone — a category that includes typhoons, hurricanes, tropical depressions, and more — the dataset includes its position, wind speed, central pressure, and classification at six-hour intervals. The dataset is updated annually and includes some historical cyclones from as early as 1842. [h/t Daniel Miller] — Data is Plural: April 19, 2017
Links:
The CDC’s National Center for Immunization and Respiratory Diseases collects and publishes state-by-state vaccination rates for infants, kindergartners, teens, and adults — plus, flu vaccination rates for several age groups. Each dataset includes several years’ worth of data, with many going back to 2008 or 2009. Related: “California Shows The Rest Of The Country How To Boost Kindergarten Vaccination Rates,” by my colleague Peter Aldhous, with additional county-level data from the Golden State. Previously: International vaccination rates and policies (DIP 2016.08.03). — Data is Plural: April 19, 2017
Links:
Tags: healthcare
The UK’s National Health Service publishes monthly data on drugs prescribed in England through the country’s single-payer health care system. (Drugs prescribed in Scotland, Wales, or Northern Ireland aren’t included.) For each prescriber-and-drug combination, the dataset includes the quantity and cost of prescriptions for each month since August 2010. The US publishes similar data about prescriptions issued through Medicare, but only on an annual basis and currently only covering 2013 and 2014. Related: ProPublica’s Prescriber Checkup, which uses the Medicare data to examine doctors’ prescribing patterns. Previously: A decade-plus of Australian prescription data (DIP 2016.08.24). [h/t Adam Crahen] — Data is Plural: April 19, 2017
Links:
Tags: drugshealthcare
Comic books make use of white space — or gutters — to propel the story forward, relying on readers’ intuitive ability to fill in the gaps between panels. To see whether computers could learn to make the same inferences, a group of computer scientists built a giant corpus of public-domain comics and tried training a series of neural networks on it. (Spoiler: Humans are much better at this.) The underlying dataset contains 1.2 million panels from nearly 200,000 scanned pages of nearly 4,000 books in the Digital Comic Museum, all published during the 1938–1954 “Golden Age” of American comics. It also contains 2.5 million chunks of text extracted from the comics’ speech balloons, thought bubbles, and narration boxes. [h/t Robin Sloan] — Data is Plural: April 12, 2017
Links:
Researchers at the World Health Organization have assembled a dataset of international aid — both from official government assistance and private grants — devoted to reproductive, maternal, newborn, and child health from 2003 to 2013. The dataset, which the researchers described in a recent academic article, draws on 2.1 million records, and is based largely on the OECD’s Creditor Reporting System. Related: Earlier this month, the U.S. State Department cut all its funding for the UN's family planning agency; it was the agency’s third-largest donor. — Data is Plural: April 12, 2017
Links:
Sci-Hub, which describes itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers,” recently released a list of the 62,835,101 academic papers it has collected. That dataset identifies each paper only by its DOI — a short, unique ID. Helpfully, graduate student Bastian Greshake has extracted the journal name, publisher, and publication ear from those DOIs. Greshake has also combined that data with six months of Sci-Hub download data (previously featured in DIP 2016.05.04), and analyzed the datasets together. Among his findings: Both are “largely made up of recently published articles, with users disproportionately favoring newer articles and 35% of downloaded articles being published after 2013.” — Data is Plural: April 12, 2017
Links:
Tags: education
The Environmental Protection Agency publishes fuel efficiency data on all the car models it has tested, going back to the 1980s… minus all the Volkswagen, Audi, and Porsche diesels caught cheating. The data typically includes three estimates: for city driving, highway driving, and a city-highway combination. — Data is Plural: April 12, 2017
Links:
Every four years, Congress publishes United States Government Policy and Supporting Positions, better known as the Plum Book. The 2016 version, which is available as both PDF and Excel files, identifies more than 8,000 executive and legislative branch jobs subject to “noncompetitive appointment.” Those positions include 1,710 presidential appointments, which are as wide-ranging as the ambassadorship to Afghanistan and the directorship of the Occupational Safety and Health Administration’s Whistleblower Protection Program. Related: For positions requiring its confirmation, the Senate publishes XML files of pending, confirmed, and withdrawn nominees. — Data is Plural: April 12, 2017
Links:
Tags: governmentpolitics
In peer-reviewed paper published last week, a trio of University College London researchers describe their Global Avian Invasions Atlas. The dataset includes information on “971 species, introduced to 230 countries and administrative areas across all eight biogeographical realms, spanning the period 6000 BCE – AD 2014.” — Data is Plural: April 5, 2017
Links:
Tags: animals
The National Science Foundation publishes data on all of the grants the agency has awarded since the 1970s (and some earlier ones, too). Each grant is represented as an XML file, which contains information about the project, the awardee, and the NSF division that awarded the grant. [h/t France A. Córdova] — Data is Plural: April 5, 2017
Links:
Tags: science
Bruegel, “a European think tank that specialises in economics,” publishes a quarterly-updated dataset quantifying sovereign bond holdings for 12 countries: Belgium, Finland, France, Germany, Greece, Ireland, Italy, Netherlands, Portugal, Spain, the U.K., and the United States. For each country, the dataset tells you what proportion of the federal government’s bonds are held by each of five types of owners: the country’s central bank, other public institutions, domestic banks, other domestic investors, and foreign investors. [h/t @CoolDatasets] — Data is Plural: April 5, 2017
Links:
Tags: economicsgovernmentmoney
Yasuyuki Aono, an associate professor at Osaka Prefecture University, has collected the historical flowering dates of Kyoto’s Prunus jamasakura cherry trees going all the way back to the 9th century. The dataset is based on “many diaries and chronicles written by Emperors, aristocrats, [governors] and monks,” Aono writes. The dates are those “on which cherry blossom viewing parties had been held or full flowerings had been observed.” Over the past century, Kyoto’s cherry trees have been blooming earlier and earlier. Related: @bbgblossoms, a Twitter bot that tracks the status of the Brooklyn Botanic Garden’s 152 cherry trees. [h/t Eric Steig] — Data is Plural: April 5, 2017
Links:
Tags: plants
In 2014, the NYC Department of Information Technology & Telecommunications conducted a massive aerial survey of the city. Then, they converted the images and data they collected into a three-dimensional model of every building in all five boroughs. Related: In December, The New York Times used the data to map the city’s shadows. Also related: Berlin, the Hague, and Lyon offer digital 3D models of their cities, too. Previously: LiDAR-powered elevation data from around the world (May 25, 2016). [h/t Dan Nguyen] — Data is Plural: April 5, 2017
Links:
Tags: mappingtechnology
The anonymously-published DNS Census 2013 “is an attempt to provide a public dataset of registered domains and DNS records” — essentially the Internet’s phone book. The dataset, which has also been uploaded to the Internet Archive, includes 2.7 billion Domain Name System records and 106,928,034 distinct domains, organized by extension (e.g., .com, .info, .edu). RIP, certificationcommissionforhealthcareinformationtechnology.biz. [h/t Andrew Ferlitsch] — Data is Plural: March 29, 2017
Links:
Tags: technology
After last week’s item on Berkeley Earth’s real-time air quality data, reader Olaf Veerman pointed me to OpenAQ. The open-source project currently gathers pollution data from nearly 5,500 locations in 47 countries, aggregated “from real-time government and research grade sources.” You can download the data via OpenAQ’s API. [h/t Olaf Veerman] — Data is Plural: March 29, 2017
Links:
Tags: environmentstatistics
The Federal Deposit Insurance Corporation publishes a spreadsheet of failed banks for which the agency has been appointed as a receiver — some 550 banks since October 2000. It also provides short descriptions of each bank failure. The most recent: Proficio Bank of Cottonwood Heights, Utah, which closed on March 3. More on the FDIC’s receivership program here. — Data is Plural: March 29, 2017
Links:
Late last year, the FDA began publishing a dataset of ”adverse events” that have been reported to its Center for Food Safety and Applied Nutrition. The database currently covers January 2004 through December 2016, and includes reports of (suspected) bad reactions to foods, dietary supplements, and cosmetics. For instance, the first row names a particular brand of chocolate chips as the potential culprit in the hospitalization of a two-year-old girl, whose symptoms included a rash, swelling face, cough, and difficulty breathing. Previously: FDA adverse event data for pharmaceutical drugs (May 18, 2016). [h/t Sheila Hagar + Drew Ivan] — Data is Plural: March 29, 2017
Links:
Tags: food
The Stockholm International Peace Research Institute’s Military Expenditure Database is based on official reports, International Monetary Fund yearbooks, newspaper articles, and other sources. It covers most major countries since the 1950s and more than 100 countries since 1988. The dataset also quantifies military spending on a per-capita basis, as share of the country’s GDP, and as a proportion of total government spending. Also: The Defense Manpower Data Center publishes spreadsheets detailing the number of active and reserve U.S. personnel stationed in each state, territory, and foreign country. Previously: SIPRI’s database of international arms transfers (Nov. 18, 2015). [h/t K.K. Rebecca Lai, Troy Griggs, Max Fisher and Audrey Carlsen] — Data is Plural: March 29, 2017
Links:
NOAA Fisheries’ Greater Atlantic Region publishes spreadsheets of the federal permits it awards to fishing vessels, operators, and dealers. For each vessel, the data includes the boat’s name, owner, principal port city, length, horsepower, and categories of fish permitted. The agency’s Southeast Regional Office also publishes lists of its permits — for shark dealers, domestic swordfish dealers, spiny lobster tailing, and more — but as HTML tables with no CSV-export option. [h/t J. Albert Bowden II] — Data is Plural: March 22, 2017
Links:
The Census’ Value of Construction Put in Place Survey “provides monthly estimates of the total dollar value of construction work done in the U.S.” For instance, construction spending in 2016 totaled approximately $1.1 trillion, $89 billion of which went to education-related construction. The survey has been collected monthly since 1964; historical data files are available going back to 1993. [h/t Kevin Gilmore] — Data is Plural: March 22, 2017
Links:
Tags: architecture
Five states in India, representing nearly 250 million residents — Punjab, Uttar Pradesh, Uttarakhand, Goa, and Manipur — have already held legislative assembly elections this year. India’s Election Commission publishes these results, but only as webpages. A couple of Hyderabad-based developers have scraped the website, and published CSVs of the data on GitHub. Previously: Data Is Plural’s election edition (Sept. 28, 2016). — Data is Plural: March 22, 2017
Links:
To accompany its 2016 and 2017 budget proposals, the Obama administration published machine-readable copies on GitHub. Each proposal’s data are divided into three CSV files: for budget authority, outlays, and receipts. The accompanying user guide explains the data sources and structure. Sample tidbit: The White House expected the Department of Homeland Security to pull in $712 million in excise taxes from the Oil Spill Liability Trust Fund in 2017. [h/t Dan Nguyen] — Data is Plural: March 22, 2017
Links:
Tags: government
The team at Berkeley Earth has released the data files behind their real-time global air quality map. The map and data track measurements of pollution particles smaller than 2.5 microns in diameter. “Under typical conditions,” the Berkeley Earth team writes, this particulate matter “is the most damaging form of air pollution likely to be present, contributing to heart disease, stroke, lung cancer, respiratory infections, and other diseases.” Previously: The World Health Organization’s Global Urban Ambient Air Pollution Database (June 15, 2016). — Data is Plural: March 22, 2017
Links:
To prepare for an exhibition last year, the National Archives and Records Administration created a dataset of more than 11,000 constitutional amendment proposals introduced in Congress between 1787 and 2014. [h/t Justin Lewis] — Data is Plural: March 15, 2017
Links:
The Windy City publishes two datasets on traffic violations. One tallies the daily number of speeding violations in each Children’s Safety Zone; the other, red-light violations at each camera-surveilled intersection. Both go back to July 2014. The city also publishes a spreadsheet of city-towed vehicles. Related: The Chicago Tribune’s long-running investigation into the city’s traffic camera troubles. [h/t Jacob Sheff] — Data is Plural: March 15, 2017
Links:
Tags: crimetransportation
Freddie Mac — the government-sponsored, publicly traded company also known as the Federal Home Loan Mortgage Corporation — publishes data on 23 million single-family home mortgages it has originated or guaranteed since 1999. The dataset includes the loan amount and interest rate, the borrower’s credit score, the property type (e.g., condo, co-op, manufactured housing), metro area, first payment month, whether the borrower is a first-time homebuyer, and lots more. Freddie Mac requests that you register before downloading the data, but you can also access the files directly. Don’t miss the terms and conditions, which prohibit republishing the files. Previously: Data on millions more loans from the Home Mortgage Disclosure Act (Dec. 30, 2015). — Data is Plural: March 15, 2017
Links:
Tags: real estate
Last week, a research team at Google published AudioSet, a dataset of “2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.” The clips have been classified into hundreds of categories, including “plucked string instrument,” “computer keyboard,” “chuckle, chortle,” “snoring,” and “fowl.” [h/t Suman Deb Roy] — Data is Plural: March 15, 2017
Links:
Donald Trump’s new travel ban is scheduled to take effect at 12:01am Eastern tonight. The State Department doesn’t publish realtime visa data, but it does publish historical data, including the number of non-immigrant visas issued each fiscal year between 1997 and 2016, by nationality and visa type. (For example, the government issued 226 “fiancé(e)” K-1 visas to Syrian nationals in fiscal year 2016.) The agency also reports how many visas of each type it refused each year, as well as refusal rates by nationality. [h/t Thomas Kasang] [Update, 2017-12-12: The State Department link appears no longer to be working; here's a copy from the Wayback Machine: http://web.archive.org/web/20171201161048/https://travel.state.gov/content/visas/en/law-and-policy/statistics/non-immigrant-visas.html ] — Data is Plural: March 15, 2017
Links:
Tags: Trumpimmigration
A trio of European researchers has published a dataset containing 101,000 photos of food — 1,000 images each from 101 food categories, all downloaded from foodspotting.com. The categories include apple pie, escargots, onion rings, paella, bibimbap, prime rib, and more. [h/t Reddit user cavedave] — Data is Plural: March 8, 2017
Links:
Tags: food
Researcher Amber Thomas has parsed the transcripts of last year’s 10 highest grossing films. The resulting data files indicate each character’s number of turns speaking, number of words spoken, and gender. Previously: Dialogue from 2,000 movies, by gender (April 13, 2016). — Data is Plural: March 8, 2017
Links:
Tags: entertainmentmovies
The FDA’s “Orange Book” lists approved drugs, their associated patents, and government-granted exclusivity rights. The Orange Book is available as a 1,400-page PDF, but you can also download the key data as structured text files. The files are updated monthly. Related: “Drugs For Rare Diseases Have Become Uncommonly Rich Monopolies,” published by Kaiser Health News and NPR in January. Question for readers: The Orange Book data comes as tilde-delimited files, the first I’ve ever seen. Do you have ~any other examples~? [h/t Sydney Lupkin] — Data is Plural: March 8, 2017
Links:
Tags: businessdrugshealthcare
The Bitcoin exchange rate hit an all time high last week, at more than $1,290 to the dollar. That’s according to CoinDesk’s Bitcoin Price Index, an average rate derived from several major exchanges. You can download daily and hourly data for the index and its components. [h/t Jan Doggen] — Data is Plural: March 8, 2017
Links:
Tags: moneytechnology
From Treasury.io: “Every day at 4pm, the United States Treasury publishes data tables summarizing the cash spending, deposits, and borrowing of the federal government.” Those data tables “catalog all the money taken in that day from taxes, the programs, and how much debt the government took out.” On Monday, for instance, the government spent $481 million on the Postal Service. One hitch: The Treasury’s data tables are (subjectively) ugly and (objectively) spreadsheet-unfriendly. So Treasury.io — an open-source civic project — continuously converts the files into good ol’ tabular data. You can download individual tables as CSVs, get the whole dataset as a big SQLite database, or query the API. There’s also a data dictionary and a Twitter bot. — Data is Plural: March 8, 2017
Links:
Tags: governmenttaxes
Florida’s Fish and Wildlife Conservation Commission publishes data from its statewide recreational alligator hunt. For each alligator harvested between 2000 and 2015, the dataset includes the date, the hunting area, and the length of the carcass. (Legal hunting tools include crossbows, harpoons, spearguns, fishing poles, snatch hooks, and bang sticks — but not rifles, pistols, or other guns.) [h/t Christopher Groskopf + Neil Bedi + Eric Sagara] — Data is Plural: March 1, 2017
Links:
Last month, the Seattle Public Library released a dataset tracking the total number of checkouts for each title by year and month from April 2005 to December 2016 (so far). The dataset isn’t limited to physical books; it also includes e-books, magazines, CDs, DVDs, and more. Last year, the three most popular physical books were Paula Hawkins’s The Girl on the Train (2,355 checkouts), Lauren Groff’s Fates and Furies (2,151 checkouts), and Ta-Nehisi Coates’s Between the World and Me (2,134 checkouts). — Data is Plural: March 1, 2017
Links:
Tags: books
The National Highway Traffic Safety Administration provides an impressively rich API detailing every manufacturer, make, and model in its database. The API can translate cars’ Vehicle Identification Numbers into the nitty-gritty details that those VINs encode, including the plant where the vehicle was manufactured, number of doors, engine measurements, fuel type, and more. [h/t Justin Myers] — Data is Plural: March 1, 2017
Links:
Tags: transportation
In an early executive order, Donald Trump instructed the Department of Homeland Security to expand its use of Section 287(g) of the Immigration and Nationality Act, which allows the federal government to deputize local law enforcement agencies in its search for undocumented immigrants. In response to FOIA requests, DHS has previously released data on the local agencies that participate in the 287(g) program. The Marshall Project has collated the DHS data, which includes the number of immigrants deported, for 2006 to 2013 (the most recent year available). During that timespan, “more than 175,000 people nationwide were deported under the program,” Anna Flagg writes. “More than 30,000 of them came from Maricopa County, Ariz., the most from any single jurisdiction.” [h/t Tom Meagher] — Data is Plural: March 1, 2017
Links:
Tags: Trumpimmigrationlaw
Wordbank is an “open database of children's vocabulary development.” So far, the Stanford-hosted project has gathered data from more than 71,000 standardized and anonymized vocabulary questionnaires across 23 languages. You could spend hours exploring the data online, charting how quickly children learn individual words, how quickly the same word (e.g., “grandma,” “abuela,” “ба́бушка”) is learned in different languages, and connections between words. You can download the data for each word or for each child’s vocabulary. Bonus: Wordbank has an R package and a GitHub repository. [h/t Hacker News user "Jasamba"] — Data is Plural: March 1, 2017
Links:
When researchers asked 1,354 people to name or visualize a playing card, 1 in 6 of them first chose the Ace of Spades. Here’s the data, which includes each participant’s three card choices, age, and gender. — Data is Plural: February 22, 2017
Links:
Since March 2015, the National Basketball Association has issued post-game reports reviewing referees’ calls during the final two minutes of neck-and-neck games. The NBA publishes those reports as PDFs; journalist Russell Goldenberg has been converting them to spreadsheet-friendly CSVs. Goldenberg is also analyzing and visualizing the data — updated daily — to show, for example, which players are benefitting most from incorrect and missed calls. (Answer so far: the Wizards’ Marcin Gortat and the Nets’ Brook Lopez.) — Data is Plural: February 22, 2017
Links:
Tags: sports
The National Cancer Institute has estimated ultraviolet radiation exposure estimates for every county in the continental United States. The estimates, based on a peer-reviewed methodology and 30 years of data from the National Solar Radiation Data Base, can also be explored using the institute’s mapping tool. Luna County, New Mexico had the highest estimated UV exposure at 5,723 Watt-hours per square meter; Clallam County, Washington, was exposed to the least estimated UV radiation, at 3,012 Wh/m². [h/t J. Albert Bowden II] — Data is Plural: February 22, 2017
Links:
Tags: healthcare
Last week, a team of researchers released a dataset containing “60,949 Doppler velocity measurements covering 1,624 stars taken over 20 years” from the Keck Observatory in Hawaii. The authors have already used the dataset to identify more than 100 exoplanets — i.e., planets outside our solar system. Now, they’re hoping that the public and other researchers will use their data to help discover even more. Previously: The NASA Exoplanet Archive (May 11, 2016). [h/t Arthur Bashlykov] — Data is Plural: February 22, 2017
Links:
Tags: science
Earlier this month, the Department of Housing and Urban Development released its “Picture of Subsidized Households” report for 2016. The dataset describes the living conditions, demographics, and finances of families receiving subsidies via the agency’s various programs — including public housing, Section 8 vouchers, and several others. The figures are provided for the entire U.S., by state, metro area, housing agency, city, county, Census tract, and even by housing development. HUD provides a data dictionary explaining each field, as well as a tool to query the data without downloading the entire dataset. [h/t Pat Smith] — Data is Plural: February 22, 2017
Links:
Tags: aidreal estate
The NCAA publishes data on its student athletes’ academic progress and graduation rates. The numbers are aggregated by school and sport — from baseball, to women’s bowling, to mixed rifle. [h/t Albert Bowden] — Data is Plural: February 15, 2017
Links:
From the Journal of Open Psychology Data: “We present a dataset of a single (N=1) participant diagnosed with major depressive disorder, who completed 1,478 measurements over the course of 239 consecutive days in 2012 and 2013.” The “participant” happens to be one of the study’s authors — Peter C. Groot, a researcher at Maastricht University Medical Centre. Each day, he recorded the degree to which “I feel relaxed,” “I feel lonely,” “I worry,” and responses to dozens of other prompts. [h/t Sacha Epskamp] — Data is Plural: February 15, 2017
Links:
Tags: healthcare
The Clinical Trials Transformation Initiative — a public-private partnership of more than 80 organizations — upgraded its clinical trials database late last month. The relational database, called the Aggregate Analysis of ClinicalTrials.gov (AACT), contains “all information (protocol and result data elements) about every study registered” through that titular government website. The AACT data is well-documented and accessible both via download and remote database connection. ClinicalTrials.gov also publishes the underlying data itself, but as one big XML file. — Data is Plural: February 15, 2017
Links:
Tags: healthcare
Last week, the Metropolitan Museum of Art made 375,000 images free to use, remix, and share under a Creative Commons Zero license. The museum also publishes bulk metadata on more than 420,000 pieces of art; that file indicates whether a given artwork is in the public domain, and hence whether the images fall under the new license. You can also search the images here. Other museums providing open-access imagery include the National Gallery of Art, the Getty, and Amsterdam’s Rijksmuseum. Previously: Mo’ museum metadata (Nov. 4, 2015). [h/t Joshua Barone + Sarah Bond] — Data is Plural: February 15, 2017
Links:
Tags: art
The National Weather Service’s Cooperative Observer Program (COOP) is a 127-year-old network of volunteer weather observers. “More than 8,700 volunteers take observations on farms, in urban and suburban areas, National Parks, seashores, and mountaintops,” according to the NWS. Want to become a volunteer? Because the program is so old, “many areas already have the necessary stations operating,” but “about 200 observers resign each year, about 4 per state.” While you’re waiting, you can download the COOP data from Iowa State University. [h/t Bill Frischling] — Data is Plural: February 15, 2017
Links:
Tags: climate
The World Health Organization publishes life expectancy estimates for 194 countries, for each year between 2000 and 2015. Related: “One Dataset, Visualized 25 Ways.” Previously: American life expectancies by city (April 13, 2016). — Data is Plural: February 8, 2017
Links:
For their 2011 paper, “Flavor network and the principles of food pairing,” four scientists analyzed 56,498 recipes downloaded from three websites — allrecipes.com, epicurious.com, and menupan.com. To support their findings, the authors published two datasets. One names the cuisine and ingredients for each recipe. The other dataset counts how often any two ingredients appeared in the same recipe. (Parmesan cheese and beef appeared together 93 times; starfruit and Algerian geranium oil just once.) Related: “food2vec – Augmented cooking with machine intelligence,” published last month. [h/t Rob Barry] — Data is Plural: February 8, 2017
Links:
Tags: food
The prestigious Scandinavian awards have an API. The official documentation explains it succinctly: “The data is free to use and contains information about who has been awarded the Nobel Prize, when, in what prize category and the motivation, as well as basic information about the Nobel Laureates such as birth data and the affiliation at the time of the award. The data is regularly updated as the information on Nobelprize.org is updated, including at the time of announcements of new Laureates.” Related: “These Nobel Prize Winners Show Why Immigration Is So Important For American Science,” by my colleague Peter Aldhous. Plus: The R code supporting Peter's analysis. — Data is Plural: February 8, 2017
Links:
Tags: historymiscellaneous
The International House Price Database combines and standardizes house price indices from 23 countries — mostly in Europe and North America, but also including South Africa, Australia, New Zealand, Japan, South Korea, and Israel. The dataset, published by the Federal Reserve Bank of Dallas, is deeply documented and updated quarterly. Previously: Historical San Francisco rents (May 25, 2016) and the U.S. Census Bureau’s Annual Characteristics of New Housing (June 22, 2016). — Data is Plural: February 8, 2017
Links:
Tags: real estate
Two weeks ago, Bloomberg News reporters requested entrance and exit data from Washington, DC’s Metrorail system for three days: Jan. 20, 2009 (Obama's first inauguration), Jan. 20, 2017 (Trump's inauguration), and Jan. 21, 2017 (the Women's March). A week later, they received the data — but as PDFs, which they turned into structured data and published this week. Related: NYC’s MTA publishes detailed turnstile-by-turnstile data, and Chicago publishes daily “L” ridership data for each station going back to 2001. Plus: “Second Avenue Subway Relieves Crowding on Neighboring Lines,” which uses the NYC data. — Data is Plural: February 8, 2017
Links:
The National Institute of Standards and Technology publishes Special Database 18 “for use in development and testing of automated mugshot identification systems.” The dataset contains 3,248 mugshot photos portraying 1,573 different people (mostly men), and includes each arrestee’s age and gender. [h/t Noah Veltman] — Data is Plural: January 25, 2017
Links:
Tags: crimetechnology
EU-Forest is a new dataset that, according to its authors, “extends by almost one order of magnitude the publicly available information on European tree species distribution.” The new project merges and harmonizes data from 21 national forest surveys and two related databases. In all, EU-Forest includes more than 580,000 observations of more than 200 species in 1km-by-1km square plots of land, and is available in both tabular and geospatial file formats. Previously: American tree maps (Dec. 23, 2015) and NYC street trees (Nov. 16, 2016). — Data is Plural: January 25, 2017
Links:
Tags: mappingplantsstatistics
The GDELT Project and the Internet Archive have collaborated to make the latter's Television News Archive more powerfully searchable. Their new tool, announced in December, lets you search across “more than 5.7 billion words from over 150 distinct stations spanning July 2009 to present” at a sentence-by-sentence level. The results are downloadable as CSV or JSON files. Previously: The Political TV Ad Archive (Feb. 2, 2016). — Data is Plural: January 25, 2017
Links:
The Bank of England publishes a spreadsheet of historical economic data going back, in some cases, to the late 1600s. The country’s GDP in 1700 was £11.7 billion in 2013 prices. That’s about 1/157th the size of the UK’s GDP in 2015. And in November 1694, monthly short-term interest rates were roughly 6%. [h/t Ian Greenleigh] — Data is Plural: January 25, 2017
Links:
A team of economists studying “the equality of opportunity” has published new research identifying which colleges “help the most children climb the income ladder.” For their analysis, the researchers combined federal tax records and data from the Department of Education. California State University–Los Angeles was one of the greatest engines of mobility; nearly 1 in 10 students enrolled there began in the bottom 20% of income but reached the top 20% by their early thirties. You can download the findings, which include similar statistics for more than 2,000 schools, as a series of spreadsheets. Related: “Some Colleges Have More Students From the Top 1 Percent Than the Bottom 60. Find Yours,” from the New York Times. — Data is Plural: January 25, 2017
Links:
The General Services Administration recently updated its list of known .gov domains. It currently includes more than 1,300 federal domains — from aapi.gov to youthrules.gov — and more than 4,300 domains registered by state, local, and native sovereign agencies. — Data is Plural: January 18, 2017
Links:
Tags: governmenttechnology
State-owned Deutsche Bahn AG is Europe’s largest railway company by revenue, serving 12 million train and bus passengers each day. It also happens to publish a bunch of open data, including datasets on its routes, stations, platforms, and cargo facilities. [h/t Martin Bergmann] — Data is Plural: January 18, 2017
Links:
Tags: transportation
Between December 2014 and March 2016, Alberto Cavallo — co-founder of MIT’s Billion Prices Project — sent 323 crowdsourced workers to collect product prices from 56 large retailers in 10 countries. Then, he found the prices for the same products on the retailers’ websites. The results, which contain tens of thousands of observations, are available as several Excel spreadsheets. (Caveat: The dataset’s “Terms of Use” rules stipulate that the information is “EXCLUSIVELY FOR USE IN ACADEMIC RESEARCH AND PUBLICATIONS”.) Related: Cavallo summarized his findings in a paper published recently by the American Economic Review. — Data is Plural: January 18, 2017
Links:
Tags: businesstechnology
Late last year, the USDA published a study that used “point-of-sale transaction data from a leading grocery retailer to examine the food choices” of households receiving Supplemental Nutrition Assistance Program (SNAP) benefits. In an appendix, the report ranks the total spending on major commodities by SNAP households and non-SNAP households. Soft drinks, “fluid milk products,” and ground beef were the top three commodities purchased by SNAP households. Milk, soft drinks, and cheese were the top three for non-SNAP households. That information is presented as a PDF table, but I’ve converted it to a spreadsheet-friendly text file for you. [h//t Reddit user "junglejuicy"] — Data is Plural: January 18, 2017
Links:
At BuzzFeed News, a few colleagues and I spent the past two months compiling a big database of organizations and people connected to President-elect Trump, his family, advisers, and Cabinet picks. On Sunday, we published what we’ve found so far — connections between more than 1,500 organizations and people altogether. Still, there are certainly things we’ve missed. So you can download and search the data, but you can also help us expand it. See something we’ve overlooked? Let us know! — Data is Plural: January 18, 2017
Links:
In 2015, computer scientist Randy Olson tried computing “the optimal search strategy for finding Waldo” in the seven original Where’s Waldo? books. In doing so, he transcribed a 2013 Slate chart of Waldo’s locations (itself transcribed from those seven original books). The resulting dataset contains 68 rows — one for each Waldo — and four columns: book, page, x coordinate, and y coordinate. — Data is Plural: January 11, 2017
Links:
CelesTrak’s T.S. Kelso has been obsessively transcribing NORAD’s “resident space object” data for decades. Among his offerings: the SATCAT satellite catalog, which provides data on all known satellites launched since 1957 — more than 41,900 of ‘em. Kelso also provides a SATCAT Boxscore, which is like a baseball box score ... but for satellites. The U.S., it turns out, is responsible for almost exactly one-third of the 1,590 satellites classified as “active.” Previously: The Union of Concerned Scientists’ satellite database, featured Dec. 30, 2015. [h/t Noah Veltman] — Data is Plural: January 11, 2017
Links:
Tags: historytechnology
Years ago, Lt. Col. Jenns Robertson began entering information into “a simple Excel spreadsheet that eventually matured into the largest compilation of releasable U.S. air operations data in existence.” Last month, the Department of Defense published a “beta” version of this data, known as Theater History of Operations Reports (THOR). Currently, THOR’s data covers bombing operations from World War I, World War II, the Korean War, and the Vietnam War. For each bombing, the reports include data about the aircraft, munitions, targets, results, and more. — Data is Plural: January 11, 2017
Links:
Scientists expect that, when the final numbers come in, 2016 will have been Earth’s hottest year on record. The National Oceanic and Atmospheric Administration publishes monthly data on “temperature anomalies” — how much hotter or cooler a month was than the 20th century average. (November 2016, the most recent month available, was 0.73° Celsius warmer than the average November.) You can grab the data for the entire globe, by hemisphere, or by continent; for the land and ocean combined, or separately; and going all the way back to 1880. Related: My colleague Peter Aldhous demonstrates how he charted this data using R. Also: NOAA released its 2016 U.S. “State of the Climate” report on Monday. — Data is Plural: January 11, 2017
Links:
Tags: climate
In September, ProPublica published a Chrome extension that showed readers what Facebook said it knew about them — and then asked readers to share that data. In the following months, readers unearthed more than 52,000 of the “unique interest categories” that Facebook uses for advertising, such as “yoga,” “beer,” and “Scent of a Woman (1992 film).” But ProPublica’s reporters also found that Facebook doesn’t tell users about the “far more sensitive” data it buys about their offline lives, which can include “their income, the types of restaurants they frequent and even how many credit cards are in their wallets.” To support these findings, ProPublica published two key datasets: the crowdsourced “interest categories” and the list of categories that Facebook allows advertisers to target. — Data is Plural: January 11, 2017
Links:
Tags: social mediatechnology
Last week, Quartz published an addictive tool that lets you map word usage on Twitter, by U.S. county. It’s based on an academic analysis of 890 million geocoded tweets uttered between October 2013 and November 2014. Data and details available here. — Data is Plural: December 21, 2016
Links:
The FAA’s Near Midair Collision System keeps track of incidents where two planes flew uncomfortably close to each other. The system, which is based on reports from pilots and flight crew members, contains more than 7,500 incidents dating back to 1987. The FAA received 305 of these reports for the first 10 months of 2016, including 35 classified as “critical.” — Data is Plural: December 21, 2016
Links:
Tags: transportation
The U.S. Geological Survey’s BISON service brings together “species occurrence” data from hundreds of sources. The service, whose name stands for ”Biodiversity Information Serving our Nation,” currently contains 262 million records, each of which refers to the observation of “an organism at a particular time in a particular place.” Most of the observations are based on direct sightings; others use fossils, written records, or other sources. The data aren’t available for bulk download, but can be accessed via BISON’s free API. [h/t Clare Malone] — Data is Plural: December 21, 2016
Links:
Tags: animalsplantsstatistics
Since the 1940s, oilfield services corporation Baker Hughes and its predecessor companies have been publishing “rig counts” — the number of rigs actively drilling for oil and/or gas in various parts of the world. These days, the company updates its North America numbers every week and its international counts every month. As of December 16, they counted 637 rigs in — and offshore of — the United States, nearly half of them in Texas. [h/t Jordan Wirfs-Brock] — Data is Plural: December 21, 2016
Links:
Last week, the U.S. Department of Health and Human Services released a dataset of state-level Obamacare metrics. The dataset is divided into five main categories: coverage gains, employer coverage, individual market coverage, Medicaid, and Medicare. Between 2010 and 2015, the proportion of Nevadans without health insurance dropped from 22.6% to 12.3% — the largest percentage-point decrease of any state. (In 2015, an estimated 17.1% of Texans still didn’t have health insurance, the highest rate of any state that year.) The metrics come from various sources, including the Census, academic studies, and the department’s own estimates. [h/t Nadja Popovich] — Data is Plural: December 21, 2016
Links:
Tags: governmenthealthcare
UK-based CarbonCulture helps organizations measure and publish their buildings’ energy and water use in near-realtime. Among the first users: 10 Downing Street, the Tate Modern, and University College London. For each building, you can download yearly datasets, which are broken down into 30-minute intervals. [h/t Max Roser] — Data is Plural: December 14, 2016
Links:
Tags: energy
A Lithuania-based web-scraping company has been collecting data on Kickstarter projects and Indiegogo campaigns every month. The datasets include (among other things) each project’s number of backers, amount pledged, and category. You can also explore the data online. [h/t Vincent Granville] — Data is Plural: December 14, 2016
Links:
Tags: moneytechnology
The European Commission and Google engineers have mapped surface water – including lakes, rivers, reservoirs, oceans, and more – on every 30-meter-by-30-meter square on Earth between 1984 and 2015. During that time, “permanent surface water has disappeared from an area of almost 90,000 square kilometres, roughly equivalent to that of Lake Superior, though new permanent bodies of surface water covering 184,000 square kilometres have formed elsewhere.” The data, based on the U.S. government’s Landsat satellite images, are available to download and explore online. Related: “Mapping Three Decades of Global Water Change,” published by The New York Times, based on this dataset. — Data is Plural: December 14, 2016
Links:
Troy Hunt runs HaveIBeenPwned.com, a service that lets you see whether your email address has been included in any major data breaches. Last week, Hunt published an anonymized dataset based on the breaches he’s collected. (That post provides a torrent file for the dataset; you can also download the data here.) Unlike the HaveIBeenPwned website, the dataset doesn’t include information about specific accounts; instead it counts the number of email addresses that have been compromised on particular combinations of websites. For example, 14.6 million email addresses appeared in both the LinkedIn and Dropbox breaches. (You can read more about each breach here.) — Data is Plural: December 14, 2016
Links:
Tags: crimetechnology
The federal government has released data on Medicare’s prescription drug spending from 2011 to 2015. Previously, Medicare had only published data on the most expensive drugs; the new release includes data on all drugs used by at least 11 Medicare patients in a given year. Caveat: Medicare “is prohibited from publicly disclosing drug-specific information on manufacturer rebates,” so the “spending metrics do not reflect any manufacturers’ rebates or other price concessions.” [h/t Charles Ornstein] — Data is Plural: December 14, 2016
Links:
Tags: drugshealthcare
“MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition.” [h/t Lon Riesberg] — Data is Plural: December 7, 2016
Links:
Tags: audioentertainmentmusic
The IPUMS Higher Ed portal provides data from three “leading surveys for studying the science and engineering (STEM) workforce in the United States.” The surveys currently cover 1993 through 2013 and include questions about educational choices, demographics, employment outcomes, and more. Requires a free account. Michael A. Rice, a teacher at Ingraham High School in Seattle] — Data is Plural: December 7, 2016
Links:
Last month, Chicago’s city government published data on more than 100 million local taxi rides taken in the city since 2013. (The city gathers the data through “periodic reporting by two major payment processors believed to cover most taxis in Chicago.”) The dataset contains each ride’s start/end times, pickup/dropoff location (based on Chicago’s “community areas”), distance, cost, payment type, and taxi company. Related: “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance,” which contains pointers to similar data for New York City. [h/t Dan Nguyen] — Data is Plural: December 7, 2016
Links:
Tags: transportation
The Open PV Project is a “community driven, comprehensive database” of solar panel installations in the U.S., ranging from home installations to utility-scale projects. The database, run by the Department of Energy, contains more than 1 million installations — with a total capacity of 16,000+ megawatts — and tracks their locations, sizes, costs, installers, and other variables. [h/t Dad] — Data is Plural: December 7, 2016
Links:
Tags: energy
The U.S. Energy Information Administration publishes a bunch of geographic data, including shapefiles mapping the country’s crude oil, petroleum product, hydrocarbon gas liquid, and natural gas pipelines. (They were last updated five months ago.) Additionally, the Pipeline and Hazardous Materials Safety Administration keeps track of “significant incidents” — for example, those that caused a serious injury or $50,000 in damage. Related: “Six maps that show the anatomy of America’s vast infrastructure.” Also related: ProPublica’s Pipeline Safety Tracker, covering 1986–2012. — Data is Plural: December 7, 2016
Links:
A few years ago, Reddit user trexmatt uploaded 216,930 Jeopardy! trivia-tidbits, scraped from j-archive.com, “the nearly comprehensive online Jeopardy! archive maintained by obsessive fans.” Each entry lists the question, answer, category, value, round, show number, and show air-date. — Data is Plural: November 30, 2016
Links:
Data analyst Patrick Martinchek has published a dataset of all Facebook posts from “15 of the top mainstream media sources” — a group that includes The New York Times, The Wall Street Journal, NPR, Fox News, and other familiar sources — from January 2012 through Nov. 8, 2016. Related: “What I Discovered About Trump and Clinton From Analyzing 4 Million Facebook Posts.” — Data is Plural: November 30, 2016
Links:
This year, I decided to grade a bunch of prominent election forecasts for BuzzFeed News. Now that Michigan has finally been called, I’ve published the results. I’ve also published the underlying data and code on GitHub, including state-level predictions from all nine forecasters in the analysis. — Data is Plural: November 30, 2016
Links:
Earlier this month, Forbes published an examination of ShotSpotter, a company that uses networks of outdoor microphones to detect and locate gunshot-like sounds. Forbes found that ShotSpotter has produced “few tangible results.” “In some cities, ShotSpotter hasn’t had the effect city officials and residents had hoped for. While officers are responding to more illegal gunfire, they rarely catch the shooter.” To support its findings, Forbes has published the ShotSpotter data they received from police departments in seven cities: Brockton, Mass.; East Palo Alto, Calif.; Kansas City, Mo.; Milwaukee, Wis.; Omaha, Neb.; San Francisco, Calif.; and Wilmington, N.C. The data varies somewhat for each city, but typically includes the date, time, location, and outcome of the each gunshot alert. [h/t Matt Drange] — Data is Plural: November 30, 2016
Links:
The CDC’s Underlying Cause of Death database provides county-level mortality statistics based on death certificates of U.S. residents for each year from 1999 to 2014. The tool lets you group the data by geography, demographics, place of death (e.g., inpatient hospital, hospice, home, etc.), and other variables. In 2014, for example, about 40,000 residents died of pancreatic cancer — with the highest rates coming in America’s most-rural counties (~15.6 deaths per 100,000 residents) and the lowest rates in country’s most-urban counties (~11.3 per 100,000). The CDC’s “compressed mortality” datasets contain slightly less detail, but go all the way back to 1968. [h/t Drew Ivan] — Data is Plural: November 30, 2016
Links:
Earlier this month, New York City published the results of its decennial tree count. You can explore a map of every street tree in NYC — nearly 700,000 of ‘em — or download the corresponding dataset, which contains info on each tree’s species, circumference, health status, and other observations. (Note: That dataset appears to contain about one-third fewer trees than the map’s count, for reasons I can’t quite figure out.) Results of the 1995 and 2005 tree censuses are also available. — Data is Plural: November 16, 2016
Links:
Germany-based researcher Andreas Thalhammer has applied PageRank — the algorithm at the heart of Google’s origin story — to the world of Wikipedia. The result: the DBpedia PageRank dataset, which estimates the importance of each page based on the other pages that link to it. You can download the data directly, or query it online. (According to the metric, Aristotle, Plato, and Karl Marx are history’s three most Wiki-central philosophers.) — Data is Plural: November 16, 2016
Links:
Tags: technology
Jason Baumgartner — a.k.a. Stuck_In_the_Matrix — has collected and published every submission and comment posted to Reddit from November 8th through November 10th. For each of the nearly 8 million comments, the dataset includes the message, the author, the subreddit it was posted to, the comment thread’s ID, and more. Previously: 1.7 billion Reddit comments, featured Nov. 25, 2015. — Data is Plural: November 16, 2016
Links:
Last month, colleagues at BuzzFeed News and I analyzed and fact-checked 1,000+ posts from hyperpartisan Facebook pages, and found a disturbingly high rate of fake news. Here’s the data. Facebook CEO Mark Zuckerberg has dismissed the possibility that fake news influenced the election, calling it a “pretty crazy idea”. Meanwhile, renegade Facebook employees have now formed an unofficial task force to battle fake news on the platform. — Data is Plural: November 16, 2016
Links:
Since the 1990s, the FBI has collected data on hate crimes from local law enforcement agencies. On Monday, the bureau released data for 2015, reporting “5,850 criminal incidents and 6,885 related offenses, as being motivated by bias toward race, ethnicity, ancestry, religion, sexual orientation, disability, gender, and gender identity.” Those numbers are based on reports from 14,997 participating agencies. On the FBI’s website, you can view and download summary tables of the most recent data. You can also download incident-specific data for 1992 through 2014 from the National Archive of Criminal Justice Data. Unfortunately, as ProPublica noted yesterday, the FBI dataset is “deeply flawed”; more than 3,000 law enforcement agencies don’t participate in the program. [h/t John Templon] — Data is Plural: November 16, 2016
Links:
The city publishes a spreadsheet — last updated in May — of local dogs who’ve officially been “declared dangerous.” (“They have attacked in the past. The owner is required to provide $100,000 in financial responsibility. If they attack again the court could order them put to sleep.”) The file currently contains 63 entries, from a Labrador named Charlie to a Blue Lacy named Flint. [h/t Sharon Machlis] — Data is Plural: November 2, 2016
Links:
Julian McAuley, an assistant professor at UC San Diego, has collected a massive amount of user-generated data from Amazon.com, including 142.8 million reviews and 1.4 million answered Q&As. (As of mid-2014, Sophie la Girafe was the most-reviewed item in the baby category. Backstory here.) Much of the data can be downloaded directly, but the largest files require contacting McAuley for access. [h/t Reddit user samofny] — Data is Plural: November 2, 2016
Links:
Earlier this autumn, New York City began publishing a dataset of official citizen complaints against the city’s police, for every case closed since 2006. For each of the 200,000+ allegations, the main dataset includes various details about the incident — e.g., where it took place, and whether there’s video evidence — but no information about the officer involved. Related: Similar data from Indianapolis, which includes demographic information about the complained-against officers but not their names. Also related: “The local projects that are making police complaint data open and accessible.” Previously: Complaints against Chicago police, featured Nov. 11, 2015. [h/t Eve Ahearn] — Data is Plural: November 2, 2016
Links:
The European Commission’s Global Human Settlement Layer combines satellite imagery and census data to measure three things: population, building density, and urban/rural classification. The resulting datasets are fairly detailed — they provide population estimates for every 250-meter square in the world, for example — and are available for 1975, 1990, 2000, and 2015. [h/t Alaistair Rae] — Data is Plural: November 2, 2016
Links:
The U.S. government’s Medicare Health Outcomes Survey tracks the “physical and mental health and well-being” of Americans covered by Medicare. Each survey, currently available for 1998–2000 to 2012–2014, follows a sample of Medicare beneficiaries for two years, and asks them questions along the lines of, “In the past 12 months, have you had a problem with balance or walking?” The 2012–2014 data includes (at least partial) responses from 296,320 people. [h/t Ricardo Pietrobon] [Update, 2016-11-02: The original link in this item points to an ICPSR page, which provides access only to people at "member institutions." Here's a better link to the data: http://www.hosonline.org/en/data-dissemination/research-data-files/] — Data is Plural: November 2, 2016
Links:
Tags: healthcare
Between October 2014 and September 2015, the U.S. Transportation Security Administration confiscated 22,196 “dangerous” items at airports, including 156 times at New York’s JFK. (Twice there, someone had placed fireworks in checked baggage.) That’s according to data obtained from the government by FOIA enthusiast Max Galka, who has also built an interactive map of the confiscations. — Data is Plural: October 26, 2016
Links:
Tags: transportation
OpenFlights.org has collected data on more than 60,000 flight routes, including 915 itineraries departing Atlanta’s Hartsfield–Jackson International Airport. (That airport was recently named the world’s busiest, for the 18th year in a row.) For each route, the dataset indicates the airline, the departing airport, the arriving airport, the number of stops, and what type of plane is typically used. The website also provides datasets on thousands of airports and airlines. Important caveat: “This data is not suitable for navigation.” — Data is Plural: October 26, 2016
Links:
Tags: transportation
The World Cities Culture Forum, a convening of 32 major cities on six continents, has assembled a series of mini-datasets on 70+ “cultural indicators”. Those indicators range from the number of art galleries in rach city (Paris had 1,151 in 2012) to the number of international tourists each city sees per year (Istanbul had 11.8 million in 2014) to the value of cinema ticket sales (Shanghai sold $563 million in 2014). Note: The data points draw on various sources — at least one just says “Google” — and aren’t necessarily directly comparable. [h/t Camilo Moreno] — Data is Plural: October 26, 2016
Links:
The Department of Education’s EDFacts data tracks public grade schools’ participation and proficiency rates on standardized math and reading/language exams. The files provide data on all students who took the tests, broken down by race/ethnicity, sex, disability status, homelessness, and more. A related set of data files, available on the same page, tracks high-school graduation rates. — Data is Plural: October 26, 2016
Links:
Tags: education
The website EveryCRSReport.com provides unprecedented public access to reports from the Congressional Research Service — essentially the national legislature’s think-tank. The website, which launched last week by Demand Progress and Congressional Data Coalition, also lets you download metadata and text for each report. [h/t Daniel Schuman] — Data is Plural: October 26, 2016
Links:
Tags: government
Today’s newsletter marks the 50th edition of Data Is Plural, as well as its one-year anniversary. To celebrate, I’ve started publishing a spreadsheet that details each edition’s basic stats — total subscribers, the “open rate,” the number of people who chose to unsubscribe, and more. — Data is Plural: October 19, 2016
Links:
The Department of Agriculture publishes a spreadsheet of farmers markets in the United States. For each market, the dataset notes its location, hours, and the types of goods available (e.g., vegetables, seafood, flowers, et cetera). [h/t Susie Lu] — Data is Plural: October 19, 2016
Links:
Tags: agriculture
The Jordà-Schularick-Taylor Macrohistory Database claims to be “the most extensive long-run macro-financial dataset to date.” It contains dozens of variables — GDP per capita, long-term interest rates, and the timing of systemic financial crises, for example — for 17 “advanced economies”. The dataset uses a Creative Commons license and has been extensively documented. — Data is Plural: October 19, 2016
Links:
Tags: economics
Each year, the Department of Health and Human Services updates its Area Health Resources Files, a vast suite of local health care data collated from more than 50 sources. Among the topics covered: the number health care professionals by specialty, various rates of hospital usage, air quality, and demographic profiles. You can download the data, or explore and map it online. [h/t Ricardo Pietrobon] — Data is Plural: October 19, 2016
Links:
Tags: healthcare
The Census Bureau’s Annual Survey of Manufacturers provides state-by-state and industry-by-industry statistics for America’s manufacturing sector. Metrics include the number of employees, annual payroll, “value added,” beginning-of-year inventory, and many more. In 2014, dog and cat food manufacturers employed about 18,000 people nationwide. Related: “Why Are Politicians So Obsessed With Manufacturing?” [h/t Scott Stern + RJ Andrews] — Data is Plural: October 19, 2016
Links:
You don’t have to like Flight of the Conchords to enjoy New Zealand’s national statistics website, though it couldn’t hurt. The country publishes data on a broad range of topics, including abortion, work stoppages, the Māori census, and, of course, exports. In '08 and '09, the country exported NZD $3.5 billion and NZD $2.4 billion, respectively, of ”mineral fuels, mineral oils and products of their distillation; bituminous substances; mineral waxes.” [h/t Drew Ivan] — Data is Plural: October 12, 2016
Links:
Tags: statistics
John C. McCallum has collected the advertised prices of computer memory over time. In 1957, one byte of memory cost $392, or the equivalent of $411 million per megabyte; today, one metabyte costs about a third of a cent. [h/t Jorge Luis] — Data is Plural: October 12, 2016
Links:
Tags: economicstechnology
The New York Philharmonic’s performance history dataset contains “all known concerts” — more than 20,000 of ‘em — played by the Philharmonic and the groups with which it has merged (e.g., the New York Symphony). Last month, the Museum of Modern Art published a dataset containing “all of the known exhibitions held at the museum from 1929 through 1989” — 1,788 in total. The first featured Cézanne, Gauguin, Seurat, and van Gogh. [h/t Stacy-Marie Ishmael + Miriam Posner + Chad Weinard] — Data is Plural: October 12, 2016
Links:
The World Bank keeps statistics on total forest coverage per country and worldwide. (Between 1990 and 2015, that worldwide total declined from 41.3 million to 40.0 million square kilometers.) More than 98% of all land area in Suriname was forest in 2015, according to a related dataset — the highest proportion of any country. [h/t Tariq Khokhar + Max Galka] — Data is Plural: October 12, 2016
Links:
Tags: environmentplants
Noah Veltman has collected all presidential endorsements (and non-endorsements) of 100+ major newspapers from 1980 (Reagan vs. Carter) to 2016. You can view the data as a spreadsheet, or as a formatted table. — Data is Plural: October 12, 2016
Links:
Tags: journalismmediapolitics
FOIA enthusiast Max Galka received a month of highway traffic data from the U.S. Department of Transportation. The dataset “includes hourly traffic counts for each hour of each day of [November 2015] at approximately 4,000 continuous traffic counting locations nationwide.” In all, the dataset “amounts to a total of 14 million traffic count readings and a total of 6 billion vehicles counted.” — Data is Plural: October 5, 2016
Links:
Tags: transportation
The UNESCO Institute for Statistics’ data on national research and development budgets contains estimates of personnel and total spending by field, funding source, and more. You can also explore the data online through a series of interactive graphics. [h/t Rebecca Galloway] — Data is Plural: October 5, 2016
Links:
Tags:
The federal government publishes default rates for federal student loans, aggregated by school, state, and school type. Last week, it published data covering students whose loans were due for repayment beginning in FY2013.The national default rate for those students as of this August was 11.3%. At certain schools, however, more than a third of students defaulted. More: Some background on the 10 colleges with highest default rates, by my colleague Molly Hensley-Clancy. — Data is Plural: October 5, 2016
Links:
Researchers at the Vienna-based Wittgenstein Centre for Demography and Global Human Capital have developed a dataset of historical and projected education levels for 171 countries. For five-year age groups in each country, the project estimates the percentage of people in each of several categories of educational attainment — no education, primary education, secondary education, post-secondary education, and a few gradations in between. The dataset is available to browse and download via the Wittgenstein Centre Data Explorer – look for “Educational Attainment Distribution” in the “indicators” dropdown. — Data is Plural: October 5, 2016
Links:
Tags: education
AidData, an organization based at the College of William & Mary, has compiled a dataset of more than 1.5 million foreign aid projects between 1947 and 2013. Together, the dataset accounts for more than $7 trillion in commitments from 96 donors such as the U.S. government, UNICEF, the Nordic Development Fund, and the World Bank. AidData also publishes geospatial datasets and a data user guide. Previously: ForeignAssistance.gov, featured Jan. 13. [h/t Kedar Pavgi] — Data is Plural: October 5, 2016
Links:
Tags: United Nationsaidhistory
After 2000’s contentious election, the National Opinion Research Center — funded by a consortium of news organizations — rigorously reviewed 175,010 Florida ballots that weren’t recognized as “valid” votes for president. In November 2001 the researchers concluded that, even with a full recount of disputed ballots, George W. Bush still would have won the state by 493 votes. The underlying data is available in several formats. — Data is Plural: September 28, 2016
Links:
The Constituency-Level Elections Archive, based at the University of Michigan, collects and standardizes results from lower-house legislative elections around the world. (In the U.S., the lower house is the House of Representatives; in the U.K., it’s the House of Commons; in Albania, it’s the Kuvendi i Shqipërisë.) The latest release covers 1,591 elections from 136 countries. [h/t Jeremy Darrington] — Data is Plural: September 28, 2016
Links:
The U.S. Election Assistance Commission’s Election Administration and Voting Survey “includes data on the ability of civilian, military and overseas citizens to register to vote and successfully cast a ballot,” as well as an overview of each state’s voting laws and procedures. [h/t Derek Willis] — Data is Plural: September 28, 2016
Links:
OpenElections, a Knight Foundation–funded project, aims “to create the first free, comprehensive, standardized, linked set of election data for the United States.” They’ve made progress, but are looking for additional volunteers. In the meantime, you can download county-level presidential results from the National Atlas of the United States for 2004, 2008, and 2012 — or all combined. And you can download precinct-level results from 2002 to 2012 from the Harvard Election Data Archive (codebook here). — Data is Plural: September 28, 2016
Links:
Perhaps better known for its campaign-finance data, the Federal Election Commission also publishes official state-level results for presidential, House, and Senate elections going back to 1982. The results include all official candidates, and sometimes even write-ins (depending on the state). In the 2008 presidential election, eight Rhode Island voters wrote-in “Stephen Colbert,” five scribbled “Joe the Plumber,” and seven chose “Jesus.” — Data is Plural: September 28, 2016
Links:
A group of computer scientists and the New Yorker’s cartoon editor walk into a room… and write an academic article titled, “Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest.” The corresponding dataset — available via the “cartoons” link on this page — includes 50 cartoons and nearly 300,000 reader-submitted captions. — Data is Plural: September 14, 2016
Links:
Reporters at the New York Times have assembled a dataset counting the number of inmates each U.S. county sent to state prison in 2006, 2013, and 2014. The reporters derived the numbers from the Bureau of Justice Statistics’ National Corrections Reporting Program, which only certain researchers can access. Related: “This small Indiana county sends more people to prison than San Francisco and Durham, N.C., combined. Why?” — Data is Plural: September 14, 2016
Links:
Tags: crime
The National Snow and Ice Data Center, based at the University of Colorado, publishes the Sea Ice Index. The data files, which track ice coverage in the Arctic and Antarctic oceans, include daily and monthly measurements from November 1978 to the present. Lately, the extent of sea ice on the Arctic Ocean has been two or more standard deviations below its long-term average, according to the center, while Antarctic sea ice remained at average levels. [h/t Dan Vergano] — Data is Plural: September 14, 2016
Links:
The CDC calls its Behavioral Risk Factor Surveillance System “the largest continuously conducted health survey system in the world.” Every year, the survey asks more than 400,000 American adults about a range of health-related topics, from tobacco to seatbelt use, from alcohol consumption to arthritis, from HIV testing to immunizations. Annual datasets from 1984–2015 are currently available. [h/t Ricardo Pietrobon] — Data is Plural: September 14, 2016
Links:
Researchers at the Washington Center for Equitable Growth have compiled a dataset of current and historical minimum wages in America. The federal and state minimum-wage data stretches back to May 1974 — when the federal minimum was $2.00 per hour, or roughly equivalent $9.76 per hour in today’s dollars — while the data for cities and counties starts in January 2004. [h/t Ben Casselman] — Data is Plural: September 14, 2016
Links:
The U.S. General Services Administration publishes an annual dataset about vehicles owned and leased by the federal government. The spreadsheets — which contain details on total inventories, cost, usage, and fuel consumption — go back to fiscal year 2011. In FY 2015, federal vehicles drove 4.8 billion miles, down about 9% from FY 2011. [h/t John Templon] — Data is Plural: September 7, 2016
Links:
Tags: governmenttransportation
Last week, a team of researchers published HistPat, a database containing county-of-residence data for 2.8 million U.S. patents granted between 1836 and 1975. The database covers approximately 83% of all patents granted to U.S. residents during that time, according to the authors. The most frequent home counties for innovation were New York County (422,234 patents); Cook County, Ill. (215,021), and Los Angeles County (90,171). Related: The National Bureau of Economic Research’s dataset of patent citations, 1975-1999. And: “Cancer moonshot” patents, 1976–2016. [h/t Drew Ivan] — Data is Plural: September 7, 2016
Links:
Tags: technology
Earlier this summer, a group researchers published a “new world atlas of artificial night sky brightness,” also known as light pollution. You can download a KMZ version of their atlas and view it in Google Earth. The researchers haven’t made their most detailed, “floating point” dataset available for public download; instead, they ask that you first submit a data-request form. [h/t Matthew Petroff] — Data is Plural: September 7, 2016
Links:
Tags: environmentmapping
The Federal Communications Commission decides who can use the nation’s airwaves and how. To date, they’ve issued millions of licenses, including nearly 200,000 last year for broadcast, personal use, law enforcement, and more. Almost exactly six years ago, the FCC launched a consolidated portal that pulls data from its various licensing systems into a single dataset. You can download all 17 million licenses in bulk, search for specific licenses online, or query the dataset’s API. [h/t Marc DaCosta] — Data is Plural: September 7, 2016
Links:
Tags: media
Since 1996, the Medical Expenditure Panel Survey has collected data on “the specific health services that Americans use,” and the “health insurance held by and available to U.S. workers.” In a typical year, the survey collects data from more than 30,000 people from more than 10,000 families. In addition to the raw data files, the Agency for Healthcare Research and Quality, which runs the survey, also provides summary data tables. They show that, for example, in 2013 an estimated 61% of Americans faced expenses for prescription drugs, which cost the median patient about $278 before insurance. [h/t Ricardo Pietrobon] — Data is Plural: September 7, 2016
Links:
Tags: healthcare
The 2016 U.S. Open began on Monday. It’s as good an occasion as any to highlight the work of. TennisAbstract.com’s Jeff Sackmann, who has published decades of match results and historical rankings from the men’s ATP and women’s WTA tours. Related: How FiveThirtyEight is using the data to forecast this year’s U.S. Open. Also: Prize money for the four Grand Slam tournaments, by gender and over time. And: The Tennis Racket. [h/t Nadja Popovich + John Templon] — Data is Plural: August 31, 2016
Links:
Tags: sports
California Senate Bill 272, enacted last year, required every local government agencies to publish a “catalog of enterprise systems” — essentially a guide to all the big databases they keep — by July 1 of this year. To find out who complied, a group of data-transparency organizations hosted the California Database Hunt last weekend. Volunteers searched 680 agencies, and published two spreadsheets of their findings: 430 (63%) of local agencies had posted their database catalogs, while 250 had not. [h/t Stephanie M. Lee] — Data is Plural: August 31, 2016
Links:
Tags: technology
EarthStat provides geographic data on harvest regions, yields, and fertilizer use for more than 100 crops. The website also publishes data on pasture land, water depletion, and climatological effects on crop yields. — Data is Plural: August 31, 2016
Links:
Tags: agriculture
On Monday, the Department of Transportation released 2015 data from its Fatality Analysis Reporting System. The dataset contains detailed information about every fatal motor-vehicle crash in the U.S., aggregated from a variety of state databases, including police reports, death certificates, and licensing files. In 2015, such crashes led to 35,092 deaths, 7.2% more than in 2014. [h/t Tanya Snyder] — Data is Plural: August 31, 2016
Links:
The Panel Study of Income Dynamics is “the longest running longitudinal household survey in the world,” according to its University of Michigan overseers. The study, which began in 1968, has interviewed more than 70,000 people, including four generations of some families. You can access the data for free, but you first need to register for an account and agree to a set of guidelines. An example insight: In 2013 — the most recent year for which data is available — approximately 11% of families said they owned a business in the previous year. [h/t Don Fullerton + Nirupama S. Rao] — Data is Plural: August 31, 2016
Links:
The German Traffic Sign Recognition Benchmark dataset contains 50,000+ images of 43 kinds of German traffic signs — from the classic “STOP,” to various speed limits, to roundabout indicators. The dataset, published by researchers at Ruhr-Universität Bochum’s Institut für Neuroinformatik, formed the basis of a 2011 machine-learning competition. Viktor Schepik] — Data is Plural: August 24, 2016
Links:
Tags: languagetechnology
New York State tracks every time a horse has been injured or died at a state race track since March 2009. The dataset, which is updated often, also includes a few other types of incidents, such as when a rider falls or horse loses badly. Related: “Horses’ Deaths at Aqueduct Prompt New Rules.” Mark Secada] — Data is Plural: August 24, 2016
Links:
The OpenOil project aims to collect and standardizes data oil and gas development contracts around the world. So far, they’ve gathered at least some data from more than 60 countries. They’ve also published a map of oil concessions in the Middle East and Africa. Michael Gardiner] — Data is Plural: August 24, 2016
Links:
Tags: energy
Australia’s Department of Health has recently released an enormous dataset of Medicare and subsidized-prescription claims. It includes all claims from a random 10% sample of patients, and “contains approximately 1 billion lines of data relating to approximately 3 million Australians.” The Medicare claims go back to 1984, and the prescription claims go back to 2003. Drew Ivan] — Data is Plural: August 24, 2016
Links:
Tags: healthcare
The Marshall Project has collected and analyzed four decades of FBI data “on the most serious violent crimes in 68 police jurisdictions.” The FBI data covers 1975 through 2014; the reporters “also obtained data directly from 61 local agencies for 2015 — a period for which the FBI has not yet released its numbers.” Between 2010 and 2015, violent crime increased most in Milwaukee (+11%) and declined most in Prince George’s County, Md. (-22%). — Data is Plural: August 24, 2016
Links:
Tags: crime
In the 1990s, ethnoarchaeologist Lewis Binford digitized more than 200 variables describing 339 groups of hunter-gatherers, a project his collaborator and widow Amber Johnson continues to maintain. The data come from historical ethnographies of societies, ranging from the Chichimec of the 1570s (in what is now Mexico), to the Dorobo of the 1920s (in what is now Kenya), to the Shompen of the 1980s (in the Nicobar Islands). — Data is Plural: August 17, 2016
Links:
Tags:
The Penn World Table contains GDP estimates, normalized for purchasing power, for 182 countries. These “real GDP” estimates — based on a combination of price surveys and national accounts data — stretch back at least to 1960, and many to 1950. In the most recent year available, 2014, Qatar’s real GDP per capita ranked highest: roughly $144,340 in 2011 U.S. dollars. The Central African Republic’s ranked lowest (~$594), and the United States’ ranked 11th (~$52,292). [h/t Willem Kerstholt] — Data is Plural: August 17, 2016
Links:
Tags: economics
The American Society of Composers, Authors and Publishers (ASCAP) boasts a membership of “more than 585,000 US composers, songwriters, lyricists and music publishers of every kind of music.“ The organization also maintains a downloadable catalog of the writers and publishers behind nearly 9 million songs. (But the downloaded files lack key details, such as the date the song was published.) — Data is Plural: August 17, 2016
Links:
Tags: entertainmentmusic
Through its Form 477 program, the Federal Communications Commission collects detailed data on broadband internet access in the United States. One of the easiest ways to access county-level data is through the agency's Mapping Broadband Health in America project, which overlays internet access data and physical health indicators. The latest tabulations come from 2014. In more than a quarter of counties with at least 1,000 residents that year, broadband reached less than 50% of the population. — Data is Plural: August 17, 2016
Links:
Tags: technology
The U.S. Environmental Protection Agency’s BEACON system contains data on more than 5,000 public beaches. For each state’s most “significant” beaches, BEACON’s downloadable reports include data on water quality, pollution advisories, closures, and more. Of these highly-visited beaches, the longest — at nearly 24 miles — is the Oregon Dunes National Recreation Area’s South Jetty, also home to “the largest expanse of coastal sand dunes in North America.” — Data is Plural: August 17, 2016
Links:
Tags: environment
PhysioNet has published sound and data files for more than 3,000 heart recordings (a.k.a. phonocardiograms). The files support PhysioNet’s 2016 contest, which seeks algorithms that can detect abnormal heart sounds. [h/t Joe Isaacson] — Data is Plural: August 10, 2016
Links:
Macrostrat.org provides data and maps on thousands of geologic formations around the world. The database currently includes 1,474 “regional columns,” 33,903 “rock units,” and 1,750,044 “geologic map polygons.” You can also explore the data through the University of Minnesota’s “Flyover Country” iOS and Android apps. [h/t Grant J. Smith] — Data is Plural: August 10, 2016
Links:
Tags: science
The Centers for Medicare & Medicaid Services evaluates hospitals on dozens of measures — relating to safety, timeliness of care, patient satisfaction, and more — and publishes the results online as the “Hospital Compare” dataset. The dataset also includes an overall score, which distills each hospital’s results into a single five-star rating. If you don’t want to download the data, you can explore the results online. [h/t Drew Ivan] — Data is Plural: August 10, 2016
Links:
Tags: healthcare
For more than a century, the U.S. Census collected slave population figures. An assistant professor at George Mason University has aggregated that data, and mapped it. He cautions: “Treat the Census numbers skeptically: even in the best of circumstances the Census undercounts the population.” Previously: New Orleans slave sales in the December 30 edition; slave ship voyages in the January 20 edition. — Data is Plural: August 10, 2016
Links:
Connecticut has begun publishing a daily census of every inmate held in jail while awaiting trial. Starting July 1, the database contains one row per inmate per day; each row includes basic demographic data (age, gender, race), as well as the inmate’s bond amount, main offense, and jail location. Read more at: The New Haven Independent and TrendCT. Question: This release seems unprecedented; does any other state or country publish such detailed data on pretrial inmates? [h/t Camille Seaberry] — Data is Plural: August 10, 2016
Links:
The 20 Newsgroups dataset contains 20,000 messages (including some duplicates) sent to 20 Usenet bulletin boards in 1993. Among the groups: alt.atheism, misc.forsale, sci.electronics, talk.politics.guns, and talk.politics.mideast. — Data is Plural: August 3, 2016
Links:
A group of public health researchers have estimated the average height of adults in 200 countries over the course of a century. Their calculations are based on a re-analysis of 1,472 previous studies, which collectively measured nearly 19 million participants. The resulting dataset contains annual height estimates for both men and women born each year between 1896 and 1996. During that time, South Korean women’s average height increased by approximately 8 inches, the largest gain of any group. These days, the Netherlands boasts the tallest men, and Latvia the tallest women. — Data is Plural: August 3, 2016
Links:
Tags: statistics
In May 2016, U.S. residential consumers paid an average of roughly 12.8 cents per kilowatt hour of electricity. The price was lowest in Louisiana (9.28 cents) and Washington state (9.54 cents), and highest in Hawaii (26.87 cents) and Connecticut (21.63 cents). These data-points, and more, are available through the Energy Information Administration’s electric power reports, which are updated monthly. [h/t Jordan Wirfs-Brock] — Data is Plural: August 3, 2016
Links:
Tags: energy
At least 6,913 people died while in the custody of Texas police, jails, and prisons between 2005 and 2015, according to the newly-launched Texas Justice Initiative. The data, gathered through freedom-of-information requests, contains the age, sex, and race/ethnicity of each person who died, as well as the general cause of death and a more detailed summary. Read more at: The Atlantic. Related: California’s Department of Justice publishes similar statistics and raw data. [h/t Melissa Segura + Reade Levinson] — Data is Plural: August 3, 2016
Links:
The World Health Organization publishes a slew of datasets on national vaccination rates and policies. Some facts gleaned from the data: Asked whether they provided routine vaccinations to children at school, just 55% of 191 countries that responded said they did. And: In 2015, Equatorial Guinea reported that only 26% of infants had received a first dose of measles vaccine, a lower rate than any other country’s. [h/t Philip Shemella] — Data is Plural: August 3, 2016
Links:
Tags: diseasehealthcare
The Bigfoot Field Researchers Organization dubs itself “the only scientific research organization exploring the bigfoot/sasquatch mystery.” The BFRO collects and vets sighting reports, and publishes them online. (Direct link to KMZ file.) Related: “'Squatch Watch: 92 Years of Bigfoot Sightings in the US and Canada.” [h/t Joshua Stevens + Lynn Cherny] — Data is Plural: July 27, 2016
Links:
The Pacific walrus (Odobenus rosmarus divergens) accounts for the vast majority of walruses on the planet. When they’re not swimming, Pacific walruses like to rest at places called “haulouts.” A new dataset and study include details on 150 current and historic haulouts, the largest of which has been reported to attract more than 100,000 walruses. Miscellany: Three of the study’s authors work for the U.S. Department of the Interior; the fourth works for Russia’s Institute of Biological Problems of the North. [h/t Keith Collins] — Data is Plural: July 27, 2016
Links:
Tags: animalsenvironment
The U.S. Institute of Museum and Library Services annually collects responses from 9,000 public library systems. The results, currently available through 2013, include information about the libraries’ collection size, physical footprint, population served, hours, and more. Previously: Every known museum in the United States, featured Nov. 11, 2015. — Data is Plural: July 27, 2016
Links:
Tags: books
Transitland and TransitFeeds both aggregate data on routes, stops, and timetables from hundreds of public transit systems — from the Bay Area’s BART, to New York’s MTA, to Milan’s ATM, to Budapest’s BKK. — Data is Plural: July 27, 2016
Links:
Tags: transportation
The Global Burden of Disease dataset represents “the largest and most comprehensive effort to date to measure epidemiological levels and trends worldwide,” according to the Institute for Health Metrics and Evaluation, which runs the project. For each disease and each country, the dataset contains estimates of the total deaths, years of life lost, and years lived with disability. The estimates are currently available for 1990, 1995, 2000, 2005, 2010, and 2013. Related: “Where We Live and How We Die: What a year of death looks like around the world.” [h/t Mimi Onuoha + Data & Society] — Data is Plural: July 27, 2016
Links:
Thanks to the Paperwork Reduction Act, federal agencies must get approval from the Office of Information and Regulatory Affairs for any “information collection” (e.g., a form) that seeks 10 or more responses. You can search all information collections — under review, approved, or rejected — online, or download an XML file of all active collections. — Data is Plural: July 20, 2016
Links:
Tags: government
Today, UNESCO’s World Heritage Committee will wrap up its 40th session, during which it has “inscribed” more than 20 new awe-inspiring places around the world. Online, the organization publishes spreadsheets and map files of 1,031 heritage sites it has previously inducted. For each site, the spreadsheet tracks its location, size, date inducted, category (“cultural,” “natural,” or “mixed”), and which selection criteria it met, and more. Through 2015, the countries with the largest number of heritage sites were Italy (51), China (48), and Spain (44). — Data is Plural: July 20, 2016
Links:
Tags: United Nationsmapping
The National Fire Incident Reporting System (NFIRS) is “the world’s largest, national, annual database of fire incident information,” containing about 1 million fires per year, including wildfires, structure fires, vehicle fires, and more. NFIRS data from 2013 (and prior years) are available online from FEMA. Looking for 2014’s data? The government asks you to request it via postal mail; or you could trust the copy a public safety analyst uploaded in March. (See the links at the bottom of that page.) The U.S. Fire Administration, which maintains NFIRS, publishes additional datasets, including a spreadsheet of 27,000+ fire departments and a database of on-duty firefighter fatalities. Also, the U.S. Geological Survey publishes data on current and historical wildfire perimeters. [h/t Nick Penzenstadler + Nadja Popovich] — Data is Plural: July 20, 2016
Links:
Tags: disaster
StackOverflow is a Q&A site for programmers, and part of the larger StackExchange network of Q&A communities. StackExchange publishes periodic data dumps of the networks’ users, questions, answers, votes, and comments. On Monday, the company released “StackLite,” a smaller, easier-to-use slice of the data. (Even so, it contains metadata on more than 15 million questions.) If you don’t want to download anything, you can also explore and analyze the data online. [h/t David Robinson] — Data is Plural: July 20, 2016
Links:
Tags: technology
Two political science professors at the University of Kentucky are compiling a dataset of coup attempts. So far, the dataset covers both successful and unsuccessful attempts from 1950 to late 2015. During those 65+ years, coup plotters have been foiled about half the time, with 236 victories and 238 failures. According to the dataset, Bolivia’s top leaders have faced 23 coup attempts, including 11 successful overthrows — more than any other country by either metric. [h/t Arthur Charpentier] — Data is Plural: July 20, 2016
Links:
Pokéapi is an API “detailing everything about the Pokémon main game series,” including every character, evolution, battle skill, and more. The data is also available as a series of CSVs. Currently, however, the dataset doesn’t include details from the so-hot-right-now Pokémon Go game. — Data is Plural: July 13, 2016
Links:
Tags: entertainmentgame
The U.S. Occupational Safety and Health Administration (OSHA) conducted 86,000 workplace inspections last year. The agency makes its inspection results — including investigations of fatal accidents and severe injuries — available in bulk and via an API. — Data is Plural: July 13, 2016
Links:
Tags: deathhealthcareinjury
OpenAddresses.io is an effort to collect the official geocoordinates of the all the world’s physical addresses. (These data come from “authoritative” sources, such as city governments. When Google Maps tells you the location of an address, it’s often just a very-educated guess, extrapolated from coarser data.) As of Monday evening, the project had processed 265,078,567 addresses, mostly in North America, Europe, Japan, and Australia. Related: “Open-source geo is really something right now.” — Data is Plural: July 13, 2016
Links:
Tags: mapping
Late last month, the Centers for Medicare and Medicaid Services added data from 2015 to its Open Payments database, which tracks medical companies’ payments to doctors and teaching hospitals. The payments — which include consulting fees, gifts, honoraria, meals, drinks, grants, and more — totaled more than $7.5 billion last year. Related: ProPublica’s Dollars for Docs project, which began tracking medical industry payments in 2010, long before CMS released the OpenPayments database. [h/t Cat Ferguson + Chris Hamby] — Data is Plural: July 13, 2016
Links:
Tags: businesshealthcare
This repository contains voting data from each of the UN General Assembly’s the first 69 sessions. One spreadsheet summarizes the topic and results of each voted-upon resolution. (The dataset also indicates whether the U.S. State Department identified the vote as “important” — such those condemning human rights violations in Syria and North Korea — in its annual Voting Practices in the United Nations report.) Another file contains each country’s individual voting decisions. [h/t David Robinson] — Data is Plural: July 13, 2016
Links:
Tags: United Nations
And yet, people do... by the thousands. In 2005, the Federal Aviation Administration created a system for pilots to report “laser events,” which it says can temporarily blind crewmembers. The administration has published five years of data from the reporting system. In 2014, the most recent year available, pilots reported 3,894 laser beamings. The vast majority involved a green beam, and none were reported to have caused an injury. — Data is Plural: July 6, 2016
Links:
Property tax data in New York City is technically available to the public, but the city makes it difficult to access. So a pair of civic hackers liberated the data. Now you can download 1.1 million rows of bulk data, which details each property’s type, assessed value, taxes due, owner’s name, and more. You can also download 750,000 rows of tax exemptions and abatements. Related: “A Look at NYC’s $650 Million Property Tax Breaks Related to Religion” — Data is Plural: July 6, 2016
Links:
Tags: real estatetaxes
In a recently-updated paper, three academics say they’ve found “convincing evidence of election fraud” in federal Russian elections since 2004. To support their analyses, the researchers have published the underlying data, which includes polling station data from seven Russian elections (as well as one Polish and one Spanish election, which showed no such signs of fraud). Related: WSJ analysis of Russian parliamentary election “points to widespread fraud” (2012). [h/t Arthur Bashlykov] — Data is Plural: July 6, 2016
Links:
Last month, German investigative nonprofit Correctiv published a searchable database of 13,000 nursing homes in the country. The data are based on government inspections, and the reporters have published the raw and processed data on GitHub. Related: ProPublica’s searchable database of nursing homes in the United States and the Medicare’s nursing home data. [h/t Sandhya Kambhampati] — Data is Plural: July 6, 2016
Links:
Tags: healthcare
The Correlates of State Policy Project aims to become a “one-stop shop” for data related to public policy in America’s 50 states. So far, the project is tracking 700+ aspects of each state’s laws, budgets, demographics, and more. Among the policy variables: Can pharmacies dispense emergency contraception without a prescription? Does the state ban corporal punishment in schools? and Does the state have an endangered species act? Don’t miss the codebook, which describes the data and sources in greater detail. Related: State and Local Public Policies in the United States, a similar project, for which an update to include 2014 data is “underway.” [h/t Rob Gillezeau] — Data is Plural: July 6, 2016
Links:
Tags: government
“Most people find this website because they are searching for the source of an unusual low frequency sound.” The World Hum Database currently includes more than 10,000 reader-submitted reports, including a recent submissions that describe the noise as sounding “like a fridge,” “like a train in the distance,” and “like a cicada that never shuts up.” [h/t Susie Cambria] — Data is Plural: June 22, 2016
Links:
Tags: audio
The U.S. Census Bureau’s Annual Characteristics of New Housing culls data on features such as square footage, wall material, number of bedrooms, and number of fireplaces. (Air conditioning was present in 93% of new single-family homes built in 2015, up from 49% in 1973.) Related: “Houses Keep Getting Bigger, Even as Families Get Smaller.” [h/t Lindsey Cook] — Data is Plural: June 22, 2016
Links:
Tags: real estate
Next month, thousands of adrenaline junkies will gather in Pamplona for the city’s annual Running of the Bulls. The San Fermin festival, which organizes the spectacle, publishes injury data on its website. (Here’s a shortcut to display every year of data, instead of one year at a time.) Last year, the bulls gored 10 runners and injured another 27. Related: “Your Chances Of Being Gored By A Bull In Pamplona Are Getting Higher.” — Data is Plural: June 22, 2016
Links:
Earlier this month, researchers published “the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000,” along with a detailed methodology. The dataset digitizes and geocodes population numbers originally tabulated by historian Tertius Chandler (Four Thousand Years of Urban Growth) and political scientist George Modelski (World Cities: -3,000 to 2,000). Though “far from comprehensive,” the authors say that the dataset a “first step towards understanding the geographic distribution of urban populations throughout history.” Related: “Watch 6,000 years of urbanization taking over the world.” — Data is Plural: June 22, 2016
Links:
Last week, the Internal Revenue Service released a huge dataset of nonprofits’ annual Form 990 filings, which provide details on program expenses, salaries, and more. More than 60% of Form 990s are filed digitally, according to the IRS. Previously, those forms were only available as images; now the IRS is publishing them as analysis-friendly XML files. (You can also download the data in bulk from the Internet Archive, thanks to Carl Malamud, the public domain advocate who led the fight for 990s-as-XML.) One early observer noted that the some of the data was misformatted, and has provided instructions for fixing it. [h/t Andrew Sullivan + Kendall Taggart] — Data is Plural: June 22, 2016
Links:
Tags: taxes
The National Oceanic and Atmospheric Administration’s Fisheries Statistics Division provides data on seafood caught by U.S. commercial fisheries, sliceable by month, species, and fishing gear. You can learn, for example, that these fisheries caught 88,893,305 pounds of Dungeness crab in 2006 — the highest recorded total since at least 1950. [h/t Gwynn Guilford] — Data is Plural: June 15, 2016
Links:
PhishTank is a clearinghouse that tracks thieves’ attempts to steal personal information and online credentials. The website also publishes bulk data on all verified phishing attempts — 44,000 and counting. With more than 1,000 phishing attempts recorded against it, PayPal is the single most-targeted website in the database. [h/t Herman Slatman] — Data is Plural: June 15, 2016
Links:
Tags: crimetechnology
Last month, the World Health Organization released its latest update to the Global Urban Ambient Air Pollution Database, which now covers nearly 3,000 cities in 103 countries. For each city, the dataset includes annual average density of two key categories of particulates (PM2.5 and PM10), as well as details regarding the data collection. According to the organization’s own analysis, “98% of cities in low and middle income countries with more than 100,000 inhabitants do not meet WHO air quality guidelines.” Related: ”A New Air Pollution Database Is Good, but Imperfect.” — Data is Plural: June 15, 2016
Links:
Tags: climateenvironment
The Chronicle of Higher Education has been tracking federal investigations into sexual assault on college campuses. Recently, The Chronicle added an API, so that developers and data analysts can access the data more easily. Currently, the dataset includes 292 investigations conducted since April 2011 — 49 of which have been resolved. [h/t Jon Davenport] — Data is Plural: June 15, 2016
Links:
On everypolitician.org, you can search and download data on 70,000+ legislators (past and present) from 233 countries. (Among those missing: Cuba, Ethiopia, and Qatar.) The dataset includes each lawmaker’s party affiliation, years served, gender, social media profiles, and more. Related: Every member of the United States Congress since 1789. — Data is Plural: June 15, 2016
Links:
Statistics grad student Kaylin Walker scraped 50 years of Billboard’s “Year-End Hot 100” rankings and those songs’ lyrics. Related: Walker’s analysis and methodology. [h/t Melissa Bierly] — Data is Plural: June 8, 2016
Links:
In 2006, Netflix launched a $1 million challenge to beat the company’s movie-recommendation algorithm. In 2009, Netflix awarded the prize to a group of AT&T scientists (though ultimately didn’t use the winning algorithm). The challenge, which was open to the public, was based on a dataset of 100 million ratings from 480,000 (anonymized) users, corresponding to more than 17,000 movies between Oct. 1998 and Dec. 2005. The dataset, once hosted at UC Irvine, is currently available through the Internet Archive. Previously: MovieLens, featured Jan. 27. [h/t Brandon Loudermilk] — Data is Plural: June 8, 2016
Links:
The Sunlight Foundation’s Hall of Justice brings together “nearly 10,000” criminal justice datasets and research documents from across the United States. You can search for topics and filter by geography, publisher, and accessibility (open, open-but-not-machine-readable, restricted access, et cetera.). Related: Sunlight’s “lessons learned from a year of opening police data.” [h/t Susie Cambria + Noah Veltman] — Data is Plural: June 8, 2016
Links:
The U.S. government maintains a “judgment fund,” which it uses to pay plaintiffs when federal agencies lose in court (or settle “actual or imminent lawsuits”). The Department of the Treasury, which administers the fund, publishes data on these payouts for each fiscal year going back to FY2006. [h/t CJ Ciaramella] — Data is Plural: June 8, 2016
Links:
Researchers in Europe have published a database of 216 nuclear energy accidents — a compendium they say is “twice the size of the previous best data set.” For each accident, the database contains the date, location, description, and four measurements of severity: its ratings on the International Nuclear Event Scale and on the Nuclear Accident Magnitude Scale, the number of fatalities, and total monetary cost. (The three most expensive: Chernobyl, Fukushima, and a 1995 accident at Japan’s Monju Nuclear Power Plant, estimated to have caused $15.5 billion in damages.) [h/t Dad] — Data is Plural: June 8, 2016
Links:
BrickLink is a website for buying and selling LEGOs. It also happens to publish a (nearly?) complete inventory of every LEGO set and piece produced since 1949. Related: LEGO sets have become increasingly violent, according to a recent study. [h/t Lindsey Cook] — Data is Plural: June 1, 2016
Links:
Tags: entertainment
The Scripps National Spelling Bee publishes the competition’s results online, but not in any analysis-friendly format. Thankfully, statistician Christopher Long has scraped and spreadsheet-ified the Scripps results going back to 1996 – including last week’s finals. Related: FiveThirtyEight uses the data to ask, “Where Do Spelling Bee Words Come From?” — Data is Plural: June 1, 2016
Links:
Tags: languagestatistics
The United Nations University’s World Income Inequality Database contains historical Gini coefficients for more than 170 countries — in some instances stretching back to the 1930s or ‘40s. The latest version of the database was released in October 2015 and includes key details about each estimate, such as the name of the primary source and the quality of data collection. — Data is Plural: June 1, 2016
Links:
Tags: United Nationsstatistics
Between 2002 and 2004, researchers surveyed more than 9,500 farming households in 11 African countries to better understand how climate change might affect agricultural practices. Last month, they published the detailed results and documentation in Scientific Data. The dataset includes responses to questions about plantings, harvests, yields, water sources, animal purchases, taxes paid, and much more. — Data is Plural: June 1, 2016
Links:
In 2014, approximately 22 million U.S. military veterans were still alive, including 1 million who served in World War II, 7.2 million who served during the Vietnam War era, and 3.9 million who have served in post-9/11 wars. Those numbers come from the VA’s National Center for Veterans Analysis and Statistics, which publishes estimates and future-projections of the country’s veteran population. You can explore the data by age, race, ethnicity, gender, military branch, state, county, era of service, and more. (To see the files, click on the “Population Tables” header.) [h/t Charles Worthington] — Data is Plural: June 1, 2016
Links:
The Photographers’ Identities Catalog aggregates data on more than 110,000 photographers and photo studios throughout history. The information “has been culled from trusted biographical dictionaries, catalogs and databases, and from extensive original research” by the New York Public Library’s photography experts. The catalog — which includes data on gender, geography, range of years active, and more — is available as raw CSVs on GitHub. — Data is Plural: May 25, 2016
Links:
The Major League Soccer Players Union publishes salary data going back to 2007, and released 2016’s figures last week. (At $7.17 million in total compensation, Orlando City’s Kaká ranks as the league’s highest-paid player.) The MLSPU publishes the data as PDFs; I’ve converted those PDFs into CSVs for you. [h/t Rose Eveleth + John Templon] — Data is Plural: May 25, 2016
Links:
To help understand San Francisco’s soaring real estate prices, Eric Fischer transcribed decades of apartment and house listings in the San Francisco Chronicle. For each year from 1948 through 1979, Fischer jotted down every monthly rent advertised in the paper on the first Sunday in April. (Similar data for 1979 through 2001 is available from San Francisco’s Housing Study DataBook.) The transcriptions are available on GitHub. [h/t Kendall Taggart + Michael Andersen] — Data is Plural: May 25, 2016
Links:
“There’s software used across the country to predict future criminals. And it’s biased against blacks,” a ProPublica analysis has found. The investigation focused on risk assessments and recidivism in Broward County, Florida, and found that black defendants were more likely than white defendants to be mislabeled as “high risk.” The reporters have published their methodology, code, and the underlying data — including two years of Broward County risk assessments — on GitHub. — Data is Plural: May 25, 2016
Links:
Tags: crimejusticetechnology
Governments around the world have used “LiDAR” — a laser-powered surveying technology — to build impressively precise elevation maps. In many cases, they’ve also released these topographic datasets to the public. The U.S., for instance, publishes gobs of LiDAR data through the Interagency Elevation Inventory. And you can also find LiDAR datasets for the United Kingdom, Spain, Finland, Slovenia, Denmark, Switzerland, the Netherlands, and New York City. Related: Using LiDAR data to print a 3D map of London. — Data is Plural: May 25, 2016
Links:
Tags: mapping
The MusicBrainz database contains metadata on more than one million artists, 16 million recordings, 900,000 pieces of cover art. You can download the data in bulk or query it via an API. Previously: The smaller-but-more-detailed Million Song Dataset, featured Feb. 10. [h/t Geoff Boeing] — Data is Plural: May 18, 2016
Links:
Tags: audioentertainmentmusic
I Quant NY author Ben Wellington recently discovered that New York City had been “ticketing legally parked cars for millions of dollars a year.” To reach that finding, Wellington analyzed three years of parking tickets, amounting to more than 30 million summonses. NYC isn’t alone in providing parking ticket data; Philadelphia, Toronto, Baltimore, Seattle, and others publish similar datasets. — Data is Plural: May 18, 2016
Links:
Tags: statistics
The United Nations publishes estimates of the number of foreign-born residents living in every country. The figures cover 1990 to 2015, at five-year intervals. The Vatican (100% foreign-born) and the United Arab Emirates (88%) had the highest proportion of immigrant residents in 2015; the U.S. (46.6 million) boasted the largest total immigrant population. The dataset also includes estimates by age, sex, and country of origin. Previously: Refugees in America, featured Nov. 25, 2015. [h/t Manu Balachandran] — Data is Plural: May 18, 2016
Links:
The Paleobiology Database, run by a non-profit group of researchers, has aggregated data on more than a million fossils from all around the world. You can access the dataset — organized by species, era, and location — via an interactive map, download form, or API. — Data is Plural: May 18, 2016
Links:
To help monitor drug safety, the FDA collects “adverse event” reports submitted by patients, doctors, and manufacturers. You can download the (anonymized) reports from the FDA directly, but that dataset includes duplicate cases, and sometimes calls the same drug by different names. A group of researchers recently announced that they’ve cleaned up the data — removing duplicates and standardizing nomenclature — so that you don’t have to. The resulting dataset covers 4,245 drugs, more than 17,000 types of reactions, and nearly 5 million case reports. Previously: The SIDER database of pharmaceutical side effects, featured Nov. 11, 2015. — Data is Plural: May 18, 2016
Links:
Tags: drugshealthcare
In response to a freedom-of-information request, the NYC Department of Buildings provided WNYC with a spreadsheet of 76,088 “registered elevator devices” in the city. Elevators and escalators dominate the list, but you’ll also find dumbwaiters, handicap lifts, and a few other vertical transporters. The spreadsheet includes data on location, speed, maximum capacity, floors served, and more. Related: FiveThirtyEight analyzed the data last week. [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle + John Templon] — Data is Plural: May 11, 2016
Links:
Tags: mappingstatistics
An international network of researchers who study noncommunicable diseases estimates the annual prevalence of obesity and diabetes for approximately 200 countries and territories around the world. The data currently covers 1975–2014 and is based, on 2,000+ surveys, according to the group. Related: Bloomberg’s chart and maps of the data. — Data is Plural: May 11, 2016
Links:
Tags: healthcare
The Institute for Cannabis (established in 1985 as The Institute for Hemp) has obtained, via FOIA, the U.S. Drug Enforcement Administration’s list of organizations licensed to handle marijuana — or, as the license application form calls it, “marihuana.” Many of the nearly 3,000 licensees are law enforcement organizations, but universities, pharmacies, and hospitals also pepper the list. Michael Ravnitzky] — Data is Plural: May 11, 2016
Links:
Tags: drugs
Since 2009, NASA’s Kepler spacecraft has been looking for Earth-like exoplanets — i.e., planets outside our solar system. Through the NASA Exoplanet Archive, you can explore, filter, and download databases of “candidate” and “confirmed” exoplanets, including Kepler’s discoveries. [h/t David Kipping] — Data is Plural: May 11, 2016
Links:
Tags: science
On Monday, the International Consortium of Investigative Journalists released data on 210,000 companies, trusts, and funds named in the massive Panama Papers leak. The database is searchable online and downloadable as several CSV files. The dataset includes companies’ officers, registered addresses, and middlemen. It supplements a pre-existing cache of of 105,000 companies named in ICIJ’s 2013 "Offshore Leaks" investigation. — Data is Plural: May 11, 2016
Links:
Climate scientists have compiled a dataset of grape-harvest-dates from 380 European vineyards, across 27 regions, and stretching back 650 years. The earliest data-point refers to a Burgundy harvest in 1354. Related: The original academic paper. [h/t Martín González] — Data is Plural: May 4, 2016
Links:
OpenFootball collects and publishes results and rosters from national and international soccer/football matches, including the Premier League and the World Cup. Related: English soccer/football results, 1871–2014. [h/t Wendy Mak] — Data is Plural: May 4, 2016
Links:
Tags: sports
The National Practitioner Data Bank tracks medical malpractice payments, license suspensions, Medicare expulsions, and other lists of penalized physicians. The public use data file includes dozens of details per entry but excludes the part that is almost certainly most important to patients: the doctors' names. Related: “Doctors perform thousands of unnecessary surgeries,” according to a 2013 USA Today investigation that relied partly on the NPDB. — Data is Plural: May 4, 2016
Links:
Tags: healthcare
Scholars at Virginia Commonwealth University have identified and mapped the locations of 2,000 KKK branches active in the early 20th century. The dataset contains the city, state, earliest-known-date, and sources for each “klavern.” Related: “Active Hate Groups in the United States in 2015,” a report by the Southern Poverty Law Center. [h/t K Reed] — Data is Plural: May 4, 2016
Links:
UPDATE: Check out Southern Poverty Law Center's Hate Map which includes a link for downloading data from 2000 - 2018.
Sci-Hub bills itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers.” Who’s downloading papers from the site? “Everyone,” Science magazine concluded after analyzing data culled from six months of Sci-Hub server logs. For every download, the dataset identifies the paper downloaded, the date and time, an anonymized version of the downloader’s IP address, and a rough location. [h/t Melissa Bierly + Tom Grahame] — Data is Plural: May 4, 2016
Links:
Tags: science
The Star Wars API provides programmatic access to data about every character, species, spaceship, planet, and film in George Lucas’ cinematic universe. You can also download JSON files containing all the data. [h/t Robin Sloan] — Data is Plural: April 27, 2016
Links:
Tags: entertainmentmovies
The U.S. House of Representatives requires all staff to reveal all “gift travel” — i.e., “free” trips that the government didn’t pay for. The Office of the Clerk compiles those filings into a database containing each trip’s dates and sponsors. (The Consumer Electronics Show paid for 49 staffers and one congressman to visit the Las Vegas convention in January.) The Senate publishes similar data, except it doesn’t include the sponsor name ... which kind of undermines the entire point. [h/t John Stanton] — Data is Plural: April 27, 2016
Links:
Tags: politics
The Bureau of Transportation Statistics requires the nation’s largest airlines to report scheduled and actual timing data for every domestic flight. The corresponding database includes information about delays, cancellations, and diversions, among other fields — and goes back to 1987. In January 2016, departing flights taxied for an average of 16 minutes, a minimum of 1 minute, and a maximum of 2 hours, 38 minutes. Related: “Which Flight Will Get You There Fastest?” [h/t Tom Augspurger] — Data is Plural: April 27, 2016
Links:
Tags: transportation
Last week, the researchers at CERN’s Compact Muon Solenoid Experiment released more than 300 terabytes of data. The datasets include raw particle-detection data from the Large Hadron Collider, as well as pre-processed datasets the researchers say “can be readily analysed by university or high-school students.” [h/t Dad] — Data is Plural: April 27, 2016
Links:
Tags: science
On Saturday, BuzzFeed hosted a FOIA data hackathon. Participants used datasets — from MuckRock, FOIA Machine, FOIA Mapper, and FOIA.gov — to analyze federal, state, and local responsiveness to public records requests. The first three datasets contain details about individual FOIA requests and responses; FOIA.gov provides aggregate internal data from federal agencies. — Data is Plural: April 27, 2016
Links:
Tags: journalism
The U.S. Alcohol and Tobacco Tax and Trade Bureau publishes a few permit datasets, including this table of 1,900+ businesses licensed to produce and/or bottle liquor. [h/t Maggie Lee] — Data is Plural: April 20, 2016
Links:
Tags: alcohol
Researchers have analyzed 15 years of satellite imagery to create a nearly-global dataset of seasonal cloud coverage. The data — available at a kilometer-square resolution — could help scientists monitor and predict changes in ecosystems. [h/t Grant Smith + Joanna Klein] — Data is Plural: April 20, 2016
Links:
Tags: climate
Baseball season is in full-swing, basketball and hockey playoffs have begun, and the NFL draft is nigh. No better time to highlight some cricket data! Cricsheet.org has gathered ball-by-ball data on more than 2,700 matches played since the mid-2000s. Looking for historical data? A new GitHub repository contains stats for more than 40,000 matches going back to 1773 (but mostly since the 1970s), scraped from ESPN Cricinfo. Related: How, statistically, the coin toss affects who’ll win. [h/t Derek Willis] — Data is Plural: April 20, 2016
Links:
Tags: sportsstatistics
Last week, the Bureau of Labor Statistics published its midyear update to the Consumer Expenditure Survey. The survey collects data on spending, income, and a handful of characteristics about U.S. consumers. One tidbit: On average, Americans are spending approximately 33% of their income on housing, and a tad less than 1% on alcohol. [h/t Nathan Yau] — Data is Plural: April 20, 2016
Links:
An under-scrutinized quirk in a little-known, widely-used database “turned a random Kansas farm into a digital hell.” How? The database contains best-guess geographic coordinates for every IP address on the internet. But for millions of IP addresses, the best guess is just somewhere in the United States. And, until recently, the database translated that vague location into the latitude and longitude of a farm in Potwin, Kansas. (Now it points to a lake.) — Data is Plural: April 20, 2016
Links:
Tags: mapping
The Federal Aviation Administration maintains a database of all non-military aircraft registrations, which includes extensive details about each plane/helicopter/glider/blimp and their owners. Related: “Spies In The Skies.” [h/t Peter Aldhous] — Data is Plural: April 13, 2016
Links:
Tags: transportation
Over the weekend, Hannah Anderson and Matt Daniels published an interactive analysis of male and female speaking roles in 2,000 movie scripts. Among their findings: 308 scripts gave 90%+ of the film’s dialogue to men, while just 8 scripts did so for women. The duo has also released “as much data as we can share (without getting sued)” on GitHub. — Data is Plural: April 13, 2016
Links:
The Health Inequality Project calculates American life expectancies by income, gender, and geography. You can download the data at the national, state, county, and “commuting zone” levels. Where do poor Americans live the longest? New York City, Santa Barbara, and San Jose. [h/t Margot Sanger-Katz] — Data is Plural: April 13, 2016
Links:
CourtListener gathers and publishes bulk data the Supreme Court, all federal appeals courts, and hundreds of other jurisdictions. The files include opinions, audio from oral arguments, dockets, and citations. It also has an API. (If you register, you can also create and explore networks of citation-linked cases.) [h/t Jeff Grove] — Data is Plural: April 13, 2016
Links:
Tags: law
To create the most detailed measurements of global rainfall ever, researchers at UC Santa Barbara’s Climate Hazards Group harmonize data from satellites and on-the-ground weather stations. The dataset, known as CHIRPS, stretches back more than 30 years and is freely available. Related: Eric Holthaus provides more details and explains why the dataset is so important. [h/t Dave Riordan] — Data is Plural: April 13, 2016
Links:
Tags: climate
An API Of Ice And Fire lets you fetch data about every book, character, and house in Game of Thrones — including allegiances, family trees, and dates of death. You can also download the data in bulk. Related: Macalester researchers recently published a network analysis (and underlying data) of all characters in A Storm of Swords, the third book in the series. Jon Snow, according to the analysis, was the second-most important character. [h/t Melissa Bierly] — Data is Plural: April 6, 2016
Links:
When physician John Snow constructed his now-famous dot-map of London’s Broad Street cholera outbreak in the 1850s, the leading geospatial technologies were ink and paper. Academic Robin Wilson has adapted the data for the computer age, converting Snow’s map into several modern GIS formats. Related: Infographics in the Time of Cholera. — Data is Plural: April 6, 2016
Links:
Hacker News’ official API provides data describing every submission, comment, and user on the community-driven website. You can also analyze the full dataset via Google’s recently-relaunched BigQuery Public Datasets program. [h/t Michael Gardiner] — Data is Plural: April 6, 2016
Links:
Tags: technology
. Every January, at the behest of the U.S. Department of Housing and Urban Development, volunteers across the country attempt to count the homeless in their communities. The result: HUD’s “point in time” estimates, which are currently available for 2007–2015. The most recent estimates found 564,708 homeless people nationwide, with 75,323 of that count (more than 13%) living in New York City. Related: “Why counting America’s homeless is both imperative and imperfect.” Also related: “How Many Street Homeless? NYC’s Tallies Leave the Question Open.” [h/t Tim Henderson + Jonathan Stray] — Data is Plural: April 6, 2016
Links:
Tags: aid
The citybik.es API provides access to live data on every bike-sharing station in more than 400 cities around the world. It’s free, and the underlying software is open-source. What data you get per station depends on the city, but typically includes the number of empty slots, number of available bikes, and location information. Looking for bulk data on bike-sharing rides? Many cities — including New York, Chicago, and D.C. — make it available. Related: “A Tale of Twenty-Two Million Citi Bike Rides.” Also related: Three maps illustrating the gender gap in bike-share usage. — Data is Plural: April 6, 2016
Links:
Tags: transportation
In 1999, the USDA Economic Research Service published a “natural amenities scale,” which rated every county in the contiguous United States based on factors such as landscape variation and January sunniness. Last year, based on the dataset, a Washington Post reporter called Minnesota’s Red Lake County “the absolute worst place to live in America.” Now, he’s moving there. [h/t Jody Avirgan] — Data is Plural: March 30, 2016
Links:
Tags: environment
Open Food Facts is a crowdsourced database of food products’ nutrition data and ingredient lists. (E.g., this kilogram jar of Nutella contains 316 grams of fat.) The entire database can be downloaded in several formats. — Data is Plural: March 30, 2016
Links:
Tags: food
Based in large part on Encyclopedia Titanica, researchers have compiled a structured dataset of 1,309 passengers on the RMS Titanic’s maiden voyage. (To get the data, download titanic3.csv on this page.) The dataset includes passengers’ names, ages, ticket fare, cabin number, and whether they survived. — Data is Plural: March 30, 2016
Links:
Tags: history
Researcher Gwern Branwen has assembled an archive of listings posted to “dark net markets". Silk Road is the best-known among the group, but the collection covers scores of other markets, including Amazon Dark and FreeBay. The materials gathered from each site are slightly different; many include product advertisements and seller profiles. Warning: Some of the archives contain pictures, which may include offensive or disturbing imagery. And it’s probably wise to heed Gwern’s caveats: The scrapes “are large, complicated, redundant, and highly error-prone. They cannot be taken at face-value.” [h/t Mike Sconzo] — Data is Plural: March 30, 2016
Links:
Tags: crimeeconomicstechnology
Want to fly a drone in the United States for non-recreational purposes? You’ll need a “Section 333” exemption from the Federal Aviation Administration, which governs drone activity. The FAA publishes a list of approved exemptions, which Bard College’s Center for the Study of the Drone has converted into a PDF-formatted database. The Verge, in turn, has converted that PDF into an easy-to-use CSV. Related: Last week, the FAA updated its dataset of unmanned aircraft sightings. [h/t Dan Vergano] — Data is Plural: March 30, 2016
Links:
Tags: technology
NYC’s 311 dataset contains a special category for rat sightings. This slice of data, which is updated daily and stretches back to 2010, contains more than 73,000 rows. One-third of sightings have occurred in Brooklyn. Related: An academic study of NYC rat sightings. Also related: Reply All #56 — ”Zardulu”. — Data is Plural: March 23, 2016
Links:
Tags: animals
The USDA Economic Research Service’s County Typology Codes categorize each U.S. county based on (a) its dependence on certain industries and on (b) various socio-economic factors. For example, the data classifies 219 counties as “mining-dependent.” [h/t Steven Romalewski] — Data is Plural: March 23, 2016
Links:
Tags: economics
The U.S. National Water Level Observation Network tracks water levels at hundreds of tide gauges around the country. The data is available via an API. Related: Water’s Edge, a 2014 Reuters investigation based on the gauge data. Also related: The Advanced Hydrologic Prediction Service’s flood observations and warnings, as structured data. [h/t Ryan McNeill] — Data is Plural: March 23, 2016
Links:
The UK’s Price Paid Data contains virtually all of the country’s residential property sales, with only a few exceptions. (Sales forced under court order are excluded, for example.) Each row includes the sale price, address, property type, and more. The full, multi-gigabyte dataset covers all sales since 1995, but you can also download files for individual years or the most recent month, or just search the dataset online. Related: Where can you afford to buy a house? [h/t Helena Bengtsson] — Data is Plural: March 23, 2016
Links:
Tags: real estate
The Oklahoma Geological Survey Observatory’s “Catalog of Nuclear Explosions” contains a “nearly complete” list of such detonations — more than 2,000 of them between 1945 and 2006. The dataset roughly (but not precisely) overlaps with the explosions listed in the Stockholm International Peace Research Institute’s “Nuclear Explosions, 1945–1998” (PDF) report. Both datasets list the date and location of each explosion, the country responsible, the detonation site, and (where known) its explosive yield, among other variables. And both reports use unconventional formatting, so I’ve extracted a couple of CSVs for you. — Data is Plural: March 23, 2016
Links:
If you’re looking for historical data on baseball teams, players, salaries, or managers, Sean Lahman’s Baseball Archive likely has it. The archive was updated with data from the 2015 season last week. Related: Retrosheet’s game logs — a record of every major league game since 1871. [h/t Joe Murphy] — Data is Plural: March 9, 2016
Links:
Tags: sports
The cruciverb industry is facing its first major plagiarism scandal, unearthed thanks to a newly-published database of crosswords that are at least 25% similar to previous-published puzzles. — Data is Plural: March 9, 2016
Links:
With the help of volunteers, the New York Public Library is transcribing 6,000+ mortgage and bond ledgers from Emigrant Savings Bank, founded in 1850 and the oldest such bank in the city. You can search the transcribed records, or download the (very) raw data. — Data is Plural: March 9, 2016
Links:
Tags: historymoneyreal estate
The Sunlight Foundation’s Capitol Words project lets you explore the frequency of words and phrases in the Congressional Record since 1996. For example: "weapons of mass destruction", “war” vs. “peace”, or “Obamacare”. The underlying data is available via an API. — Data is Plural: March 9, 2016
Links:
Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996–2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire — including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.) — Data is Plural: March 9, 2016
Links:
Tags: money
Through a freedom of information request, WNYC obtained four years of New York City film and television permits. The 40,000+ records date from October 2011 to September 2015 cover several types of permits, including those for scouting, shooting, and red carpet premieres. More: Popular TV shows’ shooting locations, mapped. [h/t John Templon] — Data is Plural: March 2, 2016
Links:
National population data is easy to find. But it’s much harder to find reliable, standardized population figures for finer-grained geographies. To that end, the World Bank has launched a pilot of its Subnational Population Database, which calculates estimates for 75 countries’ major provinces/states/regions. — Data is Plural: March 2, 2016
Links:
Tags: statistics
Congress has finally begun publishing official bulk data on the status of its bills — something open-government advocates had been requesting for more than a decade. The bulk downloads include an XML file for each piece of legislation, with indicators tracking (among other things) committee referrals and actions. Nostalgia: I’m Just A Bill. [h/t Derek Willis] — Data is Plural: March 2, 2016
Links:
Tags: money
The UK government has published data on 27 years of food consumption. The National Food Survey datasets are based on “food diaries” recorded by a sample of British families from 1974 to 2000. In addition to tracking food consumption, the data contains details about each household, including whether they kept vegetarian, had a pregnancy, and/or owned a microwave. [h/t Hannah Brooks + Sebastian Gutierrez] — Data is Plural: March 2, 2016
Links:
Tags: food
Last week, the Department of Homeland Security published more than 250 infrastructure-related datasets, which had previously been marked as "For Official Use Only." The release covers a wide range of topics, including datasets on educational facilities, hurricane evacuation routes, poultry slaughterhouses, and sports venues. (According to that dataset, the Indianapolis Motor Speedway holds more people than any other major sports venue, with a listed capacity of 257,325.) [h/t Michael Keller] — Data is Plural: March 2, 2016
Links:
Tags: infrastructure
Since 1999, Jester has been telling jokes. The website, built by UC Berkeley’s Laboratory for Automation Science and Engineering, asks you to rate its sometimes-humorous offerings, and then uses those answers to guess which of the remaining 100+ jokes you’ll like best. The UC Berkeley team behind the project has released millions of joke ratings from more than 100,000 anonymous users. [h/t Alex Gude] — Data is Plural: February 24, 2016
Links:
Tags: entertainmentlanguage
The CDC publishes a searchable database of its cruise ship sanitation inspections — but doesn’t provide an option to download the data. Last week, an open-data enthusiast scraped the database and posted CSVs of specific deficiencies and overall inspection scores since 1990. The lowest score: The Nippon Maru’s 38 points (out of 100) in 1998. Related: ProPublica’s “Cruise Control,” a searchable database of health and safety reports. [h/t Mike Stucka + Lena Groeger] — Data is Plural: February 24, 2016
Links:
Tags: transportation
The Nuclear Latency Dataset contains “all known uranium enrichment and plutonium reprocessing facilities” built between 1939 and 2012. That amounts to 253 plants around the world, each with information on its construction timeframe, civilian-vs-military purpose, international oversight, and more. [h/t Abraham Epton] — Data is Plural: February 24, 2016
Links:
The Uppsala Conflict Data Program maintains several large, interconnected datasets describing decades of war, genocide, and other armed hostilities. Looking for a slightly less depressing experience? Try the UCDP’s dataset of 216 peace agreements signed between 1975 and 2011. [h/t Tony Gray] — Data is Plural: February 24, 2016
Links:
Tags: conflict
The Supreme Court Database is exactly what it sounds like — and definitively so. The most recent release covers all SCOTUS cases from 1946 through 2014. For each case, the database contains 247 “pieces of information,” including the source of the case, why the court agreed to hear the case, the legal provisions at play, and how each justice voted. — Data is Plural: February 24, 2016
Links:
Tags: law
For 18 years, a trap on the roof of the University of Copenhagen’s Zoological Museum lured moths, butterflies, and beetles to their early deaths. Researchers at the university counted and identified more than 250,000 specimens from 1,500+ species. The most common: Yponomeuta evonymella, a moth species also known as the bird-cherry ermine, which got trapped nearly 40,000 times. — Data is Plural: February 17, 2016
Links:
Tags: animalsstatistics
Portable Game Notation, a file format used to describe chess matches, was invented in 1993. Since then, enthusiasts have created PGN files for virtually all top players’ games and every high-level tournament at sites such as PGN Mentor and Chess DB. [h/t Seth Kadish] — Data is Plural: February 17, 2016
Links:
Tags: entertainmentgames
In 2011, agriculture occupied about 22% of all land in the contiguous U.S., according to the National Land Cover Database. The NLCD classifies every 30-meter-by-30-meter chunk of land into one of 16 categories, including “woody wetlands,” “cultivated crops,” and “developed” land, at different intensities. (Alaska’s unique landscape has earned it a few additional categories, such as “dwarf scrub.”) The database is presented as raster files, so you’ll need some geospatial software to dig in. [h/t Ryan McNeill] — Data is Plural: February 17, 2016
Links:
Computational linguists at Canada’s National Research Council used Mechanical Turk to crowdsource the emotional associations of 14,182 words. For each word, participants were asked whether it was “positive” and/or “negative”, and whether it was associated with any of eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The resulting Word-Emotion Association Lexicon was first published in 2010. Of the full lexicon, only two words — “treat” and “feeling” — were associated with all eight emotions. [h/t Bipul Mohanto] — Data is Plural: February 17, 2016
Links:
Tags: language
Every two years since 1991, the CDC has conducted the Youth Risk Behavior Survey, which asks high school students questions about drug use, sex, eating habits, and more. The results are available at the national, state, and district level. Results from the 2015 survey will be published in June, the CDC says. Related: Today’s teens _______ less than you did. — Data is Plural: February 17, 2016
Links:
The Million Song Database contains metadata and “feature analysis” (e.g., loudness, tempo, and “danceability”) for, you guessed it, one thousand-thousand songs. The full dataset occupies hundreds of gigabytes, but you can also download a 1% sample. [h/t Neal Lathia] — Data is Plural: February 10, 2016
Links:
Tags: audiomusicstatistics
The Internet Archive’s Political TV Ad Archive uses audio fingerprinting to identify the campaign ads playing in key primary states. You can search the database, watch the ads, and download the data. The data file contains information about each ad’s sponsor, pro/con-ness, TV network, and time of airing. Previously: Political Ad Sleuth, featured Jan. 20. — Data is Plural: February 10, 2016
Links:
Tags: politics
The Organ Procurement and Transplantation Network, a public-private partnership, keeps records of organ donations, transplants, and waiting lists in the United States. The website’s “advanced” data tool lets you generate fairly detailed custom reports. One hitch: The site doesn’t provide an option to download the data. Data Is Plural wrote a small bit of software to fix that. — Data is Plural: February 10, 2016
Links:
Tags: healthcare
iNaturalist is a sort of social network for nature enthusiasts. Users can post photos and descriptions of birds, fish, bugs, and even mold, which experts can then help to identify. In November, the site recorded its two-millionth observation. You can explore the data via API or, with a free account, use the site’s export tool. [h/t Dan Brady] — Data is Plural: February 10, 2016
Links:
Every year, the U.S. Energy Information Administration requires thousands of power plants to report detailed data on fuel consumption and electricity generation. The datasets stretch back more than three decades, to 1989. In 2014, the most recent year available, Arizona’s Palo Verde Nuclear Generating Station generated more electricity — 32 million megawatt hours — than any other power plant in the country. [h/t Marc DaCosta] — Data is Plural: February 10, 2016
Links:
Tags: energy
The Cornell Movie-Dialogs Corpus contains 220,579 “conversational exchanges” between 9,035 characters in 617 movies. Included: “Hello. My name is Inigo Montoya. You killed my father. Prepare to die.” — Data is Plural: February 3, 2016
Links:
Next month marks the five-year anniversary of the Fukushima Daiichi disaster, the worst nuclear accident since Chernobyl. Since shortly after the meltdown, volunteers for Safecast have been collecting radiation measurements in Japan and beyond. The results are available to download or to access via API. — Data is Plural: February 3, 2016
Links:
Fears about the Zika virus — and a possible, but not proven, connection to microcephaly — are growing. Little data on the latest outbreak has been published, but here’s an open guide to what’s available so far, including reported cases of microcephaly in Brazil and the number of suspected Zika samples sent to Colombia’s national institute of health. — Data is Plural: February 3, 2016
Links:
Tags: diseasehealthcare
Last month, a group of researchers introduced Pantheon 1.0, “a manually verified dataset of globally famous biographies.” It starts with 11,341 Wikipedia biography pages in 25 languages, and adds birthplace, birthdate, gender, occupations, and page views. You can download the data or explore it online. Baffling factoid: As of May 2013, High School Musical star Corbin Bleu had biographies in more language editions than anyone other than Jesus Christ and Barack Obama. Related: A broader-but-shallower dataset of more than 400,000 influential people on the English-language Wikipedia. [h/t Ben Dilday] — Data is Plural: February 3, 2016
Links:
Tags: social mediatechnology
The Transportation Security Administration publishes spreadsheets of legal claims against the agency, including the location, circumstances, and outcome of each claim. The most expensive settlement on record appears to involve a vehicle-related personal injury in July 2004, for which the TSA paid $125,000. On the other end of the spectrum: In 2014, a traveler recouped $1.25 for lost food or drink at Hilton Head Island Airport. [h/t Seth Kadish + Lindsey Cook] — Data is Plural: February 3, 2016
Links:
Tags: justicetransportation
The University of Edinburgh hosts an incredibly detailed, and deeply documented database of more than 3,000 accused witches in Scotland. The mania reached its quantitative peak in 1662, when, according to the database, 402 people were accused of witchcraft. [h/t Felix Haass] — Data is Plural: January 27, 2016
Links:
Tags: historystatistics
Last year, more than 400,000 federal employees took the Office of Personnel Management’s annual survey, which includes questions about satisfaction, leadership, and work schedules. You can download aggregate and raw results. Important note: The survey is voluntary and non-random. — Data is Plural: January 27, 2016
Links:
Tags: statistics
MovieLens.org is a free, noncommercial movie recommender — sort of like Netflix, minus the ability to watch movies. The service is run by a research lab at the University of Minnesota. The lab publishes several datasets of user ratings and movie info. The largest contains 22 million ratings. Among movies with at least 1,000 ratings, The Shawshank Redemption has received the highest average score (4.44 of 5), while 2007’s Epic Movie has netted the lowest (1.48 of 5). — Data is Plural: January 27, 2016
Links:
Tags: entertainmentmovies
Earlier this month, the American Cancer Society launched a new data dashboard. Metrics include estimated new cases, historical survival rates, and more. To download the corresponding spreadsheets, use the “tools” button on each page. [h/t Virginia Hughes] — Data is Plural: January 27, 2016
Links:
NASA collects aviation safety reports from pilots, technicians, flight attendants, and other personnel. The (anonymized) published data contains text narratives, as well as details about flight conditions and other safety factors. (“Ok, I did it; the dumbest thing I have ever done in my entire life,” one confessional begins.) You can search the database but can only download so many records at a time. And you can request the full database from NASA, but you’ll have to wait. An alternative option: There’s a copy from November on the Internet Archive. [h/t Dave Riordan + Julian Simioni] — Data is Plural: January 27, 2016
Links:
Tags: transportation
Last month, Nature Communications published a study of the “long-term neural and physiological phenotyping of a single human.” That human? Study co-author Russell A. Poldrack, “a right-handed Caucasian male, aged 45 years at the onset of the study.” The 18 months of results — tracking brain connections, food consumption, stress levels, and much more — are available to download and explore. [h/t Sune Lehmann] — Data is Plural: January 20, 2016
Links:
In 2013, Stanford University researchers published a paper examining how people’s tastes “change and evolve over time.” They drew, in part, on a dataset containing 13 years of Amazon reviews of gourmet foods. (Note: Not all foods were intended for humans.) The dataset comes in a slightly unconventional format; here’s a Python script to convert it to a TSV file. [h/t Kaggle] — Data is Plural: January 20, 2016
Links:
Tags: food
The FCC requires broadcasters to keep records of “all requests for broadcast time made by or on behalf of a candidate for public office.” With the help of volunteers, Political Ad Sleuth gathers those records and enters them into a searchable, downloadable database. Note: Due, in part, to the difficulty of transcribing the (non-standardized) records, the information in the database is incomplete. — Data is Plural: January 20, 2016
Links:
Tags: politics
Slate Magazine’s “The Atlantic Slave Trade in Two Minutes” — recently named a multimedia finalist for the American Society of Magazine Editors’ annual awards — tracks 20,528 transatlantic voyages over 315 years. The information comes via SlaveVoyages.org, which provides searchable, downloadable records of ships’ and captains’ names, regions where slaves were purchased and sent, and more. — Data is Plural: January 20, 2016
Links:
Researchers from Virginia Tech have joined forces with Flint, Mich., residents to sample the city’s lead-tainted water supply. In December, the researchers posted the results of 271 samples, which indicated high levels of lead contamination. The most extreme sample found a lead concentration of 158 parts per billion — 10 times higher than the EPA’s “action level.” Related: The New York Times + The Washington Post have used the data. — Data is Plural: January 20, 2016
Links:
Tags: environment
When State Department employees travel on official business abroad, they can get reimbursed — to a point — for lodging, meals, and things such as laundry. The department publishes monthly spreadsheets of the maximum per diems, which vary by location. The highest right now? The Cayman Islands ($735 per day). The lowest? Antarctica ($0/day) and Iraq ($11/day). — Data is Plural: January 13, 2016
Links:
Tags: statistics
Last year, more than 2 million people applied for new Social Security retirement and survivor benefits. When they did, they indicated their preferred language. More than 93% said English, and about 5% of applicants said Spanish — the second most popular choice. Among the 88 other options: 1,616 applicants chose American Sign Language, 32 chose Japanese, nine chose Yiddish, and one chose Swedish. — Data is Plural: January 13, 2016
Links:
Tags: languagestatistics
USAID, the Peace Corps, the U.S. African Development Foundation, and other agencies report data on foreign assistance spending to ForeignAssistance.gov. The full dataset includes detailed information for each grant and contract — and comes with data dictionary. The website also provides a chart of participating agencies, and an interactive map of the data. — Data is Plural: January 13, 2016
Links:
Tags: aidmoneystatistics
UPDATE: The site is no longer in beta and can be found at https://foreignassistance.gov/.
What did the world’s political boundaries look like in 1945? The lines between Swedish counties in 1968? The U.S. states in 1865? Thenmap, an open-source API and mapping tool, answers these questions and more. [h/t Carlos Matallín] — Data is Plural: January 13, 2016
Links:
The 2010 Religious Congregations and Membership Study counts, for more than 200 religious groups, the number of congregations and adherents in each U.S. state and county. In total, the study reported more than 344,000 congregations and more than 150 million adherents — nearly half of the 2010 U.S. population. New counts are published every 10 years. [h/t Julia Silge] — Data is Plural: January 13, 2016
Links:
Tags: religionstatistics
Crowdsourced from his 1983 “Motown 25” performance. [h/t Nadja Popovich] — Data is Plural: January 6, 2016
Links:
Tags: entertainmentmusic
The UN’s refugee agency is keeping track of daily refugee movements through Greece, Macedonia, Serbia, and farther along into Europe. The downloadable data and interactive map cover migrations since October 2015. — Data is Plural: January 6, 2016
Links:
Tags: refugees
The historically opaque New York Police Department has finally started publishing incident-level felony data — something that cities such as Chicago and Boston have done for years. The dataset includes the date, time, and approximate location of each offense. It currently covers the first nine months of 2015 and will (apparently) be updated quarterly. Don’t miss the footnotes in this PDF. Related: Some initial insights. Also related: “Which Cities Share The Most Crime Data?” [h/t Dan Nguyen + Mark Silverberg] — Data is Plural: January 6, 2016
Links:
Tags: crime
This database compares the phonological, grammatical, and lexical properties of hundreds of languages. One dataset looks at languages’ counting systems. (Many use the decimal system, but Yoruba uses the vigesimal system and Danish uses a hybrid.) Others examine the use of tone, how you say “tea”, and whether there are different words for “finger” and “hand”. [h/t Jacqui Maher] — Data is Plural: January 6, 2016
Links:
Tags: language
After it became clear that the federal government was doing an awful job of keeping track of how often police kill civilians, two newspapers started counting last year. According to The Guardian’s tally, U.S. police killed 1,136 people in 2015. The Washington Post’s count — which focused on shootings only and didn’t include off-duty officers — counted 984 deaths. Both organizations provide methodologies and downloadable datasets (including demographic and geographic details): Guardian / WaPo. — Data is Plural: January 6, 2016
Links:
Among them: 37,622 cellphones; 3,604 hats; 1,903 scarves; 1,017 birth certificates; 483 diaries; 115 VHS tapes; 82 violins; 41 GPS navigation systems; and 9 answering machines. At least one of the 2,756 umbrellas is mine. [h/t Mona Chalabi + Allison McCann + Noah Veltman] — Data is Plural: December 30, 2015
Links:
Tags: statistics
The Union of Concerned Scientists’s Satellite Database currently contains 1,305 entries and is updated “roughly quarterly.” The longest-orbiting: AMSAT-OSCAR 7, an amateur radio satellite launched in November 1974. Related: The satellites, visualized. [h/t David Yanofsky] — Data is Plural: December 30, 2015
Links:
Tags: technology
Over the weekend, the Seattle Times and BuzzFeed News published an investigation into Clayton Homes, a company that is owned by Warren Buffett's Berkshire Hathaway and that “has grown to dominate virtually every aspect of America’s mobile-home industry.” The investigation draws on data released through the Home Mortgage Disclosure Act. The law requires large lenders to publish details about each of their loans. You can download the raw data from the FFIEC, or slightly user-friendlier versions from the CFPB. [h/t Mike Baker + Dan Wagner] — Data is Plural: December 30, 2015
Links:
Tags: economicsreal estate
Last week, the Centers for Medicare & Medicaid Services published a new drug-spending dataset. It focuses on medications that (a) cost the most, overall; (b) cost the most per patient; or (c) saw the largest price-hike between 2013 and 2014. Vimovo, an arthritis pain reliever, tops the price-hike rankings: Between 2013 and 2014, the average cost per unit increased more than sixfold, from $1.94 to $12.46. [h/t Virginia Hughes] — Data is Plural: December 30, 2015
Links:
Tags: businessdrugshealthcare
A new study in the American Economic Review suggests that slaveholders in the South underestimated the odds of “emancipation without compensation.” To reach its conclusions, researchers compiled a dataset of 15,377 slave sales, culled from remarkably detailed official records. Data for each sale includes demographic information about the slaves, seller, and buyer; the price paid; payment method; and researcher notes. — Data is Plural: December 30, 2015
Links:
The Unicode Consortium publishes a big ol’ HTML table of every emoji, how they look in various contexts, and when they entered the canon. The “Christmas tree” emoji occupies code point U+1F384, and was introduced in 2010. (“Menorah with nine branches” arrived in 2015.) [h/t Ben Collins] — Data is Plural: December 23, 2015
Links:
Tags: technology
The Forest Service has digitized many of the tree species distribution maps from Elbert Little's “Atlas of United States Trees,” first published in the 1970s. Shapefiles and PDFs are available for for more than 600 species — including Ilex opaca (American holly) and Pseudotsuga menziesii (Douglas fir). — Data is Plural: December 23, 2015
Links:
Tags: environmentmappingplants
The Wikimedia Foundation publishes hourly pageview counts for each of its articles. It’s a tremendous amount of data — about 90 megabytes, compressed, per hour. Luckily, there’s also a tool for browsing individual pages’ daily traffic stats. Last Wednesday, the English-language page for "Christmas tree" received 7,822 visits, its highest mark so far this year. — Data is Plural: December 23, 2015
Links:
The USDA’s 2012 Census of Agriculture — the most recent vintage available — tallies agricultural activity at the national, state, and county levels. You can download detailed data from the agency’s Quick Stats tool. In 2012, Oregon harvested more Christmas trees than any other state: 6.8 million of them, or 39% of the census total. [Correction, 2015-12-23: The Oregon numbers incorrectly referenced 2007 data. In 2012, Oregon harvested 6.4 million trees, or 37% of the census total. Thanks to @JoeMurph for flagging this mistake.] — Data is Plural: December 23, 2015
Links:
Tags: agriculture
Every year, the U.S. Consumer Product Safety Commission tracks emergency rooms visits to approximately 100 hospitals. The commission uses the resulting National Electronic Injury Surveillance System data to estimate national injury statistics, but it also publishes anonymized information for each consumer product–related visit, including the associated product code (e.g., 1701: “Artificial Christmas trees”) and a short narrative (“71 YO WM FRACTURED HIP WHEN GOT DIZZY AND FELL TAKING DOWN CHRISTMAS TREE AT HOME”). — Data is Plural: December 23, 2015
Links:
Tags: healthcareinjury
This dataset is fucking amazing. — Data is Plural: December 16, 2015
Links:
You’ve probably heard of PolitiFact, the Tampa Bay Times project that fact-checks what politician say. What you might not know: PolitiFact has an API. You can use it to fetch detailed data the project’s national and state-level editions. Related: “All Politicians Lie. Some Lie More Than Others,” PolitiFact’s top editor writes in the New York Times. — Data is Plural: December 16, 2015
Links:
Tags: politics
Last week, USA Today released its annual accounting of assistant — yes, assistant — college football coaches’ salaries. At $1.6 million per annum, Auburn’s Will Muschamp leads the pack. More than 371 assistants have salaries of $250,000+. The release complements the publication’s database of head-coaching salaries. Related: Each state’s highest paid public employee, as of 2013-ish. [h/t Steve Berkowitz] — Data is Plural: December 16, 2015
Links:
The recently-updated Randolph Glacier Inventory contains spreadsheets and outlines of every known glacier in the world. Of the 212,000+ glaciers inventoried, more than 27,000 are in Alaska. Someone please adopt Deserted Glacier. [h/t Robin Wilson’s stunningly extensive directory of free GIS data] — Data is Plural: December 16, 2015
Links:
The Department of Justice is authorized to investigate police departments that display a “pattern or practice” of civil rights violations. In April, the Marshall Project began publishing a spreadsheet of the DOJ investigations into local law enforcement. The dataset, which is updated regularly, indicates when each case began, when it ended, and what type of agreement (if any) was reached. The latest entry: An investigation into the Chicago Police Department, announced last week. Related: PBS Frontline's interactive map of DOJ investigations. [h/t Tom Meagher] — Data is Plural: December 16, 2015
Links:
The Texas Department of Licensing and Regulation maintains a webpage of well-formatted data on state-licensed workers, including tow truck operators, boxing judges, journeyman electricians, elevator inspectors, manicurists, and, yes, barbers. [h/t Ryan Murphy] — Data is Plural: December 9, 2015
Links:
Tags: statistics
The CDC’s Foodborne Outbreak Online Database (FOOD) contains 18,000+ outbreaks, which resulted in 358,000+ illnesses and 13,000+ hospitalizations, from 1998 through last year. In 2008, a multi-state Salmonella Saintpaul outbreak hospitalized 308 people — the highest count in the database. — Data is Plural: December 9, 2015
Links:
Tags: diseasefoodhealthcare
Gun dealers use the FBI’s National Instant Criminal Background Check System to determine whether someone is allowed to buy a firearm. There isn’t a one-to-one correlation between these background checks and gun sales, but they’re said to be the best available proxy. The FBI publishes a PDF tallying the monthly number of firearm checks for each state and type. At BuzzFeed News, we’ve parsed that PDF into a CSV/spreadsheet for easier use. — Data is Plural: December 9, 2015
Links:
Tags: gunsstatistics
Last week, Data Is Plural highlighted ShootingTracker.com, a source for data on shootings that wounded at least four people. Other resources include the Gun Violence Archive and Mother Jones’ detailed database of mass shootings since 1982. The Mother Jones database takes narrower approach, focusing on shootings that killed at least four people in a public setting. In a New York Times op-ed, published shortly after last week’s San Bernardino shooting, the editor behind that database argues that broader methodologies don’t distinguish between a “a 1 a.m. gang fight” and “the madness that just played out in Southern California.” A Washington Post article weighs the pros and cons of broader and narrower approaches. [h/t Robin Shields + Mark Follman + Christopher Ingraham] — Data is Plural: December 9, 2015
Links:
Open Knowledge International has just published its latest survey of openly available government data. This year’s audit includes 112 countries and territories, up from 97 last year. The survey scores each based on the availability of datasets in 13 key categories (e.g., “election results,” “government spending,” and “pollutant emissions”) and links out to the available datasets. In this year’s survey, Taiwan ranks first, the U.K. second, and Denmark third. The U.S. ranks eighth. — Data is Plural: December 9, 2015
Links:
Tags: electionsstatistics
The CelebA dataset, published in September, contains 200,000+ images of 10,000+ celebrities, each annotated with 40 yes/no variables. Some favorites: “5_o_Clock_Shadow,” “Bags_Under_Eyes,” and “Goatee.” — Data is Plural: December 2, 2015
Links:
Tags: entertainment
The Huffington Post and Chronicle of Higher Education teamed up to investigate how colleges bankroll their athletics. (Georgia State, for example, spent more than $100 million subsidizing sports between 2010 and 2014, mostly via student fees.) The report, published last week, draws on five years of revenue/expense reports from 234 Division I public universities. You can download the raw data or explore it online. Related: The Washington Post also tackled this topic — from a slightly different angle — last week, examining the profitability (or lack thereof) of athletic programs at 48 schools. [h/t Shane Shifflett] — Data is Plural: December 2, 2015
Links:
Socrata’s software powers open-data portals around the world. But downloading large datasets — e.g., this 2.8-gigabyte dataset of NYC parking tickets — from Socrata-powered portals can feel, well, sluggish. One solution: OpenDataCache.com, a free website that provides faster-to-download versions of virtually every dataset from 50+ Socrata portals. Related: Thomas Levine’s detailed analyses of Socrata-powered portals, published in 2013 and 2014. [h/t John Krauss and Steven Romalewski] — Data is Plural: December 2, 2015
Links:
Tags: statistics
ShootingTracker.com provides datasets listing all U.S. mass shootings — defined as “when four or more people are shot in an event, or related series of events” — since 2013. So far in 2015, mass shootings have killed 447 people and wounded an additional 1,292. — Data is Plural: December 2, 2015
Links:
The National Centers for Environmental Information maintains more than 20 petabytes of data, it says. Among the most useful slices is the Global Historical Climatology Network’s data, which aggregates reports on temperature, precipitation, wind, and more from tens of thousands of climate-monitoring stations around the world. One tidbit: January 1995 was Death Valley’s wettest month since at least the 1960s, with a whopping 2.59 inches of precipitation. — Data is Plural: December 2, 2015
Links:
In 2011, the New York Public Library launched a crowdsourcing project to transcribe its massive collection of restaurant menus, dating back to the 1850s. So far, volunteers have transcribed more than 1.3 million dishes, their prices, and where on the menu each dish appeared. The library publishes a spreadsheet of all the data, and updates it twice a month. Happy Thanksgiving! — Data is Plural: November 25, 2015
Links:
Tags: food
The U.S. government has one very large Google Analytics account, and has begun sharing traffic data with the public. Not every federal website is accounted for, but more than 4,000 are. Over the past 90 days, they’ve racked up approximately 1.5 billion visits. The most popular page at the time of this writing? Weather.gov. Bonus: How they built it. [h/t Rebecca Williams] — Data is Plural: November 25, 2015
Links:
Tags: statisticstechnology
You can download every comment posted to Reddit since October 2007 … but you’ll need some patience and a terabyte of storage. If you’re more of the instant-gratification, don’t-have-an-external-hard-drive-lying-around type, you might enjoy FiveThirtyEight’s “How The Internet* Talks,” a sort of Google Ngrams for the Reddit data. [h/t Randall Olson and Ritchie King] — Data is Plural: November 25, 2015
Links:
The Department of State publishes demographic reports on refugee arrivals since 2002. The data includes country of origin, resettlement city and state, religion, age, gender, and more. Related: At BuzzFeed, I used the data to chart the past decade of refugee arrivals. Also related: The UN’s refugee data portal. — Data is Plural: November 25, 2015
Links:
Tags: refugeesstatistics
The newly-launched Citizens Police Data Project has collected more than 56,000 allegations of police misconduct. The data, covering 2002-2008 and 2011-2015, includes demographic information about the complainant and the officer, as well as the type and location of the incident. Click here to download the raw data. Related: The City of Chicago’s wide-ranging data portal includes a spreadsheet of every reported crime in the city since 2001; you can explore neighborhood trends via the Chicago Tribune. [h/t Melissa Segura and Abraham Epton] — Data is Plural: November 25, 2015
Links:
What contains 34,052 bottles and is worth an estimated £3 million? The United Kingdom’s official wine cellar, which provides libations for the government’s guests and hosts — and a dram of data for the public. Between April 2014 and March 2015, the cellar’s clients consumed more than 5,500 bottles of wine and liquor. Among them: 205 bottles of Champagne, 51-and-a-half bottles of gin, and one bottle Château Pichon-Longueville Comtesse de Lalande 1986. [h/t Nadja Popovich] — Data is Plural: November 18, 2015
Links:
Tags: alcohol
Under the HITECH Act of 2009, companies must notify the government of any data breach involving the HIPAA-protected health data of 500 or more people. Summaries of those reports are available at the Department of Health and Human Services’s Breach Portal, which currently contains more than 1,300 incidents. Related: In April, JAMA published an analysis of the breaches. Also related: Forty years of legislative acronyms. [h/t Virginia Hughes] — Data is Plural: November 18, 2015
Links:
Tags: healthcare
The National Registry of Exonerations contains “every known exoneration in the United States since 1989—cases in which a person was wrongly convicted of a crime and later cleared of all the charges based on new evidence of innocence.” For each of the 1,702 cases, the registry includes details about the exoneree, the crime, and the factors — such as new DNA evidence — that contributed to the exoneration. [h/t agate] — Data is Plural: November 18, 2015
Links:
The 2016 presidential hopefuls have been tweeting, ‘gramming, and ‘booking like a pack of millennials. Fusion collected nearly 70,000 images from the candidates’ social media accounts, then pumped the pictures through an automated tagging system. Now you can search for guns, money, beer and more — or download the raw data for your own analysis. — Data is Plural: November 18, 2015
Links:
Tags: politicssocial media
The Arms Transfer Database tracks the international flow of major weapons — artillery, missiles, military aircraft, tanks, and the like. Maintained by the Stockholm International Peace Research Institute (SIPRI), the database contains documented sales since 1950 and is updated annually. SIPRI provides a download tool, which outputs rich-text files, but it’s also possible to download the data as CSV. [h/t Martín González] — Data is Plural: November 18, 2015
Links:
For his 1898 book, The Law of Small Numbers, statistician Ladislaus Bortkiewicz tabulated the number of Prussian cavalrymen killed by horse kicks each year between 1875 and 1894. (In total, 196 suffered that tragic fate.) The dataset is tiny, but boasts an outsized legacy: Bortkiewicz’s lethal horse kicks allegedly helped to popularize the then-obscure Poisson distribution. [h/t Noah Veltman] — Data is Plural: November 11, 2015
Links:
Tags: historystatistics
Earlier this year, the HathiTrust Research Center released a massive dataset extracted from 4.8 million digitized volumes. For each of its 1.8 billion pages, the dataset includes word frequencies, languages used, and sentence counts, among other features. — Data is Plural: November 11, 2015
Links:
The New Mexico city publishes dozens of regularly-updated, well-documented datasets. Among them: government employee earnings, the number of daily visitors to the city’s swimming pools, real-time bus locations, the geography of police beats, and the city’s complete vendor checkbook. [h/t Tom Johnson, who emailed Data Is Plural to praise how Albuquerque is sharing its data: “I have not found any other city in the world doing so in such detail.”] — Data is Plural: November 11, 2015
Links:
Tags: statistics
The Side Effect Resource, a.k.a. SIDER, takes all the fine print from drug labels, and aggregates the information about side effects into a searchable, downloadable database. SIDER got a major upgrade last month, and now contains 40% more drug-effect pairs than before. The website incorporates both generic and brand names, so that searches for “Prozac” and “fluoxetine” bring you to the same page. — Data is Plural: November 11, 2015
Links:
Tags: drugshealthcare
Good Jobs First’s Violation Tracker calls itself “the first national search engine on corporate misconduct.” The new database currently contains nearly 100,000 penalties for environmental, health, and safety violations — sourced from 13 U.S. regulatory agencies — since 2010. Search results can be downloaded as CSV files, which contain a few additional fields. (Tip: Search for “*” to get all cases.) The largest single fine? The Department of Justice’s $20.8 billion penalty this year against BP. [h/t Samuel Rubenfeld] — Data is Plural: November 11, 2015
Links:
Tags: businesscrimeenvironment
Last May, a Gulfstream G150 taking off from Houston’s Ellington Airport struck an armadillo. The animal’s remains were collected, but were not sent to the Smithsonian Institution for identification. This anecdote comes from a single row in the Federal Aviation Administration’s Wildlife Strike Database, and draws on just seven of the 94 available fields. The database contains more than 168,000 strikes reported since 1990, almost all involving birds. Roughly 10% of the time, the animal's remains are sent to the Smithsonian's Feather Identification Lab. [h/t Dan Vergano] — Data is Plural: November 4, 2015
Links:
Tags: transportation
Trans-New Guinea is the world’s third-largest language family. But it’s also among the poorest-studied. TransNewGuinea.org, an online database launched in 2013, is trying to change that. It now contains more than 1,000 New Guinea languages and lists 145,000 word translations — including 1,065 entries for “dog.” It even has an API. A recent PLOS ONE journal article provides additional background and statistics. [h/t Simon J. Greenhill] — Data is Plural: November 4, 2015
Links:
Tags: language
The Bureau of Alcohol, Tobacco, Firearms, and Explosives publishes a searchable and downloadable licensing database. License-holders fall into eleven categories. Among them: run-of-the-mill dealers, ammunition manufacturers, collectors of “curios and relics,” pawnbrokers, and importers of “destructive devices.” The ATF’s website contains monthly and state-by-state archives. [h/t Marc DaCosta] [Correction, 2015-11-04: There are only nine categories of license-holders. The published ATF data includes only eight of them; it does not include "Collector of Curios and Relics." Thanks to @MikeStucka for flagging this mistake.] — Data is Plural: November 4, 2015
Links:
Tags: guns
This July, the Museum of Modern Art published a dataset containing 120,000 artworks from its catalog, joining the UK’s Tate, the Smithsonian’s Cooper Hewitt, and other forward-thinking museums. The MoMA data contains the names of the artwork and artist, the dates created and acquired, and the medium — but no images. Related: Artist Jer Thorp encourages you to “perform” the data. Also related: Every museum in the United States. [h/t Nadja Popovich] — Data is Plural: November 4, 2015
Links:
Tags: art
The 600+ entries in this searchable, sortable database range from 3M to Amazon to Zynga, and list both paid and unpaid leave. The database, run by the women-in-the-workplace website FairyGodBoss.com, culls from published policies and employee tips. An introductory blog post provides more information. — Data is Plural: November 4, 2015
Links:
Sexualitics.org is on a mission: “to contribute to human sexuality understanding through a Big Data approach.” Last year, the site posted detailed metadata on 800,000 adult videos, including titles, descriptions, view counts, and tags. It powers Porngram, an only-kinda-safe-for-work charting tool. — Data is Plural: October 28, 2015
Links:
Tags: entertainment
Prior to October 15th, the Census Bureau’s USA Trade Online tool cost $300/year. No longer. The newly-free dataset covers more than 17,000 commodities, including a category for “magic tricks, practical joke articles; parts and accessories.” [h/t Noah Veltman] — Data is Plural: October 28, 2015
Links:
Most population numbers tell you where people live. But legions of Americans commute for work across city, county, and state lines. The Census Bureau’s Commuter-Adjusted Daytime Population Data accounts for these daily migrations. Manhattan’s population (non-tourist) population doubles from 1.5 million to 3 million, by far the largest influx by raw numbers. But Lake Buena Vista, Fla., takes the percentage-growth prize. The city’s entire resident population could fit in two sedans, but its “daytime population” includes 33,000 workers — including a not-insubstantial number dressed as Mickey Mouse. [h/t Steven Romalewski] — Data is Plural: October 28, 2015
Links:
Tags: transportation
This weekend, the New York Times published a front-page article on “the disproportionate risk of driving while black.” Among other findings: “officers were more likely to conduct [searches] when the driver was black, even though they consistently found drugs, guns or other contraband more often if the driver was white.” The investigation drew on several statewide traffic-stop datasets that track the race and gender of stopped drivers. The “seven states with the most sweeping reporting requirements,” in order of how easy it seems (to me) to get detailed data: Connecticut, North Carolina, Missouri, Nebraska, Maryland, Illinois, and Rhode Island. — Data is Plural: October 28, 2015
Links:
If you can’t beat ‘em, post spreadsheets about ‘em. Earlier this month, the Federal Communications Commission started publishing a dataset of complaints against telemarketers and robocalls. The FCC says the file will be updated weekly. It’s already being put to use: A clever programmer has crammed all the offending numbers into a single phone “contact” so that you can block them all at once. [h/t Shale Craig] — Data is Plural: October 28, 2015
Links:
Tags: statistics
WNYC, through a freedom-of-information request to the New York DMV, obtained a list of vanity plate approvals and denials from late 2010 to late 2014. Among the denials: “RUBMYDUB,” “S5SS5S5S,” “RFLMAO,” and “CBSNEWS.” (Strangely, “NBC4” was approved. Go figure.) The files and related story were published in August, but the data are timeless. [h/t @veltman] — Data is Plural: October 21, 2015
Links:
Tags: statistics
The Wikimedia Foundation has published a dataset enumerating monthly revision counts for every editor, across all of its wikis. The foundation is asking for help investigating a few perplexing trends. For example: Why have the number “very active editors” — those with 100+ edits per month — increased while the number of merely “active” editors have plateaued? — Data is Plural: October 21, 2015
Links:
The Police Open Data Census, created by Code for America fellows in Indianapolis, is tracking “currently available open datasets about police interactions with citizens in the US," including officer-involved shootings, use of force, and citizen complaints. The census currently covers 36 police departments. Related: The NYPD says it will start tracking all officer use-of-force incidents — not just gunfire — next year, the New York Times reports. — Data is Plural: October 21, 2015
Links: