![]() ![]() We then repeated the analysis for 100 random samples of 10,000 sentences to see if there was a significant improvement in accuracy with a larger sample size. In our case, we took 100 random samples of 1,000 sentences for nine popular topics: climate change, deep state, ebola, gun violence, immigration, network neutrality, teen pregnancy, US election, and vaccines. We take several random samples and calculate the standard error of our results to see if our data is stable. To measure how good our samples are, we use a method called bootstrapping. But how do we know if our 1,000 sentence sample is a good representation of our data? Evaluating Our Sampling Random sampling is a simple and popular statistical method that lets us survey data in an unbiased way. Instead of analyzing several million stories that contain a term like “North Korea,” we use a random sample of 1,000 sentences* to represent our data. That’s why we rely on sampling to generate our word clouds. Why Sample?Īs mentioned, generating this word frequency data for the entire set of results can take too long. In our Explorer tool you can flip between these two views of the language data by using the "view options" menu that appears underneath each word cloud. We find that encoding the frequency of use into both the order and size of the word makes it easier to read and understand. However, in these "ordered" word cloud visualizations, rather than laying them out randomly, we list them in order from most-used to least-used.Ĭomparing a traditional word cloud layout to and ordered word cloud layout for news about “north korea” (in the US Top Online News collection during May of 2018). Like traditional word clouds, each word is sized according to how frequently it is used. To show these top words, many of our tools employ what we call an "ordered word cloud" visualization. (To see the word cloud for a query, and download a list of word frequencies, click on the “language” tab on the search results page at .) Word clouds are a quick and easy way to see the top words for a topic or term. The simplest way to do this is to count the words and then visualize the most frequently used words. One of the main ways we help people understand the media narrative about their issue is by showing them the literal language being employed - ie. To help you have confidence in the results you are seeing, this blog post evaluates the sampling approach that drives our word clouds and provides evidence for its validity. But it also means that our database can sometimes be too large to analyze quickly! To make sure our webpages are both fast and useful, we use sampling in a handful of places to show a representative set of the data (rather than all of it). That’s great news for researchers, journalists, and advocates who want to ask questions about media coverage. Here at Media Cloud, we have over 800 million stories in our database - and we’re adding a million more each day. ![]()
0 Comments
Leave a Reply. |