topic distribution


“I raise up my voice—not so I can shout, but so that those without a voice can be heard…we cannot succeed when half of us are held back.”

Gender discrimination refers to the unequal treatment or perception of individuals based on their gender.

The history of women’s rights is a long and complex one, with significant advancements and setbacks over time. In the past, women have often been treated as second-class citizens, with limited rights and opportunities compared to men. In many societies, women have been denied the right to vote, own property, and receive an education. In the 19th and early 20th centuries, women’s rights movements emerged in many countries, leading to the adoption of laws and policies that granted women greater legal rights and equality. Despite these advancements, women continue to face discrimination and inequality in many parts of the world, and the struggle for women’s rights and gender equality continues.

It is then important that we further devote our effort into revealing the embedded discriminations, sometimes unseen and latent, to the focus of the views of public.

What we did: a brief

First, we claim the existence of gender difference by investigating the graphical structure of actors collaboration, the uneven distribution of actor ages upon participation and the distribution of actor’s career span, the most frequency-diverged words across genders and the distribution of common words across genders; After that, we tap into the evolutions of gender inequalities by investigating the gender composition in the movie industry, the distribution of actor age upon movie participation and career length, the most diverged words and the distribution of common words.

Story logistics

First, we point out the existence of cross-gender differences and stereotypes (gender effects) in the movie industry by focusing on:

  • The gender composition among top actors/actresses in history;
  • The society’s diverged expectation upon male/famale actors;
  • The graphical structure and properties of top actors’ collaboration;
  • The uneven distribution of actor ages upon participation and their career span;
  • The most frequency-diverged words in the plot summaries across two genders.
  • The distribution difference of common words across gender

Then, we shift our interest on the evolution of gender effects over time by investigating:

  • The gender composition in the movie industry;
  • The change of distribution of actors’ career span on a decade-scale;
  • The most frequency-diverged words over time;
  • The distribution of common words across genders over time.

The data

We obtain the data from the CMU Movie Summary Corpus, in which contains data about different movies, characters and actors in different movies and the plot summary of movies.

The movie data is composed of:

  • wikipedia_id: Wikipedia movie ID
  • freebase_id: Freebase movie ID
  • name: Movie name
  • release_date: Movie release date
  • box_office_revenue: Movie box office revenue
  • runtime: Movie runtime (in minutes)
  • languages: Movie languages (Freebase ID:name tuples)
  • countries: Movie countries (Freebase ID:name tuples)
  • genres: Movie genres (Freebase ID:name tuples)

The character data is composed of:

  • wikipedia_id: Wikipedia movie ID
  • freebase_id: Freebase movie ID
  • release_date: Movie release date
  • character_name: Character name
  • actor_dob: Actor date of birth
  • actor_gender: actor_gender
  • actor_height: Actor height (in meters)
  • actor_ethnicity: Actor ethnicity (Freebase ID)
  • actor_name: Actor name
  • actor_age:actor_age
  • freebase_character_map: Freebase character/actor map ID
  • freebase_character_id: Freebase character ID
  • freebase_actor_id: Freebase actor ID

The plot data contains plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

I’ve seen things you people wouldn’t believe. Attack ships on fire off the shoulder of Orion. I watched c-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Now, time to…

Tell the story!

The Existence of Gender Effects

The famous actors: what are their characteristics?

Actors are an essential part of the movies, and the movies in which the two actors have worked together can be seen as the connect between these two actors, which contributes to a very big social network. With the help of such a network structure, we can find out the structural gender difference in the movies.

We start to build up the graph with the prepared nodes and edges data. We use nx.Graph() to generate an empty undirected graph and load our prepared data. In our social network graph, every actor represents a node and there is a edge between two nodes if the two actors have cooperated at least one movies. The more degrees the node has, the more influential the actor is.

First we select the top 100 nodes which have the most degrees, which means that they represent 100 most influential actors. We name the original graph G and this subgraph G100. After obversing the gender distribution in graph G and G100, we can find that there are only eight females in the Top100 actors, which shows that in social network of actors the male actors occupy the absolute main force.

Gender ratio
Figure 1: The number of female and male actors in TOP 100 ranking

Let us have a look at the most famous male and female actors, who are Samuel L. Jackson and Whoopi Goldberg.

Samuel L. Jackson

He was born on December 21, 1948, known as a famous American actor and producer. One of the most widely recognized actors of his generation, the films in which he has appeared have collectively grossed over 27 billion worldwide, making him the third highest-grossing actor of all time. The Academy of Motion Picture Arts and Sciences gave him an Academy Honorary Award in 2022 as "A cultural icon whose dynamic work has resonated across genres and generations and audiences worldwide.

Representative Works: Jurassic Park, Pulp Fiction, Star Wars Series, Marvel Cinematic Universe.

You can see more details here!

Whoopi Goldberg

Caryn Elaine Johnson was born on November 13, 1955, known professionally as Whoopi Goldberg (/ˈwʊpi/), is an American actor, comedian, author, and television personality. A recipient of numerous accolades, she is one of 17 entertainers to win the EGOT, which includes an Emmy Award, a Grammy Award, an Academy Award ("Oscar"), and a Tony Award. In 2001, she received the Mark Twain Prize for American Humor.

Representative Works: Ghost, The Color Purple, Sister Act 1, Sister Act 2: Back in the Habit, Teenage Mutant Ninja Turtles.

You can see more details here!


After filtering some height outliers, we analyse the height difference in G and G100 between male and female actors. The p-value of T-test for the average height of male between Top100 and all actors is 0.61, which shows that they are different in fact. So we can find that for female actors, who are in Top100 are usualy taller, but for male actors, who are in Top100 almost the same as the average or even shorter than average. It is a fact that taller people are more likely to become an actor or appear in the movies. Furtherly we can make a conclusion that Society has broad height requirements for male actors, but it often has higher requirements for women.

height difference
Figure 2: Average height of famous/all actors across genders

Finally, we generate male subgraph G_male and female subgraph G_female from the social netowrk graph G and compute structural features on them to explore the structural gender difference.

structrual diff
Table 3: structural features of the total graph and gender subgraphs

It is interesting that although male actors’s nodes have more average degrees in male subgraph, which means that there are more cooperations in the film and television industry between male actors, female actors actually have more stable partnerships with female actors accroding to the higher transitivity value of gender subgraphs and higher clustering coefficient of specific actors

One convincing explanation is that the number of female actors is less than the male actors, so for every specific kind of female character group in the movies, there are few potiental female candidates, and that is why there is more stable cooperation relationship between female actors.

Actors’ careers: is there a gender effect?

In this part, we investigate how female and male actors differ in their careers. We plot the average age evolution of female and male actors. It is interesting to see the average ages of both female and male actors increase over time, while in general female actors are almost always younger than male actors.

Figure 4: Average age of participating actors upon movie release

We then plot the distribution of actors’ age upon participation of movies, for the same actor participating in different movies, we count separatedly.

Figure 5: Age distribution of female and male actors upon movies participation

The peak of character ages are in about 20s to 30s for both female and male, while also slight difference exists. The peak of male characters comes a bit later than female characters.

Besides the peak shift, we see the hist of female age distribution “thinner” than that of male age distribution, which probably means shorter career span. To be more precise, we sort the data with actors with their charaters. The career span in year is then computed from the difference of latest and ealiest character.

Below is the figure of general career span distribution. We could conclude that female actors generally have shorter career span than male actors, while most actors only have 1 year of career span, meaning only have starred in 1 movie.

Figure 6: Cumulative density function of female and male actors' strred age span

Words count! What do we know from the plots?

We also want to study the plot summaries in the movies. We aim to discover whether gender stereotypes exist in the movie plots. We defined gender stereotypes as the presence of gender-neutral words biased towards describing male or female characters. We consider relevant words for characters as the first or last two words of a verb or adjective from the character name within sentences in plot summaries. We extract relevant words around these names by gender and count the log frequency of words related to different genders.

For qualitative analysis, we look at the differences in frequencies of words between genders and rank the verbs and adjectives based on this difference. as the most-distinguishable words for men and women.

image 1
Figure 7: Top 20 words with largest frequency difference according to genders

We can find that men and women have more distinct word preferences. For the verbs and adjectives, males are associated with crime (kill, shoot, fight, arrest, dead, criminal.), power (lead, manage, powerful), and politics (corrupt), while females are depicted with the word marriage (marry, marriage), love, reproduction (pregnant), appearance (beautiful) and sex (seduce, sexual).

For quantitative analysis, we compare the distributions of verb or adjectives frequencies between men and women. We use a chi-square test to see if the difference is significant and adopt KL divergence to measure how different the two distributions are. The p-value in the chi-square test is close to zero, indicating the difference is significant. The KL divergence is 0.07 and 0.16 for verbs and adjectives, see our code base for more details!

Time Evolution of Gender Effects

Are women gaining more places in the industry?

We are interested in how gender composition in the movie industry evolves. To answer these questions, we first derive the annual gender composition in the movie industry.

Figure 8: Annual actor, actress count average by sex

With the annual actors count by sex in hand, we are able to investigate into the gender composition in the movie industry by time.

Note that we have huge data variance in the first the last few years, data may not be representative. The pattern we observe here is somewhat different. We are interested in how may films are released in each year. If there are few films released that year, data at that year may not be repesentative. According to analysis (see details in our codebase!) We crop the head and tail of the original data to make results more confident.

Figure 9: Annual actor/actress count average difference with 95% confidence interval

Then, we look at the ratio of male actors count against the female actresses count by year.

Figure 10: Annual actor/actress count average ratio

It is not hard to see that the portion of female actors first decreases then increases, while the absolute number of female actors keeps increasing after around 1942.

We explain that:

\[\frac{Male}{Female} = f(t)\]

Therefore we have:

\[Male - Female = Female(f(t) - 1)\]

We observe that:

  • $Male - Female = Female(f(t) - 1)$ increases to summit at 1941, then decreases and stays stable.
  • $\frac{Male}{Female} = f(t)$ increases to summit at 1941, then decreases all the way.

We conclude that the ratio of male-female attendance in films first increases then decreases. But genreally there are more men than women in the movie industry.

To count for the stable variation of actors count difference, we explain that it is because the increase of absolute number of women actress.

Actors’ careers: from decades’ perspective

Below is the figure of the evolution of career span distribution through different decades. We see the difference between genders is narrowing. (Except for the last plot for actors born from 1990 to 2000, but those may be too young to say about an entire career span.) It may be because of increasing awareness of female rights.