Categories
Offsites

Bias in word embeddings

Bias in word embeddings, Papakyriakopoulos et al., FAT*’20

There are no (stochastic) parrots in this paper, but it does examine bias in word embeddings, and how that bias carries forward into models that are trained using them. There are definitely some dangers to be aware of here, but also some cause for hope as we also see that bias can be detected, measured, and mitigated.

…we want to provide a complete overview of bias in word embeddings: its detection in the embeddings, its diffusion in algorithms using the embeddings, and its mitigation at the embeddings level and at the level of the algorithm that uses them.

It’s been shown before (‘Man is to computer programmer as woman is to homemaker?’) that word embeddings contain bias. The dominant source of that bias is the input dataset itself, i.e. the text corpus that the embeddings are trained on. Bias in, bias out. David Hume put his finger on the fundamental issue at stake here back in 1737 when he wrote about the unjustified shift in stance from describing what is and is not to all of a sudden talking about what ought or ought not to be. The inputs to a machine learning model are a description of what is. If we train a model on them within thinking, the outputs of the model will treat what is as if it were what ought to be. We have made a sophisticated machine for reinforcing the status quo, warts and all.

When it comes to word embeddings this effect can be especially powerful because the embeddings isolate the words from the contexts in which they were originally used, leaving behind a static residue of bias:

…the projection of words in a mathematical space by the embeddings consolidates stereotyping and prejudice, assigning static properties to social groups and individuals. Relations are no longer context-dependent and dynamic, and embeddings become deterministic projections of the bias of the social world. This bias is diffused into further algorithms unchanged, resulting in socially discriminative decisions.

The authors proceed in the following manner:

  • Firstly, they train one set of word embeddings based on the complete German wikipedia, and another set based on Facebook and Twitter content relating to the six main political parties in Germany. The training is done using GloVe. To be able to compare these word embeddings (by placing them both within the same vector space), they then find the linear transformation matrix that places all words from one into the vector space of the other with minimal translation. Since the translation is linear, the normalised distance between words does not change and hence any bias is preserved.
  • They then measure bias in the trained embeddings
  • Having demonstrated that the word embeddings do indeed contain bias, the next step is to see if a model trained using these word embeddings exhibits bias in its outputs (which they show it does).
  • Given the resulting models does show bias, they then explore mechanisms for mitigating that bias at both the word embedding level and at the level of the final trained classifier.
  • Finally, they show how biased word embeddings can actually help us to detect the same biases in new text samples (for example, the output of a language model?).

Measuring bias in word embeddings

Say we want to know whether a word embedding for a concept $c$ has a gender bias. We can take the cosine distance between $c$ and ‘Man’ and subtract the cosine distance between $c$ and ‘Woman’. A non-zero result reveals bias in one direction or the other, and the magnitude of the result tells us the amount of the bias. Since the study is done using the German language, which is gendered, concepts with male and female versions are represented by word pairs. This basic technique is used to assess bias in the trained word embeddings for professions, for Germans vs foreigners, and for homosexual vs heterosexual.

The results show that the trained vectors are more likely to associated women with professions such as nursing and secretarial work, whereas men are associated with roles such as policemen and commanders. Germans are linked to positive sentiments such as charm and passion, cooperation and union, while foreigners are generally linked to concepts such as immigration, law, and crime. Homosexuals are related to roles such as hairdresser or artist, and heterosexuals to blue collar professions and science. Homosexuality was associated with negative sentiments, and heterosexuality with positive ones.

Summaries of the most extremely biased words in both the Wikipedia and social media data sets are given in the tables below.

[the] results illustrate that word embeddings contain a high level of bias in them in terms of group stereotypes and prejudice. The intergroup comparison between sexes, populations, and sexual orientations revealed the existence of strong stereotypes and unbalanced evaluation of groups. Although Wikipedia contained a stronger bias in terms of stereotypes, social media contained a higher bias in terms of group prejudice.

Do biased word embeddings lead to biased models?

The authors trained a model that took a word embedding as input and predicted whether that word has a positive or negative sentiment. To assess bias in the trained model, the authors fed names as inputs with the expectation that in the absence of bias, names should have a zero sentiment score as they are polarity independent. The chosen names were stereotypical first names for nine population groups: German, Turkish, Polish, Italian, Greek, French, US American, Russian, and Arabic. The authors also compared male and female names.

The study illustrated the use of biased word embeddings results in the creation of biased machine learning classifiers. Models trained on the embeddings replicate the preexisting bias. Bias diffusion was proved both for sexism and xenophobia, with sentiment classifiers assigning positive sentiments to Germans and negative sentiments to foreigners. In addition, the amount of polarity for men and women in the embeddings was diffused unaltered into the models.

Mitigating bias

The authors investigate two different methods for bias mitigation in the sentiment classifier:

  1. Removing bias at the individual word embedding level (by making theoretically neutral words orthogonal to the sentiment vector, where the sentiment vector is e.g. good – bad, positive – negative etc.).
  2. Removing bias at the level of the classifier by adjusting the linear hyperplane learned by the linear SVM classifier, such that this plane is orthogonal to the sentiment vector.

While both methods reduce bias in the resulting classifications, the classifier-level correction is much more effective, as can be seen in the following figure.

This is because correcting embeddings for a theoretically neutral set of words still leaves other potential biases in place that weren’t corrected for. The classifier is good at finding these!

…the classifier learns further associations between the vectors, which are not taken into consideration when debiasing at the embeddings level… Hence, we show that debiasing at the classifier level is a much better and safer methodology to follow.

Detecting bias in text

The authors created a dataset containing an equal number of sexist and non-sexist comments from Facebook pages of German political parties. Then they trained models with LSTM and attention-based architectures to classify comments as sexist or non-sexist. Multiple variants were trained, using random embeddings, the Wikipedia embeddings, the social media embeddings, and embeddings trained on the sexist comments.

The more similar the bias in the embeddings with the target data, the higher the ability of the classifier to detect them.

The attention-based network architecture, when given the sexism embeddings, allowed to freely retrain them, and tested on comments only containing words with embeddings, achieved an accuracy of 92%.

A call to action

The study showed that bias in word embeddings can result in algorithmic social discrimination, yielding negative inferences on specific social groups and individuals. Therefore, it is necessary not only to reflect on the related issues, but also to develop frameworks of action for the just use of word embeddings…

When it comes to generative language models, one possibility that strikes me is to use the model to generate a text corpus, train word embeddings on that corpus, and then analyse them for bias. Any detected bias could then be fed back into the language model training process as a form of negative reinforcement.

Categories
Offsites

An overview of end-to-end entity resolution for big data

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127

The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. It’s an important part of many modern data workflows, and an area I’ve been wrestling with in one of my own projects.

Entity Resolution (ER) aims to identify different descriptions that refer to the same real-world entity appearing either within or across data sources, when unique entity identifiers are not available.

When ER is applied to records from the same data source it can be used for deduplication, when used to join records across data sources we call it record linking. Doing this well at scale is non-trivial; at its core, the problem requires comparing each entity to every other, i.e. it is quadratic in input size.

An individual record/document for an entity is called an entity description. A set of such descriptions is an entity collection. Two descriptions that correspond to the same real world entity are called matches or duplicates. The general flow of an ER pipeline looks like this:

  • Blocking takes input entity descriptions and assigns them to one or more blocks based on blocking keys. The point of blocking is to reduce the number of comparisons that have to be made later on – the key idea is that any two entity descriptions that have a chance of referring to the same real-world entity should end up in the same block under at least one of the blocking keys. Therefore we only have to do more detailed comparisons within blocks, but not across blocks. “The key is redundancy, i.e., the act of placing every entity into multiple blocks, thus increasing the likelihood that matching entities co-occur in at least one block.”
  • Block processing then strives to further reduce the number of comparisons that will need to be made by eliminating redundant comparisons that occur in multiple blocks, and superfluous comparisons within blocks.
  • Matching takes each pair of entity descriptions from a block and applies a similarity function to determine if they refer to the same real-world entity or not. (In an iterative ER process, matching and blocking may be interleaved with the results of each iteration potentially impacting the blocks).
  • Clustering groups together all of the identified matches such that all the descriptions within a cluster refer to the same real-world entity. The clustering stage may infer additional indirect matching relations.

The resulting clusters partition the input entity collections into a set of resolved entities.

ER solutions can be classified along three dimensions:

  • Schema-awareness – is there a schema to give structure to the data or not?
  • The nature of the matching process – is it based on a comparison of attributes in the entity descriptions, or is there more complex matching going on such as comparing related entities to give further confidence in matching?
  • The processing mode – traditional batch (with or without budget constraints), or incremental.

Let’s dive into each of the major pipeline stages in turn to get a feel for what’s involved…

Blocking

There’s a whole separate survey dedicated just to the problem of blocking for relational data, so in this survey the authors focus their attention on blocking for schema-less data. There are lots of different approaches to this:

Classic approaches look at relations and attributes. For example Token Blocking makes one block for each unique token in values, regardless of the attribute. Any entity with that token in the value of any attribute is added to the block. Redundancy positive blocking schemes such as token blocking are those in which the probability that two descriptions match increases with the number of blocks that include both of them. For redundancy neutral schemes this is not the case. An example of a redundancy neutral scheme is Canopy Clustering, which uses token-based blocks, but assigns an entity to a block based on a similarity score being greater than a threshold $t_{in}$. Moreover, if the similarity score exceeds $t_{ex} (> t_{in})$ then the description is not added to any further blocks.

Block processing

As with blocking, there are a multiplicity of approaches to block processing. Block cleaning methods may purge excessively large blocks (as these are likely to be the result of common stop-word tokens and hence less useful for matching) and filter the blocks a given description is present in – for example by removing the description from the largest $r%$ of the blocks it appears in. More sophisticated methods may also split and merge blocks. Dynamic approaches schedule block processing on the fly to maximise efficiency.

Comparison cleaning methods work on redundancy positive block collections. A graph is constructed where nodes are entity descriptions, and there is an edge between every pair of nodes co-located in a block, with an edge weight representing the likelihood of a match, e.g. the number of blocks they are co-located in. Once the graph is constructed, edge-pruning can be used to remove lower weighted edges. There are a variety of strategies both for weighting and for pruning edges. For example, Weighted Edge Pruning removes all edges less than or equal to the average edge weight. After pruning, new blocks are created from the retained edges. Learning-based methods train classifiers for pruning.

Matching

A matching function $M$ takes a pair of entity descriptions and measures their similarity using some similarity function $sim$. If the similarity score exceeds a given threshold they are said to match, otherwise they do not match. In a refinement the match function may also return uncertain for middling scores. The similarity function could be atomic, such as Jaccard similarity, or composite (e.g., using a linear combination of several similarity functions on different attributes). Any similarity measure that satisfies non-negativity, identity, symmetry, and triangle inequality can be used.

Collective based matching processes use an iterative process to uncover new matches as a result of matches already made. Merging-based collective techniques create a new entity description based on merging a matched pair, removing the original paired descriptions. Relationship-based collective techniques use relationships in the original entity graph to provide further similarity evidence. For example, Collective ER

Collective ER employs an entity graph, following the intution that two nodes are more likely to match, if their edges connect to nodes corresponding to the same entity. To capture this iterative intuitive, hierarchical agglomerative clustering is performed, where, at each iteration, the two most similar clusters are merged, until the similarity of the most similar cluster is below a threshold.

A variety of supervised, semi-supervised, and unsupervised matching techniques have also been developed.

The output of the matching process is a similarity graph with nodes corresponding to descriptions and edges connecting descriptions that matched, weighted by the matching probability.

Clustering

Clustering aims to infer more edges from indirect matching relations, while discarding edges that are unlikely to connect duplicates in favor of edges with higher weights. Hence, its end result is a set of entity clusters, each of which comprises all descriptions that correspond to the same, distinct real-world object.

The simplest approach is just to find Connected Components, but generally more advanced clustering techniques are used. For example, Markov Clustering uses random walks to strengthen intra-cluster edges while weakening inter-cluster ones.

Making it incremental

Section 8 in the survey discusses incremental approaches, but my general takeaway is that these seem rather thin on the ground, and mostly oriented towards assembling the information needed to answer an incoming query on the fly. The exception is Incremental Correlation Clustering which updates the clustering results on the fly based on newly created, updated, and deleted descriptions. All of the discussed approaches require schemas.

Open source ER systems

The survey includes an assessment of open source tools for ER, summarised in the table below.

…we share the view of ER as an engineering task by nature, and hence, we cannot just keep developing ER algorithms in a vacuum. In the Big Data era, we opt for open-world ER systems that allow one to plug-and-play different algorithms and can easily integrate with third-party tools for data exploration, data cleaning, or data analytics.

Categories
Offsites

Logarithm Fundamentals | Lockdown math ep. 6

Categories
Offsites

What makes the natural log “natural”? | Lockdown math ep. 7

Categories
Offsites

The power tower puzzle | Lockdown math ep. 8

Categories
Offsites

Does contact tracing necessarily sacrifice privacy? (via Nicky Case)

Categories
Offsites

Intuition for i to the power i | Lockdown math ep. 9

Categories
Offsites

Tips to be a better problem solver [Last lecture] | Lockdown math ep. 10

Categories
Offsites

The impossible chessboard puzzle

Categories
Offsites

Group theory and why I love 808,017,424,794,512,875,886,459,904,961,710,757,005,754,368,000,000,000