Introduction

Big gains from pretrained language models (from internet corpora). We now have a paradigm: taking a large pre-existing model and fine-tune on some downstream task.

Can we trust these models?
Have we gotten something for free, by leveraging existing models? What are risks and harms from doing so?

Can we trust these models?

Language models are now being used to generate new synthetic data. Results even suggest the models are good at recovering facts. Since it's pretty hard to evaluate (quantify) the quality of natural language, we have come with a dream:

Train language models to evaluate natural language (against high-quality, human judgement). Then, use these trained models as an evaluation metric.

We are getting very close to human-level evaluation! However, LLMs break in surprising ways:

Does pretraining result in new harms?

Albert Ge

Emerging risks and opportunities from large language models

Introduction

Can we trust these models?

Does pretraining result in new harms?

Privacy concerns