Semaglutide: Structure & Chemistry

Molecular Formula and Basic Properties

Semaglutide is a modified peptide with the molecular formula C₁₈₇H₂₉₁N₄₅O₅₉ and a molecular weight of approximately 4,113 Daltons. This makes it a relatively large molecule by pharmaceutical standards, though small compared to proteins like antibodies (which are typically 150,000 Da). The molecule consists of a 31-amino acid peptide backbone with three key modifications that distinguish it from native human GLP-1: substitution of alanine with aminoisobutyric acid at position 8, attachment of a C-18 fatty acid chain via a spacer to lysine at position 26, and substitution of lysine with arginine at position 34.

As a peptide, semaglutide is composed primarily of amino acids linked by peptide bonds (amide bonds between the carboxyl group of one amino acid and the amino group of the next). The peptide backbone provides the basic structure, while the amino acid side chains (which vary for each of the 20 standard amino acids plus the non-standard aminoisobutyric acid) give the molecule its specific properties and biological activity. The fatty acid modification adds significant hydrophobic character, dramatically affecting the molecule's pharmacokinetic properties.

Amino Acid Sequence

Understanding semaglutide's structure requires examining its amino acid sequence in detail. The sequence is based on human GLP-1(7-37), with strategic modifications.

Native GLP-1(7-37) Sequence

Native human GLP-1(7-37) has the following sequence (using single-letter amino acid codes):

H-A-E-G-T-F-T-S-D-V-S-S-Y-L-E-G-Q-A-A-K-E-F-I-A-W-L-V-K-G-R-G

Breaking this down by position:

  • Position 7 (1): Histidine (H)
  • Position 8 (2): Alanine (A)
  • Position 9 (3): Glutamic acid (E)
  • Position 10 (4): Glycine (G)
  • Position 11 (5): Threonine (T)
  • Position 12 (6): Phenylalanine (F)
  • Position 13 (7): Threonine (T)
  • Position 14 (8): Serine (S)
  • Position 15 (9): Aspartic acid (D)
  • Position 16 (10): Valine (V)
  • Position 17 (11): Serine (S)
  • Position 18 (12): Serine (S)
  • Position 19 (13): Tyrosine (Y)
  • Position 20 (14): Leucine (L)
  • Position 21 (15): Glutamic acid (E)
  • Position 22 (16): Glycine (G)
  • Position 23 (17): Glutamine (Q)
  • Position 24 (18): Alanine (A)
  • Position 25 (19): Alanine (A)
  • Position 26 (20): Lysine (K)
  • Position 27 (21): Glutamic acid (E)
  • Position 28 (22): Phenylalanine (F)
  • Position 29 (23): Isoleucine (I)
  • Position 30 (24): Alanine (A)
  • Position 31 (25): Tryptophan (W)
  • Position 32 (26): Leucine (L)
  • Position 33 (27): Valine (V)
  • Position 34 (28): Lysine (K)
  • Position 35 (29): Glycine (G)
  • Position 36 (30): Arginine (R)
  • Position 37 (31): Glycine (G)

Semaglutide Sequence

Semaglutide modifies this sequence at three positions:

H-AIB-E-G-T-F-T-S-D-V-S-S-Y-L-E-G-Q-A-A-K-E-F-I-A-W-L-V-K(C18 fatty acid diacid)-G-R-G

The modifications are:

  • Position 8 (2): Alanine → Aminoisobutyric acid (AIB)
  • Position 34 (28): Lysine → Lysine with C-18 fatty acid attached via spacer
  • Position 36 (30): Lysine → Arginine (in native GLP-1(7-36) amide, the more common form)

Sequence Numbering Convention

A note on numbering: GLP-1 is derived from proglucagon, and the numbering reflects the position in the proglucagon sequence. GLP-1(7-37) means amino acids 7-37 of the proglucagon-derived peptide. When discussing semaglutide, we often use simplified numbering (1-31) for clarity, but the modifications are typically described using the proglucagon numbering (position 8, 26, 34 in proglucagon numbering corresponds to positions 2, 20, 28 in simplified numbering).

Key Structural Modifications

Each of semaglutide's three modifications serves a specific purpose and was carefully designed based on structure-activity relationship studies.

Modification 1: Aminoisobutyric Acid (AIB) at Position 8

Aminoisobutyric acid (also called α-methylalanine) is a non-proteinogenic amino acid—it's not one of the 20 standard amino acids found in natural proteins. Structurally, AIB is similar to alanine but has two methyl groups attached to the alpha carbon instead of one. This creates a quaternary carbon (a carbon bonded to four other carbons/groups), which introduces steric hindrance.

The purpose of this modification is to protect against dipeptidyl peptidase-4 (DPP-4), the enzyme that rapidly degrades native GLP-1. DPP-4 cleaves peptides after alanine or proline at position 2 (position 8 in proglucagon numbering). The enzyme's active site is designed to accommodate alanine's small methyl side chain. By replacing alanine with AIB, which has an additional methyl group creating steric bulk, the peptide no longer fits properly into DPP-4's active site, preventing cleavage.

This modification is elegant because it provides complete protection against DPP-4 with minimal structural perturbation. The AIB substitution doesn't significantly alter the peptide's three-dimensional structure or its ability to bind and activate GLP-1 receptors. X-ray crystallography and NMR studies show that semaglutide adopts a similar conformation to native GLP-1, with the AIB substitution causing only local structural changes.

Modification 2: C-18 Fatty Acid at Position 26

The fatty acid modification is the most important for extending semaglutide's half-life. A C-18 fatty acid (stearic acid) is attached to the epsilon-amino group of lysine at position 26 (position 20 in simplified numbering). The attachment is not direct but through a spacer consisting of two gamma-glutamic acid residues.

The complete modification can be represented as: Lys-[γGlu-γGlu-C18 fatty acid]. The gamma-glutamic acids are attached through their gamma-carboxyl groups rather than the alpha-carboxyl groups used in normal peptide bonds. This creates a branched structure extending from the lysine side chain. The fatty acid is attached to the terminal gamma-glutamic acid through an amide bond.

This fatty acid modification enables strong but reversible binding to serum albumin, the most abundant protein in blood (concentration ~600-700 μM). Albumin has multiple binding sites for fatty acids and other hydrophobic molecules. The C-18 fatty acid of semaglutide inserts into one of these binding pockets, anchoring the peptide to albumin. The spacer provides optimal distance between the peptide backbone and the fatty acid, allowing the fatty acid to bind albumin while the peptide portion remains accessible for GLP-1 receptor binding.

The fatty acid length was carefully optimized. Shorter chains (C-14, C-16) provide weaker albumin binding, resulting in shorter half-lives insufficient for once-weekly dosing. Longer chains (C-20, C-22) bind too strongly, potentially reducing the amount of free drug available to activate receptors and possibly causing toxicity. The C-18 chain provides the optimal balance—strong enough binding for weekly dosing but reversible enough to maintain therapeutic efficacy.

Modification 3: Arginine at Position 34

The third modification replaces lysine at position 34 (position 28 in simplified numbering) with arginine. Both lysine and arginine are positively charged amino acids at physiological pH, so this substitution maintains the charge distribution of the peptide. However, arginine has a more complex side chain with a guanidinium group, while lysine has a simpler primary amine.

The purpose of this modification is less clear than the other two. It may enhance stability by reducing susceptibility to proteolytic enzymes that target lysine residues. It may improve solubility or reduce aggregation. It may optimize the peptide's interaction with albumin or GLP-1 receptors. Structure-activity relationship studies showed that this substitution improved the overall pharmaceutical profile, though the exact mechanism remains incompletely understood.

Three-Dimensional Structure

Peptides are not rigid linear chains but adopt specific three-dimensional conformations determined by the sequence of amino acids and the environment. Understanding semaglutide's three-dimensional structure helps explain how it binds to and activates GLP-1 receptors.

Secondary Structure Elements

GLP-1 and semaglutide adopt an alpha-helical conformation in the N-terminal region (approximately residues 7-28 in proglucagon numbering). Alpha helices are common secondary structures in proteins where the peptide backbone forms a right-handed helix stabilized by hydrogen bonds between the carbonyl oxygen of residue n and the amide hydrogen of residue n+4. This helical structure is important for receptor binding.

The C-terminal region (residues 29-37) is more flexible and less structured, existing as an extended chain or random coil. This flexibility may be important for allowing the peptide to adopt the optimal conformation for receptor binding. The fatty acid modification at position 26 extends from the helical region, projecting outward where it can interact with albumin without interfering with the helical structure important for receptor binding.

Receptor Binding

GLP-1 receptors are class B G-protein coupled receptors (GPCRs) with a large extracellular domain (ECD) and a seven-transmembrane domain (7TM). Peptide binding involves a two-step process. First, the C-terminal region of the peptide binds to the ECD, which acts as a recognition domain. This initial binding positions the peptide for the second step, where the N-terminal region inserts into the 7TM domain, triggering receptor activation.

The alpha-helical structure of semaglutide's N-terminal region is crucial for this second step. The helix presents specific amino acid side chains in the correct spatial arrangement to interact with binding pockets in the 7TM domain. Key residues include phenylalanine at position 12, which inserts into a hydrophobic pocket, and several charged residues that form ionic interactions with receptor residues.

Crystal structures of GLP-1 receptor bound to peptide agonists show that the peptide adopts a specific conformation upon binding, with the N-terminal helix extending deep into the receptor's transmembrane core. The AIB modification at position 8 doesn't significantly alter this binding mode, explaining why semaglutide activates the receptor with similar potency to native GLP-1 despite the structural modification.

Albumin Binding

Albumin is a large protein (66.5 kDa, 585 amino acids) with a heart-shaped structure consisting of three homologous domains, each containing two subdomains. Fatty acids bind to multiple sites on albumin, with the highest-affinity sites located in subdomains IIA and IIIA. The C-18 fatty acid of semaglutide binds to one of these sites, likely subdomain IIIA based on binding studies.

The binding is non-covalent, involving hydrophobic interactions between the fatty acid chain and a hydrophobic pocket in albumin, plus ionic interactions between the carboxyl group of the fatty acid and positively charged residues in the binding pocket. The binding is strong (dissociation constant in the low micromolar range) but reversible, allowing semaglutide to dissociate from albumin and bind to GLP-1 receptors.

Approximately 99% of semaglutide in circulation is bound to albumin at any given time, with only 1% free. However, this 1% free fraction is sufficient for therapeutic activity because it's continuously replenished as bound semaglutide dissociates. The albumin-bound fraction serves as a reservoir, slowly releasing free drug to maintain steady therapeutic levels.

Chemical Properties

Semaglutide's chemical properties influence its stability, solubility, and pharmaceutical formulation.

Charge and Isoelectric Point

At physiological pH (7.4), semaglutide carries a net negative charge. The peptide contains multiple acidic amino acids (glutamic acid, aspartic acid) that are negatively charged at pH 7.4, and multiple basic amino acids (histidine, lysine, arginine) that are positively charged. The net charge depends on the exact number of each type of residue and their pKa values in the context of the folded peptide.

The isoelectric point (pI)—the pH at which the peptide has no net charge—is approximately 4.5-5.0 for semaglutide. This means that at pH values above the pI, semaglutide is negatively charged, while at pH values below the pI, it's positively charged. The formulation pH (around 7.4) is well above the pI, so semaglutide is negatively charged in the formulation and in the body.

Solubility

Semaglutide's solubility is complex due to its amphipathic nature—it contains both hydrophilic (the peptide backbone with charged and polar amino acids) and hydrophobic (the fatty acid modification) regions. In aqueous solution, semaglutide can form micelles or aggregates at high concentrations, with the hydrophobic fatty acids clustering together and the hydrophilic peptide portions facing the aqueous environment.

The formulation includes propylene glycol, which enhances solubility by providing a more favorable environment for the fatty acid portion. The pH and ionic strength are also optimized to maintain solubility while minimizing aggregation. Proper formulation is crucial for achieving the high concentrations needed for subcutaneous injection (several mg/mL) while maintaining stability and preventing precipitation.

Stability

Peptides can degrade through multiple pathways including hydrolysis (cleavage of peptide bonds), oxidation (of methionine, tryptophan, or other susceptible residues), deamidation (of asparagine or glutamine), and aggregation (formation of larger complexes through peptide-peptide interactions).

Semaglutide is relatively stable compared to many peptides, but degradation can still occur. The most common degradation pathways are:

  • Oxidation: Tryptophan at position 31 is susceptible to oxidation, forming various oxidation products. This is minimized by formulating in an inert atmosphere and including antioxidants if needed.
  • Deamidation: Asparagine and glutamine residues can undergo deamidation, converting to aspartic acid and glutamic acid respectively. This is pH-dependent and minimized by formulating at optimal pH.
  • Aggregation: Peptide molecules can associate to form dimers, oligomers, or larger aggregates. This is concentration-dependent and minimized by optimizing formulation conditions and storage temperature.
  • Hydrolysis: Peptide bonds can be cleaved by water, particularly at elevated temperatures or extreme pH. This is minimized by refrigerated storage and neutral pH formulation.

The formulation and storage conditions (2-8°C, protected from light) are designed to minimize these degradation pathways and ensure semaglutide remains stable for its labeled shelf life (typically 2-3 years).

Analytical Characterization

Multiple analytical techniques are used to characterize semaglutide's structure and confirm its identity and purity.

Mass Spectrometry

Mass spectrometry determines the exact molecular weight of semaglutide. Electrospray ionization (ESI) or matrix-assisted laser desorption/ionization (MALDI) are used to ionize the peptide, and time-of-flight (TOF) or other mass analyzers measure the mass-to-charge ratio. The measured molecular weight should match the theoretical weight calculated from the amino acid sequence and modifications (approximately 4,113 Da).

Tandem mass spectrometry (MS/MS) can sequence the peptide by fragmenting it and analyzing the fragments. This confirms the amino acid sequence and the positions of modifications. High-resolution mass spectrometry can detect impurities and degradation products based on small mass differences.

Amino Acid Analysis

Amino acid analysis involves hydrolyzing the peptide to break all peptide bonds, then separating and quantifying the individual amino acids. The results should match the expected composition based on the sequence. This technique confirms that the correct amino acids are present in the correct ratios, though it doesn't provide sequence information.

Peptide Mapping

Peptide mapping involves digesting semaglutide with a specific protease (typically trypsin, which cleaves after lysine and arginine), then analyzing the resulting fragments by HPLC-MS. Each fragment has a characteristic retention time and mass. By comparing the observed fragments to those expected from the known sequence, the complete sequence and modifications can be confirmed. This technique is particularly useful for detecting sequence variants or incorrect modifications.

Chromatography

High-performance liquid chromatography (HPLC) separates semaglutide from impurities based on hydrophobicity (reversed-phase HPLC) or charge (ion-exchange HPLC). The retention time is characteristic for semaglutide and can be used for identification. The peak area is proportional to concentration, allowing quantification. HPLC is the primary method for assessing purity—the semaglutide peak should represent >95% of the total peak area.

Spectroscopy

UV-visible spectroscopy measures absorbance at specific wavelengths. Peptides absorb UV light at 280 nm due to aromatic amino acids (tryptophan, tyrosine, phenylalanine) and at 214 nm due to peptide bonds. The absorbance spectrum can be used for identification and quantification. Circular dichroism (CD) spectroscopy measures the differential absorption of left- and right-circularly polarized light, providing information about secondary structure (alpha helix content).

Structure-Activity Relationships

Understanding which structural features are essential for semaglutide's activity helps explain its design and suggests possibilities for future improvements.

Essential Features for Receptor Activation

Not all parts of the semaglutide molecule are equally important for GLP-1 receptor activation. The N-terminal region (residues 7-15) is critical—modifications or deletions in this region typically abolish activity. Key residues include histidine at position 7, which forms important interactions with the receptor, and phenylalanine at position 12, which inserts into a hydrophobic pocket.

The mid-region (residues 16-28) is important for maintaining the alpha-helical structure and proper positioning of the N-terminus. The C-terminal region (residues 29-37) is less critical for receptor activation but important for receptor binding affinity. Truncations or modifications in this region reduce potency but don't completely eliminate activity.

Features for Extended Half-Life

The fatty acid modification is essential for extended half-life. Without it, the peptide would be rapidly cleared like native GLP-1. The fatty acid length is critical—C-18 provides optimal albumin binding for once-weekly dosing. The spacer (two gamma-glutamic acids) is also important, providing optimal distance between the peptide and fatty acid.

The AIB substitution at position 8 is essential for DPP-4 resistance. Without it, the peptide would be rapidly degraded even if albumin binding extended circulation time. The combination of DPP-4 resistance and albumin binding is synergistic—both are needed for the extended half-life that enables once-weekly dosing.

Opportunities for Optimization

Structure-activity relationship studies continue to explore whether semaglutide can be further optimized. Possible modifications include different fatty acid lengths or structures, alternative spacers, additional amino acid substitutions to enhance stability or potency, or combination with other peptides or proteins. Some of these approaches are being explored in next-generation GLP-1 agonists currently in development.

Comparison to Related Peptides

Comparing semaglutide's structure to other GLP-1 agonists highlights different strategies for achieving extended half-life.

Liraglutide

Liraglutide uses a similar fatty acid modification strategy but with a C-16 fatty acid instead of C-18. This shorter fatty acid provides weaker albumin binding, resulting in a shorter half-life (13 hours vs 7 days) requiring daily dosing. Liraglutide also has the AIB substitution at position 8 for DPP-4 resistance and an arginine substitution at position 34, making it structurally very similar to semaglutide.

Dulaglutide

Dulaglutide takes a completely different approach, fusing a modified GLP-1 peptide to an immunoglobulin Fc fragment. The large Fc fragment (approximately 50 kDa) prevents renal clearance and extends half-life to approximately 5 days. This approach avoids fatty acid modification but results in a much larger molecule that must be produced using recombinant DNA technology rather than chemical synthesis.

Exenatide

Exenatide is based on exendin-4 from Gila monster venom rather than human GLP-1. It shares 53% sequence identity with GLP-1 but has natural resistance to DPP-4 due to a glycine at position 2 instead of alanine. The extended-release formulation (exenatide QW) uses microsphere technology to slowly release exenatide over a week, a pharmaceutical rather than molecular approach to extending duration.

Learn More About Semaglutide

Explore the quality of research evidence supporting semaglutide's use.

Research Quality →