COS(M+O)S: Enhancing Story Generation with Curiosity-Driven Exploration and Fine-Tuning
As the field of large language models (LLMs) evolves, their capacity to generate coherent prose and even engage in creative tasks like storytelling has become increasingly sophisticated. However, LLMs often produce formulaic or predictable text, which can detract from the creativity and depth expected in open-ended tasks such as story generation

Traditional methods typically rely on a single-pass approach, generating text one step at a time based on the most likely next token, which often leads to repetitive or uninspired outputs. To address these shortcomings, we introduce COS(M+O)S, a framework inspired by System 2 reasoning that systematically explores multiple plot developments to generate richer, more engaging stories.
COS(M+O)S leverages a combination of Monte Carlo Tree Search (MCTS) and Odds Ratio Preference Optimization (ORPO) to create an iterative process for refining plot expansions. The framework rewards curiosity and creativity while maintaining coherence, offering a scalable path to higher-quality story generation, even with smaller models. Through a series of experiments, we demonstrate that COS(M+O)S significantly improves the story quality produced by a 3B-parameter model, approaching the quality of much larger models with minimal additional computational cost.
The Challenge of Open-Ended Story Generation
Story generation involves navigating a vast space of potential narrative developments. Traditional language models like GPT-4 and Llama are capable of generating text by predicting the next most likely word or phrase based on a given prompt. However, this method tends to rely heavily on past patterns and may lack novelty, character depth, and engaging plot twists. The resulting stories often feel formulaic and predictable.
To overcome these limitations, we need a system that encourages exploration of more diverse narrative possibilities while maintaining coherence. This requires moving beyond System 1-style quick decision-making (which often results in predictable outputs) to System 2-style thinking, where the story is developed over multiple reasoning steps, allowing for deeper creativity and refinement.
COS(M+O)S: Curiosity-Oriented Step-Level Exploration
The core of COS(M+O)S is the combination of MCTS and ORPO, which allows the model to iteratively refine its plot generation process. The framework works by introducing a curiosity-driven exploration mechanism, which rewards surprising or original plot developments while penalizing incoherent ones. This process is guided by a step-level value model that evaluates the quality of each plot expansion.
Monte Carlo Tree Search (MCTS)
MCTS is an established algorithm that is widely used in decision-making tasks, such as in the famous AlphaGo system. In the context of story generation, MCTS explores a large tree of potential story expansions by evaluating multiple candidate actions at each step. The search process unfolds as follows:
- Selection: The algorithm selects a node (a partial story) in the tree based on the current value of possible story trajectories.
- Expansion: New potential actions are generated, and the story is expanded.
- Simulation: The story is advanced using the selected action, and the next sequence is generated.
- Evaluation: The resulting story is evaluated using the step-level value model.
- Backpropagation: The value of the action is updated, and high-quality plot developments are propagated back through the tree.
Odds Ratio Preference Optimization (ORPO)
Once the MCTS has explored several potential plot branches, we apply ORPO to fine-tune the model. ORPO adjusts the policy by promoting actions that lead to higher-value plot expansions, thereby shifting the model toward generating better-quality stories over time. It uses action-value estimates derived from the MCTS search to increase the likelihood of selecting higher-rated actions and decrease the likelihood of poorer ones.
This iterative process enables the model to self-improve: each round of MCTS provides the model with better feedback, and the ORPO mechanism incorporates these insights into the policy, guiding future plot expansions toward higher-quality, more engaging narratives.
Experimental Results
In our experiments, we focus on the relatively small Llama 3.2 3B model to evaluate the effectiveness of COS(M+O)S on smaller, more computationally feasible systems. Despite the smaller model size, COS(M+O)S’s iterative refinement and curiosity-driven exploration led to significant improvements in story quality.
- Human Evaluation: In small-scale tests with short story prompts, 67%–77% of participants favored COS(M+O)S’s highest-rated plot expansions over lower-rated ones. This indicates that the framework successfully generates more engaging and creative storylines, aligning with human preferences.
- Automated Evaluation: Using GPT-4o ratings, COS(M+O)S outperformed naive single-pass decoding from the Llama 3.2 3B model by 0.59 standard deviations. Moreover, pairwise comparisons between COS(M+O)S and Llama 3.1 70B showed that COS(M+O)S generated plots with no significant difference in quality, suggesting that the smaller 3B model with COS(M+O)S applied is almost on par with much larger models.
These results suggest that even with a model as small as 3 billion parameters, COS(M+O)S can substantially enhance the quality of story generation, approaching the level of much larger models in terms of creativity and coherence.
Methodology
The COS(M+O)S framework combines three key components:
-
Policy Model: The policy model, based on the Llama 3.2 3B LLM, generates possible plot actions at each step of the story. These actions are framed as Chain-of-Thought (CoT) reasoning steps that guide the narrative.
-
Simulation Model: The simulation model applies the selected actions to simulate the next segment of the story. This model uses the outputs generated by the policy model to explore possible plot continuations.
-
Step-Level Value Model: The value model evaluates the quality of each plot expansion based on factors like originality, engagement, and coherence. It assigns a value to each partial plot to guide the MCTS search process.
-
Iterative Process: The MCTS algorithm iteratively explores the plot space, refining the plot based on the value model’s feedback. ORPO is then applied to fine-tune the policy, reinforcing high-value plot trajectories.
Conclusion
COS(M+O)S represents a significant advancement in story generation for LLMs, combining curiosity-driven exploration with System 2-style iterative reasoning to create more engaging and coherent narratives. The integration of MCTS and ORPO allows for the systematic exploration of diverse plot possibilities, enabling even smaller models to produce creative stories that rival the output of much larger models.
By employing this framework, we demonstrate that smaller LLMs can generate high-quality narratives, bridging the gap between the capabilities of small and large models. As future research continues to refine and expand upon this approach, COS(M+O)S offers a promising direction for achieving open-ended creativity in AI-driven storytelling.
What's Your Reaction?






